• Article Galaxy Blog

November 4, 2020

Will Text Mining Transform Scientific Research and Publishing?

Posted by: Leah Rodriguez

Artificial intelligence (AI) technologies are helping businesses across industries accomplish their work faster and easier, with better results. In a previous blog post, we examined the potential for AI in literature search. In particular, we looked at how machine learning and natural language processing (NLP) techniques can be applied to improve the way researchers "find, synthesize, understand, and measure the impacts of developments in their field."

In this post, we'll take a deeper dive into an AI technology that's gaining traction due to the range of benefits it offers for researchers and scholarly publishers alike: text mining.

As you may know, text mining is an AI-driven process that goes beyond basic keyword search to add context to text. Whereas data mining systems uncover patterns and trends in large datasets that machines can understand, text mining is a bit trickier—employing NLP techniques to transform the language within a document into information that machines can "understand" and process.

Applications for text mining in research and publishing

Text mining systems can help scientists find relevant research (from massive literature databases) exponentially faster than they can using traditional methods. More importantly, text mining introduces the ability to extract meaningful relationships across broad swathes of literature. By connecting the dots between seemingly unrelated journal articles, text mining can help scientists develop truly novel hypotheses, which can lead to groundbreaking discoveries. A 2012 article on Text Mining and Scholarly Publishing illustrates the value quite clearly:

“According to published literature there is no relationship between deforestation and hurricanes. No amount of text mining will reveal these or similar words in the same context. However, there are sentences to be found that link deforestation to increased silt rivers. There are sentences that relate increased silt in rivers to seas becoming shallower due to silt run-off. And shallow seas have been linked to an increase in hurricane formation. A new question emerges from this analysis: is it possible that deforestation could lead to more hurricanes?” 


Text mining seems poised to transform many research and publishing processes. Scientists, for example, can use it to improve and accelerate the way they conduct literature searches, produce literature reviews, and generate hypotheses. On the publishing end, text mining can be leveraged during the peer-review process, for example, to improve fact checking, weed out plagiarism, and more. The potential use cases are boundless.

The importance of machine readability

Unless the majority of journal articles in a database are "machine readable," text mining technologies will offer little value. In order for text mining systems to direct their searches efficiently and leverage NLP techniques effectively, the applications must have access to the articles' metadata—information on the documents themselves (e.g. author, topic, field of study, publisher, publication date, etc.). It is also important to note that just because a journal article is accessible online (as a PDF, for example), does not guarantee it is machine readable. To ensure their journal articles are picked up by text mining software, publishers must either manually enter metadata into index deposit forms (which is extremely time-consuming and prone to error) or initially produce journal articles in machine-readable formats.

Will COVID-19 spur advances in text mining technology? 

In response to the coronavirus pandemic, many technology companies stepped up to help facilitate and enhance COVID-19 research. In March 2020, The White House Office of Science and Technology announced the release of the COVID-19 Open Research Dataset (CORD-19), calling it "the most extensive machine-readable Coronavirus literature collection available for data and text mining to date, with over 29,000 articles, more than 13,000 of which have full text." 

CORD-19 was developed through a collaborative effort—coordinated by Dr. Dewey Murdick, Director of Data Science at Georgetown University’s Center for Security and Emerging Technology (CSET)—which included teams from CSET, the Allen Institute for AI, Chan Zuckerberg Initiative (CZI), Microsoft, and the National Library of Medicine (NLM) at the National Institutes of Health.

While thanking the cross-industry team, The White House U.S. Chief Technology Officer, Michael Kratsios, called on the nation's "artificial intelligence experts to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19."

Looking further ahead, Dr. Murdick expressed his hope that the CORD-19 project "will inspire new ways to use machine learning to advance scientific research.”

We’re in this together! Here are some of the COVID-19 initiatives our team at Research Solutions has put in place to facilitate coronavirus research.

Publishers' critical role in text mining success

As noted earlier, the greatest value text mining offers for improving research is its ability to leverage NLP to add context to the words and phrases and journal articles. As with any AI technology, text mining's potential to transform research depends on widespread user adoption. But user adoption will only increase if scholarly publishers do their part to facilitate text mining. And that starts with text mining rights. As Open Access publisher PLOS says, "the right to read is the right to mine."

Springer Nature is one example of a scholarly publisher leading the way to facilitate text mining for researchersboth by formalizing text mining rights for its customers, as well as offering a variety of text mining tools and services, such as full-text APIs and metadata delivery. 

Here at Research Solutions, because we work directly with scholarly publishers to supply full-text PDFs, we are also able to provide access to each article's metadata and other supplementary materials available. For more information, please contact an Article Galaxy expert

Topics: scientific research scientific publishing scholarly publishing AI text mining nlp natural language processing