Data Science for Research

Automatic Text Summarization of scientific articles

Build numerous summaries of scientific articles

Based on a corpus of more than 200.000 research papers

Unique summaries are abstractive in nature: a paraphrasing approach was used to represent the content of your article.

I am looking for companies that see the tremendous potential of this technique and want to collaborate on this project.

This approach is based on:

Artificial Intelligence (AI)

Improves natural language processing 

Is used for machine translation (e.g., google translate), text classification (e.g., bloomberg news semtiment), text generation (e.g., chatbots), or automatic text summarization.

Automatic Text Summarization

Builds new text in its own words

Uses artificial intelligence to generate numerous compressed text bodies of existing articles. It is based on a sequence2sequence neural network model.

Examples from plos one's 3 highest cited articles:

1.) Designer Self-Assembling Peptide Nanofiber Scaffolds for Adult Mouse Neural Stem Cell 3-Dimensional Cultures

2.) Regional Decline of Coral Cover in the Indo-Pacific: Timing, Extent, and Subregional Comparisons

3.) Gut Microbiota in Human Adults with Type 2 Diabetes Differs from Non-Diabetic Adults

AI generated Abstract:

Here we report a designer peptide nanofiber system that mimics the 3 – D nanostructure of the extracellular matrix, which represents a promising alternative to conventional cell culture systems. We show that the designer peptide can be produced for a variety of functional motifs. The designer peptide scaffolds formed with different lengths significantly improve mouse neural stem cell survival and differentiation.

AI generated Abstract:

Coral cover is a critical measure of habitat loss and degradation. We used quantitative indo – pacific reef monitoring data to examine the spatial patterns of coral cover in the indo – pacific islands. We used repeated measures regression analysis to assess the uniformity of the average coral cover and the average net decline of coral cover in the indo – pacific islands. We found that coral cover was lower and far more uniform than expected.

AI generated Abstract:

The aim of this study was to investigate the intestinal microbiota in humans with type 2 diabetes. The intestinal microbiota is associated with obesity and other metabolic diseases in humans. We found that the proportion of Firmucutes and Clostridia were significantly smaller in diabetic persons compared to their non – diabetic counterparts. These results show that humans with type 2 diabetes exhibit differences in intestinal mcriobiota in comparison to healthy controls.

What's under the hood?


In this project a sequence-to-sequence neural network with attention was used to produce summaries of scientific articles. Why would that be interesting? Abstractive summarization does not simply copy and paste important parts of a text. In contrast, this new method produces unique, comprehensive text blocks based on the content of a paper. This is interesting for everyone who needs a summary of an scientific article written “in his own words”. There are plenty possibilities of applications for this new technique. Possible target groups are – among others – journalists from scientific magazines that aim to compose a ubiquitous summary of an scientific article, or students intending to prepare an assignment and need a short overview of the content of several articles about the same topic, or researchers aiming to write a new abstract of their own paper for an invited talk or a poster presentation, or editors of scientific journals attempting to get an overview of the numerous manuscripts handed in to their journal.


The general idea of sequence-to-sequence summarization is that neural network models are feed with a document and spit out a short, unique summary of that document. This is accomplished while training that model with numerous articles and their summaries. Most summarization research has been conducted on the “CNN/Daily Mail” corpus (Hermann et al., 2015). This corpus contains 290.000 documents with articles and their highlights. Recently, (Grusky, Naaman, & Artzi, 2018) introduced a new corpus ‘Newsroom’ containing 1.3 Million texts. Training of a neural network involves presentation of articles (encoder, first sequence) and its summary (decoder, second sequence) to the model. During training phase, numerous of iterations of an encoder-decoder model are processes on each article and its summary. After several training iterations a validation takes place, such that the model generates summaries of new articles that have not been seen before by the model (i.e., out-of-sample-validation). The goodness of the produced summaries is measured using the Rouge score (Lin, 2014). This score compares texts blocks of the original summary and its corresponding summary generated by the model on a local and global level. At the local level comparison takes place on short, adjacent word units. At the global level longer word phrases are compared. Several of these values are taken together to the mean Rouge score. The result of this comparison can be interpreted as percent of overlap between original and model-generated summary. This procedure has become the standard metric of text summarization evaluation. Recent studies achieved mean Rouge scores of up to 26% on the “CNN/Daily Mail” corpus (e.g., McCann, Keskar, Xiong, & Socher, 2018) and the ‘Newsroom’ corpus (Grusky et al., 2018).

So far, scientific articles have not been analyzed with a sequence-to-sequence neural network approach.

In the present project, a new corpus of more than 200.000 scientific articles was generated. Next, a sequence-to-sequence neural network was applied on these texts. Finally, validation of the model using a set of new scientific articles was conducted.

Corpus of scientific articles:

The database with more than 200.000 articles contains a broad range of research topics, such as medicine, biology, genetics, geology, psychology, or neuroscience. A typical scientific articles incorporates the following sections: abstract, introduction, methods, results, discussion, and bibliography. The discussion section includes most relevant information for the summary of an article. In addition, neural networks for text summarization are known to struggle with texts that are too long (Paulus, Xiong, & Socher, 2017), such that redundant, repetitive and incoherent text phrases are produced. To deal with this issue, only text of the discussion section of an article was taken into account, since most relevant information for an article summary is inherent to this section.

Before texts entered neural network computations, several text cleaning procedures were applied, including removing citations, formulas, names, sub-headlines, figure legends, and non-ascii characters, etc.

On average, discussion sections in the present corpus consisted of 1373 words (sd = 816). Even if we concentrated only on the discussion section in order to reduce text size, texts are still considerable long, in comparison to the ‘CNN/Daily Mail’, or ‘Newsroom’ datasets. In order to incorporate a possible influence of the text size on the performance of summarization, two different models were generated. The first model (long text model) comprised the whole discussion section of articles. Text size ranged from 400 to 4000 words (mean = 1218, sd = 791). In the second model (short text model), the size of the discussion section was reduced to max. 600 words using a text-rank algorithm. Text-rank algorithm is a technique that ranks sentences of a text in the order of their importance. In each text a similarity matrix between words in each sentence was generated and the eigenvector centrality was calculated. Subsequently, sentences were sorted by eigenvector centrality (i.e., importance). Sentences with the first max. 600 words were again ordered by their occurrence in text, which then constituted the new discussion section.

In the next step, words of the corpus were transformed into word vectors, such that words with similar meaning occur together in vector space. Pre-trained word vectors (vocabulary: 2.2. Million, 300-D vectors) of the GloVe algorithm (Global Vectors for Word Representation) were used (Pennington, Socher, & Manning, 2014).

Neural network computations:

In this project a sequence-to-sequence long term short term memory (LSTM) neural network with attention was applied (Chopra, Auli, & Rush, 2016; Nallapati, Zhou, santos, Gulcehre, & Xiang, 2016; Rush, Chopra, & Weston, 2015). For both models 500.000 iterations were applied. Computation takes about 4 days for the short text model and 7 days for the long text model on a NVIDIA V100 GPU.


[table id=1 /]

Inspection of the Rouge score values reveals that long text model performed slightly better than short text model. Overall, these values are similar to summarization projects using other datasets, such as ‘CNN/Daily Mail’ (McCann et al., 2018), or ‘Newsroom’ (Grusky et al., 2018).

Critical considerations

In some cases the model produced semantically wrong sentences, which cannot be captured by the Rouge score metrics. For example the sentence “Function A, but not function B activates region X.” and the sentence “Function B, but not function A activates region X.” result in a mean Rouge score of 68%. This is a very high result for two completely different statements. We need a different metric that captures variations in the overall content of a text. This should be something like a meta-analyzer that parses the produced text summaries and checks it for plausibility. So far, such a plausibility judgment can only be accomplished by a human parser. Considering the present project, this can be operationalized by generating summaries that will be judged by the researcher who actually wrote that particular paper, accompanied by an independent judgment by one of his/her co-authors. This ensures a quantified metrics for the goodness of a summary with high external validity and (presumably) with high interrater reliability.

Whats next?

The next project should incorporate more data. Data for the present model was generated using open access journals, limiting the amount of available articles. In order to drastically increase the number of articles, new database sources are needed. This will improve the performance of summarization. Another way of enhancing the outcome of automatic text summarization is to generate different models for different scientific topics. Reduction of variation of structure, size, and content of articles should improve model performance. Finally, the dataset can be tested with different neural network architectures, such as the pointer-generator model (See, Liu, & Manning, 2017).


Chopra, S., Auli, M., & Rush, A. M. (2016). Abstractive Sentence Summarization with Attentive Recurrent Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 93–98). San Diego, California: Association for Computational Linguistics.

Grusky, M., Naaman, M., & Artzi, Y. (2018). Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. ArXiv:1804.11283 [Cs]. Retrieved from

Hermann, K. M., Kočiský, T., Grefenstette, E., Espeholt, L., Kay, W., Suleyman, M., & Blunsom, P. (2015). Teaching Machines to Read and Comprehend. ArXiv:1506.03340 [Cs]. Retrieved from

Lin, C. (2014). ROUGE: A Package for Automatic Evaluation of summaries. Proceedings of ACL Workshop on Text Summarization, 10.

McCann, B., Keskar, N. S., Xiong, C., & Socher, R. (2018). The Natural Language Decathlon: Multitask Learning as Question Answering. ArXiv:1806.08730 [Cs, Stat]. Retrieved from

Nallapati, R., Zhou, B., santos, C. N. dos, Gulcehre, C., & Xiang, B. (2016). Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond. ArXiv:1602.06023 [Cs]. Retrieved from

Paulus, R., Xiong, C., & Socher, R. (2017). A Deep Reinforced Model for Abstractive Summarization. ArXiv:1705.04304 [Cs]. Retrieved from

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. Association for Computational Linguistics, 1532–1543.

Rush, A. M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for Abstractive Sentence Summarization. ArXiv:1509.00685 [Cs]. Retrieved from

See, A., Liu, P. J., & Manning, C. D. (2017). Get To The Point: Summarization with Pointer-Generator Networks. ArXiv:1704.04368 [Cs]. Retrieved from

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. ArXiv:1409.3215 [Cs]. Retrieved from