Data Science for Research

future impact score (FIS)

This approach is based on:

Natural Language Processing

Improves interaction between computer and human languages

Process and analyze large amounts of natural language data for speech recognition, natural language understanding, or text generation.

Deep Neural Networks

State-of-the-art multilayer machine learning instrument

Inspired by neuronal organizational principle. Leads to tremendous advances in performance of natural language processing tasks.

Usage Examples:

One of the numerous possible applications of this approach is the comparison of predicted number of citations between research institutions, states, or publishers. Why would that be important? Research institutions compete to attract more grant money and excellent students. Thus, it is essential to know the impact of articles in order to evaluate the impact of the research institution, at which the research was accomplished. This impact can be quantified with the number of citations an article receives. The prediction of this number can be crucial to weight future developments of research institutions.

The present example analyzed ca. 50.0000 research articles published in January 2019. First, the number of future citations were predicted for each article based on its content. Next, articles and their corresponding future citation numbers were bundled based on the research institution, state, and publisher. This resulted in mean and standard deviation of future number of citations for each research institution, state, and publisher.

Competitions: Research Institutions

Stanford has a greater output, but Berkeley will have more citations.

MIT has a greater output, but won't have more citations compared to Caltech.

Cambridge and Oxford have equal output and equal future impact.

Competitions: States

USA has a way greater output, but China will have more citations.

Germany has a greater output, and will have more citations as United Kingdom.

Australia has a surprisingly great output and will have more citations than New Zealand

Competitions: Publisher

Springer Nature, Elsevier, and Wiley publish most articles. Future number of citations is highest for American Chemical Society (ACS), followed by MDPI (Multidisciplinary Digital Publishing Institute). Springer Nature, Elsevier, and Wiley share similar predicted number of future citations.

Development of FIS:

The number of citations that a research article receives is a crucial measure for quality and importance of the underlying research output of that article. The number of citations has direct implications for the author’s track record, which has direct implications for grant and patent outputs. The prediction of future number of citations of a freshly published article would thus represent a crucial measure of the future impact of that article. Moreover, the prediction of future number of citations would also shed light on the author’s future impact on his/her field of research.

The present algorithm is able to predict the future number of citations based on the content of research articles. For this, neural network architectures were applied on a corpus of more than 500.000 papers, predicting the number of citations within the first two years after publication.

Results of this approach are very promising: the model is able to predict the number of citations with an accuracy of up to 90%. A score with five categories was generated: from ‚very low future impact‘ to ‚very high future impact‘. Several different models were generated.

Model Comparison:

In order to find the best solution for prediction of future number of citations, different neural network architectures were compared. First, a neural network model was generated and fed with articles and their corresponding number of citations. Next, models were tested on new data (i.e., out-of-sample prediction). Below you see the results of three different neural network models. These models represent the latest developments in deep learning research. For comparability reasons the same testing data set was applied for each neural network model.

Results of model comparison:

Top figures show real number of citations of the test data set (x-axis) and the corresponding number of citations that were predicted by the model (y-axis). Both values are highly correlated, suggesting high predictive power of the models. Predictive power is highest in CNN.

Bottom figures show the accuracy of each category of the future impact score. Predicted numbers of citations were aggregated into five categories, from very low future impact (0-2 citations) to very high future impact (more than 20 citations). Overall, best performance was found with the CNN model. Prediction is more precise for very low and very high future impact.

Convolutional Neural Networks (CNN)

Bidirectional Long Short-Term Memory Neural Networks (Bi-LSTM)

Hierarchical Attention Networks (HAT)

In summary, CNN model reveals highest performance, followed by Bi-LSTM models. HAT show rather disappointing results. These findings represent a prove of concept. Further research is needed to increase validity and reliability of the models.

Proof of concept

The number of citations that an research article receives is highly correlated with the research journal in which it was published. The higher the impact factor of a journal the more likely is a high number of citations of an article from that journal. Hence, the predicted number of citations of an article should also cause a high correlation with the journal’s impact factor in which it was published. Importantly, the model predicting the future number of citations has no information about the journal, as well as no information about the author, affiliation, and other meta-data information. Prediction is solely based on the content of the article. Thus, a high correlation between the predicted number citations of an article and the impact factor of the journal in which it was published would give a strong proof for the concept of the predictability of the number of citations. In this analysis the future number of citations of ca. 13.000 research articles published between 02/08/2019 and 02/15/2019 were predicted. Next, prediction values for each article were aggregated based on the research journal in which the article was published. This resulted in a data set of 1223 different research journal and their corresponding mean number of predicted citations. As a result, the journal impact factor and the corresponding mean number of predicted citations revealed a high correlation (r = 0.62). This result shows, that – even if the prediction model holds no information about the journal – the predicted number of citations goes hand in hand with the journal’s impact factor. This is a strong evidence for the predictive power of future impact score.

What's next?

1. Reliablity testing

The present corpus was generated based on articles published in 2016 and their number of citation in 2018. However, how can we be sure that articles published today will also result in a FIS with similar prediction accuracy? In other words, how reliable is our model? The reliability can be tested using different time windows for model calculation. For instance, papers published in 2015 and their citation number in 2017 will be analyzed. In this manner, the time windows will be systematically varied (i.e., 2014+2016, 2015+2017, etc.). If model performance of different time windows is similar, then the reliability of FIS measurement is high.

2. Validity testing

So far, FIS was calculated based on a corpus of 200.000 articles from a broad variety of research fields. Nonetheless, each field of research has a different distribution of citation quantities. In order to validate and improve the FIS, separate neural network models will be generated for different fields of research, such as Medicine, Genetics, Social Sciences, etc.). Results of model predictions will be compared between research fields. This approach should result in similar prediction accuracy and specificity between the different fields of research, which would represent a validation of previous results. In addition, prediction accuracy and specificity should be higher with this approach in comparison to the previous results.