Reuters has been able to successfully predict 35 Nobel Laureates in the past, by using their own citation data on scientists and applying the PageRank algorithm on the citation data. Influenced by this, I decided to calculate this index myself and use it as an attribute for my data set. Because of the limitation on citation data that I have, I could not directly apply the PageRank algorithm, but had to create a variant of it. More specifically, I used the following formula:
My model predicts that a Northwestern professor and a Northwestern alumnus to win the Nobel Prize at some point in the future! Their names are...
The Nobel Prize is arguably the most prestigious award an individual can get in the academia. Awarded for "the Benefit on Mankind", the Nobel Prize is given to those who have made a significant contribution to the field of natural sciences like Chemistry, Physiology/Medicine, and Physics, and social sciences like Economics and Literature.
Since 1901, there has been 889 winners of the Nobel Prize. Because Nobel Laureates include legendary figures such as Albert Einstein, not only the academia but the public are also interested in guessing who the winner will be for each year.
Unfortunately, the process of Nobel Prize selection is very complicated. Each year, the Nobel Prize Committee selects a few nominators who secretly makes the nominations. The nominations are concealed for the following 50 years, making it difficult for finding out even the candidates for the prize.
I show a proof of concept that such task can be done through classifier algorithms even with a data set limited in both the quality and the size, and compare the different classifier algorithms that can be applied to train the model.
Using 10-fold cross-validation method, I evaluated my model's performance on a dataset containing 233 chemists and reached a maximum of 78.5% accuracy.
The entire data set that I worked with can be accessed here. The chemists in this data set can be categorized into the following categories:
As mentioned in the citation index section, the variable alpha can be adjusted. I preprocessed the dataset by calculating the score with different alpha values (ranging from 0 to 1 with interval of 0.1).
Each of the preprocessed dataset with different values of alpha can be accessed here:
I considered three different classifiers to train and compare. More specifically:
Weka packages were used to train models based on these classifiers (IBk, FT, and BayesNet packages). In addition, different values for the alpha parameter in the citation index score was tried out, ranging from 0 to 1 with interval of 0.1.
With a limited number of chemists in my dataset, it was very easy for my model to fall into overfitting problem. In particular, with few nominal attributes having finite space of the ones known so far limits my model significantly. For example, just because there are only a handful of institutions being represented by the past Nobel Laureates (i.e. Harvard, Cambridge, Oxford) it doesn't mean there can't be a winner from an institution without a past Laureate (i.e. Northwestern!). To avoid overfitting, I used the 10-fold cross-validation method to evaluate the accuracy of my model.
With a limited number of instances (233), it came down to a difference of 2 to 3 across the range of values for alpha. Few points to note here are:
Using Weka's AttributeSelectedClassifier, I also found out the three most important attributes. Not surprisingly, the citation score was one of them. The other two were Institutions and Alma Mater, which is again not too surprising result.
Google Scholar is arguably the richest source of citation data that is publicly available. However, Google has a very strict policy against automated queries, and this made the scaping process extremely difficult. Not only does Google drop any packets that are not from browsers but also it has very sophisticated bot detection method.
By giving modifications to the HTTP request header, I was able to successfully scrape the data for a few scientists (~ 10 scientists) but after that, Google permanently blocked the IP address of my query. Any attempt to trick Google by sending a query between random intervals of 20-30 seconds did not work either, and it was unclear how Google was detecting my scraper.
As discussed previously, my biggest challenge occurred in defining a reasonable set of scientists for my dataset. The largest limitation with my data set would be that it is not the best representation of the entire chemists in the world mostly because only the well-known chemists are listed on Wikipedia. In other words, the result of my model may not necessarily apply to every single chemists in the world.
Another major limitation is that the size of my dataset is only 233. Therefore, the accuracy of my model is vulnerable to small changes in the dataset and more prone to overfitting. As mentioned in the Google Scholar scraping section, I had trouble accessing enough data to make the PageRank algorithm really work. If access to a bigger source of data for citation index, my variation of PageRank algorithm could work much better than it does now.
Something that I could have tried but couldn’t due to time limits is to come up with a way to deal with attributes that may contain more than one value - for example, institutions or alma mater. One way to have done this is to use some kind of binary encoding - have each bit represent a school, n bits would be enough to encode all possibilities when there are a total of n different schools being represented by the data set. This could have yielded a better result. Another way to have done this is to make each school an attribute, although that could potentially have led to use of different classifiers and not something like decision trees since the number of attributes would grow extremely large and the dataset would become a very sparse matrix.