Task
Given a scientist sufficiently famous to have a Wikipedia page, predict whether he/she will win the Nobel Prize in Chemistry in the future.

Dataset
Choosing the appropriate dataset was the most difficult challenge that I faced. It was easy to find out the names of all the scientists who won the Nobel Prize, but it wasn't so easy to decide on where to find the "unsuccessful" examples that I can perform the classification on, without posing a bias on the dataset. In the end, I decided that it makes the most sense to scrape the list of scientists (physicists, chemists, and biologists) from Wikipedia. Read more about the data collection process on the right hand side of the page.

Attributes
To classify scientists into the winner and non-winner group, I chose the following attributes:
  • Country of Origin
  • Date of Birth
  • Institutions Affiliated with
  • Alma Mater
  • Previous Awards Won
  • Citation Index

Citation Index

Reuters has been able to successfully predict 35 Nobel Laureates in the past, by using their own citation data on scientists and applying the PageRank algorithm on the citation data. Influenced by this, I decided to calculate this index myself and use it as an attribute for my data set. Because of the limitation on citation data that I have, I could not directly apply the PageRank algorithm, but had to create a variant of it. More specifically, I used the following formula:

where I is the citation score of a paper, alpha is a weight that can be adjusted, and p is the citation score of a paper that cites the paper in consideration. The recursion depth of the algorithm was applied only once due to the limitation in data collection process. This is explained further in the data collection process section.

A Point of Interest

My model predicts that a Northwestern professor and a Northwestern alumnus to win the Nobel Prize at some point in the future! Their names are...


The Nobel Prize Predictor

Abstract

By:

  • James Whang (sungyoonwhang2017@u.northwestern.edu)
For EECS 349: Machine Learning, Spring 2015 at Northwestern University. All data and scrapers used for this project are available in the public Git repository.

The Nobel Prize is arguably the most prestigious award an individual can get in the academia. Awarded for "the Benefit on Mankind", the Nobel Prize is given to those who have made a significant contribution to the field of natural sciences like Chemistry, Physiology/Medicine, and Physics, and social sciences like Economics and Literature.

Since 1901, there has been 889 winners of the Nobel Prize. Because Nobel Laureates include legendary figures such as Albert Einstein, not only the academia but the public are also interested in guessing who the winner will be for each year.

Unfortunately, the process of Nobel Prize selection is very complicated. Each year, the Nobel Prize Committee selects a few nominators who secretly makes the nominations. The nominations are concealed for the following 50 years, making it difficult for finding out even the candidates for the prize.

I show a proof of concept that such task can be done through classifier algorithms even with a data set limited in both the quality and the size, and compare the different classifier algorithms that can be applied to train the model.

Using 10-fold cross-validation method, I evaluated my model's performance on a dataset containing 233 chemists and reached a maximum of 78.5% accuracy.


Preprocessing

The entire data set that I worked with can be accessed here. The chemists in this data set can be categorized into the following categories:

  1. Nobel Laureates who already died
  2. Nobel Laureates who are still alive
  3. Non-Nobel Laureates who already died
  4. Non-Nobel Laureates who are still alive
Out of these categories, it is necessary to exclude those in the 4th category because strictly speaking, it's still possible for them to win the Nobel Prize in the future. Since I cannot correctly label them, I excluded them from my training set. In addition, I omitted the following attributes from the training data:
  • Name : This can become a source of overfitting.
  • Total Count and Sum of Counts that cite this person's papers : Essentially the citation index is the function of these two attributes and having these two is redundant and unnecessary.

As mentioned in the citation index section, the variable alpha can be adjusted. I preprocessed the dataset by calculating the score with different alpha values (ranging from 0 to 1 with interval of 0.1).
Each of the preprocessed dataset with different values of alpha can be accessed here:

Result

I considered three different classifiers to train and compare. More specifically:

  • K Nearest-Neighbor : Since I predict the Nobel Laureates to form some sort of cluster, I can use the K nearest-neighbor classifier. Due to the presence of noise in the data, the value of K should be high.
  • Functional Trees : Nobel Laureates had noticeably high citation scores in comparison to the non-winners. However, the general trend was that the more recent in history the scientist lived in, the higher citation score he or she had. This poses some sort of a regression pattern between the two attributes. Functional trees could be helpful in this scenario, since it's similar to a decision tree except each leaf of the can have some regression function in them.
  • Bayes Net : Given the small number of examples and the Bayesian nature of the problem (i.e. Given the past Nobel Laureates' citation scores, what is the likeliness of a person with this score to become a Nobel Laureate in the future?)

Weka packages were used to train models based on these classifiers (IBk, FT, and BayesNet packages). In addition, different values for the alpha parameter in the citation index score was tried out, ranging from 0 to 1 with interval of 0.1.

With a limited number of chemists in my dataset, it was very easy for my model to fall into overfitting problem. In particular, with few nominal attributes having finite space of the ones known so far limits my model significantly. For example, just because there are only a handful of institutions being represented by the past Nobel Laureates (i.e. Harvard, Cambridge, Oxford) it doesn't mean there can't be a winner from an institution without a past Laureate (i.e. Northwestern!). To avoid overfitting, I used the 10-fold cross-validation method to evaluate the accuracy of my model.

With a limited number of instances (233), it came down to a difference of 2 to 3 across the range of values for alpha. Few points to note here are:

  • The highest accuracy achieved was 78.5408%, achieved by using functional trees with alpha values greater than 0.8.
  • The classifier with the highest accuracy is functional trees, followed by KNN, and Bayes Net.
  • The value of alpha didn't affect the accuracy of functional trees because it can deal with it by applying regression in each node.
  • KNN shows a trend: the higher the value of alpha, the higher the accuracy gets.
  • Bayes net didn't get affected by alpha either.
This result shows that even though I lacked much of the citation data, I was able to predict whether a person will win the Nobel Prize up to a similar accuracy that Reuters was able to achieve using their extensive data.

Using Weka's AttributeSelectedClassifier, I also found out the three most important attributes. Not surprisingly, the citation score was one of them. The other two were Institutions and Alma Mater, which is again not too surprising result.

Scraping Wikipedia
I scraped the names of scientists from three different Wikipedia pages. These pages contained both winners and non-winners of the Nobel Prize in the three natural sciences disciplines. However, there were also some problems involved with Wikipedia. Since Wikipedia is an open-source encyclopedia where everyone can freely edit the contents, the quality of the content wasn't always very reliable. For example, Xi Jinping (China's Prime Minister) was in the list of chemists in the Wikipedia. However, it didn't matter too much because these inevitably formed the outliers of the data when I started collecting their attributes. I scraped names off of these lists specifically because all the attributes I wanted to collect were available in Wikipedia pages of most scientists, and I decided that it would pose the least bias for us to scrape Wikipedia pages of all scientists that are available to us. Furthermore, this drastically decreased the number of missing attributes in my data set since my sample data was limited to those who at least have a Wikipedia page. A downside of this decision is that it actually gives us only about 950 instances, which may or may not be enough for building an accurate model.

Scraping Google Scholar

Google Scholar is arguably the richest source of citation data that is publicly available. However, Google has a very strict policy against automated queries, and this made the scaping process extremely difficult. Not only does Google drop any packets that are not from browsers but also it has very sophisticated bot detection method.

By giving modifications to the HTTP request header, I was able to successfully scrape the data for a few scientists (~ 10 scientists) but after that, Google permanently blocked the IP address of my query. Any attempt to trick Google by sending a query between random intervals of 20-30 seconds did not work either, and it was unclear how Google was detecting my scraper.

Consequently I manually searched through all chemists' names in Google Scholar and set the recursion depth of my PageRank score calculation algorithm to only 2, instead of the initial plan of 3. To facilitate the process I wrote a short script in JavaScript that extracts just the citation counts from a given page. In the process of doing this I also found out that Google thinks I am a bot even when I was doing them by hand. It seemed like Google was simply blocking the IP if there were more than certain amount of queries over the time, which meant that in order to get around their bot detection method, the bot would have to go slower than my hand, which sort of defeats my purpose of using the bot.


Future Directions & Improvements

As discussed previously, my biggest challenge occurred in defining a reasonable set of scientists for my dataset. The largest limitation with my data set would be that it is not the best representation of the entire chemists in the world mostly because only the well-known chemists are listed on Wikipedia. In other words, the result of my model may not necessarily apply to every single chemists in the world.

Another major limitation is that the size of my dataset is only 233. Therefore, the accuracy of my model is vulnerable to small changes in the dataset and more prone to overfitting. As mentioned in the Google Scholar scraping section, I had trouble accessing enough data to make the PageRank algorithm really work. If access to a bigger source of data for citation index, my variation of PageRank algorithm could work much better than it does now.

Something that I could have tried but couldn’t due to time limits is to come up with a way to deal with attributes that may contain more than one value - for example, institutions or alma mater. One way to have done this is to use some kind of binary encoding - have each bit represent a school, n bits would be enough to encode all possibilities when there are a total of n different schools being represented by the data set. This could have yielded a better result. Another way to have done this is to make each school an attribute, although that could potentially have led to use of different classifiers and not something like decision trees since the number of attributes would grow extremely large and the dataset would become a very sparse matrix.