6 Conclusion

The goal of this project was to use Natural Language Processing to classify fake news. To do so, a dataset web scraped from PolitiFact was used. The text was cleaned, the vocabulary reduced, and features were extracted. Then, seven different machine learning models were fit using these features. The two most basic algorithms (Naive Bayes and basic logistic regression) performed the worst with accuracies of 60% and 62% respectively. The other five models performed relatively equally: they all had accuracies around 70%.

An automated fake news classifier that performs with a 70% accuracy is by no means useless, but it is far from being a robust classifier. With a topic such as fake news, which has already hurt much of the American public’s perception of the government, any misclassification is exacerbated. A 70% accurate fake news classifier used in practice has poor implications in terms of improving the public’s trust in what is fake and what is not. Thus, far more work is needed in order to create a reliable fake news classifier. Luckily, there are clear areas in which this work can be improved upon due to the limitations laid out in the next section.

6.1 Limitations

Among other things, the raw PolitiFact dataset used in this project contains the original PolitiFact URL, the PolitiFact truth rating, the claim of the article being assessed, and the URL and text from the article where the claim came from. The original idea of this project was to use the original article text (along with the PolitiFact truth rating) to train a fake news classifier. However, each article’s text is not necessarily making the claim that PolitiFact is grading in each entry since each PolitiFact grade is based on the claim that an article is reporting on, and not the article itself.

For example, if an article is reporting on a claim that is false—and casting the same doubt on that claim that PolitiFact does with their truth rating of “False”—then the article text itself is not necessarily false (even though the claim it’s reporting on is). This means that the target of “False” would not be aligned with the article text. There’s no way to check whether or not an article’s text is casting doubt on a claim rather than presenting the same claim that PolitiFact is fact-checking without manually going through each article. Since this is infeasible, the project switched from using the original article text in feature extraction to the claim being graded by PolitiFact. This meant that duplicate PolitiFact entries could be removed from the dataset (as they all have the same claim and truth level).

Additionally, this project initially tried to classify all six different PolitiFact truth levels. In the end, only the PolitiFact claims rated as “True” (\(1\)) or “Pants on Fire!” (\(6\)) were used. This was done as a way to simplify the many models fit in the project as well as to increase model performance. A process where all ratings from 1-3 were classified as “True” and all ratings from 4-6 were classified as “False” was attempted, but this resulted in worse model performance. The reason for this is likely because article claims are inherently very short pieces of text and because determining the difference between a “Half-True” and “Mostly False” article is a very difficult task when only looking at an article’s claim by itself.

The many models fit in this project suffered at the expense of the amount of data available. After dealing with the limitations mentioned above, only \(1911\) rows were left in the dataset. This is not nearly enough data in the context of NLP, especially when using deep learning algorithms. In standard NLP deep learning problems, datasets have somewhere on the order of tens of thousands of observations (instead of \(1911\) in the case of this project).25

6.2 Future Work

The largest areas for future work in this project are to collect more data and to fine tune the deep learning methods used. As mentioned in Section 6.1, the amount of data used in this project was not on the same scale as other NLP deep learning models. The collection of more—and higher quality data—is paramount in increasing the accuracy of a fake news classifier. Additionally, there is much work to be done to improve the performance of the models used in this project. A limited amount of hyperparameter tuning was done in this project, as performing robust hyperparameter tuning for seven different models is beyond the scope of this project. In particular, there is much work to be done in fine-tuning the deep learning models due to their inherent “black box”26 nature.


  1. “How to Build a Neural Network With Keras Using the IMDB Dataset (n.d.)↩︎

  2. In machine learning, “black box” refers to the idea that “data goes in to a model, decisions come out, but the processes between input and output are opaque.” “Black Box (n.d.)↩︎