# 3 Data

The data used in this project comes from PolitiFact, a fact-checking website. Articles, their claims, and the associated truth-levels of those claims are web-scraped from PolitiFact by the Discourse Processing Lab at Simon Fraser University23. The Discourse Processing Lab provides a web-scraping tool for multiple fact checking websites, including PolitiFact and Snopes. The dataset web-scraped from PolitiFact was chosen over the one from Snopes due to the fact that PolitiFact has six levels of truth compared to Snopes’ two. The web-scraping works in two parts: first, information is downloaded from the selected source such as the claim being assessed, the assessment of that claim, and all links to articles that are mentioned in the assessment. (That information is all added to the final dataset.) Then, the link most likely to contain the article being labeled/assessed is selected, and the original text from that article is downloaded and added to the dataset.

The PolitiFact dataset includes columns such as the original PolitiFact URL, the PolitiFact truth rating, the claim of the article being assessed, the URL and text from the article where the claim came from, and the category into which the claim being assessed falls. The levels of truth are described by PolitiFact as the following:

• True (1): The statement is accurate and there’s nothing significant missing.
• Mostly True (2): The statement is accurate but needs clarification or additional information.
• Half-True (3): The statement is partially accurate but leaves out important details or takes things out of context.
• Mostly False (4): The statement contains an element of truth but ignores critical facts that would give a different impression.
• False (5): The statement is not accurate.
• Pants on Fire! (6): The statement is not accurate and makes a ridiculous claim.

To simplify the task of classification in this project, only two of the above truth levels were used in this project. Specifically, only the PolitiFact claims rated as “True” ($$1$$) or “Pants on Fire!” ($$6$$) were used. A process where all ratings from 1-3 were classified as “True” and all ratings from 4-6 were classified as “False” was attempted, but this resulted in worse model performance. More discussion of this can be found in Section 6.1.

## 3.1 Data Cleaning

Each entry in this dataset contains a claim that PolitiFact has assigned a certain truth value. Since multiple external news articles can discuss the same claim, there can be a single PolitiFact claim/associated truth value associated with multiple different articles, and thus multiple different bodies of text. For example, the first three entries in the raw dataset are all from the same PolitiFact entry which rates the phrase “Trump approval rating better than Obama and Reagan at same point in their presidencies.” as Mostly True. The reason there are three entries from this same PolitiFact article (which are all from the exact same PolitiFact URL) is because multiple news sources (in this case Fox, NBC, and FiveThirtyEight) all reported on this claim. In an effort to cut down on the number of duplicate claims such as these, I have only kept the top entry from each unique PolitiFact URL (even though the three articles contain different textual content). The reason for the removal of duplicate entries is further discussed in Section 6.1.

In addition to removing duplicate PolitiFact entries, the two targets of interest (mentioned in the previous section) were filtered, selected, and converted to numbers. After this, the remaining data cleaning involved tidying the text of the claims associated with our different targets as well as reducing the size of the overall vocabulary of words (for the sake of bag-of-words feature extraction). In particular, this involved using basic text-cleaning methodologies (such as removing any punctuation), removing stop words, and using lemmatization. To see the full data cleaning done in this project, see Section 7.1.

## 3.2 Exploratory Data Analysis

After the full text cleaning and vocabulary reduction described previously (and shown fully in Section 7.1), the vocabulary has $$806$$ terms (across $$1911$$ documents). To check if this text cleaning was done properly, we can compare a few selected article claim documents before and after they were cleaned:

## [1] "\"Investigators: Anthony Bourdain was killed by Clinton operatives.\""
## [1] "investigator anthony bourdain kill clinton operative"
## [1] "Says that people \"went out in their boats to watch\" Hurricane Harvey."
## [1] "say people go boat watch hurricane harvey"

Since everything is now working as expected, the final thing to explore is the most frequently used terms in the vocabulary.24 To get a sense of the the most frequently used terms in the vocabulary, Figure 3.1 displays the top ten most frequently used words in the PolitiFact text corpus.

It appears that the word “say” was by far the most used word, with $$417$$ occurrences. This is a fairly common word (that is not a stop word), so this makes sense. It is reasonable to keep “say” in this context as the act of “stating” something may be important in classifying fake news. We also observe that no classic stop word (such as “the”) appears in this barplot, which is what we expect. Lastly, words such as “obama,” “president,” and “trump” also appear in the bar plot. Since PolitiFact primarily deals with fact-checking political claims, this makes sense.

With successful text cleaning and feature extraction, it is time to fit machine learning models to the data. (Note that the full code for the process of feature extraction mirrors that of the previous example in Section 2.1.2 and can be seen fully in Section 7.1.4.)

1. Note that it doesn’t make sense to visualize the DTM here because it is far too large to fully look at, and DTMs are inherently sparse. The DTMs were examined and work properly. They can be seen further in 7.1.↩︎