2 Background

2.1 Natural Language Processing

Natural Language Processing (NLP) is the process of helping computers understand and interpret human language (which is hard to do without an inherent knowledge of tone, connotation, sarcasm, etc.).7 NLP is important because it allows computers to read text (or listen to speech) and interpret what messages are being conveyed. It is used in tasks such as sentiment analysis, language translation, or—in the case of fake news—classification.

NLP is advantageous because it is able to analyze language-based data “in a consistent and unbiased way.”8 Note that the notion of algorithmic bias9 is a complex topic that is beyond the scope of this project. Thus, for the purposes of this project, the NLP used assumes unbiased classification. Assuming this sort of unbiased classification, using NLP to classify fake news is clearly beneficial.

In this project, NLP is used to extract features from a set of PolitiFact claims. Those features, and the associated truth rating of each claim, are then used to fit multiple machine learning algorithms with the end goal of classifying fake news with a high accuracy. In order to extract features from text, lemmatization and stop words are first used to reduce the size of the overall vocabulary. Then, the bag-of-words model, which is represented by a document-term matrix (DTM), is used to create a matrix of features.

2.1.1 Reducing a Vocabulary with Lemmatization and Stop Words

In NLP, a text corpus refers to the collection of text documents being used. In the context of this project, the text corpus is all of the PolitiFact claims used. A vocabulary, in NLP, is then the set of unique words used in a text corpus.10 When extracting features from text in NLP, a common problem is that the vocabulary is too large. This is due to the fact that many words that have very similar meanings (such as “cook,” “cooks,” and “cooking”) are each included as a separate word in the vocabulary. In addition, many words that (in most contexts) don’t carry meaning—such as “and,” “of,” and “the”—significantly increase the vocabulary length. To combat this, stemming/lemmatization and the removal of stop words can be used.

Stemming is a process used in NLP that removes the ends of words in order to get rid of words that have derivational affixes.11 There are standard stemming libraries that remove endings of words such as “-ing,” “-es,” and “-ed.” In addition, users can create their own custom functions to remove endings of words that are not included in a stemming library. In stemming, the words “dog,” “dogs,” and “dog’s” would all be cut down to the word “dog.” This helps to reduce the number of words that are counted in a document. However, a word such as “doggie” would likely be missed by stemming (unless a custom function had “-gie” set to be removed). Because of such exceptions, lemmatization is used in this project instead of stemming.

Lemmatization has the same end goal of stemming, but uses a vocabulary of base words (lemmas) that words in a document are matched to (and then changed to).12 An example of when lemmatization is preferable over stemming is with a word such as “better,” which has “good” as its lemma. This sort of word would be missed with stemming, but caught with lemmatization (which would have a dictionary matching these two words together). Another example of when lemmatization is preferable over stemming is with the would “caring.” Lemmatizing “caring” gives us “care,” but stemming would likely give you “car” (since a stemming function would erroneously try to remove just the “-ing”). Lemmatization is useful since it is able to reduce the vocabulary size by condensing words that have different textual representations (but are the same words otherwise) into just one word in a vocabulary.

Stop word removal is the process of removing specified words from a vocabulary with the goal of reducing the vocabulary size. Stop words are words such as “and,” “the,” and “as” that and add noise to NLP feature extraction and don’t provide any interesting insights. Custom stop words (depending on the context of the problem at hand) can also be added to a stop words dictionary.

A full working example of how to clean a single text entry to do the above discussed lemmatization and stop word removal (as well as some other standard text cleaning operations) can be seen in the code chunk below. A very similar text cleaning function is used to clean the PolitiFact claims in this project (and can be seen in Section 7.1).

# this is our text input 
trump_claim_raw <- "'Trump approval rating    better than Obama  and Reagan at 
            same point in their presidencies.'"  

# a function to perform text cleaning
clean_text <- function(input_text) {
  # create a list of all of the stop words we want to remove from our corpus
  # note that `stopwords("en")` is a pre-defined list of stop words, but we
  # can also add custom stop words (here Obama and Reagan are added)
  all_stops <- c("Obama", "Reagan", stopwords("en"))

  # remove any punctuation from the text
  output_text <- removePunctuation(input_text) %>% 
    # remove our custom list of stop words
    removeWords(all_stops) %>% 
    # make all letters lowercase
    tolower() %>% 
    # lemmatize the words in the text 
    lemmatize_strings() %>% 
    # remove any numbers in the text
    removeNumbers() %>% 
    # get rid of any extra white space

# apply the text cleaning function
trump_claim_clean <- clean_text(trump_claim_raw)
# note how "better" becomes good and "presidencies" becomes "presidency"
# also notice that "Obama" and "Reagan" were removed
## [1] "trump approval rate good point presidency"

2.1.2 Bag-of-Words Model for Feature Extraction

Once a vocabulary has been created—and its size reduced—features can be extracted using the bag-of-words model. The bag-of-words model is a way to extract features from text by describing the occurrence of each word in the vocabulary within a single document (and discarding any information about the order of those words).13 Each document in a text corpus (each PolitiFact claim) consists of a score for each word in the vocabulary. In the most simple case, that score is simply the number of times each word in a document occurs in the vocabulary. Note that there are many alternative scoring metrics here, one such being the term frequency-inverse document frequency (tf-idf).14 (Tf-idf is used as the scoring metric in this project.)

In practice, bag-of-words feature extraction is represented by a document-term matrix (DTM). A DTM is “a matrix that describes the frequency of terms that occur in a collection of documents” where “rows correspond to documents in the collection and columns correspond to terms.”15 To best illustrate this, an example of creating a DTM (using term frequency to fill in cell values) with the text2vec package can be seen in the code chunk below. Here, we can see a basic example of how to extract features from an initially uncleaned text corpus by first reducing the vocabulary size and then creating a DTM. The resulting DTM can be seen in Table 2.1.

# our initial text corpus
car_text <- rbind.data.frame("Bob really likes cars", 
                  "Sally really, really does not like cars",
                  "Sally only likes 3 types of cars")

# add unique document ids 
car_text <- cbind(car_text, c(1, 2, 3))
colnames(car_text) <- c("text", "id")

# clean the text using the same cleaning function from before
# in order to reduce the vocabulary size
car_text <- car_text %>% 
  mutate(cleaned_text = clean_text(text))

# tokenize the text into individual words 
tokenizer_car <- word_tokenizer(car_text$cleaned_text) %>% 
  itoken(ids = car_text$id, progressbar = FALSE)
# create our vocabulary using the tokenized words
car_vocabulary <- create_vocabulary(tokenizer_car) 

# create a tokenizer using each word in the vocabulary
vocabulary_vectorizer <- vocab_vectorizer(car_vocabulary) 

# finally, create our DTM
car_dtm <- create_dtm(tokenizer_car, vocabulary_vectorizer)
# convert the DTM to a matrix so that we can print it
car_dtm <- as.matrix(car_dtm)
Table 2.1: An example DTM using a term frequency scoring metric
bob type sally car like really
1 1 0 0 1 1 1
2 0 0 1 1 1 2
3 0 1 1 1 1 0

In Table 2.1 we observe that the vocabulary in this example has been trimmed to six words. Words such as “does,” “of,” and “only” have been removed and words such as “likes” have been lemmatized to “like.” Each row represents each of document and each column represents a word in the vocabulary. Each cell tells us how many times each vocabulary word appears in a specific document. For example, the first document has one occurrence of “really,” the second document has two occurrences, and the third document has none.

2.2 Deep Learning Models

2.2.1 Multilayer Perceptrons

Deep learning is a subfield of machine learning that is concerned with artificial neural networks, which are algorithms inspired by the human brain. Neural networks—which is what deep learning models refer to—are beneficial because results tend to get better with more data and larger models (at the expense of computation and run-time). They are also able to “perform automatic feature extraction from raw data,” which makes them well suited to NLP tasks.16

To understand how neural networks work we must first understand their building blocks, perceptrons. A perceptron is made up of four parts: input values, weights and bias, a net input function, and an activation function. This can be seen illustrated in Figure 2.1.17

A diagram of a basic perceptron

Figure 2.1: A diagram of a basic perceptron

The input values, represented by the first row of nodes in the above figure, are combined with the weight values in the second row of nodes (weight values are initially randomized). The perceptron then receives this net input and passes it to the net input function. This net input is passed to the activation function which decides (based on the net input) whether or not to generate a \(+1\) or a \(-1\). This generated output is then the predicted class label of the example. During the learning phase, the predicted class labels are used to calculate the error of each prediction and update the weights accordingly.18

When more than one perceptron is connected, stacked in several layers, a neural network is created. In their most basic form, neural networks consist of an input layer of features, a hidden layer, and an output layer. A multilayer perceptron (MLP) is a neural network with at least one hidden layer. Figure 2.2 displays a MLP where each node in the input and hidden layer represent an individual perceptron.19 Each perceptron in each layer is connected to every other perceptron in the following layer with weighted edges. Because there can be a large number of nodes in each layer, a MLP can quickly become complex. In a similar way to individual perceptrons, output values are predicted by stepping through the neural network during the feedforward phase. Edge weights between neurons are then updated during training through a process called backpropagation.20 This feedforward and backpropogation process repeats for a user-defined number of times. Normally, when evaluating a neural network, you should repeat this process until the edge weights are able to maximize classification accuracy on a validation set.

A diagram of a multilayer perceptron

Figure 2.2: A diagram of a multilayer perceptron

Setting the number of hidden layers in a MLP (as well as the number of nodes in each hidden layer) is an issue of model tuning and can vary from model to model. However, there is rarely much practical need to have more than two hidden layers, and one is sufficient in most problems.21 There is no rule for how many nodes each layer should have, but a general rule of thumb is this: the input layer should have as many nodes as there are features and a hidden layer should have the average of the number of nodes in the input layer and the number of nodes in the output layer.

2.2.2 Recurrent Neural Networks

Recurrent neural networks (RNN) are designed for problems with a notion of sequence and add a representation of memory to a neural network. For this reason, RNNs are well suited to NLP tasks where the sequencing of words in a sentence matters. RNNs are able to keep track of sequencing by allowing neurons to pass values sideways within a given layer (in addition to being able pass values forwards as normal).22

It’s important to note that since RNNs require a notion of sequence, the bag-of-words model mentioned previously needs to be slightly tweaked. In RNNs, each word in the vocabulary needs its own individual identifier (in the form of an integer). Then, after text cleaning, documents are simply translated from word to (id) number. This can still be represented by a DTM where the number of columns represents the maximum number of words found in a document within the text corpus and the number of rows represents the total number of documents. Each cell value is filled with the numerical representation of each word in a document (in the original order they appeared in). If a given document is shorter than the maximum document length, then 0’s are simply added to the end of it.

  1. Cybiant (n.d.)↩︎

  2. “What Is Natural Language Processing?” (n.d.)↩︎

  3. Algorithmic bias describes systematic and repeatable errors in a computer system that create unfair outcomes, such as privileging one arbitrary group of users over others. “Algorithmic Bias” (2020)↩︎

  4. “Vocabulary - Natural Language Processing with Machine Learning (n.d.)↩︎

  5. “Stemming and Lemmatization” (n.d.)↩︎

  6. “Stemming and Lemmatization” (n.d.)↩︎

  7. Brownlee (2017)↩︎

  8. Tf-idf is used to increase the weight of terms which are specific to a single document (or handful of documents) and decrease the weight for terms used in many documents. “Analyzing Texts with the Text2vec Package” (n.d.)↩︎

  9. “Document-Term Matrix” (2020)↩︎

  10. Brownlee (2019)↩︎

  11. “What Is Perceptron | Simplilearn (n.d.)↩︎

  12. ReginaOfTech (2019)↩︎

  13. “Perceptrons & Multi-Layer Perceptrons: The Artificial Neuron - MissingLink (n.d.)↩︎

  14. “Backpropagation is an algorithm for supervised learning of artificial neural networks using gradient descent. Given an artificial neural network and an error function, the method calculates the gradient of the error function with respect to the neural network’s weights.” “Backpropagation” (n.d.)↩︎

  15. “Model Selection - How to Choose the Number of Hidden Layers and Nodes in a Feedforward Neural Network?” (n.d.)↩︎

  16. Brownlee (2016)↩︎