7 Appendix
The appendix walks through all of the code performed in this project including data wrangling, feature extraction, and model fitting.
7.1 Data Wrangling/Feature Extraction
7.1.1 Basic Data Wrangling
We begin by reading in our raw data:
< read_csv("../dataraw/politifact_phase2_clean_2018_7_3.csv") politifact_df_raw
We then clean up the raw data by getting rid of rows with duplicate URLs, getting rid of targets we don’t care about, and creating two columns for our targets (one numerical, one categorical) since different models require different variable types for target values. In addition, we create a unique ID for each row (which consists of an article’s claim and its PolitiFact truth rating).
# clean up raw data
< politifact_df_raw %>%
politifact_df_cleaned # get rid of duplicate URLs (reasoning for this is discussed in limitations)
distinct(politifact_url_phase1, .keep_all = TRUE) %>%
# keep only the two targets we care about
filter(fact_tag_phase1 == "True"  fact_tag_phase1 == "Pants on Fire!") %>%
# change desired targets to categorical and numerical
mutate(targets_numerical = as.numeric(ifelse(fact_tag_phase1 == "True", 1, 0)),
targets_categorical = as.factor(ifelse(fact_tag_phase1 == "True", 1, 0))) %>%
# rename the article claim variable
rename(article_claim = article_claim_phase1) %>%
# only keep the variables we care about
select(article_claim, targets_numerical, targets_categorical)
# add an id
$article_id < seq.int(nrow(politifact_df_cleaned))
politifact_df_cleaned# take a look at our cleaned dataframe
glimpse(politifact_df_cleaned)
## Rows: 1,911
## Columns: 4
## $ article_claim <chr> "\"When you tally up their representation in Congress and governorships…
## $ targets_numerical <dbl> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,…
## $ targets_categorical <fct> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,…
## $ article_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, …
7.1.2 Cleaning the Text/Reducing the Vocabulary Size
Since the article_claim
column contains messy text data, text cleaning is needed.
The text cleaning done in this project consists of the following: removing punctuation, making all
letters lowercase, removing English stop words, lemmatizing each word, removing numbers, and
removing any extra white space. Removing stop words, numbers, and lemmatizing words all leads to a
decrease in the size of the overall vocabulary (all unique words across a set of text).
The clean_text
function performs the aforementioned text cleaning and vocabulary
reduction:
# a function to perform text cleaning
< function(input_text) {
clean_text # remove punctuation
< removePunctuation(input_text) %>%
output_text # make all letters lowercase
tolower() %>%
# remove a custom list of English stop words (from the `tm` package)
removeWords(stopwords("en")) %>%
# lemmatize the words in the text
lemmatize_strings() %>%
# remove any numbers in the text
removeNumbers() %>%
# get rid of any extra white space
stripWhitespace()
return(output_text)
}
# an example of using the `clean_text` function
clean_text("'Trump approval rating better than Obama and Reagan at
same point in their presidencies.'")
## [1] "trump approval rate good obama reagan point presidency"
The clean_text
function is used to clean up the text in
politifact_df_cleaned
.
# clean up the article claim text, create a final dataframe
< politifact_df_cleaned %>%
politifact_df_final mutate(article_text = clean_text(article_claim)) %>%
select(article_id, article_text, targets_numerical, targets_categorical)
glimpse(politifact_df_final)
## Rows: 1,911
## Columns: 4
## $ article_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, …
## $ article_text <chr> "tally representation congress governorship democrat almost low represe…
## $ targets_numerical <dbl> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,…
## $ targets_categorical <fct> 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0,…
7.1.3 Train/Test Split
In order to evaluate classification models, a training set and a testing set are needed. In this project, an \(80%\)/\(20%\) training/testing split is used.
# set the seed to ensure reproducibility
set.seed(2)
# the number of rows in our final data frame
< nrow(politifact_df_final)
num_politifact_rows
# get random row numbers from our data frame (80% of all possible row numbers
# will be stored in `random_row_numbers` since we have an 80/20 split)
< sample(1:num_politifact_rows, 0.8 * num_politifact_rows)
random_row_numbers
# use our randomly generated row numbers to select 80% of our data for training
< politifact_df_final[random_row_numbers, ] %>%
politifact_df_train # sort by article id
arrange(article_id)
# and use the remaining 20% of our data for testing
< politifact_df_final[random_row_numbers, ] %>%
politifact_df_test arrange(article_id)
# keep track of our target variables in a list
< politifact_df_train$targets_numerical
train_targets_numerical < politifact_df_train$targets_categorical
train_targets_categorical
< politifact_df_test$targets_numerical
test_targets_numerical < politifact_df_test$targets_categorical test_targets_categorical
We next want to make sure our training/testing split has a roughly equal distribution of each target value.
prop.table(table(train_targets_numerical))
## train_targets_numerical
## 0 1
## 0.4443717 0.5556283
prop.table(table(test_targets_numerical))
## test_targets_numerical
## 0 1
## 0.4830287 0.5169713
Across our two target variables—where a “false” piece of news corresponds to \(0\) and a “true” piece of news corresponds to \(1\)—we observe a \(44.4%\), \(55.6%\) to \(48.3%\), \(51.7%\) training/testing split. Since the proportion of target values is reasonably even, we can proceed with this split.
7.1.4 Creating a Document Term Matrix
To create a Document Term Matrix (DTM), the article_text
needs to be tokenized into
individual words so that an overall vocabulary can be created. Here ngrams are used (from \(n=1\) to \(n=3\)). This means that
all possible \(1\), \(2\), and
\(3\) word combinations are included in the vocabulary. Note that
we must create our vocabulary using only the training data, since our test data should not be
looked at until model evaluation.
Once a vocabulary is created, terms that only appeared once throughout our training dataset are removed. This reduces the size of the vocabulary terms from \(28678\) terms to \(806\), which is a far more manageable number of terms to have in a DTM.
# tokenize the text into individual words
< word_tokenizer(politifact_df_train$article_text) %>%
tokenizer_train itoken(ids = politifact_df_train$article_id, progressbar = FALSE)
# this test tokenizer will be used later
< word_tokenizer(politifact_df_test$article_text) %>%
tokenizer_test itoken(ids = politifact_df_test$article_id, progressbar = FALSE)
# create our vocabulary with the training data
< create_vocabulary(tokenizer_train, ngram = c(1L, 3L))
vocabulary # we observe 28678 unique terms, this is too large
nrow(vocabulary)
## [1] 28678
# prune our vocabulary to only include terms used at least 2 times
< prune_vocabulary(vocabulary, term_count_min = 5)
pruned_vocabulary # we now observe 806 unique terms, which is more manageable
pruned_vocabulary
## Number of docs: 1528
## 0 stopwords: ...
## ngram_min = 1; ngram_max = 3
## Vocabulary:
## term term_count doc_count
## 1: account 5 5
## 2: affair 5 5
## 3: affect 5 4
## 4: africanamerican 5 5
## 5: agree 5 5
## 
## 802: percent 130 102
## 803: obama 131 130
## 804: year 149 130
## 805: state 171 159
## 806: say 417 372
Now that we have a reasonably sized vocabulary, we can create our training and testing DTMs.
# create our training and testing DTMs using our tokenizers
< vocab_vectorizer(pruned_vocabulary)
vocabulary_vectorizer
< create_dtm(tokenizer_train, vocabulary_vectorizer)
dtm_train < create_dtm(tokenizer_test, vocabulary_vectorizer) dtm_test
We then change our DTMs to hold term frequency–inverse document frequency (tfidf) values, which will normalize the DTMs and increase the weight of terms which are specific to a single document (or handful of documents) and decrease the weight for terms used in many documents.^{27}
< TfIdf$new()
tfidf
# fit model to train data and transform train data with fitted model
< fit_transform(dtm_train, tfidf)
dtm_train < fit_transform(dtm_test, tfidf)
dtm_test
# convert our DTMs to a matrix and data framedifferent formats are needed
# for different models (`keras` only takes matrices for example)
< as.matrix(dtm_train)
dtm_train_matrix < as.matrix(dtm_test)
dtm_test_matrix
< as.data.frame(dtm_train_matrix)
dtm_train_df < as.data.frame(dtm_test_matrix) dtm_test_df
The data wrangling and feature extraction is now complete. Our features exist in the form of Document Term Matrices where each column represents a term, each row represents a PolitiFact article, and each cell holds the term frequency–inverse document frequency value of that term. Our targets exist as \(0\)’s and \(1\)’s, where a “false” piece of news corresponds to 0 and a “true” piece of news corresponds to 1. We observe a small portion of our training and testing DTMs below. Note that each value is \(0\) since our vocabulary contains \(3000\) terms and it is very unlikely that the first \(5\) terms are in any of the \(10\) different rows we are looking at.
1:5, 1:5] dtm_train_df[
## account affair affect africanamerican agree
## 1 0 0 0 0 0
## 3 0 0 0 0 0
## 6 0 0 0 0 0
## 8 0 0 0 0 0
## 9 0 0 0 0 0
1:5, 1:5] dtm_test_df[
## account affair affect africanamerican agree
## 2 0 0 0 0 0
## 4 0 0 0 0 0
## 5 0 0 0 0 0
## 7 0 0 0 0 0
## 12 0 0 0 0 0
7.2 Initial Model Fitting
This section displays the code for each of the initial models fit in an attempt to classify fake news.
7.2.1 Naive Bayes
We can use the e1071
package to run a Naive Bayes model. Note that the
e1071::naiveBayes
function requires categorical targets and accepts features in the
form of a dataframe.
# applying Naive Bayes model to training data
< e1071::naiveBayes(x = dtm_train_df, y = train_targets_categorical,
nb_model laplace = 1)
# predicting applying to test set
< predict(nb_model, dtm_test_df)
nb_predicted_values
< caret::confusionMatrix(nb_predicted_values,
nb_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table nb_confusion_matrix
## Actual
## Predicted 0 1
## 0 137 104
## 1 48 94
# 60.3% accuracy
round(nb_confusion_matrix$overall[1], 3)
## Accuracy
## 0.603
# [55.2%, 65.2%] 95% CI for accuracy
round(nb_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.552 0.652
# for full information, uncomment and run the following line of code:
# nb_confusion_matrix
Our Naive Bayes model performs rather poorly with an accuracy of only \(60.3\%\) (and a 95% confidence interval of \([55.2\%, 65.2\%]\)).
7.2.2 Basic Logistic Regression
Since the glm
function requires a formula, we must create a new dataframe combining
our features and targets. Note that the glm
function requires categorical targets.
# logistic regression data frames (training and testing)
< cbind(dtm_train_df, train_targets_categorical)
lr_df_train < cbind(dtm_test_df, test_targets_categorical) lr_df_test
We now fit the logisitic regression model and calculate its accuracy.
# fit the basic logistic regression model
< glm(train_targets_categorical ~ ., data = lr_df_train,
blr_model family = binomial(link = "logit"))
# calculated the predicted value probabilities (which gives us
# the probability that an observation is classified as 1)
< predict(blr_model, newdata = lr_df_test,
blr_predicted_value_probs type = "response")
# turn the probabilites into 1's or 0'sif the probability of being a 1
# is greater than 50%, turn it into a 1 (turn it into a 0 otherwise)
< as.factor(ifelse(blr_predicted_value_probs > 0.5, 1, 0))
blr_predicted_values
< caret::confusionMatrix(blr_predicted_values,
blr_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table blr_confusion_matrix
## Actual
## Predicted 0 1
## 0 102 63
## 1 83 135
# 61.9% accuracy
round(blr_confusion_matrix$overall[1], 3)
## Accuracy
## 0.619
# [56.8%, 66.8%] 95% CI for accuracy
round(blr_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.568 0.668
Our basic logistic regression model performs rather poorly with an accuracy of only \(61.9\%\) (and a 95% confidence interval of \([56.8\%, 66.8\%]\)).
7.2.3 Logistic Regresion with L1 penalty (Lasso Regression)
This model is an improved form of logistic regression using an L1 regularization penalty and
5fold crossvalidation using the glmnet
package. Logistic regression using an L1
regularization is also known as “Lasso regression.” Lasso regression shrinks the less important
features’ coefficients to zero, thus removing some features altogether.^{28} This works well for feature selection when
there are a large number of features, as in our case (we have \(806\) features).
Nfold crossvalidation (here \(n=5\)) is used to “flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.”^{29}
The cv.glmnet
function from the glmnet
function is used here to perform
the Lasso regression and 5fold crossvalidation discussed above. Note that our features must be
in the form of a matrix here (not a dataframe) and our targets must be numerical.
< 5
num_folds < 1e4
threshold < 1e5
max_num_it
# manually compute our folds so below model doesn't vary each time it's run
set.seed(2)
< sample(rep(seq(num_folds), length.out = nrow(dtm_train_matrix)))
fold_id
# use cross validation to determine the optimal lambda value for L1 penalty
< cv.glmnet(x = dtm_train_matrix, y = train_targets_numerical,
lasso_model # this gives us logistic regression
family = "binomial",
# L1 penalty
alpha = 1,
# interested in the area under the ROC curve
type.measure = "auc",
# 5fold crossvalidation
nfolds = num_folds,
# manually create our folds for reproducibility
foldid = fold_id,
# a higher threshold is less accurate,
# but has faster training
thresh = threshold,
# a lower number of iterations is less accurate,
# but has faster training
maxit = max_num_it)
< predict(lasso_model, dtm_test_matrix,
lasso_predicted_value_probs type = "response")
< as.factor(ifelse(lasso_predicted_value_probs > 0.5, 1, 0))
lasso_predicted_values
< caret::confusionMatrix(lasso_predicted_values,
lasso_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table lasso_confusion_matrix
## Actual
## Predicted 0 1
## 0 91 22
## 1 94 176
# 69.7% accuracy
round(lasso_confusion_matrix$overall[1], 3)
## Accuracy
## 0.697
# [64.8%, 74.3%] 95% CI for accuracy
round(lasso_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.648 0.743
Our logistic regression model using an L1 regularization penalty performs significantly better than our basic logistic regression model. This logistic regression model has an accuracy of \(69.7\%\) (and a 95% confidence interval of \([64.8\%, 74.3\%]\)).
7.2.4 Support Vector Machine
We can use the e1071
package to run a Support Vector Machine (SVM). Note that the
randomForest
function requires categorical targets and accepts features in the form
of a dataframe.
< e1071::svm(x = dtm_train_df, y = train_targets_categorical,
svm_model type = "Cclassification", kernel = "sigmoid")
< predict(svm_model, dtm_test_df)
svm_predicted_values
< caret::confusionMatrix(svm_predicted_values,
svm_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table svm_confusion_matrix
## Actual
## Predicted 0 1
## 0 113 46
## 1 72 152
# 69.2% accuracy
round(svm_confusion_matrix$overall[1], 3)
## Accuracy
## 0.692
# [64.3%, 73.8%] 95% CI for accuracy
round(svm_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.643 0.738
Our SVM performs reasonably well with an accuracy of \(69.2\%\) (and a 95% confidence interval of \([64.3\%, 73.8\%]\)).
7.2.5 Random Forest
We can use the randomForest
package to run a Naive Bayes model. Note that the
randomForest
function requires categorical targets and accepts features in the form
of a dataframe.
set.seed(2)
< randomForest(x = dtm_train_df, y = train_targets_categorical)
rf_model < predict(rf_model, dtm_test_df)
rf_predicted_values
< caret::confusionMatrix(rf_predicted_values,
rf_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table rf_confusion_matrix
## Actual
## Predicted 0 1
## 0 124 51
## 1 61 147
# 70.8% accuracy
round(rf_confusion_matrix$overall[1], 3)
## Accuracy
## 0.708
# [65.9%, 75.3%] 95% CI for accuracy
round(rf_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.659 0.753
Our random forest model performs reasonably well with an accuracy of \(70.8\%\) (and a 95% confidence interval of \([65.9\%, 75.3\%]\)).
7.3 Deep Learning Model Fitting
This section displays the code for the deep learning models fit in an attempt to classify fake news.
7.3.1 Multilayer Perceptron Neural Network
The keras
package is used to fit a Multilayer Perceptron (MLP). Here, a MLP with one
input layer, one hidden layer, and one output layer is created. The keras
requires
features to be stored in matrices and targets to be numerical.
Note that the keras
package randomizes initial weights each time a model is run,
meaning that results will vary slightly each time the following two code chunks are run
(specifically, Figure 7.1 will change and the MLP
model’s accuracy will change). The provided function to ensure reproducibility in
keras
(which is named use_session_with_seed()
) does not currently work
within a knitted R Markdown file. Additionally, using the normal set.seed()
function
does not alleviate the issue. For a more detailed breakdown, I’ve created a file outlining the
problem that can be found in the directory scratchwork/kerasreprex.R
.
set.seed(2)
< ncol(dtm_train_matrix)
num_features
# the input layer is implicitly added with input_shape
< keras_model_sequential() %>%
mlp_model layer_dense(units = 8, activation = "relu", input_shape = c(num_features)) %>%
layer_dense(units = 1, activation = "sigmoid")
%>% compile(
mlp_model optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
# use validation data to determine best number of epochs to run our MLP for
< mlp_model %>% keras::fit(
mlp_history
dtm_train_matrix, train_targets_numerical,epochs = 15,
batch_size = 1,
validation_split = 0.2
)
Figure 7.1 shows us how our validation data performs
as our MLP performs multiple feedforward/backpropagation epochs. Since we want to minimize our
validation data’s loss and maximize our validation data’s accuracy, we observe that out \(3\) epochs seems to be reasonable number of times to run our MLP
(note that this may change due to the issue mentioned above about random reproducibility in
keras
).
set.seed(2)
< keras_model_sequential() %>%
mlp_model layer_dense(units = 8, activation = "relu", input_shape = c(num_features)) %>%
layer_dense(units = 1, activation = "sigmoid")
%>% compile(
mlp_model optimizer = "rmsprop",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
%>% keras::fit(x = dtm_train_matrix, y = train_targets_numerical,
mlp_model epochs = 3, batch_size = 1)
< mlp_model %>% evaluate(x = dtm_test_matrix, test_targets_numerical)
mlp_results mlp_results
## loss accuracy
## 0.5965526 0.6814622
< mlp_model %>% predict_classes(dtm_test_matrix, batch_size = 1)
mlp_predicted_values
# confusion matrix
< caret::confusionMatrix(as.factor(mlp_predicted_values),
mlp_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table mlp_confusion_matrix
## Actual
## Predicted 0 1
## 0 116 53
## 1 69 145
# around 70% accuracy
round(mlp_confusion_matrix$overall[1], 3)
## Accuracy
## 0.681
# 95% CI
round(mlp_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.632 0.728
Our MLP performs reasonably well with an accuracy of around \(70\%\) (and a 95% confidence interval around \([65\%, 75\%]\)).
7.3.2 Recurrent Neural Network
The keras
package is used to fit a Recurrent Neural Network (RNN). Here, a RNN with
one input layer, one long shortterm memory (LSTM)^{30} layer, and one output layer is created. The keras
requires features to be stored in matrices and targets to be numerical. Note that a RNN requires a
different type of DTM representation (as mentioned previously) where each word must be given a
unique numerical ID so that a sense of sequence can be maintained.
# the number of words to include in our vocab (keep the same as our other vocab)
< 806
vocab_size
# note that this tokenizer is based on the training data
< text_tokenizer(num_words = vocab_size) %>%
rnn_tokenizer fit_text_tokenizer(politifact_df_train$article_text)
< texts_to_sequences(rnn_tokenizer, politifact_df_train$article_text)
train_sequences < texts_to_sequences(rnn_tokenizer, politifact_df_test$article_text)
test_sequences
# pad sequences (add 0's to the end) for documents that are short
< pad_sequences(train_sequences, padding = "post")
rnn_x_train < pad_sequences(test_sequences, padding = "post")
rnn_x_test
# take a look at what a RNN DTM looks like
1:5, 1:5] rnn_x_train[
## [,1] [,2] [,3] [,4] [,5]
## [1,] 57 48 146 165 3
## [2,] 53 135 0 0 0
## [3,] 219 136 9 368 414
## [4,] 1 11 15 415 0
## [5,] 1 485 137 573 98
Now that we have our features in the right form (they have a notion of sequence), we can fit our RNN.
set.seed(2)
< keras_model_sequential() %>%
rnn_model layer_embedding(input_dim = vocab_size, output_dim = 64) %>%
layer_lstm(units = 128, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = "sigmoid")
%>% compile(
rnn_model optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
< 64
batch_size < 15
epochs < 0.1
validation_split
< rnn_model %>% keras::fit(
rnn_history
rnn_x_train, train_targets_numerical,batch_size = batch_size,
epochs = epochs,
validation_split = validation_split
)
Figure 7.2 shows us how our validation data performs
as our RNN performs multiple feedforward/backpropagation epochs Since we want to minimize our
validation data’s loss and maximize our validation data’s accuracy, we observe that out \(3\) epochs seems to be reasonable number of times to run our MLP
(note that this may change due to the issue mentioned above about random reproducibility in
keras
).
# recompile model
< keras_model_sequential() %>%
rnn_model layer_embedding(input_dim = vocab_size, output_dim = 64) %>%
layer_lstm(units = 128, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = "sigmoid")
%>% compile(
rnn_model optimizer = "adam",
loss = "binary_crossentropy",
metrics = c("accuracy")
)
%>% keras::fit(rnn_x_train, train_targets_numerical, epochs = 3, batch_size = batch_size)
rnn_model < rnn_model %>% evaluate(rnn_x_test, test_targets_numerical)
rnn_results < rnn_model %>% predict_classes(rnn_x_test, batch_size = batch_size)
rnn_predicted_values
# Confusion matrix
< caret::confusionMatrix(as.factor(rnn_predicted_values),
rnn_confusion_matrix
test_targets_categorical,dnn = c("Predicted", "Actual"))
# confusion matrix table
$table rnn_confusion_matrix
## Actual
## Predicted 0 1
## 0 125 58
## 1 60 140
# around 68% accuracy
round(rnn_confusion_matrix$overall[1], 3)
## Accuracy
## 0.692
# 95% CI around (0.64, 0.73)
round(rnn_confusion_matrix$overall[3:4], 3)
## AccuracyLower AccuracyUpper
## 0.643 0.738
Our RNN performs reasonably well with an accuracy of around \(68\%\) (and a 95% confidence interval around \([63\%, 73\%]\)).

“LSTMs were developed to deal with the vanishing gradient problem that can be encountered when training traditional RNNs.” “Long ShortTerm Memory” (2020)↩︎