A comparison of models will be presented using the IMDB dataset, a collection of movie reviews.
More details regarding the datasets here.
In this report only the code concerning neural networks will be reported, not that of all the other statistical models, as they are not a topic of interest. In the conclusion you will find, however, the comparison of the results.
Run the code below if you wish to set in seed and obtain reproducible results.
Next, we define two variables necessary for sentiment analysis, or in general the text mining:
max_features
: generally when text analysis is performed, we refer to a vocabulary, that is a collection of words sorted by the frequency in which they occur in a corpus of documents; among these words, only the n most frequent words will be considered, where n is the value associated with the max_features
variable; all remaining words will be replaced with a generic 2L
value.maxlen
: is the maximum length of a sentence; if a sentence is longer than the value of this variable, it will be cut off.Generally the first preparation step is the tokenization, that is the association of an integer value to each word. This step is not necessary in this case, because the sequences are already tokenized.
Next, we define a function that will be useful later in the analysis: vectorize_sequences
. This function transforms the entire dataset, performing the dichotomization of the words in the sentences.
vectorize_sequences <- function(sequences, dimension = max_features) {
# Function that applies one-hot-encoding to the sequences.
# Input:
# - sequences: list of sequences; each sequence has length equal to the length of the sentence, and each int value of the sequence corresponds to a word.
# - dimension: max number of words to include; words are ranked by how often they occur (in the training set) and only the most frequent words are kept.
# Output:
# - results: matrix of dim = (n, dimension); each row represents a different sentence, and each column represents a different word; the matrix is filled by 0 and 1 values, depending if the word is present in that sentence or else.
results <- matrix(0, nrow = length(sequences), ncol = dimension)
for (i in 1:length(sequences)) results[i, sequences[[i]]] <- 1
return(results)
}
y_train <- as.numeric(train_labels)
y_test <- as.numeric(test_labels)
Specify the neural network’s architecture:
model <- keras_model_sequential() %>%
layer_dense(units = 32, activation = "relu", input_shape = c(max_features)) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
model
## Model
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## dense (Dense) (None, 32) 160032
## ________________________________________________________________________________
## dense_1 (Dense) (None, 16) 528
## ________________________________________________________________________________
## dense_2 (Dense) (None, 1) 17
## ================================================================================
## Total params: 160,577
## Trainable params: 160,577
## Non-trainable params: 0
## ________________________________________________________________________________
Compile the model:
rmsprop
, with the learning rate equals to \(0.001\);model %>% compile(
optimizer = optimizer_rmsprop(lr = 0.001),
loss = "binary_crossentropy",
metrics = c("accuracy")
)
Effettuare il training della rete neurale. Come input alla rete neurale, diamo le sequenze dicotomizzate, applicando la funzione vectorize_sequences
al dataset x_train
.
history <- model %>% fit(
x = vectorize_sequences(x_train),
y = y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2,
verbose = 1
)
The plot, representing the loss function and the accuracy in relation to the number of epochs, is shown below.
Evaluating the model on the test set.
results <- model %>% evaluate(
x = vectorize_sequences(x_test),
y = y_test,
verbose = 0
)
print(paste("Loss on test data:", results["loss"]))
## [1] "Loss on test data: 0.592347383499146"
## [1] "Accuracy on test data: 0.860759973526001"
Specify the neural network’s architecture:
model <- keras_model_sequential() %>%
layer_embedding(input_dim = max_features, output_dim = 128) %>%
layer_lstm(units = 32) %>%
layer_dense(units = 1, activation = "sigmoid")
model
## Model
## Model: "sequential_1"
## ________________________________________________________________________________
## Layer (type) Output Shape Param #
## ================================================================================
## embedding (Embedding) (None, None, 128) 640000
## ________________________________________________________________________________
## lstm (LSTM) (None, 32) 20608
## ________________________________________________________________________________
## dense_3 (Dense) (None, 1) 33
## ================================================================================
## Total params: 660,641
## Trainable params: 660,641
## Non-trainable params: 0
## ________________________________________________________________________________
Compile the model:
rmsprop
, with a learning rate of \(0.001\);model %>% compile(
optimizer = optimizer_rmsprop(lr = 0.001),
loss = "binary_crossentropy",
metrics = c("accuracy")
)
Now it’s time to train the neural network. Before giving the sequences as input, we apply the pad_sequences
function: this function applies a maximum length to the sequences, in this case equals to max_len. Moreover, it applies padding to the sequences shorter than max_len; the padding adds a bunch of elements all equal to \(0\) (zero) at the end of the sequence, until its length becomes max_len. This technique is necessary because the neural network requires that the input sequences have all the same length.
history <- model %>% fit(
x = pad_sequences(x_train, maxlen = maxlen),
y = y_train,
epochs = 10,
batch_size = 32,
validation_split = 0.2,
verbose = 1
)
The plot, representing the loss function and the accuracy in relation to the number of epochs, is shown below.
Evaluating the model on the test set.
results <- model %>% evaluate(
x = pad_sequences(x_test, maxlen = maxlen),
y = y_test,
verbose = 0
)
print(paste("Loss on test data:", results["loss"]))
## [1] "Loss on test data: 0.309290438890457"
## [1] "Accuracy on test data: 0.883960008621216"
Model | Accuracy |
---|---|
Recurrent neural network | 88.4% |
Lasso regression | 87.0% |
Logistic regression | 86.2% |
Deep neural network | 86.1% |
Random forest | 84.7% |
Bagging | 77.0% |
Adaboost | 72.5% |
Gradient boosting | 70.1% |