Rotten Tomatoes Review Analysis using Naive Bayes Classifier

Abhishek Wadhwani
9 min readApr 16, 2023
Image from: https://www.studiobinder.com/blog/rotten-tomatoes-ratings-system/

Introduction

Rotten Tomatoes is a website and online platform that aggregates reviews from professional film critics, as well as reviews from the general public, to provide a comprehensive overview of a movie’s critical and audience reception. Each movie is given a “Tomatometer” score on the website, which represents the proportion of favorable reviews from professional reviewers. The website also includes related information such as movie news and trailers. It has grown to be a well-known and significant resource for movie reviews and ratings.

A well-liked probabilistic approach for classification tasks is the naive Bayes classifier. It is noted for its simplicity, scalability, and excellent accuracy and is based on the Bayes theorem. Due to the classifier’s capacity for handling huge data sets and producing precise predictions, it is frequently used in a variety of fields, including text classification, sentiment analysis, and spam filtering.

Bayes’ Theorem

Bayes theorem is also known as as Bayes’ rule. It uses the conditional probabilities.

P(A|B) is the conditional probability of A occurring if B occurs, whereas A and B are two events.

P(A|B) and P(B|A) are not same. Bayes theorem provides a link between both of them. [1]

The formula for Bayes’ theorem is as follows:

P(A|B) = (P(B|A) * P(A)) / P(B)

Figure 1 : Bayes’ Theorem

The Naive Bayes classifier uses Bayes’ theorem to assess the likelihood of each class and choose the class with the highest likelihood as the predicted class. It has been shown to be accurate in a variety of fields, such as text classification and spam filtering.

Naive Bayes Classifier

Naive Bayes classifier is the probabilistic algorithm which is used for classification tasks. It uses Bayes theorem and suppose that features of the classification are independent from each other. The class with the highest probability is first selected by calculating the likelihood of each class given a set of features. Due to its scalability, and high accuracy, the Naive Bayes classifier is frequently used in a variety of fields which includes spam filtering, sentiment analysis, and text classification [2].

Classification using Naive Bayes Classifier

Classification using Naive Bayes classifier includes the following steps:

· Data preparation: Firstly, Preprocessing is performed on the dataset to get rid of any unnecessary features and make sure that all of the features are on the same scale.

· Training: Then the labeled dataset is used for training of classifier. The prior probability and conditional probability of each class are determined during training.

· Prediction: Once the classifier is trained, the classifier can be used to classify new input data. The classifier calculates the probability of each class and chooses the class with the highest probability as the predicted class.

· Evaluation: Several metrics are used for the evaluation of performance of model. It includes accuracy, precision, recall, and F1 score. This is used in evaluating the classifier’s performance and pointing up potential areas for development. In this project we are going to use Accuracy metric for evaluation [3].

Figure 2 : Use of Naïve Bayes Theorem

Applications of Naive Bayes Classifier

Figure 3: application using Naïve Bayes Algorithm

Naive Bayes classifier is an algorithm used for classification tasks such as :

Text classification: Tasks like spam filtering, sentiment analysis, and topic classification frequently use the Naive Bayes classifier.

Medical diagnosis: Naive Bayes classifiers can be used for tasks like disease prediction and diagnosis in the field of medicine.

Image classification: The Naive Bayes classifier can be used for applications like face and object detection in image classification.

Fraud detection: The Naive Bayes classifier can be used to detect insurance and credit card fraud, among other types of fraud.

In this project we are going to work on Text Classification(Sentiment Analysis)

Technical Challenges

Despite being a well-liked and efficient classification technique, the Naive Bayes classifier has a number of technical issues that could compromise its accuracy.

The independence assumption, which assumes that the features used for classification are independent, presents one of the difficulties because it might not hold true in real-world situations. Overfitting, which happens when there are not enough training examples, is another problem that makes it difficult to generalize to new input data. It can be difficult to handle continuous data, and methods like binning or discretization may result in information loss and poor classification accuracy [4]. Class imbalance and data scarcity can also influence the classifier’s accuracy, resulting in biased classification outcomes and reduced accuracy for minority classes.

My contribution

The dataset splitting strategy, which helped to divide the training, development, and testing datasets, was implemented in my code to address the overfitting issue. This allowed for a more accurate model evaluation and prevented the model from memorizing the training data.

I used Laplace smoothing to address the naive Bayes classifier’s data sparsity issue. This can assist avoid a problem that frequently arises in text classification tasks, when the model assigns a zero probability to a class based on the absence of a certain term in the training data.

By computing the prior probability for each class (fresh or rotten) based on the number of reviews for each class in the dataset, the prior probabilities problem is solved. The entire number of reviews in the dataset is divided by each count to normalize these probabilities. The probabilities that result are returned as a dictionary with the class names as the keys and the corresponding probabilities as the values.

Dataset Description

The data set used in this case was Rotten Tomatoes’ movie reviews dataset. A CSV file containing the data set is provided, with each row representing a survey and its related sentiment label. The collection includes reviews of several films in a variety of genres and languages from a wide range of reviewers and publications. Metadata regarding the film’s title, year of release, genre, director, cast, and other details are included with each review. Researchers can investigate how elements like genre, director, or cast affect the tone of movie reviews thanks to this comprehensive set of metadata.

The dataset is loaded and then split into three sections for purposes of training, development, and testing. The model is “trained” using the training set, “fine-tuned” using the development set, and “tested” using the testing set.

Implementation

The code consists of several functions that implement the above algorithm:

Importing dataset

The describes a Python function called LoadDataset that loads data from a CSV file and returns it as a list. The program takes a filename as input and uses the built-in csv module to read the contents of the file. Each row of the file is linked to the data set list, which is then returned by the function. The function can be used to load any CSV file, and can be further modified to pre-process or modify the data as needed.

Preprocessing

Splitting Dataset

This defines a function that divides a data set into three parts: train, dev, and test. It takes the data set and the split ratio as input, where the split ratio determines the size of the training set.

The probability of each class (freshness and rottenness) is computed by the function ClassProbabilities, which accepts a dataset as input. It’s a dictionary where each key stands for a category and the value indicates how many documents belong to that category.

Two functions are used to calculate the conditional probability of each term for a given class (fresh or rotten) in the data set. The first function, word_conditional_probability_smoothing, uses Laplace smoothing with a specified alpha value to handle the case of zero probabilities. Each word in both classes returns a dictionary with corresponding probabilities.

The second function, calculate_conditional_probabilities, does not use smoothing but adds a constant of 1 to each calculation to avoid zero probabilities. It returns a dictionary of each word and its possible matches in both classes.

The predict function takes document, vocabulary, class probabilities, and conditional probabilities as input and returns a prediction of ‘fresh’ or ‘rotten’ .

The evaluation function takes dataset, vocabulary, class_probabilities, and conditional_probabilities as input parameters. It loops through each document in the data set, using the predict function to make a prediction and comparing the predicted sentiment with the actual sentiment.

The main() function performs sentiment analysis on rotten-tomatoes-reviews dataset by using Naive Bayes Classifier.

First, it counts recurring words in the data set and creates a list of words from them. Then, it divides the dataset into train, development, and test sets, and computes class probabilities and conditional probabilities based on the occurrence of terms in the train set and applies smoothing to conditional outputs a it is possible to avoid zero probabilities.

The function then examines the classification accuracy in the test set which turned out 77% and calculates the effect of smoothing. Finally, it applies the optimal hyperparameter to the test set and reports accuracy. It also reports the top 10 most top words both fresh and rotten reviews.

Results:

Nearly equal class probabilities of 0.5 for “fresh” and “rotten” are present. The model’s accuracy using the best hyperparameters is 77% on the training set and 75% on the test set. The word “the” occurs often in the text data, with a probability of 0.57. Based on its emotion, the model has a 57.93% likelihood of categorizing a line as “fresh”. Accuracy was increased by smoothing by 2%. The top 10 words in all categories are “the,” “a,” “of,” “and,” “to,” “is,” and “in.”

Submission details

Github: https://github.com/Abhismoothie/NBC_Tomato

References

[1] J. Brownlee, “A Gentle Introduction to Bayes Theorem for Machine Learning,” 2019.

[2] N. S. Chauhan, “Naïve Bayes Algorithm: Everything You Need to Know,” 2022.

[3] T. Srivastava, “12 Important Model Evaluation Metrics for Machine Learning Everyone Should Know,” 2019.

[4] S. Ray, Naive Bayes Classifier Explained: Applications and Practice Problems of Naive Bayes Classifier, analyticsvidhya, 2017.

[5] A. Ng, “Support Vector Machines: A Visual Explanation with Sample Python Code,” 2017.

[6] J. VanderPlas, “Python Data Science Handbook,” O’Reilly Media, 2016.

[7] D. Jurafsky and J. H. Martin, “Speech and Language Processing,” Pearson Education, 2019.

[8] A. Géron, “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow,” O’Reilly Media, 2019.

--

--