This is the class project in New York University Natural Language Processing class during 2022 Spring. All rights reserved.
A movie can embody multiple genres, and we create a system that can predict movie genres based on its plot overview.
The paper can be accessed at: (click here)
The system is based on the existing movie dataset, The Movies Dataset, and it is publicly available on Kaggle. We have already downloaded for our experiment, please check it in dataset/movies_metadata.csv
, which includes metadata of over 45,000 movies.
There are four models built, including our final system, RoBERTa; two comparison models, GloVe+CNN & GloVe+BiLSTM; and our baseline, TF-IDF+LR. And they can be found in multiclassification/models
directory.
Data_Preprocessing.ipynb
This is our data preprocessing & cleaning file. Data with non-English overviews or invalid overviews (NaN) aredropped. Raw genre labels, originally in JSON
format, are transformed into Python lists of strings and stored Pandas Dataframe form. Also, The dataset is split into 70% of training data, 15% of validation data, and 15% of test data. The split data is stored into 3 CSV files for our experiment, you can check it in multiclassification/data
file.
MovieClassification_Baseline.ipynb This is the implementation of our baseline, TF-IDF+Logistic Regression. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.
MovieClassification_BiLSTM.ipynb This is the implementation of our comparison models, GloVe+BiLSTM & GloVe+CNN. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.
MovieClassification_RoBERTa.ipynb This is the implementation of our final system, RoBERTa. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.
Note: you can use our preprocessed data in the multiclassification/data
file. This is the resulting data by runing our Data_Preprocessing.ipynb file in the same directory. Meanwhile, all training logs are stored in the models
file.
People who contribute to this system: Andy Wang ([email protected]), Stephen Zhang ([email protected]), Koji Liu ([email protected]), Matthew Li ([email protected])