Skip to content

stephen-ck-zhang/Multi-Label-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Label-Classification

This is the class project in New York University Natural Language Processing class during 2022 Spring. All rights reserved.

A movie can embody multiple genres, and we create a system that can predict movie genres based on its plot overview.

The paper can be accessed at: (click here)

The system is based on the existing movie dataset, The Movies Dataset, and it is publicly available on Kaggle. We have already downloaded for our experiment, please check it in dataset/movies_metadata.csv, which includes metadata of over 45,000 movies.

There are four models built, including our final system, RoBERTa; two comparison models, GloVe+CNN & GloVe+BiLSTM; and our baseline, TF-IDF+LR. And they can be found in multiclassification/models directory.

Data_Preprocessing.ipynb This is our data preprocessing & cleaning file. Data with non-English overviews or invalid overviews (NaN) aredropped. Raw genre labels, originally in JSON format, are transformed into Python lists of strings and stored Pandas Dataframe form. Also, The dataset is split into 70% of training data, 15% of validation data, and 15% of test data. The split data is stored into 3 CSV files for our experiment, you can check it in multiclassification/data file.

MovieClassification_Baseline.ipynb This is the implementation of our baseline, TF-IDF+Logistic Regression. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

MovieClassification_BiLSTM.ipynb This is the implementation of our comparison models, GloVe+BiLSTM & GloVe+CNN. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

MovieClassification_RoBERTa.ipynb This is the implementation of our final system, RoBERTa. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

Note: you can use our preprocessed data in the multiclassification/data file. This is the resulting data by runing our Data_Preprocessing.ipynb file in the same directory. Meanwhile, all training logs are stored in the models file.

People who contribute to this system: Andy Wang ([email protected]), Stephen Zhang ([email protected]), Koji Liu ([email protected]), Matthew Li ([email protected])

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published