GitHub - stephen-ck-zhang/Multi-Label-Classification

Multi-Label-Classification

A movie can embody multiple genres, and we create a system that can predict movie genres based on its plot overview.

The paper can be accessed at: (click here)

The system is based on the existing movie dataset, The Movies Dataset, and it is publicly available on Kaggle. We have already downloaded for our experiment, please check it in dataset/movies_metadata.csv, which includes metadata of over 45,000 movies.

There are four models built, including our final system, RoBERTa; two comparison models, GloVe+CNN & GloVe+BiLSTM; and our baseline, TF-IDF+LR. And they can be found in multiclassification/models directory.

Data_Preprocessing.ipynb This is our data preprocessing & cleaning file. Data with non-English overviews or invalid overviews (NaN) aredropped. Raw genre labels, originally in JSON format, are transformed into Python lists of strings and stored Pandas Dataframe form. Also, The dataset is split into 70% of training data, 15% of validation data, and 15% of test data. The split data is stored into 3 CSV files for our experiment, you can check it in multiclassification/data file.

MovieClassification_Baseline.ipynb This is the implementation of our baseline, TF-IDF+Logistic Regression. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

MovieClassification_BiLSTM.ipynb This is the implementation of our comparison models, GloVe+BiLSTM & GloVe+CNN. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

MovieClassification_RoBERTa.ipynb This is the implementation of our final system, RoBERTa. Please make sure the envirnoment is set up and run the cells to check the results. Also, keeping the preprocessed dataset in the same directory.

Note: you can use our preprocessed data in the multiclassification/data file. This is the resulting data by runing our Data_Preprocessing.ipynb file in the same directory. Meanwhile, all training logs are stored in the models file.

People who contribute to this system: Andy Wang ([email protected]), Stephen Zhang ([email protected]), Koji Liu ([email protected]), Matthew Li ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
dataset		dataset
models		models
multiclassification		multiclassification
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Label-Classification

About

Releases

Packages

Languages

stephen-ck-zhang/Multi-Label-Classification

Folders and files

Latest commit

History

Repository files navigation

Multi-Label-Classification

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages