This project implements a comprehensive pipeline for classifying Reddit posts based on their popularity using advanced Natural Language Processing (NLP) techniques and machine learning models. It leverages BERT-based embeddings, Convolutional Neural Networks (CNNs), and Autoencoder-based architectures to deliver a robust and flexible classification framework.
The goal is to categorize Reddit posts into three levels of popularity:
- Less Popularity: Upvote ratio ≤ 0.5
- Average Popularity: 0.5 < Upvote ratio ≤ 0.8
- Most Popularity: Upvote ratio > 0.8
The pipeline includes:
- Data preparation and labeling.
- Pretrained BERT embedding extraction for text representation.
- Model training using different neural network architectures.
- Evaluation of model performance through multiple metrics.
- Visualization of results for comparative analysis.
This project demonstrates how cutting-edge techniques can address real-world NLP classification challenges, with the flexibility to adapt to various datasets and use cases.
Ensure the following libraries are installed:
- Core Libraries:
Python 3.7+
,Pandas
,NumPy
,Matplotlib
,Seaborn
- Machine Learning:
scikit-learn
,TensorFlow
,imbalanced-learn
- NLP:
HuggingFace Transformers
Install all required dependencies:
pip install -r requirements.txt
Place the input dataset at the specified path: /kaggle/input/ukraine-war/original_data.csv
.
The dataset must include the following columns:
- subreddit: Name of the subreddit.
- title: Title of the Reddit post.
- selftext: Main text content of the post.
- upvote_ratio: Ratio of upvotes to total votes.
- Data Cleaning: Remove rows with missing or invalid data.
- Label Generation: Categorize posts into "Less Popularity," "Average Popularity," and "Most Popularity" based on
upvote_ratio
.
- Text data is tokenized using pretrained BERT models:
- Default:
bert-base-uncased
- Optional Models:
TinyBERT
,RoBERTa
,ALBERT
- Default:
- Extract embeddings:
- Pooled Output: Encodes the entire input text.
- CLS Token Output: Embedding of the
[CLS]
token.
To handle imbalanced classes, three strategies are available:
- Oversampling: Replicate data from underrepresented classes.
- Undersampling: Reduce samples from overrepresented classes.
- SMOTE: Synthesize new samples for underrepresented classes.
- Input: BERT embeddings (pooled or CLS output).
- Fully connected layers with ReLU activation and dropout for regularization.
- Input: Reshaped BERT embeddings.
- Apply 1D convolutional layers for feature extraction.
- Stage 1: Train an autoencoder to learn compressed representations.
- Stage 2: Use the compressed representations for classification.
- Evaluation Metrics:
- Confusion Matrix
- Classification Report (Precision, Recall, F1 Score)
- Visualization:
- Confusion matrices for all models.
- Overall accuracy comparison.
- Label-specific accuracy and F1 scores.
- Precision and recall comparisons.
git clone https://github.com/sntk-76/Data-Mining
cd Data-Mining
pip install -r requirements.txt
Ensure the dataset is placed at /kaggle/input/ukraine-war/original_data.csv
. Update the path in the code if needed.
- Execute the script sequentially.
- Alternatively, run all cells in the provided Jupyter Notebook.
Generated visualizations will be saved in the working directory.
-
Label_classification:
- Classifies posts based on their
upvote_ratio
.
- Classifies posts based on their
-
preprocessing:
- Handles data tokenization, train-test splitting, and balancing.
-
neural_network:
- Base class for Dense Neural Networks.
-
ConvolutionalDenseNetwork:
- Extends
neural_network
to include 1D convolutional layers.
- Extends
-
AutoencoderClassifierNetwork:
- Extends
neural_network
with an autoencoder stage.
- Extends
-
Visualization:
- Visualizes results using confusion matrices, accuracy metrics, and F1 scores.
The project outputs the following visualizations for easy analysis:
- Confusion Matrices:
confusion_matrices.png
- Overall Accuracy:
overall_accuracy.png
- Label-specific Accuracy:
label_accuracy.png
- F1 Score Comparison:
f1_score_comparison.png
- Precision and Recall:
precision_recall_per_label.png
- Confusion Matrix Differences:
confusion_matrix_difference.png
- Scalable Framework: Easily extendable for new datasets or BERT variants.
- Comprehensive Models: Combines dense, convolutional, and autoencoder architectures.
- Visual Insights: Graphical representations for enhanced interpretability.
- Balanced Data Handling: Offers multiple techniques to manage class imbalance.
- Explore Additional Models:
- Add more BERT variants like
TinyBERT
,RoBERTa
, andALBERT
.
- Add more BERT variants like
- Hyperparameter Tuning:
- Optimize learning rates, layer sizes, and dropout rates.
- Advanced Visualization:
- Incorporate new metrics like ROC curves and feature importance.
- Cross-domain Adaptation:
- Test on datasets from other domains to assess generalizability.
This project is licensed under the MIT License. See the LICENSE file for more details.
This extended README provides detailed explanations, emphasizes project flexibility, and highlights its strengths. The added sections for highlights, scalability, and future work make it more comprehensive.