In molecular biology, codons are sequences of three nucleotides within DNA or RNA that correspond to specific amino acids or stop signals during protein synthesis. Each gene's sequence of codons is translated by the ribosome into a chain of amino acids, forming proteins essential to life. Codon usage refers to the frequency with which different codons encode the same amino acid and varies across species, reflecting evolutionary adaptation and gene expression preferences in different organisms.
This project aims to leverage codon usage frequencies to predict the Kingdom and DNA type of a species. Using the relative abundance of codons across species, we can gain insights into their biological classification. Our approach involves analyzing codon frequencies and training a predictive model to classify species by Kingdom and DNA type based on these patterns.
This analysis uses the Codon Usage dataset, available from the UCI Machine Learning Repository. Download the dataset from the following link and place it in the project directory:
The dataset contains codon frequency data across various species, providing the foundation for the predictive modeling.
The following libraries are required for this project:
Pandas
- Data manipulation and analysisNumPy
- Numerical computationsMatplotlib
andSeaborn
- Data visualizationscikit-learn
- Machine learning tools for classification and preprocessing
Install the dependencies by running:
pip install pandas numpy matplotlib seaborn scipy scikit-learn
The notebook, Codon_usage_final.ipynb
, is organized as follows:
- Introduction - Overview of codon usage and its biological significance.
- Data Loading and Preprocessing - Loads the dataset, applies necessary cleaning, and prepares it for analysis.
- Exploratory Data Analysis (EDA) - Visual exploration of codon usage frequencies to identify trends and patterns.
- Predictive Modeling - Trains a machine learning model to classify species by Kingdom and DNA type based on codon usage frequency.
- Results and Visualization - Presents the model’s classification accuracy and visualizes the results, including confusion matrices and feature importances.
- Conclusion - Summarizes the findings and their implications for understanding codon usage patterns.
-
Clone the repository:
git clone https://github.com/yubrajniraula/ML-Project-on-Codon-Usage-Frequency/tree/main cd ML-Project-on-Codon-Usage-Frequency/tree/main
-
Download the Codon Usage dataset from the UCI link provided above, and place it in the root directory of the project.
-
Open the notebook in Jupyter:
jupyter notebook Codon_usage_final.ipynb
-
Run each cell sequentially to execute data loading, analysis, and modeling.
The output of this notebook includes:
- Model Performance Metrics - Accuracy, precision, and recall for Kingdom and DNA type prediction.
- Feature Importance Analysis - Highlights which codons or groups of codons are most informative for predicting the classification.
- Visualization - Confusion matrices and plots for understanding classification performance.
The Codon Usage dataset used in this project is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. This permits sharing and adapting the dataset for any purpose, provided appropriate credit is given to the original source.