MediChestAI is a comprehensive MLOps project focused on classifying chest CT images as normal or adenocarcinoma using deep learning model. The project integrates various modern tools and practices to streamline the process of development, testing, deployment, and maintenance.
- Project Overview
- Project Structure
- Dataset
- Data Pipeline
- MLflow Experiments
- CI/CD Deployment
- Installation
- Usage
- MLflow DagsHub Connection
- DVC Commands
-
MLOps Workflow: Implements an end-to-end CI/CD pipeline using Jenkins for continuous integration and deployment, triggered by GitHub Actions. GitHub Actions orchestrates the process by triggering the Jenkins job that builds, tests, and deploys the project.
-
Deployment: The application is designed to be deployed on AWS EC2 instances, leveraging Docker containers stored in Amazon ECR for consistent and scalable deployment.
- Docker: A Dockerfile is included to build the project's Docker image, which is based on the python:3.8-slim-buster image. The Dockerfile installs the required dependencies and starts the Flask application.
- Docker Compose: The project includes a docker-compose.yaml file for running the application using Docker Compose. It maps port 8080 on the host to port 8080 on the container.
-
Pipeline Tracking (DVC): DVC is used to manage and track the project's machine learning pipeline, including stages like data ingestion, model preparation, training, and evaluation.
-
MLflow Integration: For experiment tracking and serving, MLflow is integrated into the project, with DAGsHub providing a centralized URI for managing and tracking experiments.
-
Data Storage: The project's data is stored on Google Drive, ensuring easy access and version control through DVC.
-
Flask Application: A Flask web application with a user-friendly interface is developed for interacting with the model, allowing users to easily upload CT images and get predictions.
-
Python Modular Approach: The project emphasizes a modular Python approach, following industry best practices to ensure maintainability and scalability.
-
Model Framework: TensorFlow and Keras frameworks are used for developing the deep learning model, specifically using the VGG16 architecture.
MediChestAI/
│
├── .dvc/ # DVC cache and temporary files
│ ├── cache/
│ └── tmp/
│
├── .github/
│ └── workflows/
│ └── main.yaml # GitHub Actions workflow for triggering Jenkins
│
├── jenkins/
│ └── Jenkinsfile # Jenkins pipeline configuration
│
├── artifacts/ # Folder to store pipeline outputs
│
├── config/ # Configuration files for the project
│ └── config.yaml # Project configuration
│
├── logs/
│ └── running_logs.log # Application logs
│
├── model/
│ └── model.h5 # Saved model
│
├── research/ # Jupyter notebooks
│ ├── 01_data_ingestion.ipynb
│ ├── 02_prepare_base_model.ipynb
│ ├── 03_model_trainer.ipynb
│ └── 04_model_evaluation_with_mlflow.ipynb
│
├── scripts/
│ ├── ec2_setup.sh # Shell script to setup EC2 instance
│ └── jenkins.sh # Jenkins setup script
│
├── src/ # Source code for the project
│ └── CNNClassifier/
│ ├── __init__.py # Init file for the package
│ ├── components/ # Pipeline components
│ │ ├── data_ingestion.py
│ │ ├── model_evaluation.py
│ │ ├── model_trainer.py
│ │ ├── prepare_base_model.py
│ │ └── __init__.py
│ │
│ ├── config/ # Configuration-related source code
│ │ ├── configuration.py
│ │ └── __init__.py
│ │
│ ├── constants/ # Project constants
│ │ ├── __init__.py
│ │
│ ├── entity/ # Entity classes
│ │ ├── config_entity.py
│ │ └── __init__.py
│ │
│ ├── pipeline/ # Pipeline scripts
│ │ ├── stage_01_data_ingestion.py
│ │ ├── stage_02_prepare_base_model.py
│ │ ├── stage_03_model_trainer.py
│ │ ├── stage_04_model_evaluation.py
│ │ └── __init__.py
│ │
│ ├── utils/ # Utility functions
│ │ ├── common.py
│ │ └── __init__.py
│
├── templates/
│ └── index.html # HTML template for the Flask web app
│
├── .dockerignore # Files to exclude from Docker builds
├── .dvcignore # Files to exclude from DVC
├── .gitignore # Files to exclude from Git
├── app.py # Flask application entry point
├── docker-compose.yaml # Docker Compose configuration file
├── Dockerfile # Dockerfile for building the Docker image
├── dvc.lock # DVC lock file
├── dvc.yaml # DVC pipeline configuration
├── inputImage.jpg # Sample input image
├── LICENSE # Project license
├── main.py # Main Python script with the model pipeline stages
├── params.yaml # Parameters for training VGG16 model
├── README.md # Project README
├── requirements.txt # Python dependencies
├── scores.json # Model evaluation scores
├── setup.py # Setup script for the Python package
└── template.py # Template script for setting up project structure
The dataset used for training and evaluation can be found on Google Drive. You can download it using the link below:
The data pipeline of the MediChestAI project includes several key stages, each of which has a specific role in the machine learning workflow:
-
Data Ingestion:
- The
stage_01_data_ingestion.py
file is responsible for downloading and extracting data from a given URL. The dataset is downloaded as a zip file, then extracted to the specified location for further processing.
- The
-
Prepare Base Model:
- The
stage_02_prepare_base_model.py
file prepares a base model using VGG16 architecture. - It initializes the model with specific configurations such as input image size, weights ("imagenet"), and whether to include the top layer.
- This stage saves the initialized model for further use in training and evaluation.
- The
-
Training:
- The
stage_03_model_trainer.py
file handles the training of the initialized model using the training dataset. - The model is fine-tuned with key hyperparameters like learning rate, batch size, and epochs, as specified in the
params.yaml
file. - After training, the model is saved for evaluation.
- The
-
Evaluation:
- The
stage_04_model_evaluation.py
file evaluates the trained model's performance using a validation dataset. - It computes evaluation metrics like accuracy and loss, which are saved in a JSON file for analysis.
- The
The pipeline is managed with DVC to ensure reproducibility and version control, making it easy to track changes and improve the pipeline iteratively.
The project leverages MLflow to track and visualize experiments, allowing you to monitor different metrics such as accuracy and loss over multiple training runs. This helps in understanding the model's performance across different hyperparameter settings.
- Tracking Experiments: Track experiments and compare their metrics and parameters side by side.
- UI Visualization: The MLflow UI provides a clear visualization of the experiments to evaluate model performance.
- Integration with DagsHub: Easily integrates with DagsHub for centralized experiment tracking.
To see the experiment tracking in action, here's a screenshot from the MLflow UI:
The CI/CD deployment architecture for MediChestAI includes the following components:
-
GitHub Actions:
-
Jenkins (EC2-1):
-
Amazon ECR (Elastic Container Registry):
-
EC2-2 (Flask Application):
- A second EC2 instance (EC2-2) is used to host the Flask application.
- The Flask application is deployed using the Docker image pulled from ECR.
- This instance acts as an endpoint that serves predictions based on the input data.
-
Source Code Commit: Changes to the code are pushed to the GitHub repository.
-
Triggering Jenkins: GitHub Actions triggers a Jenkins job to handle the CI/CD pipeline.
-
Building the Docker Image: Jenkins builds the Docker image based on the Dockerfile in the repository.
-
Pushing to ECR: The Docker image is pushed to Amazon ECR.
-
Deployment on EC2-2: The Flask application running on EC2-2 pulls the latest Docker image from ECR and deploys it, making the new version of the application available for use.
This setup ensures automated, consistent, and reliable deployment of the project, enabling continuous integration and delivery.
- Clone the repository:
https://github.com/Omar-Karimov/MediChestAI.git
- Create and activate a virtual environment:
conda create -n medichest python=3.8 -y
conda activate medichest
- Install the required Python dependencies:
pip install -r requirements.txt
- Set up the package for development:
pip install -e .
- To run the application:
python app.py
- Access the application in your web browser at http://localhost:8080.
To set up MLflow tracking using DagsHub:
-
Create a DagsHub Account:
- Visit DagsHub and create a free account.
-
Connect Your GitHub Repository to DagsHub:
- In DagsHub, create a new project and link it to your existing GitHub repository.
- Enable the Experiments feature for your project.
-
Get MLflow Tracking Credentials:
- After connecting your GitHub repo, click on the Remote tab in DagsHub.
- Copy the following environment variables provided by DagsHub:
MLFLOW_TRACKING_URI=https://dagshub.com/<your-username>/<your-repo>.mlflow MLFLOW_TRACKING_USERNAME=<your-username> MLFLOW_TRACKING_PASSWORD=<your-password>
-
Set the Environment Variables:
- You can either export these variables using Git Bash:
export MLFLOW_TRACKING_URI=https://dagshub.com/<your-username>/<your-repo>.mlflow export MLFLOW_TRACKING_USERNAME=<your-username> export MLFLOW_TRACKING_PASSWORD=<your-password>
- Alternatively, you can add these to your system environment variables.
- You can either export these variables using Git Bash:
-
Access the MLflow UI:
- Go to the MLflow UI tab in DagsHub to track all experiments linked to your project.
- Initialize DVC:
dvc init
- Reproduce the pipeline:
dvc repro
- DVC will automatically skip stages that haven't changed and run only the necessary parts of the pipeline.
- Visualize the pipeline:
dvc dag
- This will generate a graph illustrating the flow of the data pipeline.
+----------------+ +--------------------+
| data_ingestion | | prepare_base_model |
+----------------+***** +--------------------+
* ***** *
* ****** *
* *** *
** +----------+
** | training |
*** +----------+
*** ***
** **
** **
+------------+
| evaluation |
+------------+