This repository provides a curated collection of scripts and a Jupyter Notebook designed for training a custom
ruGPT-3.5-13B
model in the load_in_8bit
mode utilizing datasets and some scripts
from Saiga-2 (rulm).
The training process outlined here leverages Peft/LoRA technology. The resources provided are designed to facilitate a smooth training experience, seamless merging of LoRA weights with the original model, and a straightforward conversion of the model into the GGML format.
Note: While the training settings for this model mirror those used in GigaSaiga, but my model is enriched with additional dataset.
The primary objective of this repository is to reproduce the success achieved by GigaSaiga and to provide a detailed, step-by-step documentation of the training procedure. This initiative aims to empower and support the Russian-speaking ML community by making the process of training the ruGPT-3.5-13B model more accessible and understandable.
For your convenience, pretrained models are readily available at the following locations:
- https://huggingface.co/evilfreelancer/ruGPT-3.5-13B-lora
- https://huggingface.co/evilfreelancer/ruGPT-3.5-13B-ggml
By following the instructions and using the scripts provided in this repository, users can efficiently train their versions of the ruGPT-3.5-13B model with the flexibility to incorporate additional datasets as necessary.
First of all I would like to extend our sincere gratitude to the following authors and contributors:
-
The Sber AI Team, the brains behind the original
ruGPT-3.5-13B
model. Their groundbreaking work and continuous efforts in advancing AI and machine learning technologies have laid a solid foundation for this project and many others in the AI community. -
IlyaGusev and the rulm project team for their invaluable resources and datasets from Saiga-2/GigaSaiga, which have been fundamental in the training process of this custom ruGPT-3.5-13B model.
-
graysonwhite and the gglm project team. I'm particularly thankful for their comprehensive documentation on the ggml project, which has been indispensable in guiding me through the correct procedures for model transformation.
-
iashchak for his ruGPT-3.5-13B-ggml repository on HuggingFace. His contributions and shared expertise with llm-rs-python have been crucial in the successful creation of this project.
This project has been significantly enriched and made possible through the cumulative efforts and shared knowledge of these incredible individuals and teams. I deeply appreciate their contributions and are immensely thankful for their openness to sharing resources with the broader community.
For anyone looking to understand, extend, or build upon my work, I strongly recommend referring to and acknowledging these original authors and contributors, as their work represents the cornerstone of this project and many others in the field.
Before embarking on the training process, ensure your system meets the following requirements:
- ~100 GB of system RAM
- ~200 GB on HDD/SSD
- Nvidia GPU with at least 20 GB VRAM (eg. RTX 3090 or 4090)
- CUDA 12.2
Requirements:
- Python 3.10
- Python VirtualEnv
Clone the repo with all submodules:
git clone --recurse-submodules https://github.com/EvilFreelancer/ruGPT-3.5-training.git
Instantiate a virtual environment:
python -m venv venv
Switch to a virtual environment:
source venv/bin/activate
Download Python packages:
pip install -r requirements.txt
Requirements:
- Docker
- Docker Compose
- Nvidia Docker Runtime
Solution based on nvidia/cuda:12.2.0-devel-ubuntu22.04 image.
Clone the repo with all submodules:
git clone --recurse-submodules https://github.com/EvilFreelancer/ruGPT-3.5-training.git
Copy compose config from dist (and change settings if you need):
cp docker-compose.dist.yml docker-compose.yml
Build an image:
docker-compose build
Start container:
docker-compose build
Attach to container's shell:
docker-compose exec app bash
The entire process is broken down into four main steps, each corresponding to a script in the project’s root directory. Below is a step-by-step guide.
python3 1_dataset.py
The datasets utilized for training this model are consistent with those used for Saiga-2 (rulm).
Here's the comprehensive list:
- ru_turbo_alpaca
- ru_turbo_alpaca_evol_instruct
- ru_turbo_saiga
- ru_sharegpt_cleaned
- oasst1_ru_main_branch
- gpt_roleplay_realm
- ru_instruct_gpt4
To download and merge all datasets from this list you need to execute:
The resultant datasets train_full.jsonl
and val_full.jsonl
are generated in chat format.
python3 2_train.py
The sequence of operations performed by this script is as follows:
-
Download Original Model: The script initiates by downloading the original
ruGPT-3.5-13B
model from HuggingFace. The downloaded files are stored in theruGPT-3.5-13B
folder. -
Configuration Modification: After the download is complete, the script copies and modifies the configuration files located in
ruGPT-3.5-13B
folder. The altered configurations, which are necessary to enable training, are then placed in theoutput
folder. -
Training Initialization: The script subsequently instantiates the
src.train
Python module from therulm
project. This operation occurs within therulm/self_instruct
subdirectory. -
Output Files: Upon the completion of the above steps,
adapter_model.bin
andadapter_config.json
are generated and saved in theoutput
folder.
Each of the generated files plays a crucial role in the subsequent steps of the model training and application process.
python3 3_merge.py
This script performs the following tasks:
-
Weights Merging: It uses a modified version of the [convert_to_native.py] script. The script seamlessly merges the LoRA adapter weights with the weights of the base
ruGPT-3.5-13B
model. This merging process is crucial for enhancing the model’s performance with the learned adaptations from the LoRA training. -
Saving Merged Model: After the merging process is complete, the script saves the resultant model with the filename
pytorch_model.bin
in theoutput
directory of project.
Ensure you have sufficient storage space available in the output
directory as the merged model file can be quite
large.
python3 4_ggml.py
This step involves two main tasks:
-
Conversion to GGML-Compatible Format: The script starts by converting the
pytorch_model.bin
file into a format that is compatible with GGML. This converted format serves as an intermediate step that prepares the model for subsequent quantization processes. -
Quantization: Following the initial conversion, the script performs quantization on the model. The quantization process generates various quantized versions of the model, specifically: q4_0, q4_1, q5_0, q5_1, and q8_0. Each quantized version is optimized for different levels of precision and performance requirements.
-
Library Utilization: This entire process utilizes the llm-rs-python library. Ensure that this library is installed and accessible, as it plays a pivotal role in the GGML conversion and quantization processes.
-
Saving GGML Models: Upon completion of the conversion and quantization steps, the script saves the resultant GGML models in the
output_ggml
directory within your project’s root.
Ensure you have adequate storage space available in the output_ggml
directory, as the GGML models, especially the
quantized versions, may occupy significant space.
The root directory contains four additional scripts for testing each intermediate step:
- test_gigasaiga.py: Demonstrates the functionality of the original GigaSaiga as implemented by the authors of the rulm project.
- test_lora.py: Tests the on-the-fly merging of the LoRA adapter with adapter_model.bin from the output directory.
- test_merged.py: Shows the functionality of the original ruGPT-3.5 model after LoRA weights merging.
- test_ggml.py: Tests the GGML versions of the model to ensure proper functioning.
Feel free to open issues or pull requests if you have suggestions or encounter issues. Contributions to improve or expand this project are always welcome!