-
Notifications
You must be signed in to change notification settings - Fork 389
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #501 from ivan-tse/distillation_dataset_validation
Add dataset validation script for model distillation
- Loading branch information
Showing
8 changed files
with
501 additions
and
0 deletions.
There are no files selected for viewing
35 changes: 35 additions & 0 deletions
35
custom-models/model_distillation/dataset-validation/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
## Dataset Validation for Model Distillation | ||
Before you create a model distillation job in the Amazon Bedrock console, utilize the provided script to validate your dataset first. This would allow you to identify formatting errors (if any) faster and save costs. More details about the accepted format can be found here: https://docs.aws.amazon.com/bedrock/latest/userguide/prequisites-model-distillation.html | ||
|
||
### Usage | ||
|
||
Install the last version of python [here](https://www.python.org/downloads/) if you haven't already. | ||
|
||
Download the `dataset-validation` folder, `cd` into the root directory, and run the dataset validation script: | ||
|
||
``` | ||
pip install -r requirements.txt -U | ||
python3 dataset_validator.py -p <path> | ||
# Specifying an output file for detailed validation logs | ||
python3 dataset_validator.py -p <path> -o <log file> | ||
# Specifying the given path is for invocation logs | ||
python3 dataset_validator.py -p <path> -i | ||
``` | ||
|
||
- Path options | ||
- file: /path/to/file.jsonl | ||
- folder: /path/to/folder | ||
- S3: s3://bucket/key | ||
|
||
### Features | ||
1. Validates prompts in the given path satisfy the `bedrock-conversation-2024` format | ||
2. If an output file is given, validation errors for each prompt would be logged in the output file | ||
3. If the invocation logs flag is present, the validator will validate for the invocation logs use-case instead | ||
|
||
### Limitations | ||
|
||
This script currently does not support the following features: | ||
- Invocation logs validation with filters | ||
- Validating prompts do not contain invalid tags |
12 changes: 12 additions & 0 deletions
12
custom-models/model_distillation/dataset-validation/constants.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
USER_ROLE = "user" | ||
ASSISTANT_ROLE = "assistant" | ||
MESSAGES_FIELD = "messages" | ||
ROLE_FIELD = "role" | ||
REQUIRED_INVOCATION_LOG_KEYS = [ | ||
"modelId", | ||
] | ||
MIN_NUM_PROMPTS = 100 | ||
JSONL_EXTENSION = ".jsonl" | ||
GZ_EXTENSION = ".gz" | ||
S3_PREFIX = "s3://" | ||
MAX_SIZE = 1 * 1024 * 1024 * 1024 # 1 GB in bytes |
Oops, something went wrong.