Skip to content

Commit

Permalink
Merge pull request #501 from ivan-tse/distillation_dataset_validation
Browse files Browse the repository at this point in the history
Add dataset validation script for model distillation
  • Loading branch information
zhanyany16 authored Feb 27, 2025
2 parents 1c663e7 + 42366ff commit e093c6e
Show file tree
Hide file tree
Showing 8 changed files with 501 additions and 0 deletions.
35 changes: 35 additions & 0 deletions custom-models/model_distillation/dataset-validation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## Dataset Validation for Model Distillation
Before you create a model distillation job in the Amazon Bedrock console, utilize the provided script to validate your dataset first. This would allow you to identify formatting errors (if any) faster and save costs. More details about the accepted format can be found here: https://docs.aws.amazon.com/bedrock/latest/userguide/prequisites-model-distillation.html

### Usage

Install the last version of python [here](https://www.python.org/downloads/) if you haven't already.

Download the `dataset-validation` folder, `cd` into the root directory, and run the dataset validation script:

```
pip install -r requirements.txt -U
python3 dataset_validator.py -p <path>
# Specifying an output file for detailed validation logs
python3 dataset_validator.py -p <path> -o <log file>
# Specifying the given path is for invocation logs
python3 dataset_validator.py -p <path> -i
```

- Path options
- file: /path/to/file.jsonl
- folder: /path/to/folder
- S3: s3://bucket/key

### Features
1. Validates prompts in the given path satisfy the `bedrock-conversation-2024` format
2. If an output file is given, validation errors for each prompt would be logged in the output file
3. If the invocation logs flag is present, the validator will validate for the invocation logs use-case instead

### Limitations

This script currently does not support the following features:
- Invocation logs validation with filters
- Validating prompts do not contain invalid tags
12 changes: 12 additions & 0 deletions custom-models/model_distillation/dataset-validation/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
USER_ROLE = "user"
ASSISTANT_ROLE = "assistant"
MESSAGES_FIELD = "messages"
ROLE_FIELD = "role"
REQUIRED_INVOCATION_LOG_KEYS = [
"modelId",
]
MIN_NUM_PROMPTS = 100
JSONL_EXTENSION = ".jsonl"
GZ_EXTENSION = ".gz"
S3_PREFIX = "s3://"
MAX_SIZE = 1 * 1024 * 1024 * 1024 # 1 GB in bytes
Loading

0 comments on commit e093c6e

Please sign in to comment.