Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset validation script for model distillation #501

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions custom-models/model_distillation/dataset-validation/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
## Dataset Validation for Model Distillation
Before you create a model distillation job in the Amazon Bedrock console, utilize the provided script to validate your dataset first. This would allow you to identify formatting errors (if any) faster and save costs. More details about the accepted format can be found here: https://docs.aws.amazon.com/bedrock/latest/userguide/prequisites-model-distillation.html

### Usage

Install the last version of python [here](https://www.python.org/downloads/) if you haven't already.

Download the `dataset-validation` folder, `cd` into the root directory, and run the dataset validation script:

```
pip install -r requirements.txt -U
python3 dataset_validator.py -p <path>

# Specifying an output file for detailed validation logs
python3 dataset_validator.py -p <path> -o <log file>

# Specifying the given path is for invocation logs
python3 dataset_validator.py -p <path> -i
```

- Path options
- file: /path/to/file.jsonl
- folder: /path/to/folder
- S3: s3://bucket/key

### Features
1. Validates prompts in the given path satisfy the `bedrock-conversation-2024` format
2. If an output file is given, validation errors for each prompt would be logged in the output file
3. If the invocation logs flag is present, the validator will validate for the invocation logs use-case instead

### Limitations

This script currently does not support the following features:
- Invocation logs validation with filters
- Validating prompts do not contain invalid tags
12 changes: 12 additions & 0 deletions custom-models/model_distillation/dataset-validation/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
USER_ROLE = "user"
ASSISTANT_ROLE = "assistant"
MESSAGES_FIELD = "messages"
ROLE_FIELD = "role"
REQUIRED_INVOCATION_LOG_KEYS = [
"modelId",
]
MIN_NUM_PROMPTS = 100
JSONL_EXTENSION = ".jsonl"
GZ_EXTENSION = ".gz"
S3_PREFIX = "s3://"
MAX_SIZE = 1 * 1024 * 1024 * 1024 # 1 GB in bytes
Loading