Synda

Warning

This project is in its very early stages of development and should not be used in production environments.

Note

PR are more than welcome. Check the roadmap if you want to contribute or create discussion to submit a use-case.

Synda (synthetic data) is a package that allows you to create synthetic data generation pipelines. It is opinionated and fast by design, with plans to become highly configurable in the future.

Installation

Synda requires Python 3.10 or higher.

You can install Synda using pip:

pip install synda

Usage

Create a YAML configuration file (e.g., config.yaml) that defines your pipeline:

input:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/source.csv
    target_column: content
    separator: "\t"

pipeline:
  - type: split
    method: chunk
    name: chunk_faq
    parameters:
      size: 500
      # overlap: 20

  - type: split
    method: separator
    name: sentence_chunk_faq
    parameters:
      separator: .
      keep_separator: true

  - type: generation
    method: llm
    parameters:
      provider: openai
      model: gpt-4o-mini
      template: |
        Ask a question regarding the sentence about the content.
        content: {chunk_faq}
        sentence: {sentence_chunk_faq}

        Instructions :
        1. Use english only
        2. Keep it short

        question:

  - type: clean
    method: deduplicate-tf-idf
    parameters:
      strategy: fuzzy
      similarity_threshold: 0.9
      keep: first 

  - type: ablation
    method: llm-judge-binary
    parameters:
      provider: openai
      model: gpt-4o-mini
      consensus: all # any, majority
      criteria:
        - Is the question written in english?
        - Is the question consistent?

output:
  type: csv
  properties:
    path: tests/stubs/simple_pipeline/output.csv
    separator: "\t"

Add a model provider:

synda provider add openai --api-key [YOUR_API_KEY]

Generate some synthetic data:

synda generate config.yaml

Pipeline Structure

The Nebula pipeline consists of three main parts:

Input: Data source configuration
Pipeline: Sequence of transformation and generation steps
Output: Configuration for the generated data output

Available Pipeline Steps

Currently, Synda supports four pipeline steps (as shown in the example above):

split: Breaks down data into chunks of defined size (method: chunk or method: split)
generation: Generates content using LLM models (method: llm)
clean: Delete the duplicated data (method: deduplicate-tf-idf)
ablation: Filters data based on defined criteria (method: llm-judge-binary)

More steps will be added in future releases.

Roadmap

The following features are planned for future releases.

Core

Steps

Ideas

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
synda		synda
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synda

Installation

Usage

Pipeline Structure

Available Pipeline Steps

Roadmap

Core

Steps

Ideas

License

About

Releases

Packages

Languages

License

timothepearce/synda

Folders and files

Latest commit

History

Repository files navigation

Synda

Installation

Usage

Pipeline Structure

Available Pipeline Steps

Roadmap

Core

Steps

Ideas

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages