Write-Audit-Publish on the lakehouse with Bauplan and DBOS

A reference implementation of the write-audit-publish pattern with Bauplan and DBOS

Overview

A common need on S3-backed analytics systems (e.g. a data lakehouse) is safely ingesting new data into tables available to downstream consumers.

Data engineering best practices suggest the Write-Audit-Publish (WAP) pattern, which consists of three main logical steps:

Write: ingest data into a ''staging'' / ''temporary'' section of the lakehouse (a data branch) - the data is not visible yet to downstream consumers;
Audit: run quality checks on the data, to verify integrity and quality (avoid the ''garbage in, garbage out'' problem);
Publish: if the quality checks succeed, proceed to publish the data to the production branch - the data is now visible to downstream consumers; otherwise, raise an error and perform some clean-up operation.

This repository showcases how DBOS and Bauplan can be used to implement WAP in ~150 lines of no-nonsense pure Python code: no knowledge of the JVM, SQL or Iceberg is required.

If you are impatient and want to see the project in action, this is us running the code from our laptop.

What happens under the hood?

While the workflow looks and feels like a simple, no-nonsense Python script, a lot of magic happens behind the scenes in the cloud, over object storage. In particular, the WAP logic maps exactly to Bauplan operations over the datalake:

create a data branch, a zero-copy sandbox of the entire data lake in which to perform the ingestion safely;
create an Iceberg table inside this ingestion branch, loading the files in S3 into it;
retrieve a selected column from the Iceberg table to make sure there are no nulls (quality check);
merge the data branch into the production branch (on success), and clean-up the data branch before exiting.

What to the developer looks like a function call (wrapped by DBOS for durable execution) is actually a complex sequence of infrastructure and cloud operations performed by Bauplan for you: you do not need to know anything about Iceberg specs, data branches, columnar querying, but just focus on the business logic.

Setup

Bauplan

Bauplan is the programmable lakehouse: you can load, transform, query data all from your code (CLI or Python). You can start by reading our docs, dive deep into the underlying architecture, or explore how the API simplifies advanced use cases.

To use Bauplan, you need an API key for our demo environment: you can request one here. Run the 3 minutes quick start to get familiar with the platform first.

Note: the current SDK version is 0.0.3a292 but it is subject to change as the platform evolves - ping us if you need help with any of the APIs used in this project.

Setup your S3 bucket

To run a Write-Audit-Publish flow you need some files to write first!

When using the Bauplan demo environment, any parquet or CSV file in a publicly readable bucket will do: just load your (non-sensitive!) file(s) in a S3 bucket and set the appropriate permissions.

Note: our example video demo below is based on the Yellow Trip Dataset - adjust the quality check function accordingly if you use a different dataset.

Setup your Python environment and get started with DBOS

Install the required dependencies in a virtual environment:

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run the local DBOS setup to get started - i.e. install the CLI tool and setup the database with one of the recommended methods. For example, if you have Docker installed, you can use the following commands to start a containerized Postgres database (customized variables at your discretion):

docker pull postgres
docker run --name some-postgres -e POSTGRES_USER=postgres -e POSTGRES_PASSWORD=password -p 5432:5432  -d postgres

Once the database is running, make sure your dbos-config.yaml has both the database and the env section properly set up. For example:

env:
  TABLE_NAME: 'yellow_trips'
  BRANCH_NAME: 'mybauplanuser.dbos_ingestion'
  S3_PATH: 's3://mybucket/yellow_tripdata_2024-01.parquet'
  NAMESPACE: 'dbos'

Remember to run migrate on the database when you first set up the DBOS project in your Postgres:

dbos migrate

Run the workflow

You can run the workflow with DBOS through the CLI:

dbos start

If you want to see the end result, you can watch this video demonstration of the flow in action, both in case of successful audit and in case of failure.

License

The code in the project is licensed under the MIT License (DBOS and Bauplan are owned by their respective owners and have their own licenses).

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
img		img
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Write-Audit-Publish on the lakehouse with Bauplan and DBOS

Overview

What happens under the hood?

Setup

Bauplan

Setup your S3 bucket

Setup your Python environment and get started with DBOS

Run the workflow

License

About

Releases

Packages

Languages

License

BauplanLabs/wap-with-bauplan-and-dbos

Folders and files

Latest commit

History

Repository files navigation

Write-Audit-Publish on the lakehouse with Bauplan and DBOS

Overview

What happens under the hood?

Setup

Bauplan

Setup your S3 bucket

Setup your Python environment and get started with DBOS

Run the workflow

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages