This repository is a reference implementation demonstrating how to use bauplan and MongoDB to build a full-stack recommender system.
This application is an embedding-based recommender system from music playlists, using bauplan for data preparation and model training and MongoDB Atlas for serving.
We use the Spotify Million Playlist Dataset (originally from AI Crowd) - available as sample dataset in the bauplan sandbox.
Make sure to check out the companion blog post for the full context on the use case and the tools behind the implementation!
Given sequences of music tracks - Spotify playlists -, we wish to learn an embedding for each track, and use these embeddings to recommend similar tracks using vector similarity at inference time.
In particular, we showcase how to train a sequential model on playlists. The application can be easily changed to use a transformer model on metadata.
We then use a Streamlit app as an intuitive UI for users to explore the embedding space and get music recommendations based on the tracks they like: you can imagine the live endpoint as answering a question like "given that the user is listening to this track, what track do you recommend to play next?".
Credits: The data wrangling code is adapted from the NYU Machine Learning System Course by Jacopo Tagliabue and Ethan Rosenthal.
In the full-stack application, the data flows as follows between tools and environments:
- The original dataset is stored in AWS S3, as a bauplan-backed Iceberg table. The dataset is already available in the bauplan sandbox in the
public
namespace - note that this dataset is available as "One Big Table". - The end-to-end data pipeline is in
src/bpln_pipeline
and comprises the data preparation and training steps as simple decorated Python functions - running the pipeline with bauplan will execute these functions and persist the embeddings both in an Iceberg table and in MongoDB Atlas. - The Streamlit app in
src/app
showcases how to retrieve the embeddings from bauplan (higher latency / high throughput) and from MongoDB (lower latency / low throughput) to explore the vector space and the recommendations in real-time.
Note: both the pipeline and the app code are heavily commented. Do not hesitate to reach out to the Bauplan team for any questions or clarifications - shoot an email to [email protected].
To run the project, you need Python 3.10 or higher. We recommend using a virtual environment to manage dependencies:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- Join bauplan's sandbox, sign in, create your username and API key.
- The sandbox includes several public datasets - including the one used in this project.
- Check out our 3-min tutorial to get familiar with the platform.
- Sign up for a MongoDB account and create a cluster Atlas.
- Write down your username and password, and the connection string to your cluster. To find your string in your Atlas console, click on your Cluster and then on
Connect/Drivers/
. Scroll to the sectionAdd your connection string into your application code
and use the connection string in there, making sure to replace the variables with your database password (for more on Connection Strings, check MongoDB's documentation). - In
Cluster > Security > Network Access
, make sure your cluster is reachable from bauplan by enabling:Allow access from anywhere (0.0.0.0/0)
.
Bauplan comes with its own CLI. To get acquainted with the dataset and its schema simply run in the terminal:
bauplan table get public.spotify_playlists
You can also query the data by passing SQL statements to bauplan query
in the CLI. For instance, to retrieve the top 10 artists in the dataset run:
bauplan query "SELECT artist_name, artist_uri, COUNT(*) as _C FROM public.spotify_playlists GROUP BY ALL ORDER BY _C DESC LIMIT 10"
To run the pipeline - i.e. the DAG going from the original table to the vector space -- you first will create a data branch to develop safely in the cloud.
cd src/bpln_pipeline
bauplan branch create <YOUR_USER_NAME>.music_recommendations
bauplan branch checkout <YOUR_USER_NAME>.music_recommendations
Now, add the Mongo URI as a secret to your bauplan project allowing bauplan to connect to the cluster securely:
bauplan parameter set --name mongo_uri --value "mongodb+srv://aa:[email protected]/" --type secret
If you inspect your bauplan_project.yml
file in the pipeline
folder, you will see the new parameter as mongo_uri
:
parameters:
mongo_uri:
type: secret
default: kUg6q4141413...
key: awskms:///arn:aws:kms:us-...
You can now run the entire pipeline in the cloud by simply running:
bauplan run
Once the system is done, you can check that we successfully created the embedding table directly from the CLI:
bauplan table get track_vectors_with_metadata
You can also query the final table to get the most represented authors in the vector space:
bauplan query "SELECT artist_name, COUNT(*) as _C FROM track_vectors_with_metadata GROUP BY 1 ORDER BY 2 DESC LIMIT 10"
We can visualize the structure of the embedding space using the Streamlit app, and then ask for recommendations leveraging the low-latency query capabilities of MongoDB. You will need to pass the Mongo URI (the same one used before to generate the secret) as an environment variable to connect to the database. You will also pass your branch name to the app.
To run the Streamlit app, run:
cd src/app
MONGO_URI=<YOUR_MONGO_URI> streamlit run explore_and_recommend.py -- --bauplan_user_name <YOUR_BAUPLAN_USERNAME>
Note: make sure that the vector search index created by the bauplan pipeline is completed before running the app (check the MongoDB Atlas dashboard).
The app will open in your browser, and you can start exploring the embedding space and asking for recommendations. Note how easy it is to interact with both bauplan and MongoDB from any Python process using the bauplan
and pymongo
libraries.
Building embeddings from sequences is not the only way to produce track-to-track recommendations. For example, we can also use metadata to build text-based embeddings. This is a stub to get your started (also showcasing how to use transformers from the Hugging Face model hub):
@bauplan.python('3.11', pip={'sentence-transformers': '3.1.1'})
# note that we are using the internet_access=True flag to allow this function
# to download the transformer model from the Hugging Face hub
@bauplan.model(internet_access=True)
def content_to_vectors(
tracks=bauplan.Model(
'public.spotify_playlists',
# extract track metadata
columns=[
'track_name',
'artist_name',
'track_uri'
],
filter="num_followers > $num_followers and num_tracks > $num_tracks"
)
):
from sentence_transformers import SentenceTransformer
# instantiate the transformer model
model = SentenceTransformer("distiluse-base-multilingual-cased-v1")
# encode the content wih the model
embeddings = model.encode([list_of_metadata_strings_here])
# finish the function...
The code in this repository is released under the MIT License and provided as is.