Welcome to the data and runtime repository the Water Supply Forecast Rodeo competition on DrivenData! This repository contains a few things:
- Data download code (
data_download/
) — a Python package with a code and a CLI for downloading data for each approved feature data source. DrivenData will download datasets from certain approved data sources and mount it to the competition runtime for code execution submissions. Use the CLI to reproduce the saved file structure in the runtime. - Data reading code (
data_reading/
) — a Python library with example code for loading each of the feature datasets downloaded by the data download package, available for you to optionally use. It will be installed in the code execution runtime environment and you will be able to import it. - Submission template (
examples/template/
) — a template with the function signatures that you should implement in your submission - Example submission (
examples/moving_average/
) — a submission with a simple demonstration solution. It will run successfully in the code execution runtime and outputs a valid submission. - Runtime environment specification (
runtime/
) — the definition of the environment where your code will run.
You can use this repository to:
⬇️ Get feature data: The same code that is used to get feature data for the runtime environment is available for you to use locally.
🔧 Test your submission: Test your submission using a locally running version of the competition runtime to discover errors before submitting to the competition website.
📦 Request new packages in the official runtime: Since your submission will not have general access to the internet, all dependencies must be pre-installed. If you want to use a package that is not in the runtime environment, make a pull request to this repository. Make sure to test out adding the new package to both official environments, CPU and GPU.
Changes to the repository are documented in CHANGELOG.md.
- Prerequesites
- Setting up the data directory
- Code submission format
- Running your submission locally
- Smoke tests
- Runtime network access
This repo contains a Python package named wsfr-download
located in the data_download/
directory. It provides a command-line interface (CLI) for downloading approved challenge datasets. DrivenData will use this package to download the test feature data that will be made available to the code execution runtime. You can use it to download feature data in the same way for testing your submission or for training.
Note
Data download code may be added for requested data sources that get approved.
Requires Python 3.10. To install with the exact dependencies that will be used by DrivenData, create a new virtual environment and run
pip install -r ./data_download/requirements.txt
pip install ./data_download/
By default, data is saved into a subdirectory named data/
relative to your current working directory. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT
with another directory path. The expected default usage is that you run all commands with the root directory of this repository as your working directory.
You will also need to download the following files from the competition data download page and place them into your data directory. The data download scripts will depend on some of these files.
geospatial.gpkg
->data/geospatial.gpkg
metadata.csv
->data/metadata.csv
cdec_snow_stations.csv
->data/cdec_snow_stations.csv
cpc_climate_divisions.gpkg
->data/cpc_climate_divisions.gpkg
nlcd_release_dates.csv
->data/nlcd_release_dates.csv
Additionally, the following data products are static releases and involve large single files. If you are planning to use any of the following datasets, please manually download it from its approved source and move it into the designated location.
- BasinATLAS basin attributes ->
data/BasinATLAS_Data_v10.gdb.zip
(2.7 GB) - NLCD Urban Imperviousness ->
data/NLCD_impervious_2021_release_all_files_20230630.zip
(19.54 GB)
You will need at least 115 GB in free disk space to download all datasets. See the "Expected files" section below for a breakdown by data source.
To simply download all test feature data that will be available, use the bulk
command. From the repository root as your working directory, run:
python -m wsfr_download bulk data_download/hindcast_test_config.yml
You can invoke the CLI with python -m wsfr_download
. For example, to see a list of all available commands:
python -m wsfr_download --help
The CLI is organized with one command per data source, e.g., see
python -m wsfr_download grace_indicators --help
There is also the bulk
command for downloading multiple data sources at once, as shown in the previous section. A bulk download is configured by a YAML configuration file. The configuration file for the Hindcast test set is data_download/hindcast_test_config.yml
. To download feature data for training, create your own YAML configuration file for the years and data sources that you need using the test set file as an example.
By default, all download functions will skip downloading data for files that already exist in your data directory. This is controlled by an option called skip_existing
. To force downloads to overwrite existing files, set skip_existing
to false
in the bulk download config file when using the bulk
command, or use the --no-skip-existing
flag when using an individual data source's download command.
A list of all files present in the runtime data volume is available in data.find.txt
. You can generate an equivalent version of this file for your local data directory with the following command:
find data -type f ! -name '.DS_Store' ! -name '.gitkeep' | sort
You can also find a listing of subdirectory sizes in data.du.txt
, which will give you an idea of the disk space needed for each data source. You can generate an equivalent version of this file for your local data directory with the following command:
du -sh data/*
This repo contains a Python package named wsfr-read
located in the data_reading/
directory. It provides a library with example functions to read the data downloaded by wsfr-download
. This package will be installed into the code execution runtime for you to optionally use during inference on the test set. These functions may be helpful because they implement subsetting by site_id
and issue_date
. You are not required to use these functions in your solution.
Note
Data reading code may be added for requested data sources that get approved.
Requires Python 3.10. Install with pip:
pip install ./data_reading/
Modules are provided with names matching the data source names in the wsfr-download
package. Each module contains read_*_data
functions that are basic ways you can load that data for use as features for your models. See the docstrings on the functions for more details on usage.
By default, data is assumed to be in a subdirectory named data/
relative to your current working directory. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT
.
When you make a submission on the DrivenData competition site, we run your submission inside a Docker container, a virtual operating system that allows for a consistent software environment across machines. The best way to make sure your submission to the site will run is to first run it successfully in the container on your local machine.
- A clone of this repository
- Docker
- At least 5 GB of free space for the CPU version of the Docker image or at least 10 GB of free space for the GPU version
- GNU make (optional, but useful for running the commands in the Makefile)
Additional requirements to run with GPU:
- NVIDIA drivers with CUDA 11
- NVIDIA container toolkit
In the official code execution platform, code_execution/data
will contain data provided for the test set. This will include data from the data download page as well as feature data downloaded by the data pipelines in data_download/
. See the data download section for more about setting up the test data.
In additional to the files detailed in the data download section, you will also need the following additional two files from the data download page:
submission_format.csv
->data/submission_format.csv
smoke_submission_format.csv
->data/smoke_submission_format.csv
When testing your submission locally, the data/
directory in the repository root will be mounted into the container. You can explicitly override this by setting the environment variable WSFR_DATA_ROOT
with another directory path.
Your final submission should be a zip archive named with the extension .zip
(for example, submission.zip
). The root level of the submission.zip
file must contain a solution.py
which contains a predict
function that returns predictions for a single site on a single issue date.
A template for solution.py
is included at examples/template/solution.py
. For more detail, see the "what to submit" section of the code submission page.
This section provides instructions on how to run the your submission in the code execution container from your local machine. To simplify the steps, key processes have been defined in the Makefile
. Commands from the Makefile
are then run with make {command_name}
. The basic steps are:
make pull
make pack-submission
make test-submission
Run make help
for more information about the available commands as well as information on the official and built images that are available locally.
Here's the process in a bit more detail:
-
First, make sure you have set up the prerequisites.
-
Download the official competition Docker image:
make pull
Note
If you have built a local version of the runtime image with make build
, that image will take precedence over the pulled image when using any make commands that run a container. You can explicitly use the pulled image by setting the SUBMISSION_IMAGE
shell/environment variable to the pulled image or by deleting all locally built images.
-
Save all of your submission files, including the required
solution.py
script, in thesubmission_src
folder of the runtime repository. Make sure any needed model weights and other assets are saved insubmission_src
as well. -
Create a
submission/submission.zip
file containing your code and model assets:make pack-submission #> mkdir -p submission/ #> cd submission_src; zip -r ../submission/submission.zip ./* #> adding: solution.py (deflated 73%)
-
Launch an instance of the competition Docker image, and run the same inference process that will take place in the official runtime:
make test-submission
This runs the container entrypoint script. First, it unzips submission/submission.zip
into /code_execution/src/
in the container. Then, it runs the supervisor.py
script, which will import code from your submitted solution.py
. In the local testing setting, the final submission is saved out to submission/submission.csv
on your local machine.
When you run make test-submission
the logs will be printed to the terminal and written out to submission/log.txt
. If you run into errors, use the log.txt
to determine what changes you need to make for your code to execute successfully.
An example code submission is provided in examples/moving_average
that can run successfully and generate valid predictions. Please note that this model is not a realistic solution to the problem. You can use the example in place of steps 3 and 4 above. To pack this submission for testing or for submission to the platform, run:
make pack-example
When submitting on the platform, you will have the ability to submit "smoke tests". Smoke tests run on a reduced version of the test set in order to run more quickly. They will not be considered for prize evaluation and are intended to let you test your code for correctness.
Smoke tests use the smoke_submission_format.csv
file instead of the full submission_format.csv
file. When testing locally, a submission will run as a smoke test if the IS_SMOKE
shell variable is set to a non-empty string. For example,
IS_SMOKE=1 make test-submission
You can read more about smoke tests on the code submission format page.
In the real competition runtime, all internet access is blocked except to the hosts documented in allowed_hosts.txt
corresponding to the approved data sources labeled with "Direct API access permitted" on the Approved data sources page.
The local test runtime does not impose any network restrictions; as a result submissions that require internet access might succeed in local tests but fail in the actual competition runtime. It's up to you to make sure that your code does not make requests to unauthorized web resources. If your submission does not require internet access, you can test your submission without internet access by running BLOCK_INTERNET=true make test-submission
.
If you want to use a package that is not in the environment, you are welcome to make a pull request to this repository. If you're new to the GitHub contribution workflow, check out this guide by GitHub.
The runtime manages dependencies using conda environments and conda-lock. Here is a good general guide to conda environments. The official runtime uses Python 3.10.13 environments.
To submit a pull request for a new package:
-
Fork this repository.
-
Install conda-lock. See here for installation options.
-
Edit the conda environment YAML files,
runtime/environment-cpu.yml
andruntime/environment-gpu.yml
. There are two ways to add a requirement:- Conda package manager (preferred): Add an entry to the
dependencies
section. This installs from the conda-forge channel usingconda install
. Conda performs robust dependency resolution with other packages in thedependencies
section, so we can avoid package version conflicts. - Pip package manager: Add an entry to the
pip
section. This installs from PyPI usingpip
, and is an option for packages that are not available in a conda channel.
- Conda package manager (preferred): Add an entry to the
-
Run
make update-lockfiles
. This will readenvironment-cpu.yml
andenvironment-gpu.yml
, resolve exact package versions, and save the pinned environments toconda-lock-cpu.yml
andconda-lock-gpu.yml
. -
Locally test that the Docker image builds successfully for CPU and GPU images:
CPU_OR_GPU=cpu make build CPU_OR_GPU=gpu make build
-
Commit the changes to your forked repository. Ensure that your branch includes updated versions of all of the following:
runtime/conda-lock-cpu.yml
runtime/conda-lock-gpu.yml
runtime/environment-cpu.lock
runtime/environment-cpu.yml
runtime/environment-gpu.lock
runtime/environment-gpu.yml
-
Open a pull request from your branch to the
main
branch of this repository. Navigate to the Pull requests tab in this repository, and click the "New pull request" button. For more detailed instructions, check out GitHub's help page. -
Once you open the pull request, we will use Github Actions to build the Docker images with your changes and run the tests in
runtime/tests
. For security reasons, administrators may need to approve the workflow run before it happens. Once it starts, the process can take up to 30 minutes, and may take longer if your build is queued behind others. You will see a section on the pull request page that shows the status of the tests and links to the logs. -
You may be asked to submit revisions to your pull request if the tests fail or if a DrivenData staff member has feedback. Pull requests won't be merged until all tests pass and the team has reviewed and approved the changes.
A Makefile with several helpful shell recipes is included in the repository. The runtime documentation above uses it extensively. Running make
by itself in your shell will list relevant Docker images and provide you the following list of available commands:
Available commands:
build Builds the container locally
clean Delete temporary Python cache and bytecode files
interact-container Open an interactive bash shell within the running container (with network access)
pack-example Creates a submission/submission.zip file from the source code in examples_src
pack-submission Creates a submission/submission.zip file from the source code in submission_src
pull Pulls the official container from Azure Container Registry
test-container Ensures that your locally built image can import all the Python packages successfully when it runs
test-submission Runs container using code from `submission/submission.zip` and data from WSFR_DATA_ROOT (default `data/`)
update-lockfiles Updates runtime environment lockfiles