CAISO Wind Energy Forecast

This repository is about using machine learning to forecast the amount of wind energy in the California electricity grid. With regards to the typical data science workflow, the repository covers modeling and limited feature engineering.

This project is executed in R, using the Modeltime library for modeling, and MLFlow for tracking experiments.

The motivation for this project is to explore the capabilities (and limits) of the Modeltime time series machine learning library and to get a feeling for the challenges of modelling a part of the electricity grid.

Overview

The picture below gives an overview of the ETL and modeling process. To aid reproducibility, alphanumeric strings in labels (e.g. b2caff6) refer to specific git commits on which these models were generated.

About the Data

All data have been acquired and aggregated by the author from public sources. Three datasets are available to aid the forecasting effort:

db_pull_production_data_raw_20201208.csv – 5-min time series data stating energy in the California grid by energy source, according to CAISO. Columns Time and Wind are relevant for this analysis. Wind is what we are trying to forecast.
db_pull_weather_data_raw_20201208.csv – 1-hour weather time series data at 10 key locations for renewable energy generation (wind, solar) in California. The author determined these locations based on geospatial analysis of renewable energy assets in the state, an analysis that is outside the scope of this repository. Columns starting with 0_ through 4_ belong to wind-generating locations, ordered in descending order of generating capacity. This dataset can be used to create features for the model. Weather data have been acquired from Dark Sky.
db_pull_feature_gross_production_20201209.csv - 1-hour time series dataset with domain-informed features at 10 key locations (see above). The author generated the wind-related features (0_wind through 4_wind) by combining weather and turbine power curve information. From the weather information above, the author generated air density and hub height adjusted estimates of the available wind energy. That information was fed through an assumed power curve and multiplied by the assumed total capacity available at that key location. That analysis is outside the scope of this repository, but the generated features can be used for modelling.

All data are fed through a feature engineering pipeline, for which relevant features have been selected using a random forest model.

About the Model

The model is a weighted ensemble model of 8 tree-based models. Two of those models are based on Cubist, a boosted regression model (find a great presentation about Cubist by Max Kuhn here). Three models are based on XGBoost and another three models are random forest models. These types of models have been chosen according to their ability to incorporate a set of 188 features and good training performance in R.

Each model has been trained on about 143k data points. The model parameters have been selected after hyperparameter tuning, subject to 4-fold time series cross-validation. The number of folds was constrained by training performance.

A weighted average was chosen for ensembling the models due to performance constraints over potentially more accurate methods like stacking. The weights were chosen after assessing 20 different weight combinations through a latin hypercube experimental design (find assessment here).

In summary, the modeling process was heavily constrained by performance considerations and available project time.

All model runs, including training and hyperparameter optimization, have been recorded in MLFlow. You can boot up a workable MLFlow instance using the associated Dockerfile. All results have also been exported as CSVs in the "static" directory, where every file name corresponds to an MLFlow experiment.

Repository Structure

data – Data used or produced in the modeling process
- raw – Raw data (see above)
- processed – Processed data
docs – Documentation
models – Trained models (removed to save storage space)
- ens_level1_f9e6c40.rds – Final, trained ensemble model (see above)
notebooks – Notebooks for experiments and analyses
- ext – Notebooks used in external environments, for example for large-scale training in the cloud
- analyze_performance.rmd (HTML | view in browser) – Analysis of model performance
src – Source code for building models
- data – Source code for ETL process
- feature_engineering
- models_hyperparam_tuning
- models_training
- util – Utility code related to building models
util – Utility tools
- mlflow – Data related to MLFlow
  - mlruns – MLFlow data directory
  - static – Static export of tracked MLFlow experiments
  - Dockerfile – Dockerfile for running MLFlow server
  - environment.yml – Conda environment file to create MLFlow environment

Model Performance

This section summarizes model performance by use case.

Time Series Forecasting Use Case

The table below shows the performance of the models at time series forecasting. Given the training and test set, it is clear that the Cubist models heavily overfit and the random forest models overfit to some extent as well.

Cross-validation (CV) has not been performed for the ensemble model due to performance constraints.

Model Name	CV, folds	mae (train)	rmse (train)	rsq (train)	mae (test)	rmse (test)	rsq (test)	ratio mae train/test
ens_level1_f9e6c40	❌				469	599	0.779
cubist_level0_b2caff6_1	✅ 4	15	32	0.999	493	634	0.686	0.03
cubist_level0_b2caff6_3	✅ 4	20	39	0.999	507	656	0.670	0.04
rf_level0_cc50409_1	✅ 4	219	300	0.943	439	559	0.755	0.50
rf_level0_cc50409_2	✅ 4	231	315	0.937	440	560	0.753	0.53
rf_level0_cc50409_3	✅ 4	37	57	0.998	444	567	0.751	0.08
xgboost_level0_9dc6cbe_1	✅ 4	580	772	0.759	640	799	0.678	0.91
xgboost_level0_9dc6cbe_2	✅ 4	519	689	0.803	598	752	0.706	0.87
xgboost_level0_9dc6cbe_3	✅ 4	565	752	0.763	628	783	0.670	0.90

Time Series Peak Forecasting Use Case

Forecasting the peaks of the wind energy time series in the California grid is a specific use case of this model. The analyze_performance.rmd (HTML | view in browser) notebook investigates this case in detail. In summary, the peaks predicted by the model, in 75% of all cases, do not miss the energy in the actual peaks within a 26-hour window by more than 14%.

Contribution

I am happy about any contribution or feedback. Please let me know about your comments via the Issues tab on GitHub.

License and Attributions

This project is released under the MIT License.

Please note that raw data as provided in db_pull_production_data_raw_20201208.csv have been generated by the California ISO.

Please also note that weather data as provided in db_pull_weather_data_raw_20201208.csv has been extracted from DarkSky and is subject to its Terms of Use, allowing use only for "personal, non-commercial purposes".

For the social preview picture, the California bear is from Vecteezy.com. The R Logo is used in its original form under the CC BY-SA 4.0, and is (C) 2016 by The R Foundation. The Modeltime logo is taken from the Modeltime GitHub repository and is subject to the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAISO Wind Energy Forecast

Overview

About the Data

About the Model

Repository Structure

Model Performance

Time Series Forecasting Use Case

Time Series Peak Forecasting Use Case

Contribution

License and Attributions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
data		data
docs		docs
notebooks		notebooks
src		src
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

flrs/caiso_wind_forecast

Folders and files

Latest commit

History

Repository files navigation

CAISO Wind Energy Forecast

Overview

About the Data

About the Model

Repository Structure

Model Performance

Time Series Forecasting Use Case

Time Series Peak Forecasting Use Case

Contribution

License and Attributions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages