This repository is about using machine learning to forecast the amount of wind energy in the California electricity grid. With regards to the typical data science workflow, the repository covers modeling and limited feature engineering.
This project is executed in R, using the Modeltime library for modeling, and MLFlow for tracking experiments.
The motivation for this project is to explore the capabilities (and limits) of the Modeltime time series machine learning library and to get a feeling for the challenges of modelling a part of the electricity grid.
The picture below gives an overview of the ETL and modeling process. To aid reproducibility, alphanumeric strings in
labels (e.g. b2caff6
) refer to specific git commits on which these models were generated.
All data have been acquired and aggregated by the author from public sources. Three datasets are available to aid the forecasting effort:
-
db_pull_production_data_raw_20201208.csv
β 5-min time series data stating energy in the California grid by energy source, according to CAISO. ColumnsTime
andWind
are relevant for this analysis.Wind
is what we are trying to forecast. -
db_pull_weather_data_raw_20201208.csv
β 1-hour weather time series data at 10 key locations for renewable energy generation (wind, solar) in California. The author determined these locations based on geospatial analysis of renewable energy assets in the state, an analysis that is outside the scope of this repository. Columns starting with0_
through4_
belong to wind-generating locations, ordered in descending order of generating capacity. This dataset can be used to create features for the model. Weather data have been acquired from Dark Sky. -
db_pull_feature_gross_production_20201209.csv
- 1-hour time series dataset with domain-informed features at 10 key locations (see above). The author generated the wind-related features (0_wind
through4_wind
) by combining weather and turbine power curve information. From the weather information above, the author generated air density and hub height adjusted estimates of the available wind energy. That information was fed through an assumed power curve and multiplied by the assumed total capacity available at that key location. That analysis is outside the scope of this repository, but the generated features can be used for modelling.
All data are fed through a feature engineering pipeline, for which relevant features have been selected using a random forest model.
The model is a weighted ensemble model of 8 tree-based models. Two of those models are based on Cubist, a boosted regression model (find a great presentation about Cubist by Max Kuhn here). Three models are based on XGBoost and another three models are random forest models. These types of models have been chosen according to their ability to incorporate a set of 188 features and good training performance in R.
Each model has been trained on about 143k data points. The model parameters have been selected after hyperparameter tuning, subject to 4-fold time series cross-validation. The number of folds was constrained by training performance.
A weighted average was chosen for ensembling the models due to performance constraints over potentially more accurate methods like stacking. The weights were chosen after assessing 20 different weight combinations through a latin hypercube experimental design (find assessment here).
In summary, the modeling process was heavily constrained by performance considerations and available project time.
All model runs, including training and hyperparameter optimization, have been recorded in MLFlow. You can boot up a workable MLFlow instance using the associated Dockerfile. All results have also been exported as CSVs in the "static" directory, where every file name corresponds to an MLFlow experiment.
- data β Data used or produced in the modeling process
- docs β Documentation
- models β Trained models (removed to save storage space)
- ens_level1_f9e6c40.rds β Final, trained ensemble model (see above)
- notebooks β Notebooks for experiments and analyses
- ext β Notebooks used in external environments, for example for large-scale training in the cloud
- analyze_performance.rmd (HTML | view in browser) β Analysis of model performance
- src β Source code for building models
- data β Source code for ETL process
- feature_engineering
- models_hyperparam_tuning
- models_training
- util β Utility code related to building models
- util β Utility tools
- mlflow β Data related to MLFlow
- mlruns β MLFlow data directory
- static β Static export of tracked MLFlow experiments
- Dockerfile β Dockerfile for running MLFlow server
- environment.yml β Conda environment file to create MLFlow environment
- mlflow β Data related to MLFlow
This section summarizes model performance by use case.
The table below shows the performance of the models at time series forecasting. Given the training and test set, it is clear that the Cubist models heavily overfit and the random forest models overfit to some extent as well.
Cross-validation (CV) has not been performed for the ensemble model due to performance constraints.
Model Name | CV, folds | mae (train) | rmse (train) | rsq (train) | mae (test) | rmse (test) | rsq (test) | ratio mae train/test |
---|---|---|---|---|---|---|---|---|
ens_level1_f9e6c40 | β | 469 | 599 | 0.779 | ||||
cubist_level0_b2caff6_1 | β 4 | 15 | 32 | 0.999 | 493 | 634 | 0.686 | 0.03 |
cubist_level0_b2caff6_3 | β 4 | 20 | 39 | 0.999 | 507 | 656 | 0.670 | 0.04 |
rf_level0_cc50409_1 | β 4 | 219 | 300 | 0.943 | 439 | 559 | 0.755 | 0.50 |
rf_level0_cc50409_2 | β 4 | 231 | 315 | 0.937 | 440 | 560 | 0.753 | 0.53 |
rf_level0_cc50409_3 | β 4 | 37 | 57 | 0.998 | 444 | 567 | 0.751 | 0.08 |
xgboost_level0_9dc6cbe_1 | β 4 | 580 | 772 | 0.759 | 640 | 799 | 0.678 | 0.91 |
xgboost_level0_9dc6cbe_2 | β 4 | 519 | 689 | 0.803 | 598 | 752 | 0.706 | 0.87 |
xgboost_level0_9dc6cbe_3 | β 4 | 565 | 752 | 0.763 | 628 | 783 | 0.670 | 0.90 |
Forecasting the peaks of the wind energy time series in the California grid is a specific use case of this model. The analyze_performance.rmd (HTML | view in browser) notebook investigates this case in detail. In summary, the peaks predicted by the model, in 75% of all cases, do not miss the energy in the actual peaks within a 26-hour window by more than 14%.
I am happy about any contribution or feedback. Please let me know about your comments via the Issues tab on GitHub.
This project is released under the MIT License.
Please note that raw data as provided in
db_pull_production_data_raw_20201208.csv
have been generated
by the California ISO.
Please also note that weather data as provided in
db_pull_weather_data_raw_20201208.csv
has been extracted from
DarkSky and is subject to its Terms of Use, allowing use only for
"personal, non-commercial purposes".
For the social preview picture, the California bear is from Vecteezy.com. The R Logo is used in its original form under the CC BY-SA 4.0, and is (C) 2016 by The R Foundation. The Modeltime logo is taken from the Modeltime GitHub repository and is subject to the MIT License.