Reorganize the code base (#904)

Quantco · Feb 4, 2025 · 4f9db8f · 4f9db8f
1 parent 2ecff36
commit 4f9db8f
Show file tree

Hide file tree

Showing 50 changed files with 3,358 additions and 4,878 deletions.
diff --git a/.gitattributes b/.gitattributes
@@ -1,3 +1,2 @@
 # GitHub syntax highlighting
 pixi.lock linguist-language=YAML
-
diff --git a/.github/workflows/build_wheels.yml b/.github/workflows/build_wheels.yml
@@ -52,7 +52,7 @@ jobs:
     if: github.event_name == 'release' && github.event.action == 'published'
     needs: [build_wheels, build_sdist]
     runs-on: ubuntu-latest
-    environment: 
+    environment:
       name: test_release
       url: https://test.pypi.org/p/glum
     permissions:
@@ -70,7 +70,7 @@ jobs:
     if: github.event_name == 'release' && github.event.action == 'published'
     needs: [build_wheels, build_sdist, upload_testpypi]
     runs-on: ubuntu-latest
-    environment: 
+    environment:
       name: release
       url: https://pypi.org/p/glum
     permissions:

diff --git a/.gitignore b/.gitignore
@@ -150,4 +150,3 @@ pkgs/*
 # pixi environments
 .pixi
 *.egg-info
-
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -45,3 +45,22 @@ repos:
         language: system
         types: [python]
         require_serial: true
+      # pre-commit-hooks
+      - id: trailing-whitespace-fixer
+        name: trailing-whitespace-fixer
+        entry: pixi run -e lint trailing-whitespace-fixer
+        language: system
+        types: [text]
+        exclude: (\.py|README.md)$
+      - id: end-of-file-fixer
+        name: end-of-file-fixer
+        entry: pixi run -e lint end-of-file-fixer
+        language: system
+        types: [text]
+        exclude: (\.py|changelog.rst)$
+      - id: check-merge-conflict
+        name: check-merge-conflict
+        entry: pixi run -e lint check-merge-conflict --assume-in-merge
+        language: system
+        types: [text]
+        exclude: \.py$
diff --git a/LICENSE b/LICENSE
@@ -1,4 +1,4 @@
-Copyright 2020-2021 QuantCo Inc, Christian Lorentzen 
+Copyright 2020-2021 QuantCo Inc, Christian Lorentzen
 
 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 

diff --git a/NOTICE b/NOTICE
@@ -0,0 +1,3 @@
+Modified from code submitted as a PR to sklearn: https://github.com/scikit-learn/scikit-learn/pull/9405
+
+Original attribution from: https://github.com/scikit-learn/scikit-learn/pull/9405/filesdiff-38e412190dc50455611b75cfcf2d002713dcf6d537a78b9a22cc6b1c164390d1
diff --git a/build_tools/prepare_macos_wheel.sh b/build_tools/prepare_macos_wheel.sh
@@ -10,4 +10,3 @@ else
 fi
 
 conda create -n build -c $CONDA_CHANNEL 'llvm-openmp=11'
-
diff --git a/docs/contributing.rst b/docs/contributing.rst
@@ -1,7 +1,7 @@
 Contributing and Development
 ====================================
 
-Hello! And thanks for exploring glum more deeply. Please see the issue tracker and pull requests tabs on Github for information about what is currently happening. Feel free to post an issue if you'd like to get involved in development and don't really know where to start -- we can give some advice. 
+Hello! And thanks for exploring glum more deeply. Please see the issue tracker and pull requests tabs on Github for information about what is currently happening. Feel free to post an issue if you'd like to get involved in development and don't really know where to start -- we can give some advice.
 
 We welcome contributions of any kind!
 
@@ -25,7 +25,7 @@ Pull request process
 Releases
 --------------------------------------------------
 
-- We make package releases infrequently, but usually any time a new non-trivial feature is contributed or a bug is fixed. To make a release, just open a PR that updates the change log with the current date. Once that PR is approved and merged, you can create a new release on [GitHub](https://github.com/Quantco/glum/releases/new). Use the version from the change log as tag and copy the change log entry into the release description. 
+- We make package releases infrequently, but usually any time a new non-trivial feature is contributed or a bug is fixed. To make a release, just open a PR that updates the change log with the current date. Once that PR is approved and merged, you can create a new release on [GitHub](https://github.com/Quantco/glum/releases/new). Use the version from the change log as tag and copy the change log entry into the release description.
 
 Install for development
 --------------------------------------------------
@@ -75,10 +75,10 @@ The test suite is in ``tests/``. A pixi task is available to run the tests:
 Golden master tests
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-We use golden master testing to preserve correctness. The results of many different GLM models have been saved. After an update, the tests will compare the new output to the saved models. Any significant deviation will result in a test failure. This doesn't strictly mean that the update was wrong. In case of a bug fix, it's possible that the new output will be more accurate than the old output. In that situation, the golden master results can be overwritten as explained below. 
+We use golden master testing to preserve correctness. The results of many different GLM models have been saved. After an update, the tests will compare the new output to the saved models. Any significant deviation will result in a test failure. This doesn't strictly mean that the update was wrong. In case of a bug fix, it's possible that the new output will be more accurate than the old output. In that situation, the golden master results can be overwritten as explained below.
 
 There are two sets of golden master tests, one with artificial data and one directly using the benchmarking problems from :mod:`glum_benchmarks`. For both sets of tests, creating the golden master and the tests definition are located in the same file. Calling the file with pytest will run the tests while calling the file as a python script will generate the golden master result. When creating the golden master results, both scripts accept the ``--overwrite`` command line flag. If set, the existing golden master results will be overwritten. Otherwise, only the new problems will be run.
- 
+
 Skipping the slow tests
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -102,7 +102,7 @@ Building a conda package
 To use the package in another project, we distribute it as a conda package.
 For building the package locally, you can use the following command:
 
-:: 
+::
 
    conda build conda.recipe
 
@@ -121,7 +121,7 @@ Then, navigate to `<http://localhost:8000>`_ to view the documentation.
 
 Alternatively, if you install `entr <http://eradman.com/entrproject/>`_, then you can auto-rebuild the documentation any time a file changes with:
 
-:: 
+::
 
    cd docs
    ./dev
@@ -141,23 +141,23 @@ If you are a newbie to Sphinx, the links below may help get you up to speed on s
 Where to start looking in the source?
 -------------------------------------
 
-The primary user interface of ``glum`` consists of the :class:`GeneralizedLinearRegressor <glum.GeneralizedLinearRegressor>` and :class:`GeneralizedLinearRegressorCV <glum.GeneralizedLinearRegressorCV>` classes via their constructors and the :meth:`fit() <glum.GeneralizedLinearRegressor.fit>` and :meth:`predict() <glum.GeneralizedLinearRegressor.predict>` functions. Those are the places to start looking if you plan to change the system in some way. 
+The primary user interface of ``glum`` consists of the :class:`GeneralizedLinearRegressor <glum.GeneralizedLinearRegressor>` and :class:`GeneralizedLinearRegressorCV <glum.GeneralizedLinearRegressorCV>` classes via their constructors and the :meth:`fit() <glum.GeneralizedLinearRegressor.fit>` and :meth:`predict() <glum.GeneralizedLinearRegressor.predict>` functions. Those are the places to start looking if you plan to change the system in some way.
 
 What follows is a high-level summary of the source code structure. For more details, please look in the documentation and docstrings of the relevant classes, functions and methods.
 
 * ``_glm.py`` - This is the main entrypoint and implements the core logic of the GLM. Most of the code in this file handles input arguments and prepares the data for the GLM fitting algorithm.
 * ``_glm_cv.py`` - This is the entrypoint for the cross validated GLM implementation. It depends on a lot of the code in ``_glm.py`` and only modifies the sections necessary for running training many models with different regularization parameters.
 * ``_solvers.py`` - This contains the bulk of the IRLS and L-BFGS algorithms for training GLMs.
 * ``_cd_fast.pyx`` - This is a Cython implementation of the coordinate descent algorithm used for fitting L1 penalty GLMs. Note the ``.pyx`` extension indicating that it is a Cython source file.
-* ``_distribution.py`` - definitions of the distributions that can be used. Includes Normal, Poisson, Gamma, InverseGaussian, Tweedie, Binomial and GeneralizedHyperbolicSecant distributions. 
+* ``_distribution.py`` - definitions of the distributions that can be used. Includes Normal, Poisson, Gamma, InverseGaussian, Tweedie, Binomial and GeneralizedHyperbolicSecant distributions.
 * ``_link.py`` - definitions of the link functions that can be used. Includes identity, log, logit and Tweedie link functions.
 * ``_functions.pyx`` - This is a Cython implementation of the log likelihoods, gradients and Hessians for several popular distributions.
 * ``_util.py`` - This contains a few general purpose linear algebra routines that serve several other modules and don't fit well elsewhere.
 
 The GLM benchmark suite
 ------------------------
 
-Before deciding to build a library custom built for our purposes, we did an thorough investigation of the various open source GLM implementations available. This resulted in an extensive suite of benchmarks for comparing the correctness, runtime and availability of features for these libraries. 
+Before deciding to build a library custom built for our purposes, we did an thorough investigation of the various open source GLM implementations available. This resulted in an extensive suite of benchmarks for comparing the correctness, runtime and availability of features for these libraries.
 
 The benchmark suite has two command line entrypoints:
 
@@ -167,4 +167,3 @@ The benchmark suite has two command line entrypoints:
 Both of these CLI tools take a range of arguments that specify the details of the benchmark problems and which libraries to benchmark.
 
 For more details on the benchmark suite, see the README in the source at ``src/glum_benchmarks/README.md``.
-
diff --git a/docs/getting_started/getting_started.md b/docs/getting_started/getting_started.md
@@ -14,7 +14,7 @@ jupyter:
 ---
 
 <!-- #region tags=[] -->
-# Getting Started: fitting a Lasso model 
+# Getting Started: fitting a Lasso model
 
 The purpose of this tutorial is to show the basics of `glum`. It assumes a working knowledge of python, regularized linear models, and machine learning. The API is very similar to scikit-learn. After all, `glum` is based on a fork of scikit-learn.
 
@@ -62,7 +62,7 @@ X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
 
 ## GLM basics: fitting and predicting using the normal family
 
-We'll use `glum.GeneralizedLinearRegressor` to predict the house prices using the available predictors. 
+We'll use `glum.GeneralizedLinearRegressor` to predict the house prices using the available predictors.
 
 We set three key parameters:
 
@@ -118,7 +118,7 @@ which we interact with as in the example above.
 
 ## Fitting a GLM with cross validation
 
-Now, we fit using automatic cross validation with `glum.GeneralizedLinearRegressorCV`. This mirrors the commonly used `cv.glmnet` function. 
+Now, we fit using automatic cross validation with `glum.GeneralizedLinearRegressorCV`. This mirrors the commonly used `cv.glmnet` function.
 
 Some important parameters:
 
@@ -130,7 +130,7 @@ Some important parameters:
     3. If `min_alpha_ratio` is set, create a path where the ratio of
         `min_alpha / max_alpha = min_alpha_ratio`.
     4. If none of the above parameters are set, use a `min_alpha_ratio`
-        of 1e-6.      
+        of 1e-6.
 - `l1_ratio`: for `GeneralizedLinearRegressorCV`, if you pass `l1_ratio` as an array, the `fit` method will choose the best value of `l1_ratio` and store it as `self.l1_ratio_`.
 
 ```python

diff --git a/docs/index.rst b/docs/index.rst
@@ -15,7 +15,7 @@ Welcome to glum's documentation!
 
 .. image:: _static/headline_benchmark.png
    :width: 600
-   
+
 We suggest visiting the :doc:`Installation<install>` and :doc:`Getting Started<getting_started/getting_started>` sections first.
 
 .. toctree::

diff --git a/docs/make.bat b/docs/make.bat
@@ -32,4 +32,4 @@ goto end
 %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
 
 :end
-popd
+popd
Original file line number	Diff line number	Diff line change
		@@ -1,3 +1,2 @@
		# GitHub syntax highlighting
		pixi.lock linguist-language=YAML
Original file line number	Diff line number	Diff line change
Expand Up		@@ -150,4 +150,3 @@ pkgs/*
		# pixi environments
		.pixi
		*.egg-info
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Modified from code submitted as a PR to sklearn: https://github.com/scikit-learn/scikit-learn/pull/9405

		Original attribution from: https://github.com/scikit-learn/scikit-learn/pull/9405/filesdiff-38e412190dc50455611b75cfcf2d002713dcf6d537a78b9a22cc6b1c164390d1
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,4 +10,3 @@ else
		fi

		conda create -n build -c $CONDA_CHANNEL 'llvm-openmp=11'
-Original file line number
+Diff line change
@@ Expand Up / @@ -32,4 +32,4 @@ goto end @@
     %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
     :end
-    popd
+    popd