Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: add timeout parameter to the .fit() method #6596

Open
fingoldo opened this issue Aug 8, 2024 · 4 comments
Open

Feature Request: add timeout parameter to the .fit() method #6596

fingoldo opened this issue Aug 8, 2024 · 4 comments

Comments

@fingoldo
Copy link

fingoldo commented Aug 8, 2024

Adding the timeout parameter to the .fit() method, that should force the library to return best known solution found so far as soon as provided number of seconds since the start of training are passed, will allow to satisfy training SLAs, when a user has only a limited time budget to finish certain model training. Also, this will make possible fair comparison of different hyperparameters.

Reaching the timeout should have the same effect as reaching max iterations, maybe with additional warning and/or attribute set so that the training job's finishing reason is clear to the end user.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM and taking the time to open this.

I'm -1 on adding this to LightGBM. I understand why this might be useful, but I don't think LightGBM is the right place for this logic. This would introduce some non-trivial maintenance burden and complexity.

This would be better handled outside of LightGBM, in code that you use to invoke it.

Since you mentioned .fit(), I assume you're specifically talking about using lightgbm (the Python package for LightGBM). You could, for example, use asyncio's builtin support for timing out Python function calls: https://docs.python.org/3/library/asyncio-task.html#timeouts.

Alternatively, you could use a lightgbm callback for this purpose. Something like the following:

import lightgbm as lgb
from datetime import datetime
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000, n_features=20)
dtrain = lgb.Dataset(X, label=y)

class TimeoutCallback:
    def __init__(self, timeout_seconds: int):
        self.before_iteration = False
        self.timeout_seconds = timeout_seconds
        self._start = datetime.utcnow()

    def __call__(self, *args, **kwargs) -> None:
        if (datetime.utcnow() - self._start).total_seconds() > self.timeout_seconds:
            raise RuntimeError(f"timing out: elapsed time has exceeded {self.timeout_seconds} seconds")

bst = lgb.train(
    params={
        "objective": "regression",
        "num_leaves": 100
    },
    train_set=dtrain,
    num_boost_round=1000,
    callbacks=[TimeoutCallback(2)]
)

I just tested that with LightGBM 4.5.0 and saw the following:

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001736 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 20
[LightGBM] [Info] Start training from score 0.256686
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jlamb/miniforge3/envs/lgb-dev/lib/python3.11/site-packages/lightgbm/engine.py", line 317, in train
    cb(
  File "<stdin>", line 8, in __call__
RuntimeError: timing out: elapsed time has exceeded 2 seconds

That's not perfect, as it only runs after each iteration and individual iterations could run for much longer on a realistic dataset. But hopefully that imperfection also shows one example of how complex this would be to implement in LightGBM.

I'm only one vote here though, maybe other maintainers will have a different perspective.

@fingoldo
Copy link
Author

fingoldo commented Aug 8, 2024

I did not think of this approach! If i'm using early stopping, are best "weights" applied to the model after this exception is thrown? in other words, is best_iter set correctly? Goal would be to stay within time budget but not lose training progress up to the point.

@jameslamb
Copy link
Collaborator

jameslamb commented Aug 8, 2024

Oh interesting! It wasn't clear to me that you would want to see training time out but also keep that model.

No, in the Python package best_iter and other early stopping behavior is only set after early stopping is explicitly triggered, not along the way as training proceeds.

A Python exception is used to tell the training process that early stopping has been triggered, and to carry forward details like best iteration and evaluation results.

raise EarlyStopException(self.best_iter[i], self.best_score_list[i])

class EarlyStopException(Exception):
"""Exception of early stopping.
Raise this from a callback passed in via keyword argument ``callbacks``
in ``cv()`` or ``train()`` to trigger early stopping.

except callback.EarlyStopException as earlyStopException:
booster.best_iteration = earlyStopException.best_iteration + 1
evaluation_result_list = earlyStopException.best_score
break

You could rely on that behavior in your own callback, and have it raise a lightgbm.EarlyStopException instead of a RuntimeError like in my example. That'd allow you to treat "training has been running for too long" as a triggering condition for early stopping.

Alternatively... have you tried optuna? I haven't used this particular feature of it, but it looks like they directly offer a time_budget: https://optuna.readthedocs.io/en/v2.0.0/reference/generated/optuna.integration.lightgbm.LightGBMTuner.html

time_budget – A time budget for parameter tuning in seconds.

(that might be for the entire experiment though, not per-trial... I'm not sure)

@fingoldo
Copy link
Author

fingoldo commented Aug 8, 2024

Hah! ) I'm planning to create my own cool hyperparameters tuner, that's one of the reasons why I'm interested in this functionality. I can easily see how to do time budgeting at the level of the tuner - just in the hyperparameters checking loop, after next combination has been tried, but the underlying estimator has to finish its training gracefully before that, which for some combinations can take extremely long time.

Writing great hyperparameters optimizer is one more use case for this timeout feature. Now I think it's the EarlyStopping callback I should subclass (as I almost can't imagine training without early stopping).

Does it make sense to prepare a PR that adds timeout parameter to the EarlyStopping callback?

That said, it still seems more natural to me to be able to specify timeout in the fit or init methods of the estimator directly, same as we do with n_iters, just in this case we are interested in maximum number of seconds not trees.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants