Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Logging enhancements #19580

Open
ksuderman opened this issue Feb 10, 2025 · 2 comments
Open

Logging enhancements #19580

ksuderman opened this issue Feb 10, 2025 · 2 comments

Comments

@ksuderman
Copy link
Contributor

ksuderman commented Feb 10, 2025

This is a planning and discussion issue for enhancements I would like to make to Galaxy's logging system.

TL;DR

I want to add logger.trace() calls to the Galaxy codebase, but before doing that I would like to get some infrastructure in place.

What is proposed

  1. Enhance support for the trace() method that is already present in galaxy.util.custom_logging
  2. Add a simplified logging configuration section to either galaxy.yml or a separate logging config file
  3. Add API endpoints to query and set logging levels at runtime
  4. Add an admin panel that gives administrators a GUI for setting log levels on a running instance.

What is not proposed

  1. Providing an API that allows arbitrary logger reconfiguration. We only need (want) to change the logging level at runtime. No other configuration changes will be allowed.
  2. Changing the way loggers are currently configured. The logging configuration changes proposed here work transparently with the current logging configuration methods.
  3. Doing everything now. This is a first pass and we can add more features and enhancements later.

Proof of concept

I have a proof of concept running that implements the first three items (everything but the admin panel). The code is available in the logging-enhancements branch of my Galaxy fork. See this draft PR for details.

However, before I start submitting PRs there are a number of design decisions to resolve.

The trace() method

My proof of concept is based on the collective wisdom of StackOverflow.

If we monkeypatch Python's logging library early enough in the code loading process then the trace() method is automagically available to all loggers. There are ~750 configured loggers in a running Galaxy instance.

This works well in a running Galaxy instance, but may present problems for Galaxy packages meant to be used as libraries. Calling galaxy.util.logging.addTraceLoggingLevel() in the package.init.py may be sufficient. The goal is to make it as easy as possible for developers to add trace() calls to their code and not require the use of a special logger class.

Logging configuration

The proof of concept implementation adds a logging_levels section to the galaxy.yml file. This could also be done in a separate configuration file. The purpose of the logging_levels section is to provide an easier way to set the logging levels for groups of loggers in a single statement.

logging_levels:
  galaxy.*: ERROR
  galaxy.jobs.runners.*: DEBUG
  galaxy.jobs.runners.kubernetes: TRACE

Changes at runtime

Adding trace() level logging statements has the potential to blow up log files and Galaxy should not be configured to use TRACE as the default logging level. Therefore we will need some way to enable TRACE level logging at runtime. Options include, but are not limited to:

  1. set up a file watcher on the config file and reconfigure loggers when changes are detected
  2. provide an API that allows log levels to be changed.

These are not mutually exclusive options, but in some situations, e.g. AnVIL, users may not have access to the configuration files.

The proof of concept provides a simple API that allows the level to be changed at runtime. Adding a config watcher can be added at a later date if needed.

Screenshot 2024-11-20 at 3 59 52 PM
curl http://localhost:8080/api/logging
curl http://localhost:8080/api/logging/galaxy.jobs.runners.* 
curl -X POST http://localhost:8080/api/logging/galaxy.jobs.runners.*?level=DEBUG
curl -X POST http://localhost:8080/api/logging/galaxy.jobs.runners.kubernetes?level=TRACE

For brevity I have omitted setting the API key. Only administrators can list or change the logging levels.

The first GET method returns JSON containing information for all currently configured loggers. The second GET method returns information for a single logger or all loggers in a particular package. The POST method can be used to set the logging level for a single logger or all the loggers in a package.

TODO

Authorization

All API endpoints specify require_admin=True in the @router decorator. Is this sufficient? I was thinking of adding checks in relevant galaxy.util.logging methods, but is this necessary?

Admin GUI

I have not started on an admin panel yet, but I envision it would be a new option in the Server section of the admin panel. The logging panel would contain a table and/or a tree view of all the loggers and their current levels and the ability to set the levels for loggers. I will need help and guidance with this task.

Naming

I've used that I thought were reasonable names for the names of packages, API endpoints, function names, and configuration options, but I'm open to suggestions. This also applies to general code layout and structure.

@jmchilton
Copy link
Member

I'll be honest and frank - I'm skeptical we need to change the logging levels at runtime. It sounds like a problem with querying and processing the logs. If we're collecting the logs properly and storing them - why would the level they are stored at matter. What if you deployed more sophisticated log querying solution in the container - it feels like the right way to sort through noisy logs. It feels like maybe Loki -> {Alloy or Promtail} -> Grafana might be an option - I'm not familiar with this space and I might be speaking nonsense but the options look promising to my naive eyes? Writing docs for integrating these tools and adapting our logs for richer metadata parsing would seem like a better use of developer time than runtime changing the log levels? The UI would probably be better and it would be more generally useful and a more familiar workflow for admins and developers.

@ksuderman
Copy link
Contributor Author

Thanks for the feedback. Honest and frank is always good.

The issue isn't with querying and processing logs, the issue is that not enough logging is being done for us (AnVIL et al) to debug problems when Galaxy is running in Kubernetes. We just spent months figuring out a problem with data managers (the exec_after_process hook in particular) because we were not able to easily see what was going on at runtime. Our usual workflow in situations like this is to add a bunch of logging statements, rebuild the Docker container(s), and relaunch Galaxy. I want to start committing these additional logging statements to the codebase so we don't have to repeat the process every time we run into a problem. My concern is that this could result much larger log files (orders of magnitude larger) and lead to storage problems, problems with logrotate rules, etc. in production. If blowing up the log files isn't a concern then changing logging levels at runtime likely isn't necessary. However, my thought was to add tons of logging at the trace() level and only enable those statements when needed. Then use tools like Loki/Grafana to query and analyze the logs.

By tons of logging statements I mean adding trace() statements to:

  • the top of every method in a file including all class constructors
  • the start of flow of control statements
  • any other place that seems relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants