Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auditbeat: system/process module backed by quark #42032

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Conversation

haesbaert
Copy link
Contributor

@haesbaert haesbaert commented Dec 13, 2024

Proposed commit message

This introduces a new provider for the sytem/process module in linux.

The main motivation is to address some of the limitations of the current implementation. The gosysinfo provider sends state reports by scraping /proc from time to time, so it loses all short lived processes. Some customers also would like to have full telemetry but can't run auditd for various reasons.

As a bonus we get some extra ECS fields that were not available before.

MAIN DIFFERENCES:

  • Publishes every process in the system, regardless of lifespan.
  • Publishes exec events for an existing process (without a fork).
  • Aggregates fork+exec+exit within one event.
  • Adds event.exit_code for processes that exited, can't express exit_time in ECS?
  • Include the original process.args, sysinfo reports args that were fetched when it parsed /proc, so a userland process can masquerade itself. For the initial /proc scraping we report the current value like sysinfo. We can't get the original value since the kernel overwrites it, if you wanna have fun: https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165
  • Adds process.args_count.
  • Adds process.interactive and if true, process.tty.char_device.{major,minor}
  • Attempts to hash all processes, not just long lived ones.
  • Hashing is not rate-limited anymore, but it's cached and refreshed based on metadata. It's a LRU keyed by path and refreshed if the metadata of the file changes, statx(2) if the kernel supports, stat(2) otherwise.
  • No more periodic state reports, only initial batch.
  • No more saving the timestamp of the last state-report in disk.
  • No more /proc parsing during runtime, only on boot.

MISSING:

  • Unify entity id with sessionview.
  • Publish metrics from quark.Stats(). Done, but naming and gauges should be discussed.
  • Docs.
  • Properly define config options and names.

EXTRA CHANGES:

  • Added statx(2) to seccomp_linux so we can properly use CachedHasher.
  • Updated quark to 0.3 so we have namespace inode numbers.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

Run auditbeat on linux with the following configuration:

auditbeat.modules:

- module: system
  datasets:
    - process
  process.backend: "quark"

Related issues

Integrated PRs related to this

List of previous work done to minimize the size of this PR

Screenshots

Non interactive SSH

Below is a shot of a non interactive ssh session, done with ssh fc39vm /bin/echo hi from quarkio.
It shows the intermediary processes of sshd until we fork the shell and echo, the interesting bits is that we can see a process that forked+execed and then execs again: sshd forks+execs mksh,, which in turn execs /bin/echo, without forking.
ssh_nonint

Comparison against the sysinfo provider for a long lived process:

Here we run a long sleep and just compare the events against the existing provider on 8.14.3:
vs_old

On event.type, event.action and others

I've tried to keep things as close as possible to the old provider, but it's really just a suggestion at this point and it's likely we want to change things

event.type gosysinfo quark
fork start start
fork+exec start [start, change]
short fork+exec+exit N/A [start, change, end]
short fork+exit N/A [start, end]
existing processes info info
exec only N/A change
exec+exit end [change, end]
event.action gosysinfo quark
fork process_started process_started
fork+exec process_started process_started
short fork+exec+exit N/A process_ran
short fork+exit N/A process_ran
existing processes existing_process existing_process
exec only N/A process_changed_image
exec+exit end process_stopped

As you can see, expressing things in event.action is not great, I'm
all open to suggestions, life would be easier if it could be an
array. I've tried to compromise more states into fewer words.
process_changed_image might look a bit weird, but it's less ambiguous
than "executed". Again really open to suggestions here and I have no
strong feelings about it.

event.kind is now always event as there is no more state reports every X seconds.
The initial state report at init remains, but it's also event.

On the state of this PR

This doesn't include the documentation bits, I'd like to do this in a subsequent PR once the naming, config and whatnot is decided.
We should unify process.entity_id with sessionviewer, and we can do it in this PR, worth noting that the gosysinfo backend calculates things differently as well, so this is no worse than that.

I'm going out on holidays, but I'm taking this PR out of draft so that we can start the discussion and interested parties can test it.

This introduces a new provider for the sytem/process module in linux.

The main motivation is to address some of the limitations of the current
implementation. The gosysinfo provider sends state reports by scraping /proc
from time to time, so it loses all short lived processes. Some customers also
would like to have full telemetry but can't run auditd for various reasons.

As a bonus we get some extra ECS fields that were not available before.

MAIN DIFFERENCES:
 * Publishes every process in the system, regardless of lifespan.
 * Publishes exec events for an existing process (without a fork).
 * Aggregates fork+exec+exit within one event.
 * Adds event.exit_code for processes that exited, can't express exit_time in ECS?
 * Include the original process.args, sysinfo reports args that were
   fetched when it parsed /proc, so a userland process can masquerade
   itself. For the initial /proc scraping we report the current value like
   sysinfo. We can't get the original value since the kernel
   overwrites it, if you wanna have fun:
   https://github.com/systemd/systemd/blob/main/src/basic/argv-util.c#L165
 * Adds process.args_count.
 * Adds process.interactive and if true, process.tty.char_device.{major,minor}
 * Attempts to hash all processes, not just long lived ones.
 * Hashing is not rate-limited anymore, but it's cached and refreshed
   based on metadata. It's a LRU keyed by path and refreshed if the
   metadata of the file changes, statx(2) if the kernel supports,
   stat(2) otherwise.
 * No more periodic state reports, only initial batch.
 * No more saving the timestamp of the last state-report in disk.
 * No more /proc parsing during runtime, only on boot.

MISSING:
 * Unify entity id with sessionview.
 * Publish metrics from quark.Stats().
 * Docs.
 * Properly define config options and names.

EXTRA CHANGES:
 * Added statx(2) to seccomp_linux so we can properly use CachedHasher.
 * Updated quark to 0.3 so we have namespace inode numbers.
@haesbaert haesbaert added enhancement Team:Security-Linux Platform Linux Platform Team in Security Solution labels Dec 13, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Dec 13, 2024
Copy link
Contributor

mergify bot commented Dec 13, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @haesbaert? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit

Copy link
Contributor

mergify bot commented Dec 13, 2024

backport-8.x has been added to help with the transition to the new branch 8.x.
If you don't need it please use backport-skip label and remove the backport-8.x label.

@mergify mergify bot added the backport-8.x Automated backport to the 8.x branch with mergify label Dec 13, 2024
@haesbaert haesbaert marked this pull request as ready for review December 13, 2024 11:11
@haesbaert haesbaert requested review from a team as code owners December 13, 2024 11:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform)

if err := c.HasherConfig.Validate(); err != nil {
return err
}
if c.Backend != "quark" && c.Backend != "proc" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent with add_session_metadata in terms of backend option names.

Copy link
Contributor Author

@haesbaert haesbaert Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I've changed it to kernel_tracing like the processor, I was hoping we could discuss the naming but it makes more sense to stick with the same for now.

Copy link
Contributor

mergify bot commented Jan 8, 2025

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b quark-process upstream/quark-process
git merge upstream/main
git push upstream quark-process

@haesbaert haesbaert marked this pull request as draft January 9, 2025 15:09
@haesbaert
Copy link
Contributor Author

This is ready for some bashing, major changes since the initial push:

  • Added metrics, naming and what should be gauge needs consensus.
  • Added tests for ebpf and kprobe, for initial snapshot and generated events.
  • process.backend is now procfs or kernel_tracing, instead of proc and quark.
  • Still missing consensus on event.action and event.type.

@nicholasberlin
Copy link
Contributor

Still missing consensus on event.action and event.type.

I think we should treat event.action as an array, as Endpoint is currently doing.

Copy link
Contributor

@nicholasberlin nicholasberlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far.

@haesbaert
Copy link
Contributor Author

haesbaert commented Jan 13, 2025

Still missing consensus on event.action and event.type.

I think we should treat event.action as an array, as Endpoint is currently doing.

That'd be great, it would simplify things a lot.

@haesbaert haesbaert marked this pull request as ready for review January 13, 2025 08:56
@nfritts
Copy link

nfritts commented Feb 12, 2025

@andrewkroh any further thoughts on this?

@haesbaert this will be a new optional backend that isn't used by default after this PR right? If that's the case it feels fairly safe to me if we merge it and look in to have some additional parties test it.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work. Excited to have this capability available.

Just a few minor comments.

if err != nil {
processErr = fmt.Errorf("failed to hash executable %v for PID %v: %w",
process.Filename, process.Pid, err)
ms.log.Error(processErr.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything the user can do about this when they see these errors? The error message is already being sent along with the event. This seems more like "informational" rather than "error".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I was just keeping parity with the gosysinfo code that ms.log.Warn here:

It should have been a Warn, I don't feel strongly about it, especially since this is a somewhat common warning/error, as the default size limit is not huge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's lower to Warn.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in 2bb1815

x-pack/auditbeat/module/system/process/process.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-8.x Automated backport to the 8.x branch with mergify enhancement Team:Security-Linux Platform Linux Platform Team in Security Solution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants