Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Investigation: CPU Spikes & System Call Overhead in New Relic PHP Agent #1021

Open
theophileds opened this issue Feb 13, 2025 · 1 comment

Comments

@theophileds
Copy link

Context

Following up on issue #806, we conducted an isolated investigation to better understand the CPU spikes observed when the New Relic PHP agent is enabled. Our testing was performed in a controlled environment with a single container on a dedicated Kubernetes node.

Environment

  • Kubernetes 1.30 (EKS)
  • Instance type: m7a
  • PHP-FPM 8.2
  • New Relic PHP Agent: Latest version with all features disabled

Findings

CPU Usage Pattern

CPU usage graph showing spikes to 100% with New Relic disabled and 300% with New Relic enabled
*Figure 1: Grafana CPU metrics showing distinct usage patterns:

  • Baseline period with normal activity
  • Spike to ~100% CPU with New Relic disabled (16:10)
  • Spike to ~300% CPU with New Relic enabled (16:15)*

Flame Graph Comparison

Flame graph visualization without New Relic enabled
Figure 2: System-wide flame graph (test 4) with New Relic disabled, showing normal system call patterns and CPU usage distribution

Flame graph visualization with New Relic enabled
Figure 3: System-wide flame graph (test 4) with New Relic enabled, demonstrating significantly increased fstatat64 system calls and higher CPU utilization across all cores

This pattern remained consistent across multiple test runs and was not affected by:

  • Disabling all New Relic features
  • Using the latest agent version
  • Different sampling frequencies (99Hz and 997Hz)

System Call Analysis

Through system-wide performance profiling, we identified a significant increase in fstatat64 system calls when the New Relic agent is enabled. This suggests excessive file operations being performed by the agent.

Testing Methodology

We conducted extensive profiling using:

  1. PHP-FPM specific profiling at different sampling rates:
    perf record -F [99|997] -p $(pgrep php-fpm -o) -a -g --call-graph fp -- sleep 60

  2. System-wide profiling:
    perf record -F [99|997] -a -g -- sleep 60

  3. System call tracing:
    timeout 60 strace -tt -f -C -p $(pgrep -o php-fpm)

Version Impact

This performance regression appears to have been introduced between versions 10.0.0.312 and 10.7.0.319. Earlier versions did not exhibit this behavior.

Supporting Evidence

All profiling results are attached to this issue in newrelic_profiling_results.zip, which includes:

PHP-FPM Specific Profiles

  • With New Relic disabled:
    • 99Hz sampling (phpfpm_nr_off_99hz.*)
    • 997Hz sampling (phpfpm_nr_off_997hz.*)
  • With New Relic enabled:
    • 99Hz sampling (phpfpm_nr_on_99hz.*)
    • 997Hz sampling (phpfpm_nr_on_997hz.*)

System-Wide Profiles

  • With New Relic disabled:
    • Test 3 (system_nr_off_99hz_test3.*)
    • Test 4 (system_nr_off_99hz_test4.*)
  • With New Relic enabled:
    • Test 3 (system_nr_on_99hz_test3.*)
    • Test 4 (system_nr_on_99hz_test4.*)

Questions

  • Is there a known reason for the increased frequency of fstatat64 calls?
  • Are there plans to optimize file operations in future releases?
  • Could this be related to the agent's file monitoring or instrumentation mechanisms?

newrelic_profiling_results.zip

@iekadou
Copy link

iekadou commented Feb 27, 2025

We have a similar Issue, but for almost all versions, also previous to 10.0.0.312.

If we have 500+ req/s at the fpm we run into a CPU spike (all 96 cores have ~100% Kernel usage) which is directly resolved after restarting the newrelic-daemon.
Those spikes do not occur at all if newrelic is not running.

We also see a filename_lookups which result to locks.
Screenshot of flamegraph attached.

Image

I think this could be related?

Sadly I have no idea how to detect which file(s) are causing those locks.
If someone knows how to detect on which paths are used by native_queued_spin_lock_slowpath feel free to help me :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants