Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REQUEST: Repository maintenance on Benchmark Bare Metal Runners #2331

Closed
XSAM opened this issue Sep 4, 2024 · 9 comments
Closed

REQUEST: Repository maintenance on Benchmark Bare Metal Runners #2331

XSAM opened this issue Sep 4, 2024 · 9 comments
Labels
area/repo-maintenance Maintenance of repos in the open-telemetry org

Comments

@XSAM
Copy link
Member

XSAM commented Sep 4, 2024

Affected Repository

https://github.com/open-telemetry/opentelemetry-go

Requested changes

Need to investigate the Error: No space left on device issue of this runner while initiating jobs. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10705102088/job/29682643790

The runner would fail the job before doing any tasks, and the Go SIG cannot solve such a situation, as we lack the context of the running environment and don't have access to the bare metal machine.

Purpose

https://github.com/open-telemetry/opentelemetry-go needs a runnable benchmark runner to run benchmarks.

Repository Maintainers

  • @open-telemetry/go-maintainers
@XSAM XSAM added the area/repo-maintenance Maintenance of repos in the open-telemetry org label Sep 4, 2024
@XSAM XSAM changed the title REQUEST: Repository maintenance on 'Benchmark Bare Metal Runners' REQUEST: Repository maintenance on Benchmark Bare Metal Runners Sep 4, 2024
@XSAM
Copy link
Member Author

XSAM commented Sep 5, 2024

Now, the runner seems to work again. https://github.com/open-telemetry/opentelemetry-go/actions/runs/10715454343/job/29710949026

I am curious whether someone fixed the issue or the runner healed itself.

@XSAM
Copy link
Member Author

XSAM commented Sep 9, 2024

We haven't encountered any issue like this recently. I will close this for now.

Feel free to re-open if other people encounter similar issues.

@XSAM XSAM closed this as completed Sep 9, 2024
@XSAM
Copy link
Member Author

XSAM commented Sep 12, 2024

It happens again:

@XSAM XSAM reopened this Sep 12, 2024
@trask
Copy link
Member

trask commented Sep 17, 2024

cc @tylerbenson

also see https://cloud-native.slack.com/archives/C01NJ7V1KRC/p1725475267605189

@tylerbenson
Copy link
Member

Some job is generating a lot of 1GB+ logs in the /tmp directory:

...
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5408 item_index=item_5408 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5536 item_index=item_5536 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5537 item_index=item_5537 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5538 item_index=item_5538 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5444 item_index=item_5444 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5539 item_index=item_5539 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5544 item_index=item_5544 a=test b=5 c=3 d=true
2024-09-09 INFO3 Load Generator Counter #0 batch_index=batch_5520 item_index=item_5520 a=test b=5 c=3 d=true
...

Perhaps the collector @codeboten?
Each job should really clean up the /tmp directory before or after executing. I'm not really sure how to enforce this better.

@tylerbenson
Copy link
Member

Alternatively the TC could decide to schedule a restart every week to ensure the /tmp directory is cleaned, perhaps on Sunday to reduce risk of interrupting an active test.

@tylerbenson
Copy link
Member

For the time being, I followed this guide and added a script that executes find /tmp -user "ghrunner" -delete at the end of each job execution. We'll see if that helps.

@tylerbenson
Copy link
Member

@XSAM It should be fixed now, but please reconsider running your performance job so frequently. It looks like your job takes over an hour to run. That is entirely too long to be run on ever merge to main. Remember, this is a single instance shared by all OTel projects. You should either make it run in under 15 minutes, or limit it to only run daily.

@XSAM XSAM closed this as completed Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/repo-maintenance Maintenance of repos in the open-telemetry org
Projects
Status: Done
Development

No branches or pull requests

3 participants