You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running aws-for-fluent-bit as an ECS service that sidecars send logs to via the Forward output/input plugins. The receiving end hits a SIGSEGV when a specific service sends logs. This repos 100% of the time, and crashloops our collecting service instances every 5 minutes.
We have built the debug version of version 2.32.2.20240516 and have coredumps available, but by policy can't post them to publicly available documents like this.
The tags come in as <environment>.<service name>.<aws region>.<data classification>
[SERVICE]
flush 1
daemon Off
log_level ${LOGLEVEL}
parsers_file parsers.conf
plugins_file plugins.conf
http_server on
http_listen 0.0.0.0
http_port 2020
Health_Check On
HC_Errors_Count 5
HC_Retry_Failure_Count 5
HC_Period 5
storage.metrics on
[INPUT]
Name forward
Listen 0.0.0.0
Port 24224
storage.type memory
mem_buf_limit ${MEM_BUF_LIMIT}
Buffer_Chunk_Size 5M
Buffer_Max_Size 50M
# Regular logs
[OUTPUT]
Name s3
# Match the two non-sensitive values, but also no data classification value.
Match_regex ^\w++\.[\w\d\-]++\.[\w\d\-]++(?:\.(?=not_applicable|public)\w++)?$
bucket ${BUCKET}
region ${REGION}
total_file_size 250M
compression gzip
s3_key_format /$TAG[0]/$TAG[1]/$TAG[2]/%Y/%m/%d/%H-%M-%S-$UUID.gz
s3_key_format_tag_delimiters .
storage_class INTELLIGENT_TIERING
workers ${OUTPUT_WORKERS}
# Sensitive logs
[OUTPUT]
Name s3
# Match all data classification values that are not the known non-sensitive values.
Match_regex ^\w++\.[\w\d\-]++\.[\w\d\-]++(?:\.(?!not_applicable|public)\w++)$
bucket ${BUCKET_SENSITIVE}
region ${REGION}
total_file_size 250M
compression gzip
s3_key_format /$TAG[0]/$TAG[1]/$TAG[2]/%Y/%m/%d/%H-%M-%S-$UUID.gz
s3_key_format_tag_delimiters .
storage_class INTELLIGENT_TIERING
workers ${OUTPUT_WORKERS}
Fluent Bit Log Output
[2025/02/19 01:30:43] [engine] caught signal (SIGSEGV)
[2025/02/19 01:30:43] [ info] [output:s3:s3.1] Successfully uploaded object /prd/kyriosb/us-east-1/2025/02/19/01-30-37-xovTeH6T.gz
[2025/02/19 01:30:43] [ info] [output:s3:s3.0] Successfully uploaded object /prd/cid-core/us-east-1/2025/02/19/01-30-40-aggO6YgA.gz
[2025/02/19 01:30:43] [ info] [output:s3:s3.1] Pre-compression upload_chunk_size= 7069012, After compression, chunk is only 670356 bytes, the chunk was too small, using PutObject to upload
The lines before the SIGSEGV are normal.
Fluent Bit Version Info
Tried versions:
2.32.2.20240516
2.32.2.20240516 debug build
2.32.5.20250212
All versions displayed the SIGSEGV.
Cluster Details
We are in a VPC running in a Fargate ECS service. Until the offending log stream was sent, there were no troubles.
Application Details
Each instance is processing ~30GB an hour at a fairly steady rate.
Steps to reproduce issue
Deploy and run normally
Deploy one specific service that starts sending logs to these Fluent Bit instances
Fluent Bit starts crashing, debug builds generate core dumps
Debugging core dumps yields stack traces like this:
(gdb) bt
#0 0x00007fcb619b2ca0 in raise () from /lib64/libc.so.6
#1 0x00007fcb619b4148 in abort () from /lib64/libc.so.6
#2 0x0000000000455abe in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3 0x00007fcb5db3415d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#4 0x00007fcb5caa5c48 in ?? ()
#5 0x00007fcb5db16c05 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#6 0x00007fcb5db15347 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#7 0x00007fcb5db341c9 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#8 0x00007fcb3247771d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#9 0x00007fcb5caa5d58 in ?? ()
#10 0x00007fcb3245a285 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#11 0x00007fcb324589c7 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#12 0x00007fcb32477789 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#13 0x00007fcb0595135d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#14 0x00007fcb5caa5e68 in ?? ()
#15 0x00007fcb05933ac5 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#16 0x00007fcb05932207 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#17 0x00007fcb059513c9 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#18 <signal handler called>
#19 0x00007fcb61accbcf in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#20 0x000000000085b119 in cio_file_write (ch=0x7fcb5ca100c0, buf=0x7fc9c626fd10, count=2286604) at /tmp/fluent-bit-1.9.10/lib/chunkio/src/cio_file.c:954
#21 0x0000000000857376 in cio_chunk_write (ch=0x7fcb5ca100c0, buf=0x7fc9c626fd10, count=2286604) at /tmp/fluent-bit-1.9.10/lib/chunkio/src/cio_chunk.c:223
#22 0x0000000000643373 in flb_fstore_file_append (fsf=0x7fcb5ca9e000, data=0x7fc9c626fd10, size=2286604) at /tmp/fluent-bit-1.9.10/src/flb_fstore.c:287
#23 0x000000000060331a in s3_store_buffer_put (ctx=0x7fcad9e32e40, s3_file=0x7fcb5cacb028, tag=0x7fcad93c6630 "prd.kyriosb.us-east-1.pii", tag_len=25,
data=0x7fc9c626fd10 "{\"date\":\"2025-02-14T22:35:09.095340Z\",\"content\":\"{\\\"timestamp\\\":\\\"2025-02-14T22:35:09.095Z\\\",\\\"level\\\":\\\"INFO\\\",\\\"thread\\\":\\\"http-nio-8090-exec-9\\\",\\\"loggerName\\\":\\\"com.chewy.kyriosb.log.LoggingAspect"..., bytes=2286604) at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3_store.c:188
#24 0x0000000000600853 in buffer_chunk (out_context=0x7fcad9e32e40, upload_file=0x0,
chunk=0x7fc9c626fd10 "{\"date\":\"2025-02-14T22:35:09.095340Z\",\"content\":\"{\\\"timestamp\\\":\\\"2025-02-14T22:35:09.095Z\\\",\\\"level\\\":\\\"INFO\\\",\\\"thread\\\":\\\"http-nio-8090-exec-9\\\",\\\"loggerName\\\":\\\"com.chewy.kyriosb.log.LoggingAspect"..., chunk_size=2286604, tag=0x7fcad93c6630 "prd.kyriosb.us-east-1.pii", tag_len=25)
at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3.c:1663
#25 0x0000000000602495 in cb_s3_flush (event_chunk=0x7fcad93e4b08, out_flush=0x7fcb5ca9e2c0, i_ins=0x7fcb5ec0a780, out_context=0x7fcad9e32e40, config=0x7fcb5ec19980)
at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3.c:2202
#26 0x00000000004e702e in output_pre_cb_flush () at /tmp/fluent-bit-1.9.10/include/fluent-bit/flb_output.h:522
#27 0x0000000000a518a7 in co_init () at /tmp/fluent-bit-1.9.10/lib/monkey/deps/flb_libco/amd64.c:117
#28 0x0000000000000000 in ?? ()
Related Issues
This may be a related to #816 , but the logs don't match up. Our issue seems to be in the S3 output plugin.
The text was updated successfully, but these errors were encountered:
Are you able to check the utilization metrics of one of the ECS tasks around the time it gets a SIGSEGV and crashes? How are the memory and filesystem utilization looking?
Also, in your Fluent Bit debug-level logs, are you seeing log lines of the format [cio file] alloc_size from %lu to %lu? Additionally, are you seeing anything indicating potential issues with writing data?
Given that you can't publish the debug logs themselves, are you allowed to find the code lines publishing any debug logs ocurring just before the SIGSEGV in this repo, and share those code lines with us so we can better understand the execution path?
Bit of a long shot, but if you set the storage.pathSERVICE setting, does that fix the issue?
The provided stack trace is from a memcpy with a fairly straightforward call context, and even if the size values were incorrect I'd expect that to just not publish correct data (instead of a SIGSEGV), so I'm suspecting either:
Some weird race condition
Fluent Bit having trouble allocating enough memory for the file
Or some issue with not having write access to the required directory
In terms of mitigating impact, as I understand these crashes are sporadic rather than "start publishing => immediate crash", correct?
If so, you can try using a container restart policy on your firelens containers, so that they restart and continue publishing logs after a small gap instead of crashing your application.
Describe the question/issue
We are running aws-for-fluent-bit as an ECS service that sidecars send logs to via the Forward output/input plugins. The receiving end hits a SIGSEGV when a specific service sends logs. This repos 100% of the time, and crashloops our collecting service instances every 5 minutes.
We have built the debug version of version 2.32.2.20240516 and have coredumps available, but by policy can't post them to publicly available documents like this.
Configuration
ECS Task Config
Fluent Bit Config
The tags come in as
<environment>.<service name>.<aws region>.<data classification>
Fluent Bit Log Output
The lines before the SIGSEGV are normal.
Fluent Bit Version Info
Tried versions:
All versions displayed the SIGSEGV.
Cluster Details
We are in a VPC running in a Fargate ECS service. Until the offending log stream was sent, there were no troubles.
Application Details
Each instance is processing ~30GB an hour at a fairly steady rate.
Steps to reproduce issue
Debugging core dumps yields stack traces like this:
Related Issues
This may be a related to #816 , but the logs don't match up. Our issue seems to be in the S3 output plugin.
The text was updated successfully, but these errors were encountered: