SIGSEGV in version 2.32.2.20240516 #898

mschroede1-chwy · 2025-02-19T02:20:06Z

Describe the question/issue

We are running aws-for-fluent-bit as an ECS service that sidecars send logs to via the Forward output/input plugins. The receiving end hits a SIGSEGV when a specific service sends logs. This repos 100% of the time, and crashloops our collecting service instances every 5 minutes.

We have built the debug version of version 2.32.2.20240516 and have coredumps available, but by policy can't post them to publicly available documents like this.

Configuration

ECS Task Config

{
    "taskDefinitionArn": "<redacted>",
    "containerDefinitions": [
        {
            "name": "fluent-bit",
            "image": "<redacted>/logs-backup-collector:0.13.0-dev.13.f078c54",
            "cpu": 2048,
            "memory": 8192,
            "memoryReservation": 8192,
            "portMappings": [
                {
                    "containerPort": 2020,
                    "hostPort": 2020,
                    "protocol": "tcp"
                },
                {
                    "containerPort": 24224,
                    "hostPort": 24224,
                    "protocol": "tcp"
                }
            ],
            "essential": true,
            "environment": [
                {
                    "name": "S3_BUCKET",
                    "value": "<redacted>"
                },
                {
                    "name": "S3_KEY_PREFIX",
                    "value": "collector-crashdumps"
                },
                {
                    "name": "BUCKET",
                    "value": "<redacted>"
                },
                {
                    "name": "BUCKET_SENSITIVE",
                    "value": "<redacted>"
                },
                {
                    "name": "REGION",
                    "value": "us-east-2"
                },
                {
                    "name": "OUTPUT_WORKERS",
                    "value": "10"
                },
                {
                    "name": "MEM_BUF_LIMIT",
                    "value": "5120M"
                },
                {
                    "name": "LOGLEVEL",
                    "value": "info"
                }
            ],
            "mountPoints": [],
            "volumesFrom": [],
            "linuxParameters": {
                "capabilities": {
                    "add": [
                        "SYS_PTRACE"
                    ],
                    "drop": []
                }
            },
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "<redacted>",
                    "mode": "non-blocking",
                    "awslogs-region": "us-east-2",
                    "awslogs-stream-prefix": "<redacted>"
                }
            },
            "healthCheck": {
                "command": [
                    "CMD-SHELL",
                    "curl -f http://localhost:2020/api/v1/health || exit 1"
                ],
                "interval": 30,
                "timeout": 60,
                "retries": 10,
                "startPeriod": 30
            },
            "systemControls": []
        }
    ],
    "family": "<redacted>",
    "taskRoleArn": "<redacted>",
    "executionRoleArn": "<redacted>",
    "networkMode": "awsvpc",
    "revision": 18,
    "volumes": [],
    "status": "ACTIVE",
    "requiresAttributes": [
        {
            "name": "com.amazonaws.ecs.capability.logging-driver.awslogs"
        },
        {
            "name": "ecs.capability.execution-role-awslogs"
        },
        {
            "name": "com.amazonaws.ecs.capability.ecr-auth"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.19"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.28"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.21"
        },
        {
            "name": "com.amazonaws.ecs.capability.task-iam-role"
        },
        {
            "name": "ecs.capability.container-health-check"
        },
        {
            "name": "ecs.capability.execution-role-ecr-pull"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.18"
        },
        {
            "name": "ecs.capability.task-eni"
        },
        {
            "name": "com.amazonaws.ecs.capability.docker-remote-api.1.29"
        }
    ],
    "placementConstraints": [],
    "compatibilities": [
        "EC2",
        "FARGATE"
    ],
    "requiresCompatibilities": [
        "FARGATE"
    ],
    "cpu": "2048",
    "memory": "8192",
    "runtimePlatform": {
        "cpuArchitecture": "X86_64"
    }
}

Fluent Bit Config

The tags come in as <environment>.<service name>.<aws region>.<data classification>

[SERVICE]
    flush        1
    daemon       Off
    log_level    ${LOGLEVEL}

    parsers_file parsers.conf
    plugins_file plugins.conf

    http_server  on
    http_listen  0.0.0.0
    http_port    2020

    Health_Check On
    HC_Errors_Count 5
    HC_Retry_Failure_Count 5
    HC_Period 5

    storage.metrics on

[INPUT]
    Name                       forward
    Listen                     0.0.0.0
    Port                       24224
    storage.type               memory
    mem_buf_limit              ${MEM_BUF_LIMIT}
    Buffer_Chunk_Size          5M
    Buffer_Max_Size            50M

# Regular logs
[OUTPUT]
    Name                         s3
    # Match the two non-sensitive values, but also no data classification value.
    Match_regex                  ^\w++\.[\w\d\-]++\.[\w\d\-]++(?:\.(?=not_applicable|public)\w++)?$
    bucket                       ${BUCKET}
    region                       ${REGION}
    total_file_size              250M
    compression                  gzip
    s3_key_format                /$TAG[0]/$TAG[1]/$TAG[2]/%Y/%m/%d/%H-%M-%S-$UUID.gz
    s3_key_format_tag_delimiters .
    storage_class                INTELLIGENT_TIERING
    workers                      ${OUTPUT_WORKERS}

# Sensitive logs
[OUTPUT]
    Name                         s3
    # Match all data classification values that are not the known non-sensitive values.
    Match_regex                  ^\w++\.[\w\d\-]++\.[\w\d\-]++(?:\.(?!not_applicable|public)\w++)$
    bucket                       ${BUCKET_SENSITIVE}
    region                       ${REGION}
    total_file_size              250M
    compression                  gzip
    s3_key_format                /$TAG[0]/$TAG[1]/$TAG[2]/%Y/%m/%d/%H-%M-%S-$UUID.gz
    s3_key_format_tag_delimiters .
    storage_class                INTELLIGENT_TIERING
    workers                      ${OUTPUT_WORKERS}

Fluent Bit Log Output

[2025/02/19 01:30:43] [engine] caught signal (SIGSEGV)
[2025/02/19 01:30:43] [ info] [output:s3:s3.1] Successfully uploaded object /prd/kyriosb/us-east-1/2025/02/19/01-30-37-xovTeH6T.gz
[2025/02/19 01:30:43] [ info] [output:s3:s3.0] Successfully uploaded object /prd/cid-core/us-east-1/2025/02/19/01-30-40-aggO6YgA.gz
[2025/02/19 01:30:43] [ info] [output:s3:s3.1] Pre-compression upload_chunk_size= 7069012, After compression, chunk is only 670356 bytes, the chunk was too small, using PutObject to upload

The lines before the SIGSEGV are normal.

Fluent Bit Version Info

Tried versions:

2.32.2.20240516
2.32.2.20240516 debug build
2.32.5.20250212
All versions displayed the SIGSEGV.

Cluster Details

We are in a VPC running in a Fargate ECS service. Until the offending log stream was sent, there were no troubles.

Application Details

Each instance is processing ~30GB an hour at a fairly steady rate.

Steps to reproduce issue

Deploy and run normally
Deploy one specific service that starts sending logs to these Fluent Bit instances
Fluent Bit starts crashing, debug builds generate core dumps

Debugging core dumps yields stack traces like this:

(gdb) bt
#0  0x00007fcb619b2ca0 in raise () from /lib64/libc.so.6
#1  0x00007fcb619b4148 in abort () from /lib64/libc.so.6
#2  0x0000000000455abe in flb_signal_handler (signal=11) at /tmp/fluent-bit-1.9.10/src/fluent-bit.c:581
#3  0x00007fcb5db3415d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#4  0x00007fcb5caa5c48 in ?? ()
#5  0x00007fcb5db16c05 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
    at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#6  0x00007fcb5db15347 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#7  0x00007fcb5db341c9 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#8  0x00007fcb3247771d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#9  0x00007fcb5caa5d58 in ?? ()
#10 0x00007fcb3245a285 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
    at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#11 0x00007fcb324589c7 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#12 0x00007fcb32477789 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#13 0x00007fcb0595135d in runtime.sigfwd () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:331
#14 0x00007fcb5caa5e68 in ?? ()
#15 0x00007fcb05933ac5 in runtime.sigfwdgo (sig=<optimized out>, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>, ~r0=<optimized out>)
    at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:1135
#16 0x00007fcb05932207 in runtime.sigtrampgo (sig=0, info=0x7fcb5caa6070, ctx=0x7fcb619b2ca0 <raise+272>) at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/signal_unix.go:432
#17 0x00007fcb059513c9 in runtime.sigtramp () at /home/.gimme/versions/go1.20.7.linux.amd64/src/runtime/sys_linux_amd64.s:354
#18 <signal handler called>
#19 0x00007fcb61accbcf in __memmove_avx_unaligned_erms () from /lib64/libc.so.6
#20 0x000000000085b119 in cio_file_write (ch=0x7fcb5ca100c0, buf=0x7fc9c626fd10, count=2286604) at /tmp/fluent-bit-1.9.10/lib/chunkio/src/cio_file.c:954
#21 0x0000000000857376 in cio_chunk_write (ch=0x7fcb5ca100c0, buf=0x7fc9c626fd10, count=2286604) at /tmp/fluent-bit-1.9.10/lib/chunkio/src/cio_chunk.c:223
#22 0x0000000000643373 in flb_fstore_file_append (fsf=0x7fcb5ca9e000, data=0x7fc9c626fd10, size=2286604) at /tmp/fluent-bit-1.9.10/src/flb_fstore.c:287
#23 0x000000000060331a in s3_store_buffer_put (ctx=0x7fcad9e32e40, s3_file=0x7fcb5cacb028, tag=0x7fcad93c6630 "prd.kyriosb.us-east-1.pii", tag_len=25,
    data=0x7fc9c626fd10 "{\"date\":\"2025-02-14T22:35:09.095340Z\",\"content\":\"{\\\"timestamp\\\":\\\"2025-02-14T22:35:09.095Z\\\",\\\"level\\\":\\\"INFO\\\",\\\"thread\\\":\\\"http-nio-8090-exec-9\\\",\\\"loggerName\\\":\\\"com.chewy.kyriosb.log.LoggingAspect"..., bytes=2286604) at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3_store.c:188
#24 0x0000000000600853 in buffer_chunk (out_context=0x7fcad9e32e40, upload_file=0x0,
    chunk=0x7fc9c626fd10 "{\"date\":\"2025-02-14T22:35:09.095340Z\",\"content\":\"{\\\"timestamp\\\":\\\"2025-02-14T22:35:09.095Z\\\",\\\"level\\\":\\\"INFO\\\",\\\"thread\\\":\\\"http-nio-8090-exec-9\\\",\\\"loggerName\\\":\\\"com.chewy.kyriosb.log.LoggingAspect"..., chunk_size=2286604, tag=0x7fcad93c6630 "prd.kyriosb.us-east-1.pii", tag_len=25)
    at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3.c:1663
#25 0x0000000000602495 in cb_s3_flush (event_chunk=0x7fcad93e4b08, out_flush=0x7fcb5ca9e2c0, i_ins=0x7fcb5ec0a780, out_context=0x7fcad9e32e40, config=0x7fcb5ec19980)
    at /tmp/fluent-bit-1.9.10/plugins/out_s3/s3.c:2202
#26 0x00000000004e702e in output_pre_cb_flush () at /tmp/fluent-bit-1.9.10/include/fluent-bit/flb_output.h:522
#27 0x0000000000a518a7 in co_init () at /tmp/fluent-bit-1.9.10/lib/monkey/deps/flb_libco/amd64.c:117
#28 0x0000000000000000 in ?? ()

Related Issues

This may be a related to #816 , but the logs don't match up. Our issue seems to be in the S3 output plugin.

The text was updated successfully, but these errors were encountered:

swapneils · 2025-02-20T00:44:51Z

Are you able to check the utilization metrics of one of the ECS tasks around the time it gets a SIGSEGV and crashes? How are the memory and filesystem utilization looking?

Also, in your Fluent Bit debug-level logs, are you seeing log lines of the format [cio file] alloc_size from %lu to %lu? Additionally, are you seeing anything indicating potential issues with writing data?

Given that you can't publish the debug logs themselves, are you allowed to find the code lines publishing any debug logs ocurring just before the SIGSEGV in this repo, and share those code lines with us so we can better understand the execution path?

Bit of a long shot, but if you set the storage.path SERVICE setting, does that fix the issue?

The provided stack trace is from a memcpy with a fairly straightforward call context, and even if the size values were incorrect I'd expect that to just not publish correct data (instead of a SIGSEGV), so I'm suspecting either:

Some weird race condition
Fluent Bit having trouble allocating enough memory for the file
Or some issue with not having write access to the required directory

In terms of mitigating impact, as I understand these crashes are sporadic rather than "start publishing => immediate crash", correct?
If so, you can try using a container restart policy on your firelens containers, so that they restart and continue publishing logs after a small gap instead of crashing your application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSEGV in version 2.32.2.20240516 #898

SIGSEGV in version 2.32.2.20240516 #898

mschroede1-chwy commented Feb 19, 2025 •

edited

Loading

swapneils commented Feb 20, 2025

SIGSEGV in version 2.32.2.20240516 #898

SIGSEGV in version 2.32.2.20240516 #898

Comments

mschroede1-chwy commented Feb 19, 2025 • edited Loading

Describe the question/issue

Configuration

ECS Task Config

Fluent Bit Config

Fluent Bit Log Output

Fluent Bit Version Info

Cluster Details

Application Details

Steps to reproduce issue

Related Issues

swapneils commented Feb 20, 2025

mschroede1-chwy commented Feb 19, 2025 •

edited

Loading