Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

network: add support for tcp keepalive probes #3096

Closed
wants to merge 1 commit into from

Conversation

abiliojr
Copy link

@abiliojr abiliojr commented Feb 21, 2021

Signed-off-by: Abilio Marques [email protected]

The term TCP keepalive is been widely used in networking to describe a mechanism totally different than just "not closing the connection after a request". This PR adds support for real TCP keepalive probing. A matching PR with suggestions on how to make these changes clearer in the documentation is also available.

TCP keepalives have 2 potentially useful applications:

  • To prevent disconnection due to network inactivity: any TCP connection is subject to be silently dropped by intermediate equipment in the network (e.g., routers) if it's quiet for too long.
  • To detect dead connections, either because of a crash on the other side, or because the network channel has gone down.

All happens at the transport layer, without any intervention, and is transparent to upper (e.g., HTTP) layers.

For a more detailed description, please see: https://tldp.org/HOWTO/TCP-Keepalive-HOWTO/overview.html

TCP-keepalive values could be configured system wide, but this is definitely not ideal, as it still needs to be enabled by the application. The other option is to do it per socket. For that, fluentbit needs to apply the values to the socket. This patch adds support for that. By default, it's disabled.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

Documentation

  • Documentation required for this feature

Documentation provided here: fluent/fluent-bit-docs#471

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@abiliojr
Copy link
Author

Configuration example:

[INPUT]
    Name mqtt
    Port 1885
    Tag demo

[OUTPUT]
    Name http
    Match *
    host localhost
    port 3000
    URI /messages
    net.keepalive_idle_timeout 180
    net.tcp_keepalive_time 30
    net.tcp_keepalive_interval 15
    net.tcp_keepalive_probes 3

@abiliojr
Copy link
Author

Debug log:

Fluent Bit v1.8.0
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

[2021/02/21 10:44:20] [ info] Configuration:
[2021/02/21 10:44:20] [ info]  flush time     | 5.000000 seconds
[2021/02/21 10:44:20] [ info]  grace          | 5 seconds
[2021/02/21 10:44:20] [ info]  daemon         | 0
[2021/02/21 10:44:20] [ info] ___________
[2021/02/21 10:44:20] [ info]  inputs:
[2021/02/21 10:44:20] [ info]      mqtt
[2021/02/21 10:44:20] [ info] ___________
[2021/02/21 10:44:20] [ info]  filters:
[2021/02/21 10:44:20] [ info] ___________
[2021/02/21 10:44:20] [ info]  outputs:
[2021/02/21 10:44:20] [ info]      http.0
[2021/02/21 10:44:20] [ info] ___________
[2021/02/21 10:44:20] [ info]  collectors:
[2021/02/21 10:44:20] [ info] [engine] started (pid=7968)
[2021/02/21 10:44:20] [debug] [engine] coroutine stack size: 24576 bytes (24.0K)
[2021/02/21 10:44:20] [debug] [storage] [cio stream] new stream registered: mqtt.0
[2021/02/21 10:44:20] [ info] [storage] version=1.1.0, initializing...
[2021/02/21 10:44:20] [ info] [storage] in-memory
[2021/02/21 10:44:20] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
[2021/02/21 10:44:20] [ info] [input:mqtt:mqtt.0] listening on 0.0.0.0:1885
[2021/02/21 10:44:20] [debug] [http:http.0] created event channels: read=19 write=20
[2021/02/21 10:44:20] [debug] [router] match rule mqtt.0:http.0
[2021/02/21 10:44:20] [ info] [sp] stream processor started
[2021/02/21 10:44:28] [debug] [input:mqtt:mqtt.0] [fd=25] new TCP connection
[2021/02/21 10:44:30] [debug] [task] created task=0x7f07a8009210 id=0 OK
[2021/02/21 10:44:30] [debug] [http_client] not using http_proxy for header
[2021/02/21 10:44:30] [debug] [http_client] header=POST /messages HTTP/1.1
Host: localhost:3000
Content-Length: 34
Content-Type: application/msgpack
User-Agent: Fluent-Bit


[2021/02/21 10:44:30] [ info] [output:http:http.0] localhost:3000, HTTP status=200
{"status": "received"}
[2021/02/21 10:44:30] [debug] [upstream] KA connection #25 to localhost:3000 is now available
[2021/02/21 10:44:30] [debug] [out coro] cb_destroy coro_id=0
[2021/02/21 10:44:30] [debug] [task] destroy task=0x7f07a8009210 (task_id=0)
^C[2021/02/21 10:44:51] [engine] caught signal (SIGINT)
[2021/02/21 10:44:51] [ warn] [engine] service will stop in 5 seconds
[2021/02/21 10:44:56] [ info] [engine] service stopped

@abiliojr
Copy link
Author

Valgrind report:

==8219== 
==8219== HEAP SUMMARY:
==8219==     in use at exit: 0 bytes in 0 blocks
==8219==   total heap usage: 513 allocs, 513 frees, 1,032,942 bytes allocated
==8219== 
==8219== All heap blocks were freed -- no leaks are possible
==8219== 
==8219== For counts of detected and suppressed errors, rerun with: -v
==8219== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

@abiliojr
Copy link
Author

Sample extracted using wireshark (configuration was 15 seconds during this test):

No.     Time           Source                Destination           Protocol Length Info                                                            Source Port Destination Port
     12 15.203402344   127.0.0.1             127.0.0.1             TCP      66     [TCP Keep-Alive] 35680 → 3000 [ACK] Seq=162 Ack=180 Win=65536 Len=0 TSval=2983848706 TSecr=2983833505 35680       3000

Frame 12: 66 bytes on wire (528 bits), 66 bytes captured (528 bits) on interface lo, id 0
Ethernet II, Src: 00:00:00_00:00:00 (00:00:00:00:00:00), Dst: 00:00:00_00:00:00 (00:00:00:00:00:00)
Internet Protocol Version 4, Src: 127.0.0.1, Dst: 127.0.0.1
Transmission Control Protocol, Src Port: 35680, Dst Port: 3000, Seq: 162, Ack: 180, Len: 0

src/flb_io.c Outdated
@@ -102,6 +102,13 @@ int flb_io_net_connect(struct flb_upstream_conn *u_conn,
u_conn->fd, u->tcp_host, u->tcp_port);
}

/* set TCP keepalive if configured */
ret = flb_net_socket_tcp_keepalive(fd, &u->net);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is changing the default system values, I think this call must be optional based on if the user set some parameters. I mean: all the internal tcp_something fields should have a non-set value like -1 and IF in the invoked call that value is >= 0, then perform the syscall to change the behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed. Added a boolean: net.tcp_keepalive, that allows enabling keepalives while keeping the system-wide default values

src/flb_network.c Show resolved Hide resolved
@@ -65,6 +65,25 @@ struct flb_config_map upstream_net[] = {
"before it is retired."
},

{
FLB_CONFIG_MAP_INT, "net.tcp_keepalive_time", "0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think default values should be avoided since every system can have an implicit default that might not match this 0. I am thinking mostly on current users that my face a different internal setup due to this change.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with -1, and as indicated in other comment, the system-wide default will be then used if the configuration does not specify a desired value

@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Mar 26, 2021
@fdr
Copy link

fdr commented Mar 7, 2022

I have some need for this particular feature, presuming it can work with out_cloudwatch_logs, to work around #4606. It looks like a lot of the work has been done though there has not been correspondence in some time. If it's a bit more effort to get this committed, please let me know.

Sidebar: though perhaps more suitable for 1.9 given the change level, for reasons like bug #4606, I think it's a bad idea to not have keepalive set when doing synchronous I/O unless some unusual things are done to time out blocking system calls. i.e., there should be a default.

@PettitWesley
Copy link
Contributor

@abiliojr @edsiper Can we resurrect this PR?

I am hoping it can fix this issue: aws/aws-for-fluent-bit#293

Do you have recommended values for the settings?

@github-actions
Copy link
Contributor

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Copy link
Collaborator

@leonardo-albertovich leonardo-albertovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With those minimal changes this should be ready to be merged.

int intvl = net->tcp_keepalive_interval;
int probes = net->tcp_keepalive_probes;

printf("keepalive = %d, %d %d\n", time, intvl, probes);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this line.

/*
* Enable TCP keepalive
*/
int flb_net_socket_tcp_keepalive(flb_sockfd_t fd, struct flb_net_setup *net)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the bitwise assignments of ret with regular assignments and add ret == 0 && as a prefix to the conditional blocks so a failed setsockopt call is not overwritten and thus missed by the conditional in line 221.

Please change the conditional in line 221 so it explicitly compares against zero (ie. if (ret != 0) {).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is important because apparently according to msdn, windows doesn't accept values larger than 255 for TCP_KEEPCNT whereas other operating systems accept much larger values,.

@@ -102,6 +102,15 @@ int flb_io_net_connect(struct flb_upstream_conn *u_conn,
u_conn->fd, u->tcp_host, u->tcp_port);
}

/* set TCP keepalive and it's options */
if (u->net.tcp_keepalive != FLB_FALSE) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please either use if (u->net.tcp_keepalive == FLB_TRUE) { or the preferred syntax according to the coding style if (u->net.tcp_keepalive) {.

@@ -43,6 +43,18 @@ struct flb_net_setup {

/* maximum of times a keepalive connection can be used */
int keepalive_max_recycle;

/* enable/disable tcp keepalive */
char tcp_keepalive;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This property needs to be an int

@leonardo-albertovich
Copy link
Collaborator

@abiliojr if you are not able to make those changes please let me know and I'll make them for you.

Thanks a lot!

@abiliojr
Copy link
Author

@leonardo-albertovich , I haven't used fluentbit for at least a couple of years. Please go ahead and make any needed changes.

@edsiper
Copy link
Member

edsiper commented Aug 19, 2024

Closing this as changes are part of #9249 .

@edsiper edsiper closed this Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants