KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

kirktrue · 2025-02-04T00:20:52Z

This change reduces fetch session cache evictions on the broker for the KafkaConsumer by altering its logic to determine which partitions it includes in fetch requests.

Background

Consumer implementations fetch data from the cluster and temporarily buffer it in memory until the user next calls Consumer.poll(). When a fetch request is being generated, partitions that already have buffered data are not included in the fetch request.

The ClassicKafkaConsumer performs much of its fetch logic and network I/O in the application thread. On poll(), if there is any locally-buffered data, the ClassicKafkaConsumer does not fetch any new data and simply returns the buffered data to the user from poll().

On the other hand, the AsyncKafkaConsumer consumer splits its logic and network I/O between two threads, which results in a potential race condition during fetch. The AsyncKafkaConsumer also checks for buffered data on its application thread. If it finds there is none, it signals the background thread to create a fetch request. However, it's possible for the background thread to receive data from a previous fetch and buffer it before the fetch request logic starts. When that occurs, as the background thread creates a new fetch request, it skips any buffered data, which has the unintended result that those partitions get added to the fetch request's "to remove" set. This signals to the broker to remove those partitions from its internal cache.

This issue is technically possible in the ClassicKafkaConsumer too, since the heartbeat thread performs network I/O in addition to the application thread. However, because of the frequency at which the AsyncKafkaConsumer's background thread runs, it is ~100x more likely to happen.

Options

The core decision is: what should the background thread do if it is asked to create a fetch request and it discovers there's buffered data. There were multiple proposals to address this issue in the AsyncKafkaConsumer. Among them are:

The background thread should omit buffered partitions from the fetch request as before (this is the existing behavior)
The background thread should skip the fetch request generation entirely if there are any buffered partitions
The background thread should include buffered partitions in the fetch request, but use a small “max bytes” value
The background thread should skip fetching from the nodes that have buffered partitions

Option 4 won out. The change is localized to AbstractFetch where the basic idea is to skip fetch requests to a given node if that node is the leader for buffered data. By preventing a fetch request from being sent to that node, it won't have any "holes" where the buffered partitions should be.

Testing

Eviction rate testing

Here are the results of our internal stress testing:

ClassicKafkaConsumer—after the initial spike during test start up, the average rate settles down to ~0.14 evictions/second
AsyncKafkaConsumer, (w/o fix)—after startup, the evictions still settle down, but they are about 100x higher than the ClassicKafkaConsumer at ~1.48 evictions/second
AsyncKafkaConsumer (w/ fix)—the eviction rate is now closer to the ClassicKafkaConsumer at ~0.22 evictions/second

`EndToEndLatency` testing

The bundled EndToEndLatency test runner was executed on a single machine using Docker. The apache/kafka:latest Docker image was used and either the cluster/combined/plaintext/docker-compose.yml or single-node/plaintext/docker-compose.yml Docker Compose configuration files, depending on the test. The Docker containers were recreated from scratch before each test.

A single topic was created with 30 partitions and with a replication factor of either 1 or 3, depending on a single- or multi-node setup.

For each of the test runs these argument values were used:

Message count: 100000
acks: 1
Message size: 128 bytes

A configuration file which contained a single configuration value of group.protocol=<$group_protocol> was also provided to the test, where $group_protocol was either CLASSIC or CONSUMER.

Test results

Test 1—`CLASSIC` group protocol, cluster size: 3 nodes, replication factor: 3

Metric	`trunk`	PR
Average latency	1.4901	1.4871
50th percentile	1	1
99th percentile	3	3
99.9th percentile	6	6

Test 2—`CONSUMER` group protocol, cluster size: 3 nodes, replication factor: 3

Metric	`trunk`	PR
Average latency	1.4704	1.4807
50th percentile	1	1
99th percentile	3	3
99.9th percentile	6	7

Test 3—`CLASSIC` group protocol, cluster size: 1 node, replication factor: 1

Metric	`trunk`	PR
Average latency	1.0777	1.0193
50th percentile	1	1
99th percentile	2	2
99.9th percentile	5	4

Test 4—`CONSUMER` group protocol, cluster size: 1 node, replication factor: 1

Metric	`trunk`	PR
Average latency	1.0937	1.0503
50th percentile	1	1
99th percentile	2	2
99.9th percentile	4	4

Conclusion

These tests did not reveal any significant differences between the current fetcher logic on trunk and the one proposed in this PR. Addition test runs using larger message counts and/or larger message sizes did not affect the result.

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

…licitly

…equests

…om prepareFetchRequests()

…s the same

Fixed typo

…tched

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetcherTest.java:215

[nitpick] The parameter name 'numNodes' could be more descriptive. Consider renaming it to 'numberOfNodes'.

private void assignFromUser(Set<TopicPartition> partitions, int numNodes) {

kirktrue

The two sub-comments highlight the key differences between #17700 and this PR.

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

kirktrue · 2025-02-04T00:51:13Z

@jeffkbkim @junrao @lianetm—this is the second attempt at fixing the fix session eviction bug (#17700). I've highlight the differences between the two PRs. It really boils down to the addition of a check for paused partitions. I've added the check, comments, and another relevant unit test.

I ran the StandbyTaskEOSMultiRebalanceIntegrationTest integration test over 260 times without failure (before it failed within 5 tries).

cc @mjsax

junrao

@kirktrue : Thanks for the PR. A couple of comments.

junrao · 2025-02-06T00:48:33Z

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java

+            // data to be returned.
+            //
+            // See FetchCollector.collectFetch().
+            if (subscriptions.isPaused(partition))


Should we just use subscriptions.isFetchable? Intuitively, an un-assigned/revoking partition shouldn't block the fetching of other partitions.

Done. isFetchable() replaces both the isAssigned() and isPaused() calls. I did a little more refactoring and moved that whole loop into a new bufferedNodes() method.

junrao · 2025-02-06T00:51:52Z

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java

+        //
+        // - tp0 was collected and thus not in the fetch buffer
+        // - tp1, while still in the fetch buffer, is paused and its node should be ignored
+        assertEquals(1, sendFetches());


Is there an easy way to verify the partitions in the fetch request?

Yes, good suggestion. I've updated all the tests to ensure the partitions and nodes in the requests are as expected. I also added three more tests for other cases isFetchable() should catch.

…xplanation

…d more cases

kirktrue added 30 commits September 5, 2024 12:10

KAFKA-17439: Make polling for new records an explicit action/event in…

e984638

… the new consumer Updated the FetchRequestManager to only create and enqueue fetch requests when signaled to do so by a FetchEvent.

Merge remote-tracking branch 'origin/trunk' into KAFKA-17439-poll-exp…

b6af23b

…licitly

Minor tweaks to FetchEvent documentation.

335c249

Update FetchRequestManager.java

2abd6a4

Added unit tests to exercise poll, request-then-poll, and duplicate r…

45bc8c0

…equests

Updated FetchEvent to CreateFetchRequestsEvent and catching errors fr…

7265404

…om prepareFetchRequests()

Fixed spacing issue that checkstyle wasn't happy with

5449549

Merge branch 'trunk' into KAFKA-17439-poll-explicitly

f7a5940

PoC for adding buffered partitions to the fetch request

38fbb00

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

f6f8c21

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

9864864

Removed superfluous imports

d5ba79e

Updates to ensure that the core logic for the classic consumer remain…

cafcaf2

…s the same

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

eda5922

Reverting some changes to avoid unnecessary diffs

668e232

More tweaking to avoid diffs

ceeb8e2

More diff reduction

8c86cb6

More diff

af831a5

More clean up

705fad1

More tweaks to avoid tweaking existing code

df30eb3

More tweaks

936e27a

Update AbstractFetch.java

f2cdc49

Updates for clarity

5537945

Update FetchRequestManager.java

671b1f3

Fixed typo

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

f4e6fce

Fix code style bug

40b3aef

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

7ab5cbf

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

6e26f5d

Update FetchRequestManagerTest.java

d48edf2

Updates to reduce some duplication

9fa2bbf

kirktrue added 3 commits January 30, 2025 10:00

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

d488d02

Skip paused partitions to ensure they don't block nodes from being fe…

3deb9ba

…tched

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

56a3105

kirktrue requested a review from Copilot February 4, 2025 00:21

github-actions bot added triage PRs from the community consumer clients labels Feb 4, 2025

kirktrue added KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) consumer clients and removed consumer clients labels Feb 4, 2025

Copilot AI reviewed Feb 4, 2025

View reviewed changes

kirktrue added the Blocker This pull request is identified as solving a blocker for a release. label Feb 4, 2025

Added testFetchRequestWithBufferedAndPausedPartition

3134919

kirktrue commented Feb 4, 2025

View reviewed changes

clients/src/main/java/org/apache/kafka/clients/consumer/internals/AbstractFetch.java Outdated Show resolved Hide resolved

clients/src/test/java/org/apache/kafka/clients/consumer/internals/FetchRequestManagerTest.java Outdated Show resolved Hide resolved

Updated comments for paused partitions

c50b9ea

kirktrue requested review from junrao, jeffkbkim and lianetm February 4, 2025 00:47

github-actions bot removed the triage PRs from the community label Feb 4, 2025

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

cfd08f3

junrao reviewed Feb 6, 2025

View reviewed changes

kirktrue added 5 commits February 6, 2025 15:39

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

7e74480

Refactor bufferedNodes into a dedicated method to allow for verbose e…

d353430

…xplanation

Hardened fetch tests to ensure expected partitions and nodes and adde…

5dbac58

…d more cases

Added comments and function-atized the streams calls a little more

1edc464

Merge branch 'trunk' into KAFKA-17182-reduce-fetch-session-eviction

406f270

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

kirktrue commented Feb 4, 2025 •

edited

Loading

kirktrue left a comment

kirktrue commented Feb 4, 2025 •

edited

Loading

junrao left a comment

junrao Feb 6, 2025

kirktrue Feb 7, 2025

junrao Feb 6, 2025

kirktrue Feb 7, 2025

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

Are you sure you want to change the base?

KAFKA-17182: Consumer fetch sessions are evicted too quickly with AsyncKafkaConsumer #18795

Conversation

kirktrue commented Feb 4, 2025 • edited Loading

Background

Options

Testing

Eviction rate testing

EndToEndLatency testing

Test results

Test 1—CLASSIC group protocol, cluster size: 3 nodes, replication factor: 3

Test 2—CONSUMER group protocol, cluster size: 3 nodes, replication factor: 3

Test 3—CLASSIC group protocol, cluster size: 1 node, replication factor: 1

Test 4—CONSUMER group protocol, cluster size: 1 node, replication factor: 1

Conclusion

Choose a reason for hiding this comment

kirktrue left a comment

Choose a reason for hiding this comment

kirktrue commented Feb 4, 2025 • edited Loading

junrao left a comment

Choose a reason for hiding this comment

junrao Feb 6, 2025

Choose a reason for hiding this comment

kirktrue Feb 7, 2025

Choose a reason for hiding this comment

junrao Feb 6, 2025

Choose a reason for hiding this comment

kirktrue Feb 7, 2025

Choose a reason for hiding this comment

kirktrue commented Feb 4, 2025 •

edited

Loading

`EndToEndLatency` testing

Test 1—`CLASSIC` group protocol, cluster size: 3 nodes, replication factor: 3

Test 2—`CONSUMER` group protocol, cluster size: 3 nodes, replication factor: 3

Test 3—`CLASSIC` group protocol, cluster size: 1 node, replication factor: 1

Test 4—`CONSUMER` group protocol, cluster size: 1 node, replication factor: 1

kirktrue commented Feb 4, 2025 •

edited

Loading