Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC-11961 Provide comprehensive documentation for hotspots #19282

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

florence-crl
Copy link
Contributor

@florence-crl florence-crl commented Jan 6, 2025

Fixes DOC-11961

Added understand-hotspots.md and corresponding images.
In sidebar-data/troubleshooting.json, added link to understand-hotspots.md.

Rendered preview:

Copy link

github-actions bot commented Jan 6, 2025

Files changed:

Copy link

netlify bot commented Jan 6, 2025

Deploy Preview for cockroachdb-api-docs canceled.

Name Link
🔨 Latest commit 171a9cc
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-api-docs/deploys/67b644194d25ff00085af666

Copy link

netlify bot commented Jan 6, 2025

Deploy Preview for cockroachdb-interactivetutorials-docs canceled.

Name Link
🔨 Latest commit 171a9cc
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-interactivetutorials-docs/deploys/67b644195be58d0008e04105

Copy link

netlify bot commented Jan 6, 2025

Netlify Preview

Name Link
🔨 Latest commit 171a9cc
🔍 Latest deploy log https://app.netlify.com/sites/cockroachdb-docs/deploys/67b644199bc4f70008620f2f
😎 Deploy Preview https://deploy-preview-19282--cockroachdb-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

@angles-n-daemons angles-n-daemons left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, left a couple comments - but appreciate all the extra work put into this (and the links are amazing).

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@angles-n-daemons
Copy link

Florence one more thing to note, the hot key visual may be misleading. While write hotspots can go up to 20k writes per second for an index hotspot, hot keys are generally limited to < 1k writes per second. The image I created seems to indicate that they can go much higher, which I doubt is the case in most deployments. Not sure if this is important for documentation, but it might be worth updating the image. I've updated it in the "Talking about hotspots" document

@florence-crl
Copy link
Contributor Author

Florence one more thing to note, the hot key visual may be misleading. While write hotspots can go up to 20k writes per second for an index hotspot, hot keys are generally limited to < 1k writes per second. The image I created seems to indicate that they can go much higher, which I doubt is the case in most deployments. Not sure if this is important for documentation, but it might be worth updating the image. I've updated it in the "Talking about hotspots" document

I updated the 2 images.

Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together! Few comments.

Copy link
Contributor Author

@florence-crl florence-crl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevin-v-ngo updated according to your comments. tftr
@angles-n-daemons, please provide clarification to the unresolved conversations.

@florence-crl
Copy link
Contributor Author

@angles-n-daemons tftr!
@kevin-v-ngo , please review the updated sections:

Copy link

@kevin-v-ngo kevin-v-ngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Approved pending small change. Thanks Florence!

Copy link
Contributor

@kathancox kathancox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! I learned a lot from this. Thanks @florence-crl. I left a bunch of non-blocking comments, these are all suggestions/questions I have for clarification. Let me know if you want me to take another look.


Because [`CURRENT_TIMESTAMP()`]({% link {{ page.version.version }}/functions-and-operators.md %}#date-and-time-functions) is a steadily increasing value, this [`UPDATE`]({% link {{ page.version.version }}/update.md %}) will similarly burden the range at the tail of the index. While hotspots on an index tail tend to be the most common, bottlenecks on the head are not unheard of. For example, indexing on `DESC` with the same insertion strategy will cause a hotspot.

In this page, the phrase _index hotspot_ will be reserved for a hot by write hotspot on an index, even though indexes can become hot due to read. This is because a hot by write index hotspot is the most common hotspot pattern that occurs now and in the future as workloads continue to be migrated from legacy single-node installations. Hot by read index hotspots are defined later on this page as [_lookback hotspots_](#lookback-hotspots).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about making this paragraph a note? I would suggest "On this page"


**Synonyms:** Outbox hotspot

A _queuing hotspot_ is a type of index hotspot that occurs when an workload treats CockroachDB like a distributed queue. This can happen if you implement the [Outbox microservice pattern]({% link {{ page.version.version }}/cdc-queries.md %}#queries-and-the-outbox-pattern).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A _queuing hotspot_ is a type of index hotspot that occurs when an workload treats CockroachDB like a distributed queue. This can happen if you implement the [Outbox microservice pattern]({% link {{ page.version.version }}/cdc-queries.md %}#queries-and-the-outbox-pattern).
A _queuing hotspot_ is a type of index hotspot that occurs when a workload treats CockroachDB like a distributed queue. This can happen if you implement the [Outbox microservice pattern]({% link {{ page.version.version }}/cdc-queries.md %}#queries-and-the-outbox-pattern).


The following image visualizes a keyspace with multiple hot rows. In a large enough cluster, each of these rows can burden the range they live in, leading to multiple burdened nodes.

<img src="{{ 'images/v25.1/hotspots-figure-7.png' | relative_url }}" alt="Multiple row hotspots example" style="border:1px solid #eee;max-width:100%" />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about including celebrity names in a diagram — my feeling is that if we just use a user + number or even celebrity_a, this will make the diagram more timeless and something we have control over.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about including celebrity names in a diagram — my feeling is that if we just use a user + number or even celebrity_a, this will make the diagram more timeless and something we have control over.


The word _hotspot_ describes various skewed data access patterns in a [cluster]({% link {{ page.version.version }}/architecture/overview.md %}#cluster), often manifesting as higher [CPU]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#cpu) utilization on one or more [nodes]({% link {{ page.version.version }}/architecture/overview.md %}#node). Hotspots can also be based on [disk I/O]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#storage-and-disk-i-o), [memory]({% link {{ page.version.version }}/common-issues-to-monitor.md %}#memory) usage, or other finite resources. Hotspots are troublesome because they are often limited to a fixed-size subset of the cluster’s resources, which puts them in a class of performance issues that cannot be solved by [scaling the cluster size]({% link {{ page.version.version }}/frequently-asked-questions.md %}#how-does-cockroachdb-scale).

### Hot Node
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the subtitles, do we want to change to sentence case as per the style guide? Also, for the "Synonyms", there was a difference in capitalization, probably keep them all lower case?


<img src="{{ 'images/v25.1/hotspots-figure-6.png' | relative_url }}" alt="Single row hotspot example" style="border:1px solid #eee;max-width:100%" />

Without changing the default behavior of the system, the load will not be distributed because it needs to be served by a single range. This behavior is not just temporary; certain users, such as celebrities and influencers, may consistently experience a high volume of activity compared to the average user. This can result in a system with multiple hotspots, each of which can potentially overload the system at any moment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Without changing the default behavior of the system, the load will not be distributed because it needs to be served by a single range. This behavior is not just temporary; certain users, such as celebrities and influencers, may consistently experience a high volume of activity compared to the average user. This can result in a system with multiple hotspots, each of which can potentially overload the system at any moment.
Without changing the default behavior of the system, the load will not be distributed because it needs to be served by a single range. This behavior is not just temporary; certain users may consistently experience a high volume of activity compared to the average user. This can result in a system with multiple hotspots, each of which can potentially overload the system at any moment.

I'm not sure that defining the users is necessary.


Because sequences avoid user expressions, optimizations can be made to improve their performance, but unfortunately the write volume on the sequence is still that of the sum total of all its accesses.

[Sequence caching]({% link {{ page.version.version }}/create-sequence.md %}#cache-sequence-values-in-memory), which allows clients to cache sequence values as to reduce the burden on the target range, serves as a good mitigation for hot sequences. Alternatively, the `unique_rowid()` function generates sequential values which have strong guarantees against collision, with the drawback that its values are not a series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Sequence caching]({% link {{ page.version.version }}/create-sequence.md %}#cache-sequence-values-in-memory), which allows clients to cache sequence values as to reduce the burden on the target range, serves as a good mitigation for hot sequences. Alternatively, the `unique_rowid()` function generates sequential values which have strong guarantees against collision, with the drawback that its values are not a series.
[Sequence caching]({% link {{ page.version.version }}/create-sequence.md %}#cache-sequence-values-in-memory), which allows clients to cache sequence values to reduce the burden on the target range, serves as a good mitigation for hot sequences. Alternatively, the `unique_rowid()` function generates sequential values which have strong guarantees against collision, with the drawback that its values are not a series.


<img src="{{ 'images/v25.1/hotspots-figure-9.png' | relative_url }}" alt="Table hotspot example" style="border:1px solid #eee;max-width:100%" />

Reads in the `posts` table may be evenly distributed, but joining the `countries` table becomes a bottleneck, since it exists in so few ranges. Splitting the `countries` table ranges can relieve pressure, but only to a theoretical limit as the indivisible points, the rows themselves, experience high throughput. [Global tables]({% link {{ page.version.version }}/global-tables.md %}) and [follower reads]({% link {{ page.version.version }}/follower-reads.md %}) can help scaling in this case, especially when write throughput is low.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Reads in the `posts` table may be evenly distributed, but joining the `countries` table becomes a bottleneck, since it exists in so few ranges. Splitting the `countries` table ranges can relieve pressure, but only to a theoretical limit as the indivisible points, the rows themselves, experience high throughput. [Global tables]({% link {{ page.version.version }}/global-tables.md %}) and [follower reads]({% link {{ page.version.version }}/follower-reads.md %}) can help scaling in this case, especially when write throughput is low.
Reads in the `posts` table may be evenly distributed, but joining the `countries` table becomes a bottleneck, since it exists in so few ranges. Splitting the `countries` table ranges can relieve pressure, but only to a limit as the indivisible rows experience high throughput. [Global tables]({% link {{ page.version.version }}/global-tables.md %}) and [follower reads]({% link {{ page.version.version }}/follower-reads.md %}) can help scaling in this case, especially when write throughput is low.

) STORED;
~~~

In the `ALTER` statement, the first condition `province <> 'alabama'` checks whether the province is not `alabama`. It matches every single row in the table that is not `alabama`, and will ironically place them in the `us-east-1` region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense is that the "ironically" may be lost on some users?


**Synonyms:** time-based hotspot

_Temporal hotspots_ refer to increased database usage during particular windows of time. These take a variety of shapes, from event and holiday usage (such as Black Friday or the Superbowl), to synchronized job runs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit split on the mentions of "Black Friday and the Superbowl". I think it's good to provide practical examples that users can understand, but my one worry was that they're both US-centric and I wonder whether they're timeless? Do you think just "from event and holiday usage" is enough without the parenthesis?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants