Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Managing false positives #1063

Merged
merged 19 commits into from
Oct 16, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 65 additions & 0 deletions site/content/docs/monitoring/false-positives.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: Managing false positives
weight: 4
menu:
platform:
parent: "Monitoring"
---

Effective monitoring stands and falls with its accuracy. You want to be alerted-and, if it comes to it, woken up in the middle of the night-only when there's a real issue. However, the reality is that false positives can occur for various reasons. A check might be affected by a harmless glitch in the network connection between Checkly and your check target, causing your site to be flagged as slow or unreachable, when, in fact, it is running perfectly fine.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: We aim to help you achieve the lowest false positive rate in the market.


The public internet, with its millions of networks and nodes, can be unpredictable, making occasional hiccups unavoidable. Thankfully, many of these are just isolated incidents that resolve on their own, and there are ways to protect your monitoring setup from them.

In this section, we'll dive into why these issues happen and share strategies on minimizing the impact of false positives, allowing you to more clearly distinguish between a real problem and a simple network blip.

## Common Causes of Performance Variations

When setting up your monitoring strategy, it's important to consider a few common factors that can affect networking performance:

- **Network Latency:** The time it takes for data to travel between points. This can vary due to distance, routing, or congestion.
- **Packet Loss:** Sometimes data packets don’t reach their destination and need to be resent, leading to delays.
- **Server Response Time:** Servers might occasionally take longer to respond, or load balancers and backend services could be at capacity.
- **Infrastructure Issues:** Temporary issues like routing errors, server load spikes, or scheduled maintenance can also impact performance.
sujaya-sys marked this conversation as resolved.
Show resolved Hide resolved
- **Security-Related Blockages:** Firewalls, DDoS protection, or other security measures might prevent Checkly’s monitoring traffic from reaching your servers if flagged as suspicious. This can disrupt checks and result in false positives.

While these factors are generally well-managed, they can’t be completely avoided. However, you can take steps to make your monitoring setup more resilient.

## Best Practices to Minimize False Positives

False positives can lead to unnecessary alerts, but you can keep them under control by following these best practices:

### Use Retries

Retries are your first line of defense against transient issues. By setting up retries, your checks can automatically try again if the first attempt fails because of a temporary glitch. This reduces the impact of brief disruptions and helps prevent false positives.

While you might want to adjust based on specific use cases, we suggest implementing at least one retry for checks with a high escalation impact as a standard best practice.

Learn more about implementing retries: [Retries at Checkly](https://www.checklyhq.com/docs/alerting-and-retries/retries/#retries).

### Run Checks from Multiple Locations

A temporary issue from one location might cause a check to fail, but that doesn't always mean your application is down. With 22 available regions, Checkly lets you run checks from multiple locations simultaneously. This helps you spot the difference between a localized network problem and a real issue with your app.

We recommend running checks from at least two different regions to get a clearer picture and reduce the chance of a false positive caused by a regional failure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that why I use a specific region? I want to understand if my app is not performing in a specific region.


Learn more about setting up multi-location checks: [Global Location and Scheduling Strategies](https://www.checklyhq.com/docs/monitoring/global-locations/).

### Fine-Tune Your Alerting

Once you’ve implemented retries and multi-location checks, you can adjust your alert settings to avoid unnecessary notifications. For example, you can require failures in multiple regions or after several retries before firing off an alert. This helps you avoid alerts for short-lived, isolated failures that don’t need immediate attention.

A good starting point is to set alerts to trigger only after a check has failed at least twice (or after two intervals) and has failed in more than one region.

Learn more about fine-tuning your alerts: [Alert Settings at Checkly](https://www.checklyhq.com/docs/alerting-and-retries/alert-settings/).

### Allowlist Checkly Traffic

In some cases, firewalls, load balancers, or security solutions might inadvertently block Checkly's monitoring traffic, which could interfere with running checks successfully. This can happen if your system detects Checkly’s requests as suspicious or treats them as unwanted traffic.

To prevent this, we recommend reviewing your security settings to ensure that Checkly’s IP ranges are included in your allowlist.

You can find Checkly’s current IP ranges here: [Allowlisting & filtering traffic](https://www.checklyhq.com/docs/monitoring/allowlisting/#ip-range-allowlisting).

## Need Help?

If you’re not sure what’s going on or if an issue keeps popping up without a clear reason, don’t hesitate to reach out. Our support team is here to help you troubleshoot any unusual check performance. You can contact us at [email protected].
2 changes: 1 addition & 1 deletion site/content/docs/monitoring/how-we-store-data.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: How we store data
weight: 4
weight: 5
menu:
platform:
parent: "Monitoring"
Expand Down
Loading