Defining Alert Thresholds
Defining alert thresholds involves establishing specific criteria for when an alert should be triggered in the monitoring system, indicating potential issues or anomalies within the application or infrastructure.
Goal
The primary goal is to enable proactive identification and resolution of issues, minimising the impact on users and ensuring the system operates within its desired parameters.
Context
Things will go wrong in production. We need to be able to quickly identify and address these issues to minimise their impact on users and the business.
Threshold Types
Type | Description |
---|---|
User Behaviour Thresholds | Criteria based on user interactions and behaviour, such as session length and conversion rates. |
Performance Thresholds | Criteria based on application performance metrics, such as response times and throughput. |
Resource Utilisation Thresholds | Limits set on the usage of system resources, like CPU, memory, and disk space. |
Error Rate Thresholds | Defined levels for acceptable error rates within the application's operations. |
Cost Thresholds | Limits on cloud resource costs to manage and optimise spending. |
Inputs
Artifact | Description |
---|---|
Realtime Application Performance Data | Data collected from the application's monitoring tools, providing insights into performance and usage. |
Service Level Objectives (SLOs) | Agreed-upon performance and reliability targets for the service. |
Outputs
Artifact | Description | Benefits |
---|---|---|
Automated Alerts | Configured alerts based on defined thresholds, triggering notifications to stakeholders when breached. | Enables timely detection and notification of issues, facilitating quick response. |
Anti-patterns
- Over-Alerting: Setting thresholds too sensitively, leading to frequent, often unnecessary alerts that cause alert fatigue. There will be natural fluctuations in the system, and not all of them are indicative of a problem.