Defining Alert Thresholds

Defining alert thresholds involves establishing specific criteria for when an alert should be triggered in the monitoring system, indicating potential issues or anomalies within the application or infrastructure.

Goal

The primary goal is to enable proactive identification and resolution of issues, minimising the impact on users and ensuring the system operates within its desired parameters.

Context

Things will go wrong in production. We need to be able to quickly identify and address these issues to minimise their impact on users and the business.

Threshold Types

Type	Description
User Behaviour Thresholds	Criteria based on user interactions and behaviour, such as session length and conversion rates.
Performance Thresholds	Criteria based on application performance metrics, such as response times and throughput.
Resource Utilisation Thresholds	Limits set on the usage of system resources, like CPU, memory, and disk space.
Error Rate Thresholds	Defined levels for acceptable error rates within the application's operations.
Cost Thresholds	Limits on cloud resource costs to manage and optimise spending.

Inputs

Artifact	Description
Realtime Application Performance Data	Data collected from the application's monitoring tools, providing insights into performance and usage.
Service Level Objectives (SLOs)	Agreed-upon performance and reliability targets for the service.

Outputs

Artifact	Description	Benefits
Automated Alerts	Configured alerts based on defined thresholds, triggering notifications to stakeholders when breached.	Enables timely detection and notification of issues, facilitating quick response.

Anti-patterns

Over-Alerting: Setting thresholds too sensitively, leading to frequent, often unnecessary alerts that cause alert fatigue. There will be natural fluctuations in the system, and not all of them are indicative of a problem.