Monitoring
Monitoring in the context of software development refers to the continuous observation of a system's operation to ensure it performs as expected. It involves collecting, analysing, and using information to identify and resolve issues proactively.
Purpose
The purpose of monitoring is to identify issues quickly so you can maintain high availability, performance, and reliability of software applications.
- Early Detection of Issues: Identify and resolve problems before they affect users.
- Performance Optimisation: Continuous feedback on performance allows for adjustments to improve efficiency.
- Enhanced User Satisfaction: Ensures a seamless and responsive user experience.
Context
Industry Context
Systems go down. Applications fail. Users encounter issues. Monitoring is essential to identify and address these problems quickly, ensuring that applications remain available and performant.
ZeroBlockers Context
As you increase the pace of delivery there is always a risk that you will introduce new issues. Monitoring is essential to identify and address these problems quickly, ensuring that applications remain available and performant.
Methods
Method | Description | Benefits |
---|---|---|
Instrumentation | The process of integrating monitoring tools and code within an application to collect data on its operation, such as performance metrics, error rates, and usage patterns. | Enables real-time visibility into application behaviour, facilitates troubleshooting, and supports performance optimisation. |
Defining Alert Thresholds | Defining specific criteria or metrics that, when breached, trigger notifications or alerts to stakeholders, indicating potential issues or anomalies. | Allows teams to proactively address issues before they impact users or escalate into more significant problems, minimising downtime. |
Blameless Postmortems | A collaborative analysis of incidents or failures without assigning blame to individuals, focusing on understanding root causes and systemic issues. | Encourages a culture of transparency, trust, and continuous improvement, enabling teams to learn from failures and prevent recurrence. |
Backup and Recovery | Implementing systematic processes for creating regular backups of data and applications, and ensuring that they can be quickly restored in case of data loss. | Protects against data loss and ensures quick recovery in case of incidents, minimising downtime and data corruption. |
Disaster Recovery Planning | Developing a structured approach for responding to catastrophic events that cause system downtime or data loss, ensuring the organisation can quickly recover. | Enhances organisational resilience, reduces recovery time after disasters, and minimises potential losses. |
Anti-patterns
- Under-Monitoring: Failing to monitor key aspects of the system, resulting in blind spots and undetected issues.
- Overly Frequent Alerts: Generating excessive alerts that overwhelm teams and desensitise them to critical notifications.
Case Studies
Improving Team Performance with a Blameless Culture
How Monzo Bank enhanced team performance and learning by creating a blameless culture.
Monzo Bank
Optimizing Response Times with Advanced Performance Monitoring
How Sky improved response times and application performance through advanced monitoring and production support.
Sky
Enhancing Product Health with Instrumentation
How Google improved product health and performance by implementing a comprehensive instrumentation strategy.
Enhancing Customer Satisfaction through Effective Production Monitoring and Support
How Reddit improved customer satisfaction by implementing robust production monitoring and support practices.