Actions
dev #3308
open[BE] Setup alerting system in aiaxio
Description
Grafana Alert Rules for Backend¶
Details:
The following alert rules have been configured in Grafana using Terraform:
-
Request Latency Alert
- Metric: 95th Percentile Response Time (P95)
- Evaluation Interval: 10 minutes
- Threshold: Configured per rule
- Impact: High latency may degrade user experience or indicate performance issues.
- Recommended Actions: Check server load, resource utilization, recent deployments, slow queries, and network performance.
-
CPU Usage Alert
- Metric: CPU utilization across instances
- Evaluation Interval: 10 minutes
- Threshold: 70%
- Impact: Sustained high CPU can cause slow responses or degraded performance.
- Recommended Actions: Investigate CPU-intensive tasks, scale resources, review deployments.
-
Memory Usage Alert
- Metric: Memory utilization
- Evaluation Interval: 10 minutes
- Threshold: 90%
- Impact: High memory usage can lead to instability or crashes.
- Recommended Actions: Identify memory-intensive processes, review deployments, scale memory, check for leaks.
-
High 4xx Errors Alert
- Metric: HTTP 4xx errors
- Evaluation Interval: 1 hour
- Threshold: 30 occurrences
- Impact: Frequent client errors may indicate misconfigured requests or broken endpoints.
- Recommended Actions: Investigate traffic patterns, check endpoints, validate API calls.
-
High 5xx Errors Alert
- Metric: HTTP 5xx errors
- Evaluation Interval: 1 hour
- Threshold: 30 occurrences
- Impact: Frequent server errors indicate potential system issues or instability.
- Recommended Actions: Check server logs, identify failing endpoints, verify resource allocation and dependencies.
-
High Traffic Alert
-
Metric: Total HTTP requests excluding
/metricsand/healthroutes - Evaluation Interval: 10 minutes
- Threshold: 1000 requests per hour
- Impact: Sudden traffic spikes may affect system stability or indicate abuse.
- Recommended Actions: Investigate surge in usage, ensure proper scaling and monitoring.
-
Metric: Total HTTP requests excluding
Contact Point: aiaxio-alert-contact-point
Grafana Alert Rules for Frontend¶
Details:
A Grafana alert rule has been configured using Terraform to monitor frontend exceptions:
- Rule Name: Front End Exception Alert
- Datasource: Grafana Cloud Logs (Loki)
- Evaluation Interval: 1 hour
-
Metric: Count of exceptions for app_id
2278 - Threshold: 10 or more exceptions within 1 hour
- Impact: High exception rates may indicate potential application instability or degraded frontend performance.
-
Recommended Actions:
- Investigate recent exceptions
- Review application logs
- Identify recurring errors
- Roll back recent changes if necessary
Contact Point: aiaxio-frontend-alert-contact-point
Actions