Project

General

Profile

Actions

dev #3308

open

[BE] Setup alerting system in aiaxio

Added by Zahid Hassan 7 months ago.

Status:
Complete
Priority:
High
Assignee:
Target version:
Start date:
09/22/2025
Due date:
% Done:

0%

Estimated time:
Spent time:

Description

Grafana Alert Rules for Backend

Details:

The following alert rules have been configured in Grafana using Terraform:

  1. Request Latency Alert

    • Metric: 95th Percentile Response Time (P95)
    • Evaluation Interval: 10 minutes
    • Threshold: Configured per rule
    • Impact: High latency may degrade user experience or indicate performance issues.
    • Recommended Actions: Check server load, resource utilization, recent deployments, slow queries, and network performance.
  2. CPU Usage Alert

    • Metric: CPU utilization across instances
    • Evaluation Interval: 10 minutes
    • Threshold: 70%
    • Impact: Sustained high CPU can cause slow responses or degraded performance.
    • Recommended Actions: Investigate CPU-intensive tasks, scale resources, review deployments.
  3. Memory Usage Alert

    • Metric: Memory utilization
    • Evaluation Interval: 10 minutes
    • Threshold: 90%
    • Impact: High memory usage can lead to instability or crashes.
    • Recommended Actions: Identify memory-intensive processes, review deployments, scale memory, check for leaks.
  4. High 4xx Errors Alert

    • Metric: HTTP 4xx errors
    • Evaluation Interval: 1 hour
    • Threshold: 30 occurrences
    • Impact: Frequent client errors may indicate misconfigured requests or broken endpoints.
    • Recommended Actions: Investigate traffic patterns, check endpoints, validate API calls.
  5. High 5xx Errors Alert

    • Metric: HTTP 5xx errors
    • Evaluation Interval: 1 hour
    • Threshold: 30 occurrences
    • Impact: Frequent server errors indicate potential system issues or instability.
    • Recommended Actions: Check server logs, identify failing endpoints, verify resource allocation and dependencies.
  6. High Traffic Alert

    • Metric: Total HTTP requests excluding /metrics and /health routes
    • Evaluation Interval: 10 minutes
    • Threshold: 1000 requests per hour
    • Impact: Sudden traffic spikes may affect system stability or indicate abuse.
    • Recommended Actions: Investigate surge in usage, ensure proper scaling and monitoring.

Contact Point: aiaxio-alert-contact-point


Grafana Alert Rules for Frontend

Details:

A Grafana alert rule has been configured using Terraform to monitor frontend exceptions:

  • Rule Name: Front End Exception Alert
  • Datasource: Grafana Cloud Logs (Loki)
  • Evaluation Interval: 1 hour
  • Metric: Count of exceptions for app_id 2278
  • Threshold: 10 or more exceptions within 1 hour
  • Impact: High exception rates may indicate potential application instability or degraded frontend performance.
  • Recommended Actions:
    • Investigate recent exceptions
    • Review application logs
    • Identify recurring errors
    • Roll back recent changes if necessary

Contact Point: aiaxio-frontend-alert-contact-point

Actions

Also available in: Atom PDF