Part 8 — Monitoring, Telemetry & Observability
From Monitoring to Observability
Traditional monitoring answers “Is the device up?” Observability helps answer “Why did this happen?” — the difference is critical for mature NetDevOps operations.
SNMP vs Streaming Telemetry
SNMP (Polling): mature but coarse-grained and high-latency.
Streaming Telemetry: devices push rich, near-real-time telemetry to collectors for anomaly detection and fine-grained analysis.
Prometheus: The Monitoring Backend
Prometheus scrapes metrics from exporters and stores time-series data.
Example prometheus.yml snippet:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'network-devices'
static_configs:
- targets: ['10.1.1.1:9161']
labels:
device: 'R1'
Grafana: Visualization and Dashboards
Grafana reads Prometheus and visualizes metrics (utilization, BGP status, interface errors) and configures alerts.
Logs, Metrics, Traces
Collect:
- Metrics: Prometheus
- Logs: Grafana Loki / ELK
- Traces: Jaeger
Combined, these provide deep observability for troubleshooting and automation-driven remediation.
Alerting Best Practices
Good alerts detect real issues and avoid noise. Example:
alert: HighCPUUsage
expr: device_cpu_usage_percent > 95
for: 5m
annotations:
summary: "{{ $labels.device }} CPU high"
Try this now: Stand up Prometheus + Grafana in a sandbox, add an SNMP exporter for one device, and build a dashboard that shows interface utilization over time.