Skip to main content
Under Reviewv0.1.0-alpha

Grafana Dashboards

Eight dashboards are provisioned automatically in the ecom-engine folder when Grafana starts. All dashboards share a 30-second auto-refresh and default to a 3-hour time window.

Starting point: Always open Service Health first during an incident. It shows the critical SLO indicators in one view and links naturally to the focused dashboards.


Incident starts

Service Health — is something broken?
├─ 5xx spike → HTTP Traffic → drill by route
├─ High latency → HTTP Traffic → check DB / Cache
├─ DB pool > 90% → Database
├─ Cache hit rate < 50% → Cache
├─ Memory / CPU spike → Runtime & Process
├─ Auth failures → Security
└─ Unexplained → Logs (filter by error)
→ Traces (via trace.id in logs)

Service Health

UID: ecom-engine-health | Purpose: On-call first look, SLO monitoring

The highest-level view. Covers the most critical indicators across all subsystems in a single screen.

Panels

Row: Service Overview

PanelTypeMetric / QueryDescription
Request RateStatjob:ecom_engine_http_requests_total:rate1mRequests per second — baseline for "is traffic normal?"
5xx Error RateStatjob:ecom_engine_http_5xx_rate_pct:rate1mRed threshold at 5%. Any sustained value here needs investigation.
P50 LatencyStathistogram_quantile(0.50, ...)Median response time. Healthy target: < 100ms.
In-Flight RequestsStatecom_engine_http_requests_in_flightHigh in-flight with low request rate = stuck handlers.
Error Log EventsStatLoki count of log.level=error in 5mLog-level error count — catches issues not reflected in HTTP status codes.
Warn Log EventsStatLoki count of log.level=warn in 5mLeading indicator — warns often precede errors.

Row: Trends

PanelTypeDescription
Error & Warn RateTimeseriesError and warn log volume over time. Spot when an issue started.
HTTP Request RateTimeseriesRequest volume trend. Distinguish traffic-driven spikes from internal failures.

When to use

Open this dashboard at the start of every on-call incident. If anything is red, navigate to the focused dashboard matching the symptom.


HTTP Traffic

UID: ecom-engine-http | Purpose: Request rate, latency, and error investigation by route

Panels

Row: Summary

PanelTypeDescription
Request RateStatTotal requests/s
5xx Error RateStatPercentage of requests returning 5xx
4xx Error RateStatPercentage of requests returning 4xx
In-Flight RequestsStatConcurrent requests being processed
P50 LatencyStatMedian response time
P99 LatencyStat99th percentile — your slowest 1% of requests

Row: Request Traffic

PanelTypeDescription
Request Rate by RouteTimeseriesPer-route traffic volume. Identify which endpoint is receiving the spike.
Response Time Percentiles (p50/p95/p99)TimeseriesLatency trends over time across all routes. Watch for p99 diverging from p50.

Row: Status Codes

PanelTypeDescription
Requests by Status Code ClassTimeseries2xx / 3xx / 4xx / 5xx stacked — see how the error class distribution changes.

Row: Error Analysis

PanelTypeDescription
Top Error MessagesTableMost frequent error log messages in the time range. Pivot from metric to root cause.

When to use

  • 5xx spike in Service Health → open this dashboard to identify the route
  • Latency complaint → compare p50 vs p99 to distinguish outliers from systemic slowness
  • After a deploy → confirm request rate and error rate return to baseline

Database

UID: ecom-engine-db | Purpose: PostgreSQL connection pool monitoring

Panels

Row: Pool Status

PanelTypeMetricDescription
DB Pool UtilizationGaugejob:ecom_engine_db_pool_utilization_pctCurrent pool utilisation %. Yellow at 70%, red at 90%. Above 90% acquire timeouts become likely.
DB Pool ConnectionsStatacquired / idle / maxThree stats in one: how many connections are in use, available, and the configured max.

Row: Pool Trends

PanelTypeDescription
Pool Connections & Acquire RateTimeseriesAcquired, idle, and total connections over time alongside the rate of empty-pool waits.

What to look for

  • Utilization gauge red (>90%): Pool is near exhaustion. Acquire timeouts will start causing request failures. Increase DB_POOL_MAX_CONNS or reduce query duration.
  • Rising empty-acquire rate: Requests are waiting for a free connection. Correlates with high latency in the HTTP Traffic dashboard.
  • Total conns < max_conns but utilization high: Pool size is the right size but queries are slow — investigate slow queries.

Cache

UID: ecom-engine-cache | Purpose: L1/L2 cache efficiency and Redis pool health

Panels

Row: Cache Summary

PanelTypeDescription
L1 (Memory) Hit RateStatFraction of L1 lookups returning a hit. Healthy: > 80% for hot data.
L2 (Redis) Hit RateStatFraction of L2 (Redis) lookups returning a hit. Healthy: > 50%.
L1 Memory Fill RatioGaugeCurrent items / max items in the in-memory cache. Yellow at 80%, red at 95%.
Stampede Dedup (5m)StatCount of singleflight collapses. High value = stampede protection firing under load.

Row: Cache Operations

PanelTypeDescription
Cache Requests by Layer & ResultTimeseriesHit/miss/error counts per layer over time. Spot when hit rate dropped.
Cache Operation Latency (p50/p99)TimeseriesL1 and L2 get latency percentiles. L1 should be sub-millisecond; L2 < 5ms.

Row: L1 Memory Cache

PanelTypeDescription
L1 Memory ItemsTimeseriesItem count over time. Flat line at max = capacity eviction is happening.
L1 Evictions by ReasonTimeseriesexpired (TTL) vs lfu (capacity) evictions. Rising LFU evictions = L1 is too small.

Row: Redis (L2) Pool

PanelTypeDescription
Redis Pool ConnectionsStatTotal and idle Redis connections at a glance.
Redis Pool Hits / Misses / TimeoutsTimeseriesPool hit rate and timeout rate over time. Timeout spikes indicate Redis is overloaded or unreachable.

What to look for

  • L2 hit rate drops suddenly: Cold start after deploy, cache invalidation storm, or TTL too short.
  • Redis pool timeouts increasing: Redis CPU saturation, network partition, or pool size too small.
  • Fallback alert fires: Distributed rate limiting is no longer enforced — check Redis connectivity.

Business Events

UID: ecom-engine-business | Purpose: Order, payment, cart, and catalog event monitoring

Panels

Row: Business Stats

PanelTypeDescription
Order EventsStatOrder-related event count in the time range
Payment EventsStatPayment-related event count
Cart EventsStatCart add/remove/clear event count
Checkout EventsStatCheckout initiated / completed counts

Row: Event Rates

PanelTypeDescription
Orders & PaymentsTimeseriesOrder and payment event rates over time. Correlated drops indicate a funnel problem.
Cart & CheckoutTimeseriesCart activity and checkout conversion over time.
Catalog, Inventory & ShippingTimeseriesProduct view, inventory check, and shipping event rates.

When to use

  • "Revenue is down" — check if order/payment rates dropped or if cart → checkout conversion fell.
  • After a deploy — confirm business event rates return to expected levels.
  • During a sale or campaign — monitor event volume in real time.

Security

UID: ecom-engine-security | Purpose: Authentication failures and rate limiting health

Panels

Row: Authentication

PanelTypeDescription
Auth EventsTimeseriesLogin success / failure rates over time. A spike in failures may indicate a credential-stuffing attack.

Row: Rate Limiting

PanelTypeDescription
Rate Limit Decisions (Prometheus)TimeseriesAllow vs deny request counts over time. Normal traffic produces very low deny rates.
Rate Limit BackendStatCurrent backend: redis (distributed) or memory (per-instance fallback). Red if fallback is active.
Backend Fallbacks (1h)StatCount of Redis → memory fallback events in the last hour. Any non-zero value warrants investigation.

What to look for

  • Auth failure spike: Possible brute-force or credential-stuffing attack. Check source IPs in the Logs dashboard.
  • Rate limit backend = memory: Distributed limiting is not enforced. Investigate Redis connectivity.
  • Deny rate rising with no traffic increase: Rate limit thresholds may be too tight, or a bot is probing.

Logs

UID: ecom-engine-logs | Purpose: Ad-hoc log search with level and trace ID filtering

Variables

VariableTypeDescription
Log LevelMulti-selectFilter the log stream to one or more levels (debug, info, warn, error)

Panels

Row: Log Overview

PanelTypeDescription
Log Events by LevelTimeseriesError, warn, info, debug event volume over time. Spot when an issue started even before the alert fires.
Log Level DistributionPie chartProportional breakdown of log levels in the selected time range.

Row: Live Log Stream

PanelTypeDescription
ecom-engine Live LogsLogsFull raw log stream, filterable by level. Click any log line to expand structured fields. Click trace.id to jump to Tempo.

Workflow

  1. Set Log Level variable to error to filter out noise.
  2. Find a relevant log line and expand it.
  3. Copy the trace.id value.
  4. Open Explore → Tempo and search by trace ID to open the full trace.

Runtime & Process

UID: ecom-engine-runtime | Purpose: Go heap, GC, goroutines, CPU, and memory

Panels

Row: Resource Overview

PanelTypeThresholdsDescription
CPU UsageStatYellow > 60%, Red > 80%Process CPU %. Sustained high CPU needs profiling.
Memory RSSStatYellow > 512 MiB, Red > 1 GiBPhysical memory in use. Growing RSS over time = memory leak.
Heap AllocStatYellow > 256 MiB, Red > 512 MiBLive Go heap allocations.
GoroutinesStatYellow > 200, Red > 500Current goroutine count. Monotonic growth = goroutine leak.
GC Cycles / minStatYellow > 30, Red > 60GC frequency. Very high rates inflate p99 latency.

Row: CPU & Memory Trends

PanelTypeDescription
CPU Usage Over TimeTimeseriesCPU % trend — identify when it started rising and correlate with deploy or traffic changes.
Memory Over TimeTimeseriesRSS, VMS, and heap allocated on a single graph. RSS growing while heap is stable = memory held outside the Go allocator.

Row: Go Runtime Details

PanelTypeDescription
Goroutines Over TimeTimeseriesGoroutine count trend. Use the 15-minute window to distinguish normal bursts from a true leak.
GC ActivityTimeseriesDual-axis: GC pause time (s/s) on the left, GC cycles/min on the right. High pause time with normal cycle count = large heap objects.
Heap BreakdownTimeseriesFour lines: heap allocated, heap sys (OS granted), nextGC target, and heap objects count. Useful for diagnosing fragmentation or unexpectedly large heap reservation.

What to look for

  • Goroutines monotonically increasing: Goroutine leak — a goroutine is started but never exits. Check for unclosed channels, missing cancel() calls, or stuck HTTP/DB calls.
  • GC pause rate > 50ms/s: GC is consuming significant CPU time. Reduce heap allocation rate — look for hot paths creating many small objects.
  • RSS growing while Heap Alloc is stable: Memory is being held outside Go's allocator (e.g. CGo, large mmap regions, or a growing sync.Pool).
  • CPU spike with no traffic increase: A background goroutine is doing expensive work — check GC cycle rate and goroutine count for correlation.