When you run your application or service (especially in production), sooner or later you would need to answer questions like
- “Is my service alive?”
- “How fast is my app?”
- “What just happened before the crash?”
Observability tools throw around the words logs and metrics all the time. They sound similar (both are “something my code emits”), but they solve different problems and need different handling.
This post sums up:
- what a log is
- what a metric is
- how they differ
- good (and bad) habits for each
- when to rely on which
Table of Contents
Open Table of Contents
What Is a Log?
A log entry is a statement produced by code at the moment something interesting happens.
Examples:
2025-07-21T14:18:04Z INFO api user=42 "payment created"
2025-07-21T14:18:07Z ERROR db "duplicate key value violates unique constraint"
Key points:
- Timestamped “facts” about the run-time path.
- Usually human-readable first, machine-parseable second.
- Volume can explode when traffic or error rates spike.
- You normally read them to answer “why did X happen?”.
What Is a Metric?
A metric is a numeric measurement sampled at a regular interval (seconds, minutes) and named so that you can graph/alert on it.
Examples:
payments_processed_total{service="api"} 2_400_123
request_latency_seconds_bucket{le="0.25", route="/login"} 486
cpu_usage_percent{host="api-01"} 73.2
Key points:
- Small key-value records, cheap to ship/store.
- Optimised for math: rate, avg, percentiles, heatmaps.
- Meant for dashboards and automated alerts.
- You look at them to answer “is X healthy?” or “how big is X?”.
Logs vs Metrics – quick checklist
Think of a web request:
- Log = “user 42 failed to pay, card expired” (time, user id, stack-trace).
- Metric =
checkout_fail_total{reason="card_expired"} = 1
.
Logs and metrics are complementary: metrics tell you something is wrong; logs tell you why. They overlap, they can be converted into each other, but they play different roles.
Logs | Metrics | |
---|---|---|
Sample | Individual event | Aggregate/summary |
Format | Text or structured JSON/proto | Numbers with labels |
Typical size | Kilobytes each | Bytes each |
Query style | Search, grep, full-text, trace | Math, filter, group by time |
Storage retention | Hours to weeks (expensive) | Weeks to years (cheap) |
Use cases | Debug, audit, investigate | Dashboards, alerts, capacity |
Logging Best Practices
- Log intent, not just data
Branches (
if
,catch
) and state changes are what to look at. - Be structured JSON, key=value, or protobuf — free dimensions for search.
- Add correlation IDs
trace_id
,user_id
… lets you stitch full story. - Control volume
- Rate-limit or sample noisy statements.
- When rate-limitng, time-based sampling (
once per second
) usually beats count-based (every 100 calls
) when traffic varies.
- Keep levels, but use them
DEBUG
off in prod;ERROR
should be rare. - Ship ASAP Buffering is OK, losing logs isn’t (especially crashes).
Pros:
- Rich context for humans
- Works everywhere
- Good for incident investigations and audits
Cons:
- Costly at high TPS
- Harder to turn into time-series math
- Sensitive data easily leaks
Metrics Best Practices
- Make them first-class
Use a metrics client (
prometheus_client
,statsd
, etc.), notlogger.info
. - Pick clear names and units
queue_size_gauge
,request_latency_seconds
. Units in the name avoid “is that ms or s?”. - Choose the right type Counter (monotonically increasing), Gauge (up/down), Histogram/Summary (distribution).
- Label sparingly 10 high-cardinality labels will sink your TSDB. Pick the ones you actually slice on.
- Sample/aggregate in the agent, not in code The collector can down-sample later; raw ≠ noisy.
- Alert on RED/USE Request Rate, Errors, Duration; Utilization, Saturation, Errors.
Pros:
- Tiny, cheap, fast
- Perfect for SLOs and autoscaling
- Easy math & alerting
Cons:
- Loses per-event detail
- “Which request failed?” still needs logs or traces
- Wrong type/labels are hard to fix retroactively
When to Use Which?
- Am I alive? → Metric (
up == 1
) - Error rate spiking? → Metric (
5xx_rate_per_min
) alerts, then logs for stack-traces - Need to audit one user’s journey? → Logs (plus distributed trace if you have it)
- Capacity planning? → Metrics (CPU, p99 latency, QPS)
- Post-mortem timeline? → Both (metrics for impact window, logs for root cause)
Rule of thumb: emit both, treat them differently.
Conclusion
Logs and metrics are signals, not rivals.
Store raw events as logs for source of truth and storytelling; publish rolled-up numbers as metrics for health checks and automation. Mixing them (“metrics-inside-logs” or “log-every-metric-tick”) usually hurts scale and clarity.
Happy shipping, and may your dashboards stay green!