Logs and metrics — two signals, two jobs

When you run your application or service (especially in production), sooner or later you would need to answer questions like

“Is my service alive?”
“How fast is my app?”
“What just happened before the crash?”

Observability tools throw around the words logs and metrics all the time. They sound similar (both are “something my code emits”), but they solve different problems and need different handling.

This post sums up:

what a log is
what a metric is
how they differ
good (and bad) habits for each
when to rely on which

Open Table of Contents

What Is a Log?
What Is a Metric?
Logs vs Metrics – quick checklist
Logging Best Practices
Metrics Best Practices
When to Use Which?
Conclusion

What Is a Log?

A log entry is a statement produced by code at the moment something interesting happens.

Examples:

2025-07-21T14:18:04Z INFO api user=42 "payment created"
2025-07-21T14:18:07Z ERROR db "duplicate key value violates unique constraint"

Key points:

Timestamped “facts” about the run-time path.
Usually human-readable first, machine-parseable second.
Volume can explode when traffic or error rates spike.
You normally read them to answer “why did X happen?”.

What Is a Metric?

A metric is a numeric measurement sampled at a regular interval (seconds, minutes) and named so that you can graph/alert on it.

Examples:

payments_processed_total{service="api"} 2_400_123
request_latency_seconds_bucket{le="0.25", route="/login"} 486
cpu_usage_percent{host="api-01"} 73.2

Key points:

Small key-value records, cheap to ship/store.
Optimised for math: rate, avg, percentiles, heatmaps.
Meant for dashboards and automated alerts.
You look at them to answer “is X healthy?” or “how big is X?”.

Logs vs Metrics – quick checklist

Think of a web request:

Log = “user 42 failed to pay, card expired” (time, user id, stack-trace).
Metric = checkout_fail_total{reason="card_expired"} = 1.

Logs and metrics are complementary: metrics tell you something is wrong; logs tell you why. They overlap, they can be converted into each other, but they play different roles.

	Logs	Metrics
Sample	Individual event	Aggregate/summary
Format	Text or structured JSON/proto	Numbers with labels
Typical size	Kilobytes each	Bytes each
Query style	Search, grep, full-text, trace	Math, filter, group by time
Storage retention	Hours to weeks (expensive)	Weeks to years (cheap)
Use cases	Debug, audit, investigate	Dashboards, alerts, capacity

Logging Best Practices

Log intent, not just data Branches (if, catch) and state changes are what to look at.
Be structured JSON, key=value, or protobuf — free dimensions for search.
Add correlation IDs trace_id, user_id… lets you stitch full story.
Control volume
- Rate-limit or sample noisy statements.
- When rate-limitng, time-based sampling (once per second) usually beats count-based (every 100 calls) when traffic varies.
Keep levels, but use them DEBUG off in prod; ERROR should be rare.
Ship ASAP Buffering is OK, losing logs isn’t (especially crashes).

Pros:

Rich context for humans
Works everywhere
Good for incident investigations and audits

Cons:

Costly at high TPS
Harder to turn into time-series math
Sensitive data easily leaks

Metrics Best Practices

Make them first-class Use a metrics client (prometheus_client, statsd, etc.), not logger.info.
Pick clear names and units queue_size_gauge, request_latency_seconds. Units in the name avoid “is that ms or s?”.
Choose the right type Counter (monotonically increasing), Gauge (up/down), Histogram/Summary (distribution).
Label sparingly 10 high-cardinality labels will sink your TSDB. Pick the ones you actually slice on.
Sample/aggregate in the agent, not in code The collector can down-sample later; raw ≠ noisy.
Alert on RED/USE Request Rate, Errors, Duration; Utilization, Saturation, Errors.

Pros:

Tiny, cheap, fast
Perfect for SLOs and autoscaling
Easy math & alerting

Cons:

Loses per-event detail
“Which request failed?” still needs logs or traces
Wrong type/labels are hard to fix retroactively

When to Use Which?

Am I alive? → Metric (up == 1)
Error rate spiking? → Metric (5xx_rate_per_min) alerts, then logs for stack-traces
Need to audit one user’s journey? → Logs (plus distributed trace if you have it)
Capacity planning? → Metrics (CPU, p99 latency, QPS)
Post-mortem timeline? → Both (metrics for impact window, logs for root cause)

Rule of thumb: emit both, treat them differently.

Conclusion

Logs and metrics are signals, not rivals.

Store raw events as logs for source of truth and storytelling; publish rolled-up numbers as metrics for health checks and automation. Mixing them (“metrics-inside-logs” or “log-every-metric-tick”) usually hurts scale and clarity.

Happy shipping, and may your dashboards stay green!