Kube-prometheus-stack bundles Prometheus and Grafana for monitoring Kubernetes workloads. On the surface, it looks like the answer to all your monitoring needs. But monitoring is not observability, and if you confuse the two, you will hit a wall.Kube-prometheus-stack bundles Prometheus and Grafana for monitoring Kubernetes workloads. On the surface, it looks like the answer to all your monitoring needs. But monitoring is not observability, and if you confuse the two, you will hit a wall.

Why kube-prometheus-stack Isn’t Enough for Kubernetes Observability

2025/10/28 14:04

Observability in Kubernetes has become a hot topic in recent years. Teams everywhere deploy the popular kube-prometheus-stack, which bundles Prometheus and Grafana into an opinionated setup for monitoring Kubernetes workloads. On the surface, it looks like the answer to all your monitoring needs. But here is the catch: monitoring is not observability. And if you confuse the two, you will hit a wall when your cluster scales or your incident response gets messy.

In this first post of my observability series, I want to break down the real difference between monitoring and observability, highlight the gaps in kube-prometheus-stack, and suggest how we can move toward true Kubernetes observability.

The question I keep hearing

I worked with a team running microservices on Kubernetes. They had kube-prometheus-stack deployed, beautiful Grafana dashboards, and alerts configured. Everything looked great until 3 AM on a Tuesday when API requests started timing out.

The on-call engineer got paged. Prometheus showed CPU spikes. Grafana showed pod restarts. When the team jumped on Slack, they asked me: “Do you have tools for understanding what causes these timeouts?” They spent two hours manually correlating logs across CloudWatch, checking recent deployments, and guessing at database queries before finding the culprit: a batch job with an unoptimized query hammering the production database.

I had seen this pattern before. Their monitoring stack told them something was broken, but not why. With distributed tracing, they would have traced the slow requests back to that exact query in minutes, not hours. This is the observability gap I keep running into: teams confuse monitoring dashboards with actual observability. The lesson for them was clear: monitoring answers “what broke” while observability answers “why it broke.” And fixing this requires shared ownership. Developers need to instrument their code for visibility. DevOps engineers need to provide the infrastructure to capture and expose that behavior. When both sides own observability together, incidents get resolved faster and systems become more reliable.

Monitoring vs Observability

Most engineers use the terms interchangeably, but they are not the same. Monitoring tells you when something is wrong, while observability helps you understand why it went wrong.

  • Monitoring: Answers “what is happening?” You collect predefined metrics (CPU, memory, disk) and set alerts when thresholds are breached. Your alert fires: “CPU usage is 95%.” Now what?
  • Observability: Answers “why is this happening?” You investigate using interconnected data you didn’t know you’d need. Which pod is consuming CPU? What user request triggered it? Which database query is slow? What changed in the last deployment?

The classic definition of observability relies on the three pillars:

  • Metrics: Numerical values over time (CPU, latency, request counts).
  • Logs: Unstructured text for contextual events.
  • Traces: Request flow across services.

Prometheus and Grafana excel at metrics, but Kubernetes observability requires all three pillars working together. The CNCF observability landscape shows how the ecosystem has evolved beyond simple monitoring. If you only deploy kube-prometheus-stack, you will only get one piece of the puzzle.

The Dominance of kube-prometheus-stack

Let’s be fair. kube-prometheus-stack is the default for a reason. It provides:

  • Prometheus for metrics scraping
  • Grafana for dashboards
  • Alertmanager for rule-based alerts
  • Node Exporter for hardware and OS metrics

With Helm, you can set it up in minutes. This is why it dominates Kubernetes monitoring setups today. But it’s not the full story.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace

Within minutes, you’ll have Prometheus scraping metrics, Grafana running on port 3000, and a collection of pre-configured dashboards. It feels like magic at first.

Access Grafana to see your dashboards:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

Default credentials are admin / prom-operator. You’ll immediately see dashboards for Kubernetes cluster monitoring, node exporter metrics, and pod resource usage. The data flows in automatically.

In many projects, I’ve seen teams proudly display dashboards full of red and green panels yet still struggle during incidents. Why? Because the dashboards told them what broke, not why.

Common Pitfalls with kube-prometheus-stack

Metric Cardinality Explosion

Cardinality is the number of unique time series created by combining a metric name with all possible label value combinations. Each unique combination creates a separate time series that Prometheus must store and query. The Prometheus documentation on metric and label naming provides official guidance on avoiding cardinality issues.

Prometheus loves labels, but too many labels can crash your cluster. If you add dynamic labels like user_id or transaction_id, you end up with millions of time series. This causes both storage and query performance issues. I’ve witnessed a production cluster go down not because of the application but because Prometheus itself was choking.

Here’s a bad example that will destroy your Prometheus instance:

from prometheus_client import Counter # BAD: High cardinality labels http_requests = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'user_id', 'transaction_id'] # AVOID! ) # With 1000 users and 10000 transactions per user, you get: # 5 methods * 20 endpoints * 1000 users * 10000 transactions = 1 billion time series

Instead, use low-cardinality labels and track high-cardinality data elsewhere:

from prometheus_client import Counter # GOOD: Low cardinality labels http_requests = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status_code'] # Limited set of values ) # Now you have: 5 methods * 20 endpoints * 5 status codes = 500 time series

You can check your cardinality with this PromQL query:

count({__name__=~".+"}) by (__name__)

If you see metrics with hundreds of thousands of series, you’ve found your culprit.

Lack of Scalability

In small clusters, a single Prometheus instance works fine. In large enterprises with multiple clusters, it becomes a nightmare. Without federation or sharding, Prometheus does not scale well. If you’re building multi-cluster infrastructure, understanding Kubernetes deployment patterns becomes critical for running monitoring components reliably.

For multi-cluster setups, you’ll need Prometheus federation according to the Prometheus federation documentation. Here’s a basic configuration for a global Prometheus instance that scrapes from cluster-specific instances:

scrape_configs: - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="kubernetes-pods"}' - '{__name__=~"job:.*"}' static_configs: - targets: - 'prometheus-cluster-1.monitoring:9090' - 'prometheus-cluster-2.monitoring:9090' - 'prometheus-cluster-3.monitoring:9090'

Even with federation, you hit storage limits. A single Prometheus instance struggles beyond 10-15 million active time series.

Alert Fatigue

Kube-prometheus-stack ships with a bunch of default alerts. While they are useful at first, they quickly generate alert fatigue. Engineers drown in notifications that don’t actually help them resolve issues.

Check your current alert rules:

kubectl get prometheusrules -n monitoring

You’ll likely see dozens of pre-configured alerts. Here’s an example of a noisy alert that fires too often:

- alert: KubePodCrashLooping annotations: description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping' summary: Pod is crash looping. expr: | max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"}[5m]) >= 1 for: 15m labels: severity: warning

The problem? This fires for every pod in CrashLoopBackOff, including those in development namespaces or expected restarts during deployments. You end up with alert spam.

A better approach is to tune alerts based on criticality:

- alert: CriticalPodCrashLooping annotations: description: 'Critical pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping' summary: Production-critical pod is failing. expr: | max_over_time(kube_pod_container_status_waiting_reason{ reason="CrashLoopBackOff", namespace=~"production|payment|auth" }[5m]) >= 1 for: 5m labels: severity: critical

Now you only get alerted for crashes in critical namespaces, and you can respond faster because the signal-to-noise ratio is higher.

Dashboards That Show What but Not Why

Grafana panels look impressive, but most of them only highlight symptoms. High CPU, failing pods, dropped requests. They don’t explain the underlying cause. This is the observability gap.

Here’s a typical PromQL query you’ll see in Grafana dashboards:

# Shows CPU usage percentage 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

This tells you what: CPU is at 95%. But it doesn’t tell you why. Which process? Which pod? What triggered the spike?

You can try drilling down with more queries:

# Top 10 pods by CPU usage topk(10, rate(container_cpu_usage_seconds_total[5m]))

Even this shows you the pod name, but not the request path, user action, or external dependency that caused the spike. Without distributed tracing, you’re guessing. You end up in Slack asking, “Did anyone deploy something?” or “Is the database slow?”

Why kube-prometheus-stack Alone Is Not Enough for Kubernetes Observability

Here is the opinionated part: kube-prometheus-stack is monitoring, not observability. It’s a foundation, but not the endgame. Kubernetes observability requires:

  • Logs (e.g., Loki, Elasticsearch)
  • Traces (e.g., Jaeger, Tempo)
  • Correlated context (not isolated metrics)

Without these, you will continue firefighting with partial visibility.

Building a Path Toward Observability

So, how do we close the observability gap?

  • Start with kube-prometheus-stack, but acknowledge its limits.
  • Add a centralized logging solution (Loki, Elasticsearch, or your preferred stack).
  • Adopt distributed tracing with Jaeger or Tempo.
  • Prepare for the next step: OpenTelemetry.

Here’s how to add Loki for centralized logging alongside your existing Prometheus setup:

helm repo add grafana https://grafana.github.io/helm-charts helm repo update # Install Loki for log aggregation helm install loki grafana/loki \ --namespace monitoring \ --create-namespace

For distributed tracing, Tempo integrates seamlessly with Grafana:

# Install Tempo for traces helm install tempo grafana/tempo \ --namespace monitoring

Now configure Grafana to use Loki and Tempo as data sources. In your Grafana UI, add:

apiVersion: 1 datasources: - name: Loki type: loki access: proxy url: http://loki:3100 - name: Tempo type: tempo access: proxy url: http://tempo:3100

With this setup, you can jump from a metric spike in Prometheus to related logs in Loki and traces in Tempo. This is when monitoring starts becoming observability.

OpenTelemetry introduces a vendor-neutral way to capture metrics, logs, and traces in a single pipeline. Instead of bolting together siloed tools, you get a unified foundation. I’ll cover this in detail in the next post on OpenTelemetry in Kubernetes.

Conclusion

Kubernetes observability is more than Prometheus and Grafana dashboards. Kube-prometheus-stack gives you a strong monitoring foundation, but it leaves critical gaps in logs, traces, and correlation. If you only rely on it, you will face cardinality explosions, alert fatigue, and dashboards that tell you what went wrong but not why.

True Kubernetes observability requires a mindset shift. You’re not just collecting metrics anymore. You’re building a system that helps you ask questions you didn’t know you’d need to answer. When an incident happens at 3 AM, you want to trace a slow API call from the user request, through your microservices, down to the database query that’s timing out. Prometheus alone won’t get you there.

To build true Kubernetes observability:

  • Accept kube-prometheus-stack as monitoring, not observability
  • Add logs and traces into your pipeline
  • Watch out for metric cardinality and alert noise
  • Move toward OpenTelemetry pipelines for a unified solution

The monitoring foundation you build today shapes how quickly you can respond to incidents tomorrow. Start with kube-prometheus-stack, acknowledge its limits, and plan your path toward full observability. Your future self (and your on-call team) will thank you.

In the next part of this series, I will show how to deploy OpenTelemetry in Kubernetes for centralized observability. That is where the real transformation begins.

Read next: OpenTelemetry in Kubernetes for centralized observability.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Ripple Buyers Step In at $2.00 Floor on BTC’s Hover Above $91K

Ripple Buyers Step In at $2.00 Floor on BTC’s Hover Above $91K

The post Ripple Buyers Step In at $2.00 Floor on BTC’s Hover Above $91K appeared on BitcoinEthereumNews.com. Token breaks above key support while volume surges 251% during psychological level defense at $2.00. News Background U.S. spot XRP ETFs continue pulling in uninterrupted inflows, with cumulative demand now exceeding $1 billion since launch — the fastest early adoption pace for any altcoin ETF. Institutional participation remains strong even as retail sentiment remains muted, contributing to market conditions where large players accumulate during weakness while short-term traders hesitate to re-enter. XRP’s macro environment remains dominated by capital rotation into regulated products, with ETF demand offsetting declining open interest in derivatives markets. Technical Analysis The defining moment of the session came during the $2.03 → $2.00 flush when volume spiked to 129.7M — 251% above the 24-hour average. This confirmed heavy selling pressure but, more importantly, marked the exact moment where institutional buyers absorbed liquidity at the psychological floor. The V-shaped rebound from $2.00 back into the $2.07–$2.08 range validates active demand at this level. XRP continues to form a series of higher lows on intraday charts, signaling early trend reacceleration. However, failure to break through the $2.08–$2.11 resistance cluster shows lingering supply overhead as the market awaits a decisive catalyst. Momentum indicators show bullish divergence forming, but volume needs to expand during upside moves rather than only during downside flushes to confirm a sustainable breakout. Price Action Summary XRP traded between $2.00 and $2.08 across the 24-hour window, with a sharp selloff testing the psychological floor before immediate absorption. Three intraday advances toward $2.08 failed to clear resistance, keeping price capped despite improving structure. Consolidation near $2.06–$2.08 into the session close signals stabilization above support, though broader range compression persists. What Traders Should Know The $2.00 level remains the most important line in the sand — both technically and psychologically. Institutional accumulation beneath this threshold hints at larger players…
Share
BitcoinEthereumNews2025/12/08 13:22
UK crypto holders brace for FCA’s expanded regulatory reach

UK crypto holders brace for FCA’s expanded regulatory reach

The post UK crypto holders brace for FCA’s expanded regulatory reach appeared on BitcoinEthereumNews.com. British crypto holders may soon face a very different landscape as the Financial Conduct Authority (FCA) moves to expand its regulatory reach in the industry. A new consultation paper outlines how the watchdog intends to apply its rulebook to crypto firms, shaping everything from asset safeguarding to trading platform operation. According to the financial regulator, these proposals would translate into clearer protections for retail investors and stricter oversight of crypto firms. UK FCA plans Until now, UK crypto users mostly encountered the FCA through rules on promotions and anti-money laundering checks. The consultation paper goes much further. It proposes direct oversight of stablecoin issuers, custodians, and crypto-asset trading platforms (CATPs). For investors, that means the wallets, exchanges, and coins they rely on could soon be subject to the same governance and resilience standards as traditional financial institutions. The regulator has also clarified that firms need official authorization before serving customers. This condition should, in theory, reduce the risk of sudden platform failures or unclear accountability. David Geale, the FCA’s executive director of payments and digital finance, said the proposals are designed to strike a balance between innovation and protection. He explained: “We want to develop a sustainable and competitive crypto sector – balancing innovation, market integrity and trust.” Geale noted that while the rules will not eliminate investment risks, they will create consistent standards, helping consumers understand what to expect from registered firms. Why does this matter for crypto holders? The UK regulatory framework shift would provide safer custody of assets, better disclosure of risks, and clearer recourse if something goes wrong. However, the regulator was also frank in its submission, arguing that no rulebook can eliminate the volatility or inherent risks of holding digital assets. Instead, the focus is on ensuring that when consumers choose to invest, they do…
Share
BitcoinEthereumNews2025/09/17 23:52