“The thought experiment, Schroedinger’s cat, introduces a cat that is both alive and dead, at the same time. No better analogy exists for describing the complexity of monitoring a platform like Kubernetes, where things come, go, live, and die, dozens of times, every minute.” — Matt Reider, Dynatracer and Kubernetes Wizard
Kubernetes is the de-facto standard for container orchestration as it solves many problems, like distributing workloads across machines, achieving fault tolerance, and re-scheduling workloads when problems occur. While speeding up development processes and reducing complexity does make the lives of Kubernetes operators easier, the inherent abstraction and automation can lead to new types of errors that are difficult to find, troubleshoot, and prevent.
Typically, Kubernetes monitoring is managed using a separate dashboard (like the Kubernetes Dashboard or the Grafana App for Kubernetes) that shows the state of the cluster and alerts when anomalies occur. Monitoring agents installed on the Kubernetes nodes monitor the Kubernetes environment and give valuable information about the status of nodes. Nevertheless, there are related components and processes, for example, virtualization infrastructure and storage systems (see image below), that can lead to problems in your Kubernetes infrastructure.
As a platform operator, you want to identify problems quickly and learn from them to prevent future outages. As an application developer, you want to instrument your code to understand how your services communicate with each other and where bottlenecks cause performance degradations. Fortunately, monitoring solutions are available to analyze and display such data, provide deep insights, and take automated actions based on those insights (for example, alerting or remediation).
The Kubernetes experience
When using managed environments like Google Kubernetes Engine (GKE), Amazon Elastic Kubernetes (EKS), or Azure Kubernetes Service it’s easy to spin up a new cluster. After applying the first manifests (which are likely copied and pasted from a how-to tutorial), a web server is up and running within minutes.
However, when extending the configuration for production, with your growing expertise, you may discover that:
- Your application isn’t as stateless as you thought it was.
- Configuring storage in Kubernetes is more complex than using a file system on your host.
- Storing configurations or secrets in the container image may not be the best idea.
You overcome all these obstacles, and after some time, your application is running smoothly. During the adoption-phase, some assumptions about the operating conditions were made, and the application deployment is aligned to them. Even though Kubernetes has built-in error/fault detection and recovery mechanisms, unexpected anomalies can still creep in, leading to data loss, instability, and negative impact on user experience. Additionally, the auto-scaling mechanisms embedded in Kubernetes can have a negative impact on costs if your resource limits are set to high (or not set at all).
To protect yourself from this, you want to instrument your application to provide deep monitoring insights. This enables you to take actions (automatically or manually) when anomalies and performance problems occur that have an impact on end-user experience.
What does observability mean for Kubernetes?
When designing and running modern, scalable, and distributed applications, Kubernetes seems to be the solution for all your needs. Nevertheless, as a container orchestration platform, Kubernetes doesn’t know a thing about the internal state of your applications. That’s why developers and SRE’s rely on telemetry data (i.e., metrics, traces, and logs) to gain a better understanding of the behavior of their code during runtime.
- Metrics are a numeric representation of intervals over time. They can help you find out how the behavior of a system changes over time (for example, how long do requests take in the new version compared to the last version?).
- Traces represent causally related distributed events on a distributed system, showing, for example, how a request flows from the user to the database.
- Logs are easy to produce and provide data in plain-text, structured (JSON, XML), or binary format. Logs can also be used to represent event data.
- Apart from the three pillars of observability (i.e., logs, metrics, and traces), more sophisticated approaches can add topology information, real user experience data, and other meta-information.