“So you have system, user, idle, guest, etc., and all these different modes. “The time series node_cpu_seconds_total gives you the second spend in CPU in the wall clock time in the various modes that the CPU supports,” said Kaltschmidt. This is a classic example of a metric used for monitoring infrastructure. (The terms on the graphic are the module names the metrics are named node_ followed by the module name.) System Chart with node_exporter Metrics Metrics Used CPU Utilization The resulting graphic maps the metrics exported by node exporter to the parts of a Linux system. “Luckily, it has a bit of overlap with what node exporter gives you, and for us it became a challenge: How can we map node exporter metrics to what we need to cover in the parts of a Linux system?” “Ben has been using a bunch of these in his day-to-day job, but the bigger task for him is always how can we also have the information that these tools provide in a sort of time series,” Kaltschmidt said. With so many collector modules, Kaltschmidt said he sought the advice of GitLab’s Ben Kochie, who referred him to this diagram: Linux Observability Tools It’s written in Go with pluggable metric collectors. This is achieved with Prometheus using the node exporter, which is a service running on the host that collects hardware and OS metrics exposed by *NIX kernels. That requires the team to make sure that anyone who receives a page will have meaningful dashboards and good procedures in place. “So we try to just have monitoring by alerting, using the time series that we produce and write alerts that only page us when things are happening.” Monitoring at Grafana Labs AlertingĪt Grafana Labs, “we make dashboarding software, but we don’t want to look at dashboards all day,” Kaltschmidt said. This infrastructure monitoring involves Prometheus for metrics, Loki for logs, and Jaeger for distributed tracing, along with monitoring mixins. At the recent All Systems Go! conference in Berlin, David Kaltschmidt, Director, User Experience, gave a talk about what monitoring these clusters and servers looks like at Grafana Labs and shared some best practices. Grafana Labs has 8+ clusters in GKE running 270 nodes of various sizes, and all the hosted metrics and hosted log Grafana Cloud offerings are run on 16-core, 64-gig machines.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |