How Prometheus Monitoring works | Prometheus Architecture explained

TechWorld with Nana
24 Apr 202021:30

Summary

TLDRThis video delves into Prometheus, a pivotal monitoring tool for dynamic container environments like Kubernetes and Docker Swarm. It outlines Prometheus' architecture, including its server, time-series database, and data retrieval worker. The script explains how Prometheus collects metrics from targets via a pull system, using exporters for services lacking native support. It also touches on alerting mechanisms, data storage, and the use of PromQL for querying metrics. The video promises practical examples, highlighting Prometheus' importance in modern DevOps for automating monitoring and alerting to maintain service availability.

Takeaways

  • 📈 **Prometheus Overview**: Prometheus is a monitoring tool designed for dynamic container environments like Kubernetes and Docker Swarm, but also applicable to traditional infrastructure.
  • 🔍 **Use Cases**: It's used for monitoring servers, applications, and services, providing insights into hardware and application levels, which is crucial for maintaining uptime and performance.
  • 🏗️ **Architecture**: Prometheus consists of a server, time series database, data retrieval workers, and a web server/API for querying stored data.
  • 🎯 **Key Characteristics**: It's known for its pull model of data collection, which reduces network load and allows for easier service status detection.
  • 🚀 **Popularity in DevOps**: It's become a mainstream choice in container and microservice monitoring due to its automation capabilities and efficiency.
  • 📊 **Data Collection**: Prometheus collects metrics from targets via an HTTP endpoint, which requires the target to expose a `/metrics` endpoint.
  • 🔌 **Exporters**: For services without native Prometheus support, exporters are used to translate metrics into a format Prometheus can understand.
  • 💾 **Storage**: It stores data in a local on-disk time series database and can integrate with remote storage systems.
  • 📑 **Configuration**: Prometheus uses a `prometheus.yml` file to configure target scraping and rule evaluation intervals.
  • 🔥 **Alerting**: The Alertmanager component is responsible for handling alerts based on defined rules and sending notifications through various channels.
  • 🔄 **Scalability**: While Prometheus is reliable and self-contained, scaling it across many servers can be challenging due to its design.

Q & A

  • What is Prometheus and why is it important in modern infrastructure?

    -Prometheus is a monitoring tool designed for highly dynamic container environments like Kubernetes and Docker Swarm. It's important because it helps monitor and manage complex infrastructures, providing insights into hardware and application levels to prevent downtimes and ensure smooth operations.

  • What are the different use cases of Prometheus?

    -Prometheus can be used for monitoring containerized applications, traditional non-container infrastructure, and microservices. It's particularly useful in environments where there's a need for automated monitoring and alerting to maintain system reliability and performance.

  • What is the architecture of Prometheus?

    -Prometheus architecture consists of a Prometheus server that includes a time series database for storing metrics, a data retrieval worker for pulling metrics from targets, and a web server/API for querying stored data. It also includes components like exporters for non-native Prometheus targets and client libraries for custom application metrics.

  • How does Prometheus collect metrics from targets?

    -Prometheus collects metrics by pulling data from HTTP endpoints exposed by targets. This requires the target to expose a /metrics endpoint in a format that Prometheus understands. Exporters are used to convert metrics from services that don't have native Prometheus endpoints.

  • What is the significance of the pull model in Prometheus?

    -The pull model allows Prometheus to reduce the load on the infrastructure by not requiring services to push metrics. It also simplifies the monitoring process and makes it easier to detect if a service is down since it's not responding to the pull request.

  • How does Prometheus handle short-lived targets that aren't around long enough to be scraped?

    -For short-lived targets like batch jobs, Prometheus offers a component called Pushgateway. This allows these services to push their metrics directly to the Prometheus database.

  • What is the role of the Prometheus Alertmanager?

    -The Alertmanager is responsible for firing alerts through various channels like email or Slack when certain conditions specified in the alert rules are met.

  • How does Prometheus store its data and how can it be accessed?

    -Prometheus stores metrics data in a local on-disk time series database in a custom format. It can also integrate with remote storage systems. The data can be queried through its server API using the PromQL query language or visualized through tools like Grafana.

  • What is the difficulty with scaling Prometheus?

    -Scaling Prometheus can be challenging due to its design to be reliable even when other systems have outages. Each Prometheus server is standalone, and setting up an extensive infrastructure for aggregation of metrics from multiple servers can be complex.

  • How is Prometheus integrated with container environments like Docker and Kubernetes?

    -Prometheus components are available as Docker images, making it easy to deploy in Kubernetes or other container environments. It integrates well with Kubernetes, providing cluster node resource monitoring out of the box.

Outlines

00:00

🌟 Introduction to Prometheus

The video introduces Prometheus, a monitoring tool crucial for modern infrastructure, especially in containerized environments like Kubernetes and Docker Swarm. It explains the importance of Prometheus in monitoring dynamic systems, its architecture, components, and its widespread acceptance. The tool is highlighted for its ability to handle complex DevOps environments with automation, offering insights into potential issues before they affect users. It also discusses the challenges of maintaining large-scale infrastructures without monitoring tools and the benefits of early detection and alerting of problems.

05:01

🛠️ Prometheus Architecture and Data Collection

This section delves into the architecture of Prometheus, which consists of a server, time series database, data retrieval worker, and a web server/API. It explains how Prometheus monitors targets and collects metrics, the role of exporters in converting data into a format Prometheus can understand, and the importance of the /metrics endpoint. The paragraph also covers different metric types Prometheus uses and how it collects data through a pull system, which is more efficient in environments with many microservices, and the use of exporters for services that don't natively support Prometheus monitoring.

10:02

🔌 Exporters and Prometheus Client Libraries

The script discusses the use of exporters for various services and platforms to make metrics available for Prometheus scraping. It mentions how to set up exporters for Linux servers and MySQL containers in Kubernetes. The paragraph also explains the role of Prometheus client libraries in applications, allowing developers to expose metrics that infrastructure teams can monitor. The importance of the pull model over the push model for monitoring is reiterated, along with the use of the Push Gateway for short-lived jobs.

15:04

📝 Prometheus Configuration and Alerting

This part of the script explains how Prometheus is configured through the prometheus.yml file, detailing how to define targets and scrape intervals. It introduces the concept of service discovery, rule files for creating alerts, and the global configuration for setting evaluation intervals. The paragraph also discusses the Alert Manager, Prometheus's component for firing alerts through various channels when certain conditions are met.

20:05

💾 Data Storage and Scaling Challenges

The final paragraph covers how Prometheus stores data on disk using a local time series database and can integrate with remote storage systems. It explains the use of PromQL for querying metric data and how tools like Grafana can visualize this data. The paragraph also touches on the challenges of configuring Prometheus and the steep learning curve involved. It concludes with the discussion on Prometheus's reliability during system outages and its difficulty in scaling, suggesting ways to work around these limitations.

🚀 Prometheus with Docker and Kubernetes

The script concludes with a brief mention of Prometheus's compatibility with Docker and Kubernetes, noting that Prometheus components are available as Docker images and integrate well with Kubernetes for cluster node resource monitoring. It also announces a forthcoming video on deploying and configuring Prometheus for Kubernetes monitoring.

Mindmap

Keywords

💡Prometheus

Prometheus is an open-source monitoring and alerting toolkit that is widely used for its effectiveness in containerized environments like Kubernetes and Docker Swarm. It plays a central role in the video's theme by being the main tool discussed for monitoring modern infrastructure. The script explains how Prometheus helps in monitoring dynamic container environments and traditional infrastructure, highlighting its importance in the DevOps world.

💡Monitoring

Monitoring refers to the act of observing and tracking the performance and health of systems, infrastructure, and applications. In the context of the video, monitoring is crucial for maintaining the smooth operation of complex infrastructures, especially those involving multiple servers and containerized applications. The script uses monitoring as a means to prevent downtimes and quickly identify issues within the system.

💡Containerized Environments

Containerized environments are computing environments where applications are deployed within containers, providing isolated and portable runtime environments. The video script discusses how Prometheus was created specifically for monitoring such environments, emphasizing its role in handling the complexities and dynamics of container orchestration platforms like Kubernetes.

💡Alerting

Alerting is the process of notifying system administrators or users when specific conditions or thresholds are met within a monitored system. The script describes Prometheus' alerting capabilities, which can notify maintainers of potential issues before they become critical, thus playing a vital role in proactive system management.

💡Targets

In Prometheus, a target refers to the entities being monitored, which can range from servers to specific applications or services. The script explains that Prometheus can monitor various targets, making it a versatile tool for different monitoring needs within an infrastructure.

💡Metrics

Metrics in the context of the video are quantitative measurements that provide insights into the performance or status of a system, application, or service. Metrics are collected by Prometheus from targets and are crucial for monitoring and alerting purposes. The script provides examples such as CPU usage, memory usage, and number of exceptions as types of metrics.

💡Exporters

Exporters are components that translate and expose metrics from various services or systems into a format that Prometheus can understand. The script mentions exporters as a solution for services that do not natively expose Prometheus metrics endpoints, thus expanding Prometheus' monitoring capabilities.

💡PromQL

PromQL (Prometheus Query Language) is the query language used to interact with Prometheus data. The script briefly introduces PromQL as the means to query and retrieve time series data from Prometheus, which can be used for creating dashboards or triggering alerts.

💡Service Discovery

Service discovery in Prometheus refers to the process of dynamically identifying the targets to be monitored. This is important in the video's narrative as it allows Prometheus to adapt to changes in the infrastructure, such as the addition or removal of servers or services.

💡Alert Manager

The Alert Manager in Prometheus is responsible for handling alerts. When certain conditions are met as defined by rules, the Alert Manager triggers notifications through various channels like email or Slack. The script explains its role in the alerting process, emphasizing the proactive nature of Prometheus.

💡Configuration

Configuration in Prometheus involves setting up the monitoring parameters, such as which targets to monitor and how often to scrape metrics. The script mentions 'prometheus.yml' as the configuration file where these settings are defined, highlighting the importance of proper configuration for effective monitoring.

Highlights

Prometheus is a vital tool for monitoring modern infrastructure, especially in containerized environments like Kubernetes and Docker Swarm.

Its architecture includes a Prometheus server with a time series database, data retrieval worker, and a web server/API for querying stored data.

Prometheus can monitor various targets such as servers, standalone services, or applications, with metrics representing the units of monitoring.

Metrics are categorized into counter, gauge, and histogram types, facilitating the tracking of different aspects of system performance.

Prometheus collects metrics from targets by pulling data from an HTTP endpoint, which reduces the load on the infrastructure compared to push-based systems.

Exporters are used to convert metrics from services into a format that Prometheus can understand and expose at a metrics endpoint.

Prometheus client libraries allow applications to expose their own metrics, which can be monitored and scraped by Prometheus.

The pull model of Prometheus makes it easier to detect service health and reduces the risk of network bottlenecks.

For short-lived or batch jobs, Prometheus offers a Push Gateway to allow services to push their metrics directly to the database.

Prometheus configurations are defined in a prometheus.yaml file, which controls target scraping intervals and service discovery.

Alerts are managed by the Alertmanager component, which can send notifications through various channels when certain conditions are met.

Prometheus stores metrics data in a local on-disk time series database and can integrate with remote storage systems.

Data can be queried using the PromQL query language, and visualization tools like Grafana can display this data.

Prometheus is designed to be reliable even when other systems are down, allowing for effective diagnostics and problem-solving.

While Prometheus is easy to start with a single node, scaling it across many servers can be challenging due to its standalone nature.

Prometheus is fully compatible with Docker and Kubernetes, offering native support for cluster node resource monitoring.

The video will cover a separate tutorial on deploying and configuring Prometheus to monitor Kubernetes clusters.

Transcripts

play00:01

in this video we're going to talk about

play00:02

prometheus so first i'm going to explain

play00:04

to you what prometheus is and what are

play00:06

different use cases where prometheus is

play00:08

used and why is it such an important

play00:11

tool in modern infrastructure we're

play00:13

going to go through prometheus

play00:15

architecture so different components

play00:18

that it contains we're going to see an

play00:20

example configuration and also some of

play00:22

these key characteristics why it became

play00:25

so widely accepted and popular

play00:28

especially in containerized environments

play00:33

prometheus was created to monitor highly

play00:36

dynamic container environments like

play00:39

kubernetes docker swarm etc however it

play00:42

can also be used in a traditional

play00:44

non-container infrastructure where you

play00:47

have just bare servers with applications

play00:49

deployed directly on it so over the past

play00:51

years prometheus has become the

play00:53

mainstream monitoring tool of choice in

play00:57

container and micro service world so

play00:59

let's see why prometheus is so important

play01:02

in such infrastructure and what are some

play01:04

of its use cases modern devops is

play01:07

becoming more and more complex to handle

play01:09

manually and therefore needs more

play01:11

automation

play01:12

so typically you have multiple servers

play01:14

that run containerized applications and

play01:17

there are hundreds of different

play01:18

processes running on that infrastructure

play01:21

and things are interconnected so

play01:23

maintaining such setup to run smoothly

play01:26

and without application down times is

play01:28

very challenging

play01:30

imagine having such a complex

play01:32

infrastructure with loads of servers

play01:34

distributed over many locations and you

play01:37

have no insight of what is happening on

play01:40

hardware level or on application level

play01:42

like errors response latency hardware

play01:46

down or overloaded maybe running out of

play01:49

resources etc

play01:51

in such complex infrastructure there are

play01:53

more things that can go wrong

play01:55

when you have tons of services and

play01:57

applications deployed any one of them

play02:00

can crash and cause failure of other

play02:02

services and only have so many moving

play02:05

pieces and suddenly application becomes

play02:07

unavailable to users you must quickly

play02:10

identify what exactly out of this

play02:13

hundred different things went wrong and

play02:15

that could be difficult and

play02:17

time-consuming when debugging the system

play02:20

manually so let's take a specific

play02:22

example

play02:23

say one specific server ran out of

play02:26

memory and kicked off a running

play02:28

container that was responsible for

play02:31

providing database sync between two

play02:33

database pods in a kubernetes cluster

play02:36

that in turn caused those two database

play02:38

parts to fail that database was used by

play02:41

an authentication service that also

play02:44

stopped working because the database

play02:46

became unavailable

play02:47

and then application

play02:49

that depended on that authentication

play02:51

service couldn't authenticate users in

play02:54

the ui anymore but from a user

play02:56

perspective all you see is error in the

play02:58

ui can't login so how do you know what

play03:01

actually went wrong when you don't have

play03:04

any insight of what's going on inside

play03:06

the cluster you don't see that red line

play03:09

of the chain of events as displayed here

play03:12

you just see the error so now you start

play03:14

working backwards from there to find the

play03:16

cause and fix it

play03:18

so you check is the application back and

play03:20

running does it show an exception is the

play03:23

authentication service running did it

play03:25

crash why did it crash in all the way to

play03:27

the initial container failure but what

play03:30

will make this searching the problem

play03:32

process more efficient would be to have

play03:34

a tool that constantly monitors whether

play03:38

services are running and alerts the

play03:40

maintainers as soon as one service

play03:43

crashes so you know exactly what

play03:45

happened or even better it identifies

play03:48

problems before they even occur and

play03:50

alerts the system administrators

play03:52

responsible for that infrastructure to

play03:55

prevent that issue so for example in

play03:57

this case it would check regularly the

play04:00

status of memory usage on each server

play04:03

and when on one of the servers it spikes

play04:05

over for example 70 percent for over an

play04:08

hour or keeps increasing notify about

play04:11

the risk that the memory on that server

play04:14

might soon run out or let's consider

play04:16

another scenario where suddenly you stop

play04:18

seeing logs for your application because

play04:21

elasticsearch doesn't accept any new

play04:24

logs because the server ran out of disk

play04:27

space or elasticsearch reached the

play04:29

storage limit that was allocated for it

play04:32

again the monitoring tool would check

play04:34

continuously the storage space and

play04:36

compare with the elastic search

play04:38

consumption of space of storage

play04:41

and it will see the risk and notify

play04:43

maintainers of the possible storage

play04:45

issue and you can tell the monitoring

play04:47

tool what that critical point is when

play04:50

the alert should be triggered for

play04:52

example if you have a very important

play04:53

application that absolutely can have any

play04:56

log data loss you may be very strict and

play04:59

want to take measures as soon as 50 or

play05:01

60 percent capacity is reached or maybe

play05:03

you know adding more storage space will

play05:06

take long because it's a bureaucratic

play05:08

process in your organization where you

play05:10

need approval of some it department and

play05:12

several other people

play05:14

then maybe you also want to be notified

play05:17

earlier about the possible storage issue

play05:19

so that you have more time to fix it or

play05:22

a third scenario where application

play05:24

suddenly becomes too slow because one

play05:26

service breaks down and starts sending

play05:28

hundreds of error messages in a loop

play05:30

across the network that creates high

play05:33

network traffic and slows down other

play05:35

services too having a tool that detects

play05:38

such spikes in network load plus tells

play05:42

you which service is responsible for

play05:43

causing it

play05:45

can give you timely alert to fix the

play05:47

issue

play05:48

and such automated monitoring and

play05:50

alerting is exactly what prometheus

play05:53

offers as a part of a modern devops

play05:56

workflow so how does prometheus actually

play06:00

work or how does it architecture

play06:02

actually looks like

play06:04

at its core prometheus has the main

play06:06

component called prometheus server that

play06:08

does the actual monitoring work and is

play06:11

made up of three parts it has a time

play06:14

series database that stores all the

play06:17

metrics data like current cpu usage or

play06:20

number of exceptions in an application

play06:23

second it has a data retrieval worker

play06:27

that is responsible for getting or

play06:30

pulling those metrics from applications

play06:33

services

play06:35

servers and other target resources

play06:38

and

play06:39

storing them or pushing them into that

play06:41

database

play06:42

and third it has a web server or server

play06:45

api that accepts queries for that stored

play06:49

data and that web server component or

play06:51

the server api is used to display the

play06:54

data in a dashboard or ui either through

play06:57

prometheus dashboard or some other data

play07:00

visualization tool like grafana so the

play07:03

prometheus server monitors a particular

play07:06

thing and that thing could be anything

play07:08

it could be an entire linux server or

play07:10

windows server it could be a standalone

play07:13

apache server

play07:14

a single application or service like a

play07:17

database

play07:18

and those things that prometheus

play07:20

monitors are called targets and each

play07:22

target has units of monitoring for linux

play07:26

server target it could be a current cpu

play07:29

status its memory usage disk space usage

play07:33

etc for an application for example

play07:37

it could be number of exceptions number

play07:39

of requests or request duration and that

play07:42

unit that you would like to monitor for

play07:45

a specific target is called a metric and

play07:48

metrics are what gets saved into

play07:50

prometheus database component prometheus

play07:53

defines human readable text-based format

play07:56

for this metrics metrics entries or data

play08:00

has type and help attributes to increase

play08:03

its readability so help is basically a

play08:05

description that just describe what the

play08:07

metrics is about and type is one of

play08:10

three metrics types

play08:12

for metrics about how many times

play08:15

something happened

play08:16

like number of exceptions that

play08:18

application had or number of requests it

play08:20

has received there is a counter type

play08:23

metric that can go both up and down

play08:26

is represented by a gauche example what

play08:30

is the current value of cpu usage now

play08:34

or what is the current capacity of disk

play08:37

space now or what is the number of

play08:40

concurrent requests at that given moment

play08:42

and for tracking how long something took

play08:45

or how big for example the size of a

play08:47

request was there is a histogram type

play08:51

so now the interesting question is how

play08:53

does prometheus actually collect those

play08:55

metrics from the targets

play08:57

prometheus pulls metrics data from the

play09:00

targets from an http endpoint which by

play09:03

default is host address slash metrics

play09:07

and for that to work one targets must

play09:10

expose that slash metrics endpoint and

play09:13

two data available at slash matrix

play09:16

endpoint must be in the format that

play09:18

prometheus understands and we saw that

play09:21

example metrics before

play09:23

some servers are already exposing

play09:25

prometheus endpoints so you don't need

play09:27

extra work to gather metrics from them

play09:30

but many services don't have native

play09:33

prometheus endpoints so extra component

play09:35

is required to do that

play09:37

and this component is exporter so

play09:39

exporter is basically a script or

play09:42

service that fetches metrics from your

play09:44

target and converts them in format

play09:47

prometheus understands and exposes this

play09:49

converted data at its own slash metrics

play09:52

endpoint

play09:53

where prometheus can scrape them and

play09:56

prometheus has a list of exporters for

play09:59

different services like mysql

play10:01

elasticsearch linux server build tools

play10:04

cloud platforms and so on i will put the

play10:08

link to prometheus official

play10:10

documentation and export the list as

play10:12

well as its repository in the

play10:14

description so for example if you want

play10:15

to monitor a linux server you can

play10:18

download a node exporter tar file from

play10:20

prometheus repository you can untar and

play10:23

execute it and it will start converting

play10:26

the metrics of the server and making

play10:28

them scrapable at its own slash matrix

play10:31

endpoint and then you can go and

play10:33

configure prometheus to scrape that

play10:36

endpoint

play10:37

and these exporters are also available

play10:39

as docker images so for example if you

play10:42

want to monitor your mysql container in

play10:44

kubernetes cluster

play10:46

you can deploy a sidecar container of

play10:48

mysql exporter that will run inside the

play10:51

pod with mysql container connect to it

play10:54

and start translating mysql metrics for

play10:57

prometheus and making them available at

play11:00

its own slash metrics endpoint and again

play11:03

once you add mysql exporter endpoint to

play11:06

prometheus configuration prometheus will

play11:09

start collecting those metrics and

play11:11

saving them in its database what about

play11:13

monitoring your own applications let's

play11:15

say you want to see how many requests

play11:18

your application is getting at different

play11:20

times or how many exceptions are

play11:22

occurring how many server resources your

play11:25

application is using etc for this use

play11:28

case there are prometheus client

play11:30

libraries for different languages like

play11:32

node.js java etc using these libraries

play11:36

you can expose the slash metrics

play11:38

scraping endpoint in your application

play11:40

and provide different metrics that are

play11:42

relevant for you on that endpoint and

play11:45

this is a pretty convenient way for the

play11:47

infrastructure team to tell developers

play11:50

emit metrics that are relevant to you

play11:52

and will collect and monitor them in our

play11:55

infrastructure

play11:56

and i will also link the list of client

play11:58

libraries prometheus supports where you

play12:01

can see the documentation of how to use

play12:03

them

play12:06

so i mentioned that prometheus pulls

play12:08

this data from endpoints and that's

play12:10

actually an important characteristic of

play12:12

prometheus let's see why most monitoring

play12:15

systems like amazon cloud watch or new

play12:18

relief etc use a push system meaning

play12:22

applications and servers are responsible

play12:25

for pushing their metric data to a

play12:27

centralized collection platform of that

play12:30

monitoring tool so when you're working

play12:32

with many microservices and you have

play12:34

each service pushing their metrics to

play12:37

the monitoring system it creates a high

play12:39

load of traffic within your

play12:41

infrastructure and your monitoring can

play12:43

actually become your bottleneck so you

play12:45

have monitoring which is great but you

play12:47

pay the price of overloading your

play12:49

infrastructure with constant push

play12:51

requests from all the services and thus

play12:54

flooding the network plus you also have

play12:56

to install daemons on each of these

play12:59

targets to push the metrics to

play13:01

monitoring server while prometheus

play13:03

requires just a scraping endpoint and

play13:06

this way metrics can also be pulled by

play13:08

multiple prometheus instances and

play13:10

another advantage of that is using paul

play13:13

prometheus can easily detect whether

play13:15

service is up and running for example

play13:18

when he doesn't respond on the pull or

play13:19

when the endpoint isn't available while

play13:22

with push if the service doesn't push

play13:24

any data or send its health status it

play13:27

might have many reasons other than the

play13:29

service isn't running it could be that

play13:31

network isn't working the package got

play13:33

lost on the way

play13:34

or some other problem so you don't

play13:36

really have an insight of what happened

play13:39

but there are limited number of cases

play13:41

where a target that needs to be

play13:43

monitored runs only for a short time so

play13:46

they aren't around long enough to be

play13:48

scraped example could be a batch job or

play13:52

scheduled job that say cleans up some

play13:54

old data or does backups etc for such

play13:58

jobs prometheus offers push gateway

play14:01

component so that these services can

play14:04

push their metrics directly to

play14:05

prometheus database but obviously using

play14:08

pushgateway to gather metrics in

play14:10

prometheus should be an exception

play14:12

because of the reasons i mentioned

play14:14

earlier

play14:15

so how does prometheus know what to

play14:17

scrape and when

play14:18

all that is configured in

play14:20

prometheus.yaml configuration file so

play14:23

you define which targets prometheus

play14:25

should scrape and at what interval

play14:28

prometheus then uses a service discovery

play14:30

mechanism to find those target endpoints

play14:33

when you first download and install

play14:35

prometheus you will see the sample

play14:37

config file with some default values in

play14:40

it here is an example we have

play14:42

global config that defines scrape

play14:45

interval or how often prometheus will

play14:47

scrape its targets and you can override

play14:49

these for individual targets

play14:51

the rule files block specifies the

play14:54

location of any rules we want prometheus

play14:56

server to load and the rules are

play14:58

basically either for aggregating matrix

play15:01

values or creating alerts when some

play15:04

condition is met like cpu usage reached

play15:08

80 percent for example so prometheus

play15:10

uses rules to create new time series

play15:13

entries and to generate alerts and the

play15:16

evaluation interval option in global

play15:18

config defines how often prometheus will

play15:22

evaluate these rules in the last block

play15:25

scrape configs controls what resources

play15:28

prometheus monitors this is where you

play15:30

define the targets

play15:32

since prometheus has its own metrics

play15:35

endpoint to expose its own data it can

play15:37

monitor its own health so in this

play15:40

default configuration there is a single

play15:42

job

play15:43

called prometheus which scrapes the

play15:46

metrics exposed by the prometheus server

play15:49

so it has a single target at localhost

play15:51

1990 and prometheus expects metrics to

play15:55

be available on a target on a path of

play15:59

slash metrics which is a default path

play16:02

that is configured for that endpoint

play16:06

and here you can also define other

play16:07

endpoints to scrape through jobs so you

play16:10

can create another job and for example

play16:13

override the scrape interval from the

play16:15

global configuration and and define the

play16:18

target host address

play16:21

so a couple of important points here so

play16:23

the first one is how does prometheus

play16:26

actually trigger the alerts that are

play16:28

defined by rules and who receives them

play16:31

prometheus has a component called alert

play16:34

manager

play16:35

that is responsible for firing alerts

play16:38

via

play16:39

different channels it could be email it

play16:41

could be a slack channel

play16:43

or some other notification client so

play16:45

prometheus server will read the alert

play16:47

rules and if the condition in the rules

play16:50

is met an alert gets fired through that

play16:53

configured channel and the second one is

play16:56

prometheus data storage

play16:58

where does prometheus store all this

play17:00

data that it collects

play17:03

and then aggregates

play17:04

and how can other systems access this

play17:06

data

play17:08

prometheus stores the metrics data on

play17:10

disk so it includes a local on disk time

play17:13

series database but also optionally

play17:16

integrates with remote storage system

play17:18

and the data is stored in a custom time

play17:21

series format and because of that you

play17:23

can't write prometheus data directly

play17:26

into a relational database for example

play17:28

so once you've collected the metrics

play17:30

prometheus also lets you query the

play17:32

metrics data on targets through its

play17:35

server api using promptql query language

play17:41

you can use prometheus dashboard ui to

play17:43

ask the prometheus server via promql to

play17:46

for example show the status of a

play17:48

particular target right now or you can

play17:51

use more powerful data visualization

play17:54

tools like grafana

play17:56

to display the data which under the hood

play17:59

also uses promql to get the data out of

play18:02

prometheus and this is an example of a

play18:04

promql query which this one here

play18:07

basically queries all http status codes

play18:09

except the ones in 400 range and this

play18:12

one basically does some sub query on

play18:14

that for a period of 30 minutes and this

play18:17

is just to give you an example of how is

play18:20

query language look like but with

play18:22

grafana instead of writing promptq

play18:24

queries directly into the prometheus

play18:25

server um you basically have grafina ui

play18:29

where you can create dashboards that can

play18:32

then in the background use prom ql to

play18:35

query the data that you want to display

play18:38

now concerning promql the prometheus

play18:41

configuration in grafana ui i have to

play18:44

say from my personal experience that

play18:47

configuring prometheus yml file to

play18:50

scrape different targets and then

play18:52

creating all those dashboards

play18:54

to display meaningful

play18:57

data out of the script metrics can

play18:59

actually be pretty complex and it's also

play19:02

not very well documented

play19:04

so there is some steep learning curve to

play19:07

learning how to correctly configure

play19:08

prometheus and how to then query the

play19:11

collected metrics data to create

play19:13

dashboards

play19:14

so i will make a separate video where i

play19:17

configure prometheus to monitor

play19:19

kubernetes services

play19:20

to show some of the practical examples

play19:23

and the final point is an important

play19:25

characteristic of prometheus

play19:28

that it is designed to be reliable

play19:30

even when other systems have an outage

play19:33

so that you can diagnose the problems

play19:35

and fix them so each prometheus server

play19:37

is standalone and self-containing

play19:40

meaning it doesn't depend on network

play19:41

storage or other remote services it's

play19:44

meant to work when other parts of the

play19:46

infrastructure are broken and you don't

play19:49

need to set up extensive infrastructure

play19:51

to use it which of course is a great

play19:54

thing however it also has disadvantage

play19:57

that prometheus can be difficult to

play19:59

scale so when you have hundreds of

play20:01

servers you might want to have multiple

play20:04

prometheus servers that somewhere

play20:06

aggregate all this metrics data and

play20:08

configuring that and scaling prometheus

play20:11

in that way can actually be very

play20:13

difficult because of this characteristic

play20:15

so while using a single node is less

play20:17

complex and you can get started very

play20:19

easily it puts a limit on the number of

play20:21

metrics that can be monitored by

play20:23

prometheus so to work around that you

play20:25

either increase the capacity of the

play20:28

prometheus server so it can store more

play20:30

metrics data or you limit the number of

play20:33

metrics that prometheus collects from

play20:36

the applications to keep it down to only

play20:38

the relevant ones

play20:40

and finally in terms of prometheus with

play20:43

docker and kubernetes as i mentioned

play20:45

throughout the video with different

play20:47

examples prometheus is fully compatible

play20:50

with both and prometheus components are

play20:52

available as docker images and therefore

play20:55

can easily be deployed in kubernetes or

play20:58

other container environments

play21:00

and it integrates great with kubernetes

play21:02

infrastructure providing cluster node

play21:05

resource monitoring out of the box which

play21:07

means once it's deployed on kubernetes

play21:10

it starts gathering matrix data on each

play21:13

kubernetes node server without any extra

play21:15

configuration and i will make a separate

play21:18

video on how to deploy and configure

play21:20

prometheus to monitor your kubernetes

play21:22

cluster so subscribe to my channel click

play21:25

that notification bell and you will be

play21:27

notified when the new video is out

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
Prometheus MonitoringContainer EnvironmentsKubernetesDocker SwarmDevOps ToolsInfrastructure MonitoringMicroservicesAlert ManagementData VisualizationPrometheus Configuration
Benötigen Sie eine Zusammenfassung auf Englisch?