# Prometheus Cheatsheets

## Basics

• `Counter`: A counter metric always increases
• `Gauge`: A gauge metric can increase or decrease
• `Histogram`: A histogram metric can increase or descrease
• Source and Statistics 101

Query Functions:

• `rate` - The rate function calculates at what rate the counter increases per second over a given time window. src
• `irate` - Calculates at what rate the counter increases per second over a defined time window. The difference being that irate only looks at the last two data points. This makes irate well suited for graphing volatile and/or fast-moving counters. src
• `increase` - The increase function calculates the counter increase over a given time frame. src
• `resets` - The function gives you the number of counter resets over a given time window. src

## Curated Examples

Example queries per exporter / service:

How can I get the amount of requests over a given time (dashboard time):

``````sum by (uri) (increase(http_requests_total[\$__range]))
``````

How many pod restarts per minute?

``````rate(kube_pod_container_status_restarts_total{job="kube-state-metrics",namespace="default"}[15m]) * 60 * 15
``````

## Example Queries

How many nodes are up?

``````up
``````

Combining values from 2 different vectors (Hostname with a Metric):

``````up * on(instance) group_left(nodename) (node_uname_info)
``````

Exclude labels:

``````sum without(job) (up * on(instance)  group_left(nodename)  (node_uname_info))
``````

Count targets per job:

``````count by (job) (up)
``````

Amount of Memory Available:

``````node_memory_MemAvailable_bytes
``````

Amount of Memory Available in MB:

``````node_memory_MemAvailable_bytes/1024/1024
``````

Amount of Memory Available in MB 10 minutes ago:

``````node_memory_MemAvailable_bytes/1024/1024 offset 10m
``````

Average Memory Available for Last 5 Minutes:

``````avg_over_time(node_memory_MemAvailable_bytes[5m])/1024/1024
``````

Memory Usage in Percent:

``````100 * (1 - ((avg_over_time(node_memory_MemFree_bytes[10m]) + avg_over_time(node_memory_Cached_bytes[10m]) + avg_over_time(node_memory_Buffers_bytes[10m])) / avg_over_time(node_memory_MemTotal_bytes[10m])))
``````

CPU Utilization:

``````100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m])) * 100 )
``````

CPU Utilization offset with 24hours ago:

``````100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle", instance="my-instance"}[5m] offset 24h)) * 100 )
``````

CPU Utilization per Core:

``````( (1 - rate(node_cpu_seconds_total{job="node-exporter", mode="idle", instance="\$instance"}[\$__interval])) / ignoring(cpu) group_left count without (cpu)( node_cpu_seconds_total{job="node-exporter", mode="idle", instance="\$instance"}) )
``````

CPU Utilization by Node:

``````100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[10m]) * 100) * on(instance) group_left(nodename) (node_uname_info))
``````

Memory Available by Node:

``````node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)
``````

Or if you rely on labels from other metrics:

``````(node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"} - node_memory_Buffers_bytes{job="node-exporter"} - node_memory_Cached_bytes{job="node-exporter"}) * on(instance) group_left(nodename) (node_uname_info{nodename=~"\$nodename"})
``````

``````avg(node_load1{instance=~"\$name", job=~"\$job"}) /  count(count(node_cpu_seconds_total{instance=~"\$name", job=~"\$job"}) by (cpu)) * 100
``````

``````sum(node_load5{}) by (instance) / count(node_cpu_seconds_total{mode="user"}) by (instance) * 100
``````

Load Average (average per instance_id: lets say the metric has 2 identical label values but are different):

``````avg by (instance_id, instance) (node_load1{job=~"node-exporter", aws_environment="dev", instance="debug-dev"})
# {instance="debug-dev",instance_id="i-aaaaaaaaaaaaaaaaa"}
# {instance="debug-dev",instance_id="i-bbbbbbbbbbbbbbbbb"}
``````

Disk Available by Node:

``````node_filesystem_free_bytes{mountpoint="/"} * on(instance) group_left(nodename) (node_uname_info)
``````

Disk IO per Node: Outbound:

``````sum(rate(node_disk_read_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
``````

Disk IO per Node: Inbound:

``````sum(rate(node_disk_written_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
``````

Network IO per Node:

``````sum(rate(node_network_receive_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
sum(rate(node_network_transmit_bytes_total[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info)
``````

Process Restarts:

``````changes(process_start_time_seconds{job=~".+"}[15m])
``````

Container Cycling:

``````(time() - container_start_time_seconds{job=~".+"}) < 60
``````

Histogram:

``````histogram_quantile(1.00, sum(rate(prometheus_http_request_duration_seconds_bucket[5m])) by (handler, le)) * 1e3
``````

Metrics 24 hours ago (nice when you compare today with yesterday):

``````# query a
total_number_of_errors{instance="my-instance", region="eu-west-1"}
# query b
total_number_of_errors{instance="my-instance", region="eu-west-1"} offset 24h

# related:
``````

Number of Nodes (Up):

``````count(up{job="cadvisor_my-swarm"})
``````

Running Containers per Node:

``````count(container_last_seen) BY (container_label_com_docker_swarm_node_id)
``````

Running Containers per Node, include corresponding hostnames:

``````count(container_last_seen) BY (container_label_com_docker_swarm_node_id) * ON (container_label_com_docker_swarm_node_id) GROUP_LEFT(node_name) node_meta
``````

HAProxy Response Codes:

``````haproxy_server_http_responses_total{backend=~"\$backend", server=~"\$server", code=~"\$code", alias=~"\$alias"} > 0
``````

Metrics with the most resources:

``````topk(10, count by (__name__)({__name__=~".+"}))
``````

the same, but per job:

``````topk(10, count by (__name__, job)({__name__=~".+"}))
``````

or jobs have the most time series:

``````topk(10, count by (job)({__name__=~".+"}))
``````

Top 5 per value:

``````sort_desc(topk(5, aws_service_costs))
``````

Table - Top 5 (enable instant as well):

``````sort(topk(5, aws_service_costs))
``````

Most metrics per job, sorted:

``````sort_desc (sum by (job) (count by (__name__, job)({job=~".+"})))
``````

Group per Day (Table) - wip

``````aws_service_costs{service=~"\$service"} + ignoring(year, month, day) group_right
count_values without() ("year", year(timestamp(
count_values without() ("month", month(timestamp(
count_values without() ("day", day_of_month(timestamp(
aws_service_costs{service=~"\$service"}
)))
)))
))) * 0
``````

Group Metrics per node hostname:

``````node_memory_MemAvailable_bytes * on(instance) group_left(nodename) (node_uname_info)

..
{cloud_provider="amazon",instance="x.x.x.x:9100",job="node_n1",my_hostname="n1.x.x",nodename="n1.x.x"}
``````

Subtract two gauge metrics (exclude the label that dont match):

``````polkadot_block_height{instance="polkadot", chain=~"\$chain", status="sync_target"} - ignoring(status) polkadot_block_height{instance="polkadot", chain=~"\$chain", status="finalized"}
``````

Container CPU Average for 5m:

``````(sum by(instance, container_label_com_amazonaws_ecs_container_name, container_label_com_amazonaws_ecs_cluster) (rate(container_cpu_usage_seconds_total[5m])) * 100)
``````

Container Memory Usage: Total:

``````sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"})
``````

``````sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_task_name, container_label_com_docker_swarm_node_id)
``````

Container Memory per Node:

``````sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_swarm_node_id)
``````

Memory Usage per Stack:

``````sum(container_memory_rss{container_label_com_docker_swarm_task_name=~".+"}) BY (container_label_com_docker_stack_namespace)
``````

Remove metrics from results that does not contain a specific label:

``````container_cpu_usage_seconds_total{container_label_com_amazonaws_ecs_cluster!=""}
``````

Remove labels from a metric:

``````sum without (age, country) (people_metrics)
``````

View top 10 biggest metrics by name:

``````topk(10, count by (__name__)({__name__=~".+"}))
``````

View top 10 biggest metrics by name, job:

``````topk(10, count by (__name__, job)({__name__=~".+"}))
``````

View all metrics for a specific job:

``````{__name__=~".+", job="node-exporter"}
``````

View all metrics for more than one job using vector selectors

``````{__name__=~".+", job=~"traefik|cadvisor|prometheus"}
``````

Website uptime with blackbox-exporter:

``````# https://www.robustperception.io/what-percentage-of-time-is-my-service-down-for

avg_over_time(probe_success{job="node"}[15m]) * 100
``````

Remove / Replace:

• https://medium.com/@texasdave2/replace-and-remove-a-label-in-a-prometheus-query-9500faa302f0

Client Request Counts:

``````irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])
``````

Client Response Time:

``````irate(http_client_requests_seconds_sum{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m]) /
irate(http_client_requests_seconds_count{job="web-metrics", environment="dev", uri!~".*actuator.*"}[5m])
``````

Requests per Second:

``````sum(increase(http_server_requests_seconds_count{service="my-service", env="dev"}[1m])) by (uri)
``````

is the same as:

``````sum(rate(http_server_requests_seconds_count{service="my-service", env="dev"}[1m]) * 60 ) by (uri)
``````

See this SO thread for more details

## Scrape Config

relabel configs:

``````# full example: https://gist.github.com/ruanbekker/72216bea59fc56af189f5a7b2e3a8002
scrape_configs:
- job_name: 'multipass-nodes'
static_configs:
- targets: ['ip-192-168-64-29.multipass:9100']
labels:
env: test
- targets: ['ip-192-168-64-30.multipass:9100']
labels:
env: test
# https://grafana.com/blog/2022/03/21/how-relabeling-in-prometheus-works/#internal-labels
relabel_configs:
separator: ':'
regex: '(.*):(.*)'
replacement: '\${1}'
target_label: instance
``````

static_configs:

``````scrape_configs:
- job_name: 'prometheus'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9090']
labels:
region: 'eu-west-1'
``````

dns_sd_configs:

``````scrape_configs:
- job_name: 'mysql-exporter'
scrape_interval: 5s
dns_sd_configs:
- names:
type: 'A'
port: 9104
relabel_configs:
regex: '.*'
target_label: instance
replacement: 'mysqld-exporter'
``````

• https://gist.github.com/ruanbekker/72216bea59fc56af189f5a7b2e3a8002
• https://gist.github.com/trastle/1aa205354577ef0b329d4b8cc84c674a
• https://github.com/prometheus/docs/issues/341
• https://medium.com/quiq-blog/prometheus-relabeling-tricks-6ae62c56cbda
• https://blog.freshtracks.io/prometheus-relabel-rules-and-the-action-parameter-39c71959354a
• https://www.robustperception.io/relabel_configs-vs-metric_relabel_configs
• https://training.robustperception.io/courses/prometheus-configuration/lectures/3170347

## Grafana with Prometheus

If you have output like this on grafana:

``````{instance="10.0.2.66:9100",job="node",nodename="rpi-02"}
``````

and you only want to show the hostnames, you can apply the following in “Legend” input:

``````
``````

If your output want `exported_instance` in:

``````sum(exporter_memory_usage{exported_instance="myapp"})
``````

You would need to do:

``````sum by (exported_instance) (exporter_memory_usage{exported_instance="my_app"})
``````

Then on Legend:

``````
``````

### Variables

• Hostname:

name: `node` label: `node` node: `label_values(node_uname_info, nodename)`

Then in Grafana you can use:

``````sum(rate(node_disk_read_bytes_total{job="node"}[1m])) by (device, instance) * on(instance) group_left(nodename) (node_uname_info{nodename=~"\$node"})
``````

type: `query` query: `label_values(node_network_up, instance)`

type: `query` query: `label_values(mysql_up, instance)`

• Static Values:

type: `custom` name: `dc` label: `dc` values seperated by comma: `eu-west-1a,eu-west-1b,eu-west-1c`

• Docker Swarm Stack Names

name: `stack` label: `stack` query: `label_values(container_last_seen,container_label_com_docker_stack_namespace)`

• Docker Swarm Service Names

name: `service_name` label: `service_name` query: `label_values(container_last_seen,container_label_com_docker_swarm_service_name)`

• Docker Swarm Manager NodeId:

name: `manager_node_id` label: `manager_node_id` query:

``````label_values(container_last_seen{container_label_com_docker_swarm_service_name=~"proxy_traefik", container_label_com_docker_swarm_node_id=~".*"}, container_label_com_docker_swarm_node_id)
``````
• Docker Swarm Stacks Running on Managers

name: `stack_on_manager` label: `stack_on_manager` query:

``````label_values(container_last_seen{container_label_com_docker_swarm_node_id=~"\$manager_node_id"},container_label_com_docker_stack_namespace)
``````