PCP Measurements

Based on grafana/checklist.json.

CPU

CPU (main)

The speed of the CPU is limiting performance

Threshold: processors 85% busy

Compute kernel.percpu.cpu.util.all with:
1 - rate(kernel.percpu.cpu.idle)

References:

User time

The CPU is executing application code

Threshold: processors 80% in application code

Compute kernel.percpu.cpu.util.user with:
rate(kernel.percpu.cpu.user)

References:

System time

The CPU is executing system code

Threshold: processors 20% in kernel

Compute kernel.percpu.cpu.util.sys with:
rate(kernel.percpu.cpu.sys)

References:

Kernel samepage merging daemon (ksmd)

Kernel SamePage Merging Daemon (ksmd) using too much time

Compute hotproc.psinfo.ksmd.util with:
rate(hotproc.psinfo.utime)+rate(hotproc.psinfo.stime)

References:

Note: There can only be one hotproc predicate at a time. For this to work would need hotproc.control.config set to '(fname=="ksmd" && cpuburn > 0.10)'


Storage

Storage (main)

Excessive waiting for storage

Threshold: disk 85% busy

Compute diskbusy with:
rate(disk.dm.avactive)

References:

Bandwidth

Saturating bandwidth of storage

Threshold: > 2.5GB/s

Compute disk.dm.bw with:
rate(disk.dm.total)

Note: There does not seem to be a way to query storage devices for device's the max r/w bandwidthes, so there is not a way for checklist to have a predicate at the momement for this.

Small blocks

Excessively small sized operations for storage

Threshold: < 0.5 Kbytes per iop

Compute disk.dm.avgsz with:
delta(disk.dm.total_bytes)/delta(disk.dm.total)

Note: The computation of avgiosz does not seem to be quite right and is dropped at times drom the display.


Memory

Memory (main)

Running low on available memory

Threshold: < 10% available memory

Compute mem.ratio.available with:
mem.util.available/mem.physmem

References:

Swapping

Not enough physical memory and data being moved out to swap space

Threshold: swapping occuring

Compute swaps with:
rate(swap.pagesout)

References:

Huge page defragmentation

The system is spending large amounts of time grouping small pages of memory together into contigious physical regions of memory

Montor PCP metrics:
  • mem.vmstat.thp_collapse_alloc
  • mem.vmstat.thp_fault_alloc
  • mem.vmstat.thp_fault_fallback

References:

Huge page fragmentation

The system is splitting large regions of memory (Huge pages) into small pages

Montor PCP metrics:
  • mem.vmstat.thp_split

References:

Note: The RHEL7 has mem.vmstat.thp_split (thp_split in /proc/vmstat is available but on Fedora 25 /proc/vmstat has thp_split_page and thp_split_pmd which do not match up with PCP's mem.vmstat.thp_split


Network TX

Network TX (main)

Amount of network trafic sent

Threshold: network tx bandwidth

Compute network_tx_bandwidth with:
rate(network.interface.out.bytes)/network.interface.baudrate

Saturation

Network packets being dropped

Threshold: network tx drops

Compute network_tx_drops with:
rate(network.interface.out.drops)

Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.

Errors

Show network errors

Threshold: network tx errors

Compute network_tx_errors with:
rate(network.interface.out.errors)

References:


Network RX

Network RX (main)

Amount of network trafic received

Threshold: network rx bandwidth

Compute network_rx_bandwidth with:
rate(network.interface.in.bytes)/network.interface.baudrate

Saturation

Network packets being dropped

Threshold: network rx drops

Compute network_rx_drops with:
rate(network.interface.in.drops)

Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.

Errors

Show network errors

Threshold: network rx errors

Compute network_rx_errors with:
rate(network.interface.in.errors)

References:

RX queue too small

Per-cpu RX queue are filled to capacity and some RX packet are being dropped as a result

Threshold: rx cpu queue drop

Compute rxcpuqdropped with:
rate(network.softnet.dropped)

References:

RX packet processing exceeding time quota

The RX packet processing function had more work remaining when it ran out of time

Threshold: exceeded rx time budget

Compute time_squeeze with:
rate(network.softnet.time_squeeze)

References: