PCP Measurements
Based on grafana/checklist.json.
CPU
CPU (main)
The speed of the CPU is limiting performance
Threshold: processors 85% busy
Compute kernel.percpu.cpu.util.all with:1 - rate(kernel.percpu.cpu.idle)
References:
User time
The CPU is executing application code
Threshold: processors 80% in application code
Compute kernel.percpu.cpu.util.user with:rate(kernel.percpu.cpu.user)
References:
System time
The CPU is executing system code
Threshold: processors 20% in kernel
Compute kernel.percpu.cpu.util.sys with:rate(kernel.percpu.cpu.sys)
References:
Kernel samepage merging daemon (ksmd)
Kernel SamePage Merging Daemon (ksmd) using too much time
Compute hotproc.psinfo.ksmd.util with:rate(hotproc.psinfo.utime)+rate(hotproc.psinfo.stime)
References:
Note: There can only be one hotproc predicate at a time. For this to work would need hotproc.control.config set to '(fname=="ksmd" && cpuburn > 0.10)'
Storage
Storage (main)
Threshold: disk 85% busy
Compute diskbusy with:rate(disk.dm.avactive)
References:
Bandwidth
Saturating bandwidth of storage
Threshold: > 2.5GB/s
Compute disk.dm.bw with:rate(disk.dm.total)
Note: There does not seem to be a way to query storage devices for device's the max r/w bandwidthes, so there is not a way for checklist to have a predicate at the momement for this.
Small blocks
Excessively small sized operations for storage
Threshold: < 0.5 Kbytes per iop
Compute disk.dm.avgsz with:delta(disk.dm.total_bytes)/delta(disk.dm.total)
Note: The computation of avgiosz does not seem to be quite right and is dropped at times drom the display.
Memory
Memory (main)
Running low on available memory
Threshold: < 10% available memory
Compute mem.ratio.available with:mem.util.available/mem.physmem
References:
Swapping
Not enough physical memory and data being moved out to swap space
Threshold: swapping occuring
Compute swaps with:rate(swap.pagesout)
References:
Huge page defragmentation
The system is spending large amounts of time grouping small pages of memory together into contigious physical regions of memory
Montor PCP metrics:- mem.vmstat.thp_collapse_alloc
- mem.vmstat.thp_fault_alloc
- mem.vmstat.thp_fault_fallback
References:
Huge page fragmentation
The system is splitting large regions of memory (Huge pages) into small pages
Montor PCP metrics:- mem.vmstat.thp_split
References:
Note: The RHEL7 has mem.vmstat.thp_split (thp_split in /proc/vmstat is available but on Fedora 25 /proc/vmstat has thp_split_page and thp_split_pmd which do not match up with PCP's mem.vmstat.thp_split
Network TX
Network TX (main)
Amount of network trafic sent
Threshold: network tx bandwidth
Compute network_tx_bandwidth with:rate(network.interface.out.bytes)/network.interface.baudrate
Saturation
Threshold: network tx drops
Compute network_tx_drops with:rate(network.interface.out.drops)
Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.
Errors
Threshold: network tx errors
Compute network_tx_errors with:rate(network.interface.out.errors)
References:
Network RX
Network RX (main)
Amount of network trafic received
Threshold: network rx bandwidth
Compute network_rx_bandwidth with:rate(network.interface.in.bytes)/network.interface.baudrate
Saturation
Threshold: network rx drops
Compute network_rx_drops with:rate(network.interface.in.drops)
Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.
Errors
Threshold: network rx errors
Compute network_rx_errors with:rate(network.interface.in.errors)
References:
RX queue too small
Per-cpu RX queue are filled to capacity and some RX packet are being dropped as a result
Threshold: rx cpu queue drop
Compute rxcpuqdropped with:rate(network.softnet.dropped)
References:
RX packet processing exceeding time quota
The RX packet processing function had more work remaining when it ran out of time
Threshold: exceeded rx time budget
Compute time_squeeze with:rate(network.softnet.time_squeeze)
References: