PCP Measurements

CPU

CPU (main)

The speed of the CPU is limiting performance

Threshold: processors 85% busy

Compute kernel.percpu.cpu.util.all with:

1 - rate(kernel.percpu.cpu.idle)

References:

https://access.redhat.com/articles/767563#cpu

User time

The CPU is executing application code

Threshold: processors 80% in application code

Compute kernel.percpu.cpu.util.user with:

rate(kernel.percpu.cpu.user)

References:

https://access.redhat.com/articles/767563#cpu

System time

The CPU is executing system code

Threshold: processors 20% in kernel

Compute kernel.percpu.cpu.util.sys with:

rate(kernel.percpu.cpu.sys)

References:

https://access.redhat.com/articles/767563#cpu

Kernel samepage merging daemon (ksmd)

Kernel SamePage Merging Daemon (ksmd) using too much time

Compute hotproc.psinfo.ksmd.util with:

rate(hotproc.psinfo.utime)+rate(hotproc.psinfo.stime)

References:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Virtualization_Tuning_and_Optimization_Guide/index.html

Note: There can only be one hotproc predicate at a time. For this to work would need hotproc.control.config set to '(fname=="ksmd" && cpuburn > 0.10)'

Storage

Storage (main)

Excessive waiting for storage

Threshold: disk 85% busy

Compute diskbusy with:

rate(disk.dm.avactive)

References:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/index.html#chap-Red_Hat_Enterprise_Linux-Performance_Tuning_Guide-Storage_and_File_Systems

Bandwidth

Saturating bandwidth of storage

Threshold: > 2.5GB/s

Compute disk.dm.bw with:

rate(disk.dm.total)

Note: There does not seem to be a way to query storage devices for device's the max r/w bandwidthes, so there is not a way for checklist to have a predicate at the momement for this.

Small blocks

Excessively small sized operations for storage

Threshold: < 0.5 Kbytes per iop

Compute disk.dm.avgsz with:

delta(disk.dm.total_bytes)/delta(disk.dm.total)

Note: The computation of avgiosz does not seem to be quite right and is dropped at times drom the display.

Memory

Memory (main)

Running low on available memory

Threshold: < 10% available memory

Compute mem.ratio.available with:

mem.util.available/mem.physmem

References:

https://access.redhat.com/articles/781733

Swapping

Not enough physical memory and data being moved out to swap space

Threshold: swapping occuring

Compute swaps with:

rate(swap.pagesout)

References:

Huge page defragmentation

The system is spending large amounts of time grouping small pages of memory together into contigious physical regions of memory

Montor PCP metrics:

mem.vmstat.thp_collapse_alloc
mem.vmstat.thp_fault_alloc
mem.vmstat.thp_fault_fallback

References:

http://dl.acm.org/citation.cfm?id=2930834

Huge page fragmentation

The system is splitting large regions of memory (Huge pages) into small pages

Montor PCP metrics:

mem.vmstat.thp_split

References:

https://engineering.linkedin.com/performance/optimizing-linux-memory-management-low-latency-high-throughput-databases

Note: The RHEL7 has mem.vmstat.thp_split (thp_split in /proc/vmstat is available but on Fedora 25 /proc/vmstat has thp_split_page and thp_split_pmd which do not match up with PCP's mem.vmstat.thp_split

Network TX

Network TX (main)

Amount of network trafic sent

Threshold: network tx bandwidth

Compute network_tx_bandwidth with:

rate(network.interface.out.bytes)/network.interface.baudrate

Saturation

Network packets being dropped

Threshold: network tx drops

Compute network_tx_drops with:

rate(network.interface.out.drops)

Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.

Errors

Show network errors

Threshold: network tx errors

Compute network_tx_errors with:

rate(network.interface.out.errors)

References:

https://access.redhat.com/solutions/518893

Network RX

Network RX (main)

Amount of network trafic received

Threshold: network rx bandwidth

Compute network_rx_bandwidth with:

rate(network.interface.in.bytes)/network.interface.baudrate

Saturation

Network packets being dropped

Threshold: network rx drops

Compute network_rx_drops with:

rate(network.interface.in.drops)

Note: The URL mentions comparing the current ring buffer size to the max allowed and increase the ring buffer size, but PCP doesn't have metrics to provide ring buffer info, a 1% packet drop threshold might be too high.

Errors

Show network errors

Threshold: network rx errors

Compute network_rx_errors with:

rate(network.interface.in.errors)

References:

https://access.redhat.com/solutions/518893

RX queue too small

Per-cpu RX queue are filled to capacity and some RX packet are being dropped as a result

Threshold: rx cpu queue drop

Compute rxcpuqdropped with:

rate(network.softnet.dropped)

References:

RX packet processing exceeding time quota

The RX packet processing function had more work remaining when it ran out of time

Threshold: exceeded rx time budget

Compute time_squeeze with:

rate(network.softnet.time_squeeze)

References: