Montag, 10. März 2014

Capturing SNMP Values During Load Tests

After capturing and analyzing response times I need some monitoring data to correlate the results to. As far as I could find out, JMeter does not provide means to capture system attributes like CPU, memory, and I/O utilization. So I set off to build something on my own. My application servers run a SNMP daemon. It seemed to obvious to query system data via the available service.


In order to query SNMP data I had to install some Debian packages:
sudo apt-get install \
  snmp \
The snmp package provides the required command line tools, especially snmpget and snmpwalk. snmp-mibs-downloader provides Management Information Base files in the directory /usr/share/snmp/mibs SNMP data structures.
SNMP structures the management information in a numerical, hierarchical format that is very hard to reason about:
snmpwalk -v2c -c public
iso. = STRING: "Linux bay5 3.2.0-59-generic #90-Ubuntu SMP Tue Jan 7 22:43:51 UTC 2014 x86_64"
iso. = OID: iso.
iso. = Timeticks: (145234) 0:24:12.34
iso. = STRING: "Root <root@localhost>"
iso. = STRING: "bay5"
iso. = STRING: "Server Room"
iso. = Timeticks: (0) 0:00:00.00
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = OID: iso.
iso. = STRING: "The SNMP Management Architecture MIB."
iso. = STRING: "The MIB for Message Processing and Dispatching."
iso. = STRING: "The management information definitions for the SNMP User-based Security Model."
iso. = STRING: "The MIB module for SNMPv2 entities"
iso. = STRING: "The MIB module for managing TCP implementations"
iso. = STRING: "The MIB module for managing IP and ICMP implementations"
iso. = STRING: "The MIB module for managing UDP implementations"
iso. = STRING: "View-based Access Control Model for SNMP."
# ~2300 more entries
The Management Information Base provides metadata that provide human-readable names and categories. To activate MIBs I have to set the MIBS environment variable:
export MIBS=ALL
Afterwords, the output is easier to understand:
$ snmpwalk -v2c -c public
SNMPv2-MIB::sysDescr.0 = STRING: Linux bay5 3.2.0-59-generic #90-Ubuntu SMP Tue Jan 7 22:43:51 UTC 2014 x86_64
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-TC::linux
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (201014) 0:33:30.14
SNMPv2-MIB::sysContact.0 = STRING: Root <root@localhost>
SNMPv2-MIB::sysName.0 = STRING: bay5
SNMPv2-MIB::sysLocation.0 = STRING: Server Room
SNMPv2-MIB::sysORLastChange.0 = Timeticks: (0) 0:00:00.00
SNMPv2-MIB::sysORID.1 = OID: SNMP-FRAMEWORK-MIB::snmpFrameworkMIBCompliance
SNMPv2-MIB::sysORID.2 = OID: SNMP-MPD-MIB::snmpMPDCompliance
SNMPv2-MIB::sysORID.4 = OID: SNMPv2-MIB::snmpMIB
SNMPv2-MIB::sysORID.6 = OID: RFC1213-MIB::ip
SNMPv2-MIB::sysORDescr.1 = STRING: The SNMP Management Architecture MIB.
SNMPv2-MIB::sysORDescr.2 = STRING: The MIB for Message Processing and Dispatching.
SNMPv2-MIB::sysORDescr.3 = STRING: The management information definitions for the SNMP User-based Security Model.
SNMPv2-MIB::sysORDescr.4 = STRING: The MIB module for SNMPv2 entities
SNMPv2-MIB::sysORDescr.5 = STRING: The MIB module for managing TCP implementations
SNMPv2-MIB::sysORDescr.6 = STRING: The MIB module for managing IP and ICMP implementations
SNMPv2-MIB::sysORDescr.7 = STRING: The MIB module for managing UDP implementations
SNMPv2-MIB::sysORDescr.8 = STRING: View-based Access Control Model for SNMP.
# ~2300 more entries
In addition, there are categories of values that can be queried, e.g. systemStats or memory:
snmpwalk -v2c -c public systemStats >
snmpwalk -v2c -c public memory      >
UCD-SNMP-MIB::ssIndex.0 = INTEGER: 1
UCD-SNMP-MIB::ssErrorName.0 = STRING: systemStats
UCD-SNMP-MIB::ssSwapIn.0 = INTEGER: 0 kB
UCD-SNMP-MIB::ssSwapOut.0 = INTEGER: 0 kB
UCD-SNMP-MIB::ssIOSent.0 = INTEGER: 6 blocks/s
UCD-SNMP-MIB::ssIOReceive.0 = INTEGER: 0 blocks/s
UCD-SNMP-MIB::ssSysInterrupts.0 = INTEGER: 62 interrupts/s
UCD-SNMP-MIB::ssSysContext.0 = INTEGER: 124 switches/s
UCD-SNMP-MIB::ssCpuUser.0 = INTEGER: 1
UCD-SNMP-MIB::ssCpuSystem.0 = INTEGER: 0
UCD-SNMP-MIB::ssCpuIdle.0 = INTEGER: 97
UCD-SNMP-MIB::ssCpuRawUser.0 = Counter32: 3112
UCD-SNMP-MIB::ssCpuRawNice.0 = Counter32: 0
UCD-SNMP-MIB::ssCpuRawSystem.0 = Counter32: 366
UCD-SNMP-MIB::ssCpuRawIdle.0 = Counter32: 16735
UCD-SNMP-MIB::ssCpuRawWait.0 = Counter32: 96
UCD-SNMP-MIB::ssCpuRawKernel.0 = Counter32: 0
UCD-SNMP-MIB::ssCpuRawInterrupt.0 = Counter32: 33
UCD-SNMP-MIB::ssIORawSent.0 = Counter32: 5482
UCD-SNMP-MIB::ssIORawReceived.0 = Counter32: 203276
UCD-SNMP-MIB::ssRawInterrupts.0 = Counter32: 31382
UCD-SNMP-MIB::ssRawContexts.0 = Counter32: 68221
UCD-SNMP-MIB::ssCpuRawSoftIRQ.0 = Counter32: 13
UCD-SNMP-MIB::ssRawSwapIn.0 = Counter32: 0
UCD-SNMP-MIB::ssRawSwapOut.0 = Counter32: 0
UCD-SNMP-MIB::memIndex.0 = INTEGER: 0
UCD-SNMP-MIB::memErrorName.0 = STRING: swap
UCD-SNMP-MIB::memTotalSwap.0 = INTEGER: 466936 kB
UCD-SNMP-MIB::memAvailSwap.0 = INTEGER: 466936 kB
UCD-SNMP-MIB::memTotalReal.0 = INTEGER: 1026948 kB
UCD-SNMP-MIB::memAvailReal.0 = INTEGER: 632672 kB
UCD-SNMP-MIB::memTotalFree.0 = INTEGER: 1099608 kB
UCD-SNMP-MIB::memMinimumSwap.0 = INTEGER: 16000 kB
UCD-SNMP-MIB::memBuffer.0 = INTEGER: 4116 kB
UCD-SNMP-MIB::memCached.0 = INTEGER: 97152 kB
UCD-SNMP-MIB::memSwapError.0 = INTEGER: noError(0)
UCD-SNMP-MIB::memSwapErrorMsg.0 = STRING: 

Capturing Data Regularly

As a first step, I was fine with just capturing CPU and memory statistics. Selectively querying this information was quite valueable, because a full snmpwalk turned out to be too expensive and influenced the test results. I wrote a small script to record systemStats and memory every 60 seconds:

export MIBS=ALL
while true; do
  timestamp=$(date +%s%N | cut -b1-13)
  echo "Recording SNMP indicators to $record_dir/"
  mkdir --parents "$record_dir"
  snmpwalk -t 1 -v2c -c public $host systemStats > $record_dir/$host.systemStats.snmpwalk
  snmpwalk -t 1 -v2c -c public $host memory      > $record_dir/$host.memory.snmpwalk
  sleep $interval
The results are organized hierarchical in directories by host and the timestamp.

Cleaning Up and Collecting the Captured Data

To prepare the data for analysis with R, I stripped out some uninteresting tokens:
cat - | \
  sed "s/UCD-SNMP-MIB:://" | \
  sed "s/INTEGER: //" | \
  sed "s/STRING: //" | \
  sed "s/Counter32: //" | \
  sed "s/\.0//" | \
  sed "s/ = /,/g"
Afterwards I wrote a script to traverse the directories, read the snmpwalk data, clean it up and prepending the host name and timestamp.
for recording in $( ls --directory */*/ ); do
  host=$( echo "$recording" | sed 's/\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+\)\/\([0-9]\+\)\//\1/' )
  timestamp=$( echo "$recording" | sed 's/\([0-9]\+\.[0-9]\+\.[0-9]\+\.[0-9]\+\)\/\([0-9]\+\)\//\2/' )
  cat $host/$timestamp/*.snmpwalk | ./ | sed "s/^/$host,$timestamp,/"
Executing ./collect-snmp-data > snmp.csv produces a neat CSV file, ready for R.

Plotting the Data

Finally, I plotted the data with R.
ssCpuUser<-snmps[ snmps$valueType == 'ssCpuUser', ]
     main=expression(bold(paste(ssCpuUser, ", ", 1-textstyle(frac(memAvailReal, memTotalReal))))),
     xlab="Test time [min]",
     ylab="Utilization [%]",

timestamps<-snmps[ snmps$valueType == 'memTotalReal', ]$timestamp
memTotalReal<-as.numeric(sub("([0-9]+) kB", "\\1", snmps[ snmps$valueType == 'memTotalReal', ]$value))
memAvailReal<-as.numeric(sub("([0-9]+) kB", "\\1", snmps[ snmps$valueType == 'memAvailReal', ]$value))

legend(0, 20,
       c("ssCpuUser", expression(1-textstyle(frac(memAvailReal, memTotalReal)))),
       col=c("purple", "orange"),
Combined with the response times from JMeter, I got the following image:

The result does not suggest any relationships between these system parameters and the response times. But that just motivates further investigation. :-) So stay tuned!
As always, the sources are available via GitHub.


Dienstag, 4. März 2014

Analyzing JMeter Results with R

"If you can't measure it, you can't manage it."
— Peter Drucker
To be able to improve a system's performance I need to understand the current characteristics of its operation. So I created a very simple (you might call it naïve as well) performance test with JMeter. Executing the test for roughly 35 minutes resulted in 386629 lines of raw CSV data. But raw data does not provide any insight. In order to understand what is going on I needed some statistical numbers and charts. This is where R comes into play. Reading data in is quite simple:
First of all, I wanted an overview of the latency over the test runtime.
     xlab="Test time [min]",
     ylab="Latency [ms]",

Response times seem to be pretty stable over time, I cannot identify any trends at first sight. Nevertheless, the results are split: requests are replied to either quite fast or after about 5 seconds. The "five second barrier" is interesting, though. It is too constant to be incidental. This begs for further investigation.
As a next step, I analyzed the ratios of HTTP response codes during the test:
    col=c("steelblue3", "tomato2", "tomato3"),
    main="HTTP Response Codes"

About 2/3 of the requests are handled successfully, but 1/3 of the requests resulted in server errors. That's definitely too many and needs improvements.
As a last, but very important step, I analyzed the overall service levels. So I created a plot of the cumulated relative frequency of the response times.
     main="Cumulative relative frequency of response times",
     xlab="Latency [ms]",
Important indicators are usability barriers and the 95 percentile. Regarding usability, the ultimate goal are 100 ms response time; this makes the system appear instantaneous. 1000 ms response time are the maximum not to interrupt the user's flow of thought.
instantResponse = cumsum(table(cut(responseTimes$Latency,c(0,100))))/nrow(responseTimes)
text(100,instantResponse,paste(format(instantResponse*100,digits=3), "%"),col="green4",adj=c(1.1,-.3))

fastResponse = cumsum(table(cut(responseTimes$Latency,c(0,1000))))/nrow(responseTimes)
text(1000,fastResponse,paste(format(fastResponse*100,digits=3), "%"),col="green4",adj=c(1.1,-.3))
The 95 percentile is the "realistic maximum" response time. Beyond this limit are the extreme outliers that you cannot prevent on a loosely coupled, unreliable, distributed system like the internet. Fighting against these is a waste of your valuable time. But for service quality, it is important that nearly every user gets a reasonable response time. In order to calculate and plot the 95 percentile, I had to:
ninetyfiveQuantile = quantile(responseTimes$Latency,c(0.95))
segments(-10000, 0.95, ninetyfiveQuantile, .95, col="tomato1",lty="dashed",lwd=2)
These calculations result in the following plot.

To sum it up, the system is able to
  • serve 60.7 % of its users in 100 ms or less;
  • serve 63.7 % of its users in 1000 ms or less;
  • serve 95 % of its users in 5017 ms or less.
It seems very likely that the successful requests are responded quickly, and that the responses that took about 5 seconds are the 503 errors. The timeout is probably caused by some kind of bottleneck. I did not investigate this further yet, but I am quite happy with the visualizations I produced.
As always, the sources are available via GitHub.