Thursday, March 24, 2005

Extracting useful exacct process data

I modified the /usr/demo/libexacct/exdump.c example code to include the data structures described in previous posts, and made the code update each item in the structure. Then I added a printf from the data structure so I could check that all the data is being captured correctly. Part of the output is shown below. I hand edited the column headers a bit for the first two lines to make them line up.


procid ppid uid usr sys time majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1693 1679 100 40.23 3.53 416930.11 237 0 58745.32 261.17 4.22 0.00 801.7 38.4 44.6 mozilla-bin
procid ppid uid usr sys time majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1679 1647 100 0.00 0.00 416930.16 0 0 15.11 0.02 0.00 0.00 0.6 0.6 26.4 run-mozilla.sh
procid ppid uid usr sys time majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1647 1646 100 0.00 0.01 416930.30 0 0 8.29 0.03 0.00 0.00 0.8 0.4 26.4 mozilla



Running the modified exdump with the -v option shows the data objects in detail so I can check that the right data is appearing in the right place. Here is the same data as the second line above. It looks as if the minor faults counter really is zero, which is probably a Solaris bug.


100 group-proc [group of 35 object(s)]
1000 pid 1679
1001 uid 100
1002 gid 10 staff
1004 projid 10 group.staff
1003 taskid 54
100b cpu-user-sec 0
100c cpu-user-nsec 2861447
100d cpu-sys-sec 0
100e cpu-sys-nsec 4150875
1007 start-sec 1109810711 03/02/05 16:45:11
1008 start-nsec 813223175
1009 finish-sec 1110227641 03/07/05 12:34:01
100a finish-nsec 968607249
1006 command "run-mozilla.sh"
100f tty-major 4294967295
1010 tty-minor 4294967295
1005 hostname "crun"
1011 faults-major 0
1012 faults-minor 0
1014 msgs-snd 0
1013 msgs-recv 0
1015 blocks-in 0
1016 blocks-out 0
1017 chars-rdwr 15105
1018 ctxt-vol 21
1019 ctxt-inv 0
101a signals 0
101b swaps 0
101c syscalls 586
101d acctflags 0
101f ppid 1647
1020 wait-status 0 exit
1021 zone "global"
1022 memory-rss-avg-k 584
1023 memory-rss-max-k 27004
procid ppid uid usr sys time majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1679 1647 100 0.00 0.00 416930.16 0 0 15.11 0.02 0.00 0.00 0.6 0.6 26.4 run-mozilla.sh

Wednesday, March 16, 2005

An exercise in complexity....

Time for a grumble....

My plan was to take the libexacct.so API and expose it as an SE toolkit class. After looking at what it would take to do this I have come to the conclusion that the data structure definitions and API for reading the data are too complex.

The design is so abstract that it seems that reading meaningful data out of the log file is some obscure side effect of the code. You can read the data, but there is no guarantee that any specific item of data will be present. The accounting system has various options to send more or less data to the file, so it needs to be flexible, but the important thing is the meaning of the data being logged. I care about the semantic and informational content of the data source. What I get from exacct is "there are some tagged typed objects in this file". I can't consume the data without making assumptions about it, and the API doesn't embody those assumptions.

Some of the data being reported is useless (blocks in and blocks out are archaic measures that are always zero) and other stuff is missing - like the good microstate information on CPU wait, page-in wait etc.

I'm going to take the exdump.c code and turn it into a library that exports a sane and simple set of abstractions (like the data structures I defined in earlier posts).

Monday, March 14, 2005

Data structures and objects

The exacct data file is a complex tagged object format that is read via the libexacct library routines. While generic and flexible, it is a pain to get at the data. There are two demo programs that display the information, I used /usr/demo/libexacct/exdump to print out the information shown earlier, and there is also a perl library and a script called dumpexacct.pl. It displays the tags and types like this:


GROUP
Catalog = EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
ITEM
Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
Value = 1904
ITEM
Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
Value = 25
...


I used this information to define a data structure that will be populated with the data from the file as the first processing step. I left the tags as comments, and defined reasonable amounts of fixed space for strings. The task and flow structures are similar.


struct ex_proc { // EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
uint32_t pid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
uint32_t uid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
uint32_t gid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_GID
uint32_t projid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PROJID
uint32_t taskid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TASKID
uint64_t cpu_user_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_SEC
uint64_t cpu_user_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_NSEC
uint64_t cpu_sys_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_SEC
uint64_t cpu_sys_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_NSEC
uint64_t proc_start_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_SEC
uint64_t proc_start_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_NSEC
uint64_t proc_finish_sec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_SEC
uint64_t proc_finish_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_NSEC
char command[PRFNSZ];// EXT_STRING|EXC_DEFAULT|EXD_PROC_COMMAND
uint32_t tty_major; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MAJOR
uint32_t tty_minor; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MINOR
#define EX_SYS_NMLN 40 // SYS_NMLN = 257 - too much
char hostname[EX_SYS_NMLN]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_HOSTNAME
uint64_t major_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MAJOR
uint64_t minor_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MINOR
uint64_t messages_snd; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_SND
uint64_t messages_rcv; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_RCV
uint64_t blocks_in; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_IN
uint64_t blocks_out; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_OUT
uint64_t chars_rdwr; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CHARS_RDWR
uint64_t vctx; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_VOL
uint64_t ictx; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_INV
uint64_t signals; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SIGNALS
uint64_t swaps; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SWAPS
uint64_t syscalls; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SYSCALLS
uint32_t acct_flags; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ACCT_FLAGS
uint32_t ppid; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ANCPID
uint32_t wait_status; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_WAIT_STATUS
#define EX_ZONENAME 64
char zonename[EX_ZONENAME]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_ZONENAME
uint64_t mem_rss_avg_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_AVG_K
uint64_t mem_rss_max_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_MAX_K
};

Capture Ratio and measurement overhead

For performance monitoring applications, we often want to know the process related information, but collecting it from /proc is very expensive compared to collecting other performance data, and the amount of work increases as the number of processes increases.

The capture ratio is defined as the amount of CPU time that is gathered by looking at process data versus the total systemwide CPU used. The difference is made up by processes that stop during the intervals between measurements. Since short lived processes may start and stop between measurements, and we don't know whether a process stopped immediately before a measurement or just after a measurement, there is always an error in sampled process measures. The error is reduced by using a short measurement interval, but that increases overhead. Bypassing the /proc interface, and reading process data directly from the kernel is very implementation dependent but is used by BMC's PATROL® for Unix - Perform & Predict data collector, so that they can collect process data efficiently on large systems at high data rates.

By watching traditional SysV accounting records that fall between measures, some heuristics can be used to improve the capture ratio. However the SysV accounting record does not include the process id, so this is an inexact technique. The Teamquest® View and Model data collector uses this trick.

With extended accounting, we can use wracct to force a record of all current processes to be written to the accounting file, along with the processes that terminate. This gives us a perfect capture ratio, even at infrequent measurement intervals, so the overhead of process data collection is extremely low.

There is no need for a performance collection agent, a cron script can invoke wracct at the desired measurement interval. Another cron script can use acctadm to switch logs to a new file and process locally or ship the old file to a central location as required.

That in a nutshell is why extended accounting is interesting, very good quality data, perfect capture ratio and very low measurement overhead.

Partial and Interval accounting records

Traditional accounting generates a record only when the process terminates. The current process data from ps contains different information so its not possible to keep the two in sync. Extended accounting includes the wracct command that forces an accounting record to be generated for a process or task. The partial record is tagged differently as shown below, but contains the same information.


106 group-proc-partial [group of 35 object(s)]
1000 pid 664
...


For a process, partial records provide the total resource usage since process creation. For a task, an additional option allows for resource usage over the interval since the previous record was written.

Thursday, March 10, 2005

Data from task accounting record

A task is a group of related processes, when the last one exits, a task record is written. The information is very similar to the process record.


ff group-header [group of 4 object(s)]
1 version 1
2 filetype "exacct"
3 creator "SunOS"
4 hostname "crun"
101 group-task [group of 25 object(s)]
2000 taskid 61
2001 projid 1 user.root
2007 cpu-user-sec 0
2008 cpu-user-nsec 0
2009 cpu-sys-sec 0
200a cpu-sys-nsec 0
2003 start-sec 1109844060 03/03/05 02:01:00
2004 start-nsec 907341842
2005 finish-sec 1109844060 03/03/05 02:01:00
2006 finish-nsec 925962473
2002 hostname "crun"
200b faults-major 2
200c faults-minor 0
200e msgs-snd 0
200d msgs-recv 0
200f blocks-in 3
2010 blocks-out 1
2011 chars-rdwr 13483
2012 ctxt-vol 12
2013 ctxt-inv 1
2014 signals 0
2015 swaps 0
2016 syscalls 666
2018 anctaskid 29
2019 zone "global"

Data from process accounting

The process accounting record is far more detailed than the standard sysV acct record used by most Unix based systems. For a start it includes the pid of the process and the pid of the parent process so you can stitch the records together properly. The Solaris project and task id's let you manage and control workloads effectively, and since microstate accounting is on by default in Solaris 10, the CPU usage numbers are accurate and high resolution. By default everything is in the global zone. Zones are the virtual machine containers used for fault isolation and resource management in Solaris 10, so data needs to be separated by zone as well as workload.


ff group-header [group of 4 object(s)]
1 version 1
2 filetype "exacct"
3 creator "SunOS"
4 hostname "crun"
100 group-proc [group of 34 object(s)]
1000 pid 1748
1001 uid 0 root
1002 gid 0 root
1004 projid 1 user.root
1003 taskid 56
100b cpu-user-sec 0
100c cpu-user-nsec 417258
100d cpu-sys-sec 0
100e cpu-sys-nsec 989267
1007 start-sec 1109813617 03/02/05 17:33:37
1008 start-nsec 139893535
1009 finish-sec 1109813617 03/02/05 17:33:37
100a finish-nsec 152386048
1006 command "acctadm"
100f tty-major 24
1010 tty-minor 4
1011 faults-major 0
1012 faults-minor 0
1014 msgs-snd 0
1013 msgs-recv 0
1015 blocks-in 0
1016 blocks-out 5
1017 chars-rdwr 481
1018 ctxt-vol 2
1019 ctxt-inv 0
101a signals 0
101b swaps 0
101c syscalls 94
101d acctflags 2 SU
101f ppid 1568
1020 wait-status 0 exit
1021 zone "global"
1022 memory-rss-avg-k 996
1023 memory-rss-max-k 29636

Data logged by flow accounting

The data comes in two forms, outgoing traffic is tagged with the userid and project of the initiating process, but incoming traffic is missing this information. Since TCP flows are captured in pairs they need to be matched up. The output from the provided demo program /usr/demo/libexacct/exdump -v is shown below.

These match if the src and dest address and ports are reversed


ff group-header [group of 4 object(s)]
1 version 1
2 filetype "exacct"
3 creator "SunOS"
4 hostname "crun"
109 group-flow [group of 11 object(s)]
3000 src-addr-v4 a.b.c.d
3001 dest-addr-v4 e.f.g.h crun
3004 src-port 80
3005 dest-port 43727
3006 protocol 6 tcp
3007 diffserv-field 0
300a creation-time 1110482732 03/10/05 11:25:32
300b last-seen 1110482734 03/10/05 11:25:34
3008 total-bytes 3447
3009 total-packets 10
300e action-name "acct"
109 group-flow [group of 13 object(s)]
3000 src-addr-v4 e.f.g.h crun
3001 dest-addr-v4 a.b.c.d
3004 src-port 43727
3005 dest-port 80
3006 protocol 6 tcp
3007 diffserv-field 0
300a creation-time 1110482732 03/10/05 11:25:32
300b last-seen 1110482734 03/10/05 11:25:34
3008 total-bytes 4561
3009 total-packets 5
300c projid 10
300d uid 100
300e action-name "acct"

Wednesday, March 09, 2005

IPQoS Configuration

I found a config file that logs data to the accounting system without
filtering it much, it basically filters by protocol, I just take tcp
and pass it to the flow accounting module. You could add udp and other protocols if needed.

This is installed using


# ipqosconf -a exacct.qos

Where exacct.qos is written as follows

fmt_version 1.0

action {
module ipgpc
name ipgpc.classify
params {
global_stats true
}
filter {
name tcpfilter
protocol tcp
class allclass
}
class {
name allclass
next_action acct
enable_stats true
}
}
action {
module flowacct
name acct
params {
global_stats true
timer 10000
timeout 10000
max_limit 2048
next_action continue
}
}

Wednesday, March 02, 2005

Configuring IPQoS for flow accounting

I spent some time today working my way through the manuals. What I want is to just have the accounting information about the network flows, with no complex QoS rules and minimum overhead. So far I got it configured, but I'm getting no output. No error messages, but no output - so my machine is giving me some sulky passive agressive treatment :-(

I'm working on a Sun W2100z, which is a dual CPU Opteron system. It came with Solaris 9, but I got Solaris 10 loaded on it by feeding it a bunch of CDs. The hardest part was configuring the graphics display, it defaulted to something horrible with 256 colors. I had to use kdmconfig, but it took me a while to figure out a) that fbconfig is only for SPARC, and that there is an x86 command with a different name. b) that the system I have has a certain kind of NVidia card in it, c) what display resolution my non-Sun LCD screen supports, and d) that if I scrolled up the screen I could find options with 16M colors that look right.

Its a nice box, not too big, fairly quiet, fast.