Adrian Cockcroft's Blog: March 2005

Thursday, March 24, 2005

Extracting useful exacct process data

I modified the /usr/demo/libexacct/exdump.c example code to include the data structures described in previous posts, and made the code update each item in the structure. Then I added a printf from the data structure so I could check that all the data is being captured correctly. Part of the output is shown below. I hand edited the column headers a bit for the first two lines to make them line up.


procid ppid uid  usr  sys time      majf minf rwKB     vcxK   icxK sigK sycK  arMB mrMB command
1693  1679 100 40.23 3.53 416930.11 237    0 58745.32 261.17 4.22 0.00 801.7 38.4 44.6 mozilla-bin
procid ppid uid  usr  sys time      majf minf rwKB    vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11   0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh
procid ppid uid  usr  sys time      majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1647  1646 100 0.00 0.01 416930.30    0    0 8.29 0.03 0.00 0.00  0.8  0.4 26.4 mozilla

Running the modified exdump with the -v option shows the data objects in detail so I can check that the right data is appearing in the right place. Here is the same data as the second line above. It looks as if the minor faults counter really is zero, which is probably a Solaris bug.


   100  group-proc                      [group of 35 object(s)]
 1000   pid                            1679
 1001   uid                            100
 1002   gid                            10             staff
 1004   projid                         10             group.staff
 1003   taskid                         54
 100b   cpu-user-sec                   0
 100c   cpu-user-nsec                  2861447
 100d   cpu-sys-sec                    0
 100e   cpu-sys-nsec                   4150875
 1007   start-sec                      1109810711     03/02/05 16:45:11
 1008   start-nsec                     813223175
 1009   finish-sec                     1110227641     03/07/05 12:34:01
 100a   finish-nsec                    968607249
 1006   command                        "run-mozilla.sh"
 100f   tty-major                      4294967295
 1010   tty-minor                      4294967295
 1005   hostname                       "crun"
 1011   faults-major                   0
 1012   faults-minor                   0
 1014   msgs-snd                       0
 1013   msgs-recv                      0
 1015   blocks-in                      0
 1016   blocks-out                     0
 1017   chars-rdwr                     15105
 1018   ctxt-vol                       21
 1019   ctxt-inv                       0
 101a   signals                        0
 101b   swaps                          0
 101c   syscalls                       586
 101d   acctflags                      0
 101f   ppid                           1647
 1020   wait-status                    0              exit
 1021   zone                           "global"
 1022   memory-rss-avg-k               584
 1023   memory-rss-max-k               27004
procid ppid uid  usr  sys time      majf minf rwKB  vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11 0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh

Wednesday, March 16, 2005

An exercise in complexity....

Time for a grumble....

My plan was to take the libexacct.so API and expose it as an SE toolkit class. After looking at what it would take to do this I have come to the conclusion that the data structure definitions and API for reading the data are too complex.

The design is so abstract that it seems that reading meaningful data out of the log file is some obscure side effect of the code. You can read the data, but there is no guarantee that any specific item of data will be present. The accounting system has various options to send more or less data to the file, so it needs to be flexible, but the important thing is the meaning of the data being logged. I care about the semantic and informational content of the data source. What I get from exacct is "there are some tagged typed objects in this file". I can't consume the data without making assumptions about it, and the API doesn't embody those assumptions.

Some of the data being reported is useless (blocks in and blocks out are archaic measures that are always zero) and other stuff is missing - like the good microstate information on CPU wait, page-in wait etc.

I'm going to take the exdump.c code and turn it into a library that exports a sane and simple set of abstractions (like the data structures I defined in earlier posts).

Monday, March 14, 2005

Data structures and objects

The exacct data file is a complex tagged object format that is read via the libexacct library routines. While generic and flexible, it is a pain to get at the data. There are two demo programs that display the information, I used /usr/demo/libexacct/exdump to print out the information shown earlier, and there is also a perl library and a script called dumpexacct.pl. It displays the tags and types like this:


GROUP
  Catalog = EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
  ITEM
    Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
    Value = 1904
  ITEM
    Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
    Value = 25
...

I used this information to define a data structure that will be populated with the data from the file as the first processing step. I left the tags as comments, and defined reasonable amounts of fixed space for strings. The task and flow structures are similar.


struct ex_proc { // EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
 uint32_t pid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
 uint32_t uid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
 uint32_t gid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_GID
 uint32_t projid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PROJID
 uint32_t taskid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TASKID
 uint64_t cpu_user_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_SEC
 uint64_t cpu_user_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_NSEC
 uint64_t cpu_sys_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_SEC
 uint64_t cpu_sys_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_NSEC
 uint64_t proc_start_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_SEC
 uint64_t proc_start_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_NSEC
 uint64_t proc_finish_sec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_SEC
 uint64_t proc_finish_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_NSEC
 char  command[PRFNSZ];// EXT_STRING|EXC_DEFAULT|EXD_PROC_COMMAND
 uint32_t tty_major; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MAJOR
 uint32_t tty_minor; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MINOR
#define EX_SYS_NMLN 40 // SYS_NMLN = 257 - too much
 char  hostname[EX_SYS_NMLN]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_HOSTNAME
 uint64_t major_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MAJOR
 uint64_t minor_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MINOR
 uint64_t messages_snd; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_SND
 uint64_t messages_rcv; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_RCV
 uint64_t blocks_in; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_IN
 uint64_t blocks_out; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_OUT
 uint64_t chars_rdwr; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CHARS_RDWR
 uint64_t vctx;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_VOL
 uint64_t ictx;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_INV
 uint64_t signals; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SIGNALS
 uint64_t swaps;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SWAPS
 uint64_t syscalls; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SYSCALLS
 uint32_t acct_flags; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ACCT_FLAGS
 uint32_t ppid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ANCPID
 uint32_t wait_status; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_WAIT_STATUS
#define EX_ZONENAME 64
 char  zonename[EX_ZONENAME]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_ZONENAME
 uint64_t mem_rss_avg_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_AVG_K
 uint64_t mem_rss_max_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_MAX_K
 };

Capture Ratio and measurement overhead

For performance monitoring applications, we often want to know the process related information, but collecting it from /proc is very expensive compared to collecting other performance data, and the amount of work increases as the number of processes increases.

The capture ratio is defined as the amount of CPU time that is gathered by looking at process data versus the total systemwide CPU used. The difference is made up by processes that stop during the intervals between measurements. Since short lived processes may start and stop between measurements, and we don't know whether a process stopped immediately before a measurement or just after a measurement, there is always an error in sampled process measures. The error is reduced by using a short measurement interval, but that increases overhead. Bypassing the /proc interface, and reading process data directly from the kernel is very implementation dependent but is used by BMC's PATROL® for Unix - Perform & Predict data collector, so that they can collect process data efficiently on large systems at high data rates.

By watching traditional SysV accounting records that fall between measures, some heuristics can be used to improve the capture ratio. However the SysV accounting record does not include the process id, so this is an inexact technique. The Teamquest® View and Model data collector uses this trick.

With extended accounting, we can use wracct to force a record of all current processes to be written to the accounting file, along with the processes that terminate. This gives us a perfect capture ratio, even at infrequent measurement intervals, so the overhead of process data collection is extremely low.

There is no need for a performance collection agent, a cron script can invoke wracct at the desired measurement interval. Another cron script can use acctadm to switch logs to a new file and process locally or ship the old file to a central location as required.

That in a nutshell is why extended accounting is interesting, very good quality data, perfect capture ratio and very low measurement overhead.

Partial and Interval accounting records

Traditional accounting generates a record only when the process terminates. The current process data from ps contains different information so its not possible to keep the two in sync. Extended accounting includes the wracct command that forces an accounting record to be generated for a process or task. The partial record is tagged differently as shown below, but contains the same information.


   106  group-proc-partial              [group of 35 object(s)]
  1000   pid                            664
...

For a process, partial records provide the total resource usage since process creation. For a task, an additional option allows for resource usage over the interval since the previous record was written.

Thursday, March 10, 2005

Data from task accounting record

A task is a group of related processes, when the last one exits, a task record is written. The information is very similar to the process record.


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   101  group-task                      [group of 25 object(s)]
  2000   taskid                         61
  2001   projid                         1              user.root
  2007   cpu-user-sec                   0
  2008   cpu-user-nsec                  0
  2009   cpu-sys-sec                    0
  200a   cpu-sys-nsec                   0
  2003   start-sec                      1109844060     03/03/05 02:01:00
  2004   start-nsec                     907341842
  2005   finish-sec                     1109844060     03/03/05 02:01:00
  2006   finish-nsec                    925962473
  2002   hostname                       "crun"
  200b   faults-major                   2
  200c   faults-minor                   0
  200e   msgs-snd                       0
  200d   msgs-recv                      0
  200f   blocks-in                      3
  2010   blocks-out                     1
  2011   chars-rdwr                     13483
  2012   ctxt-vol                       12
  2013   ctxt-inv                       1
  2014   signals                        0
  2015   swaps                          0
  2016   syscalls                       666
  2018   anctaskid                      29
  2019   zone                           "global"

Data from process accounting

The process accounting record is far more detailed than the standard sysV acct record used by most Unix based systems. For a start it includes the pid of the process and the pid of the parent process so you can stitch the records together properly. The Solaris project and task id's let you manage and control workloads effectively, and since microstate accounting is on by default in Solaris 10, the CPU usage numbers are accurate and high resolution. By default everything is in the global zone. Zones are the virtual machine containers used for fault isolation and resource management in Solaris 10, so data needs to be separated by zone as well as workload.


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   100  group-proc                      [group of 34 object(s)]
  1000   pid                            1748
  1001   uid                            0              root
  1002   gid                            0              root
  1004   projid                         1              user.root
  1003   taskid                         56
  100b   cpu-user-sec                   0
  100c   cpu-user-nsec                  417258
  100d   cpu-sys-sec                    0
  100e   cpu-sys-nsec                   989267
  1007   start-sec                      1109813617     03/02/05 17:33:37
  1008   start-nsec                     139893535
  1009   finish-sec                     1109813617     03/02/05 17:33:37
  100a   finish-nsec                    152386048
  1006   command                        "acctadm"
  100f   tty-major                      24
  1010   tty-minor                      4
  1011   faults-major                   0
  1012   faults-minor                   0
  1014   msgs-snd                       0
  1013   msgs-recv                      0
  1015   blocks-in                      0
  1016   blocks-out                     5
  1017   chars-rdwr                     481
  1018   ctxt-vol                       2
  1019   ctxt-inv                       0
  101a   signals                        0
  101b   swaps                          0
  101c   syscalls                       94
  101d   acctflags                      2              SU
  101f   ppid                           1568
  1020   wait-status                    0              exit
  1021   zone                           "global"
  1022   memory-rss-avg-k               996
  1023   memory-rss-max-k               29636

Data logged by flow accounting

The data comes in two forms, outgoing traffic is tagged with the userid and project of the initiating process, but incoming traffic is missing this information. Since TCP flows are captured in pairs they need to be matched up. The output from the provided demo program /usr/demo/libexacct/exdump -v is shown below.

These match if the src and dest address and ports are reversed


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   109  group-flow                      [group of 11 object(s)]
  3000   src-addr-v4                    a.b.c.d  
  3001   dest-addr-v4                   e.f.g.h  crun
  3004   src-port                       80
  3005   dest-port                      43727
  3006   protocol                       6              tcp
  3007   diffserv-field                 0
  300a   creation-time                  1110482732     03/10/05 11:25:32
  300b   last-seen                      1110482734     03/10/05 11:25:34
  3008   total-bytes                    3447
  3009   total-packets                  10
  300e   action-name                    "acct"
   109  group-flow                      [group of 13 object(s)]
  3000   src-addr-v4                    e.f.g.h  crun
  3001   dest-addr-v4                   a.b.c.d
  3004   src-port                       43727
  3005   dest-port                      80
  3006   protocol                       6              tcp
  3007   diffserv-field                 0
  300a   creation-time                  1110482732     03/10/05 11:25:32
  300b   last-seen                      1110482734     03/10/05 11:25:34
  3008   total-bytes                    4561
  3009   total-packets                  5
  300c   projid                         10
  300d   uid                            100
  300e   action-name                    "acct"

Wednesday, March 09, 2005

IPQoS Configuration

I found a config file that logs data to the accounting system without
filtering it much, it basically filters by protocol, I just take tcp
and pass it to the flow accounting module. You could add udp and other protocols if needed.

This is installed using


# ipqosconf -a exacct.qos

Where exacct.qos is written as follows


fmt_version 1.0

action {
        module ipgpc
        name ipgpc.classify
        params {
                global_stats true
        }
        filter {
                name tcpfilter
                protocol tcp
                class allclass
        }
        class {
                name allclass
                next_action acct
                enable_stats true
        }
}
action {
        module flowacct
        name acct
        params {
                global_stats true
                timer 10000
                timeout 10000
                max_limit 2048
                next_action continue
        }
}

Wednesday, March 02, 2005

Configuring IPQoS for flow accounting

I spent some time today working my way through the manuals. What I want is to just have the accounting information about the network flows, with no complex QoS rules and minimum overhead. So far I got it configured, but I'm getting no output. No error messages, but no output - so my machine is giving me some sulky passive agressive treatment :-(

I'm working on a Sun W2100z, which is a dual CPU Opteron system. It came with Solaris 9, but I got Solaris 10 loaded on it by feeding it a bunch of CDs. The hardest part was configuring the graphics display, it defaulted to something horrible with 256 colors. I had to use kdmconfig, but it took me a while to figure out a) that fbconfig is only for SPARC, and that there is an x86 command with a different name. b) that the system I have has a certain kind of NVidia card in it, c) what display resolution my non-Sun LCD screen supports, and d) that if I scrolled up the screen I could find options with 16M colors that look right.

Its a nice box, not too big, fairly quiet, fast.