Adrian Cockcroft's Blog: 2005

Wednesday, December 21, 2005

CMG05 trip comments and "utilization is useless..."

I have a good time at the Computer Measurement Group meeting in Orlando recently. Mario Jauvin and I put together a tutorial on "Capacity Planning and Performance Monitoring with Free Tools" that was well attended, although relatively few attendees seem to be using free tools, mostly due to policy and support issues. We also attended James Holtman's workshop on using the free statistics package 'R' for performance data analysis and plotting.

The main new theme at this year's conference seemed to be CPU virtualization. Many people using VMware, Zen, Solaris containers and other virtualization facilities are finding that their measurements of CPU utilization don't make sense any more. Both BMC and Teamquest are working on building some support for virtualization concepts into their tools.

My observation is that utilization is useless as a metric and should be abandoned. It has been useless in virtualized disk subsystems for some time, and is now useless for CPU measurement as well. There used to be a clear relationship between response time and utilization, but systems are now so complex that those relationships no longer hold. Instead, you need to directly measure response time and relate it to throughput. Utilization is properly defined as busy time as a proportion of elapsed time. The replacement for utilization is headroom which is defined as the unused proportion of the maximum possible throughput. Dave Fisk calls this Capability Utilization.

I'm thinking of writing a paper, maybe for next year's CMG on this topic....

Happy holidays everyone
Cheers Adrian

Monday, October 31, 2005

Help! I've lost my memory! Updated Sunworld Column

Originally published in Unix Insider 10/1/95
Stripped of adverts, url references fixed and comments added in red type to bring it up to date ten years later.

Dear Adrian, After a reboot I saw that most of my computer's memory was
free, but when I launched my application it used up almost all the
memory. When I stopped the application the memory didn't come back!
Take a look at my vmstat output:

% vmstat 5
procs     memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id

This is before the program starts:

0 0 0 330252 80708   0   2  0  0  0  0  0  0  0  0  1   18  107  113  0  1 99
0 0 0 330252 80708   0   0  0  0  0  0  0  0  0  0  0   14   87   78  0  0 99

I start the program and it runs like this for a while:

0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  414  132   79 24  1 74
0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  411   99   66 25  1 74

I stop it, then almost all the swap space comes back, but the free memory does not:

0 0 0 326776 21260   0   3  0  0  0  0  0  0  1  0  0  420  116   82  4  2 95
0 0 0 329924 24396   0   0  0  0  0  0  0  0  0  0  0  414   82   77  0  0 100
0 0 0 329924 24396   0   0  0  0  0  0  0  0  2  0  1  430   90   84  0  1 99

I checked that there were no application processes running. It looks like a huge memory leak in the operating system. How can I get my memory back?
--RAMless in Ripon

Update

This remains one of the most frequently asked questions of all time. The
original answer is still true for many Unix variants. However while
writing his book on Solaris Internals, Richard McDougall worked out
how to fix Solaris to make it work better, and to make this aparrent
problem go away. The result was one of the most significant
performance improvements in Solaris 8, but the first edition of his
book was written before Solaris 8 came out, so doesn't describe the
fix!

The short answer

Launch your application again. Notice that it starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still
in memory, even though they are not active. The memory they occupy is
not "free." If you restart the same application it finds
the pages that are already in memory. The pages are attached to the
inode cache entries for the files. If you start a different
application, and there is insufficient free memory, the kernel will
scan for pages that have not been touched for a long time, and "free"
them. Once you quit the first application, the memory it occupies is
not being touched, so it will be freed quickly for use by other
applications.

In 1988, Sun introduced this feature in SunOS 4.0. It still applies to
all versions of Solaris 1 and 2. The kernel is trying to avoid disk
reads by caching as many files as possible in memory. Attaching to a
page in memory is around 1,000 times faster than reading it in from
disk. The kernel figures that you paid good money for all of that
RAM, so it will try to make good use of it by retaining files you
might need.

Since Solaris 8, the memory in the file cache is actually also on the free
list, so you do see vmstat free memory reduce when you quit a
program. You also should expect large amounts of file I/O to cause
high scan rates on older Solaris releases, and for there to be no
scanning at all on Solaris 8 systems. If Solaris 8 scans at all, then
it has truly run out of memory and is overloaded.

By contrast, Memory leaks appear as a shortage of swap space after the
misbehaving program runs for a while. You will probably find a
process that has a larger than expected size. You should restart the
program to free up the swap space, and check it with a debugger that
offers a leak-finding feature (run it with the libumem version of the malloc library that instruments memory leaks).

The long (and technical) answer

To understand how Sun's operating systems handle memory, I will explain how the inode cache works, how the buffer cache fits into the picture, and how the life
cycle of a typical page evolves as the system uses it for several
different purposes.

The inode cache and file data caching

Whenever you access a file, the kernel needs to know the size, the access permissions,
the date stamps and the locations of the data blocks on disk.
Traditionally, this information is known as the inode for the file.
There are many filesystem types. For simplicity I will assume we are
only interested in the Unix filesystem (UFS) on a local disk. Each
filesystem type has its own inode cache.

The filesystem stores inodes on the disk; the inode must be read into
memory whenever an operation is performed on an entity in the
filesystem. The number of inodes read per second is reported as
iget/s by the sar -a command. The inode read from disk is cached in case
it is needed again, and the number of inodes that the system will
cache is influenced by a kernel parameter called ufs_ninode.
The kernel keeps inodes on a linked list, rather than in a fixed-size
table.

As I mention each command I will show you what the output looks like. In
my case I'm collecting sar
data automatically using cron.
sar, which defaults to
reading the stored data for today. If you have no stored data,
specify a time interval and sar
will show you current activity.

% sar -a

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  iget/s namei/s dirbk/s
01:00:01       4       6       0

All reads or writes to UFS files occur by paging from the filesystem. All
pages that are part of the file and are in memory will be attached to
the inode cache entry for that file. When a file is not in use, its
data is cached in memory, using an inactive inode cache entry. When
the kernel reuses an inactive inode cache entry that has pages
attached, it puts the pages on the free list; this case is shown by
sar -g as %ufs_ipf.
This number is the percentage of UFS inodes that were overwritten in
the inode cache by iget and
that had reusable pages associated with them. The kernel flushes the
pages, and updates on disk any modified pages. Thus, this %ufs_ipf
number is the percentage of igets with page flushes. Any non-zero
values of %ufs_ipf reported by sar -g
indicate that the inode cache is too small for the current workload.

% sar -g

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:01     0.02     0.02     0.08     0.12     0.00

For SunOS 4 and releases up to Solaris 2.3, the number of inodes that the
kernel will keep in the inode cache is set by the kernel variable
ufs_ninode. To simplify: When a file is opened, an inactive
inode will be reused from the cache if the cache is full; when an
inode becomes inactive, it is discarded if the cache is over-full. If
the cache limit has not been reached then an inactive inode is placed
at the back of the reuse list and invalid inodes (inodes for files
that longer exist) are placed at the front for immediate reuse. It is
entirely possible for the number of open files in the system to cause
the number of active inodes to exceed ufs_ninode; raising
ufs_ninode allows more inactive inodes to be cached in case
they are needed again.

Solaris 2.4 uses a more clever inode cache algorithm. The kernel maintains a
reuse list of blank inodes for instant use. The number of active
inodes is no longer constrained, and the number of idle inodes
(inactive but cached in case they are needed again) is kept between
ufs_ninode and 75 percent of ufs_ninode by a new
kernel thread that scavenges the inodes to free them and maintains
entries on the reuse list. If you use sar -v to look at the inode cache, you may see a larger
number of existing inodes than the reported "size."

% sar -v

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
01:00:01   66/506     0 2108/2108    0  353/353     0    0/0

Buffer cache

The buffer cache is used to cache filesystem
data in SVR3 and BSD Unix. In SunOS 4, generic SVR4, and Solaris 2,
it is used to cache inode, indirect block, and cylinder group blocks
only. Although this change was introduced in 1988, many people still
incorrectly think the buffer cache is used to hold file data. Inodes
are read from disk to the buffer cache in 8-kilobyte blocks, then the
individual inodes are read from the buffer cache into the inode
cache.

Life cycle of a typical physical memory page

This section provides additional insight into the way memory is used. The sequence
described is an example of some common uses of pages; many other
possibilities exist.

1. Initialization -- A page is born: When the system boots, it forms all free memory into pages, and allocates
a kernel data structure to hold the state of every page in the
system.
2. Free -- An untouched virgin page: All the memory is put onto the free list to start with. At this stage the
content of the page is undefined.
3. ZFOD -- Joining an uninitialized data segment: When a program accesses data that is preset to zero for the very first
time, a minor page fault occurs and a Zero Fill On Demand (ZFOD)
operation takes place. The page is taken from the free list,
block-cleared to contain all zeroes, and added to the list of
anonymous pages for the uninitialized data segment. The program then
reads and writes data to the page.
4. Scanned -- The pagedaemon awakes: When the free list gets below a certain size, the pagedaemon starts to
look for memory pages to steal from processes. It looks at all pages
in physical memory order; when it gets to the page, the page is
synchronized with the memory management unit (MMU) and a reference
bit is cleared.
5. Waiting -- Is the program really using this page right now?: There is a delay that varies depending upon how quickly the pagedaemon
scans through memory. If the program references the page during this
period, the MMU reference bit is set.
6. Pageout Time -- Saving the contents: The pageout daemon returns and checks the MMU reference bit to find that
the program has not used the page so it can be stolen for reuse. The
pagedaemon checks to see if anything had been written to the page;
if it contains no data, a page-out occurs. The page is moved to the
pageout queue and marked as I/O pending. The swapfs code clusters
the page together with other pages on the queue and writes the
cluster to the swap space. The page is then free and is put on the
free list again. It remembers that it still contains the program
data.
7. Reclaim -- Give me back my page!: Belatedly, the program tries to read the page and takes a page fault. If the
page had been reused by someone else in the meantime, a major fault
would occur and the data would be read from the swap space into a
new page taken from the free list. In this case, the page is still
waiting to be reused, so a minor fault occurs, and the page is moved
back from the free list to the program's data segment.
8. Program Exit -- Free again: The program finishes running and exits. The data segments are private to
that particular instance of the program (unlike the shared-code
segments), so all the pages in the data segment are marked as
undefined and put onto the free list. This is the same state as Step
2.
9. Page-in -- A shared code segment: A page fault occurs in the code segment of a window system shared library.
The page is taken off the free list, and a read from the filesystem
is scheduled to get the code. The process that caused the page fault
sleeps until the data arrives. The page is attached to the inode of
the file, and the segments reference the inode.
10. Attach -- A popular page: Another process using the same shared-library page faults in the same place.
It discovers that the page is already in memory and attaches to the
page, increasing its inode reference count by one.
11. COW -- Making a private copy: If one of the processes sharing the page tries to write to it, a
copy-on-write (COW) page fault occurs. Another page is grabbed from
the free list, and a copy of the original is made. This new page
becomes part of a privately mapped segment backed by anonymous
storage (swap space) so it can be changed, but the original page is
unchanged and can still be shared. Shared libraries contain jump
tables in the code that are patched, using COW as part of the
dynamic linking process.
12. File Cache -- Not free: The entire window system exits, and both processes go away. This time
the page stays in use, attached to the inode of the shared library
file. The inode is now inactive but will stay in the inode cache
until it is reused, and the pages act as a file cache in case the
user is about to restart the window system again. The
change made in Solaris 8 was that the file cache is the tail of the
free list, and any file cache page can be reused immediately for
something else without needing to be scanned first.
13. fsflush -- Flushed by the sync: Every 30 seconds all the pages in the system are examined in physical page
order to see which ones contain modified data and are attached to a
vnode. The details differ between SunOS 4 and Solaris 2, but
essentially any modified pages will be written back to the
filesystem, and the pages will be marked as clean.
This example sequence can continue from Step 4 or
Step 9 with minor variations. The fsflush
process occurs every 30 seconds by default for all pages, and
whenever the free list size drops below a certain value, the
pagedaemon scanner wakes up and reclaims some pages. A
recent change in Solaris 10, backported to Solaris 8 and 9 patches,
makes fsflush run much more efficiently on machines with very large
amounts of memory. However, if you see fsflush using an excessive
amount of CPU time you should increase “autoup” in /etc/system
from its default of 30s, and you will see fsflush usage reduce
proportionately.

Now you know

I have seen this missing-memory question
asked about once a month since 1988! Perhaps the manual page for
vmstat should include a better explanation of what the
values are measuring. This answer is based on some passages from my
book Sun Performance and Tuning. The book explains in detail how the
memory algorithms work and how to tune them. However the book doesn't cover the changes made in Solaris 8.

Tuesday, October 25, 2005

SunWorld Columns at ITworld

I found that the columns I wrote between 1995 and 1999 all seems to be online, but its hard to find, so I'm going to start by just listing everything I can find in its original form. BEWARE! Many of these articles were obsoleted by developments in later releases of Solaris, or refer to dead URL's so I'll work through them in subsequent blog entries and provide commentary and updates.

2001/03: Collected Short Questions and Answers
1999/08: What does 100 percent busy mean?
1999/07: Disk Error Detection
1999/03: Digging into the details of WorkShop 5.0
1999/02: SyMON and SE get upgraded
1998/12: Out and about at conferences
1998/11: What's new in Solaris 7?
1998/10: IOwait, what's the holdup?
1998/09: Do collision levels accurately tell you the real story about your Ethernet?
1998/08: Unlocking the kernel
1998/07: Clearing up swap space confusion
1998/06: How busy is the CPU, really?
1998/05: Processor partitioning
1998/04: Prying into processes and workloads
1998/03: Sizing up memory in Solaris
1998/02: Perfmeter unmasked
1998/01: SE Toolkit FAQ
1997/12: At last! The updated SE release has arrived
1997/11: Performance perplexities: Help! Where do I start?
1997/10: Learn to performance tune your Java programs
1997/09: Clarifying disk measurements and terminology
1997/08: How does Solaris 2.6 improve performance stats and Web performance?
1997/07: Dissecting proxy Web cache performance
1997/06: Analysis of TCP transfer characteristics for Web servers made easier
1997/05: The memory go round
1997/04: Craving more books on Solaris? Look no further
1997/03: How to optimize caching file accesses
1997/02: Increase system performance by maximizing your cache
1997/01: Design your cache to match your applications
1996/12: Tips for TCP/IP monitoring and tuning
1996/11: The right disk configurations for servers
1996/10: Solving the iostat disk mystery
1996/09: Unveiling vmstat's charms
1996/08: What's the best way to probe processes?
1996/07 Monitor your Web server in realtime, Part 2
1996/07: How can I optimize my programs for UltraSPARC?
1996/06: How do disks really work?
1996/05: How much RAM is enough?
1996/04: Monitor your Web server in real time, Part 1
1996/04: How does swap space work?
1996/03: Watching your Web server
1996/02: Which is better, static or dynamic linking?
1996/01: What are the tunable kernel parameters for Solaris 2?
1995/12: Performance Q&A Compendium
1995/11: When is it faster to have 64 bits?
1995/10 Help! I've lost my memory!

Sunday, October 02, 2005

How busy is your CPU, really?

Just in case you thought that you could compare your CPU utilization data across Solaris releases I have a few words of caution.

To start with there is the whole problem of interrupts, do they count as system time, or do they just make whatever they interrupted take longer?

Then there is the question of wait-for-io, and who is waiting for which io? This is a form of idle time that tends to confuse people as it doesn't really mean anything once you have more than one CPU.

There is one mechanism used to report the systemwide CPU utilization data. This data is reported in a per-cpu kstat data structure and is used by every tool that ever reports CPU usr, sys, wio, idle etc. Most tools sum the data over all the CPUs, mpstat gives you the per CPU data. The form of the data is a number of ticks of CPU time that accumulates, starting at zero at boot time. To measure CPU utilization over a time interval, you measure the difference in the number of ticks and divide by the time, and the tick rate. The tick rate is set by the clock interrupt, it defaults to 100Hz, but can be set to 1000Hz.

There are two mechanisms used to measure CPU usage, one is the old method of using the clock interrupt to see what is running every 100Hz. This is low resolution, and since the clock wakes up jobs, its possible for jobs to hide between the ticks, so its a statistically biased measure.
The other mechanism is microstate accounting, where every change of state is timed with a hires clock. This was only used for tracking CPU usage by processes, and needed special calls to get the data in releases up to Solaris 9.

So how do those nice numbers you get from vmstat or your favourite tool vary?

Solaris 8 and earlier releases: interrupt time does not get classified as system time

Solaris 9: Interrupt time is counted as system time.

Solaris 10: wait-for-io time will always be zero. The ratio of confusion to enlightenment was too high, so the entire concept has been removed from Solaris. This is a good thing.

Solaris 10 initial release: systemwide CPU time is now measured using microstates, this is far more accurate than ticks, but somehow the interrupt time ended up spread out rather than in system time.

This is unfortunate, but I'm told that an update of Solaris 10 will again classify interrupts as system time, and then it will all be just about as accurate as it can be.

You may be curious about the size of errors in these measurements, the answer is that "it depends". There is an SE toolkit script called cpuchk.se that looks at the process data and compares the ticks with the microstate data. However, I don't have any measures of interrupt time differences.

SE toolkit 3.4 on Solaris 10 Opteron Workaround

The current 3.4 build of SE is available from the sunfreeware site and supports Solaris 8, 9 and 10 for SPARC and x86. It also includes full source code of the interpreter for the first time.
However it only includes 32bit x86 support, and when run on a 64bit Solaris kernel on an Opteron system, it will fail to run. This is due to the return value from the isalist command returning amd64 as the first word, rather than pentium. You can get it going with a workaround by changing the startup script /opt/RICHPse/bin/se to match the string "*pentium*" rather than "pentium*". This lets the 32bit x86 binary run on the 64bit amd64 system, and most of the scripts will still work. Some that try to access the kernel directly viakvm will fail, but most scripts use kstat which doesn't need 64bit accesses.
The real fix is to use the source to compile an amd64 build....

Monday, September 12, 2005

SunWorld Offline

The monthly columns I wrote for SunWorld Online between 1995 and 1999 seem to have gone offline. At one point after SunWorld shut down I got permission from the editors at ITworld to take back the content, remove the adverts, and host it on the Sun site. Since I left Sun in 2004 the link has been closed down. I think there is still some useful content and historical interest in the columns, which seem to have been copied to a few other places already. I'm going to dig them out of my archives, and repost them here, along with comments and perhaps a few corrections and updates.

The original idea for the column came from a featured article on the Sun homepage in 1995 where I talked about the release of my first book, Sun Performance and Tuning - SPARC and Solaris. This generated an offer from the editors at SunWorld Online to do a monthly Performance Q&A column. I thought it worked out very well. I got paid for the column, which forced me to write something every month, and I based the columns on questions from the readers, or parts of the original book that needed to be updated, or parts of the second edition that I was writing during this period.

The monthly column seemed to have the effect of positioning me as the expert to quite a wide public, and I'm sure it helped me get promoted to Distinguished Engineer in 1999. I stopped writing the column due to a combination of being too busy, running out of topics to write about, and the introduction of the BluePrint program.

In effect, several engineers had created books and columns in their spare time, and we worked with management to make it an official program for publishing Sun's best practices on how to configure products and combinations of products. This made it our day job, which was easier in some ways, and we had support staff and more detailed reviews, but we didn't get paid royalties for doing the BluePrint columns or books. After a few years and several books (The very first BluePrint on Resource Management, and one on Capacity Planning) I moved on from the BluePrint group, but the program is still running and there is a large body of work there now.

There are over forty columns. I could post them en-masse unedited but I think they can already be found lurking in Google. It seems like more fun to try and remember what was going on at the time and make it an anecdotal reminiscence.

Monday, August 15, 2005

Solving storage tuning problems

I wrote a while ago about Dave Fisk's Ortera Atlas tool for storage analysis. I recently had a chance to use a beta release of Atlas on a real problem, and they are about to do a GA release, its ready for prime time.

Like most tools, it can produce masses of numbers and graphs, but compared to other storage analysis tools I've seen it goes further in three ways:

1) It collects critically important data that is not provided by the OS
2) It processes the data to tell you exactly what is wrong
3) It runs heuristics to tell you how to fix the problem

I wish more tools spent this much effort on solving the actual problem rather than making pretty graphs that only an expert would understand.

What we actually did was run the tool on a pre-production Oracle system using Veritas Filesystem and Volume Manager with Solaris on a SAN connected to a Hitachi storage array. Atlas starts off by looking at all the active processes on the system, and ignoring any that are not doing any I/O. It collects data on which files are being read or written by which process, and what the pattern and sizes are at the system call, file system and device level. You can also set the tool to focus on a set of devices, and gather information on the processes that actually talk to those devices.

Atlas immediately pointed out that two volumes had been concatenated to form a filesystem, and that 98% of the accesses were to one of the volumes. It recommended that the volumes be striped together for better overall performance.

It also pointed out that some of the I/O accesses were taking two seconds to complete at the filesystem level, but only two milliseconds at the device level. I guessed this was CPU starvation caused by fsflush running flat out on this machine which had over 50GB of RAM. Adding set autoup=600 to /etc/system and rebooting made the problem go away. We also saw this effect in the terminal window, where our typing would stop echoing for a few seconds every now and again. I've been told by Sun that the very latest patches finally fix fsflush so that it can't use a lot of CPU time, so large memory machines will finally work properly without needing this tweak.

Finally Atlas showed that the filesystem block size was set too small and Oracle was doing large reads that were being chopped into smaller reads by the filesystem layer before being sent to the device. It gave a specific recommendation for the block size that should be used. Reconfiguring the disks takes a long time to do, but we'll fix it before it goes into production.

We could have figured out the concatenation problem using iostat data, but the other two problems are normally invisible, and the topic of what filesystem block size to use can generate masses of discussion and confusion, so having "virtual Dave Fisk" tell you what blocksize to use can save a lot of time :-)

Saturday, July 23, 2005

Extracct code posted on orcaware.com

Since the code I was writing to extract useful information from extended accounting logs is intended to end up as a data source for Orca, it made sense to host it on that site. I haven't had a chance to do any more development on it for a few months, so the source, including SPARC and x86 binaries is now posted there. Thanks to Blair Zajac for the space.

extracct - extract useful data from extended Solaris accounting

If anyone feels inspired to develop it further, please send me any changes you make.

Cheers Adrian

Update Feb 2006: I just downloaded it and untar'd it with no problem, so try downloading it again if you have any problems, should be about 54K long.

Adrian

Friday, July 15, 2005

Free performance and capacity tools

What free tools do you use for performance and capacity planning work? I'm particularly interested in figuring out combinations of tools that work well together.

Mario Jauvin and I are signed up to present a workshop this december at CMG05 in Florida on performance monitoring and capacity planning with free tools. I know a lot about some of the tools, but it would be useful to get some input on the relative usage level of what is available.

Here is a list of the kind of tools I'm interested in:

Orca, www.orcaware.com data collection using SE toolkit etc. and web display using rrdtool
BigBrother www.bb4.org and BigSister, bigsister.graeff.com which do you use and why?
OpenNMS, opennms.org/wiki Large scale network management system
Nagios, www.nagios.org/about Simpler and smaller scale network element management
SE Toolkit, www.sunfreeware.com/setoolkit.html Solaris specific instrumentation and tools
R, www.r-project.org statistical analysis and graphing package
MySQL, www.mysql.com database for storing large amounts of performance data
PDQ, www.perfdynamics.com toolkit for building analytical queueing models in C or Perl
Ganglia, ganglia.sourceforge.net large scale distributed system monitoring

Does the license model matter? There is a wide variety of licenses for free tools, from an absence of any license, through GPL to free use of commercially owned products. Do you use anything that is free, or does it have to meet some criteria?

I'm inviting a lot more input than I have before, add comments to this post or send me messages, I'll summarize, and I'll describe the tools in more detail and how they inter-relate in future postings.

Cheers Adrian

Wednesday, June 15, 2005

Playing with Fire

What would you do if your house burned down? Build another one, but that takes a while, so next time it catches fire you call the fire brigade. They get there in time to save some stuff, but you still have to rebuild it.

Since fires seem to happen rather often (bear with me on this, I'll get to my point eventually) you hire your own fireman. Thats better, you are putting out fires early enough now that the house remains habitable more often.

But the fireman is only there during working hours, so you hire a team of firemen to give you 24x7 coverage, and tidy up the burned patches that appear every few days.

However your spouse is not happy with the mess, the disruption and the cost of a team of firemen, and a nice salesman comes by and sells you a set of smoke detectors, alarm bells and water sprinklers, and you hire a cleanup crew with mops.

Its still messy and inconvenient, and everything is slightly soggy so you ask around. You find that some people are having fires much less often than you do, and they think its because they had their house checked over by a building standards inspector.

The inspector looked around your house and predicted the next few things that would start a fire - so you could fix them in advance. The inspector also advised you on how to build a new house that wouldn't catch fire so easily in the first place.

So my point is that if you are running a datacenter things will go wrong, and from a performance and capacity perspective rapid growth rates, fast changing applications and sudden changes in user activity levels can all put you into a fire fighting situation.
You can develop a fire fighting mentality, or a fire prevention mentality. One problem is that fire fighters are heroes, but when did you last hear the story of a heroic building inspector saving thousands of lives? Few people are fans of building inspectors, but they are mandated by local government to keep people safe.

The recent upsurge of interest in the Information Technology Infrastructure Library (ITIL) seems to be a response to the auditing requirements of Sarbanes-Oxley (SOX). ITIL specifies lots of best practices for capacity planning amongst other things, as my co-author Bill Walker described in our Sun Blueprint on Capacity Planning Book. So it may be SOX that gets you to look at ITIL, and helps you justify your building inspector status as the hero that helped pass a SOX audit perhaps? OK, its a stretch, but perhaps it helps get beyond the fire fighting mentality towards predictive capacity modelling.

So how well are you doing? One way to find out is to rate yourself on a maturity model, and there is a good paper on this in the current issue (3.06) of the CMG newsletter Measure-IT.

Friday, May 20, 2005

Performance monitoring with Hyperthreading

Hyperthreading is used by most recent Intel servers, Intel describe it here.

I tried searching for information on the performance impact and performance monitoring impact and found several studies that describe improvements in terms of throughput that ranged from negative impact (a slowdown) to speedups of 15-20% per CPU. I didn't see much discussion of the effects on observability or of effects on response time, so thats what I'm going to concentrate on writing about.

First, to summarize the information that I could find, the early measurements of performance included most of the negative impact cases. This is due to two separate effects, software and hardware improvements mean that the latest operating systems on the most recent Intel Pentium 4 and Xeon chipsets have a larger benefit and minimise any negative effects.

From a software point of view, Intel advises that Hyperthreading should be disabled for anything other than Windows XP and recent releases of RedHat and SuSe Linux. Older operating systems are naiive about Hyperthreading and while they run, they don't schedule jobs optimally. I would add that older Solaris x86 releases are also unaware of Hyperthreading and Solaris 9 and 10 for x86 include optimized support. I have also been told that Linux 2.6 includes optimizations for Hyperthreading that have been backported to recent patches for Linux 2.4.

From a hardware point of view, the benefit of Hyperthreading increases as CPU clock rates and pipeline lengths increase. Longer waits for pipeline stalls due to memory references and branch mispredictions create larger "pipeline bubbles" for the second Hyperthread to run in. It is possible that there may also be improvements in the implementation of Hyperthreading at the silicon level as Intel learns from early experiences and improves its designs.

The fundamental problem with monitoring a Hyperthreaded system is that one of the most basic and common assumptions made by capacity planners is invalid. For many years the CPU utilization of a normal system has been used as a primary capacity metric since it is assumed to be linearly related to the capacity and throughput of a normal system that has not saturated. In other words, if you keep the CPU usage below 80-90% busy, you expect that the throughput at 60% busy is about twice the throughput at 30% busy for a constant workload.

Hyperthreaded CPUs are non-linear, in other words they do not "scale" properly. A typical two CPU Hyperthreaded system behaves a bit like two normal fast CPUs with two very slow CPUs which are only active when the system is busy. The OS will normally report it as a four CPU system.

Whereas the service time (CPU used) for an average transaction remains constant on a normal CPU as the load increases, the service time for a Hyperthreaded CPU increases as the load increases.

If all you are looking at is the CPU %busy reported by the OS, you will see a Hyperthreaded system perform reasonably well up to 50% busy then as a small amount of additional load is added the CPU usage shoots up and the machine saturates.

To be clear, the Hyperthreads do normally increase the peak throughput of the system, they often provide a performance benefit, but they make it extremely difficult to manage the capacity of a system.

I've been looking at calibration methods and models that can deal with the capacity measurement problem, and will discuss them in future posts.

Thursday, May 12, 2005

Seeing storage problems

Change of topic...

Those of you who have dealt with Sun storage performance over the years might have heard of Dave Fisk, we became good friends when I worked together with him in the same group many years ago, then he went to work for the Sun storage business unit, and finally he left Sun to build a storage performance analysis product. After a lot of work over several years he has finally got his product ready, and it cracks open the storage stack so you can see what is really happening.

Dave's company is called ORtera and they have a downloadable beta release available which you can get a free one month evaluation license for.

If you ever wondered why the I/O traffic your application (e.g. Oracle) is generating don't seem to match up with the stats you see in iostat, then this tool will show you why. For example, small filesystem writes can cause larger read-modify-write operations to occur and the tool will show you this, point it out with a heuristic and suggest a filesystem tuning fix.

Its a useful addition to the toolbox, I think they should have called it "virtual Dave" :-)

Wednesday, May 04, 2005

extraccting on SPARC

I made it with no problems on SPARC Solaris 10 (using /usr/sfw/bin/gcc) and it seems to be working fine. The binary doesn't work on Solaris 8 because its looking for a later version SUNW_1.2 of libexacct.so.1 and I don't have a compiler on this Solaris 8 machine, so thats enough testing for today...

extracct 1.0

I cleaned up the build directory and renamed the program extracct - extract useful info from exacct - and its been running fine on Solaris 10/x64 for a while. If anyone wants to try it out let me know and I'll email a tarfile, I haven't yet got around to testing it on SPARC. I built it on Opteron using gmake/gcc from /usr/sfw/bin.

I'm also wondering how best to make it more generally available, perhaps via orcaware.com or sunfreeware.com, or I could post it to BigAdmin or the OpenSolaris source code as an updated demo. Any opinions out there?

Thursday, April 28, 2005

Pesky style sheet format.....

Sorry that some of the lines are wrapped and others run off the edge, I'm still figuring out how to make the composition tools and template do what I want, rather than what they want....

Automated rotating exdump

I've added an option that automates the processing of exacct data to text file format and rotates the logs, it is hard wired to create a new log file in /var/adm/exacct (which is an empty directory created during Solaris install as standard) and it includes the hostname and the date+time in the new file.

The previous log name is used as the base of the output file, which is written to a specified directory with a .txt extension.

The file /etc/acctadm.conf is maintained by the acctadm command, I read the log file names from it, and if there is no log file I don't start one. i.e. you need to manually start accounting with acctadm the first time to decide which logs you want to generate. The command syntax now looks like this:

# ./exdump
Usage: exdump [-vwr] [ file | -a dir ]

-v: verbose
-w: wracct all processes first
-r: rotate logs
-a dir: use acctadm.conf to get input logs, and write output files to dir

The usual way to run the command will be from cron as shown

0 * * * * /opt/exdump/exdump -war /var/tmp/exacct > /dev/null 2>&1
2 * * * * /bin/find /var/adm/exacct -ctime +7 -exec rm {} \;

This also shows how to clean up old log files, I only delete the binary files in this example, and I created /var/tmp/exacct to hold the text files. The process data in the text file looks like this:

timestamp locltime duration procid ppid uid usr sys majf rwKB vcxK icxK sigK sycK arMB mrMB command
1114734370 17:26:10 0.0027 16527 16526 0 0.000 0.002 0 0.53 0.00 0.00 0.00 0.1 0.7 28.9 acctadm
1114734370 17:26:10 0.0045 16526 16525 0 0.000 0.001 0 0.00 0.00 0.00 0.00 0.1 1.1 28.9 sh
1114734370 17:26:10 0.0114 16525 8020 0 0.001 0.005 0 1.71 0.00 0.00 0.00 0.3 1.0 28.9 exdump
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786952 10:09:12 -1.0000 0 0 0 0.012 118.883 131 0.00 0.33 0.00 0.00 0.0 0.0 0.0 sched
1109786959 10:09:19 -1.0000 1 0 0 4.311 3.066 96 47504.69 49.85 0.18 0.34 456.2 0.9 1.0 init
1109786959 10:09:19 -1.0000 2 0 0 0.000 0.000 0 0.00 0.00 0.00 0.00 0.0 0.0 0.0 pageout

Since the -w option forced long running processes to write an accounting record they have not finished, and I chose to mark this by showing the duration as -1. One odd thing shown here is that the sched process generated three identical accounting records, this needs some more investigation.

I'm going to run this for a while, and install it on a few other machines including SPARC Solaris 10 and Solaris 8 to make sure its robust.

Monday, April 18, 2005

Flow accounting printout

I went to edit the code and realized that I hadn't finished some previous work, so I added a printout of the flow accounting data, an example is shown below.

The timestamp is the start of the flow, I translate to local timezone and show that as well. I difference the two timestamps to get the duration of the flow in seconds; add the uid and project id (which are only present for outbound flows); show the protocol; decode the IP addresses and look them up as names if possible, along with their port numbers; list the number of packets and bytes in the flow; and end with the filter name from the ipqos definition.

timestamp  locltime dur   uid  proj prot           srcip:port            dstip:port  npkts  nbytes aname
1113850483 11:54:43  13     0     0  tcp   66.102.15.100:80               crun:55783    21    4148  acct
1113850483 11:54:43  13   100    10  tcp            crun:55783   66.102.15.100:80       13   15082  acct
1113850483 11:54:43  13     0     0  tcp            crun:55783   66.102.15.100:80        6     240  acct
1113850480 11:54:40  16   100    10  tcp            crun:55782   66.102.15.101:80        3    1253  acct
1113850480 11:54:40  16     0     0  tcp            crun:55782   66.102.15.101:80        7     280  acct

Thursday, April 14, 2005

Sequencing exacct logs

I decided to simplify the next step and to just get raw data logged in a useful manner. To do this there needs to be a periodic process that generates readable log files. After thinking about this for a while, I think I have a simple and effective way to do it.

I will modify the exdump code to add another option "-a". This will invoke acctadm to switch to a new set of log files, with a datestamped name. Before it does the switch, it will need to cause all current processes to write accounting entries, so "-a" will normally be used in conjunction with the "-w" option I added already. After the log switch, the old log files will be processed from their binary form to a text file with one record per line, ready for consumption by futher processing steps.
When "-a" is specified, a directory can be specified on the command line to hold output files, the input filename does not need to be specified as it was for the original version. This makes it easy to invoke exdump directly from cron without needing a wrapper script.

The sequence is:


Obtain existing log files by running acctadm or reading /etc/acctadm.conf
Note which logs are being generated and their filenames
Run wracct on all existing processes
Run acctadm to switch to new log files with date-time-stamped names
Process existing log files to text form in the current directory or a 
   specified directory using same base filenames

To start exacct the user needs to manually run acctadm once to enable the logs of interest (flow, task or proc) then exdump will generate the output logs at the cron interval. A find command can be used from cron to remove old logs and keep disk space managed.

Since many commands run on the hour, it might be worth starting exdump a few seconds early, and have the file timestamp round up to the next minute so that any command sequences started from cron on the hour are less likely to get split up.

This still leaves the problem of processing individual accounting records into useful data, but at least the one line per record text form with interval data will be more tractable than the binary file.

Friday, April 01, 2005

Writing accounting records at time intervals

A major new feature of the exacct system is the ability to get an accounting record logged without terminating the process. There are two forms of this, for tasks you can get the record to dump the delta since the last record was logged. Somehow the task remembers the data each time it cuts a record so it can do the differencing. This seems to be too much overhead at the process level, so the other option is to cut a record that logs the same data as if the process had just exited, and this option is available for both tasks and processes.

The command that causes a record to be written is "wracct" and it takes a list of process or task id's and makes a system call to cause the record to be written. You have to be root to do this. The wracct command line syntax is a pain if you want to get it to dump multiple processes, as shown in this example from the manpage:


# /usr/sbin/wracct -i "`pgrep sendmail`" process

I want to make every process cut a record, and if you attempt to do this with wracct you need to form a list of every process id on the system. I tried to do this by listing all the entries in the /proc filesystem, but if any of the pid's exit before wracct gets to them it gets an error and quits. This is stupid, because if a process has exited, it has already cut an accounting record! The wracct command should have a "-a" option that writes all records and ignores errors.

I modified the exdump command to have a "-w" option that loops over all processes and forces an accounting record to be written before it reads the accounting file. If you aren't root, it has no effect. The code looks like this:


 if (wflag) {
  DIR *dirp;
  struct dirent *dp;

  dirp = opendir("/proc");
              do {
                 if ((dp = readdir(dirp)) != NULL)
     (void) wracct(P_PID, atoi(dp->d_name), EW_PARTIAL);
              } while (dp != NULL);
               (void) closedir(dirp);
                
 }

The next step is a bit more complex. Both for flow accounting and interval process accounting, I need code that remembers previous records and matches them with subsequent ones so that they can be accumulated (flows) or differenced (process intervals).

Thursday, March 24, 2005

Extracting useful exacct process data

I modified the /usr/demo/libexacct/exdump.c example code to include the data structures described in previous posts, and made the code update each item in the structure. Then I added a printf from the data structure so I could check that all the data is being captured correctly. Part of the output is shown below. I hand edited the column headers a bit for the first two lines to make them line up.


procid ppid uid  usr  sys time      majf minf rwKB     vcxK   icxK sigK sycK  arMB mrMB command
1693  1679 100 40.23 3.53 416930.11 237    0 58745.32 261.17 4.22 0.00 801.7 38.4 44.6 mozilla-bin
procid ppid uid  usr  sys time      majf minf rwKB    vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11   0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh
procid ppid uid  usr  sys time      majf minf rwKB vcxK icxK sigK sycK arMB mrMB command
1647  1646 100 0.00 0.01 416930.30    0    0 8.29 0.03 0.00 0.00  0.8  0.4 26.4 mozilla

Running the modified exdump with the -v option shows the data objects in detail so I can check that the right data is appearing in the right place. Here is the same data as the second line above. It looks as if the minor faults counter really is zero, which is probably a Solaris bug.


   100  group-proc                      [group of 35 object(s)]
 1000   pid                            1679
 1001   uid                            100
 1002   gid                            10             staff
 1004   projid                         10             group.staff
 1003   taskid                         54
 100b   cpu-user-sec                   0
 100c   cpu-user-nsec                  2861447
 100d   cpu-sys-sec                    0
 100e   cpu-sys-nsec                   4150875
 1007   start-sec                      1109810711     03/02/05 16:45:11
 1008   start-nsec                     813223175
 1009   finish-sec                     1110227641     03/07/05 12:34:01
 100a   finish-nsec                    968607249
 1006   command                        "run-mozilla.sh"
 100f   tty-major                      4294967295
 1010   tty-minor                      4294967295
 1005   hostname                       "crun"
 1011   faults-major                   0
 1012   faults-minor                   0
 1014   msgs-snd                       0
 1013   msgs-recv                      0
 1015   blocks-in                      0
 1016   blocks-out                     0
 1017   chars-rdwr                     15105
 1018   ctxt-vol                       21
 1019   ctxt-inv                       0
 101a   signals                        0
 101b   swaps                          0
 101c   syscalls                       586
 101d   acctflags                      0
 101f   ppid                           1647
 1020   wait-status                    0              exit
 1021   zone                           "global"
 1022   memory-rss-avg-k               584
 1023   memory-rss-max-k               27004
procid ppid uid  usr  sys time      majf minf rwKB  vcxK icxK sigK sycK arMB mrMB command
1679  1647 100 0.00 0.00 416930.16    0    0 15.11 0.02 0.00 0.00  0.6  0.6 26.4 run-mozilla.sh

Wednesday, March 16, 2005

An exercise in complexity....

Time for a grumble....

My plan was to take the libexacct.so API and expose it as an SE toolkit class. After looking at what it would take to do this I have come to the conclusion that the data structure definitions and API for reading the data are too complex.

The design is so abstract that it seems that reading meaningful data out of the log file is some obscure side effect of the code. You can read the data, but there is no guarantee that any specific item of data will be present. The accounting system has various options to send more or less data to the file, so it needs to be flexible, but the important thing is the meaning of the data being logged. I care about the semantic and informational content of the data source. What I get from exacct is "there are some tagged typed objects in this file". I can't consume the data without making assumptions about it, and the API doesn't embody those assumptions.

Some of the data being reported is useless (blocks in and blocks out are archaic measures that are always zero) and other stuff is missing - like the good microstate information on CPU wait, page-in wait etc.

I'm going to take the exdump.c code and turn it into a library that exports a sane and simple set of abstractions (like the data structures I defined in earlier posts).

Monday, March 14, 2005

Data structures and objects

The exacct data file is a complex tagged object format that is read via the libexacct library routines. While generic and flexible, it is a pain to get at the data. There are two demo programs that display the information, I used /usr/demo/libexacct/exdump to print out the information shown earlier, and there is also a perl library and a script called dumpexacct.pl. It displays the tags and types like this:


GROUP
  Catalog = EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
  ITEM
    Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
    Value = 1904
  ITEM
    Catalog = EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
    Value = 25
...

I used this information to define a data structure that will be populated with the data from the file as the first processing step. I left the tags as comments, and defined reasonable amounts of fixed space for strings. The task and flow structures are similar.


struct ex_proc { // EXT_GROUP|EXC_DEFAULT|EXD_GROUP_PROC
 uint32_t pid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PID
 uint32_t uid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_UID
 uint32_t gid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_GID
 uint32_t projid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_PROJID
 uint32_t taskid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TASKID
 uint64_t cpu_user_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_SEC
 uint64_t cpu_user_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_USER_NSEC
 uint64_t cpu_sys_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_SEC
 uint64_t cpu_sys_nsec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CPU_SYS_NSEC
 uint64_t proc_start_sec; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_SEC
 uint64_t proc_start_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_START_NSEC
 uint64_t proc_finish_sec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_SEC
 uint64_t proc_finish_nsec;// EXT_UINT64|EXC_DEFAULT|EXD_PROC_FINISH_NSEC
 char  command[PRFNSZ];// EXT_STRING|EXC_DEFAULT|EXD_PROC_COMMAND
 uint32_t tty_major; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MAJOR
 uint32_t tty_minor; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_TTY_MINOR
#define EX_SYS_NMLN 40 // SYS_NMLN = 257 - too much
 char  hostname[EX_SYS_NMLN]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_HOSTNAME
 uint64_t major_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MAJOR
 uint64_t minor_faults; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_FAULTS_MINOR
 uint64_t messages_snd; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_SND
 uint64_t messages_rcv; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MESSAGES_RCV
 uint64_t blocks_in; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_IN
 uint64_t blocks_out; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_BLOCKS_OUT
 uint64_t chars_rdwr; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CHARS_RDWR
 uint64_t vctx;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_VOL
 uint64_t ictx;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_CONTEXT_INV
 uint64_t signals; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SIGNALS
 uint64_t swaps;  // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SWAPS
 uint64_t syscalls; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_SYSCALLS
 uint32_t acct_flags; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ACCT_FLAGS
 uint32_t ppid;  // EXT_UINT32|EXC_DEFAULT|EXD_PROC_ANCPID
 uint32_t wait_status; // EXT_UINT32|EXC_DEFAULT|EXD_PROC_WAIT_STATUS
#define EX_ZONENAME 64
 char  zonename[EX_ZONENAME]; // EXT_STRING|EXC_DEFAULT|EXD_PROC_ZONENAME
 uint64_t mem_rss_avg_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_AVG_K
 uint64_t mem_rss_max_k; // EXT_UINT64|EXC_DEFAULT|EXD_PROC_MEM_RSS_MAX_K
 };

Capture Ratio and measurement overhead

For performance monitoring applications, we often want to know the process related information, but collecting it from /proc is very expensive compared to collecting other performance data, and the amount of work increases as the number of processes increases.

The capture ratio is defined as the amount of CPU time that is gathered by looking at process data versus the total systemwide CPU used. The difference is made up by processes that stop during the intervals between measurements. Since short lived processes may start and stop between measurements, and we don't know whether a process stopped immediately before a measurement or just after a measurement, there is always an error in sampled process measures. The error is reduced by using a short measurement interval, but that increases overhead. Bypassing the /proc interface, and reading process data directly from the kernel is very implementation dependent but is used by BMC's PATROL® for Unix - Perform & Predict data collector, so that they can collect process data efficiently on large systems at high data rates.

By watching traditional SysV accounting records that fall between measures, some heuristics can be used to improve the capture ratio. However the SysV accounting record does not include the process id, so this is an inexact technique. The Teamquest® View and Model data collector uses this trick.

With extended accounting, we can use wracct to force a record of all current processes to be written to the accounting file, along with the processes that terminate. This gives us a perfect capture ratio, even at infrequent measurement intervals, so the overhead of process data collection is extremely low.

There is no need for a performance collection agent, a cron script can invoke wracct at the desired measurement interval. Another cron script can use acctadm to switch logs to a new file and process locally or ship the old file to a central location as required.

That in a nutshell is why extended accounting is interesting, very good quality data, perfect capture ratio and very low measurement overhead.

Partial and Interval accounting records

Traditional accounting generates a record only when the process terminates. The current process data from ps contains different information so its not possible to keep the two in sync. Extended accounting includes the wracct command that forces an accounting record to be generated for a process or task. The partial record is tagged differently as shown below, but contains the same information.


   106  group-proc-partial              [group of 35 object(s)]
  1000   pid                            664
...

For a process, partial records provide the total resource usage since process creation. For a task, an additional option allows for resource usage over the interval since the previous record was written.

Thursday, March 10, 2005

Data from task accounting record

A task is a group of related processes, when the last one exits, a task record is written. The information is very similar to the process record.


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   101  group-task                      [group of 25 object(s)]
  2000   taskid                         61
  2001   projid                         1              user.root
  2007   cpu-user-sec                   0
  2008   cpu-user-nsec                  0
  2009   cpu-sys-sec                    0
  200a   cpu-sys-nsec                   0
  2003   start-sec                      1109844060     03/03/05 02:01:00
  2004   start-nsec                     907341842
  2005   finish-sec                     1109844060     03/03/05 02:01:00
  2006   finish-nsec                    925962473
  2002   hostname                       "crun"
  200b   faults-major                   2
  200c   faults-minor                   0
  200e   msgs-snd                       0
  200d   msgs-recv                      0
  200f   blocks-in                      3
  2010   blocks-out                     1
  2011   chars-rdwr                     13483
  2012   ctxt-vol                       12
  2013   ctxt-inv                       1
  2014   signals                        0
  2015   swaps                          0
  2016   syscalls                       666
  2018   anctaskid                      29
  2019   zone                           "global"

Data from process accounting

The process accounting record is far more detailed than the standard sysV acct record used by most Unix based systems. For a start it includes the pid of the process and the pid of the parent process so you can stitch the records together properly. The Solaris project and task id's let you manage and control workloads effectively, and since microstate accounting is on by default in Solaris 10, the CPU usage numbers are accurate and high resolution. By default everything is in the global zone. Zones are the virtual machine containers used for fault isolation and resource management in Solaris 10, so data needs to be separated by zone as well as workload.


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   100  group-proc                      [group of 34 object(s)]
  1000   pid                            1748
  1001   uid                            0              root
  1002   gid                            0              root
  1004   projid                         1              user.root
  1003   taskid                         56
  100b   cpu-user-sec                   0
  100c   cpu-user-nsec                  417258
  100d   cpu-sys-sec                    0
  100e   cpu-sys-nsec                   989267
  1007   start-sec                      1109813617     03/02/05 17:33:37
  1008   start-nsec                     139893535
  1009   finish-sec                     1109813617     03/02/05 17:33:37
  100a   finish-nsec                    152386048
  1006   command                        "acctadm"
  100f   tty-major                      24
  1010   tty-minor                      4
  1011   faults-major                   0
  1012   faults-minor                   0
  1014   msgs-snd                       0
  1013   msgs-recv                      0
  1015   blocks-in                      0
  1016   blocks-out                     5
  1017   chars-rdwr                     481
  1018   ctxt-vol                       2
  1019   ctxt-inv                       0
  101a   signals                        0
  101b   swaps                          0
  101c   syscalls                       94
  101d   acctflags                      2              SU
  101f   ppid                           1568
  1020   wait-status                    0              exit
  1021   zone                           "global"
  1022   memory-rss-avg-k               996
  1023   memory-rss-max-k               29636

Data logged by flow accounting

The data comes in two forms, outgoing traffic is tagged with the userid and project of the initiating process, but incoming traffic is missing this information. Since TCP flows are captured in pairs they need to be matched up. The output from the provided demo program /usr/demo/libexacct/exdump -v is shown below.

These match if the src and dest address and ports are reversed


    ff  group-header                    [group of 4 object(s)]
     1   version                        1
     2   filetype                       "exacct"
     3   creator                        "SunOS"
     4   hostname                       "crun"
   109  group-flow                      [group of 11 object(s)]
  3000   src-addr-v4                    a.b.c.d  
  3001   dest-addr-v4                   e.f.g.h  crun
  3004   src-port                       80
  3005   dest-port                      43727
  3006   protocol                       6              tcp
  3007   diffserv-field                 0
  300a   creation-time                  1110482732     03/10/05 11:25:32
  300b   last-seen                      1110482734     03/10/05 11:25:34
  3008   total-bytes                    3447
  3009   total-packets                  10
  300e   action-name                    "acct"
   109  group-flow                      [group of 13 object(s)]
  3000   src-addr-v4                    e.f.g.h  crun
  3001   dest-addr-v4                   a.b.c.d
  3004   src-port                       43727
  3005   dest-port                      80
  3006   protocol                       6              tcp
  3007   diffserv-field                 0
  300a   creation-time                  1110482732     03/10/05 11:25:32
  300b   last-seen                      1110482734     03/10/05 11:25:34
  3008   total-bytes                    4561
  3009   total-packets                  5
  300c   projid                         10
  300d   uid                            100
  300e   action-name                    "acct"

Wednesday, March 09, 2005

IPQoS Configuration

I found a config file that logs data to the accounting system without
filtering it much, it basically filters by protocol, I just take tcp
and pass it to the flow accounting module. You could add udp and other protocols if needed.

This is installed using


# ipqosconf -a exacct.qos

Where exacct.qos is written as follows


fmt_version 1.0

action {
        module ipgpc
        name ipgpc.classify
        params {
                global_stats true
        }
        filter {
                name tcpfilter
                protocol tcp
                class allclass
        }
        class {
                name allclass
                next_action acct
                enable_stats true
        }
}
action {
        module flowacct
        name acct
        params {
                global_stats true
                timer 10000
                timeout 10000
                max_limit 2048
                next_action continue
        }
}

Wednesday, March 02, 2005

Configuring IPQoS for flow accounting

I spent some time today working my way through the manuals. What I want is to just have the accounting information about the network flows, with no complex QoS rules and minimum overhead. So far I got it configured, but I'm getting no output. No error messages, but no output - so my machine is giving me some sulky passive agressive treatment :-(

I'm working on a Sun W2100z, which is a dual CPU Opteron system. It came with Solaris 9, but I got Solaris 10 loaded on it by feeding it a bunch of CDs. The hardest part was configuring the graphics display, it defaulted to something horrible with 256 colors. I had to use kdmconfig, but it took me a while to figure out a) that fbconfig is only for SPARC, and that there is an x86 command with a different name. b) that the system I have has a certain kind of NVidia card in it, c) what display resolution my non-Sun LCD screen supports, and d) that if I scrolled up the screen I could find options with 16M colors that look right.

Its a nice box, not too big, fairly quiet, fast.

Saturday, February 26, 2005

Grid overview presentation at ACM

I gave this talk at the SF Bay ACM meeting last October

They posted the slides, which were the first part of my half day
tutorial at CMG04 in December. The talk went well, was well attended and I caught up with some old friends afterwards.

Extended Accounting in Solaris 8-10

This is a very under-used facility introduced in Solaris 8. Solaris 10
extends it so that you can get TCP flow accounting as well as extended
process accounting info. It also adds a perl interface.

Unlike traditional unix accounting - which is low overhead but stores
very poor quality information - extended accounting can save data at
fixed time intervals (as well as at process termination). It can be
configured to store detailed information, and in Solaris 10 since
microstate accounting is enabled for all processes by default you can
get the microstate details for everything with no extra overhead. I'll
be digging into the details of how to configure and process this data
soon.

Friday, February 25, 2005

Motivation

I started this blog, but didn't get far with it last summer. I just updated it and aim to be much more active, as I'm collecting information on free performance and capacity planning tools and want a place to share it.

For those that know me from my books, I left Sun last summer and started work at eBay in the Operations Architecture group in September 2004.

Archive