Adrian Cockcroft's Blog: October 2005

Originally published in Unix Insider 10/1/95
Stripped of adverts, url references fixed and comments added in red type to bring it up to date ten years later.

Dear Adrian, After a reboot I saw that most of my computer's memory was
free, but when I launched my application it used up almost all the
memory. When I stopped the application the memory didn't come back!
Take a look at my vmstat output:

% vmstat 5
procs     memory            page            disk          faults      cpu
r b w   swap  free  re  mf pi po fr de sr s0 s1 s2 s3   in   sy   cs us sy id

This is before the program starts:

0 0 0 330252 80708   0   2  0  0  0  0  0  0  0  0  1   18  107  113  0  1 99
0 0 0 330252 80708   0   0  0  0  0  0  0  0  0  0  0   14   87   78  0  0 99

I start the program and it runs like this for a while:

0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  414  132   79 24  1 74
0 0 0 314204  8824   0   0  0  0  0  0  0  0  0  0  0  411   99   66 25  1 74

I stop it, then almost all the swap space comes back, but the free memory does not:

0 0 0 326776 21260   0   3  0  0  0  0  0  0  1  0  0  420  116   82  4  2 95
0 0 0 329924 24396   0   0  0  0  0  0  0  0  0  0  0  414   82   77  0  0 100
0 0 0 329924 24396   0   0  0  0  0  0  0  0  2  0  1  430   90   84  0  1 99

I checked that there were no application processes running. It looks like a huge memory leak in the operating system. How can I get my memory back?
--RAMless in Ripon

Update

This remains one of the most frequently asked questions of all time. The
original answer is still true for many Unix variants. However while
writing his book on Solaris Internals, Richard McDougall worked out
how to fix Solaris to make it work better, and to make this aparrent
problem go away. The result was one of the most significant
performance improvements in Solaris 8, but the first edition of his
book was written before Solaris 8 came out, so doesn't describe the
fix!

The short answer

Launch your application again. Notice that it starts up more quickly than it did the first time, and with less disk activity. The application code and its data files are still
in memory, even though they are not active. The memory they occupy is
not "free." If you restart the same application it finds
the pages that are already in memory. The pages are attached to the
inode cache entries for the files. If you start a different
application, and there is insufficient free memory, the kernel will
scan for pages that have not been touched for a long time, and "free"
them. Once you quit the first application, the memory it occupies is
not being touched, so it will be freed quickly for use by other
applications.

In 1988, Sun introduced this feature in SunOS 4.0. It still applies to
all versions of Solaris 1 and 2. The kernel is trying to avoid disk
reads by caching as many files as possible in memory. Attaching to a
page in memory is around 1,000 times faster than reading it in from
disk. The kernel figures that you paid good money for all of that
RAM, so it will try to make good use of it by retaining files you
might need.

Since Solaris 8, the memory in the file cache is actually also on the free
list, so you do see vmstat free memory reduce when you quit a
program. You also should expect large amounts of file I/O to cause
high scan rates on older Solaris releases, and for there to be no
scanning at all on Solaris 8 systems. If Solaris 8 scans at all, then
it has truly run out of memory and is overloaded.

By contrast, Memory leaks appear as a shortage of swap space after the
misbehaving program runs for a while. You will probably find a
process that has a larger than expected size. You should restart the
program to free up the swap space, and check it with a debugger that
offers a leak-finding feature (run it with the libumem version of the malloc library that instruments memory leaks).

The long (and technical) answer

To understand how Sun's operating systems handle memory, I will explain how the inode cache works, how the buffer cache fits into the picture, and how the life
cycle of a typical page evolves as the system uses it for several
different purposes.

The inode cache and file data caching

Whenever you access a file, the kernel needs to know the size, the access permissions,
the date stamps and the locations of the data blocks on disk.
Traditionally, this information is known as the inode for the file.
There are many filesystem types. For simplicity I will assume we are
only interested in the Unix filesystem (UFS) on a local disk. Each
filesystem type has its own inode cache.

The filesystem stores inodes on the disk; the inode must be read into
memory whenever an operation is performed on an entity in the
filesystem. The number of inodes read per second is reported as
iget/s by the sar -a command. The inode read from disk is cached in case
it is needed again, and the number of inodes that the system will
cache is influenced by a kernel parameter called ufs_ninode.
The kernel keeps inodes on a linked list, rather than in a fixed-size
table.

As I mention each command I will show you what the output looks like. In
my case I'm collecting sar
data automatically using cron.
sar, which defaults to
reading the stored data for today. If you have no stored data,
specify a time interval and sar
will show you current activity.

% sar -a

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  iget/s namei/s dirbk/s
01:00:01       4       6       0

All reads or writes to UFS files occur by paging from the filesystem. All
pages that are part of the file and are in memory will be attached to
the inode cache entry for that file. When a file is not in use, its
data is cached in memory, using an inactive inode cache entry. When
the kernel reuses an inactive inode cache entry that has pages
attached, it puts the pages on the free list; this case is shown by
sar -g as %ufs_ipf.
This number is the percentage of UFS inodes that were overwritten in
the inode cache by iget and
that had reusable pages associated with them. The kernel flushes the
pages, and updates on disk any modified pages. Thus, this %ufs_ipf
number is the percentage of igets with page flushes. Any non-zero
values of %ufs_ipf reported by sar -g
indicate that the inode cache is too small for the current workload.

% sar -g

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
01:00:01     0.02     0.02     0.08     0.12     0.00

For SunOS 4 and releases up to Solaris 2.3, the number of inodes that the
kernel will keep in the inode cache is set by the kernel variable
ufs_ninode. To simplify: When a file is opened, an inactive
inode will be reused from the cache if the cache is full; when an
inode becomes inactive, it is discarded if the cache is over-full. If
the cache limit has not been reached then an inactive inode is placed
at the back of the reuse list and invalid inodes (inodes for files
that longer exist) are placed at the front for immediate reuse. It is
entirely possible for the number of open files in the system to cause
the number of active inodes to exceed ufs_ninode; raising
ufs_ninode allows more inactive inodes to be cached in case
they are needed again.

Solaris 2.4 uses a more clever inode cache algorithm. The kernel maintains a
reuse list of blank inodes for instant use. The number of active
inodes is no longer constrained, and the number of idle inodes
(inactive but cached in case they are needed again) is kept between
ufs_ninode and 75 percent of ufs_ninode by a new
kernel thread that scavenges the inodes to free them and maintains
entries on the reuse list. If you use sar -v to look at the inode cache, you may see a larger
number of existing inodes than the reported "size."

% sar -v

SunOS hostname 5.4 Generic_101945-32 sun4c    09/18/95

00:00:01  proc-sz    ov  inod-sz    ov  file-sz    ov   lock-sz
01:00:01   66/506     0 2108/2108    0  353/353     0    0/0

Buffer cache

The buffer cache is used to cache filesystem
data in SVR3 and BSD Unix. In SunOS 4, generic SVR4, and Solaris 2,
it is used to cache inode, indirect block, and cylinder group blocks
only. Although this change was introduced in 1988, many people still
incorrectly think the buffer cache is used to hold file data. Inodes
are read from disk to the buffer cache in 8-kilobyte blocks, then the
individual inodes are read from the buffer cache into the inode
cache.

Life cycle of a typical physical memory page

This section provides additional insight into the way memory is used. The sequence
described is an example of some common uses of pages; many other
possibilities exist.

1. Initialization -- A page is born: When the system boots, it forms all free memory into pages, and allocates
a kernel data structure to hold the state of every page in the
system.
2. Free -- An untouched virgin page: All the memory is put onto the free list to start with. At this stage the
content of the page is undefined.
3. ZFOD -- Joining an uninitialized data segment: When a program accesses data that is preset to zero for the very first
time, a minor page fault occurs and a Zero Fill On Demand (ZFOD)
operation takes place. The page is taken from the free list,
block-cleared to contain all zeroes, and added to the list of
anonymous pages for the uninitialized data segment. The program then
reads and writes data to the page.
4. Scanned -- The pagedaemon awakes: When the free list gets below a certain size, the pagedaemon starts to
look for memory pages to steal from processes. It looks at all pages
in physical memory order; when it gets to the page, the page is
synchronized with the memory management unit (MMU) and a reference
bit is cleared.
5. Waiting -- Is the program really using this page right now?: There is a delay that varies depending upon how quickly the pagedaemon
scans through memory. If the program references the page during this
period, the MMU reference bit is set.
6. Pageout Time -- Saving the contents: The pageout daemon returns and checks the MMU reference bit to find that
the program has not used the page so it can be stolen for reuse. The
pagedaemon checks to see if anything had been written to the page;
if it contains no data, a page-out occurs. The page is moved to the
pageout queue and marked as I/O pending. The swapfs code clusters
the page together with other pages on the queue and writes the
cluster to the swap space. The page is then free and is put on the
free list again. It remembers that it still contains the program
data.
7. Reclaim -- Give me back my page!: Belatedly, the program tries to read the page and takes a page fault. If the
page had been reused by someone else in the meantime, a major fault
would occur and the data would be read from the swap space into a
new page taken from the free list. In this case, the page is still
waiting to be reused, so a minor fault occurs, and the page is moved
back from the free list to the program's data segment.
8. Program Exit -- Free again: The program finishes running and exits. The data segments are private to
that particular instance of the program (unlike the shared-code
segments), so all the pages in the data segment are marked as
undefined and put onto the free list. This is the same state as Step
2.
9. Page-in -- A shared code segment: A page fault occurs in the code segment of a window system shared library.
The page is taken off the free list, and a read from the filesystem
is scheduled to get the code. The process that caused the page fault
sleeps until the data arrives. The page is attached to the inode of
the file, and the segments reference the inode.
10. Attach -- A popular page: Another process using the same shared-library page faults in the same place.
It discovers that the page is already in memory and attaches to the
page, increasing its inode reference count by one.
11. COW -- Making a private copy: If one of the processes sharing the page tries to write to it, a
copy-on-write (COW) page fault occurs. Another page is grabbed from
the free list, and a copy of the original is made. This new page
becomes part of a privately mapped segment backed by anonymous
storage (swap space) so it can be changed, but the original page is
unchanged and can still be shared. Shared libraries contain jump
tables in the code that are patched, using COW as part of the
dynamic linking process.
12. File Cache -- Not free: The entire window system exits, and both processes go away. This time
the page stays in use, attached to the inode of the shared library
file. The inode is now inactive but will stay in the inode cache
until it is reused, and the pages act as a file cache in case the
user is about to restart the window system again. The
change made in Solaris 8 was that the file cache is the tail of the
free list, and any file cache page can be reused immediately for
something else without needing to be scanned first.
13. fsflush -- Flushed by the sync: Every 30 seconds all the pages in the system are examined in physical page
order to see which ones contain modified data and are attached to a
vnode. The details differ between SunOS 4 and Solaris 2, but
essentially any modified pages will be written back to the
filesystem, and the pages will be marked as clean.
This example sequence can continue from Step 4 or
Step 9 with minor variations. The fsflush
process occurs every 30 seconds by default for all pages, and
whenever the free list size drops below a certain value, the
pagedaemon scanner wakes up and reclaims some pages. A
recent change in Solaris 10, backported to Solaris 8 and 9 patches,
makes fsflush run much more efficiently on machines with very large
amounts of memory. However, if you see fsflush using an excessive
amount of CPU time you should increase “autoup” in /etc/system
from its default of 30s, and you will see fsflush usage reduce
proportionately.

Now you know

I have seen this missing-memory question
asked about once a month since 1988! Perhaps the manual page for
vmstat should include a better explanation of what the
values are measuring. This answer is based on some passages from my
book Sun Performance and Tuning. The book explains in detail how the
memory algorithms work and how to tune them. However the book doesn't cover the changes made in Solaris 8.

Just in case you thought that you could compare your CPU utilization data across Solaris releases I have a few words of caution.

To start with there is the whole problem of interrupts, do they count as system time, or do they just make whatever they interrupted take longer?

Then there is the question of wait-for-io, and who is waiting for which io? This is a form of idle time that tends to confuse people as it doesn't really mean anything once you have more than one CPU.

There is one mechanism used to report the systemwide CPU utilization data. This data is reported in a per-cpu kstat data structure and is used by every tool that ever reports CPU usr, sys, wio, idle etc. Most tools sum the data over all the CPUs, mpstat gives you the per CPU data. The form of the data is a number of ticks of CPU time that accumulates, starting at zero at boot time. To measure CPU utilization over a time interval, you measure the difference in the number of ticks and divide by the time, and the tick rate. The tick rate is set by the clock interrupt, it defaults to 100Hz, but can be set to 1000Hz.

There are two mechanisms used to measure CPU usage, one is the old method of using the clock interrupt to see what is running every 100Hz. This is low resolution, and since the clock wakes up jobs, its possible for jobs to hide between the ticks, so its a statistically biased measure.
The other mechanism is microstate accounting, where every change of state is timed with a hires clock. This was only used for tracking CPU usage by processes, and needed special calls to get the data in releases up to Solaris 9.

So how do those nice numbers you get from vmstat or your favourite tool vary?

Solaris 8 and earlier releases: interrupt time does not get classified as system time

Solaris 9: Interrupt time is counted as system time.

Solaris 10: wait-for-io time will always be zero. The ratio of confusion to enlightenment was too high, so the entire concept has been removed from Solaris. This is a good thing.

Solaris 10 initial release: systemwide CPU time is now measured using microstates, this is far more accurate than ticks, but somehow the interrupt time ended up spread out rather than in system time.

This is unfortunate, but I'm told that an update of Solaris 10 will again classify interrupts as system time, and then it will all be just about as accurate as it can be.

You may be curious about the size of errors in these measurements, the answer is that "it depends". There is an SE toolkit script called cpuchk.se that looks at the process data and compares the ticks with the microstate data. However, I don't have any measures of interrupt time differences.

Adrian Cockcroft's Blog

Archive

Monday, October 31, 2005

Help! I've lost my memory! Updated Sunworld Column

Tuesday, October 25, 2005

SunWorld Columns at ITworld

Sunday, October 02, 2005

How busy is your CPU, really?

SE toolkit 3.4 on Solaris 10 Opteron Workaround