Source http://constantin.glez.de/blog/2011/01/my-favorite-oracle-solaris-performance-analysis-commands
Let's see how we can quickly answer the question: Do I have enough CPU power?
In the old days of single-core, single-CPU systems, we fired up top
and watched the system load value, or the top processes' CPU
percentage. But in today's multi-CPU, multi-core world, this doesn't
work anymore. The old concept of "load" is now misleading and quite
useless if your want to assess whether your system has enough CPU power
or not.
Here's a more modern way:
constant@fridolin:~$ vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
0 0 0 446144 130076 23 100 0 1 3 0 12 7 -0 13 0 465 1352 1137 6 12 82
0 0 0 405376 90808 33 41 0 0 0 0 0 39 0 3 0 514 500 571 4 11 85
0 0 0 405296 90536 0 0 0 0 0 0 0 29 0 1 0 502 778 551 4 10 86
...
(Remember to ignore the first line of the output as it may contain accumulated data from an unknown sample size.)
Now watch the rightmost column, which is the system idle time in
percent. Is it bigger than 0 most of the time? Then you have enough CPU
power. It's that simple. If idle time is 0 most of the time, buy a
bigger CPU, if not, look elsewhere.
The above system has enough CPU: It's idle more than 80% of its time
so even if something runs slow, it can't be the CPU in this case.
(Yes, life can be more complex than that, but remember, we're talking
about a cheat sheet here. This is the most useful approach for a
majority of cases.)
How's My Memory Doing?
Now that we've ruled out "not enough CPU horsepower" as the
bottleneck, let's look at the next layer: RAM. Do we have enough RAM? Or
is the system starving for more memory, possibly resorting to using
slow disks as a poor substitute for RAM? Again,
constant@fridolin:~$ vmstat 5
kthr memory page disk faults cpu
r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
2 0 7 6472 30620 6 85 108 392 546 3060 2617 143 0 111 0 839 408 14606 3 40 57
0 0 7 8360 33960 10 51 89 155 1910 1816 19090 187 0 52 0 883 529 9512 5 36 59
0 0 7 12548 42948 19 48 66 215 215 1080 0 121 0 70 0 737 340 10273 3 31 66
1 0 7 13612 39916 38 90 106 0 0 632 0 171 0 56 0 900 616 10160 5 29 66
4 0 7 8060 29528 10 47 55 0 383 232 5514 112 0 77 0 854 739 6665 4 26 70
0 0 7 7312 38468 3 9 15 234 1500 0 17073 33 0 47 0 580 349 3993 2 25 73
0 0 7 8960 39460 17 46 55 0 0 0 0 101 0 37 0 744 529 7870 3 27 70
2 0 7 8836 37020 6 31 46 0 0 0 0 87 0 87 0 749 418 6033 3 20 77
is our friend. This time, let's look at three values: swap , free and sr (or: scan rate):
- swap: This is the amount of free virtual memory.
- free: This is the amount of free physical memory.
- scan rate (sr): This is the number of times that
the memory page scanner is cleaning up memory pages, freeing the lesser
used memory pages to make room for data that needs to be allocated from
physical memory.
Again, the old adage was: If memory is full, you need more of it. But
today it's misleading: Modern operating systems tend to use up as much
memory as they can, to maximize your hard spent RAM bucks' utilization.
For example, ZFS uses as much free memory as possible as a read cache to
save you from spending precious IOPS on disks. So if the "free mem"
column in top is small, this is actually a good sign: It means that your RAM is doing useful stuff.
A better question to ask here: Is my memory system in trouble? That's
what the scan rate value is telling us: The bigger this value, the more
stressed our memory subsystem is, because the OS is more and more busy
scanning memory pages for expendable chunks so it can fulfill a high
demand in fresh memory. If the scan rate is a single digit value most of
the time, you're ok. If it shows large values over extended periods of
time, you'll likely benefit from some extra RAM in your system.
In the second vmstat
example above, I created extra stress for the memory system by starting
a ZFS scrub (filling up RAM), starting OpenOffice with a large
presentation and asking GIMP to set up a new 8k x 8k picture for me.
That resulted in some samples showing more than a thousand page scans.
That's certainly a situation where more RAM would have come in handy.
The system was unusable, although the CPU showed more than 70% idle.
(Again, there's a lot more detail that we don't cover here, but we
don't want to make this post bigger than a good bedtime reading, do we?)
The nice thing about vmstat
is that with just one command, you can easily assess if the CPU and RAM
situation is ok or not, then move on to the next layer.
Or Is There a Disk Problem?
Now it gets interesting. Most if not all of the performance problems I
see are disk I/O related, and there's no indication that this is about
to change.
You can get a quick overview about your IO situation by using:
constant@fridolin:~$ iostat -xzn 5
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
8.2 6.2 163.8 90.0 0.5 0.2 35.4 13.1 8 10 c3d0
1.4 12.2 30.0 81.4 0.1 0.2 8.9 13.0 3 7 c6t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
126.6 33.1 1613.0 400.3 3.5 1.6 21.9 9.8 75 81 c3d0
0.0 19.7 0.0 40.7 0.6 0.1 28.6 7.5 14 15 c6t0d0
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
33.4 2.0 242.5 14.4 7.1 2.0 200.0 56.4 100 100 c3d0
0.0 15.8 0.0 39.4 2.3 0.5 148.2 31.3 49 49 c6t0d0
Again, looking at simple performance numbers like reads/writes per
second or even kilobytes read/written per second doesn't tell you much.
Are 126 reads fast? Or too slow? Wow, 1613k read per second. That's a
lot! Is it? Wait, what disks am I using again? (Answer: The above is a
Solaris 11 Express system running on VirtualBox on my 3-year-old Mac.)
A more interesting figure to look at is wait :
This is the number of IO operations that are waiting to be serviced. In
other words: "wait" tells you the waiting queue length. If your queue
length looks like the one in front of an Apple store at the day of the
introduction of the new iPhone, you need to work on your disks (Here are a few suggestions if you use ZFS). If the wait time is in the single digit range, then your problem may be elsewhere.
Sometimes you want a more application level view into your IO situation and that is what the following command is about:
admin@krengi:~$ fsstat -F 5
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
0 0 0 0 0 0 0 0 0 0 0 ufs
0 0 0 0 0 0 0 0 0 0 0 proc
0 0 0 0 0 0 0 0 0 0 0 nfs
0 0 0 68 0 43 0 0 0 9 1.06K zfs
0 0 0 0 0 0 0 0 0 0 0 lofs
0 0 0 0 0 0 0 0 0 0 0 tmpfs
0 0 0 0 0 0 0 0 0 0 0 mntfs
0 0 0 0 0 0 0 0 0 0 0 nfs3
0 0 0 0 0 0 0 0 0 0 0 nfs4
0 0 0 0 0 0 0 0 0 0 0 autofs
(I threw away the first batch of data, which is always useless.) Or, if the number of filesystems you're interested in is limited:
admin@krengi:~$ fsstat zfs 5
new name name attr attr lookup rddir read read write write
file remov chng get set ops ops ops bytes ops bytes
2.08M 613K 171K 7.68G 2.25M 10.0G 43.3M 1.09G 1.97T 189M 638G zfs
0 0 0 74 0 79 0 35 608 18 860 zfs
0 0 0 67 0 39 0 0 0 1 112 zfs
0 0 0 71 0 73 0 1 4 1 112 zfs
This is another great way of quickly having a look at what's up with your disk IO.
Are your users creating lots of files? Or are they
modifying/removing/changing attributes a lot? What filesystems are
causing the most IO load? How much IO goes through NFS and how much is
local? All these questions can be easily answered with fsstat and a few flags.
Checking Out the Network
Finally, if your problem is neither on the CPU nor on the memory nor
on the disk IO side, it may lie outside of your system, perhaps at the
networking level. Again, there's a favorite command that gets me a
useful picture most of the time. For example, while streaming some video
on my home server, I checked the effect on the network with this:
admin@krengi:~$ netstat -I e1000g0 5
input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls
417683472 4 384816503 0 0 420603019 4 387736050 0 0
5779 0 3282 0 0 5779 0 3282 0 0
6487 0 3556 0 0 6487 0 3556 0 0
3672 0 2351 0 0 3673 0 2352 0 0
Notice that netstat counts packets here, not MB/s. Network
performance analysis and tuning is a science of its own, but with this
command you can quickly assess what each networking interface is doing,
and whether the packets they transmit are in the right ballpark. Maybe
you have multiple network interfaces configured, but still all your data
is sent through the same pipe?
Digging Deeper
So that's it for my performance cheat sheet: vmstat for CPU and memory, iostat with the -xzn flags and fsstat for disk IO, and good old netstat -I
for the network. This is the 20% effort solution, the minimum effective
set of commands that will get you a quick overview of a system in 80%
of the cases.
Now for that other 20% of more complicated cases, you will need some
extra digging. If you want to learn more, here are a few useful
pointers:
- The Solaris Internals Wiki has a great page about CPU/Processor Analysis.
- dim_STAT is a complete
toolset for collecting and analyzing system performance. It can both
generate a high level overview or a deep down analysis of a system.
- Jörg wrote a nice article about fsstat, and he promised a little series about
*stat articles. Jörg, why don't you continue your series with some of your favorite tools? That would be cool!
|