My Favorite Oracle Solaris Performance Analysis Commands - Solaris - Каталог статей

In the old days of single-core, single-CPU systems, we fired up top and watched the system load value, or the top processes' CPU percentage. But in today's multi-CPU, multi-core world, this doesn't work anymore. The old concept of "load" is now misleading and quite useless if your want to assess whether your system has enough CPU power or not.

constant@fridolin:~$ vmstat 5 
 kthr memory page disk faults cpu
 r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
 0 0 0 446144 130076 23 100 0 1 3 0 12 7 -0 13 0 465 1352 1137 6 12 82
 0 0 0 405376 90808 33 41 0 0 0 0 0 39 0 3 0 514 500 571 4 11 85
 0 0 0 405296 90536 0 0 0 0 0 0 0 29 0 1 0 502 778 551 4 10 86
...

(Remember to ignore the first line of the output as it may contain accumulated data from an unknown sample size.)

Now watch the rightmost column, which is the system idle time in percent. Is it bigger than 0 most of the time? Then you have enough CPU power. It's that simple. If idle time is 0 most of the time, buy a bigger CPU, if not, look elsewhere.

The above system has enough CPU: It's idle more than 80% of its time so even if something runs slow, it can't be the CPU in this case.

(Yes, life can be more complex than that, but remember, we're talking about a cheat sheet here. This is the most useful approach for a majority of cases.)

How's My Memory Doing?

Now that we've ruled out "not enough CPU horsepower" as the bottleneck, let's look at the next layer: RAM. Do we have enough RAM? Or is the system starving for more memory, possibly resorting to using slow disks as a poor substitute for RAM? Again,

constant@fridolin:~$ vmstat 5 
 kthr memory page disk faults cpu
 r b w swap free re mf pi po fr de sr cd s0 s2 -- in sy cs us sy id
 2 0 7 6472 30620 6 85 108 392 546 3060 2617 143 0 111 0 839 408 14606 3 40 57
 0 0 7 8360 33960 10 51 89 155 1910 1816 19090 187 0 52 0 883 529 9512 5 36 59
 0 0 7 12548 42948 19 48 66 215 215 1080 0 121 0 70 0 737 340 10273 3 31 66
 1 0 7 13612 39916 38 90 106 0 0 632 0 171 0 56 0 900 616 10160 5 29 66
 4 0 7 8060 29528 10 47 55 0 383 232 5514 112 0 77 0 854 739 6665 4 26 70
 0 0 7 7312 38468 3 9 15 234 1500 0 17073 33 0 47 0 580 349 3993 2 25 73
 0 0 7 8960 39460 17 46 55 0 0 0 0 101 0 37 0 744 529 7870 3 27 70
 2 0 7 8836 37020 6 31 46 0 0 0 0 87 0 87 0 749 418 6033 3 20 77

is our friend. This time, let's look at three values: swap, free and sr (or: scan rate):

Again, the old adage was: If memory is full, you need more of it. But today it's misleading: Modern operating systems tend to use up as much memory as they can, to maximize your hard spent RAM bucks' utilization. For example, ZFS uses as much free memory as possible as a read cache to save you from spending precious IOPS on disks. So if the "free mem" column in top is small, this is actually a good sign: It means that your RAM is doing useful stuff.

A better question to ask here: Is my memory system in trouble? That's what the scan rate value is telling us: The bigger this value, the more stressed our memory subsystem is, because the OS is more and more busy scanning memory pages for expendable chunks so it can fulfill a high demand in fresh memory. If the scan rate is a single digit value most of the time, you're ok. If it shows large values over extended periods of time, you'll likely benefit from some extra RAM in your system.

In the second vmstat example above, I created extra stress for the memory system by starting a ZFS scrub (filling up RAM), starting OpenOffice with a large presentation and asking GIMP to set up a new 8k x 8k picture for me. That resulted in some samples showing more than a thousand page scans. That's certainly a situation where more RAM would have come in handy. The system was unusable, although the CPU showed more than 70% idle.

(Again, there's a lot more detail that we don't cover here, but we don't want to make this post bigger than a good bedtime reading, do we?)

The nice thing about vmstat is that with just one command, you can easily assess if the CPU and RAM situation is ok or not, then move on to the next layer.

Or Is There a Disk Problem?

Now it gets interesting. Most if not all of the performance problems I see are disk I/O related, and there's no indication that this is about to change.

constant@fridolin:~$ iostat -xzn 5 
 extended device statistics 
 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
 8.2 6.2 163.8 90.0 0.5 0.2 35.4 13.1 8 10 c3d0
 1.4 12.2 30.0 81.4 0.1 0.2 8.9 13.0 3 7 c6t0d0
 extended device statistics 
 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
 126.6 33.1 1613.0 400.3 3.5 1.6 21.9 9.8 75 81 c3d0
 0.0 19.7 0.0 40.7 0.6 0.1 28.6 7.5 14 15 c6t0d0
 extended device statistics 
 r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
 33.4 2.0 242.5 14.4 7.1 2.0 200.0 56.4 100 100 c3d0
 0.0 15.8 0.0 39.4 2.3 0.5 148.2 31.3 49 49 c6t0d0

Again, looking at simple performance numbers like reads/writes per second or even kilobytes read/written per second doesn't tell you much. Are 126 reads fast? Or too slow? Wow, 1613k read per second. That's a lot! Is it? Wait, what disks am I using again? (Answer: The above is a Solaris 11 Express system running on VirtualBox on my 3-year-old Mac.)

A more interesting figure to look at is wait: This is the number of IO operations that are waiting to be serviced. In other words: "wait" tells you the waiting queue length. If your queue length looks like the one in front of an Apple store at the day of the introduction of the new iPhone, you need to work on your disks (Here are a few suggestions if you use ZFS). If the wait time is in the single digit range, then your problem may be elsewhere.

Sometimes you want a more application level view into your IO situation and that is what the following command is about:

 admin@krengi:~$ fsstat -F 5
 new name name attr attr lookup rddir read read write write
 file remov chng get set ops ops ops bytes ops bytes
 0 0 0 0 0 0 0 0 0 0 0 ufs
 0 0 0 0 0 0 0 0 0 0 0 proc
 0 0 0 0 0 0 0 0 0 0 0 nfs
 0 0 0 68 0 43 0 0 0 9 1.06K zfs
 0 0 0 0 0 0 0 0 0 0 0 lofs
 0 0 0 0 0 0 0 0 0 0 0 tmpfs
 0 0 0 0 0 0 0 0 0 0 0 mntfs
 0 0 0 0 0 0 0 0 0 0 0 nfs3
 0 0 0 0 0 0 0 0 0 0 0 nfs4
 0 0 0 0 0 0 0 0 0 0 0 autofs

admin@krengi:~$ fsstat zfs 5
 new name name attr attr lookup rddir read read write write
 file remov chng get set ops ops ops bytes ops bytes
2.08M 613K 171K 7.68G 2.25M 10.0G 43.3M 1.09G 1.97T 189M 638G zfs
 0 0 0 74 0 79 0 35 608 18 860 zfs
 0 0 0 67 0 39 0 0 0 1 112 zfs
 0 0 0 71 0 73 0 1 4 1 112 zfs

This is another great way of quickly having a look at what's up with your disk IO.

Are your users creating lots of files? Or are they modifying/removing/changing attributes a lot? What filesystems are causing the most IO load? How much IO goes through NFS and how much is local? All these questions can be easily answered with fsstat and a few flags.

Checking Out the Network

Finally, if your problem is neither on the CPU nor on the memory nor on the disk IO side, it may lie outside of your system, perhaps at the networking level. Again, there's a favorite command that gets me a useful picture most of the time. For example, while streaming some video on my home server, I checked the effect on the network with this:

admin@krengi:~$ netstat -I e1000g0 5
 input e1000g output input (Total) output
packets errs packets errs colls packets errs packets errs colls 
417683472 4 384816503 0 0 420603019 4 387736050 0 0 
5779 0 3282 0 0 5779 0 3282 0 0 
6487 0 3556 0 0 6487 0 3556 0 0 
3672 0 2351 0 0 3673 0 2352 0 0

Notice that netstat counts packets here, not MB/s. Network performance analysis and tuning is a science of its own, but with this command you can quickly assess what each networking interface is doing, and whether the packets they transmit are in the right ballpark. Maybe you have multiple network interfaces configured, but still all your data is sent through the same pipe?

Digging Deeper

So that's it for my performance cheat sheet: vmstat for CPU and memory, iostat with the -xzn flags and fsstat for disk IO, and good old netstat -I for the network. This is the 20% effort solution, the minimum effective set of commands that will get you a quick overview of a system in 80% of the cases.

Now for that other 20% of more complicated cases, you will need some extra digging. If you want to learn more, here are a few useful pointers: