wiki:DiagnosingSluggishness

Version 7 (modified by dkg, 10 years ago) (diff)

--

Why is my computer so slow?

Many folks get frustrated with their computer taking "too long" to respond to them, or to do things that they want it to do. While some of these problems can be fixed with better software implementations, some of them are related fundamentally to underlying resource exhaustion which no software fix can address. Even if you think that a software fix is possible, it's good to think about what resources are at their limit (the "bottleneck") so you can focus your software development energies in the right direction.

So how do you know where the bottleneck actually is? Which resources might be causing problems that are noticable to the user of a given machine? A good starting point for this (under linux, anyway) is vmstat. When invoked as:

vmstat 1 5

it will produce a series of 5 rows, one per second, which each tell you a lot about the state of the system. The first row tells you about the aggregated state of the system since it booted, and each row thereafter shows you values for the system during the last interval. Specific details about the number can be found in the man page.

On a typical modern system there are 4 main categories of resources whose exhaustion causes user-noticable lag (please give a shout if there are other categories i'm ignoring):

CPU Cycles (aka "the processor")

Your basic computer internally can really only be doing one thing at a time. Even more modern computers with mult-core CPUs and/or multiple processors can only be doing a handful of things at once. The illusion of "multitasking" comes from the fact that operating system forces the CPU to switch contexts very rapidly between all the outstanding tasks that the user has instructed it to work on. If you've instructed your computer to do more work than it has time to get to, you'll perceive it as sluggishness.

Some example ways to exhaust your CPU: complicated statistical analysis (e.g. seti@home), heavy-duty cryptanalysis (e.g. password cracking), algorithmically-intensive data transformations (e.g. transcoding video), excessive javascript in pages you've viewed (e.g. the countdown clock that used to be on ussf2007.org), etc.

Symptoms that might mean you've got a CPU bottleneck:

  • run "vmstat 1": do you see the "us" and "sy" (user and kernel) columns in the "cpu" section dominating the "id" and "wa" (idle and I/O wait) columns?
  • are all the fans on your system running full blast, and the computer is churning out a lot of heat? Processors under heavy load get hot and need to purge their heat somewhere.

Here's vmstat of a computer under heavy CPU load (note that the first line just shows that the CPU on monkey has been idle for 79% of the time since it booted):

[0 dkg@monkey ~]$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 4  1 131464   5440   5532 102616    1    1    20    44  170  289 18  2 79  1
 3  1 131464   6308   5532 100744    0    0     0    44  272 2593 25 75  0  0
 3  1 131464   5776   5532 100184    0    0     4     0  270 2560 25 75  0  0
 3  0 131464   6248   5548  98192    0    0     4  5188  303 2125 22 78  0  0
 4  1 131464   6056   5548  97668    0    0     0     0  452 3884 39 61  0  0
[0 dkg@monkey ~]$ 

RAM (aka "memory")

The memory is the "working set" of information that the computer can access relatively fast. Every tab you have open in your web browser (even if it's not foregrounded and no javascript is running) will set aside some RAM to keep track of it. Every word processing document you're writing is loaded into RAM while you're writing it. Pretty pictures on your desktop require a chunk of RAM.

Modern computers are clever enough to use swap (aka "virtual memory") when you ask them to hold more things in RAM than they physically have -- this just means that they substitute the slower (but much larger) hard disk for RAM when things get tight. A common principle here is to eject the LRU (Least Recently Used) block of RAM, writing it out to a special place on disk (the "swap file"), and loading in new data requested by an active process.

Symptoms that might mean you've got a RAM bottleneck:

  • frequent, heavy disk activity when you're not trying to write out or copy large files usually means that you're swapping. Since you only swap when you've run out of RAM, that's a bad sign. If you're lucky enough to have a machine with a functional disk activity light, keep an eye on it. If you don't have a disk activity light, listen with your ears: unless you've got a solid-state disk (as of 2008, if you aren't sure whether you have a solid state disk, you probably don't have one), disk activity like this is actually audible as whirring and clicking.
  • Applications shutting down without warning could mean hitting a hard wall on RAM. If the computer has X amount of RAM, and you've instructed the operating system to set aside Y amount of swap space, and then you ask the computer to do tasks that consume more than X + Y memory in aggregate, your computer has to decide what to do: it's probably going to start by killing off some of the more offensive memory hogs to get the system back into a normal state. On Linux-based systems, this job is performed by a kernel subsystem called the oom-killer (out of memory killer). It's kind of a black art and you really don't want to get to the point where you need it.

Here's vmstat of a computer that has exhausted its RAM and is chewing into swap:

[0 dkg@monkey ~]$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0 131548   5880   4068 121408    1    1    19    48  171  290 18  2 79  1
 1  1 131628   4920   4448  84212    0   80    12    80  116  313  2 16 75  7
 6  4 143132   5124    256  50252    0 11496     0 11648  615 1296  1 69  0 30
 2  1 158556   4856    172  43008   96 15404   100 15404  745  721  3 70  0 27
 7  7 181712   4732    192  38280  528 22192   556 22276  983  954  4 84  0 12
[0 dkg@monkey ~]$ 

Things to note:

  • the swap columnset is active in both si (swap in, meaning bringing a block of RAM in from disk) and so (swap out, meaning writing a page of RAM out to disk)
  • the cpu spends a good chunk of its time in the wa (I/O wait) state, meaning that it is otherwise idle, but waiting for some sort of disk access.
  • the number of swapped pages (swpd) is increasing
  • the amount of memory allocated to buffers (buff) and cache drop precipitously -- buffering and caching are two performance-optimizing ways that the kernel makes use of memory that is otherwise unallocated. They speed up your use of the machine without you asking them to do anything concretely, but they are not strictly required to make the computer work correctly. So when memory gets tight, the kernel reclaims the RAM it was using for buffers and caches to try to accommodate the new requirements coming from the user.

Disks (aka "I/O")

There are really two kinds of resources you can exhaust related to disks, but only one of them typically results in the sluggishness this page attempts to diagnose. I'll get the other one out of the way first:

Disk Space (capacity)

This is the form of disk resource people are most used to seeing exhausted. You get messages like "cannot save file, disk is full" from your programs, or you get weird misbehaviors or system failures -- services being unable to log, mail transfer agents bouncing mail, etc.

The quickest way on a reasonable system to get a sense of how your disks are is just df. Using the -h flag shows you the numbers in "human-readable" format:

0 ape:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ape0-root
                     1008M  802M  156M  84% /
tmpfs                  64M     0   64M   0% /lib/init/rw
udev                   10M   44K   10M   1% /dev
tmpfs                  64M     0   64M   0% /dev/shm
/dev/sda1             228M  139M   78M  65% /boot
0 ape:~# 

You see here that ape only has 156MB available on its root filesystem. This is pretty tight: any time you get close to 90% full on a filesystem, the kernel has to do a lot more work to decide how to place the files you want to store. With a near-full filesystem, storing a larger file can take more time because it often needs to be broken up into smaller chunks and distributed across the disk. Having files scattered across the disk ("fragmented") means more work for the moving parts within the disk when you want to access that file. Moving parts are slow compared to electronics.

But the modern operating system kernel hides almost all of these shenanigans from the end user pretty well, so you usually won't notice performance degradation from full filesystems until they're actually full, at which point you'll get the nasty hard errors mentioned above. This brings me to the flavor of disk-related resource that often does cause perceptible performance problems...

Disk Throughput ("I/O" or "bandwidth")

Where disks are most likely to cause user-visible sluggishness is in their throughput, not their capacity. Accessing data off of a disk (or writing data to a disk) is extremely slow when compared to other parts of a modern computer. If the user has to experience this behavior directly, they'll likely feel like the computer is sluggish.

A good kernel can deal with this for disk writes transparently, as long as there is enough RAM: when the user says "save this file to disk", the kernel just says, "ok, fine", caches the data first in (fast) RAM, returns control to the user, and then (while the user is otherwise idle) dribbles the data out to (slow) disk in little chunks when the opportunity presents itself. If you've ever tried to save a file to a floppy disk in Windows 98, you know the annoyance that comes from an OS not doing this sort of "write caching": writes to floppy disks under that OS were synchronous (they had to happen exactly when the user requested them) -- so the whole machine would lock up while the file was actually being saved.

Modern kernels also do similar sleights-of-hand on disk reads, though it is harder to do because predicting what the user is going to want to read next is an imprecise art. So if some subsystem in your computer is accessing the disk a lot, then other programs which need to access the disk will be noticeably slower, as they wait for their turn at the limited bandwidth available to the disk.

Here's a common situation: a program starting up needs to read its executable binary (and all linked libraries) from your disk into RAM. If you've already opened the program previously (and you have enough RAM), it's likely that the computer will have that copy of it in RAM already, so you won't see the sluggishness related to pulling it from disk. But if you're low on RAM already (so the cached data has been ejected from RAM), or you've just never opened this program before, it won't be cached. If another process is hammering the disk (e.g. say you're making two copies of a multi-gigabyte DVD image), then the disk accesses the program needs to actually start up will be interleaved with the other requests made of the disk. This will manifest itself as a slow program startup.

Note also that it starts getting tricky here: when you run out of RAM, your computer often exhausts the throughput to your disks because it's swapping. So sometimes when you see that the disk activity is excessive, it could be due fundamentally to RAM exhaustion: so check there first.

Symptoms that might mean you have a disk I/O bottleneck:

  • constant hard disk activity -- if your machine is physically nearby, look for the disk activity lights, or listen for the whirring, clicking sound that an active disk makes. If your machine is remote, use vmstat to look for high levels of bi and bo in the io grouping, or high levels of wa in the cpu grouping.
  • launching programs for the first time since boot, or opening files for the first time since boot takes much longer than it should.

Here's vmstat of a system just coming under heavy I/O load:

0 dkg@squash:~$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0     56   6944 238140 106932    0    0    15    44  260   54  4  1 93  2
 1  0     56   4896 252776  94608    0    0  7916     0  253  204  9 58 24  9
 1  1     56   4644 253984  93612    0    0 10624  2800  253  264 12 80  0  8
 1  0     56   4632 252984  93612    0    0  5376  5488  253  261  3 52  0 45
 1  0     56   5548 252528  93416    0    0  4224  4144  253  263  3 44  0 53
0 dkg@squash:~$ 

Note that there are no idle CPU cycles (id in the cpu section), but a significant amount of cycles in I/O wait (wa in cpu). The I/O wait column is new as of Linux kernel 2.6 -- if you're using a 2.4 series kernel, you won't be able to see this. CPU cycles counted in this last column are cycles in which the CPU is idle, but there is an outstanding request to the disks (or does other I/O count?). This indicates that if the I/O had completed, more activity could happen (because the CPU is otherwise idle), so large values in that column are good indicators of a disk throughput problem.

Note also that bi and bo are large values (in io), while there is no actual swap activity. This is helpful to distinguish from an out-of-RAM state.

Network

Diagnosing network resource exhaustion is tougher than the other forms of resource exhaustion because it's often caused by external systems. So saying "it's a network problem" is sometimes the answer of last resort when none of the other resources are anything close to exhausted.

FIXME: more to write here

If you suspect that your upstream connection is clogged and you control the router/gateway, you might try using iftop to figure out which particular client is hogging the pipe.