wiki:DiagnosingSluggishness

Version 2 (modified by Daniel Kahn Gillmor, 11 years ago) (diff)

still working on this (saving intermediate state)

Why is my computer so slow?

Many folks get frustrated with their computer taking "too long" to respond to them, or to do things that they want it to do. While some of these problems can be fixed with better software implementations, some of them are related fundamentally to underlying resource exhaustion which no software fix can address. Even if you think that a software fix is possible, it's good to think about what resources are at their limit so you can focus your software development energies in the right direction.

So how do you know where the bottleneck actually is? Which resources might be causing problems that are noticable to the user of a given machine? A good starting point for this (under linux, anyway) is vmstat. When invoked as:

vmstat 1 5

it will produce a series of 5 rows, one per second, which each tell you a lot about the state of the system. The first row tells you about the aggregated state of the system since it booted, and each row thereafter shows you values for the system during the last interval. Specific details about the number can be found in the man page.

On a typical modern system there are 4 main categories of resources whose exhaustion causes user-noticable lag (please give a shout if there are other categories i'm ignoring):

CPU Cycles (aka "the processor")

Your basic computer internally can really only be doing one thing at a time. Even more modern computers with mult-core CPUs and/or multiple processors can only be doing a handful of things at once. The illusion of "multitasking" comes from the fact that operating system forces the CPU to switch contexts very rapidly between all the outstanding tasks that the user has instructed it to work on. If you've instructed your computer to do more work than it has time to get to, you'll perceive it as sluggishness.

Some example ways to exhaust your CPU: complicated statistical analysis (e.g. seti@home), heavy-duty cryptanalysis (e.g. password cracking), algorithmically-intensive data transformations (e.g. transcoding video), excessive javascript in pages you've viewed (e.g. the countdown clock that used to be on ussf2007.org), etc.

Symptoms that might mean you've got a CPU bottleneck:

  • run "vmstat 1": do you see the "us" and "sy" (user and kernel) columns in the "cpu" section dominating the "id" and "wa" (idle and I/O wait) columns?
  • are all the fans on your system running full blast, and the computer is churning out a lot of heat? Processors under heavy load get hot and need to purge their heat somewhere.

Here's vmstat of a computer under heavy CPU load (note that the first line just shows that the CPU on monkey has been idle for 79% of the time since it booted):

[0 dkg@monkey ~]$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 4  1 131464   5440   5532 102616    1    1    20    44  170  289 18  2 79  1
 3  1 131464   6308   5532 100744    0    0     0    44  272 2593 25 75  0  0
 3  1 131464   5776   5532 100184    0    0     4     0  270 2560 25 75  0  0
 3  0 131464   6248   5548  98192    0    0     4  5188  303 2125 22 78  0  0
 4  1 131464   6056   5548  97668    0    0     0     0  452 3884 39 61  0  0
[0 dkg@monkey ~]$ 

RAM (aka "memory")

The memory is the "working set" of information that the computer can access relatively fast. Every tab you have open in your web browser (even if it's not foregrounded and no javascript is running) will set aside some RAM to keep track of it. Every word processing document you're writing is loaded into RAM while you're writing it. Pretty pictures on your desktop require a chunk of RAM.

Modern computers are clever enough to use swap (aka "virtual memory") when you ask them to hold more things in RAM than they physically have -- this just means that they substitute the slower (but much larger) hard disk for RAM when things get tight. A common principle here is to eject the LRU (Least Recently Used) block of RAM, writing it out to a special place on disk (the "swap file"), and loading in new data requested by an active process.

Symptoms that might mean you've got a RAM bottleneck:

  • frequent, heavy disk activity when you're not trying to write out or copy large files usually means that you're swapping. Since you only swap when you've run out of RAM, that's a bad sign. If you're lucky enough to have a machine with a functional disk activity light, keep an eye on it. If you don't have a disk activity light, listen with your ears: unless you've got a solid-state disk (as of 2008, if you aren't sure whether you have a solid state disk, you probably don't have one), disk activity like this is actually audible as whirring and clicking.
  • Applications shutting down without warning could mean hitting a hard wall on RAM. If the computer has X amount of RAM, and you've instructed the operating system to set aside Y amount of swap space, and then you ask the computer to do tasks that consume more than X + Y memory in aggregate, your computer has to decide what to do: it's probably going to start by killing off some of the more offensive memory hogs to get the system back into a normal state. On Linux-based systems, this job is performed by a kernel subsystem called the oom-killer (out of memory killer). It's kind of a black art and you really don't want to get to the point where you need it.

Here's vmstat of a computer that has exhausted its RAM and is chewing into swap:

[0 dkg@monkey ~]$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  0 131548   5880   4068 121408    1    1    19    48  171  290 18  2 79  1
 1  1 131628   4920   4448  84212    0   80    12    80  116  313  2 16 75  7
 6  4 143132   5124    256  50252    0 11496     0 11648  615 1296  1 69  0 30
 2  1 158556   4856    172  43008   96 15404   100 15404  745  721  3 70  0 27
 7  7 181712   4732    192  38280  528 22192   556 22276  983  954  4 84  0 12
[0 dkg@monkey ~]$ 

Things to note:

  • the swap columnset is active in both si (swap in, meaning bringing a block of RAM in from disk) and si (swap out, meaning writing a page of RAM out to disk)
  • the cpu spends a good chunk of its time in the wa (I/O wait) state, waiting for the swap to take effect
  • the number of swapped pages (swpd) is increasing
  • the amount of memory allocated to buffers (buff) and cache drop precipitously -- buffering and caching are two performance-optimizing ways that the kernel makes use of memory that is otherwise unallocated. They speed up your use of the machine without you asking them to do anything concretely, but they are not strictly required to make the computer work. So when memory gets tight, the kernel reclaims the RAM it was using for buffers and caches to try to accommodate the new requirements coming from the user.

Disks (aka "I/O")

There are really two kinds of resources you can exhaust related to disks, but only one of them typically results in the sluggishness this page attempts to diagnose. I'll get the other one out of the way first:

Disk Space (capacity)

This is the form of disk resource people are most used to seeing exhausted. You get messages like "cannot save file, disk is full" from your programs, or you get weird misbehaviors or system failures -- services being unable to log, mail transfer agents bouncing mail, etc.

The quickest way on a reasonable system to get a sense of how your disks are is just df. Using the -h flag shows you the numbers in "human-readable" format:

0 ape:~# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_ape0-root
                     1008M  802M  156M  84% /
tmpfs                  64M     0   64M   0% /lib/init/rw
udev                   10M   44K   10M   1% /dev
tmpfs                  64M     0   64M   0% /dev/shm
/dev/sda1             228M  139M   78M  65% /boot
0 ape:~# 

You see here that ape only has 156MB available on its root filesystem. This is pretty tight: any time you get close to 90% full on a filesystem, the kernel has to do a lot more work to decide how to place the files you want to store. With a near-full filesystem, storing a larger file can take more time because it often needs to be broken up into smaller chunks and distributed across the disk, which means more moving parts.

But the modern operating system kernel hides almost all of these shenanigans from the end user pretty well, so you usually won't notice performance degradation from full filesystems until they're actually full, and you get the nasty hard errors mentioned above. This brings me to the flavor of disk-related resource that often does cause performance problems...

Disk Throughput

Where disks are most likely to cause user-visible sluggishness is in their throughput, not their capacity. Accessing data off of a disk (or writing data to a disk) is extremely slow when compared to other parts of a modern computer. If the user has to experience this behavior directly, they'll likely feel like the computer is sluggish.

A good kernel can deal with this for disk writes transparently, as long as there is enough RAM: when the user says "save this file to disk", the kernel just says, "ok, fine", caching the data first in (fast) RAM, returns control to the user, and writes the data out to (slow) disk in little chunks while during time that the user is otherwise idle. (if you've ever tried to save a file to a floppy disk in Windows 98, you know the annoyance that comes from an OS not doing this: writes to floppy disks under that OS were asynchronous -- so the whole machine would lock up while the file was actually being saved).

Modern kernels also do similar sleights-of-hand on disk reads, though it is harder to do.

It starts getting tricky here: when you run out of RAM, your computer often starts thrashing your disks. But there are other things that can cause you to thrash your disks too.

Network

This resource really kinda sucks because it's often hard to diagnose from the machine in question. the case that