Changes between Version 1 and Version 2 of DiagnosingSluggishness


Ignore:
Timestamp:
Feb 13, 2008 6:25:29 PM (5 years ago)
Author:
dkg
Comment:

still working on this (saving intermediate state)

Legend:

Unmodified
Added
Removed
Modified
  • DiagnosingSluggishness

    v1 v2  
    3131etc. 
    3232 
    33 symptoms that might mean you've got a CPU bottleneck:  
     33Symptoms that might mean you've got a CPU bottleneck:  
    3434 
    3535 * run "vmstat 1": do you see the "us" and "sy" (user and kernel) columns in the "cpu" section dominating the "id" and "wa" (idle and I/O wait) columns? 
    3636 * are all the fans on your system running full blast, and the computer is churning out a lot of heat?  Processors under heavy load get hot and need to purge their heat somewhere. 
    3737 
    38 Here's `vmstat` of a computer under heavy CPU load: 
     38Here's `vmstat` of a computer under heavy CPU load (note that the first line just shows that the CPU on `monkey` has been idle for 79% of the time since it booted): 
    3939{{{ 
    40 [0 dkg@squeak ~]$ vmstat 1 5 
     40[0 dkg@monkey ~]$ vmstat 1 5 
    4141procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
    4242 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
     
    4646 3  0 131464   6248   5548  98192    0    0     4  5188  303 2125 22 78  0  0 
    4747 4  1 131464   6056   5548  97668    0    0     0     0  452 3884 39 61  0  0 
    48 [0 dkg@squeak ~]$  
     48[0 dkg@monkey ~]$  
    4949}}} 
    5050 
     
    5757javascript is running) will set aside some RAM to keep track of it. 
    5858Every word processing document you're writing is loaded into RAM 
    59 while you're writing it.  Modern computers are clever enough to use 
     59while you're writing it.  Pretty pictures on your desktop require a chunk of RAM. 
     60 
     61Modern computers are clever enough to use 
    6062swap (aka "virtual memory") when you ask them to hold more things 
    6163in RAM than they physically have -- this just means that they 
    62 substitute 
     64substitute the slower (but much larger) hard disk for RAM when things get tight.  A common principle here is to eject the LRU (Least Recently Used) block of RAM, writing it out to a special place on disk (the "swap file"), and loading in new data requested by an active process. 
     65 
     66Symptoms that might mean you've got a RAM bottleneck: 
     67 
     68 * frequent, heavy disk activity when you're not trying to write out or copy large files usually means that you're swapping.  Since you only swap when you've run out of RAM, that's a bad sign.  If you're lucky enough to have a machine with a functional disk activity light, keep an eye on it.  If you don't have a disk activity light, listen with your ears: unless you've got a solid-state disk (as of 2008, if you aren't sure whether you have a solid state disk, you ''probably don't have one''), disk activity like this is actually audible as whirring and clicking. 
     69 * Applications shutting down without warning could mean hitting a hard wall on RAM.  If the computer has ''X'' amount of RAM, and you've instructed the operating system to set aside ''Y'' amount of swap space, and then you ask the computer to do tasks that consume more than ''X + Y'' memory in aggregate, your computer has to decide what to do:  it's probably going to start by killing off some of the more offensive memory hogs to get the system back into a normal state.  On Linux-based systems, this job is performed by a kernel subsystem called the [http://linux-mm.org/OOM_Killer oom-killer] (out of memory killer).  It's kind of a black art and you really don't want to get to the point where you need it. 
     70 
     71Here's `vmstat` of a computer that has exhausted its RAM and is chewing into swap: 
     72{{{ 
     73[0 dkg@monkey ~]$ vmstat 1 5 
     74procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu---- 
     75 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa 
     76 2  0 131548   5880   4068 121408    1    1    19    48  171  290 18  2 79  1 
     77 1  1 131628   4920   4448  84212    0   80    12    80  116  313  2 16 75  7 
     78 6  4 143132   5124    256  50252    0 11496     0 11648  615 1296  1 69  0 30 
     79 2  1 158556   4856    172  43008   96 15404   100 15404  745  721  3 70  0 27 
     80 7  7 181712   4732    192  38280  528 22192   556 22276  983  954  4 84  0 12 
     81[0 dkg@monkey ~]$  
     82}}} 
     83 
     84Things to note: 
     85 
     86 * the `swap` columnset is active in both `si` (swap in, meaning bringing a block of RAM in from disk) and `si` (swap out, meaning writing a page of RAM out to disk) 
     87 * the `cpu` spends a good chunk of its time in the `wa` (I/O wait) state, waiting for the swap to take effect 
     88 * the number of swapped pages (`swpd`) is increasing 
     89 * the amount of `memory` allocated to buffers (`buff`) and `cache` drop precipitously -- buffering and caching are two performance-optimizing ways that the kernel makes use of memory that is otherwise unallocated.  They speed up your use of the machine without you asking them to do anything concretely, but they are not strictly required to make the computer work.  So when memory gets tight, the kernel reclaims the RAM it was using for buffers and caches to try to accommodate the new requirements coming from the user. 
    6390 
    6491== Disks (aka "I/O") == 
     92 
     93There are really two kinds of resources you can exhaust related to disks, but only one of them typically results in the sluggishness this page attempts to diagnose.  I'll get the other one out of the way first: 
     94 
     95=== Disk Space (capacity) === 
     96 
     97This is the form of disk resource people are most used to seeing exhausted.  You get messages like "cannot save file, disk is full" from your programs, or you get weird misbehaviors or system failures -- services being unable to log, mail transfer agents bouncing mail, etc. 
     98 
     99The quickest way on a reasonable system to get a sense of how your disks are is just `df`.  Using the `-h` flag shows you the numbers in "human-readable" format: 
     100 
     101{{{ 
     1020 ape:~# df -h 
     103Filesystem            Size  Used Avail Use% Mounted on 
     104/dev/mapper/vg_ape0-root 
     105                     1008M  802M  156M  84% / 
     106tmpfs                  64M     0   64M   0% /lib/init/rw 
     107udev                   10M   44K   10M   1% /dev 
     108tmpfs                  64M     0   64M   0% /dev/shm 
     109/dev/sda1             228M  139M   78M  65% /boot 
     1100 ape:~#  
     111}}} 
     112You see here that `ape` only has 156MB available on its root filesystem.  This is pretty tight: any time you get close to 90% full on a filesystem, the kernel has to do a lot more work to decide how to place the files you want to store.  With a near-full filesystem, storing a larger file can take more time because it often needs to be broken up into smaller chunks and distributed across the disk, which means more moving parts. 
     113 
     114But the modern operating system kernel hides almost all of these shenanigans from the end user pretty well, so you usually won't notice performance degradation from full filesystems until they're actually full, and you get the nasty hard errors mentioned above.  This brings me to the flavor of disk-related resource that often does cause performance problems... 
     115 
     116=== Disk Throughput === 
     117 
     118Where disks are most likely to cause user-visible sluggishness is in their ''throughput'', not their ''capacity''.  Accessing data off of a disk (or writing data to a disk) is extremely slow when compared to other parts of a modern computer.  If the user has to experience this behavior directly, they'll likely feel like the computer is sluggish. 
     119 
     120A good kernel can deal with this for disk ''writes'' transparently, as long as there is enough RAM: when the user says "save this file to disk", the kernel just says, "ok, fine", caching the data first in (fast) RAM, returns control to the user, and writes the data out to (slow) disk in little chunks while during time that the user is otherwise idle.  (if you've ever tried to save a file to a floppy disk in Windows 98, you know the annoyance that comes from an OS ''not'' doing this: writes to floppy disks under that OS were asynchronous -- so the whole machine would lock up while the file was actually being saved). 
     121 
     122Modern kernels also do similar sleights-of-hand on disk ''reads'', though it is harder to do. 
     123 
     124  It starts getting tricky here: when you [#RAMakamemory run out of RAM], your computer often starts thrashing your disks.  But there are other things that can cause you to thrash your disks too. 
    65125 
    66126== Network ==