Changes between Version 5 and Version 6 of DiagnosingSluggishness


Ignore:
Timestamp:
Feb 18, 2008, 6:35:15 PM (10 years ago)
Author:
dkg
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DiagnosingSluggishness

    v5 v6  
    114114But the modern operating system kernel hides almost all of these shenanigans from the end user pretty well, so you usually won't notice performance degradation from full filesystems until they're actually full, at which point you'll get the nasty hard errors mentioned above.  This brings me to the flavor of disk-related resource that often ''does'' cause perceptible performance problems...
    115115
    116 === Disk Throughput ("bandwidth") ===
     116=== Disk Throughput ("I/O" or "bandwidth") ===
    117117
    118118Where disks are most likely to cause user-visible sluggishness is in their ''throughput'', not their ''capacity''.  Accessing data off of a disk (or writing data to a disk) is extremely slow when compared to other parts of a modern computer.  If the user has to experience this behavior directly, they'll likely feel like the computer is sluggish.
     
    120120A good kernel can deal with this for disk ''writes'' transparently, as long as there is enough RAM: when the user says "save this file to disk", the kernel just says, "ok, fine", caches the data first in (fast) RAM, returns control to the user, and then (while the user is otherwise idle) dribbles the data out to (slow) disk in little chunks when the opportunity presents itself.  If you've ever tried to save a file to a floppy disk in Windows 98, you know the annoyance that comes from an OS ''not'' doing this sort of "write caching": writes to floppy disks under that OS were synchronous (they had to happen exactly when the user requested them) -- so the whole machine would lock up while the file was actually being saved.
    121121
    122 Modern kernels also do similar sleights-of-hand on disk ''reads'', though it is harder to do.
     122Modern kernels also do similar sleights-of-hand on disk ''reads'', though it is harder to do because predicting what the user is going to want to read next is an imprecise art.  So if some subsystem in your computer is accessing the disk a lot, then other programs which need to access the disk will be noticeably slower, as they wait for their turn at the limited bandwidth available to the disk. 
    123123
    124 ''FIXME: more to write here''
    125   It starts getting tricky here: when you [#RAMakamemory run out of RAM], your computer often starts thrashing your disks.  But there are other things that can cause you to thrash your disks too.
     124Here's a common situation: a program starting up needs to read its executable binary (and all linked libraries) from your disk into RAM.  If you've already opened the program previously (and you have enough RAM), it's likely that the computer will have that copy of it in RAM already, so you won't see the sluggishness related to pulling it from disk.  But if you're low on RAM already (so the cached data has been ejected from RAM), or you've just never opened this program before, it won't be cached.  If another process is hammering the disk (e.g. say you're making two copies of a multi-gigabyte DVD image), then the disk accesses the program needs to actually start up will be interleaved with the other requests made of the disk.  This will manifest itself as a slow program startup.
    126125
     126Note also that it starts getting tricky here: when you [#RAMakamemory run out of RAM], your computer often exhausts the throughput to your disks because it's swapping.  So sometimes when you see that the disk activity is excessive, it could be due fundamentally to RAM exhaustion: so check there first.
     127
     128Symptoms that might mean you have a disk I/O bottleneck:
     129
     130 * constant hard disk activity -- if your machine is physically nearby, look for the disk activity lights, or listen for the whirring, clicking sound that an active disk makes.  If your machine is remote, use `vmstat` to look for high levels of `bi` and `bo` in the `io` grouping, or high levels of `wa` in the `cpu` grouping.
     131 * launching programs for the first time since boot, or opening files for the first time since boot takes much longer than it should.
     132
     133Here's `vmstat` of a system just coming under heavy I/O load:
     134{{{
     1350 dkg@squash:~$ vmstat 1 5
     136procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
     137 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
     138 2  0     56   6944 238140 106932    0    0    15    44  260   54  4  1 93  2
     139 1  0     56   4896 252776  94608    0    0  7916     0  253  204  9 58 24  9
     140 1  1     56   4644 253984  93612    0    0 10624  2800  253  264 12 80  0  8
     141 1  0     56   4632 252984  93612    0    0  5376  5488  253  261  3 52  0 45
     142 1  0     56   5548 252528  93416    0    0  4224  4144  253  263  3 44  0 53
     1430 dkg@squash:~$
     144}}}
     145
     146Note that there are no idle CPU cycles (`id` in the `cpu` section), but a significant amount of cycles in I/O wait (`wa` in `cpu`).  The I/O wait column is new as of Linux kernel 2.6 -- if you're using a 2.4 series kernel, you won't be able to see this.
     147
     148Note also that `bi` and `bo` are large values (in `io`), though there is no actual swap activity.
     149 
    127150== Network ==
    128151