[wplug] RAID performance
Michael Skowvron
michaels at penguincentral.org
Sat Mar 3 17:55:32 EST 2007
Tagged Command Queing:
SATA-I does not have NCQ.
NCQ and CTQ/TCQ on SCSI are designed to allow the device to take a group
of unrelated requests and reorder them so that they can be serviced most
efficiently. Multiple requests that target the same area on the disk
will be serviced together and possibly ahead of other requests so that
head movement is minimzed.
In our simple test of streaming reads and streaming writes, there is
nothing to reorder so command queueing will have no effect.
Command queuing is a benefit for multiple processes doing simultaneous
reads and writes. Running 3 or 4 simultaneous lmdd sessions _might_
allow the NCQ SATA-II drives to pull further ahead in overall throughput.
RAID-5 parity calculation:
The read-modify-write penalty is something that is often overlooked, yet
has a big impact on performance. RAID-5 parity is caluclated in what I
will call "stripes", but there are other terms that mean the same. If a
4 drive RAID5 (also known as a 3+1) has a stripe size of 64K, the RAID
works in data chunks that are 3X 64K in size. The parity is calculated
on the 192K chunk to generate another 64K of data and the 256K chunk is
written as 4X 64K chunks, one to each drive.
The read-modify-write penalty comes from the fact that the RAID must
always work on a full stripe worth of data to calculate the parity. If
the operating system issues a 16K write to the RAID, the RAID must read
an entire 192K stripe into memory, perform the parity calc, and then
write the data back out. It doesn't have to write the full 256K. It only
has to write back the 64K chunk where the 16K of data was modified and
write a new 64K chunk of parity data.
If your system is doing a lot of 16K I/Os, the raid is going to be
potentially reading 192K for every 16K write. This is the
read-modify-write penalty. The OS won't see it, but the RAID will be
doing a lot of reads (on the back-end) for every write. Write
performance will be severely degraded and will show up as long service
and queue times in 'iostat -x'.
Ideally, the kernel would issue 192K of sequential data every time it
does a write. When the RAID sees this, it's supposed to recognize that
_all_ the data in a sripe is going to get modified and know that it
doesn't have to read anything. It just takes the 192K write, calculates
the parity and writes the whole stripe out.
It should work this way when using software RAID. The kernel knows all
about the stripe width, so it sould always be trying to coalesce writes
into effient stripe-sized chunks. For the 3Ware RAID controller, the
kernel driver provides the perferred I/O size information to the block
layer. Unfortunately, for an external stand-alone SCSI or FibreChannel
RAID, this information is almost never communicated back to the kernel.
If the kernel falls back on it's default of doing 16K I/O, a nice,
expensive, enterprize-class RAID could end up looking very unimpressive.
To minimize the read-modify-write penalty, the preferred I/O size for
the RAID should be as closely matched to the expected write sizes as
possible. If you're going to be writing big data files only, go for a
big stripe size (256K) on your RAID. It's more efficient. If you're
going to run a database, and a row update is going to be less than 2K, a
large 256K RAID stripe could kill database responsiveness.
RAID stripe alignment:
Just knowing the stripe size and sending right amount of data isn't the
only important point. The data has to be aligned. If we send the RAID in
the previous example 192K of sequential data, but the specific blocks
are straddled across two RAID parity stripes, we hit the
read-modify-write problem again. This time the RAID has to take the 192K
that it was given, break it into two separate chunks, read 384K of data,
calculate parity on both, and write both back out. The RAID does 2x the
work for every I/O you send.
To avoid the alignment problem, the first block on the filesystem must
be aligned on the first block of a RAID stripe. For a 3+1 RAID5 with 64K
stripe, the start of the second stripe would be 192K into the volume at
block number 384. The partition for the filesystem must start at exactly
block number 384. I don't know what partitioning tools allow you to work
in blocks. The last time I used parted, for example, it only wanted to
work in megabytes and gigabytes, so it would be impossible to properly
align a filesystem.
Does anyone think that I/O performance tuning would make for a good
session at a meeting?
Michael
More information about the wplug
mailing list