[wplug] RAID performance

Sat Mar 3 17:55:32 EST 2007

Tagged Command Queing:
SATA-I does not have NCQ.

NCQ and CTQ/TCQ on SCSI are designed to allow the device to take a group 
of unrelated requests and reorder them so that they can be serviced most 
efficiently. Multiple requests that target the same area on the disk 
will be serviced together and possibly ahead of other requests so that 
head movement is minimzed.

In our simple test of streaming reads and streaming writes, there is 
nothing to reorder so command queueing will have no effect.

Command queuing is a benefit for multiple processes doing simultaneous 
reads and writes. Running 3 or 4 simultaneous lmdd sessions _might_ 
allow the NCQ SATA-II drives to pull further ahead in overall throughput.

RAID-5 parity calculation:
The read-modify-write penalty is something that is often overlooked, yet 
has a big impact on performance. RAID-5 parity is caluclated in what I 
will call "stripes", but there are other terms that mean the same. If a 
4 drive RAID5 (also known as a 3+1) has a stripe size of 64K, the RAID 
works in data chunks that are 3X 64K in size. The parity is calculated 
on the 192K chunk to generate another 64K of data and the 256K chunk is 
written as 4X 64K chunks, one to each drive.

The read-modify-write penalty comes from the fact that the RAID must 
always work on a full stripe worth of data to calculate the parity. If 
the operating system issues a 16K write to the RAID, the RAID must read 
an entire 192K stripe into memory, perform the parity calc, and then 
write the data back out. It doesn't have to write the full 256K. It only 
has to write back the 64K chunk where the 16K of data was modified and 
write a new 64K chunk of parity data.

If your system is doing a lot of 16K I/Os, the raid is going to be 
potentially reading 192K for every 16K write. This is the 
read-modify-write penalty. The OS won't see it, but the RAID will be 
doing a lot of reads (on the back-end) for every write. Write 
performance will be severely degraded and will show up as long service 
and queue times in 'iostat -x'.

Ideally, the kernel would issue 192K of sequential data every time it 
does a write. When the RAID sees this, it's supposed to recognize that 
_all_ the data in a sripe is going to get modified and know that it 
doesn't have to read anything. It just takes the 192K write, calculates 
the parity and writes the whole stripe out.

It should work this way when using software RAID. The kernel knows all 
about the stripe width, so it sould always be trying to coalesce writes 
into effient stripe-sized chunks. For the 3Ware RAID controller, the 
kernel driver provides the perferred I/O size information to the block 
layer. Unfortunately, for an external stand-alone SCSI or FibreChannel 
RAID, this information is almost never communicated back to the kernel. 
If the kernel falls back on it's default of doing 16K I/O, a nice, 
expensive, enterprize-class RAID could end up looking very unimpressive.

To minimize the read-modify-write penalty, the preferred I/O size for 
the RAID should be as closely matched to the expected write sizes as 
possible. If you're going to be writing big data files only, go for a 
big stripe size (256K) on your RAID. It's more efficient. If you're 
going to run a database, and a row update is going to be less than 2K, a 
large 256K RAID stripe could kill database responsiveness.

RAID stripe alignment:
Just knowing the stripe size and sending right amount of data isn't the 
only important point. The data has to be aligned. If we send the RAID in 
the previous example 192K of sequential data, but the specific blocks 
are straddled across two RAID parity stripes, we hit the 
read-modify-write problem again. This time the RAID has to take the 192K 
that it was given, break it into two separate chunks, read 384K of data, 
calculate parity on both, and write both back out. The RAID does 2x the 
work for every I/O you send.

To avoid the alignment problem, the first block on the filesystem must 
be aligned on the first block of a RAID stripe. For a 3+1 RAID5 with 64K 
stripe, the start of the second stripe would be 192K into the volume at 
block number 384. The partition for the filesystem must start at exactly 
block number 384. I don't know what partitioning tools allow you to work 
in blocks. The last time I used parted, for example, it only wanted to 
work in megabytes and gigabytes, so it would be impossible to properly 
align a filesystem.

Does anyone think that I/O performance tuning would make for a good 
session at a meeting?

Michael