[wplug] Install Question -- upstream (RHEL 4) initrd/support bug

Wed Jul 25 16:05:10 EDT 2007

On Wed, 2007-07-25 at 15:16 -0400, Eli Heady wrote:
> Can you expand on this?

Yes.  I'll try to briefly.

- Short Answer ...

A single byte error at the right point of MD meta-data volume can
_destroy_ everything atop of it -- including the _entire_ LVM structure.

A single byte error at various points of LVM meta-data will _only_
destroy those slices/volumes managed by those meta-data entries.

It's all about "localization."

If you make 1 MD volume _per_ set of LVM slices, then you're localizing
impact to your volumes.

If you make 1 MD volume for the _entire_ disk, then if that MD volume
fails, the _whole_ disk fails!

- Long Answer ...

First off, the biggest issue I have with MD is not MD itself, it's with
disk labels and slices.  MD does _nothing_ to "hide" the "raw" slices
(partitions).  This causes many things to touch them as if they were
Ext3.  The "autoraid" slice type helps things a bit for boot, but still
has the same issue in being "seen" as Ext3 at times.  MD also doesn't
offer any redundant meta-data.

[ SIDE NOTE:  This isn't merely limited to Linux/UNIX.  It was also the
issue with NT 3/4 volume sets.  They were directly on the disk label and
slices. ]

Logical Volume Management (on Linux, HP/UX, Solaris and other systems),
on the other hand, are _full_ "disk labels", not filesystems or "raw
slices with meta-data."  It can_not_ and will not be mistaken by the
filesystem.  It can still be "messed up" by software/firmware looking at
slices, but it's much less likely to be.  It also has select points of
redundancy recovery of its own meta-data.

[ SIDE NOTE:  This was also the driver for NT 5+ (2000+)'s Logical Disk
Manager (LDM) disk label as well (aka a "Dynamic Disk").  LDM also has a
half-way decent journaling system which can be "rolled back."  This is
to deal with various _core_design_flaws_ with NTFS (which actually have
to do with the Security IDs in the System Accounts Manager -- which is
in the local registry as well as Windows domains, long story).]

LVM is also required for non-x86 platforms in many cases, and goes in
their native disk label.  For most people only used to PC BIOS, LVM
doesn't seem to make any sense.  For those people used to non-PC, LVM
usage almost seems "mandatory" -- especially on non-Linux.  ;)

[ SIDE NOTE:  For non-PC BIOS, NT required the ARC firmware and a small
FAT16 filesystem to "emulate" this startup.  With NT5+ LDM is now used
in the native disk label of the architecture -- such as Intel 64-bit EFI
on IA-64. ]

But probably the "greatest testament" to OS volume management is the
legacy BSD disk label.  In fact, the lack of a Linux disk label before
LVM really used to get cause me grief compared to BSD (or even SCO UNIX
for that matter).  Especially when you have a HBA/LUN that _limits_ you
to only 8 or even 4 slices because of 32 or 64 LUNs per device (only 4-8
minor numbers for the major device).

Putting MD inside of LVM solves a lot of issues from the "raw" disk
label / slices of the platform.  Doing vice-versa only applifies it.  In
fact, 99% of the complaints I see about LVM with MD are because people
are making "one big MD" and putting LVM atop of it -- a single byte
error is _wiping_out_ the _entire_ LVM.  _Rare_ is it that I've seen the
issue the other way.

> One can find many configuration examples and how-to's 
> online which suggest the use of lvm on software raid.

Yes, it's mainly for "ease of use."  Having to create a MD for each and
every set of LVM logical volumes gets tiresome.

But it's far more reliable to put a MD in LVM than vice-versa (as
ablove).  This is not just a Linux-only thing, but also proliferated on
Solaris.  Nearly all UNIX platforms have their own "disk label" for
volume management.

Note you can also find many "common viewpoint" documents on ...  
- Tarball compress a backup, then image it to ISO9660
- How 3Ware cards can offer hot-swap on JBOD
- How hardware RAID should never be used
- How hardware RAID is slower
- Etc...

Just to answer those ...

- This has got to be the worst backup strategy.  If you compress a
backup, and then put it on ISO9660, a _single_byte_error_ can render
your _entire_ backup _useless_ from that point forward on the _entire_
CD.  Error rates of CD-RW and DVD-RW/+RW (although not DVD-RAM, long
story) writes are horrendous (1 in 10^9 -- 1 per 1GB).  But people do
this for "easy of use."  It's much more difficult to use per-file
compression in an archive (although programs like "afio" do that).

- 3Ware cards _only_ offer hot-swap when its controller manages the RAID
volume, _not_ JBOD and with software RAID (because the system/OS _never_
uses the "end storage device" -- unless it's JBOD).  3Ware gets a "bad
name" because far too many MD people ignorantly proliferate that 3Ware
supports hot-swap in JBOD mode -- it has nothing to do with the card but
how the volume is presented to the kernel.  If you use 3Ware's
controller to manage your volume, you don't have to do any kernel
hot-plug, ATA or other "disconnect" notification.  If you don't, and use
JBOD, you have to now inform a kernel a device is "no longer there"
because the volume hardware availability isn't controlled by the
microcontroller.  But people assume such because they don't want to deal
with the "software management" aspects of hot-swap (which really only
work on SCSI-2 protocol (including SAS) anyway, libata's "disconnect" is
still being worked on for ATA/SATA), but then that's why you should
_take_advantage_ of 3Ware's volume management.

- DeviceMapper can actually now map many hardware RAID volumes on "dumb"
ATA channels, including many 3Ware volumes which are openly documented.
Many other hardware RAID volumes are also documented, including more of
the Intel IOP reference (Areca) volumes too.

- A 300-500MHz+ I/O Processor with ASIC XOR units can _cream_ a 3GHz
dual-core x86-64 at "in-line/real-time" XOR data transfer rates (DTR),
all while not pegging your CPU-I/O interconnect so your CPU is only 10%
busy (because the entire interconnect is 100% utilized).  Most people
benchmark only storage DTR or try to look at CPU utilization (which has
_nothing_ to do with I/O processing), instead of actually benchmarking
their application under load, to see the actual performance difference.
Using SIMD isn't the issue, it's the interconnect approach differences.
After all, you don't use a PC for an Ethernet switch -- same deal for
Storage switching.  ;)

- Etc...

I'm sure I'm in the _minority_ on this.  But I'm not an "ease-of-use"
type of guy, I'm a "data integrity is ultimate" type of guy.

I've just had too many things "touch" a MD volume and leave me stranded
-- not because of MD itself, but because of how MD is on the "raw" disk
label / slices.  Now that LVM2 has more and more RAID support built-in,
this is much better, removing the requirement for MD volumes altogether
(like "real" UNIX OSes ;-).  Volume management might be a "foreign"
concept of PC/Linux users, but it's lock-stock-barrel for UNIX, and even
PC/BSD users.

Of course, something like ZFS (integrated volume + filesystem) is far
more ideal.

Taking it one step further, if you throw hardware requirements at it,
like with a NVRAM and a full data journal, then things are really ideal
(especially for NFS sync).  E.g., NetApp filer with its integrated Data
OnTap OS and integrated volume-filesystem concept in WAFL, and reliance
on hardware NVRAM.

VALinux used to do similar with a NVRAM board and Ext3 full data
journaling (as did I back in the late kernel 2.2.x series with Ext3 full
data journaling).

-- 
Bryan J. Smith         Professional, Technical Annoyance
mailto:b.j.smith at ieee.org   http://thebs413.blogspot.com
--------------------------------------------------------
        Fission Power:  An Inconvenient Solution