[wplug] Thoughts on a Storage System

Thu May 21 10:07:16 EDT 2009

Michael Semcheski wrote:
> On Wed, May 20, 2009 at 1:50 PM, Max Putas <maxblaze at gmail.com> wrote:
>> OpenFiler appears to be fairly capable, although I can't fully vouch
>> for it since I never used it in production. You might want to look at
>> this tutorial for setting up two OpenFiler boxes in a HA
>> configuration:
>
> OpenFiler seems like you're sort of limited to having as much
> available storage as you can attach to a single node.
>
> To me, attaching lots of storage to a single node isn't as good bang
> for the buck as having multiple nodes.  With one node, your IO
> bandwidth is practically limited to what the node's NIC can support.
> If you have many nodes, you might be able to get n-NIC IO.

Without defining the performance and bandwidth requirements of the application, it's hard
to say that one architecture is better than the other. Each one has different strengths
and weaknesses. If your application is distributed across many compute nodes and access
data in parallel, then a parallel filesystem is going to be a perfect fit. But if your
application tends to run on a single node, then a single server may provide all of the
bandwidth necessary. This could be true even if you had many clients.

Sometimes it can be more efficient (bang for the buck) to have a single (or few) large
server(s). This is true from an administration standpoint as well as for many workloads. I
would say it's similar to how a system with a single 3GHz processor would probably be
faster on most workloads than having two 2GHz processors. But if you've got a completely
parallel task, the dual processor is going to win.

Having said that, how is your particular application going to be architected?

>From what you've described earlier, you already have a handful of 1U and 2U file servers,
but they are independent and you now want a unified namespace. Most of the options
suggested would work and solve the problem in different ways. Here's how I think they
stack up.

OpenFiler is the monolithic server choice providing file-based access over NFS or Samba.
All the storage does not have to be on a single server, because you can turn all your 1U
and 2U servers into iSCSI block based storage for the back end to OpenFiler. Drop in a
couple of 10Gb interfaces and deliver hundreds of MB/s. It's a simple architecture built
on technologies that are mostly mature. It's a simple system to maintain because
everything runs a small appliance-like distribution. You have to invest in a couple of
switches with 10Gb uplinks. Performance would be limited to what the single server could
shovel, which would be around 400-ish MB/s.

Lustre is a very scalable object-based parallel filesystem. It requires more hardware to
implement and is significantly more complex from a software standpoint to both get running
and to maintain. It is one of the most popular HPC filesystems because of it's performance
and because it's open source. Lustre is in the kernel, so you'll have to build it
yourself. From what I've heard, it's not the easiest to build because there are a lot of
dependences. Lustre relies on the RAID of the OSS to protect the data. It is not tolerant
of OSS node failures, so it is my understanding that the OSS are usually implemented in HA
pairs with shared storage. Lustre can deliver very high per-file and per-client
performance and is very scalable. Lustre has high performance, but requires the most
hardware, the most expertise and the most time to implement.

Gluster seems to be an interesting project and might be the closest to what you are
looking for. It's a FUSE filesystem (in user space) that aggregates a collection of
independent servers. It unifies the files stored on those servers into a single namespace.
In general, a file is stored in its entirety on a single server. The developers report
that it scales well, but obviously the per-file performance couldn't exceed what a single
server could deliver. Still, it may deliver all the performance that your application
might need. Gluster is apparently very easy to get going. Gluster can replicate files. The
scariest thing about Gluster is that it's very immature. It would be fun to test, but I'm
not sure I'd commit production data to it just yet. Still, it looks like it has a lot of
potential.

Another possible solution is Cleversafe (http://www.cleversafe.org/). This is a dispersed
data solution mostly targeted at spreading data across geographical locations. The data is
encoded (with reed-soloman I think) and encrypted and distributed across many storage
nodes. It's highly tolerant of the loss of nodes and the amount of encoding is adjustable.
You can, for example, set it to use 16/10 and tolerate the loss of 6 nodes in 16 or set it
to something like 10/8 to tolerate the loss of 2 nodes in 10. The code has been around a
while and is fairly mature. It's also offered in a commercial version. I have no idea what
kind of performance it would deliver on a local SAN/LAN.

At the end of the day, there seems to be a number of different ways to tackle the problem.
 Obviously there are others that haven't even been mentioned. I'm sure that we're all
going to be looking forward to hearing about what you end up implementing.

Michael
Solutions Architect, Storage
SGI