[wplug] Thoughts on a Storage System

Wed May 20 10:20:21 EDT 2009

I going to play with glusterfs.  I'll have a review some time this summer.  http://gluster.org/

It looks like you can build a very simple system and grow into a cluster.  With 2 nodes you can setup something close to raid 10.

GlusterFS is a cluster file-system capable of scaling to
several peta-bytes. It aggregates various storage bricks over
Infiniband RDMA or TCP/IP interconnect into one large parallel network
file system. GlusterFS is based on a stackable user space design
without compromising performance. 

________________________________
From: Michael Semcheski <mhsemcheski at gmail.com>
To: General user list <wplug at wplug.org>
Sent: Tuesday, May 19, 2009 1:32:29 PM
Subject: [wplug] Thoughts on a Storage System

Hey All,

I'm currently looking at some different options for providing lots of
storage to a few applications.  The applications are in development
and use in-house.  We analyze video data, and if you record a lot of
video, you need a lot of disk space.  We have other applications that
may be coming online in a few months that could use the space too.
Typically though, we have no need to access the data via the file
system - doing everything via API would be a-ok.

I've been spending some time here and there to see if there are any
compelling open source projects that we should trial.  Hadoop is
definitely on the radar, but its not a perfect fit.  It seems designed
for Java, and and though it supports C++, I'm not sure if there's
first class support.  Also, there's a fad for things that scale up to
1000's of nodes.  That has its place, but many of those systems don't
scale down to three or four nodes as well.  CAStor, from Caringo seems
excellent, but its blows our budget and our cost per GB.

Anyway, here are the requirements I've come up with so far:

   1. Able to put, get, and delete data.
   2. Able to run efficiently on as few as three nodes.
   3. If there are multiple nodes, be able to use capacity for
redundancy / failover / balancing.
   4. Each additional node adds to the total storage available.
   5. Good linear read / write performance.
   6. Must be able to recover from multi-node failure.
   7. System stays online if 15% of the nodes are offline.
   8. Data can be deleted - we do not have unlimited storage space.
   9. Able to scale up to many TB of total space. (ie, currently we
have about 10TB, but could easily get another 10TB in the next year.)

And these are a few of the "wishlist" items I've come up with - not
requirements but they would be neato:

   1. Support for different tiers of storage. (ie, most recent data
goes to the faster tier, and is moved to the slower tier of time.)
   2. Integration with a distributed job processing system.
   3. Support for local clients. (ie, clients can cache a portion of
the data for offline or disconnected operation.)

Anybody have any thoughts?  Know of something that does what we want,
or something similar?

Thanks,

Mike
_______________________________________________
wplug mailing list
wplug at wplug.org
http://www.wplug.org/mailman/listinfo/wplug
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.wplug.org/pipermail/wplug/attachments/20090520/7d8b3a89/attachment.html