[wplug] Thoughts on a Storage System

Wed May 20 14:20:10 EDT 2009

To things I wanted to follow up on:

First, Tom Grove mentioned the issue of backing this puppy up.  That
will definitely be an issue.  LTO4 seems like the obvious media for
the backups to me.  You can get a tape library that holds 48 LTO4
tapes for $10,000.  (48 LTO4 tapes cost about $2500).  That gives you
a library which holds almost 28TB for $12,500, or $0.44 per GB.
$0.44/GB is tough to beat.  Of course you'll need some software to run
the tape library, and depending on what you do there, its can be a
non-trivial cost.

Second, if I do a thorough evaluation of one or more systems in
pursuit of a solution, I'll make sure to send my results to this list.

Mike

On Wed, May 20, 2009 at 9:25 AM, Duncan Hutty <dhutty at ece.cmu.edu> wrote:
> Michael Semcheski wrote:
>> Hey All,
>>
>> I'm currently looking at some different options for providing lots of
>> storage to a few applications.  The applications are in development
>> and use in-house.  We analyze video data, and if you record a lot of
>> video, you need a lot of disk space.  We have other applications that
>> may be coming online in a few months that could use the space too.
>> Typically though, we have no need to access the data via the file
>> system - doing everything via API would be a-ok.
>>
>> I've been spending some time here and there to see if there are any
>> compelling open source projects that we should trial.  Hadoop is
>> definitely on the radar, but its not a perfect fit.  It seems designed
>> for Java, and and though it supports C++, I'm not sure if there's
>> first class support.  Also, there's a fad for things that scale up to
>> 1000's of nodes.  That has its place, but many of those systems don't
>> scale down to three or four nodes as well.  CAStor, from Caringo seems
>> excellent, but its blows our budget and our cost per GB.
>>
>>
>> Anyway, here are the requirements I've come up with so far:
>>
>>    1. Able to put, get, and delete data.
>>    2. Able to run efficiently on as few as three nodes.
>>    3. If there are multiple nodes, be able to use capacity for
>> redundancy / failover / balancing.
>>    4. Each additional node adds to the total storage available.
>>    5. Good linear read / write performance.
>>    6. Must be able to recover from multi-node failure.
>>    7. System stays online if 15% of the nodes are offline.
>>    8. Data can be deleted - we do not have unlimited storage space.
>>    9. Able to scale up to many TB of total space. (ie, currently we
>> have about 10TB, but could easily get another 10TB in the next year.)
>>
>>
>> And these are a few of the "wishlist" items I've come up with - not
>> requirements but they would be neato:
>>
>>    1. Support for different tiers of storage. (ie, most recent data
>> goes to the faster tier, and is moved to the slower tier of time.)
>>    2. Integration with a distributed job processing system.
>>    3. Support for local clients. (ie, clients can cache a portion of
>> the data for offline or disconnected operation.)
>>
>>
>> Anybody have any thoughts?  Know of something that does what we want,
>> or something similar?
>>
>
> Storage is something that's been on my OneDay list for a while now and
> this is the project that I have at the back of my mind to look into:
> http://www.gluster.org
>
> I'm sure wplug would be grateful if you make a comparison of some of the
> FOSS options, their advantages and disadvantages:)
> --
> Duncan Hutty
> _______________________________________________
> wplug mailing list
> wplug at wplug.org
> http://www.wplug.org/mailman/listinfo/wplug
>