[wplug] Thoughts on a Storage System

Thu May 21 10:46:31 EDT 2009

On Thu, May 21, 2009 at 10:07 AM, Michael Skowvron
<michaels at penguincentral.org> wrote:
> Having said that, how is your particular application going to be architected?

Right now, our application is distributed across 5-10 Windows PC's.
The nodes which process the most data are the storage nodes because
they also have the most RAM and processor cores, but some are just
workstations not being used by anyone else.

We would consider porting the job application to Linux if the storage
nodes were Linux.

> From what you've described earlier, you already have a handful of 1U and 2U file servers,
> but they are independent and you now want a unified namespace.

Its not that we want a unified namespace so much that we want the data
to be stored more efficiently.  That is, the data should be balanced
across nodes so that utilization is optimized.  As an added bonus it
would be nice to have nodes doing processing look for data that was
stored locally rather than whatever is next in the queue.

We started out looking at Hadoop, and still might set up a test
implementation, but it didn't seem like a perfect fit.  Its based
around a queue of jobs that are waiting to execute towards reducing
the data to a result set.  And if you don't want to write everything
in Java (we use C++), you have to do everything with pipes.

What we're really looking for is an architecture that allows us to
scale up more efficiently.  Its not worthwhile for us to add more
dedicated nodes before we really need their disk space now, because
they would be underutilized.  The data isn't replicated and it isn't
spread around.  That's what we want to fix.