[wplug] Thoughts on a Storage System

Thu May 21 14:06:12 EDT 2009

On Thu, May 21, 2009 at 1:09 PM, Michael Skowvron
<michaels at penguincentral.org> wrote:
> Do you only mean the utilization of the storage on those nodes, or the processing workload
> also?

So far, utilization of processing resources is good.  The only way to
improve it would be to increase the number / frequency of CPU cores or
reduce the time waiting for IO.

> Does this mean you are currently doing some sort of cross-mounting of filesystems so that
> storage nodes can process data that resides on other storage nodes? And if it was
> selective about working on local data, it would obviously run faster.

Storage nodes are really just boxes that have 4 or 8 disks in RAID5 or
6.  They export a Windows file share, and each share has between 2 and
6TB's of disk space available.  As it happens, those servers have
relatively new processors and plenty of memory, so they run the job
processing software too.  They can each access data on one another.

We have an SQL database that maps particular pieces of data to a path
to the server and file on the server.  When we ingest data, we read it
in to our own structures, compress it, ask the database for the path
to write it to, write it, then store in the database where it was
written.

Then, when a job is told to apply operation X to dataset Y, it gets
the path to Y from the database, reads in Y, applies X, and writes the
result.  (We determine where to write the result from the database in
the same manner as when we were dealing with the input data.)

Its a decent system, but the problem is that the SQL database is
stupid.  We set the location for new data to be written to by hand.
When a server fills up, we update by hand where to start writing data.
 One option we're considering is improving the logic in the database,
so that as new datasets come in, they get spread around better.  But
that's a more complex solution than it may appear at first.

> Here's where I get confused because you mention that a node can process whatever is next
> in the queue, but here you state that additional nodes aren't effectively utilized unless
> you need their disk space.

I should have differentiated between processing nodes (which are well
utilized) and storage nodes (which are not).  Conceptually they're
different, but they're not necessarily physically different.

> When you ingest new data that is to be processed, how do you
> decide where it is to be stored? If the ingest process spread the data around the storage,
> would that make it possible to utilize new nodes more effectively?

Improving the ingestion would improve the storage utilization.  And we
are considering that course of action.  It seems simpler than it
really is when you get to the details.  To achieve all our
requirements, it would involve creating some process that monitors the
storage nodes, balances the data by moving or copying from one node to
another, telling the database which nodes have space and how much,
etc.

> If every node were to have high speed access to a shared filesystem, wouldn't that be the
> most efficient? Any node can run any job and you don't have to waste disk space (and time
> spent copying) on replicated data.

As I see it, replicating the data is a step that would effectively
increase our available IO bandwidth - when a job needs a piece of
data, it has more places it could get it from.  Wasting disk space
isn't a big concern, as long as we have some control over how much is
wasted and its "wasted efficiently".

>  1. Something cheap
>  2. Something that doesn't require building an infrastructure (like a SAN)
>  3. Something that allows processing and storage nodes to be added dynamically
>  4. Something that will distribute incoming data across independent storage nodes
>  5. Something that will keep track of where all the data is and direct jobs to the proper node

I'd say that pretty well sums it up.  On number 5, directing it to the
proper node is a wish more than a requirement...

Our job management system isn't perfect, but its working so far.  When
an end user wants to process a dataset, they do so interactively and
the work is distributed across available processing nodes.  They're
hoping to get a result in a matter of minutes or less.  I think it
would be tough to replace our job processor with something like Condor
or Hadoop because they're not tuned to that type time frame.  But I
haven't dug too deep into them yet and we're considering that too.