[wplug] Thoughts on a Storage System

Thu May 21 13:09:55 EDT 2009

Well, I have a little better picture of what you're looking for, but now I have even more
questions. I would like to know more about the workflow as well as how things grew to
where they are today.

You say that
> The nodes which process the most data are the storage nodes because
> they also have the most RAM and processor cores

Then you say that
> the data should be balanced
> across nodes so that utilization is optimized.

Do you only mean the utilization of the storage on those nodes, or the processing workload
also?

When you say
> As an added bonus it
> would be nice to have nodes doing processing look for data that was
> stored locally rather than whatever is next in the queue.

Does this mean you are currently doing some sort of cross-mounting of filesystems so that
storage nodes can process data that resides on other storage nodes? And if it was
selective about working on local data, it would obviously run faster.

> What we're really looking for is an architecture that allows us to
> scale up more efficiently.  Its not worthwhile for us to add more
> dedicated nodes before we really need their disk space now, because
> they would be underutilized.

Here's where I get confused because you mention that a node can process whatever is next
in the queue, but here you state that additional nodes aren't effectively utilized unless
you need their disk space. When you ingest new data that is to be processed, how do you
decide where it is to be stored? If the ingest process spread the data around the storage,
would that make it possible to utilize new nodes more effectively?

> The data isn't replicated and it isn't spread around.

If every node were to have high speed access to a shared filesystem, wouldn't that be the
most efficient? Any node can run any job and you don't have to waste disk space (and time
spent copying) on replicated data.

What is the workflow from ingest through processing? When you ingest new data, how do you
determine where it will be stored. When you process data,
If I try to read between the lines and guess at what you are looking for, I would say you
want:
  1. Something cheap
  2. Something that doesn't require building an infrastructure (like a SAN)
  3. Something that allows processing and storage nodes to be added dynamically
  4. Something that will distribute incoming data across independent storage nodes
  5. Something that will keep track of where all the data is and direct jobs to the proper
node

Michael
Solution Architect, Storage
SGI