[wplug] Large Database

Fri Mar 6 09:56:27 EST 2009

How big and complex are the records, and how fast are you reading them? Can you load them into some other data structure say, a sparse array, during data aq and then process the recorded data offline?

>>> "Bill Moran" <wmoran at potentialtech.com> 3/6/2009 9:37 AM >>>
In response to DK <wplug at curlynoodle.com>:
> 
> I need to implement a data acquisition system which will sample and
> store large amounts of "time-series" data, that is hundred of millions
> of records.  I would like to investigate using an open source
> database.  Does anyone have suggestions?

PostgreSQL is frequently used for datasets that size.

However, whatever you choose, you're going to have to take some time to
tune your methodology if you're dealing with that much data.  You're
crossing a line between what ordinary people expect computers to do and
what _REAL_ number crunching is like.

You will _NEVER_ get the kind of performance out of a database that size
that the average user gets out of his MS Access database that he uses to
catalog his library of mp3s.  You will have queries that take hours to
run.  This will be the case regardless of what DB system you use ... pay
$infinity for Oracle and the performance won't be any better.  It's just
the realities of scale.

You will probably also find that collecting said data will require various
forms of "trickery".  If you just try to insert it one row at a time, you'll
spend weeks getting it into the database.  If you're trying to capture it
"real time", you'll probably find that it's impossible to do.  There are
LOTS of tricks to getting this to work, but the standard methods of
data processing usually don't scale up to recordsets that size.  Again,
it doesn't matter what DB you use, you'll come across these same problems.
Keep in mind, that depending on the size of each row, you're talking about
10s of G of data ... possibly 100s of G.

With that warning in mind, I further recommend PostgreSQL.  I've talked to
many people dealing with datasets that size, and PG handles them as well
as anything you'll pay money for.  Personally, I have a few databases with
10s of millions of rows and have had no problems with PG's performance, but
I've yet to push a database into the 100s of millions.  I'll be interested
to see how it works for your project.

-- 
Bill Moran
http://www.potentialtech.com 
http://people.collaborativefusion.com/~wmoran/ 
_______________________________________________
wplug mailing list
wplug at wplug.org 
http://www.wplug.org/mailman/listinfo/wplug 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.wplug.org/pipermail/wplug/attachments/20090306/265ea466/attachment.html