[wplug] Large Database

Fri Mar 6 10:26:40 EST 2009

My employer has Oracle databases with hundreds of millions of rows and terabytes of data on Netapp filers with hundreds of spindles. That solution may be pricey,  but it does suggest some of the issues you'll be dealing with. We use hundreds of spindles to increase the disk I/O rate. Big RAM caches ( gigabytes ) also help. Some folks use SSDs to increase the I/O rates even further - usually on hot spots such as index files. As Bill Moran says, you're definitely pushing the envelope. PostgreSQL can probably do it, but you'll need a lot of tuning of the database and the underlying hardware. Lashing a few terabyte drives together may be cheap, but it won't perform.

What is the expected data acquisition rate? What sort of queries are expected? Is SQL a good match for this, or would something else be more suitable? 
Terry McIntyre <terrymcintyre at yahoo.com>

-- Libertarians Do It With Consent!

----- Original Message ----
> From: Bill Moran <wmoran at potentialtech.com>
> To: General user list <wplug at wplug.org>
> Sent: Friday, March 6, 2009 6:37:29 AM
> Subject: Re: [wplug] Large Database
> 
> In response to DK :
> > 
> > I need to implement a data acquisition system which will sample and
> > store large amounts of "time-series" data, that is hundred of millions
> > of records.  I would like to investigate using an open source
> > database.  Does anyone have suggestions?
> 
> PostgreSQL is frequently used for datasets that size.
> 
> However, whatever you choose, you're going to have to take some time to
> tune your methodology if you're dealing with that much data.  You're
> crossing a line between what ordinary people expect computers to do and
> what _REAL_ number crunching is like.
> 
> You will _NEVER_ get the kind of performance out of a database that size
> that the average user gets out of his MS Access database that he uses to
> catalog his library of mp3s.  You will have queries that take hours to
> run.  This will be the case regardless of what DB system you use ... pay
> $infinity for Oracle and the performance won't be any better.  It's just
> the realities of scale.
> 
> You will probably also find that collecting said data will require various
> forms of "trickery".  If you just try to insert it one row at a time, you'll
> spend weeks getting it into the database.  If you're trying to capture it
> "real time", you'll probably find that it's impossible to do.  There are
> LOTS of tricks to getting this to work, but the standard methods of
> data processing usually don't scale up to recordsets that size.  Again,
> it doesn't matter what DB you use, you'll come across these same problems.
> Keep in mind, that depending on the size of each row, you're talking about
> 10s of G of data ... possibly 100s of G.
> 
> With that warning in mind, I further recommend PostgreSQL.  I've talked to
> many people dealing with datasets that size, and PG handles them as well
> as anything you'll pay money for.  Personally, I have a few databases with
> 10s of millions of rows and have had no problems with PG's performance, but
> I've yet to push a database into the 100s of millions.  I'll be interested
> to see how it works for your project.
> 
> -- 
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/
> _______________________________________________
> wplug mailing list
> wplug at wplug.org
> http://www.wplug.org/mailman/listinfo/wplug