[wplug] Large Database

Bill Moran wmoran at potentialtech.com
Fri Mar 6 14:53:57 EST 2009


In response to Drew from Zhrodague <drew at zhrodague.net>:

> > Each record would include a datestamp field and at least one value
> > field.  More than likely I would have 20 or more value fields, all
> > with a common datestamp.
> 
>  	Use a binary format! Compress the records in chunks, maybe hourly. 
> Make sure you don't drop them when writing to the next file.
> 
> > The system would sample data from process devices every 10ms, 100ms,
> > 500ms, 1s or 5s, depending on the source of the data.  Sampling at
> > these rates for a year or more yield millions of records.
> 
>  	Is BerkeleyDB fast enough for this? The stuff I was doing uses an 
> old Lotus123 format, and it is pretty fast (shapefiles).
> 
>  	There should be a simple method of streaming these records to 
> disk. Company I worked for did stuff like this, and it seemed to be best 
> to use an efficient binary format. Compress the samples in chunks to save 
> disk space.

The validity of Drew's answers depends on what you're going to do with
the data once you've stored it.

If you're just going to be looking at individual records, or generating
graphs over specified plots of time, you might be much better off with
flat files.  I'm not crazy about binary formats, as they're difficult
to parse with tools like sed/awk/sh, but flat files are going to have
lots of performance benefits over a RDBMS if you're accessing data
sequentially, or only looking at individual records and the data is
of predictable size.

However, if you're going to be doing data mining stuff, such as looking
for trends, etc ... you'll probably find it much easier to work with
a RDBMS.  You may find it easier to store the incoming data in flat files
and import it into the RDBMS as needed, for performance reasons.

If there will be relationships between different aspects of the data, i.e.
you'll have parent/child table relationships, then an RDBMS is liable to
be a better system for storage than flat files (again, possibly importing
the requested data on demand)

Terry pointed out that your initial #s seemed to be wrong.  100 TPS isn't
that awful (http://www.tgc.com/dsstar/00/0822/102059.html shows tested
systems exceeding 1000 TPS, and that benchmark is 8 years old) so you
should be OK, unless you've got a lot of systems all reporting to the
same system at 10ms intervals.

-- 
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/


More information about the wplug mailing list