[wplug] crashh! ext2, fsck, and duplicate blocks

Sun Oct 31 21:34:17 EST 2004

On Sun, 2004-10-31 at 17:33, Brandon Kuczenski wrote:
> I have a few questions about system administration practices.
> 
> I am running a FreeBSD server (okay, okay, should I send it to wplug-bsd?
> I think these questions are of general interest, though) and I had a
> peculiar problem.  A configuration file for one of my scripts got
> suddenly and unexpectedly filled with garbage.  The garbage looked like
> this:
> 

What did this script do?  Was it setup as a cron job or something else? 
Shell/perl/other?

> ...
> t 1099176433 N:N.N.N
> t 1099176433 greetings
> t 1099176433 HTime-Received:Oct
> t 1099176433 H*F:D*com.br
> t 1099176433 H*F:D*br
> ...
> 
> When I fixed the file, it soon became garbage again.  I disabled the
> script and decided to puzzle over it for awhile.
> 
> But wait! There's more!
> 
> My server just crashed.  When it restarted, I fsck'ed the disks and found
> "duplicate blocks" -- shared by my Bayes tokens database and that
> configuration file.  Aha!  So my config file got overwritten by spam
> data.  fsck fixed the problem.
> 
> So, given all this, I have three questions: One, how in blazes (I wanted
> to say something more R-rated) did this happen?  As I said, the OS was
> FreeBSD, the filesystem was ext2 (FreeBSD doesn't support ext3, and these
> disks were migrated from a Linux system).
> 

Did the system ever go down unexpectedly?  In my experience, it is not
at all unusual to have some truncated or garbage disk entries after a
crash.  This is especially true when you are using I/O intensive
applications (like a database).  It sounds like you might have had a
cross-linked file here.  This can somewhat of a hard thing to narrow
down, especially if it happened a while back and you were not aware of
any problems.  Had you disabled fsck for this volume for some reason? 
It sounds as if you had to invoke a manual fsck.  

> Two: should I run fsck on a routine (i.e. cron) basis, to catch glitches
> like this?  How often do they happen?  Or should I just wait for
> random reboots to check the disks? What is the "Right thing to do"?
> 

AFAIK, "standard procedure" for any system-created volume is upon boot,
the system will check if the volume is clean, if not, it will run a
fsck.  The system will also run a fsck after N number of mounts.  See
/etc/fstab or cron for further details.

> Three: when my server crashes and leaves no helpful information in
> /var/log/messages (in fact, even the startup log is missing), am I just
> supposed to pretend like nothing happened?  How do I find bugs if there
> are no logs?
> 

Were there any other logs left?  I.e. /var/log/messages.0.gz ?

> Here's my /var/log/messages surrounding the time of the reboot (the first
> several lines are just IP filter data):
> 
> 
> Oct 31 14:27:49 ocean ipmon[83]: 14:27:48.985723 rl0 @0:17 b 209.195.143.195,2077 -> 209.195.172.207,3127 PR tcp len 20 48 -S IN
> Oct 31 14:28:46 ocean ipmon[83]: 14:28:45.874701 rl0 @0:17 b 209.195.87.230,2728 -> 209.195.172.207,6129 PR tcp len 20 48 -S IN
> Oct 31 14:28:49 ocean ipmon[83]: 14:28:48.785276 rl0 @0:17 b 209.195.87.230,2728 -> 209.195.172.207,6129 PR tcp len 20 48 -S IN
> Oct 31 14:28:55 ocean ipmon[83]: 14:28:54.896598 rl0 @0:17 b 209.195.87.230,2728 -> 209.195.172.207,6129 PR tcp len 20 48 -S IN
> Oct 31 14:34:27 ocean ipmon[83]: 14:34:27.219536 rl0 @0:17 b 63.205.221.242,4325 -> 209.195.172.207,1433 PR tcp len 20 48 -S IN
> Oct 31 14:34:31 ocean ipmon[83]: 14:34:30.216077 rl0 @0:17 b 63.205.221.242,4325 -> 209.195.172.207,1433 PR tcp len 20 48 -S IN
> Oct 31 17:03:37 ocean /kernel: e
> Oct 31 17:03:37 ocean /kernel: dscheck(#ad/0x3000a): b_bcount 1 is not on a sector boundary (ssize 512)
> Oct 31 17:03:37 ocean last message repeated 11 times
> Oct 31 17:03:37 ocean /kernel: IP Filter: v3.4.31 initialized.  Default = pass all, Logging = enabled
> Oct 31 17:03:40 ocean ipmon[84]: 17:03:40.513362 rl0 @0:17 b 209.195.138.37,2091 -> 209.195.172.207,3127 PR tcp len 20 48 -S IN
> Oct 31 17:03:44 ocean ntpd[116]: ntpd 4.1.0-a Tue May 25 21:15:34 GMT 2004 (1)
> Oct 31 17:03:44 ocean ntpd[116]: kernel time discipline status 2040
> Oct 31 17:04:02 ocean login: ROOT LOGIN (root) ON ttyv0
> 
> Sometime before the line that reads "/kernel: e", the system rebooted.  I
> mean, WTF?
> 

A little OT: but this line:

Oct 31 17:03:37 ocean /kernel: IP Filter: v3.4.31 initialized.  Default
= pass all, Logging = enabled

Default = pass all.  

Are you blocking anything , running any services, etc?  This is a whole
other subject, but I feel it is worth mentioning.  

HTH,

-- 
Carl Benedict
Pittsburgh Techs
Main:  724-741-0233
http://www.pittsburghtechs.com
cbenedic at pittsburghtechs.com