[wplug] NFS mount problem

Michael Skowvron skowvron at verizon.net
Fri Jul 2 20:06:00 EDT 2004


Gentgeen wrote:

> I guess I have used the wrong term and/or have the wrong understanding. 
> When I did consecutive 'ifconfig' while NFS was 'frozen' I could watch
> the RX errors going up and up and up on 'linuxbox' but stayed at zero on
> 'kingpin'.
> So is this still a hardware problem, or something else? 

You've got yourself a real interesting problem. Receive errors almost
always indicate a hardware problem. But you're not getting a solid
hardware failure because we know that small, interactive traffic works
fine even when NFS is "hung." Knowing that there are receive errors is
much better than just knowing packets are dropped. They're being
dropped because they are bad.

The problem is definitely related to the "amount" of data being passed
because small NFS request sizes show no errors and larger request
sizes get progressively worse. An important thing to remember is that
kingpin is going to send "rsize" amount of data as fast as it can when
it transmits. Smaller "rsize"-s mean a smaller amount of packets that
linuxbox has to ingest at one time.

>    When rsize=8192, netstat showed RX-ERR increasing by 5 for every
>         increase of 7 in RX-OK   
>    When rsize=4096, netstat showed RX-ERR increasing by 1 for every
>         increase of 2 in RX-OK
>    When rsize=2048, netstat showed RX-ERR remaining at 0, and RX-OK
>         increasing.
> 
> As a note, the above is not very scientific, just a matter of
> continously running 'netstat -i' and doing some quick
> math.

That's as scientific as I think you need to be. Netstat can be a
diagnostic tool just like anything else. The numbers you are showing
are absolutely horrible. On my network of 12 or so hosts I've got
packet counts of over 400 million and I have 0 errors. I use a lot of
NFS with 32K request sizes.

The problem appears to be isolated to linuxbox receiving packets. To
confirm, you should also test kingpin for heavy receive load. Instead
of reading data from your NFS filesystem, write a lot of data to it
from linuxbox. Use an rsize of 4k or 8k and see if rx errors start to
show up on kingpin.

If kingpin still doesn't show errors, the problem must be isolated to
linuxbox. At this point, I would begin to suspect that the ethernet
card is bad or there is some sort of interrupt servicing problem. If
kingpin does start to show errors, it could be a bug in the receive
packet handling of the tulip driver.

I notice that you always seem to have the problem when playing audio
files. It could be that you don't do anything else on your NFS
filesystem, or it could be that the audio board is effecting the PCI
bus and corrupting data coming from the ethernet card. Maybe it's a
funny interaction beween the audio driver and the tulip driver.

Do large file copies also hang NFS? If they do, then swap the ethernet
cards between kingpin and linuxbox. Do the receive errors follow the
card? If they do, bad card. If not, maybe it's an IRQ problem or some
other resource conflict on linuxbox.

On the other hand, if NFS only hangs when playing audio files, look
for the problem to be related in some way to the audio board or the
audio driver in the kernel.

Whew! I'll be anxious to see what you post next!

Michael






More information about the wplug mailing list