Digital Media Web Blogs > Web

Solaris+NFS can be "fun"


So there I was, sitting comfortably in my chair reading slashdot. The emails start coming in.. it seems nobody can use subversion, emacs mail was failing, and sendmail was spazzing when it tried to deliver to a file (i.e. not via procmail).

The exact error was: "no record locks available." Google lead me to believe that everyone in the world has solved this problem with a file server reboot, or by restarting lockd. I rebooted many times.

The next day my boss basically said that I need to stay all night until the problem is resolved. I logged in and ran 'lockfs -fa;sync;sync;sync;reboot.' Clearly rebooting this Sun server wasn't the solution.

To make a very long story short, I finally discovered how to get info from the kernel about record locks. Note: this means locking part of a file (from an offset) as opposed to just a standard lock, those were working fine. It turns out that lockd only has 20 threads to deal with locks, and we were being DoS'd with a lock storm. Before that, I had ran lockd in debug mode and discovered the exact error being returned was: NLM4_DENIED_GRACE_PERIOD (see the RFC), which is very misleading.

So it's possible to have a lockd storm. Running 'snoop rpc nlockmgr' showed a tad bit of traffic from a web server that mounts home directories from the file server. A student's perl cgi was calling lockf() explicitly, and this became a very popular site. MSNbot was crawling it at the time, and the processes never died. They never ate up many resources on the web server, so nobody noticed.

Isn't that just the strangest thing? :) My gratitude goes to Dan Leger of Sun Networking Support for helping me, even though he wasn't required to, by providing "contract only" documentation which led straight to the resolution.

Anyone ever seen that?

Categories





AddThis Social Bookmark Button
Comments (2)
Read More Entries by Charles Schluting.

2 Comments

cschluti said:

lockd 200
I guess I wasn't too clear :)
No, that didn't work. The solution was to kill the web server that was grabbing all the locks. (the everlasting cgi processes, I mean)

Laen said:

lockd 200
So was the solution to just run lockd with more threads?

Recommended for You

Topics of Interest

Archives


 
 


Or, visit our complete archive.