accept() hangs forever, although the clients try to connect()
I have a multihreaded concurrent server running under Solaris 8. The main thread is dedicated solely to accept()ing connections and creating new worker threads, which then handle incomig requests and send responses. A worker thread finishes, after the client closes the socket connection. Accept()ing in the main thread occurs in a simple infinite loop, like this:
for( ; ; ) {
socket = accept();
create_new_thread( socket );
}
All above happens naturally via TCP/IP.
The accept()ing server socket is created like this:
socket_descriptor = socket(AF_INET, SOCK_STREAM, 0);
The following incident happened on the production system:
- the main thread blocked in the call to accept(),
- all the worker threads finished their work successfully,
- the accept() in the main thread never returned, although the clients tried to connect,
- the normal kill on the server process didn't work, only "kill -9" command managed to stop the server process.
A few additional informations about the environment of the server process:
- the server runs usually weeks/monts without restarting and without problems,
- in the last few weeks, it was noticed, that the rlimit parameter (max opened file descriptors) for the server process is set to 256 (default on Solaris 8), instead of 4096 which was recommended for the server,
- the amount of opened socket connections of the server (opened file descriptors) grows slowly because of half-opened sockets, it was slightly under 200, when the incident happened.
My question is: is there any explanation of this pheneomenon (accept() blocks forever, although the clients still try to connect())?
I'm pretty sure, that this not a software problem. The code, which is responsible for accept()ing is too simple to hide some error. The logfile from the incident says clearly, that the application hangs in the accept.
I tried to find out, which OS or TCP/IP paremeters could be responsible for such an behaviour. I found some informations about backlog queue and SYN flood attacks, but this seems be not the problem - the clients don't try to open so many connections simultanously...
Any ideas?