Weird, weird thing.. Running OEL 6.9 under VMWare. No problems, works beautifully and we've had 25 or 30 of these environments up for a couple of years. About 6 months ago we consolidated a dozen databases onto one server with filesystems mounted via NFS and all was fine for 4 months. But over the last couple of months we start getting defunct processes on this one server that has a dozen databases. And the numbers increase until it hits max processes and down go the databases.
Processes all have a parent process ID of 1, processes consist of oracle, sshd, java, even bash and su and fprintd have shown up as zombies. Once they get cranking we get an average of 20 or 30 per minute, though it can be as low as 5 or as high as 50. They can't be killed, nothing in dmesg or /var/log/messages showing anything happening. We're not out of memory on the server, don't appear to be out of anything else like semaphores or shared memory. We do get nebulous reports from users complaining about performance, but nothing reproducible or obvious like really bad disk i/o.
A reboot makes them go away and they don't spawn any more after a reboot. Well, not for a couple of weeks when it starts happening again.
Did find while running a sosreport for Oracle that a 'initctl list' run as any user will clear out all the zombies - though they immediately start accumulating again. That one confuses me..
Anybody run into anything similar?