We've got a client application that uses the java.net.Socket API to communicate with a remote device (normally in a LAN) using some proprietary packet protocol embedded into TCP frames.
That means our client application (either Java SE 6 or 7) will start a request or send some data and the remote device will simply acknowledge the data or respond to the request by sending some requested data.
When the data to be sent exceeds a certain size it needs to be packetized, forming a sequence of packets. Each packet has to be acknowledged, so it's always only one side sending and the other waiting.
The reason for using that protocol embedded into TCP frames is that the it was originally meant to be used to communicate via RS232 interface and the TCP communication is just used as a tunnel.
The implementation in our Java client is a low priority workloop thread that receives commands (via LinkedBlockingQueue) from the actual application, does the communcation with the remote device and upon completion passes the response back to the actual application, by eventually doing some work such as preparing received bitmap graphic data for display or parsing received XML data (done in the very same workloop thread).
When fetching commands from the queue the socket connection is established, the request is sent to the remote device and the response is received (in individually acknowledged packets as described above). When the queue runs empty (all commands from the application have been processed), the socket connection is closed.
Now the problem we're currently facing:
While with the client application running on modern developer workstations (running Windows XP or Windows 7) there aren't any communication problems we know of, on some target systems (embedded Windows XP or Windows 7 based) some received packet content seems to be wrong (possibly fragments of content from previous packets, but nothing I would expect).
Even more confusing is that while the actual data isn't correct the number of bytes returned by read() is actually correct (commonly 536 bytes).
The fact that the number of received bytes is correct could be verified by letting the application issue the very same requests to the very same remote device once on the developer workstation and on the target system, logging the sent and received data and comparing it.
What we've tried so far, but didn't help:
- reading byte by byte instead of bulk read.
- using the socket's InputStream directly instead of the originally used BufferedInputStream
- switching TcpNoDelay off (we used to need that enabled)
- changing workloop thread priority
- using temporarily allocated receive buffers instead of reusing the same receive buffer
- using blocking SocketStream
- using non-blocking SocketStream with a HeapByteBuffer and Selectors for OP_CONNECT, OP_READ and OP_WRITE
- using non-blocking SocketStream with a DirectByteBuffer and Selectors for OP_CONNECT, OP_READ and OP_WRITE
While using SocketStream seemed to help on Windows 7 embedded so far (where the problem occured very frequently until then) it doesn't help on at least one affected Windows XP embedded system.
Using the DirectByteBuffer I wanted to get closer to the origin of the problem because browsing the actual JVM source code shows that in that case the WSArecv Winsock2 API function writes data directly to the address of that same buffer and also passes the number of bytes written into the buffer.
So until now the wrong data seems to actually come from that Winsock2 API function. And what further lets me think so is the fact that a simple 10 ms sleep after the OP_READ selector returning and before issuing the SocketStream.read() miraculously makes the data arrive correctly (nearly always)! It also seems that by increasing the sleep duration in millisecond steps, starting with 1 ms, the amount of "correct" data in the buffer increases.
Do you have any clue what might be wrong in our Java client application that could cause such behaviour - or is the fact that using the DirectByteBuffer doesn't help indicating a problem outside the JVM?