Losing All ZeroMQ Messages
As of today, we have finally cracked a long time bug in both our payment terminal integration and lecture hall control systems at NTK. So, what went wrong in the first place?
Both these systems use the venerable ZeroMQ messaging layer with JSON payloads to facilitate communication between the application server and a multitude of microservices that encapsulate communication with the actual devices such as projectors or payment terminals.
We shut some of the devices down for the night, along with their microservices. For some devices, such as projectors, this is by our choice, but in case of payment terminals we had no choice. Every day, they time out during their nightly batch processing. When started again in the morning, some of the microservices were unable to receive any messages. This of course surprised us a lot. Moreover, restarting neither the microservice nor the application server did seem to help. Full reboot, on the other hand, always did.
After wasting time by writing a lot of debugging code we noticed following in the ss output:
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 0 10.8.0.30:46112 10.8.0.30:46112
The sockets were connecting to themselves! This will cause things to break for sure, but how did this happen in the first place? Well, our working theory is this:
ZeroMQ attempts to reconnect every 100ms after losing a connection. It receives various source port numbers from the kernel, until it eventually attempts to connect to itself. This will succeed and create a looped connection. From that moment on, the legitimate owner of the port can no longer bind to the port and so no communication can ever happen.
What are the ports that are automatically assigned to clients initiating connections on Linux?
# sysctl net.ipv4.ip_local_port_range
net.ipv4.ip_local_port_range = 32768 61000
Sadly, our 46112 falls right into middle. When I picked that port, I had no idea what a stupid mistake that was.
And why did restarting not help? We never managed to wait long enough for TIME_WAIT to expire. If we did, and then started the server first, everything would have worked again. That is, until the next morning.