An Interesting Bug

A few years ago I wrote the C++ program latheControl which is designed to control my lathe (and subsequently my milling maching, and rotary table) by moving the machine’s axes via stepper motors. It’s designed to be an aid to manual machining rather than going the full CNC route.

It consists of a low-priority UI thread which handles display updates and input, and multiple realtime-scheduled threads which handle stepper motors and sensors. These are high priority as they need microsecond-level timing to handle operations like cutting threads, where the program needs to know exactly where the spindle is (rotation-wise) and how far to advance the z-axis.

All this runs on a Rasberry Pi, with a regular keyboard attached for input. It’s been rock solid and I’ve used it to great effect in many machining projects.

I recently relocated my mill, and started a new cut on a fairly large chunk of steel. To my surprise, the latheControl program stopped responding to any keypresses about half way through the first cut. I had to physically cut power to the stepper motor to stop the tool’s motion. Strangely, although the program wasn’t responding, I could get a virtual console (Ctrl-Alt-F1) and reboot it fine. So the Pi hadn’t crashed.

After the reboot, I tried again. About half way through the cut, it exhibited the same unresponsiveness.

The program is written such that I can run it on my laptop with all the hardware (steppers, sensors, rotary encoders etc) mocked out. This allows for unit testing as well as manual testing. I could not replicate the issue at all.

Back to the “real” hardware, I started to wonder about hardware failures. I tried installing the latest Pi OS. Again, same issue. Interestingly, if I traversed the x-axis without actually cutting metal, the problem didn’t appear. Now it was getting strange.

It’s not easy (but not impossible) to debug the program while running on the mill. I started thinking about running strace on the program while running. Initially I tried the trusty “got here” messages in the program’s log output. It confirmed it simply didn’t return keyboard input after a while, but only while cutting metal. I wondered whether the extra load of the mill’s spindle was causing some form of brown-out on the Pi. But this was discounted because the Pi remained working fine if I opened a virtual console for rebooting it. I wondered if the keyboard was physically failing (it does occasionally get showered with hot metal chips), but the fact I could switch to a virtual console and type normally discounted that idea.

I started looking more closely at the code, puzzled. The input loop has a function to get input from the keyboard. The program has a virtual function for this – I do this so I can encapsulate and abstract out the graphics library in case I want to swap it in future (I’ve already done this once – the first version of the program used a TUI rather than a GUI). The current one uses SFML to get input and display graphical output.

In pseudocode, the input function does this:

    •  get event (this is any input, e.g. window interactions, keypresses, joystick input, mouse motion etc)
    • if event is not keypress return “no key pressed”
    • otherwise return the key pressed

This is called from the low-priority UI thread, in a loop, which has a 50ms delay each iteration. Handling one keypress every 50ms is fine, right?

After staring at the lower-level input code for a few hours, I suddenly realised that if many non-keyboard events are received, the event queue will back up (because only one event is popped off the queue every 50ms).

The keyboard attached to the mill’s Pi has a rollerball on it, to control the mouse. Can you see the issue?

It turned out that the rollerball was very slightly being vibrated by the mill when cutting, causing a slew of “mouse moved” events to be added to the event queue when chewing metal. These were only being popped off every 50ms, so  the queue grew to the point that keypresses were pushed deep down the queue, rendering the program unresponsive.

When the cause finally dawned on me, it was a simple job to put a loop in the input processing function  to discard any non-keyboard events in one go, and the problem disappeared. One of the most interesting bugs I’ve encountered during decades of C++ development.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.