After some time, I've come back to this and thought the next useful thing would be to add graphics to the LCD. While I've written low-level graphics before, it wasn't a pleasant experience so I went with a graphics library I've seen used before that had built-in support for fonts and images: cairo.
The latest g13 code is on github as usual and this time it has some example code for using cairo to write the time to the screen. It does this by using cairo.ImageSurface as the target surface for the context, and converting the data layer of the surface into one acceptable by the G13 device over usb. Since the G13's LCD data is stored in a vertical-then-horizontal format, but in two-rows in some weird amalgam between C-accessed and Fortran-accessed display memory, the ImageSurface's data is converted into G13's immediately prior to drawing the LCD. I initially used an A1 format since it was the only 1-bit format, similar to the G13's, and did the conversion in Python, but that code turned out to be a nested for-loop over memory in Python, which is exactly what Python is bad at.
When this was written and working, I decided I wanted to change the conversion code to be written in C/C++ somehow, and looked around. I had already used Cython and Shedskin, and while I have no particular qualms with them, decided to try a new one and went with Scipy's weave.inline function. I have to admit, if you're familiar with C++ and not wary of using it, then weave.inline is the way to go. I wrote the code in a big docstring (not that big, really) and passed it into weave and immediately got an immense speed boost. According to the checked-in micro-benchmark in benchmark.py, the A1 format conversion code went from 3ms per conversion to 58us on my machine, or a 51x speedup. While benchmark data is to be watched for, this is more than an order-of-magnitude faster and actually runs on input data similar to what's seen normally (I write some text to the cairo image surface before running it). Also, to get these numbers I ran each function 512 and 32,768 times, respectively, to make the test run over a second total. When I ran it longer, it didn't make a significant difference, upping the limit to 10 seconds gave the same numbers within 1%.
However, that's not the end. I noticed that even in the C++ version, I had two loops, one on all the inputs bytes, then one more that looped through each bit of the byte, and while the format used the least amount of memory and therefore did the fewest number of memory accesses, it was doing a lot of bitwise math that could be unnecessary. So as an experiment, I switched to using another ImageSurface format: RGB24. I skipped ARGB32 since I didn't want to deal with transparency in a 1-bit screen. The new weave implementation was over twice as fast, taking 24us to do the same conversion.
For this blog post, I thought I'd round out the corner and implement the 4th bit of code, python converting RGB24, and found it was in fact over twice as slow, taking 8ms. Switching from using struct.unpack and bit-math to direct array access gains a bit but brings it down to 5.6ms, still more than twice as slow. Turns out that, oddly enough, a loop over an xrange(8) and bit access is faster than 3 array accesses and a straight number comparison.
Check out the code on github, I'm going to use this stuff combined with the state machine to start using this on my own machine as a shortcut tool.