* * * * * Sixteen hours to change three lines of code The bug report was simple enough: “Project: Sippy-Cup [1]” would, when it had to retransmit the data (remember, we're running over UDP), it would send an additional packet. Table: Data flow for “Project: Sippy-Cup” Sippy Cup direction Client ------------------------------ DATA → (packet dropped) DATA → (packet received) (packet recieved) ← ACK DATA →   So far, the bugs found in “Project: Sippy-Cup” have been easy to reproduce (thankfully!) and easy to fix once identified. This one was easy to reproduce … and took two days of intense effort to fix. The bug is not a show-stopper, and the client should (there's that word) handle the duplicate packet. My manager, S, was subtly hinting that I should put this on the backburner, but this is one of those problems that just irks me. I added logging statements. I instrumented the code. I traced the execution of the program line-by-line (easy enough to do within Lua [2] as there's a pre-defined method for hooking into the execution engine, and even though this didn't reveal the root cause of the bug, it did reveal a few places where some simple changes optimized the code (but that's for another post [3]). For example, here's all the code that's executed [4] when you run the luarocks [5] command, but fair warning—it's nearly 7,000 lines long) and it became clear that something was wrong with the timeout handling (what I didn't mention the other day about scheduling coroutines [6] is how I handle timeouts, but alas, that's for another post [7] as well). The code is basically: > request = receive_packet_from_client() > send_ack() > send_request_to_db(request) > > -- ------------------------------------------------------------------ > -- This sets things up to force the coroutine to resume in one second, > -- returning a result of nil. > -- ------------------------------------------------------------------ > > timeout(1) > > -- ---------------------------------------------------------------- > -- the code that resumes this coroutine will pass in the result in > -- coroutine.resume(), which is assigned to dbresult > -- ---------------------------------------------------------------- > > dbresult = coroutine.yield() > > -- ------------------------------------------------------------------------ > -- cancel a pending timeout, if any. If we did timeout, this is a NOP. If > -- we haven't timed out, this prevents the timeout from triggering. > -- ------------------------------------------------------------------------ > > timeout(0) > > if not dbresult then > info = failed() > else > info = extract(dbresult) > end > > for retries = 1 , 3 do > send_information_to_client() > timeout(1) > ack = coroutine.yield() > timeout(0) > if ack then > return > end > end > I was able to determine that the initial “timeout” was the request to the database engine timeout trigger (after hacking in a way to track the timeout requests) which meant the timeout code wasn't working properly. Once that was clear, I was able to locate the problem code: > -- ------------------------------------------------------------------------ > -- TQUEUE contains references to coroutines that want a timeout. If we set > -- the timeout to 0, this just sets the awake field to 0, which informs the > -- coroutine scheduler not to run the coroutine when it "times out". > -- ------------------------------------------------------------------------ > > function timeout(when) > local f = coroutine.running() > > if when == 0 then > if TQUEUE[co] then > TQUEUE[co].awake = 0 -- oops, bug. See this post [8] for details > end > else > TQUEUE:insert(when,f) > end > end > Here we have an instance where a static compiler would have flagged this immediately [9]. I either need to rename f to co, or change co to f. A simple two line fix (and I really need to find a code checker for Lua). The other issue, now quickly found, was this bit of code: > local function handler(info,err,remote,message,startime) > ... > local ninfo,err,nremote,nmessage = coroutine.yield() > timeout(0) > > if err == 0 then > if not message.request.status then > -- error > else > return nmessage > end > else > ... > Even a static compiler would not have found this issue—I was checking the wrong variable! The code should read: > if not nmessage.request.status then > Bad me for using a bad name for a variable. No cookie for me! Okay, this is a simple one line change. Total lines changed: 3. [1] gopher://gopher.conman.org/0Phlog:2014/03/05.1 [2] http://www.lua.org/ [3] gopher://gopher.conman.org/0Phlog:2015/02/13.1 [4] gopher://gopher.conman.org/0Phlog:2015/02/12/luarocks-run.txt [5] https://rocks.moonscript.org/ [6] gopher://gopher.conman.org/0Phlog:2015/02/05.1 [7] gopher://gopher.conman.org/0Phlog:2015/02/16.1 [8] gopher://gopher.conman.org/0Phlog:2015/02/16.1 [9] gopher://gopher.conman.org/0Phlog:2015/02/09.1 Email Sean Conner at sean@conman.org .