https://medium.com/@polyglot_factotum/re-fixing-servos-event-loop-e00bdf267385 Open in app Sign up Sign in [ ] Write Sign up Sign in [1] Re-fixing Servo's event-loop Gregory Terzian Gregory Terzian * Follow 5 min read * 1 day ago -- Listen Share A follow-up on a previous story, in which we identify another problem, and perhaps confirm a theory through a theorem. As part of the ongoing renovation of Servo's event-loop, we are trying to move closer to the specification of updating the rendering. After having merged a general change that supported the overall structure of the spec, the next step was to respect certain ordering constraints. What appeared like an easy fix -- a simple filter map of available web documents -- instead ended-up exposing a concurrent problem. The problem The data representing a web page in Servo comes in different flavors, each one dealing with a different view into the page. For example, the Documentcorresponds to the DOM element, and the Pipelineis rather a view into a web document from the perspective of the overall loading, navigating, communicating with, and ultimately rendering, of content(orchestrated by the constellation). When a new web page opens, a new event-loop spawns(or and existing one is re-used), and it is that event-loop that will run the contents of the page, as per the HTML event-loop. In the case of same-origin Iframes or tabs, multiple documents end-up running -- bar some expections or it wouldn't be the Web -- on the same event-loop, and ordering constraints should be applied to those documents when updating the rendering. When a web page closes, the HTML event-loop keeps running as long as other pages need it, but the tasks queued for the closing page are cancelled. Since any queued "update the rendering" tasks run their steps for all active documents, their queuing can be batched so that only one task runs per cycle, which in Servo was achieved through a boolean. With hindsight, it is obvious that this wasn't a good idea: the task queued can be cancelled if the page for which it was queued is closed, and then other pages relying on this batching mechanism will never get their rendering updated. The fix is, with hindsight, just as obvious: scope the batching per page. The fix, at https://github.com/servo/servo/pull/32574/files# diff-2c5948584719622eef023c56cdf90e8278971f2f2eda26e7fd0485a24c32be3cL520 But what is not so obvious is the actual flaw in logic, and its correction. As you can see from the screenshot above, the fix does not even relate to concurrency on the face of it, and the borrow-checker is of no help here. We need other tools and mindsets. What I wrote above ("the task queued can be cancelled if ...) comes down to a sequence of steps leading to the erroneous state: 1. A task is queued for document A, and the boolean is set to true. 2. Document B also wants to queue a task, but skips it because the boolean is already set to true. 3. Document A is closed before the task runs, and the task is cancelled. Document B now is stuck waiting for eternity for the task to run: deadlock. This is an example of error-prone thinking in terms of previous steps taken by the algorithm. A more robust form of thinking is through an invariant that must be respected at each step(it is hard to keep permutations of steps in mind; an invariant is easily kept in mind). And due to the concurrent nature of the problem, it is unpractical to reproduce it with testing. So we must find a way to define an invariant, use our own judgement to ensure it matches the reality of the system, and test its logical correctness independently of the software. How? Enter: TLA+. Informal verification We will first model the broken state of the current system. Our broken batching of tasks spec. Above is a simple TLA spec, consisting of three actions that queues, runs, and cancels tasks. For a task to run, the pipeline must remain open. A boolean is used to batch the queuing of tasks. As a quick side note: the spec looks nothing like the software it is meant to model. One thing that made it very hard until recently for me to write TLA specs was the urge to use it to write software: don't do it. Focus instead on building a model for a particular behavior of your system. The interesting part is the invariant: QueuedTaskRuns. It's English translation would be: the boolean being true implies that there is a task queued for a pipeline. Running the spec in TLC reveals almost immediately that this invariant is not respected for long: Results from a TLC run of the broken spec. The moment a pipeline that has queued a task is closed, the invariant is invalidated. This confirms our mental model that using a simple boolean was not a good idea. Now we can model the fix. The above spec is a fleshed-out version of the previous one: batching is done by pipeline(the fix), and an additional variable is used to track closed pipelines(needed for the invariant). The interesting part is again the invariant, which now reads as: for all open pipelines, if a rendering task was queued, then a task is pending in the queue(we use "if", which means "implies", versus the stronger "iff", which means "if, and only if"). Note that A => B means that A can be false while B is true(but not the other way around), which is exactly what can happen as a result of resetting the batching in RunTask. TLC run results TLC finds no error after having explored 624 distinct states in 7 seconds. The number of states should convince you that finding an invariant is easier than running all scenarios in your mind. At this point, we have done the following: * Formalized our mental model of how rendering task batching works * Thrown a bunch of CPUs at the model to try to find errors. What we haven't done is formal verification: that would require writing a proof to back-up our spec. And what about the relationship between the code and the spec? My personal judgement is that the spec reflects the behavior of the system. I could be wrong in two ways: 1. The batching behaves differently. 2. The batching behaves like modeled, but the intermittent test failures experienced in CI related to another part of the system. My theory was wrong. 1. would mean I could update the spec with newfound knowledge, and 2. would mean I would have to write another spec. In both cases, what links the spec to the code is human judgement. Thank you for reading. Sure enough, before the ink dried on this post, the test started failing intermittently. Is it because the batching is still not fixed, or is there something else interfering? I have no idea yet... Stay tuned. -- -- Gregory Terzian Follow Written by Gregory Terzian 641 Followers I write in .js, .py, .rs, .tla, and English. Always for people to read Follow Help Status About Careers Press Blog Privacy Terms Text to speech Teams