https://medium.com/@polyglot_factotum/re-fixing-servos-event-loop-e00bdf267385

Open in app

Sign up

Sign in

 
[                    ]
 
Write
 

Sign up

Sign in

[1]

Re-fixing Servo's event-loop

 
Gregory Terzian

Gregory Terzian

*

Follow

5 min read
*
1 day ago
 

--

 

Listen

Share

A follow-up on a previous story, in which we identify another
problem, and perhaps confirm a theory through a theorem.

As part of the ongoing renovation of Servo's event-loop, we are
trying to move closer to the specification of updating the rendering.
After having merged a general change that supported the overall
structure of the spec, the next step was to respect certain ordering
constraints. What appeared like an easy fix -- a simple filter map of
available web documents -- instead ended-up exposing a concurrent
problem.

The problem

The data representing a web page in Servo comes in different flavors,
each one dealing with a different view into the page. For example,
the Documentcorresponds to the DOM element, and the Pipelineis rather
a view into a web document from the perspective of the overall
loading, navigating, communicating with, and ultimately rendering, of
content(orchestrated by the constellation). When a new web page
opens, a new event-loop spawns(or and existing one is re-used), and
it is that event-loop that will run the contents of the page, as per
the HTML event-loop. In the case of same-origin Iframes or tabs,
multiple documents end-up running -- bar some expections or it
wouldn't be the Web -- on the same event-loop, and ordering
constraints should be applied to those documents when updating the
rendering.

When a web page closes, the HTML event-loop keeps running as long as
other pages need it, but the tasks queued for the closing page are
cancelled. Since any queued "update the rendering" tasks run their
steps for all active documents, their queuing can be batched so that
only one task runs per cycle, which in Servo was achieved through a
boolean.

With hindsight, it is obvious that this wasn't a good idea: the task
queued can be cancelled if the page for which it was queued is
closed, and then other pages relying on this batching mechanism will
never get their rendering updated. The fix is, with hindsight, just
as obvious: scope the batching per page.

The fix, at https://github.com/servo/servo/pull/32574/files#
diff-2c5948584719622eef023c56cdf90e8278971f2f2eda26e7fd0485a24c32be3cL520

But what is not so obvious is the actual flaw in logic, and its
correction. As you can see from the screenshot above, the fix does
not even relate to concurrency on the face of it, and the
borrow-checker is of no help here. We need other tools and mindsets.

What I wrote above ("the task queued can be cancelled if ...) comes
down to a sequence of steps leading to the erroneous state:

 1. A task is queued for document A, and the boolean is set to true.
 2. Document B also wants to queue a task, but skips it because the
    boolean is already set to true.
 3. Document A is closed before the task runs, and the task is
    cancelled. Document B now is stuck waiting for eternity for the
    task to run: deadlock.

This is an example of error-prone thinking in terms of previous steps
taken by the algorithm. A more robust form of thinking is through an
invariant that must be respected at each step(it is hard to keep
permutations of steps in mind; an invariant is easily kept in mind).
And due to the concurrent nature of the problem, it is unpractical to
reproduce it with testing. So we must find a way to define an
invariant, use our own judgement to ensure it matches the reality of
the system, and test its logical correctness independently of the
software. How? Enter: TLA+.

Informal verification

We will first model the broken state of the current system.

Our broken batching of tasks spec.

Above is a simple TLA spec, consisting of three actions that queues,
runs, and cancels tasks. For a task to run, the pipeline must remain
open. A boolean is used to batch the queuing of tasks.

    As a quick side note: the spec looks nothing like the software it
    is meant to model. One thing that made it very hard until
    recently for me to write TLA specs was the urge to use it to
    write software: don't do it. Focus instead on building a model
    for a particular behavior of your system.

The interesting part is the invariant: QueuedTaskRuns. It's English
translation would be: the boolean being true implies that there is a
task queued for a pipeline. Running the spec in TLC reveals almost
immediately that this invariant is not respected for long:

Results from a TLC run of the broken spec.

The moment a pipeline that has queued a task is closed, the invariant
is invalidated. This confirms our mental model that using a simple
boolean was not a good idea. Now we can model the fix.

The above spec is a fleshed-out version of the previous one: batching
is done by pipeline(the fix), and an additional variable is used to
track closed pipelines(needed for the invariant). The interesting
part is again the invariant, which now reads as: for all open
pipelines, if a rendering task was queued, then a task is pending in
the queue(we use "if", which means "implies", versus the stronger
"iff", which means "if, and only if"). Note that A => B means that A
can be false while B is true(but not the other way around), which is
exactly what can happen as a result of resetting the batching in
RunTask.

TLC run results

TLC finds no error after having explored 624 distinct states in 7
seconds. The number of states should convince you that finding an
invariant is easier than running all scenarios in your mind.

At this point, we have done the following:

  * Formalized our mental model of how rendering task batching works
  * Thrown a bunch of CPUs at the model to try to find errors.

What we haven't done is formal verification: that would require
writing a proof to back-up our spec. And what about the relationship
between the code and the spec? My personal judgement is that the spec
reflects the behavior of the system. I could be wrong in two ways:

 1. The batching behaves differently.
 2. The batching behaves like modeled, but the intermittent test
    failures experienced in CI related to another part of the system.
    My theory was wrong.

1. would mean I could update the spec with newfound knowledge, and 2.
would mean I would have to write another spec. In both cases, what
links the spec to the code is human judgement.

Thank you for reading.

Sure enough, before the ink dried on this post, the test started
failing intermittently. Is it because the batching is still not
fixed, or is there something else interfering? I have no idea yet...
Stay tuned.

 

--

 

--

 
 
Gregory Terzian
Follow
 

Written by Gregory Terzian

641 Followers

I write in .js, .py, .rs, .tla, and English. Always for people to
read

Follow
 
 

Help

 

Status

 

About

 

Careers

 

Press

 

Blog

 

Privacy

 

Terms

 

Text to speech

 

Teams