http://varnish-cache.org/docs/trunk/phk/notes.html#so-what-s-wrong-with-1975-programming

Navigation

  * index
  * next |
  * previous |
  * Varnish version trunk documentation >>
  * Poul-Hennings random outbursts >>
  * Notes from the Architect

Notes from the ArchitectP

Once you start working with the Varnish source code, you will notice
that Varnish is not your average run of the mill application.

That is not a coincidence.

I have spent many years working on the FreeBSD kernel, and only
rarely did I venture into userland programming, but when I had
occation to do so, I invariably found that people programmed like it
was still 1975.

So when I was approached about the Varnish project I wasn't really
interested until I realized that this would be a good opportunity to
try to put some of all my knowledge of how hardware and kernels work
to good use, and now that we have reached alpha stage, I can say I
have really enjoyed it.

So what's wrong with 1975 programming ?P

The really short answer is that computers do not have two kinds of
storage any more.

It used to be that you had the primary store, and it was anything
from acoustic delaylines filled with mercury via small magnetic
dougnuts via transistor flip-flops to dynamic RAM.

And then there were the secondary store, paper tape, magnetic tape,
disk drives the size of houses, then the size of washing machines and
these days so small that girls get disappointed if think they got
hold of something else than the MP3 player you had in your pocket.

And people program this way.

They have variables in "memory" and move data to and from "disk".

Take Squid for instance, a 1975 program if I ever saw one: You tell
it how much RAM it can use and how much disk it can use. It will then
spend inordinate amounts of time keeping track of what HTTP objects
are in RAM and which are on disk and it will move them forth and back
depending on traffic patterns.

Well, today computers really only have one kind of storage, and it is
usually some sort of disk, the operating system and the virtual
memory management hardware has converted the RAM to a cache for the
disk storage.

So what happens with squids elaborate memory management is that it
gets into fights with the kernels elaborate memory management, and
like any civil war, that never gets anything done.

What happens is this: Squid creates a HTTP object in "RAM" and it
gets used some times rapidly after creation. Then after some time it
get no more hits and the kernel notices this. Then somebody tries to
get memory from the kernel for something and the kernel decides to
push those unused pages of memory out to swap space and use the
(cache-RAM) more sensibly for some data which is actually used by a
program. This however, is done without squid knowing about it. Squid
still thinks that these http objects are in RAM, and they will be,
the very second it tries to access them, but until then, the RAM is
used for something productive.

This is what Virtual Memory is all about.

If squid did nothing else, things would be fine, but this is where
the 1975 programming kicks in.

After some time, squid will also notice that these objects are
unused, and it decides to move them to disk so the RAM can be used
for more busy data. So squid goes out, creates a file and then it
writes the http objects to the file.

Here we switch to the high-speed camera: Squid calls write(2), the
address i gives is a "virtual address" and the kernel has it marked
as "not at home".

So the CPU hardwares paging unit will raise a trap, a sort of
interrupt to the operating system telling it "fix the memory please".

The kernel tries to find a free page, if there are none, it will take
a little used page from somewhere, likely another little used squid
object, write it to the paging poll space on the disk (the "swap
area") when that write completes, it will read from another place in
the paging pool the data it "paged out" into the now unused RAM page,
fix up the paging tables, and retry the instruction which failed.

Squid knows nothing about this, for squid it was just a single normal
memory acces.

So now squid has the object in a page in RAM and written to the disk
two places: one copy in the operating systems paging space and one
copy in the filesystem.

Squid now uses this RAM for something else but after some time, the
HTTP object gets a hit, so squid needs it back.

First squid needs some RAM, so it may decide to push another HTTP
object out to disk (repeat above), then it reads the filesystem file
back into RAM, and then it sends the data on the network connections
socket.

Did any of that sound like wasted work to you ?

Here is how Varnish does it:

Varnish allocate some virtual memory, it tells the operating system
to back this memory with space from a disk file. When it needs to
send the object to a client, it simply refers to that piece of
virtual memory and leaves the rest to the kernel.

If/when the kernel decides it needs to use RAM for something else,
the page will get written to the backing file and the RAM page reused
elsewhere.

When Varnish next time refers to the virtual memory, the operating
system will find a RAM page, possibly freeing one, and read the
contents in from the backing file.

And that's it. Varnish doesn't really try to control what is cached
in RAM and what is not, the kernel has code and hardware support to
do a good job at that, and it does a good job.

Varnish also only has a single file on the disk whereas squid puts
one object in its own separate file. The HTTP objects are not needed
as filesystem objects, so there is no point in wasting time in the
filesystem name space (directories, filenames and all that) for each
object, all we need to have in Varnish is a pointer into virtual
memory and a length, the kernel does the rest.

Virtual memory was meant to make it easier to program when data was
larger than the physical memory, but people have still not caught on.

More cachesP

But there are more caches around, the silicon mafia has more or less
stalled at 4GHz CPU clock and to get even that far they have had to
put level 1, 2 and sometimes 3 caches between the CPU and the RAM
(which is the level 4 cache), there are also things like write
buffers, pipeline and page-mode fetches involved, all to make it a
tad less slow to pick up something from memory.

And since they have hit the 4GHz limit, but decreasing silicon
feature sizes give them more and more transistors to work with,
multi-cpu designs have become the fancy of the world, despite the
fact that they suck as a programming model.

Multi-CPU systems is nothing new, but writing programs that use more
than one CPU at a time has always been tricky and it still is.

Writing programs that perform well on multi-CPU systems is even
trickier.

Imagine I have two statistics counters:

    unsigned n_foo; unsigned n_bar;

So one CPU is chugging along and has to execute n_foo++

To do that, it read n_foo and then write n_foo back. It may or may
not involve a load into a CPU register, but that is not important.

To read a memory location means to check if we have it in the CPUs
level 1 cache. It is unlikely to be unless it is very frequently
used. Next check the level two cache, and let us assume that is a
miss as well.

If this is a single CPU system, the game ends here, we pick it out of
RAM and move on.

On a Multi-CPU system, and it doesn't matter if the CPUs share a
socket or have their own, we first have to check if any of the other
CPUs have a modified copy of n_foo stored in their caches, so a
special bus-transaction goes out to find this out, if if some cpu
comes back and says "yeah, I have it" that cpu gets to write it to
RAM. On good hardware designs, our CPU will listen in on the bus
during that write operation, on bad designs it will have to do a
memory read afterwards.

Now the CPU can increment the value of n_foo, and write it back. But
it is unlikely to go directly back to memory, we might need it again
quickly, so the modified value gets stored in our own L1 cache and
then at some point, it will end up in RAM.

Now imagine that another CPU wants to n_bar+++ at the same time, can
it do that ? No. Caches operate not on bytes but on some "linesize"
of bytes, typically from 8 to 128 bytes in each line. So since the
first cpu was busy dealing with n_foo, the second CPU will be trying
to grab the same cache-line, so it will have to wait, even through it
is a different variable.

Starting to get the idea ?

Yes, it's ugly.

How do we cope ?P

Avoid memory operations if at all possible.

Here are some ways Varnish tries to do that:

When we need to handle a HTTP request or response, we have an array
of pointers and a workspace. We do not call malloc(3) for each
header. We call it once for the entire workspace and then we pick
space for the headers from there. The nice thing about this is that
we usually free the entire header in one go and we can do that simply
by resetting a pointer to the start of the workspace.

When we need to copy a HTTP header from one request to another (or
from a response to another) we don't copy the string, we just copy
the pointer to it. Provided we do not change or free the source
headers, this is perfectly safe, a good example is copying from the
client request to the request we will send to the backend.

When the new header has a longer lifetime than the source, then we
have to copy it. For instance when we store headers in a cached
object. But in that case we build the new header in a workspace, and
once we know how big it will be, we do a single malloc(3) to get the
space and then we put the entire header in that space.

We also try to reuse memory which is likely to be in the caches.

The worker threads are used in "most recently busy" fashion, when a
workerthread becomes free it goes to the front of the queue where it
is most likely to get the next request, so that all the memory it
already has cached, stack space, variables etc, can be reused while
in the cache, instead of having the expensive fetches from RAM.

We also give each worker thread a private set of variables it is
likely to need, all allocated on the stack of the thread. That way we
are certain that they occupy a page in RAM which none of the other
CPUs will ever think about touching as long as this thread runs on
its own CPU. That way they will not fight about the cachelines.

If all this sounds foreign to you, let me just assure you that it
works: we spend less than 18 system calls on serving a cache hit, and
even many of those are calls tog get timestamps for statistics.

These techniques are also nothing new, we have used them in the
kernel for more than a decade, now it's your turn to learn them :-)

So Welcome to Varnish, a 2006 architecture program.

phk

Table of Contents

  * Notes from the Architect
      + So what's wrong with 1975 programming ?
      + More caches
      + How do we cope ?

Previous topic

Why Sphinx and reStructuredText ?

Next topic

Varnish Glossary

This Page

  * Show Source

Quick search

[                    ] [Go]
Navigation

  * index
  * next |
  * previous |
  * Varnish version trunk documentation >>
  * Poul-Hennings random outbursts >>
  * Notes from the Architect

(c) Copyright 2010-2014, Varnish Software AS. Created using Sphinx
3.5.2.