* * * * *
                                        
                         Is profiling even viable now?
                                        
Mark brought up (in email) an interesting optimization technique using GCC 3:

> I came across an interesting optimization that is GCC specific but quite
> clever.
> 
> In lots of places in the Linux kernel you will see something like:
> 
> > p = get_some_object();
> > if (unlikely(p == NULL))
> > {
> >   kill_random_process();
> >   return (ESOMETHING);
> > }
> > 
> > do_stuff(p);
> > 
> 
> The conditional is clearly an error path and as such means it is rarely
> taken. This is actually a macro defined like this:
> 
> > #define unlikely(b)   __builtin_expect(b, 0)
> > 
> 
> On newer versions of GCC this tells the compiler to expect the condition
> not to be taken. You could also tell the compiler that the branch is likely
> to be taken:
> 
> > #define likely(b)     __builtin_expect(b, 1)
> > 
> 
> So how does this help GCC anyhow? Well, on some architectures (PowerPC)
> there is actually a bit in the branch instruction to tell the CPU's
> speculative execution unit if the branch is likely to be taken. On other
> architectures it avoids conditional branches to make the “fast path” branch
> free (with -freorder-blocks).
> 

I was curious to see if this would actually help any, so I found a machine
that had GCC 3 installed (swift), compiled a version of mod_blog [1] with
profiling information, ran it, found a function that looked good to speed up,
added some calls to __builtin_expect(), reran the code and got a rather
encouragine interesting result.

I then reran the code, and got a completely different result.

In fact, each time I run the code, the profiling information I get is nearly
useless—well, to a degree. For instance one run:

Table: Each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	name
------------------------------
100.00	0.01	0.01	119529	0.00	0.00	line_ioreq
0.00	0.01	0.00	141779	0.00	0.00	BufferIOCtl
0.00	0.01	0.00	60991	0.00	0.00	line_readchar
0.00	0.01	0.00	59747	0.00	0.00	ht_readchar

Then another run:

Table: Each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	name
------------------------------
33.33	0.01	0.01	119529	0.00	0.00	line_ioreq
33.33	0.02	0.01	60991	0.00	0.00	line_readchar
33.33	0.03	0.01	21200	0.00	0.00	ufh_write
0.00	0.03	0.00	141779	0.00	0.00	BufferIOCtl

Yet another run:

Table: Each sample counts as 0.01 seconds. no time accumulated
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	name
------------------------------
0.00	0.00	0.00	141779	0.00	0.00	BufferIOCtl
0.00	0.00	0.00	119529	0.00	0.00	line_ioreq
0.00	0.00	0.00	60991	0.00	0.00	line_readchar
0.00	0.00	0.00	59747	0.00	0.00	ht_readchar

And still another one:

Table: Each sample counts as 0.01 seconds.
% time	cumulative seconds	self seconds	calls	self ms/call	total ms/call	name
------------------------------
50.00	0.01	0.01	60991	0.00	0.00	line_readchar
50.00	0.02	0.01	1990	0.01	0.01	HtmlParseNext
0.00	0.02	0.00	141779	0.00	0.00	BufferIOCtl
0.00	0.02	0.00	119529	0.00	0.00	line_ioreq

Like I said, nearly useless. Sure, there are the usual suspects, like
BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements
I'm getting by doing this. And by today's standards, swift isn't a fast
machine being only (only!) a 1.3GHz (gigaHertz) Pentium III with half a gig
of RAM (Random Access Memory). I could only imagine the impossibility of
profiling under a faster machine, or even imagining what could be profiled
under a faster machine.

I have to wonder what the Linux guys are smoking to even think, in the grand
scheme of things, if __builtin_expect() will even improve things all that
much.

Unless they have access to better profiling mechanics than I do.

Looks like I might have to find a slower machine to get a better feel for how
to improve the speed of the program.

[1] https://boston.conman.org/mod_blog.tar.gz

Email Sean Conner at sean@conman.org

.