* * * * * Is profiling even viable now? Mark brought up (in email) an interesting optimization technique using GCC 3: > I came across an interesting optimization that is GCC specific but quite > clever. > > In lots of places in the Linux kernel you will see something like: > > > p = get_some_object(); > > if (unlikely(p == NULL)) > > { > > kill_random_process(); > > return (ESOMETHING); > > } > > > > do_stuff(p); > > > > The conditional is clearly an error path and as such means it is rarely > taken. This is actually a macro defined like this: > > > #define unlikely(b) __builtin_expect(b, 0) > > > > On newer versions of GCC this tells the compiler to expect the condition > not to be taken. You could also tell the compiler that the branch is likely > to be taken: > > > #define likely(b) __builtin_expect(b, 1) > > > > So how does this help GCC anyhow? Well, on some architectures (PowerPC) > there is actually a bit in the branch instruction to tell the CPU's > speculative execution unit if the branch is likely to be taken. On other > architectures it avoids conditional branches to make the “fast path” branch > free (with -freorder-blocks). > I was curious to see if this would actually help any, so I found a machine that had GCC 3 installed (swift), compiled a version of mod_blog [1] with profiling information, ran it, found a function that looked good to speed up, added some calls to __builtin_expect(), reran the code and got a rather encouragine interesting result. I then reran the code, and got a completely different result. In fact, each time I run the code, the profiling information I get is nearly useless—well, to a degree. For instance one run: Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 100.00 0.01 0.01 119529 0.00 0.00 line_ioreq 0.00 0.01 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.01 0.00 60991 0.00 0.00 line_readchar 0.00 0.01 0.00 59747 0.00 0.00 ht_readchar Then another run: Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 33.33 0.01 0.01 119529 0.00 0.00 line_ioreq 33.33 0.02 0.01 60991 0.00 0.00 line_readchar 33.33 0.03 0.01 21200 0.00 0.00 ufh_write 0.00 0.03 0.00 141779 0.00 0.00 BufferIOCtl Yet another run: Table: Each sample counts as 0.01 seconds. no time accumulated % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 0.00 0.00 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.00 0.00 119529 0.00 0.00 line_ioreq 0.00 0.00 0.00 60991 0.00 0.00 line_readchar 0.00 0.00 0.00 59747 0.00 0.00 ht_readchar And still another one: Table: Each sample counts as 0.01 seconds. % time cumulative seconds self seconds calls self ms/call total ms/call name ------------------------------ 50.00 0.01 0.01 60991 0.00 0.00 line_readchar 50.00 0.02 0.01 1990 0.01 0.01 HtmlParseNext 0.00 0.02 0.00 141779 0.00 0.00 BufferIOCtl 0.00 0.02 0.00 119529 0.00 0.00 line_ioreq Like I said, nearly useless. Sure, there are the usual suspects, like BufferIOCtl() and line_ioreq(), but it's impossible to say what improvements I'm getting by doing this. And by today's standards, swift isn't a fast machine being only (only!) a 1.3GHz (gigaHertz) Pentium III with half a gig of RAM (Random Access Memory). I could only imagine the impossibility of profiling under a faster machine, or even imagining what could be profiled under a faster machine. I have to wonder what the Linux guys are smoking to even think, in the grand scheme of things, if __builtin_expect() will even improve things all that much. Unless they have access to better profiling mechanics than I do. Looks like I might have to find a slower machine to get a better feel for how to improve the speed of the program. [1] https://boston.conman.org/mod_blog.tar.gz Email Sean Conner at sean@conman.org .