Subj : Re: hyperthreading in database-benchmarks
To   : comp.arch,comp.programming.threads
From : Bill Todd
Date : Thu Oct 13 2005 03:31 am

David Kanter wrote:
> Bill Todd wrote:
> 
>>Oliver S. wrote:
>>
>>>Has anyone found information on how much hyperthreading is able to
>>>improve the
>>>performance of database-workloads (OTP as well as DWH)?
>>
>>My recollection is that POWER5's SMT is said to give it something like a
>>35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
>>said to provide less (more like 25%).  Those of course are both
>>dual-thread SMT implementations without any more execution units than
>>their non-SMT predecessors:  EV8's quad-thread implementation did (IIRC)
>>contain more execution units, was fine-grained, and was said to provide
>>over 2x (possibly as much as 3x - it's been a long time since I visited
>>the material) the TPC-C throughput that a non-SMT version would have
>>managed.

....

> It was estimated by Joel Emer at about a 225-230% boost

As a mathematician, you really ought to be more careful with your 
terminology (and this isn't the first time I've noticed that, which is 
why I'm commenting upon it):  the 'boost' you're describing is 125% - 130%.

  (hard to tell
> with the graph and scale):
> 
> www.cs.washington.edu/research/smt/papers/compaqMF.ppt
> 
> This persuades me that a chip designed for SMT from the ground up can
> get quite a bit better than just 40%.

Well, even with the added execution units when running only two threads 
the EV8 managed less than 70% in the 'TP' workload described (and did 
even worse in some of the other workloads when limited to two threads): 
  the ability to support four concurrent threads (and to keep them 
reasonably well-supplied with resources) was its most significant advantage.

   The real question is whether you
> are better off with CMP than a wide SMT...hard to say

Not really.  Even as cores continue to diminish in size, *some* level of 
SMT will remain desirable insofar as it allows one to put to good use 
more execution units whether to enhance the performance of a single 
thread or to enhance the performance of multiple concurrent threads 
within the single core (i.e., it provides a core which can handle a 
wider range of workloads more closely to optimally, rather than a static 
arrangement either starved for execution units when servicing a number 
of demanding threads lower than the number of cores or leaving execution 
units idle even when a number of far-less-demanding threads covers all 
the cores).

So the *real* question is whether that's *enough* of an improvement to 
justify the added design effort (and relatively small additional 
physical overheads - at least as evidenced by current examples) involved 
- and if the answer is 'yes', then just what level of multi-threading 
within the multiple separate cores on a chip is ideal across the normal 
distribution of real-world workloads (one can't just suggest that SMT 
could eliminate *all* need for CMP since wire and synchronization delays 
within a single core are non-negligible factors which bound total core 
size even if the complexity of, say, supporting many dozens of 
concurrent threads did not).

EV8 may have occupied a unique moment in time when placing multiple 
relatively high-performance cores on a single chip was not yet quite 
feasible but when the level of performance desired from a single thread 
was not yet so limited by the 'memory wall' that single-thread 
performance had ceased to be so desirable - at least if you could find 
good uses for the many execution units at other times as well to make 
the chip more generally useful.  Even so, it should be some time yet 
before such considerations fade away completely (and if ever something 
pushes back that memory wall sufficiently they'll resurface).

- bill

.