Subj : Re: hyperthreading in database-benchmarks
To   : comp.arch,comp.programming.threads
From : David Kanter
Date : Thu Oct 13 2005 01:12 pm


Bill Todd wrote:
> David Kanter wrote:
> > Bill Todd wrote:
> >
> >>Oliver S. wrote:
> >>
> >>>Has anyone found information on how much hyperthreading is able to
> >>>improve the
> >>>performance of database-workloads (OTP as well as DWH)?
> >>
> >>My recollection is that POWER5's SMT is said to give it something like a
> >>35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
> >>said to provide less (more like 25%).  Those of course are both
> >>dual-thread SMT implementations without any more execution units than
> >>their non-SMT predecessors:  EV8's quad-thread implementation did (IIRC)
> >>contain more execution units, was fine-grained, and was said to provide
> >>over 2x (possibly as much as 3x - it's been a long time since I visited
> >>the material) the TPC-C throughput that a non-SMT version would have
> >>managed.
>
> ...
>
> > It was estimated by Joel Emer at about a 225-230% boost
>
> As a mathematician, you really ought to be more careful with your
> terminology (and this isn't the first time I've noticed that, which is
> why I'm commenting upon it):  the 'boost' you're describing is 125% - 130%.

You're right, sorry about that, it was rather late.  Yes, it is a
125-130% boost over the non-SMT case.

> >   (hard to tell
> > with the graph and scale):
> >
> > www.cs.washington.edu/research/smt/papers/compaqMF.ppt
> >
> > This persuades me that a chip designed for SMT from the ground up can
> > get quite a bit better than just 40%.
>
> Well, even with the added execution units when running only two threads
> the EV8 managed less than 70% in the 'TP' workload described (and did
> even worse in some of the other workloads when limited to two threads):
>   the ability to support four concurrent threads (and to keep them
> reasonably well-supplied with resources) was its most significant advantage.
>
>    The real question is whether you
> > are better off with CMP than a wide SMT...hard to say
>
> Not really.  Even as cores continue to diminish in size, *some* level of
> SMT will remain desirable insofar as it allows one to put to good use
> more execution units whether to enhance the performance of a single
> thread or to enhance the performance of multiple concurrent threads
> within the single core (i.e., it provides a core which can handle a
> wider range of workloads more closely to optimally, rather than a static
> arrangement either starved for execution units when servicing a number
> of demanding threads lower than the number of cores or leaving execution
> units idle even when a number of far-less-demanding threads covers all
> the cores).

> So the *real* question is whether that's *enough* of an improvement to
> justify the added design effort

That was part of my question/point.  What I really want to know is
this:

'Doubling' the performance using CMP usually doubles the die size (and
a little more).  However, it keeps the core at the same size.

What does it take to double the performance using SMT?  What does it do
to overall die size and core size?

Core size is important because a big core --> longer pipelines to drive
data across the chip.

We know adding SMT is very small and affordable, even for a 4T design.
However, how much would it cost in terms of additional function units,
branch prediction mechanisms, etc. etc. to take todays existing 2
threaded designs and  provide a 70% boost?

I guess what I am trying to get it is "What are the costs to double
performance using SMT, compared to CMP"?

The biggest cost for SMT is probably validation.

>(and relatively small additional
> physical overheads - at least as evidenced by current examples) involved
> - and if the answer is 'yes', then just what level of multi-threading
> within the multiple separate cores on a chip is ideal across the normal
> distribution of real-world workloads (one can't just suggest that SMT
> could eliminate *all* need for CMP since wire and synchronization delays
> within a single core are non-negligible factors which bound total core
> size even if the complexity of, say, supporting many dozens of
> concurrent threads did not).

Precisely.  We're on the same page.

> EV8 may have occupied a unique moment in time when placing multiple
> relatively high-performance cores on a single chip was not yet quite
> feasible but when the level of performance desired from a single thread
> was not yet so limited by the 'memory wall' that single-thread
> performance had ceased to be so desirable - at least if you could find
> good uses for the many execution units at other times as well to make
> the chip more generally useful.  Even so, it should be some time yet
> before such considerations fade away completely (and if ever something
> pushes back that memory wall sufficiently they'll resurface).

Fundamentally, the issue stems from the memory wall, but immediately
the issue was heat and power.

David

.