Subj : Re: hyperthreading in database-benchmarks
To   : comp.arch,comp.programming.threads
From : JJ
Date : Thu Oct 13 2005 02:45 pm

David Kanter wrote:
> Bill Todd wrote:

snipping

> 'Doubling' the performance using CMP usually doubles the die size (and
> a little more).  However, it keeps the core at the same size.
>
> What does it take to double the performance using SMT?  What does it do
> to overall die size and core size?
>
> Core size is important because a big core --> longer pipelines to drive
> data across the chip.
>
> We know adding SMT is very small and affordable, even for a 4T design.
> However, how much would it cost in terms of additional function units,
> branch prediction mechanisms, etc. etc. to take todays existing 2
> threaded designs and  provide a 70% boost?
>
> I guess what I am trying to get it is "What are the costs to double
> performance using SMT, compared to CMP"?
>

One way of looking at the problem is to see that any cache even L1 is
itself a memory wall, even if it's only 1-few cycles because only 1
thread is serialized through 1 SRAM.

If one projects a relatively simple 4way MTA that runs much faster than
normal complex designs but pushes the L1 down to be slower and much
bigger or just use the L2 directly but massively interleave it and use
all the banks concurrently for all the threads that will be in flight.
Getting all the banks to work concurrently is what will make the wall
fall down and marrying to multiple PEs designed to exploit that huge
banking enabled issue rates.

> The biggest cost for SMT is probably validation.
>
> >(and relatively small additional
> > physical overheads - at least as evidenced by current examples) involved
> > - and if the answer is 'yes', then just what level of multi-threading
> > within the multiple separate cores on a chip is ideal across the normal
> > distribution of real-world workloads (one can't just suggest that SMT
> > could eliminate *all* need for CMP since wire and synchronization delays
> > within a single core are non-negligible factors which bound total core
> > size even if the complexity of, say, supporting many dozens of
> > concurrent threads did not).
>
> Precisely.  We're on the same page.
>
> > EV8 may have occupied a unique moment in time when placing multiple
> > relatively high-performance cores on a single chip was not yet quite
> > feasible but when the level of performance desired from a single thread
> > was not yet so limited by the 'memory wall' that single-thread
> > performance had ceased to be so desirable - at least if you could find
> > good uses for the many execution units at other times as well to make
> > the chip more generally useful.  Even so, it should be some time yet
> > before such considerations fade away completely (and if ever something
> > pushes back that memory wall sufficiently they'll resurface).
>

To be first always seems to hurt badly:)

> Fundamentally, the issue stems from the memory wall, but immediately
> the issue was heat and power.
>

The memory wall fades away if the big fat SRAM is forced to deliver
alot more concurrency, which can be exploited by many MTA PEs, and it
does require to do a few things that might seem unpalatable
(randomizing etc). And no it isn't black magic or 666 or any such
voodoo nonsense. Just requires some out of the box thinking that is
more familiar in DSP terms. More later.


John

.