https://cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem

 
logo
logo
 

  * Products
      + Inference
      + AI Supercomputers
      + AI Model Services
      + Cloud
      + System
      + Processor
  * Solutions
      + Large Language Models
          o Inference
          o Training
      + Industries
          o High-Performance Computing
          o Health & Pharma
          o Energy
          o Government
          o Scientific Computing
          o Financial Services
      + Customer Spotlights
          o The Mayo Clinic
          o National Laboratories Breakthrough
          o Aleph Alpha
          o GlaxoSmithKline
          o See more
  * Developers
      + Documentation
      + Cerebras Fellows Program
      + Open Source AI Models
      + Cerebras Model Zoo
      + Academic System Time
      + Software
      + Access Inference SDK
      + Access Hardware SDK
  * Company
      + About
      + In the News
      + Awards
      + Press Releases
      + Press Kit
      + Join Us
          o Life at Cerebras
          o All openings
  * Resources
      + Blog
      + Inference Research Grant
      + Publications
      + White Papers
      + Events
      + Trust Center

  *  
    Contact Us

  * Search

[                    ]
100x Defect Tolerance: How Cerebras Solved the Yield Problem

January 13, 2025
James Wang James Wang

Conventional wisdom in semiconductor manufacturing has long held that
bigger chips mean worse yields. Yet at Cerebras, we've successfully
built and commercialized a chip 50x larger than the largest computer
chips - and achieved comparable yields. This seeming paradox is one
of our most frequently asked questions: how do we achieve a usable
yield with a wafer-scale processor?

The answer lies in rethinking the relationship between chip size and
fault tolerance. This article will provide a detailed,
apples-to-apples comparison of manufacturing yields between the
Cerebras Wafer Scale Engine and an H100-sized chip, both manufactured
at 5nm. By examining the interplay between defect rates, core size,
and fault tolerance, we'll show how we achieve wafer scale
integration with equal or better yields vs. reticle limited GPUs.

What determines yield

Like any manufacturing process, computer chips are prone to defects.
Larger chips are more likely to encounter defects, thus as chips grow
in size, yields fall exponentially with increasing die area. Even
though larger chips generally run faster, early microprocessors were
built to a modest size to maintain acceptable manufacturing yields
and profit margins. In the early 2000s, this started to change. As
transistor budgets grew to over 100 million, it became the norm to
build processors with multiple independent cores per chip. Since all
the cores were identical and independent, chip designers built-in
core-level fault tolerance so that if one core suffered a defect, the
remaining cores could still operate. For example in 2006 Intel
released the Intel Core Duo - a chip with two CPU cores. If one core
was faulty, it was disabled and the product was sold as an Intel Core
Solo. Nvidia, AMD, and others all embraced this core-level redundancy
in the coming years.

[100x-tolerance-table-01-scaled]

Today, fault tolerance is widely used by high performance processors
and it's perfectly normal to sell chips with some cores disabled. AMD
and Intel CPUs generally ship a flagship version with all cores
enabled and a lower end version with a portion of cores disabled.
Nvidia's data center GPUs are substantially larger than CPU dies and
as a result even its flagship models have some portion of cores
disabled.

Take the Nvidia H100 - a massive GPU weighing in at 814mm^2.
Traditionally this chip would be very difficult to yield
economically. But since its cores (SMs) are fault tolerant, a
manufacturing defect does not knock out the entire product. The chip
physically has 144 SMs but the commercialized product only has 132
SMs active. This means the chip could suffer numerous defects across
12 SMs and still be sold as a flagship part.

Defect tolerance is the key to yield

Traditionally, chip size directly dictated chip yields. In the modern
era, yield is a function of both chip size and defect tolerance.
800mm^2 chips were once thought infeasible to commercialize due to
yield, but by using defect tolerant design, they are now mainstream
products.

The degree of defect tolerance can be measured by the amount of chip
area that is lost when a defect occurs. For multi-core chips, this
means the smaller the core, the greater the defect tolerance. If
individual cores are small enough, it may be possible to build a very
large chip.

Wafer Scale Engine Cores

[100x-tolerance-wafer-scale-01]

At Cerebras, before committing to build a wafer-scale chip, we first
designed a very small core. Each AI core in the Wafer Scale Engine 3
is approximately 0.05mm^2 or about 1% the size of an H100 SM core.
Both core designs are fault tolerant. This means a defect in a WSE
core would disable 0.05mm^2 of silicon while the same defect in an
H100 disables ~6mm^2. To a first order of approximation, the wafer
scale engine is ~100x more fault tolerant than a GPU when considering
the silicon area affected by each defect.

The Routing Architecture

[100x-tolerance-defects-01-scaled]

But small cores alone aren't enough. We developed a sophisticated
routing architecture that allows us to dynamically reconfigure
connections between cores. When a defect is detected, the system can
automatically route around it using redundant communication pathways,
preserving the chip's overall computational capabilities by
leveraging nearby cores.

This routing system works in concert with a small reserve of spare
cores that can be used to replace defective units. Unlike previous
approaches that required massive redundancy overhead, our
architecture achieves high yield with minimal spare capacity through
intelligent routing.

A wafer scale walkthrough

Defect tolerance at a chip level is fairly clear. Let's now compare
how a traditional GPU and a wafer-scale chip would yield using TSMC
5nm's 300mm wafer:

[100x-GPUs-per-wafer-01]

On the left is a H100-like GPU: it is 814mm^2, it has 144 fault
tolerant cores, and a single 300mm wafer yields 72 full die chips. On
the right we have the Cerebras Wafer Scale Engine 3. It's one giant
square measuring 46,225mm^2. It has 970,000 fault tolerant cores. One
wafer yields one chip.

[100x-tolerance-table-02-scaled]

At the current TSMC 5nm node, TSMC's process reportedly has ~0.001
defect per mm^2. 72 GPU dies have total die area of 58,608mm^2.
Applying this defect density, this area would see a total of 59
defects. For simplicity, let's assume each defect lands on a separate
core. At 6.2mm^2 per core, this means 361mm^2 of die space would be
lost of defects.

On the Cerebras side, the effective die size is a bit smaller at
46,225mm^2. Applying the same defect rate, the WSE-3 would see 46
defects. Each core is 0.05mm^2. This means 2.2mm^2 in total would be
lost to defects.

Measuring the total area lost, the GPU in this case loses 164x more
silicon area than the Wafer Scale Engine on an apple-to-apples basis
on the same manufacturing node and defect rate.

The above makes a high-level point but simplifies a few details.
First, not all areas of the chip are taken by the compute cores.
Caches, memory controllers, and on-chip fabric take up a substantial
amount of die size, perhaps up to 50%. However, these components can
be designed to be fault tolerant in their own way. An H100 SM is
likely smaller than 6.2mm^2 due to these components, though not by an
order of magnitude. Second, a cluster of defects could overwhelm
fault tolerant areas and disable the whole chip. In practice, even
fault tolerant chips will not yield close to 100%. These caveats
aside, the general rule that smaller cores make for greater fault
tolerance still holds.

Putting Cerebras in the Table

[100x-tolerance-table-03-scaled]

Let's revisit the first table, now with the Cerebras Wafer Scale
Engine added. Like Nvidia's data center GPUs, the WSE-3 is designed
to be fault-tolerant and disables a portion of its cores to manage
yield. Because our cores are so tiny, the number of cores is so much
larger - 970,000 physical cores with 900,000 active on our current
shipping product. This provides tremendous, fine grained, defect
tolerance. Despite having built the world's largest chip, we enable
93% of our silicon area, which is higher than the leading GPU today.

To summarize, Cerebras resolved the wafer scale manufacturing
challenge by designing a small fault-tolerant core in combination
with a fault tolerant on-chip fabric. While total chip area increased
by ~50x compared to conventional GPUs, we reduced individual core
size by ~100x. As a result, defects are far less damaging to the WSE
than conventional multi-core processors. The third generation WSE
engine achieves 93% silicon utilization - the highest among leading
AI accelerators - demonstrating that wafer-scale computing is not
just possible, but commercially viable at scale.

---------------------------------------------------------------------
Avatar photo

James Wang

Author posts

  * Prev
  * Next

Stay up to date our latest innovations.

FOLLOW USSUBSCRIBE
---------------------------------------------------------------------
[cerebras-white-01]

info@cerebras.ai

1237 E. Arques Ave
Sunnyvale, CA 94085

Follow

 
 
 
 
 

Product

Cluster
System
Chip
Software
Cloud

Applications

Natural Language Processing
Computer Vision
High Performance Computing

Industries

Health & Pharma
Energy
Government
Scientific Computing
Financial Services
Web & Social Media

Resources

Customer Spotlights
Blog
Publications
Event Replays
Whitepapers

Developers

Documentation
Fellows
Open Source AI Models
Cerebras Model Zoo
Academic System Time
Software
Access Inference SDK
Access Hardware SDK

Company

About Cerebras
In the News
Press Releases
Terms & Policies
Careers
Contact
Service Status

(c) 2025 Cerebras. All rights reserved
 
[                    ]
Privacy Preference Center

Privacy Preferences

[Save Preferences] 
x
*