[HN Gopher] 100x defect tolerance: How we solved the yield problem
___________________________________________________________________
100x defect tolerance: How we solved the yield problem
Author : jwan584
Score : 318 points
Date : 2025-01-15 21:19 UTC (1 days ago)
(HTM) web link (cerebras.ai)
(TXT) w3m dump (cerebras.ai)
| ChuckMcM wrote:
| I think this is an important step, but it skips over that 'fault
| tolerant routing architecture' means you're spending die space on
| routes vs transistors. This is exactly analogous to using bits in
| your storage for error correcting vs storing data.
|
| That said, I think they do a great job of exploiting this
| technique to create a "larger"[1] chip. And like storage it
| benefits from every core is the same and you don't need to get to
| every core directly (pin limiting).
|
| In the early 2000's I was looking at a wafer scale startup that
| had the same idea but they were applying it to an FPGA
| architecture rather than a set of tensor units for LLMs. Nearly
| the exact same pitch, "we don't have to have all of our GLUs[2]
| work because the built in routing only uses the ones that are
| qualified." Xilinx was still aggressively suing people who put
| SERDES ports on FPGAs so they were pin limited overall but the
| idea is sound.
|
| While I continue to believe that many people are going to
| collectively lose trillions of dollars ultimately pursuing "AI"
| at this stage. I appreciate the the amount of money people are
| willing to put at risk here allow for folks to try these "out of
| the box" kinds of ideas.
|
| [1] It is physically more cores on a single die but the overall
| system is likely smaller, given the integration here.
|
| [2] "Generic Logic Unit" which was kind of an extended LUT with
| some block RAM and register support.
| __Joker wrote:
| "While I continue to believe that many people are going to
| collectively lose trillions of dollars ultimately pursuing "AI"
| at this stage"
|
| Can you please explain more why you think so ?
|
| Thank you.
| mschuster91 wrote:
| It's a hype cycle with many of the hypers and deciders having
| _zero_ idea about what AI actually is and how it works.
| ChatGPT, while amazing, is at its core a token predictor, it
| cannot ever get to an AGI level that you 'd assume to be
| competitive to a human, even most animals.
|
| And just as every other hype cycle, this one will crash down
| hard. The crypto crashes were bad enough but at least gamers
| got some very cheap GPUs out of all the failed crypto farms
| back then, but this time so much more money, particularly
| institutional money, is flowing around AI that we're looking
| at a repeat of Lehman's once people wake up and realize
| they've been scammed.
| KronisLV wrote:
| > And just as every other hype cycle, this one will crash
| down hard.
|
| Isn't that an inherent problem with pretty much everything
| nowadays: crypto, blockchain, AI, even the likes of
| serverless and Kubernetes, or cloud and microservices in
| general.
|
| There's always some hype cycle where the people who are
| early benefit and a lot of people chasing the hype later
| lose when the reality of the actual limitations and the
| real non-inflated utility of each technology hits. And
| then, a while later, it all settles down.
|
| I don't think the current "AI" is special in any way, it's
| just that everyone tries to get rich (or benefit in other
| ways, as in the microservices example, where you still very
| much had a hype cycle) quick without caring about the
| actual details.
| anon373839 wrote:
| > I don't think the current "AI" is special in any way
|
| As someone who loves to pour ice water on AI hype, I have
| to say: you can't be serious.
|
| The current AI tech has opened up paths to develop
| applications that were impossible just a few years ago.
| Even if the tech freezes in place, I think it will yield
| substantial economic value in the coming years.
|
| It's very different from crypto, the main use case for
| which appears to be money laundering.
| carlmr wrote:
| >It's very different from crypto, the main use case for
| which appears to be money laundering.
|
| Which has substantial economic value (for certain groups
| of people).
| lazide wrote:
| According to this random estimate, black market economy
| alone in just the US is worth ~ $2 trillion/yr.
| [https://www.investopedia.com/terms/u/underground-
| economy.asp]
|
| Roughly 11-12% of GDP.
|
| In many countries, black+grey market is larger than the
| 'white' market. The US is notoriously 'clean' compared to
| most (probably top 10).
|
| Even in the US, if you suddenly stopped 10-12% of GDP
| we're talking 'great depression' levels of economic pain.
|
| Honestly, the only reason Crypto isn't _bigger_ IMO is
| because there is such a large and established set of
| folks doing laundering in the 'normal' system, and those
| work well enough there is not nearly as much demand as
| you'd expect.
| KronisLV wrote:
| > The current AI tech has opened up paths to develop
| applications that were impossible just a few years ago.
|
| My argument is that if it's advertised as a direct
| precursor to AGI based on wishful thinking and people
| don't know any better, then it's no different to claims
| about how putting blockchain technology in X industry
| will solve all of its problems.
|
| I use LLMs daily and don't scoff at AI generated imagery
| or use cases like agentic systems, but there absolutely
| is a similar hype cycle to every other innovation out
| there where people are borderline delusional in the
| initial stages (Kubernetes will solve all of our issues,
| moving to cloud and microservices will solve all of our
| issues, the blockchain will...), before the limitations
| crystallize and we know what each technology is good or
| bad at.
|
| Though maybe that says more about human nature than the
| technology itself.
|
| > It's very different from crypto, the main use case for
| which appears to be money laundering.
|
| That's akin to saying "The main use case for AI appears
| to be stealing people's art and even for writers and
| others it seems to be firing people to replace them with
| soulless AI generated slop."
|
| I'd even argue that there's nothing wrong with the
| technologies themselves, be it LLMs, AI for image, video,
| audio generation, blockchain and crypto, or whatever. The
| problems arise based on how the technologies are used, or
| in my argument above - how they're touted as the solution
| to all the problems. Some people profit a lot, others
| collide with reality and their limitations at speed.
|
| In other words, if the technology will generate 100
| billion USD of actual value but people are betting on 500
| billion USD, then clearly we have a bit of an issue.
| idiotsecant wrote:
| >the main use case for which appears to be money
| laundering.
|
| You say tomato, I say freedom from the tyranny of fiat
| power structures.
| carlhjerpe wrote:
| Both Kubernetes and serverless (FaaS) is here to stay.
| Microservices is just an excuse to build shit software.
| KronisLV wrote:
| > Both Kubernetes and serverless (FaaS) is here to stay.
|
| *in environments and projects where they are a good fit
|
| > Microservices is just an excuse to build shit software.
|
| *in environments and projects where they are a bad fit
| CamperBob2 wrote:
| _it cannot ever get to an AGI level that you 'd assume to
| be competitive to a human, even most animals._
|
| Suppose you turn out to be wrong. What would convince you?
| ChuckMcM wrote:
| It could diagram a sentence it had never seen.
| karolist wrote:
| It does that all the time even now. Play with temp,
| top_k, min_p and nothing is preventing you from getting
| combinations and order of tokens previously not
| encountered in history.
| kragen wrote:
| I did eventually get an LLM to produce what seems to be a
| correct diagram of a sentence it had never seen, but it
| took about ten tries. Grammatical analysis seems to have
| happened correctly every time, but getting to a usable
| diagram was difficult.
|
| (I know that it's generally rude to include LLM output in
| HN comments, but in this case I think it's essential
| supporting material to elevate the discussion of LLM
| capabilities above "yes it is", "no it isn't".)
|
| First, I just asked ChatGPT (GPT-4o mini, what they offer
| anonymous free users):
|
| > Please diagram the sentence, "If my mother's road had
| been cultural, my uncle's sloth squad residue would have
| compiled a periodic meadow of night."
|
| I just invented that sentence, picking several words with
| a random number generator, but of course I can't prove to
| you that I didn't copy it from some book you haven't
| seen. (For that, you'd have to invent a sentence
| yourself, or at least provide me with some words to
| include in it.) I hope you can at least agree that it's a
| _highly unlikely_ sentence for anyone ever to have
| uttered before!
|
| After talking to itself for a few paragraphs, it arrived
| at the following grammatical analysis:
| Step 4: Diagramming the sentence Main
| Clause: Subject: "my uncle's sloth squad
| residue" Verb: "would have compiled"
| Direct Object: "a periodic meadow of night"
| "a periodic meadow" (noun phrase) "of
| night" (prepositional phrase modifying "meadow")
| Conditional Clause: Subject: "my mother's
| road" Verb: "had been" (linking verb)
| Predicate Adjective: "cultural"
|
| This is exactly correct. From that it produced the
| following ASCII art "diagram", which looks like nonsense
| to me, but maybe it's in some kind of sentence-
| diagramming formalism I haven't seen:
| If ______/______ | |
| my mother's had been road |
| cultural my uncle's sloth squad
| residue | would have
| compiled | a periodic meadow
| of night | of
| night
|
| I told it the diagram looked like nonsense, and it
| produced this one, which looks worse:
| If _______/______ |
| | my mother's road had been |
| | cultural (Main Clause)
| | my uncle's sloth squad residue
| | would have compiled
| | a periodic meadow
| | of night
|
| I asked GPT-4 (paid) the same question, and it gave me
| another exactly correct grammatical analysis in words:
| 1. **Identify the main clauses**: - "If my
| mother's road had been cultural" is a dependent clause
| (specifically, a conditional clause). - "my
| uncle's sloth squad residue would have compiled a
| periodic meadow of night" is the main clause.
| 2. **Start with the main clause**: - Subject:
| "my uncle's sloth squad residue" - "my
| uncle's" modifies "sloth squad," which in turn modifies
| "residue." - Predicate: "would have compiled"
| - This includes the modal auxiliary verb ("would"),
| perfect auxiliary ("have"), and main verb ("compiled").
| - Direct object: "a periodic meadow of night"
| - "a" is an article modifying "meadow," -
| "periodic" modifies "meadow," - "of night" is
| a prepositional phrase modifying "meadow." 3.
| **Diagram the dependent clause**: - Subject:
| "road" - Modified by possessive noun phrase
| "my mother's" - Predicate: "had been" -
| Complement/Adjective: "cultural"
|
| (I feel like this is definitely "a level that you'd
| assume to be competitive to a human", in mschuster91's
| phrase; most native English speakers can't do that.)
|
| But then it came up with the following incorrect diagram:
| +---------------------------------------------------+
| | Main Clause |
| +---------------------------------------------------+
| / | \
| / | \
| / | \
| [My uncle's] [sloth squad] [residue]
| \ / /
| \ / /
| \ / /
| would have compiled -------------------
| \
| \
| \
| [a periodic meadow of night]
| / | \
| / | \
| a periodic [of night]
| +---------------------+ | Dependent Clause |
| +---------------------+ If |
| road ___|_____ | | my
| mother's had been cultural
|
| Telling it the diagram was nonsense didn't help.
|
| But my conclusion from this is not that GPT-4 can't
| analyze the grammatical structure of sentences it had
| never seen. It's that GPT-4 gets "confused" when it tries
| to draw ASCII art, like a person trying to tell you how
| to draw it over the telephone, but not quite as bad. I've
| seen similarly abominable results when it tried to draw
| circuit schematics and guitar tablatures as ASCII art. I
| tried asking it for a graphviz diagram, which was
| syntactically correct graphviz code but didn't express
| the sentence structure properly.
|
| I tried simplifying the problem with the prompt, 'Please
| provide a syntactic tree showing the syntactic derivation
| of the sentence, "My uncle's sloth squad residue would
| have compiled a periodic meadow of night," using graphviz
| syntax and with individual words at the leaf nodes of the
| tree.' which yielded a slightly incorrect diagram. (I'm
| still sure nobody has ever uttered this sentence before.)
| On being told, "This diagram is not quite right.", GPT-4
| produced the following diagram, which I believe is
| correct except for an extra "s" on "uncle" (you can see
| this in your browser at https://is.gd/Axufjx):
| digraph G { node [shape=plaintext];
| S [label="S"]; NP [label="NP"]; VP
| [label="VP"]; S -> NP; S -> VP;
| // Noun Phrase breakdown DP_my [label="DP"];
| DP_my -> My; UnclePossessive [label="NP"];
| UnclePossessive -> Uncles; UnclePossessive
| -> ApostropheS; NP -> DP_my;
| NP -> UncleCompound; UncleCompound
| [label="NP"]; SlothSquadResidue[label =
| "N'"]; SlothSquadResidue-> Sloth;
| SlothSquadResidue-> SquadResidue;
| SquadResidue[label = "N'"]
| SquadResidue-> Squad;
| SquadResidue-> Residue;
| UncleCompound-> UnclePossessive;
| UncleCompound-> SlothSquadResidue; //
| Verb Phrase breakdown ModalWould[label =
| "Modal"]; ModalWould-> Would;
| AuxHave[label = "Aux"]; AuxHave-> Have;
| VCompiled[label = "V"]; VCompiled-> Compiled;
| ObjectNP[label = "NP"]; // Direct object noun phrase
| ArticleA[label = "D"]; ArticleA-> A;
| MeadowPhrase[label = "N'"]; // 'Meadow' modified by
| 'periodic' AdjPeriodic[label = "Adj"];
| AdjPeriodic-> Periodic;
| MeadowWord[label = N]; MeadowWord ->
| Meadow; MeadowPhrase ->
| AdjPeriodic; MeadowPhrase ->
| MeadowWord; PPofNight[label = PP];
| // Prepositional phrase 'of night'
| PrepOf[label=P] PrepOf-> Of;
| NightNounPhrase[label=N]
| NightNounPhrase-> Night;
| PPofNight -> PrepOf; PPofNight ->
| NightNounPhrase; ObjectNP ->
| ArticleA; ObjectNP -> MeadowPhrase;
| ObjectNP -> PPofNight; VP ->
| ModalWould; VP -> AuxHave; VP ->
| VCompiled VP -> ObjectNP }
|
| Claude produced a reasonably serviceable diagram on the
| first try with that prompt: digraph
| SyntacticTree { node [shape=plaintext];
| S -> NP1; S -> VP; NP1 -> DET1
| [label="Det"]; NP1 -> NP2 [label="NP"];
| NP2 -> NP3 [label="NP"]; NP2 -> N1 [label="N"];
| NP3 -> N2 [label="N"]; NP3 -> POS [label="Pos"];
| VP -> AUX1 [label="Aux"]; VP -> AUX2
| [label="Aux"]; VP -> V [label="V"]; VP ->
| NP4 [label="NP"]; NP4 -> DET2 [label="Det"];
| NP4 -> ADJ [label="Adj"]; NP4 -> N3 [label="N"];
| NP4 -> PP [label="PP"]; PP -> P [label="P"];
| PP -> NP5 [label="NP"]; NP5 -> N4
| [label="N"]; DET1 -> "My"; N2 ->
| "uncle"; POS -> "'s"; N1 -> "sloth";
| N1 -> "squad"; N1 -> "residue"; AUX1 ->
| "would"; AUX2 -> "have"; V -> "compiled";
| DET2 -> "a"; ADJ -> "periodic"; N3 ->
| "meadow"; P -> "of"; N4 -> "night";
| }
|
| On being told, I think incorrectly, "This diagram is not
| quite right.", it produced a worse diagram.
|
| So LLMs didn't perform nearly as well on this task as I
| thought they would, but they also performed much better
| than you thought they would.
| emkee wrote:
| Having only taken one syntax class for fun in college, I
| find this pretty impressive. Generating syntax trees was
| never a trivial task for me (but I was just a CS major
| who needed a credit). Slightly related, but I have also
| never had ChatGPT successfully generate ASCII art, even
| with extensive conversation.
| kragen wrote:
| Yeah, I think it would be very challenging for most
| people. It did considerably better with Graphviz than
| with ASCII art, but it still had trouble with the
| transition from a perfectly correct and highly nuanced
| verbal grammatical analysis to Graphviz. I think this is
| pretty convincing evidence against ChuckMcM's implicit
| position. It's weaker evidence against mschuster91's
| explicit position because parsing is something computers
| have been doing for a long time, so it doesn't imply any
| new capabilities.
|
| I'm pretty sure there are part-of-speech tagging parsers
| using fairly shallow statistics that could also have
| produced an equivalently good sentence diagram.
| https://corenlp.run/ seems to produce a correct parse,
| though in a different format.
| stonemetal12 wrote:
| If you ask it to draw a dinosaur it does an okay
| brontosaurs. __
| / _) _.----._/ / / /
| __/ ( | ( | /__.-'|_|--|_|
|
| Asking for a Tyrannosaurus Rex gives you more or less the
| same brontosaurs: __
| / _) _.----._/ / / /
| __/ ( | ( | /__.-'|_|--|_|
| CamperBob2 wrote:
| _(I know that it 's generally rude to include LLM output
| in HN comments, but in this case I think it's essential
| supporting material to elevate the discussion of LLM
| capabilities above "yes it is", "no it isn't".)_
|
| You just have to be prepared to take a karma hit for it.
| The audience here does not consist largely of 'hackers',
| but seems to skew toward the sort of fearful, resentful
| reactionaries that hacker culture traditionally opposes.
|
| I will say I wouldn't peg ChuckMcM as being one of the
| reactionaries, though. That would be an unpleasant
| surprise.
|
| As far as the diagram goes, my guess is that sentence
| diagrams were underrepresented in the training corpus.
| Diagramming sentences was already out of fashion when I
| was in school in the 1980s -- in fact, I don't recall
| ever having done it. The model is struggling much the
| same way you'd expect a grade-school student (or me, I
| guess) to struggle upon being asked to perform the task
| for the first time.
|
| Knowing when to say "I don't know how to do that" is
| still a foundational weakness of LLMs, but I don't expect
| it to remain unaddressed for long. We will see
| improvement in that area, sooner or later. The
| anklebiters will respond by moving their goalposts and
| hitting the downvote button as usual.
| kragen wrote:
| ASCII art Reed-Kellogg sentence diagrams are probably
| hard to find anywhere, and Graphviz can't really express
| Reed-Kellogg diagrams. But Reed and Kellogg published
| their somewhat ad-hoc diagram language in 01877, 78 years
| before what we now call "linguistics" was known in the
| West thanks to Chomsky's work in 01955. These are among
| the reasons I thought it might be a good idea to use the
| form of sentence diagrams used by linguists instead of
| the more compact Reed-Kellogg diagrams.
| CamperBob2 wrote:
| A language model can generate valid sentences that have
| never existed before -- not just syntactically or
| statistically valid sentences in a Markovian sense, but
| semantically valid ones. How could it possibly do that
| without an implicit awareness of all aspects of language,
| including grammar and parts of speech?
|
| That implicit awareness is indistinguishable from
| knowledge, at least to the extent it yields correct
| results.
|
| And the fact that it was inferred from examples rather
| than taught explicitly is indistinguishable from
| intelligence.
| dsign wrote:
| Those glorified token predictors are the missing piece in
| the puzzle of general intelligence. There is a long way to
| go still in putting all those pieces together, but I don't
| think any of the steps left are in the same order of "we
| need a miracle breakthrough".
|
| That said, I believe that this is going one of two ways: we
| use AI to make things materially harder for humans, in a
| scale from "you don't get this job" to "oops, this is
| Skynet", with many unpleasant stops in the middle. By the
| amount of money going into AI right now and most of the
| applications I'm seeing being hyped, I don't think we have
| have any scruples with this direction.
|
| The other way this can go, and Cerebras is a good example,
| is that we increase our compute capability and our AI-
| usefulness to a point where we can fight cancer and
| stop/revert aging, both being a computational problem at
| this point. Even if most people don't realize it, or most
| people have strong moral objections to this outcome and
| don't even want to talk about it, so it probably won't
| happen.
|
| In simpler words, I think we want to use AI to commit
| species suicide :-)
| Shorel wrote:
| I'm sure there are more missing pieces.
|
| We are more than Broca's areas. Our intelligence is much
| more than linguistic intelligence.
|
| However, and this is also an important point, we have
| built language models far more capable than any language
| model a single human brain can have.
|
| Makes me shudder in awe of what's going to happen when we
| add the missing pieces.
| idiotsecant wrote:
| Yes, I sometimes wonder if what we're witnessing in our
| lifetimes is the next stage of the 'bootstrapping' of
| life into a more complex form. If we might be the
| mitochondria contributing our little piece to the cell
| that comes after.
| immibis wrote:
| Why do you think that an AGI can't be a token predictor?
| mschuster91 wrote:
| Because an LLM _by definition_ cannot even do basic maths
| (well, except if you're OpenAI and cheat your way around
| it by detecting if the user asks a simple math question).
|
| I'd expect an actually "general" intelligence Thing to be
| able to be as versatile in intellectual tasks as a human
| is - and LLMs are reasonably decent at repetition, but
| cannot infer something completely new from the data it
| has.
| versteegen wrote:
| Define "by definition".
|
| Because this statement really makes no sense.
| Transformers are perfectly capable (and capable of
| perfectly) learning mathematical functions, given the
| necessary working-out space, e.g. for long division or
| for algebraic manipulation. And they can learn to
| generalise from their training data very well (although
| very data-inefficiently). That's their entire strength!
| dogcomplex wrote:
| Yet they can get silver medal PhD level competition math
| scores.
|
| Perhaps your "definition" should be simply that LLMs have
| temporarily seen limitations in their ability to natively
| do math unassisted by an external memory, but are
| _exceptionally_ good at _very advanced math_ when they
| can compensate for their lossy short-term attention
| memory...
| Shorel wrote:
| By analogy with human brains: Because our own brains are
| far more than the Broca's areas in them.
|
| Evolution selects for efficiency.
|
| If token prediction could work for everything, our brains
| would also do nothing else but token prediction. Even the
| brains of fishes and insects would work like that.
|
| The human brain has dedicated clusters of neurons for
| several different cognitive abilities, including face
| recognition, line detection, body parts self perception,
| 3D spatial orientation, and so on.
| myrmidon wrote:
| > Evolution selects for efficiency.
|
| I think this is a poor argument here. From an
| evolutionary point of view, our brains are optimized to:
|
| - Provide fine-motor control to craft weapons and tools
| (enhancing adaptibility and enabling us to hunt way
| outside our weight class)
|
| - Communicate/coordinate effectively in small groups
|
| - Do sensor processing and the above with a low energy
| budget
|
| Our brains are *not* selected to be minimum-complexity
| intelligences, and a _lot_ of what our brain does is
| completely useless for AGI building (motor control,
| sensor processing, ...).
|
| Furthermore, the cost/complexity (from a evolutionary
| PoV) is a totally different beast from what complexity
| means to us.
|
| Just consider flight as an example: A fruitfly is an
| insanely simple and straightforward beast, but to us, a
| biochemically fuelled, beating-wing design is _still_
| infeasibly complicated. If our approach to flight had
| been to ape after how nature does it in detail, we likely
| _still_ would not have planes.
|
| I do agree that todays LLMs still have clear
| architectural flaws that we need to overcome (online
| learning being a very glaring one), but, to pick up the
| flight analogy, we might well have the main wing
| structure _already_ down, and we won 't necessarily have
| to make those wings _beat_ to get into the air...
| Shorel wrote:
| Just because there are some parts of our brains that are
| not needed for an AGI...
|
| Doesn't mean that there aren't some part of our brains
| that are needed for an AGI, and are not present in the
| current crop of LLM.
| immibis wrote:
| What do our brains do that isn't token prediction?
|
| They receive information about photons and air vibrations
| and control muscles, okay. If a human brain was hooked up
| the way ChatGPT was, only to text input and output, would
| that make it not intelligent?
| Shorel wrote:
| > What do our brains do that isn't token prediction?
|
| I am planning a masters and phd on that question, so give
| me a few years to answer.
| Shorel wrote:
| While I basically agree with everything you say, I have to
| add some caveats:
|
| ChatGPT, while being as far from true AGI as the Elisa
| chatbot written in Lisp, is extraordinarily more useful,
| and being used for many things that previously required
| humans to write the bullshit, like lobbying and propaganda.
|
| And Crypto... right now BTC is at an historical highest. It
| could even go higher. And it will eventually crash again.
| It's the nature of that beast.
| idiotsecant wrote:
| All the big LLMs are no longer _just_ token predictors.
| They are beginning to incorporate memory, chain of thought,
| and other architectural tricks that use the token predictor
| in novel ways to produce some startlingly useful output.
|
| It's certainly the case that an LLM alone cannot achieve
| AGI. As a component of a larger system though? That remains
| to be seen. Maybe all we need to do is duct tape a limbic
| system and memory onto an LLM and the result is something
| sort of like an AGI.
|
| It's a little bit like saying that a ball bearing can't
| possibly ever be an internal combustion engine. While true,
| it's sidestepping the point a little bit.
| ChuckMcM wrote:
| I would guess you're not asking a serious question here but
| if you were feel free to contact me, it's why I put my email
| address in my profile.
| bigdict wrote:
| Why are you assuming bad faith?
| ChuckMcM wrote:
| What gave you the impression I was assuming bad faith?
| It's off topic to the discussion (which is fine) but can
| be annoying in the middle of an HN thread.
| bigdict wrote:
| > What gave you the impression I was assuming bad faith?
|
| You said "I would guess you're not asking a serious
| question here"
| ripped_britches wrote:
| It was a direct quote from your original comment
| bruce343434 wrote:
| You brought it up...
| ossopite wrote:
| Without offering any opinion on its merits, if you think
| justifying this controversial claim is off topic, then so
| is the claim and you shouldn't have written it.
| kragen wrote:
| You said, "I would guess you're not asking a serious
| question here," which is to say, you were guessing that
| the question was asked in bad faith. Or, at any rate, you
| _would_ , if for some reason the question came up, for
| example in deciding how to answer it. Which is what you
| were doing. That is to say, you _did_ guess that it was
| asked in bad faith. Given the minimal amount of evidence
| available (12 words and a nickname "__Joker") I think
| it's reasonable to describe that guess as an assumption.
| Ergo, you were assuming bad faith.
| __Joker wrote:
| Really sorry, if the question came as snarky or if
| otherwise. Those were not my intent.
|
| Related to AI given all around noise, really wanted to
| understand kind of contrarian view of monetary aspects.
|
| Once again, apologies if the question seems frivolous.
| enragedcacti wrote:
| Any thoughts on why they are disabling so many cores in their
| current product? I did some quick noodling based on the
| 46/970000 number and the only way I ended up close to 900,000
| was by assuming that an entire row or column would be disabled
| if any core within it was faulty. But doing that gave me a ~6%
| yield as most trials had active core counts in the high
| 800,000s
| projektfu wrote:
| They did mention that they stash extra cores to enable the
| re-routing. Those extra cores are presumably unused when not
| routed in.
| enragedcacti wrote:
| That was my first thought but based on the rerouting
| graphic it seems like the extra cores would be one or two
| rows and columns around the border which would only account
| for ~4000 cores.
| projektfu wrote:
| If the system were broken down into more subdivisions
| internally, there would be more cores dedicated to
| replacement. It seems like it could be more difficult to
| reroute an entire row or column of cores on a wafer than
| a small block. Perhaps, also, they are building in heavy
| redundancy for POC and in the future will optimize the
| number of cores they expect to lose.
| ChuckMcM wrote:
| I could _guess_ that it helps with heat dissipation
| /management. But I don't know. That guess is from looking at
| the list of patents[1] they have.
|
| [1] https://patents.justia.com/assignee/cerebras-systems-inc
| girvo wrote:
| > Xilinx was still aggressively suing people who put SERDES
| ports on FPGAs
|
| This so isn't important to your overall point, but where would
| I begin to look into this? Sounds fascinating!
| nroize wrote:
| Not OP but I was curious too. Here's all I could find that
| seemed related: https://www.businesswire.com/news/home/202001
| 21005582/en/Xil...
| ChuckMcM wrote:
| Well this was the patent they were threatening with as I
| recall (https://patents.google.com/patent/US20030023912A1/en)
| and there was this one too:
| https://patents.google.com/patent/US5576554A/en
|
| Basically the "secret sauce" of the startup recruiting me was
| that they were going to do wafer scale FPGAs that could be
| tiled together to build arbitrarily complex systems like
| military phased array radars and such. All very hush hush but
| apparently they had recruited some key talent from Xilinx
| which was annoying Xilinx.
| dogcomplex wrote:
| Of course many people are going to collectively lose trillions,
| AI's a very highly hyped industry with people racing into it
| without an intellectual edge and any temporary achievement by
| any one company will be quickly replicated and undercut by
| another using the same tools. Economic success of the
| individuals swarming on a new technology is not a guarantee
| whatsoever, nor is it an indicator of the impact of the
| technology.
|
| Just like the dotcom bubble, AI is gonna hit, make a few
| companies stinking rich, and make the vast majority (of both
| AI-chasing and legacy) companies bankrupt. And it's gonna
| rewire the way everything else operates too.
| idiotsecant wrote:
| >it's gonna rewire the way everything else operates too.
|
| This is the part that I think a lot of very tech literate
| people don't seem to get. I see people all the time
| essentially saying 'AI is just autocomplete' or pointing out
| that some vaporware ai company is a scam so surely everyone
| is.
|
| A lot of it is scams and flash in the pan. But a few of them
| are going to transform our lives in ways we probably don't
| even anticipate yet, for good and bad.
| Retric wrote:
| I'm not so sure it's going to even do that much. People are
| currently happy to use LLM's, but the outputs aren't
| accurate and don't seem to be improving quickly.
|
| A YouTuber watch regularly includes questions they asked
| Chat GPT and very single time there's a detailed response
| in the comments showing how the output is wildly wrong from
| multiple mistakes.
|
| I suspect the backlash from disgruntled users is going to
| hit the industry hard and these models are still extremely
| expensive to keep updated.
| Thews wrote:
| Using function calls for correct answer lookup already
| practically eliminates this, it's not wide spread yet,
| but the ease of doing it is already practical for many.
|
| New models aren't being trained specifically on single
| answers which will only help.
|
| The expense for the larger models is something to be
| concerned about. Small models with function calls is
| already great, especially if you narrow down what they
| are being used for. Not seeing their utility is just a
| lack of imagination.
| ithkuil wrote:
| Dollars are not lost; they are just very indirectly invested
| into gpu makers (and energy providers)
| wizzard0 wrote:
| this is an important reminder that all digital electronics is
| really analog but with good correction circuitry.
|
| and run-time cpu and memory error rates are always nonzero too,
| though orders of magnitude lower than chip yield rates
| nine_k wrote:
| CPUs may be very digital inside, but DRAM and flash memory are
| _highly_ analog, especially MLC flash. DDR4 even has a
| dedicated training mode [1], during which DRAM and the memory
| controller learn the quirks of particular data lines and adjust
| to them, in order to communicate reliably.
|
| [1]: https://www.systemverilog.io/design/ddr4-initialization-
| and-...
| ajb wrote:
| So they massively reduce the area lost to defects per wafer, from
| 361 to 2.2 square mm. But from the figures in this blog, this is
| massively outweighed by the fact that they only get 46222 sq mm
| useable area out of the wafer, as opposed to 56247 that the H100
| gets - because they are using a single square die instead of
| filling the circular wafer with smaller square dies, they lose
| 10,025 sq mm!
|
| Not sure how that's a win.
|
| Unless the rest of the wafer is useable for some other customer?
| olejorgenb wrote:
| Is the wafer itself so expensive? I assume they don't pattern
| the unused area, so the process should be quicker?
| yannyu wrote:
| > I assume they don't pattern the unused area, so the process
| should be quicker?
|
| The primary driver of time and cost in the fabrication
| process is the number of layers for the wafers, not the
| surface area, since all wafers going through a given process
| are the same size. So you generally want to maximize the
| number of devices per wafer, because a large part of your
| costs will be calculated at the per-wafer level, not a per-
| device level.
| olejorgenb wrote:
| Yes, but my understanding is that the wafer is exposed in
| multiple steps, so there would still be less exposure
| steps? Probably insignificant compared to all the rest
| though. (Etching, moving the wafer, etc.)
|
| EDIT: to clarify - I mean the exposure of one single
| pattern/layer is done in multiple steps.
| (https://en.wikipedia.org/wiki/Photolithography#Projection)
| yannyu wrote:
| The number of exposure steps would be unrelated to the
| (surface area) size of die/device that you're making. In
| fact, in semiconductor manufacturing you're typically
| trying to maximize the number of devices per wafer
| because it costs the same to manufacture 1 device with 10
| layers vs 100 devices with 10 layers on the same wafer.
| This goes so far as to have companies or business units
| share wafers for prototyping runs so as to minimize cost
| per device (by maximizing output per wafer).
|
| Also, etching, moving, etc is all done on the entire
| wafer at the same time generally, via masks and baths.
| It's less of a pencil/stylus process, and more of a
| t-shirt silk-screening process.
| gpm wrote:
| > This goes so far as to have companies or business units
| share wafers for prototyping runs so as to minimize cost
| per device
|
| Can this be done in production? Is there a chance that
| the portion of the wafer cerebras.ai can't fit their
| giant square in is being used for production of some
| other companies chips?
| mattashii wrote:
| Yes, but isn't a big driver of layer costs the cost of the
| machines to build those layers?
|
| For patterning, a single iteration could be (example
| values, no actual values used, probably only ballpark
| accuracy) on a 300M$ EUV machine with 5-year write off
| cycle, patterns on average 180 full wafers /hour. Excluding
| energy usage and service time, each wafer that needs full
| patterning would cost ~38$. If each wafer only needed half
| the area patterned, the lithography machine might only
| spend half its usual time on such a wafer, and that could
| double the throughput of the EUV machine, halving the
| write-off based cost component of such a patterning step.
|
| Given that each layer generally consists of multiple
| patterning steps, a 10-20% reduction in those steps could
| give a meaningful reduction in time spent in the machines
| whose time spend on the wafer depends on the used wafer
| area.
|
| This of course doesn't help reduce time in polishing or
| etching (and other steps that happen with whole wafers at a
| time), so it won't be as straightforward as % reduction in
| wafer area usage == % reduction in cost, but I wouldn't be
| surprised if it was a meaningful percentage.
| yannyu wrote:
| > Yes, but isn't a big driver of layer costs the cost of
| the machines to build those layers?
|
| Let's say the time spent in lithography step is linear
| the way you're describing. Even with that, the deposition
| step beforehand is surface area independent and would be
| applied across the entire wafer, and takes just as long
| if not longer than the lithography.
|
| Additionally, if you were going to build a fab ground up
| for some specific purpose, then you might optimize the
| fab for those specific devices as you lay out. But most
| of these companies are not doing that and are simply
| going through TSMC or a similar subcontractor. So you've
| got an additional question of how far TSMC will go to
| accommodate customers who only want to use half a wafer,
| and whether that's the kind of project they could
| profitably cater to.
| ajb wrote:
| Good question. I think the wafer has a cost per area which is
| fairly significant, but I don't have any figures. There has
| historically been a push to utilise them more efficiently, eg
| by building fabs that can process larger wafers. Although
| mask exposure would be per processed area, I think that there
| are also some proportion of processing time which is per
| wafer, so the unprocessed area would have an opportunity cost
| relating to that.
| pulvinar wrote:
| There's also no reason they couldn't pattern that area with
| some other suitable commodity chips. Like how sawmills and
| butchers put all cuts to use.
| sitkack wrote:
| Often those areas are used for test chips and structures
| for the next version. They are effectively free, so you can
| use them to test out ideas.
| georgeburdell wrote:
| They probably pattern at least next nearest neighbors for
| local uniformity. That's just litho though. The rest of the
| process is done all at once on the wafer
| kristjansson wrote:
| AIUI Wafer marginal cost is lower than you'd expect. I had
| $50k in my head, quick google indicates[1] maybe <$20k at
| AAPL volumes? Regardless seems like the economics for
| Cerebras would strongly favor yield over wafer area
| utilization.
|
| [1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-
| prici...
| addaon wrote:
| > I assume they don't pattern the unused area
|
| I'm out of date on this stuff, so it's possible things have
| changed, but I wouldn't make that assumption. It is (used to
| be?) standard to pattern the entire wafer, with partially-
| off-the-wafer dice around the edges of the circle. The reason
| for this is that etching behavior depends heavily on the
| surrounding area -- the amount of silicon or copper whatever
| etched in your neighborhood affects the speed of etching for
| you, which effects line width, and (for a single mask used
| for the whole wafer) thus either means you need to have more
| margin on your parameters (equivalent to running on an old
| process) or have a higher defect right near the edge of the
| die (which you do anyway, since you can only take "similar
| neighborhood" so far). This goes as far as, for hyper-
| optimized things like SRAM arrays, leaving an unused row and
| column at each border of the array.
| kurthr wrote:
| All the process steps are limited by wafers for hour.
| Lithography (esp EUV) might be slightly faster, but that's
| not 30% of total steps, since you generally have deposit
| and etch/implant for every lithography step.
|
| It's close to a dead loss in process cost.
| Scaevolus wrote:
| Why does their chip have to be rectangular, anyways? Couldn't
| they cut out a (blocky) circle too?
| guyzero wrote:
| I've never cut a wafer, but I assume cutting is hard and
| single straight lines are the easiest.
| sroussey wrote:
| I wonder if you could... just not cut the wafer at all??
| axus wrote:
| Might be jumping in without reading, but the chips you
| cut out of the wafer have to be delivered to physically
| different locations.
| ajb wrote:
| Normally yes. But they're using a whole wafer for a
| single chip! So it's actually a good idea.
|
| I guess the issue is how do you design your routing
| fabric to work in the edge regions.
|
| Actually I wonder how they are exposing this wafer.
| Normal chips are exposed in a rectangular batch called a
| reticle. The reticle mask has repeated patterns across
| it, and it is then exposed repeatedly across the wafer.
| So either they have to make a reticle mask the full size
| of the wafer, which sounds expensive, or they somehow
| have to precisely align reticle exposures so that the
| joined edges form valid circuits.
| daedrdev wrote:
| That's the idea in the article. Just one big chip. But
| the reason why it's normally done is that there is a
| pretty high defect rate, so cutting if every wafer has
| 1-2 defects you still get (X-1.5) devices per wafer. In
| the article thy go into how they avoid this problem (I
| think its better fault tolerance, at a cost)
| gpm wrote:
| The article shows them using a single maximally sized
| square portion of a circular wafer.
|
| I think the proposal you're responding to is "just use
| the whole circular wafer without cutting out a square".
| ryao wrote:
| I suspect this would cause alignment issues since you
| could literally rotate it into the wrong position when
| doing soldering. That said, perhaps they could get away
| with cutting less and using more.
| Dylan16807 wrote:
| If you want to have nice straight edges to clamp into
| place, then you only need to shave off four slivers. You
| can lose a couple percent instead of more than a third.
| CorrectHorseBat wrote:
| They already have a notch or flat for alignment, which is
| much more critical during the lithography process than
| during soldering.
| sharpesttool wrote:
| You just need a sharpie to mark the top.
| nine_k wrote:
| Rather I wonder why do they even need to cut the extra space,
| instead of putting something there. I suppose that the
| structure of the device is highly rectangular from the
| logical PoV, so there's nothing useful to put there. I
| suspect smaller unrelated chips can be produced on these
| areas along the way.
| yannyu wrote:
| The cost driver for fabbing out wafers is the number of
| layers and the number of usable devices per wafer. Higher
| layer count increases cost and tends to decrease yield, and
| more robust designs with higher yields increase usable
| devices per wafer. If circles or other shapes could help with
| either of those, they would likely be used. Generally the end
| goal is to have the most usable devices per wafer, so they'll
| be packed as tightly as possible on the wafer so as to have
| the highest potential output.
| Scaevolus wrote:
| Right, but they're making just _one_ usable device per
| wafer already.
| Qwertious wrote:
| You need a rectilinear polygon that tessellates, and has the
| fewest sides possible to minimize the number of cuts
| necessary. And it would probably help the cutting if the
| shape is entirely convex, so that cuts can overshoot a bit
| without damaging anything.
|
| That suggests a rectangle is the only possible shape.
| CorrectHorseBat wrote:
| If it's just one chip per wafer, why even bother cutting?
| timerol wrote:
| Why does it need to tessellate if there's only one chip per
| wafer?
| nine_k wrote:
| It's a win because they have to test one chip, and don't have
| to spend resources on connecting the chiplets. The latter costs
| a lot (though it has other advantages). I suspect that a
| chiplet-based device with total 900k cores would just be not
| viable due to the size constraints.
|
| If their routing around the defects is automated enough (given
| the highly regular structure), it may be a massive economy of
| efforts on testing and packaging the chip.
| sroussey wrote:
| It's a win if you can use the wafer as opposed to throwing it
| away.
| kristjansson wrote:
| A win is a manufacturing process that results in a
| functioning product. Wafers, etc. aren't so scarce as to
| demand every mm2 be used on every one every time.
| kristjansson wrote:
| Additional wafer area would be a marginal increase in
| performance (+~20% core core best case) but increases the
| complexity of their design, and requires they figure out how to
| package/connect/house/etc. a non-standard shape. A wafer scale
| chip is already a huge tech risk, why spend more novelty budget
| on nonessential weirdness?
| ungreased0675 wrote:
| Why does it have to be a square? There's no need to worry about
| interchangeable third-party heat sink compatibility. Is it
| possible to make it an irregular polygon instead of square?
| bee_rider wrote:
| > Second, a cluster of defects could overwhelm fault tolerant
| areas and disable the whole chip.
|
| That's an interesting point. In architecture class (which was
| basic and abstract so I'm sure Cerebras is doing something much
| more clever), we learned that defects cluster, but this is a good
| thing. A bunch of defects clustering on one core takes out the
| core, a bunch of defects not clustering could take out... a bunch
| of cores, maybe rendering the whole chip useless.
|
| I wonder why they don't like clustering. I could imagine in a
| network of little cores, maybe enough defects clustered on the
| network could... sort of overwhelm it, maybe?
|
| Also I wonder how much they benefit from being on one giant
| wafer. It is definitely cool as hell. But could chiplets eat away
| at their advantage?
| IshKebab wrote:
| TSMC also have a manufacturing process used by Tesla's Dojo where
| you can cut up the chips, throw away the defective ones, and then
| reassemble working ones into a sort of wafer scale device (5x5
| chips for Dojo). Seems like a more logical design to me.
| mhh__ wrote:
| Amazing. I clicked a button in the azure deployment menu
| today...
| ryao wrote:
| I had been under the impression that Nvidia had done something
| similar here, but they did not talk about deploying the space
| saving design and instead only talked about the server rack
| where all of the chips on the mega wafer normally are.
|
| https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...
| wmf wrote:
| That shield is just a prop that looks nothing like the real
| product. The NVL72 rack doesn't use any wafer-scale-like
| packaging.
| ryao wrote:
| It would be nice if they made it real. The cost savings
| from not needing so much material should be fantastic.
| bee_rider wrote:
| Is this similar to a chiplet design? Chiplets have been a thing
| for a while, so I assume Cerebras avoided them on purpose.
| IshKebab wrote:
| I don't think so - chiplets are much smaller and I think the
| process is different.
| iataiatax10 wrote:
| The yield problem is not surprising they found a solution. Maybe
| they could elaborate more on the power distribution and
| dissipation problem?
| highfrequency wrote:
| To summarize: localize defect contamination to a very small unit
| size, by making the cores tiny and redundant.
|
| Analogous to a conglomerate wrapping each business vertical in a
| limited liability veil so that lawsuits and bankruptcy do not
| bring down the whole company. The smaller the subsidiaries, the
| less defect contamination but also the less scope for
| frictionless resource and information sharing.
| exabrial wrote:
| I have a dumb question. Why isn't silicon sold in cubes instead
| of cylinders?
| bigmattystyles wrote:
| no matter how you orient a circle on a plane, it's the same
| amelius wrote:
| The silicon ingots have a rotating production process that
| results in cylinders, not bricks.
| exabrial wrote:
| fascinating, I figured it was something like that. maybe we
| should produce hexagonal, instead of square, chip designs
| kryptiskt wrote:
| Crystalline silicon is produced with the Czochralski process
| (https://en.wikipedia.org/wiki/Czochralski_method), which
| produces a round ingot. So you'd have to cut away perfectly
| fine silicon to make something squarish.
| NickHoff wrote:
| Neat. What about power density?
|
| An H100 has a TDP of 700 watts (for the SXM5 version). With a die
| size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the
| same power density, that means a cerebras TDP of 37.8 kW.
|
| That's a lot. Let's say you cover the whole die area of the chip
| with water 1 cm deep. How long would it take to boil the water
| starting from room temperature (20 degrees C)?
|
| amount of water = (die area of 46225 mm^2) * (1 cm deep) *
| (density of water) = 462 grams
|
| energy needed = (specific heat of water) * (80 kelvin difference)
| * (462 grams) = 154 kJ
|
| time = 154 kJ / 39.8 kW = 3.9 seconds
|
| This thing will boil (!) a centimeter of water in 4 seconds. A
| typical consumer water cooler radiator would reduce the
| temperature of the coolant water by only 10-15 C relative to
| ambient, and wouldn't like it (I presume) if you pass in boiling
| water. To use water cooling you'd need some extreme flow rate and
| a big rack of radiators, right? I don't really know. I'm not even
| sure if that would work. How do you cool a chip at this power
| density?
| lostlogin wrote:
| If rack mounted, you are ending up with something like a
| reverse power station.
|
| So why not use it as an energy source? Spin a turbine.
| sebzim4500 wrote:
| If my very stale physics is accurate then even with perfect
| thermodynamic efficiency you would only recover about a third
| of the energy that you put into the chips.
| dylan604 wrote:
| 1/3 > 0, so even if you don't get a $0 energy bill I'd
| venture that any company that could get 1/3 of energy bill
| would be happy
| bentcorner wrote:
| I'm aware of the efficiency losses but I think it would be
| amusing to use that turbine to help power the machine
| generating the heat.
| twic wrote:
| Hey, we're building artificial general intelligence, what's
| a little perpetual motion on the side?
| kristjansson wrote:
| If you let the chip actual boil enough water to run a turbine
| you're going to have a hard time keeping the magic smoke
| inside. Much better to run at reasonable temps and try to
| recover energy from the waste heat.
| ericye16 wrote:
| What if you chose a refrigerant with a lower boiling point?
| kristjansson wrote:
| That's basically the principle of binary cycle[1]
| generators. However for data center waste heat recovery,
| I'd think you'd want to use a more stable fluid for
| cooling, and then pump it to a separate closed-loop
| binary-cycle generator. No reason to make your datacenter
| cooling system also deal with high pressure fluids, and
| moving high pressure working fluid from 1000s of chips to
| a turbine of sufficient size, etc.
|
| [1]: https://en.wikipedia.org/wiki/Binary_cycle
| renhanxue wrote:
| There's a bunch of places in Europe that use waste heat from
| datacenters in district heating systems. Same thing with
| waste heat from various industrial processes. It's relatively
| common practice.
| buildbot wrote:
| A Very Fancy cooling engine: https://www.eetimes.com/powering-
| and-cooling-a-wafer-scale-d...
| jwan584 wrote:
| A good talk on how Cerebras does power & cooling (8min)
| https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...
| flopsamjetsam wrote:
| Minor correction, the keynote video says ~20 kW
| throwup238 wrote:
| The machine that actually holds one of their wafers is almost
| as impressive as the chip itself. Tons of water cooling
| channels and other interesting hardware for cooling.
| Paul_Clayton wrote:
| The enthalpy of vaporization of water (at standard pressure) is
| listed by Wikipedia[1] as 2.257 kJ/g, so _boiling_ 462 grams
| would require an additional 1.04 MJ, adding 26 seconds.
| Cerebras claims a "peak sustained system power of 23kW" for
| the CS-3 16 Rack Unit system[2], so clearly the power density
| is lower than for an H100.
|
| [1]
| https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other...
| [2] https://cerebras.ai/product-system/
| twic wrote:
| On a tangent: has anyone built an active cooling system which
| operates in a partial vacuum? At half atmospheric pressure,
| water boils at around 80 C, which i believe is roughly the
| operating temperature for a hard-working chip. You could pump
| water onto the chip, have it vapourise, taking away all that
| heat, then take the vapour away and condense it at the fan
| end.
|
| This is how heat pipes work, i believe, but heat pipes aren't
| pumped, they rely entirely on heat-driven flow. I would have
| thought there were pumped heat pipes. Are they called
| something else?
|
| It's also not a refrigerator, because those use a pump to
| pressurise the coolant in its gas phase, whereas here you
| would only be pumping the water.
| pants2 wrote:
| No need to bother with a partial vacuum when ethanol boils
| at around 80 C as well and doesn't destroy electronics. I'm
| not aware of any active cooling systems utilizing this
| though.
| ddxxdd wrote:
| I could argue that ethanol has 1/3 the latent heat of
| vaporization of water, and would boil off 3 times
| quicker. However, what ultimately matters is the rate of
| heat transfer, so my nitpick may be irrelevant.
| pezezin wrote:
| May I introduce you to the glorious vodka cooled PC?
| https://www.youtube.com/watch?v=IYTJfLyo_vE
| TehCorwiz wrote:
| I found this review from 2019 of mechanically pumped heat
| pipe technologies. I skimmed the intro. Looks like it
| already has a foothold in aerospace.
|
| https://www.sciencedirect.com/science/article/abs/pii/S1359
| 4...
| Dylan16807 wrote:
| > This is how heat pipes work, i believe, but heat pipes
| aren't pumped, they rely entirely on heat-driven flow. I
| would have thought there were pumped heat pipes.
|
| Do you have a particular benefit in mind that a pump would
| help with?
| bigmattystyles wrote:
| When I was a kid, I used to get intel keychains with a die in
| acrylic - good job to whoever thought of that to sell the fully
| defective chips.
| dylan604 wrote:
| wow, fancy with the acrylic. lots of places just place a chip
| (I'm more familiar with RAM sticks) on a keychain and call it a
| day.
| bigmattystyles wrote:
| they're all over eBay, I just checked - the one I was
| thinking of, that I think I had is going for $150 - the
| things you get rid of....
| bradyd wrote:
| Electronic Goldmine sells entire scrapped 200mm wafers for
| $15 or less
|
| https://theelectronicgoldmine.com/search?options%5Bprefix%5
| D...
| kragen wrote:
| Those aren't just a chip; they're an epoxy package with a
| leadframe and a chip inside it. To put just a chip on a
| keychain, you'd have to drill a hole through it, which is
| difficult because silicon is so brittle--almost like drilling
| a hole in glass. Then, when someone put it onto a keyring,
| the keyring would form a lever that applies a massive force
| to the edge of the brittle hole, shattering the brittle
| silicon. Potting the chip in acrylic resin is a much cheaper
| solution that works better.
| Neywiny wrote:
| Understanding that there's inherent bias by them being
| competitors of the other companies, but still this article seems
| to make some stretches. If you told me you had an 8% core defect
| rate reduced 100x, I'd assume you got to close to 99% enablement.
| The table at the end shows... Otherwise.
|
| They also keep flipping between cores, SMs, dies, and maybe other
| block sizes. At the end of the day I'm not very impressed. They
| seemingly have marginally better yields despite all that effort.
| sfink wrote:
| I think you're missing the point. The comparison is not between
| 93% and 92%. The comparison is between what they're getting
| (93%) and what you'd get if you scaled up the usual process to
| the core size they're using (0%). They are doing something
| different (namely: a ~whole wafer chip) that isn't possible
| without massively boosting the intra-chip redundancy. (The
| usual process stops working once you no longer have any extra
| dies to discard.)
|
| > Despite having built the world's largest chip, we enable 93%
| of our silicon area, which is higher than the leading GPU
| today.
|
| The important part is building the largest chip. The icing on
| the top is that the enablement is not lower. Which it would be
| without the routing-to-spare-cores magic sauce.
|
| And the differing terminology is because they're talking about
| differing things? You could call an SM a core, but it kind of
| contains (heterogeneous) cores itself. (I've no idea whether
| intra-SM cores can be redundant to boost yield.) A die is the
| part you break off and build a computer out of, it may contain
| a bunch of cores, a wafer can be broken up into multiple dies
| but for Cerebras it isn't.
|
| If NVIDIA were to go and build a whole-wafer die, they'd do
| something similar. But Cerebras did it and got it to work.
| NVIDIA hasn't gotten into that space yet, so there's no point
| in building a product that you can't sell to a consumer or even
| a data center that isn't built around that exact product (or to
| contain a Balrog).
| fspeech wrote:
| There is nothing inherently good about wafer scale. It's
| actually harder to dissipate heat and enable hybrid bonding
| with DRAM. So the gp is entirely correct that you need to
| actually show higher silicon utilization to be even
| considered as being something worthwhile.
| Neywiny wrote:
| I think I'll still stand by my viewpoint. They said:
|
| > On the Cerebras side, the effective die size is a bit
| smaller at 46,225mm2. Applying the same defect rate, the
| WSE-3 would see 46 defects. Each core is 0.05mm2. This means
| 2.2mm2 in total would be lost to defects.
|
| So ok they claim that they should see (46225-2.2)/46225 =
| 99.995%. Doing the same math for their Nvidia numbers it's
| 99.4%. And yet in practice neither approach got to these
| numbers. Nowhere near it. I just feel like the whole article
| talks about all this theory and numbers and math of how
| they're so much better but in practice it's meaningless.
|
| So what I'm not seeing is why it'd be impossible for all the
| H100s on a wafer to be interconnected and call it a day.
| You'd presumably get 92/93 = 98.9% of the performance and,
| here's the kicker, no need to switch to another architecture.
| I didn't know where your 0% number came from. Nothing about
| this article says that a competitor doing the same scaling to
| wafer scale would get 0%, just a marginal decrease in how
| many cores made it through fab.
|
| Fundamentally I am not convinced from this article that
| Cerebras has done something in their design that makes this
| possible. All I'm seeing is that it'd perform 1% faster.
|
| Edit: thinking a bit more on it, to me it's like they said
| TSMC has a guy with a sledgehammer who smashes all the wafers
| and their architecture snaps a tiny bit cleaner. But they
| haven't said anything about firing the guy with the
| sledgehammer. Their paragraph before the final table says
| that this whole exercise is pretty much meaningless because
| their numbers are made up about competitors and they aren't
| even the right numbers to be using. Then the table backs up
| my paraphrase.
| wendyshu wrote:
| What's yield?
| elpocko wrote:
| When driving a car, to yield means that merging drivers must
| prepare to stop if necessary to let a driver on another
| approach proceed.
|
| That's not necessary if you have strong weaponry mounted on
| your vehicle: research shows that you dont't have to stop if
| all the other drivers are dead.
| wmf wrote:
| It's the fraction of usable product from a manufacturing
| process.
| anonymousDan wrote:
| Very interesting. Am I correct in saying that fault tolerance
| here is with respect to 'static' errors that occur during
| manufacturing and are straightforward to detect before reaching
| the customer? Or can these failures potentially occur later on
| (and be tolerated) during the normal life of the chip?
| abrookewood wrote:
| Looking at the H100 on the left, why is the chip yield (72) based
| on a circular layout/constraint? Why do they discard all of the
| other chips that fall outside the circle?
| flumpcakes wrote:
| Because the circle is the physical silicon. Any chips that fall
| outside the circle are only part of a full chip. They will be
| physically missing half the chip.
| donavanm wrote:
| AFAIK all wafer ingots are cylinders, which means the wafers
| themselves are a circular cross section. So manufacturing is
| binpacking rectangles in to a circle. Plus different
| effects/defects in the chips based on the distance from the
| edge of the wafer.
|
| So I believe its the opposite: why are they representing the
| larger square and implying lower yield off the wafer in space
| that doesnt practically exist?
| therealcamino wrote:
| That's just the shape of the wafer. I don't know why the
| diagram continued the grid outside it.
| ryao wrote:
| > Take the Nvidia H100 - a massive GPU weighing in at 814mm2.
| Traditionally this chip would be very difficult to yield
| economically. But since its cores (SMs) are fault tolerant, a
| manufacturing defect does not knock out the entire product. The
| chip physically has 144 SMs but the commercialized product only
| has 132 SMs active. This means the chip could suffer numerous
| defects across 12 SMs and still be sold as a flagship part.
|
| Fault tolerance seems to be the wrong term to use here. If I
| wrote this, I would have written redundant.
| jjk166 wrote:
| Redundant cores lead to a fault tolerant chip.
| ryao wrote:
| ECC memory is fault tolerant. It repairs issues on the fly
| without disabling hardware. This on the other hand is merely
| redundant to handle manufacturing defects. If they make a
| mistake and ship a bad core that malfunctions at runtime, it
| is not going to tolerate that.
| jjk166 wrote:
| Redundancy is a method of providing fault tolerance, the
| existence of other methods doesn't make it less fault
| tolerant.
|
| Nothing is tolerant to all possible faults. Fault tolerance
| refers to being able to tolerate specific types of faults
| under specific conditions.
|
| Fault tolerant is the proper term for this.
| gunalx wrote:
| My biggest question is who are the buyers?
| asdasdsddd wrote:
| mostly 1 ai company in the middle east last I heard
| bcatanzaro wrote:
| This is a strange blog post. Their tables say:
|
| Cerebras yields 46225 * .93 = 43000 square millimeters per wafer
|
| NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer
|
| I don't know if their numbers are correct but it is a strange
| thing for a startup to brag that it is worse than a big company
| at something important.
| saulpw wrote:
| Being within striking distance of SOTA while using orders of
| magnitude fewer resources is worth bragging about.
| RecycledEle wrote:
| IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to
| laser out bad cores.
| oksurewhynot wrote:
| I live in a small city/large town that has a large number of
| craft breweries. I always marveled at how these small operations
| were able to churn out so many different varieties. Turns out
| they are actually trying to make their few core recipes but the
| yield is so low they market the less consistent results as...all
| that variety I was so impressed with.
| trhway wrote:
| 56K mm2 vs 46K mm2. I wonder why they wouldn't use the smart
| routing/etc to use more fitting shape than square and thus use
| more of the wafer.
| ilaksh wrote:
| I assume people are aware, but Cerebras has a web demo and API
| which is open to try and it is 2000 tokens per second for Llama
| 3.3 70b and 1000 tokens per second for Llama 3.1 405b.
|
| https://cerebras.ai/inference
| Fokamul wrote:
| Anyone has some picture how it is looks like inside these
| servers?
| hoseja wrote:
| Why square chip? Make it an octagon or something.
| aurareturn wrote:
| Bear case on Cerebras:
| https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...
|
| Note: This author is heavily invested in Nvidia.
| larsrc wrote:
| How do these much smaller cores compare in computing power to the
| bigger ones? They seem to implicitly claim that a core is a core
| is a core, but surely one gets something extra out of the much
| bigger one?
| jstrong wrote:
| I would like a workstation with 900k cores. lmk when these things
| are on ebay.
| riskable wrote:
| Just need that 20kW connection to your energy provider.
| aaroninsf wrote:
| The number of people ITT this thread who have absorbed the world-
| weary AI-is-a-bubble skepticism...
|
| I'm just gonna say, with serene certainty,
|
| the economic order we inhabit going through phase change is
| certain. From certain myopic perspectives we can shoehorn that
| into a narrative of cyclical patterns in the tech industry or
| financial markets etc etc.
|
| This is not going to be that. No more than the transformation of
| American retail can be shoehorned to kind of look like it used if
| you don't know anything at all about what contemporary
| international trade and logistics and oligopoly actually mean in
| terms of what is coming into your home from where and why it is
| or isn't cheap.
|
| Where we'll be in 10, 20, years is literally unimaginable today;
| and trying to navigate that wrt traditional landmarks... oof.
| lofaszvanitt wrote:
| A well written, easy to understand article.
___________________________________________________________________
(page generated 2025-01-16 23:01 UTC)