[HN Gopher] 100x defect tolerance: How we solved the yield problem
       ___________________________________________________________________
        
       100x defect tolerance: How we solved the yield problem
        
       Author : jwan584
       Score  : 318 points
       Date   : 2025-01-15 21:19 UTC (1 days ago)
        
 (HTM) web link (cerebras.ai)
 (TXT) w3m dump (cerebras.ai)
        
       | ChuckMcM wrote:
       | I think this is an important step, but it skips over that 'fault
       | tolerant routing architecture' means you're spending die space on
       | routes vs transistors. This is exactly analogous to using bits in
       | your storage for error correcting vs storing data.
       | 
       | That said, I think they do a great job of exploiting this
       | technique to create a "larger"[1] chip. And like storage it
       | benefits from every core is the same and you don't need to get to
       | every core directly (pin limiting).
       | 
       | In the early 2000's I was looking at a wafer scale startup that
       | had the same idea but they were applying it to an FPGA
       | architecture rather than a set of tensor units for LLMs. Nearly
       | the exact same pitch, "we don't have to have all of our GLUs[2]
       | work because the built in routing only uses the ones that are
       | qualified." Xilinx was still aggressively suing people who put
       | SERDES ports on FPGAs so they were pin limited overall but the
       | idea is sound.
       | 
       | While I continue to believe that many people are going to
       | collectively lose trillions of dollars ultimately pursuing "AI"
       | at this stage. I appreciate the the amount of money people are
       | willing to put at risk here allow for folks to try these "out of
       | the box" kinds of ideas.
       | 
       | [1] It is physically more cores on a single die but the overall
       | system is likely smaller, given the integration here.
       | 
       | [2] "Generic Logic Unit" which was kind of an extended LUT with
       | some block RAM and register support.
        
         | __Joker wrote:
         | "While I continue to believe that many people are going to
         | collectively lose trillions of dollars ultimately pursuing "AI"
         | at this stage"
         | 
         | Can you please explain more why you think so ?
         | 
         | Thank you.
        
           | mschuster91 wrote:
           | It's a hype cycle with many of the hypers and deciders having
           | _zero_ idea about what AI actually is and how it works.
           | ChatGPT, while amazing, is at its core a token predictor, it
           | cannot ever get to an AGI level that you 'd assume to be
           | competitive to a human, even most animals.
           | 
           | And just as every other hype cycle, this one will crash down
           | hard. The crypto crashes were bad enough but at least gamers
           | got some very cheap GPUs out of all the failed crypto farms
           | back then, but this time so much more money, particularly
           | institutional money, is flowing around AI that we're looking
           | at a repeat of Lehman's once people wake up and realize
           | they've been scammed.
        
             | KronisLV wrote:
             | > And just as every other hype cycle, this one will crash
             | down hard.
             | 
             | Isn't that an inherent problem with pretty much everything
             | nowadays: crypto, blockchain, AI, even the likes of
             | serverless and Kubernetes, or cloud and microservices in
             | general.
             | 
             | There's always some hype cycle where the people who are
             | early benefit and a lot of people chasing the hype later
             | lose when the reality of the actual limitations and the
             | real non-inflated utility of each technology hits. And
             | then, a while later, it all settles down.
             | 
             | I don't think the current "AI" is special in any way, it's
             | just that everyone tries to get rich (or benefit in other
             | ways, as in the microservices example, where you still very
             | much had a hype cycle) quick without caring about the
             | actual details.
        
               | anon373839 wrote:
               | > I don't think the current "AI" is special in any way
               | 
               | As someone who loves to pour ice water on AI hype, I have
               | to say: you can't be serious.
               | 
               | The current AI tech has opened up paths to develop
               | applications that were impossible just a few years ago.
               | Even if the tech freezes in place, I think it will yield
               | substantial economic value in the coming years.
               | 
               | It's very different from crypto, the main use case for
               | which appears to be money laundering.
        
               | carlmr wrote:
               | >It's very different from crypto, the main use case for
               | which appears to be money laundering.
               | 
               | Which has substantial economic value (for certain groups
               | of people).
        
               | lazide wrote:
               | According to this random estimate, black market economy
               | alone in just the US is worth ~ $2 trillion/yr.
               | [https://www.investopedia.com/terms/u/underground-
               | economy.asp]
               | 
               | Roughly 11-12% of GDP.
               | 
               | In many countries, black+grey market is larger than the
               | 'white' market. The US is notoriously 'clean' compared to
               | most (probably top 10).
               | 
               | Even in the US, if you suddenly stopped 10-12% of GDP
               | we're talking 'great depression' levels of economic pain.
               | 
               | Honestly, the only reason Crypto isn't _bigger_ IMO is
               | because there is such a large and established set of
               | folks doing laundering in the 'normal' system, and those
               | work well enough there is not nearly as much demand as
               | you'd expect.
        
               | KronisLV wrote:
               | > The current AI tech has opened up paths to develop
               | applications that were impossible just a few years ago.
               | 
               | My argument is that if it's advertised as a direct
               | precursor to AGI based on wishful thinking and people
               | don't know any better, then it's no different to claims
               | about how putting blockchain technology in X industry
               | will solve all of its problems.
               | 
               | I use LLMs daily and don't scoff at AI generated imagery
               | or use cases like agentic systems, but there absolutely
               | is a similar hype cycle to every other innovation out
               | there where people are borderline delusional in the
               | initial stages (Kubernetes will solve all of our issues,
               | moving to cloud and microservices will solve all of our
               | issues, the blockchain will...), before the limitations
               | crystallize and we know what each technology is good or
               | bad at.
               | 
               | Though maybe that says more about human nature than the
               | technology itself.
               | 
               | > It's very different from crypto, the main use case for
               | which appears to be money laundering.
               | 
               | That's akin to saying "The main use case for AI appears
               | to be stealing people's art and even for writers and
               | others it seems to be firing people to replace them with
               | soulless AI generated slop."
               | 
               | I'd even argue that there's nothing wrong with the
               | technologies themselves, be it LLMs, AI for image, video,
               | audio generation, blockchain and crypto, or whatever. The
               | problems arise based on how the technologies are used, or
               | in my argument above - how they're touted as the solution
               | to all the problems. Some people profit a lot, others
               | collide with reality and their limitations at speed.
               | 
               | In other words, if the technology will generate 100
               | billion USD of actual value but people are betting on 500
               | billion USD, then clearly we have a bit of an issue.
        
               | idiotsecant wrote:
               | >the main use case for which appears to be money
               | laundering.
               | 
               | You say tomato, I say freedom from the tyranny of fiat
               | power structures.
        
               | carlhjerpe wrote:
               | Both Kubernetes and serverless (FaaS) is here to stay.
               | Microservices is just an excuse to build shit software.
        
               | KronisLV wrote:
               | > Both Kubernetes and serverless (FaaS) is here to stay.
               | 
               | *in environments and projects where they are a good fit
               | 
               | > Microservices is just an excuse to build shit software.
               | 
               | *in environments and projects where they are a bad fit
        
             | CamperBob2 wrote:
             | _it cannot ever get to an AGI level that you 'd assume to
             | be competitive to a human, even most animals._
             | 
             | Suppose you turn out to be wrong. What would convince you?
        
               | ChuckMcM wrote:
               | It could diagram a sentence it had never seen.
        
               | karolist wrote:
               | It does that all the time even now. Play with temp,
               | top_k, min_p and nothing is preventing you from getting
               | combinations and order of tokens previously not
               | encountered in history.
        
               | kragen wrote:
               | I did eventually get an LLM to produce what seems to be a
               | correct diagram of a sentence it had never seen, but it
               | took about ten tries. Grammatical analysis seems to have
               | happened correctly every time, but getting to a usable
               | diagram was difficult.
               | 
               | (I know that it's generally rude to include LLM output in
               | HN comments, but in this case I think it's essential
               | supporting material to elevate the discussion of LLM
               | capabilities above "yes it is", "no it isn't".)
               | 
               | First, I just asked ChatGPT (GPT-4o mini, what they offer
               | anonymous free users):
               | 
               | > Please diagram the sentence, "If my mother's road had
               | been cultural, my uncle's sloth squad residue would have
               | compiled a periodic meadow of night."
               | 
               | I just invented that sentence, picking several words with
               | a random number generator, but of course I can't prove to
               | you that I didn't copy it from some book you haven't
               | seen. (For that, you'd have to invent a sentence
               | yourself, or at least provide me with some words to
               | include in it.) I hope you can at least agree that it's a
               | _highly unlikely_ sentence for anyone ever to have
               | uttered before!
               | 
               | After talking to itself for a few paragraphs, it arrived
               | at the following grammatical analysis:
               | Step 4: Diagramming the sentence              Main
               | Clause:             Subject: "my uncle's sloth squad
               | residue"             Verb: "would have compiled"
               | Direct Object: "a periodic meadow of night"
               | "a periodic meadow" (noun phrase)                 "of
               | night" (prepositional phrase modifying "meadow")
               | Conditional Clause:             Subject: "my mother's
               | road"             Verb: "had been" (linking verb)
               | Predicate Adjective: "cultural"
               | 
               | This is exactly correct. From that it produced the
               | following ASCII art "diagram", which looks like nonsense
               | to me, but maybe it's in some kind of sentence-
               | diagramming formalism I haven't seen:
               | If             ______/______            |              |
               | my mother's    had been           road             |
               | cultural                    my uncle's sloth squad
               | residue                      |         would have
               | compiled                      |       a periodic meadow
               | of night                   |                 of
               | night
               | 
               | I told it the diagram looked like nonsense, and it
               | produced this one, which looks worse:
               | If                _______/______               |
               | |       my mother's road    had been                |
               | |            cultural      (Main Clause)
               | |            my uncle's sloth squad residue
               | |                      would have compiled
               | |                    a periodic meadow
               | |                            of night
               | 
               | I asked GPT-4 (paid) the same question, and it gave me
               | another exactly correct grammatical analysis in words:
               | 1. **Identify the main clauses**:           - "If my
               | mother's road had been cultural" is a dependent clause
               | (specifically, a conditional clause).          - "my
               | uncle's sloth squad residue would have compiled a
               | periodic meadow of night" is the main clause.
               | 2. **Start with the main clause**:          - Subject:
               | "my uncle's sloth squad residue"            - "my
               | uncle's" modifies "sloth squad," which in turn modifies
               | "residue."          - Predicate: "would have compiled"
               | - This includes the modal auxiliary verb ("would"),
               | perfect auxiliary ("have"), and main verb ("compiled").
               | - Direct object: "a periodic meadow of night"
               | - "a" is an article modifying "meadow,"            -
               | "periodic" modifies "meadow,"            - "of night" is
               | a prepositional phrase modifying "meadow."            3.
               | **Diagram the dependent clause**:          - Subject:
               | "road"            - Modified by possessive noun phrase
               | "my mother's"          - Predicate: "had been"          -
               | Complement/Adjective: "cultural"
               | 
               | (I feel like this is definitely "a level that you'd
               | assume to be competitive to a human", in mschuster91's
               | phrase; most native English speakers can't do that.)
               | 
               | But then it came up with the following incorrect diagram:
               | +---------------------------------------------------+
               | |                      Main Clause                   |
               | +---------------------------------------------------+
               | /                |               \
               | /                 |                \
               | /                  |                 \
               | [My uncle's]         [sloth squad]          [residue]
               | \                    /                     /
               | \                  /                     /
               | \                /                     /
               | would have compiled -------------------
               | \
               | \
               | \
               | [a periodic meadow of night]
               | /           |            \
               | /            |             \
               | a        periodic       [of night]
               | +---------------------+       |  Dependent Clause   |
               | +---------------------+             If              |
               | road           ___|_____         |          |        my
               | mother's had been cultural
               | 
               | Telling it the diagram was nonsense didn't help.
               | 
               | But my conclusion from this is not that GPT-4 can't
               | analyze the grammatical structure of sentences it had
               | never seen. It's that GPT-4 gets "confused" when it tries
               | to draw ASCII art, like a person trying to tell you how
               | to draw it over the telephone, but not quite as bad. I've
               | seen similarly abominable results when it tried to draw
               | circuit schematics and guitar tablatures as ASCII art. I
               | tried asking it for a graphviz diagram, which was
               | syntactically correct graphviz code but didn't express
               | the sentence structure properly.
               | 
               | I tried simplifying the problem with the prompt, 'Please
               | provide a syntactic tree showing the syntactic derivation
               | of the sentence, "My uncle's sloth squad residue would
               | have compiled a periodic meadow of night," using graphviz
               | syntax and with individual words at the leaf nodes of the
               | tree.' which yielded a slightly incorrect diagram. (I'm
               | still sure nobody has ever uttered this sentence before.)
               | On being told, "This diagram is not quite right.", GPT-4
               | produced the following diagram, which I believe is
               | correct except for an extra "s" on "uncle" (you can see
               | this in your browser at https://is.gd/Axufjx):
               | digraph G {           node [shape=plaintext];
               | S [label="S"];           NP [label="NP"];           VP
               | [label="VP"];                S -> NP;           S -> VP;
               | // Noun Phrase breakdown           DP_my [label="DP"];
               | DP_my -> My;                UnclePossessive [label="NP"];
               | UnclePossessive -> Uncles;               UnclePossessive
               | -> ApostropheS;                    NP -> DP_my;
               | NP -> UncleCompound;                UncleCompound
               | [label="NP"];               SlothSquadResidue[label =
               | "N'"];                   SlothSquadResidue-> Sloth;
               | SlothSquadResidue-> SquadResidue;
               | SquadResidue[label = "N'"]
               | SquadResidue-> Squad;
               | SquadResidue-> Residue;
               | UncleCompound-> UnclePossessive;
               | UncleCompound-> SlothSquadResidue;                     //
               | Verb Phrase breakdown            ModalWould[label =
               | "Modal"];            ModalWould-> Would;
               | AuxHave[label = "Aux"];            AuxHave-> Have;
               | VCompiled[label = "V"];            VCompiled-> Compiled;
               | ObjectNP[label = "NP"];  // Direct object noun phrase
               | ArticleA[label = "D"];                ArticleA-> A;
               | MeadowPhrase[label = "N'"];  // 'Meadow' modified by
               | 'periodic'                    AdjPeriodic[label = "Adj"];
               | AdjPeriodic-> Periodic;
               | MeadowWord[label = N];                    MeadowWord ->
               | Meadow;                          MeadowPhrase ->
               | AdjPeriodic;                     MeadowPhrase ->
               | MeadowWord;                      PPofNight[label = PP];
               | // Prepositional phrase 'of night'
               | PrepOf[label=P]                        PrepOf-> Of;
               | NightNounPhrase[label=N]
               | NightNounPhrase-> Night;
               | PPofNight -> PrepOf;                         PPofNight ->
               | NightNounPhrase;                      ObjectNP ->
               | ArticleA;                 ObjectNP -> MeadowPhrase;
               | ObjectNP -> PPofNight;                      VP ->
               | ModalWould;            VP -> AuxHave;            VP ->
               | VCompiled            VP -> ObjectNP            }
               | 
               | Claude produced a reasonably serviceable diagram on the
               | first try with that prompt:                 digraph
               | SyntacticTree {         node [shape=plaintext];
               | S -> NP1;         S -> VP;              NP1 -> DET1
               | [label="Det"];         NP1 -> NP2 [label="NP"];
               | NP2 -> NP3 [label="NP"];         NP2 -> N1 [label="N"];
               | NP3 -> N2 [label="N"];         NP3 -> POS [label="Pos"];
               | VP -> AUX1 [label="Aux"];         VP -> AUX2
               | [label="Aux"];         VP -> V [label="V"];         VP ->
               | NP4 [label="NP"];              NP4 -> DET2 [label="Det"];
               | NP4 -> ADJ [label="Adj"];         NP4 -> N3 [label="N"];
               | NP4 -> PP [label="PP"];              PP -> P [label="P"];
               | PP -> NP5 [label="NP"];              NP5 -> N4
               | [label="N"];              DET1 -> "My";         N2 ->
               | "uncle";         POS -> "'s";         N1 -> "sloth";
               | N1 -> "squad";         N1 -> "residue";         AUX1 ->
               | "would";         AUX2 -> "have";         V -> "compiled";
               | DET2 -> "a";         ADJ -> "periodic";         N3 ->
               | "meadow";         P -> "of";         N4 -> "night";
               | }
               | 
               | On being told, I think incorrectly, "This diagram is not
               | quite right.", it produced a worse diagram.
               | 
               | So LLMs didn't perform nearly as well on this task as I
               | thought they would, but they also performed much better
               | than you thought they would.
        
               | emkee wrote:
               | Having only taken one syntax class for fun in college, I
               | find this pretty impressive. Generating syntax trees was
               | never a trivial task for me (but I was just a CS major
               | who needed a credit). Slightly related, but I have also
               | never had ChatGPT successfully generate ASCII art, even
               | with extensive conversation.
        
               | kragen wrote:
               | Yeah, I think it would be very challenging for most
               | people. It did considerably better with Graphviz than
               | with ASCII art, but it still had trouble with the
               | transition from a perfectly correct and highly nuanced
               | verbal grammatical analysis to Graphviz. I think this is
               | pretty convincing evidence against ChuckMcM's implicit
               | position. It's weaker evidence against mschuster91's
               | explicit position because parsing is something computers
               | have been doing for a long time, so it doesn't imply any
               | new capabilities.
               | 
               | I'm pretty sure there are part-of-speech tagging parsers
               | using fairly shallow statistics that could also have
               | produced an equivalently good sentence diagram.
               | https://corenlp.run/ seems to produce a correct parse,
               | though in a different format.
        
               | stonemetal12 wrote:
               | If you ask it to draw a dinosaur it does an okay
               | brontosaurs.                                __
               | / _)            _.----._/ /           /          /
               | __/ (  | (  |       /__.-'|_|--|_|
               | 
               | Asking for a Tyrannosaurus Rex gives you more or less the
               | same brontosaurs:                                  __
               | / _)              _.----._/ /            /          /
               | __/ (  | (  |       /__.-'|_|--|_|
        
               | CamperBob2 wrote:
               | _(I know that it 's generally rude to include LLM output
               | in HN comments, but in this case I think it's essential
               | supporting material to elevate the discussion of LLM
               | capabilities above "yes it is", "no it isn't".)_
               | 
               | You just have to be prepared to take a karma hit for it.
               | The audience here does not consist largely of 'hackers',
               | but seems to skew toward the sort of fearful, resentful
               | reactionaries that hacker culture traditionally opposes.
               | 
               | I will say I wouldn't peg ChuckMcM as being one of the
               | reactionaries, though. That would be an unpleasant
               | surprise.
               | 
               | As far as the diagram goes, my guess is that sentence
               | diagrams were underrepresented in the training corpus.
               | Diagramming sentences was already out of fashion when I
               | was in school in the 1980s -- in fact, I don't recall
               | ever having done it. The model is struggling much the
               | same way you'd expect a grade-school student (or me, I
               | guess) to struggle upon being asked to perform the task
               | for the first time.
               | 
               | Knowing when to say "I don't know how to do that" is
               | still a foundational weakness of LLMs, but I don't expect
               | it to remain unaddressed for long. We will see
               | improvement in that area, sooner or later. The
               | anklebiters will respond by moving their goalposts and
               | hitting the downvote button as usual.
        
               | kragen wrote:
               | ASCII art Reed-Kellogg sentence diagrams are probably
               | hard to find anywhere, and Graphviz can't really express
               | Reed-Kellogg diagrams. But Reed and Kellogg published
               | their somewhat ad-hoc diagram language in 01877, 78 years
               | before what we now call "linguistics" was known in the
               | West thanks to Chomsky's work in 01955. These are among
               | the reasons I thought it might be a good idea to use the
               | form of sentence diagrams used by linguists instead of
               | the more compact Reed-Kellogg diagrams.
        
               | CamperBob2 wrote:
               | A language model can generate valid sentences that have
               | never existed before -- not just syntactically or
               | statistically valid sentences in a Markovian sense, but
               | semantically valid ones. How could it possibly do that
               | without an implicit awareness of all aspects of language,
               | including grammar and parts of speech?
               | 
               | That implicit awareness is indistinguishable from
               | knowledge, at least to the extent it yields correct
               | results.
               | 
               | And the fact that it was inferred from examples rather
               | than taught explicitly is indistinguishable from
               | intelligence.
        
             | dsign wrote:
             | Those glorified token predictors are the missing piece in
             | the puzzle of general intelligence. There is a long way to
             | go still in putting all those pieces together, but I don't
             | think any of the steps left are in the same order of "we
             | need a miracle breakthrough".
             | 
             | That said, I believe that this is going one of two ways: we
             | use AI to make things materially harder for humans, in a
             | scale from "you don't get this job" to "oops, this is
             | Skynet", with many unpleasant stops in the middle. By the
             | amount of money going into AI right now and most of the
             | applications I'm seeing being hyped, I don't think we have
             | have any scruples with this direction.
             | 
             | The other way this can go, and Cerebras is a good example,
             | is that we increase our compute capability and our AI-
             | usefulness to a point where we can fight cancer and
             | stop/revert aging, both being a computational problem at
             | this point. Even if most people don't realize it, or most
             | people have strong moral objections to this outcome and
             | don't even want to talk about it, so it probably won't
             | happen.
             | 
             | In simpler words, I think we want to use AI to commit
             | species suicide :-)
        
               | Shorel wrote:
               | I'm sure there are more missing pieces.
               | 
               | We are more than Broca's areas. Our intelligence is much
               | more than linguistic intelligence.
               | 
               | However, and this is also an important point, we have
               | built language models far more capable than any language
               | model a single human brain can have.
               | 
               | Makes me shudder in awe of what's going to happen when we
               | add the missing pieces.
        
               | idiotsecant wrote:
               | Yes, I sometimes wonder if what we're witnessing in our
               | lifetimes is the next stage of the 'bootstrapping' of
               | life into a more complex form. If we might be the
               | mitochondria contributing our little piece to the cell
               | that comes after.
        
             | immibis wrote:
             | Why do you think that an AGI can't be a token predictor?
        
               | mschuster91 wrote:
               | Because an LLM _by definition_ cannot even do basic maths
               | (well, except if you're OpenAI and cheat your way around
               | it by detecting if the user asks a simple math question).
               | 
               | I'd expect an actually "general" intelligence Thing to be
               | able to be as versatile in intellectual tasks as a human
               | is - and LLMs are reasonably decent at repetition, but
               | cannot infer something completely new from the data it
               | has.
        
               | versteegen wrote:
               | Define "by definition".
               | 
               | Because this statement really makes no sense.
               | Transformers are perfectly capable (and capable of
               | perfectly) learning mathematical functions, given the
               | necessary working-out space, e.g. for long division or
               | for algebraic manipulation. And they can learn to
               | generalise from their training data very well (although
               | very data-inefficiently). That's their entire strength!
        
               | dogcomplex wrote:
               | Yet they can get silver medal PhD level competition math
               | scores.
               | 
               | Perhaps your "definition" should be simply that LLMs have
               | temporarily seen limitations in their ability to natively
               | do math unassisted by an external memory, but are
               | _exceptionally_ good at _very advanced math_ when they
               | can compensate for their lossy short-term attention
               | memory...
        
               | Shorel wrote:
               | By analogy with human brains: Because our own brains are
               | far more than the Broca's areas in them.
               | 
               | Evolution selects for efficiency.
               | 
               | If token prediction could work for everything, our brains
               | would also do nothing else but token prediction. Even the
               | brains of fishes and insects would work like that.
               | 
               | The human brain has dedicated clusters of neurons for
               | several different cognitive abilities, including face
               | recognition, line detection, body parts self perception,
               | 3D spatial orientation, and so on.
        
               | myrmidon wrote:
               | > Evolution selects for efficiency.
               | 
               | I think this is a poor argument here. From an
               | evolutionary point of view, our brains are optimized to:
               | 
               | - Provide fine-motor control to craft weapons and tools
               | (enhancing adaptibility and enabling us to hunt way
               | outside our weight class)
               | 
               | - Communicate/coordinate effectively in small groups
               | 
               | - Do sensor processing and the above with a low energy
               | budget
               | 
               | Our brains are *not* selected to be minimum-complexity
               | intelligences, and a _lot_ of what our brain does is
               | completely useless for AGI building (motor control,
               | sensor processing, ...).
               | 
               | Furthermore, the cost/complexity (from a evolutionary
               | PoV) is a totally different beast from what complexity
               | means to us.
               | 
               | Just consider flight as an example: A fruitfly is an
               | insanely simple and straightforward beast, but to us, a
               | biochemically fuelled, beating-wing design is _still_
               | infeasibly complicated. If our approach to flight had
               | been to ape after how nature does it in detail, we likely
               | _still_ would not have planes.
               | 
               | I do agree that todays LLMs still have clear
               | architectural flaws that we need to overcome (online
               | learning being a very glaring one), but, to pick up the
               | flight analogy, we might well have the main wing
               | structure _already_ down, and we won 't necessarily have
               | to make those wings _beat_ to get into the air...
        
               | Shorel wrote:
               | Just because there are some parts of our brains that are
               | not needed for an AGI...
               | 
               | Doesn't mean that there aren't some part of our brains
               | that are needed for an AGI, and are not present in the
               | current crop of LLM.
        
               | immibis wrote:
               | What do our brains do that isn't token prediction?
               | 
               | They receive information about photons and air vibrations
               | and control muscles, okay. If a human brain was hooked up
               | the way ChatGPT was, only to text input and output, would
               | that make it not intelligent?
        
               | Shorel wrote:
               | > What do our brains do that isn't token prediction?
               | 
               | I am planning a masters and phd on that question, so give
               | me a few years to answer.
        
             | Shorel wrote:
             | While I basically agree with everything you say, I have to
             | add some caveats:
             | 
             | ChatGPT, while being as far from true AGI as the Elisa
             | chatbot written in Lisp, is extraordinarily more useful,
             | and being used for many things that previously required
             | humans to write the bullshit, like lobbying and propaganda.
             | 
             | And Crypto... right now BTC is at an historical highest. It
             | could even go higher. And it will eventually crash again.
             | It's the nature of that beast.
        
             | idiotsecant wrote:
             | All the big LLMs are no longer _just_ token predictors.
             | They are beginning to incorporate memory, chain of thought,
             | and other architectural tricks that use the token predictor
             | in novel ways to produce some startlingly useful output.
             | 
             | It's certainly the case that an LLM alone cannot achieve
             | AGI. As a component of a larger system though? That remains
             | to be seen. Maybe all we need to do is duct tape a limbic
             | system and memory onto an LLM and the result is something
             | sort of like an AGI.
             | 
             | It's a little bit like saying that a ball bearing can't
             | possibly ever be an internal combustion engine. While true,
             | it's sidestepping the point a little bit.
        
           | ChuckMcM wrote:
           | I would guess you're not asking a serious question here but
           | if you were feel free to contact me, it's why I put my email
           | address in my profile.
        
             | bigdict wrote:
             | Why are you assuming bad faith?
        
               | ChuckMcM wrote:
               | What gave you the impression I was assuming bad faith?
               | It's off topic to the discussion (which is fine) but can
               | be annoying in the middle of an HN thread.
        
               | bigdict wrote:
               | > What gave you the impression I was assuming bad faith?
               | 
               | You said "I would guess you're not asking a serious
               | question here"
        
               | ripped_britches wrote:
               | It was a direct quote from your original comment
        
               | bruce343434 wrote:
               | You brought it up...
        
               | ossopite wrote:
               | Without offering any opinion on its merits, if you think
               | justifying this controversial claim is off topic, then so
               | is the claim and you shouldn't have written it.
        
               | kragen wrote:
               | You said, "I would guess you're not asking a serious
               | question here," which is to say, you were guessing that
               | the question was asked in bad faith. Or, at any rate, you
               | _would_ , if for some reason the question came up, for
               | example in deciding how to answer it. Which is what you
               | were doing. That is to say, you _did_ guess that it was
               | asked in bad faith. Given the minimal amount of evidence
               | available (12 words and a nickname  "__Joker") I think
               | it's reasonable to describe that guess as an assumption.
               | Ergo, you were assuming bad faith.
        
             | __Joker wrote:
             | Really sorry, if the question came as snarky or if
             | otherwise. Those were not my intent.
             | 
             | Related to AI given all around noise, really wanted to
             | understand kind of contrarian view of monetary aspects.
             | 
             | Once again, apologies if the question seems frivolous.
        
         | enragedcacti wrote:
         | Any thoughts on why they are disabling so many cores in their
         | current product? I did some quick noodling based on the
         | 46/970000 number and the only way I ended up close to 900,000
         | was by assuming that an entire row or column would be disabled
         | if any core within it was faulty. But doing that gave me a ~6%
         | yield as most trials had active core counts in the high
         | 800,000s
        
           | projektfu wrote:
           | They did mention that they stash extra cores to enable the
           | re-routing. Those extra cores are presumably unused when not
           | routed in.
        
             | enragedcacti wrote:
             | That was my first thought but based on the rerouting
             | graphic it seems like the extra cores would be one or two
             | rows and columns around the border which would only account
             | for ~4000 cores.
        
               | projektfu wrote:
               | If the system were broken down into more subdivisions
               | internally, there would be more cores dedicated to
               | replacement. It seems like it could be more difficult to
               | reroute an entire row or column of cores on a wafer than
               | a small block. Perhaps, also, they are building in heavy
               | redundancy for POC and in the future will optimize the
               | number of cores they expect to lose.
        
           | ChuckMcM wrote:
           | I could _guess_ that it helps with heat dissipation
           | /management. But I don't know. That guess is from looking at
           | the list of patents[1] they have.
           | 
           | [1] https://patents.justia.com/assignee/cerebras-systems-inc
        
         | girvo wrote:
         | > Xilinx was still aggressively suing people who put SERDES
         | ports on FPGAs
         | 
         | This so isn't important to your overall point, but where would
         | I begin to look into this? Sounds fascinating!
        
           | nroize wrote:
           | Not OP but I was curious too. Here's all I could find that
           | seemed related: https://www.businesswire.com/news/home/202001
           | 21005582/en/Xil...
        
           | ChuckMcM wrote:
           | Well this was the patent they were threatening with as I
           | recall (https://patents.google.com/patent/US20030023912A1/en)
           | and there was this one too:
           | https://patents.google.com/patent/US5576554A/en
           | 
           | Basically the "secret sauce" of the startup recruiting me was
           | that they were going to do wafer scale FPGAs that could be
           | tiled together to build arbitrarily complex systems like
           | military phased array radars and such. All very hush hush but
           | apparently they had recruited some key talent from Xilinx
           | which was annoying Xilinx.
        
         | dogcomplex wrote:
         | Of course many people are going to collectively lose trillions,
         | AI's a very highly hyped industry with people racing into it
         | without an intellectual edge and any temporary achievement by
         | any one company will be quickly replicated and undercut by
         | another using the same tools. Economic success of the
         | individuals swarming on a new technology is not a guarantee
         | whatsoever, nor is it an indicator of the impact of the
         | technology.
         | 
         | Just like the dotcom bubble, AI is gonna hit, make a few
         | companies stinking rich, and make the vast majority (of both
         | AI-chasing and legacy) companies bankrupt. And it's gonna
         | rewire the way everything else operates too.
        
           | idiotsecant wrote:
           | >it's gonna rewire the way everything else operates too.
           | 
           | This is the part that I think a lot of very tech literate
           | people don't seem to get. I see people all the time
           | essentially saying 'AI is just autocomplete' or pointing out
           | that some vaporware ai company is a scam so surely everyone
           | is.
           | 
           | A lot of it is scams and flash in the pan. But a few of them
           | are going to transform our lives in ways we probably don't
           | even anticipate yet, for good and bad.
        
             | Retric wrote:
             | I'm not so sure it's going to even do that much. People are
             | currently happy to use LLM's, but the outputs aren't
             | accurate and don't seem to be improving quickly.
             | 
             | A YouTuber watch regularly includes questions they asked
             | Chat GPT and very single time there's a detailed response
             | in the comments showing how the output is wildly wrong from
             | multiple mistakes.
             | 
             | I suspect the backlash from disgruntled users is going to
             | hit the industry hard and these models are still extremely
             | expensive to keep updated.
        
               | Thews wrote:
               | Using function calls for correct answer lookup already
               | practically eliminates this, it's not wide spread yet,
               | but the ease of doing it is already practical for many.
               | 
               | New models aren't being trained specifically on single
               | answers which will only help.
               | 
               | The expense for the larger models is something to be
               | concerned about. Small models with function calls is
               | already great, especially if you narrow down what they
               | are being used for. Not seeing their utility is just a
               | lack of imagination.
        
           | ithkuil wrote:
           | Dollars are not lost; they are just very indirectly invested
           | into gpu makers (and energy providers)
        
       | wizzard0 wrote:
       | this is an important reminder that all digital electronics is
       | really analog but with good correction circuitry.
       | 
       | and run-time cpu and memory error rates are always nonzero too,
       | though orders of magnitude lower than chip yield rates
        
         | nine_k wrote:
         | CPUs may be very digital inside, but DRAM and flash memory are
         | _highly_ analog, especially MLC flash. DDR4 even has a
         | dedicated training mode [1], during which DRAM and the memory
         | controller learn the quirks of particular data lines and adjust
         | to them, in order to communicate reliably.
         | 
         | [1]: https://www.systemverilog.io/design/ddr4-initialization-
         | and-...
        
       | ajb wrote:
       | So they massively reduce the area lost to defects per wafer, from
       | 361 to 2.2 square mm. But from the figures in this blog, this is
       | massively outweighed by the fact that they only get 46222 sq mm
       | useable area out of the wafer, as opposed to 56247 that the H100
       | gets - because they are using a single square die instead of
       | filling the circular wafer with smaller square dies, they lose
       | 10,025 sq mm!
       | 
       | Not sure how that's a win.
       | 
       | Unless the rest of the wafer is useable for some other customer?
        
         | olejorgenb wrote:
         | Is the wafer itself so expensive? I assume they don't pattern
         | the unused area, so the process should be quicker?
        
           | yannyu wrote:
           | > I assume they don't pattern the unused area, so the process
           | should be quicker?
           | 
           | The primary driver of time and cost in the fabrication
           | process is the number of layers for the wafers, not the
           | surface area, since all wafers going through a given process
           | are the same size. So you generally want to maximize the
           | number of devices per wafer, because a large part of your
           | costs will be calculated at the per-wafer level, not a per-
           | device level.
        
             | olejorgenb wrote:
             | Yes, but my understanding is that the wafer is exposed in
             | multiple steps, so there would still be less exposure
             | steps? Probably insignificant compared to all the rest
             | though. (Etching, moving the wafer, etc.)
             | 
             | EDIT: to clarify - I mean the exposure of one single
             | pattern/layer is done in multiple steps.
             | (https://en.wikipedia.org/wiki/Photolithography#Projection)
        
               | yannyu wrote:
               | The number of exposure steps would be unrelated to the
               | (surface area) size of die/device that you're making. In
               | fact, in semiconductor manufacturing you're typically
               | trying to maximize the number of devices per wafer
               | because it costs the same to manufacture 1 device with 10
               | layers vs 100 devices with 10 layers on the same wafer.
               | This goes so far as to have companies or business units
               | share wafers for prototyping runs so as to minimize cost
               | per device (by maximizing output per wafer).
               | 
               | Also, etching, moving, etc is all done on the entire
               | wafer at the same time generally, via masks and baths.
               | It's less of a pencil/stylus process, and more of a
               | t-shirt silk-screening process.
        
               | gpm wrote:
               | > This goes so far as to have companies or business units
               | share wafers for prototyping runs so as to minimize cost
               | per device
               | 
               | Can this be done in production? Is there a chance that
               | the portion of the wafer cerebras.ai can't fit their
               | giant square in is being used for production of some
               | other companies chips?
        
             | mattashii wrote:
             | Yes, but isn't a big driver of layer costs the cost of the
             | machines to build those layers?
             | 
             | For patterning, a single iteration could be (example
             | values, no actual values used, probably only ballpark
             | accuracy) on a 300M$ EUV machine with 5-year write off
             | cycle, patterns on average 180 full wafers /hour. Excluding
             | energy usage and service time, each wafer that needs full
             | patterning would cost ~38$. If each wafer only needed half
             | the area patterned, the lithography machine might only
             | spend half its usual time on such a wafer, and that could
             | double the throughput of the EUV machine, halving the
             | write-off based cost component of such a patterning step.
             | 
             | Given that each layer generally consists of multiple
             | patterning steps, a 10-20% reduction in those steps could
             | give a meaningful reduction in time spent in the machines
             | whose time spend on the wafer depends on the used wafer
             | area.
             | 
             | This of course doesn't help reduce time in polishing or
             | etching (and other steps that happen with whole wafers at a
             | time), so it won't be as straightforward as % reduction in
             | wafer area usage == % reduction in cost, but I wouldn't be
             | surprised if it was a meaningful percentage.
        
               | yannyu wrote:
               | > Yes, but isn't a big driver of layer costs the cost of
               | the machines to build those layers?
               | 
               | Let's say the time spent in lithography step is linear
               | the way you're describing. Even with that, the deposition
               | step beforehand is surface area independent and would be
               | applied across the entire wafer, and takes just as long
               | if not longer than the lithography.
               | 
               | Additionally, if you were going to build a fab ground up
               | for some specific purpose, then you might optimize the
               | fab for those specific devices as you lay out. But most
               | of these companies are not doing that and are simply
               | going through TSMC or a similar subcontractor. So you've
               | got an additional question of how far TSMC will go to
               | accommodate customers who only want to use half a wafer,
               | and whether that's the kind of project they could
               | profitably cater to.
        
           | ajb wrote:
           | Good question. I think the wafer has a cost per area which is
           | fairly significant, but I don't have any figures. There has
           | historically been a push to utilise them more efficiently, eg
           | by building fabs that can process larger wafers. Although
           | mask exposure would be per processed area, I think that there
           | are also some proportion of processing time which is per
           | wafer, so the unprocessed area would have an opportunity cost
           | relating to that.
        
           | pulvinar wrote:
           | There's also no reason they couldn't pattern that area with
           | some other suitable commodity chips. Like how sawmills and
           | butchers put all cuts to use.
        
             | sitkack wrote:
             | Often those areas are used for test chips and structures
             | for the next version. They are effectively free, so you can
             | use them to test out ideas.
        
           | georgeburdell wrote:
           | They probably pattern at least next nearest neighbors for
           | local uniformity. That's just litho though. The rest of the
           | process is done all at once on the wafer
        
           | kristjansson wrote:
           | AIUI Wafer marginal cost is lower than you'd expect. I had
           | $50k in my head, quick google indicates[1] maybe <$20k at
           | AAPL volumes? Regardless seems like the economics for
           | Cerebras would strongly favor yield over wafer area
           | utilization.
           | 
           | [1] https://www.tomshardware.com/tech-industry/tsmcs-wafer-
           | prici...
        
           | addaon wrote:
           | > I assume they don't pattern the unused area
           | 
           | I'm out of date on this stuff, so it's possible things have
           | changed, but I wouldn't make that assumption. It is (used to
           | be?) standard to pattern the entire wafer, with partially-
           | off-the-wafer dice around the edges of the circle. The reason
           | for this is that etching behavior depends heavily on the
           | surrounding area -- the amount of silicon or copper whatever
           | etched in your neighborhood affects the speed of etching for
           | you, which effects line width, and (for a single mask used
           | for the whole wafer) thus either means you need to have more
           | margin on your parameters (equivalent to running on an old
           | process) or have a higher defect right near the edge of the
           | die (which you do anyway, since you can only take "similar
           | neighborhood" so far). This goes as far as, for hyper-
           | optimized things like SRAM arrays, leaving an unused row and
           | column at each border of the array.
        
             | kurthr wrote:
             | All the process steps are limited by wafers for hour.
             | Lithography (esp EUV) might be slightly faster, but that's
             | not 30% of total steps, since you generally have deposit
             | and etch/implant for every lithography step.
             | 
             | It's close to a dead loss in process cost.
        
         | Scaevolus wrote:
         | Why does their chip have to be rectangular, anyways? Couldn't
         | they cut out a (blocky) circle too?
        
           | guyzero wrote:
           | I've never cut a wafer, but I assume cutting is hard and
           | single straight lines are the easiest.
        
             | sroussey wrote:
             | I wonder if you could... just not cut the wafer at all??
        
               | axus wrote:
               | Might be jumping in without reading, but the chips you
               | cut out of the wafer have to be delivered to physically
               | different locations.
        
               | ajb wrote:
               | Normally yes. But they're using a whole wafer for a
               | single chip! So it's actually a good idea.
               | 
               | I guess the issue is how do you design your routing
               | fabric to work in the edge regions.
               | 
               | Actually I wonder how they are exposing this wafer.
               | Normal chips are exposed in a rectangular batch called a
               | reticle. The reticle mask has repeated patterns across
               | it, and it is then exposed repeatedly across the wafer.
               | So either they have to make a reticle mask the full size
               | of the wafer, which sounds expensive, or they somehow
               | have to precisely align reticle exposures so that the
               | joined edges form valid circuits.
        
               | daedrdev wrote:
               | That's the idea in the article. Just one big chip. But
               | the reason why it's normally done is that there is a
               | pretty high defect rate, so cutting if every wafer has
               | 1-2 defects you still get (X-1.5) devices per wafer. In
               | the article thy go into how they avoid this problem (I
               | think its better fault tolerance, at a cost)
        
               | gpm wrote:
               | The article shows them using a single maximally sized
               | square portion of a circular wafer.
               | 
               | I think the proposal you're responding to is "just use
               | the whole circular wafer without cutting out a square".
        
               | ryao wrote:
               | I suspect this would cause alignment issues since you
               | could literally rotate it into the wrong position when
               | doing soldering. That said, perhaps they could get away
               | with cutting less and using more.
        
               | Dylan16807 wrote:
               | If you want to have nice straight edges to clamp into
               | place, then you only need to shave off four slivers. You
               | can lose a couple percent instead of more than a third.
        
               | CorrectHorseBat wrote:
               | They already have a notch or flat for alignment, which is
               | much more critical during the lithography process than
               | during soldering.
        
               | sharpesttool wrote:
               | You just need a sharpie to mark the top.
        
           | nine_k wrote:
           | Rather I wonder why do they even need to cut the extra space,
           | instead of putting something there. I suppose that the
           | structure of the device is highly rectangular from the
           | logical PoV, so there's nothing useful to put there. I
           | suspect smaller unrelated chips can be produced on these
           | areas along the way.
        
           | yannyu wrote:
           | The cost driver for fabbing out wafers is the number of
           | layers and the number of usable devices per wafer. Higher
           | layer count increases cost and tends to decrease yield, and
           | more robust designs with higher yields increase usable
           | devices per wafer. If circles or other shapes could help with
           | either of those, they would likely be used. Generally the end
           | goal is to have the most usable devices per wafer, so they'll
           | be packed as tightly as possible on the wafer so as to have
           | the highest potential output.
        
             | Scaevolus wrote:
             | Right, but they're making just _one_ usable device per
             | wafer already.
        
           | Qwertious wrote:
           | You need a rectilinear polygon that tessellates, and has the
           | fewest sides possible to minimize the number of cuts
           | necessary. And it would probably help the cutting if the
           | shape is entirely convex, so that cuts can overshoot a bit
           | without damaging anything.
           | 
           | That suggests a rectangle is the only possible shape.
        
             | CorrectHorseBat wrote:
             | If it's just one chip per wafer, why even bother cutting?
        
             | timerol wrote:
             | Why does it need to tessellate if there's only one chip per
             | wafer?
        
         | nine_k wrote:
         | It's a win because they have to test one chip, and don't have
         | to spend resources on connecting the chiplets. The latter costs
         | a lot (though it has other advantages). I suspect that a
         | chiplet-based device with total 900k cores would just be not
         | viable due to the size constraints.
         | 
         | If their routing around the defects is automated enough (given
         | the highly regular structure), it may be a massive economy of
         | efforts on testing and packaging the chip.
        
         | sroussey wrote:
         | It's a win if you can use the wafer as opposed to throwing it
         | away.
        
           | kristjansson wrote:
           | A win is a manufacturing process that results in a
           | functioning product. Wafers, etc. aren't so scarce as to
           | demand every mm2 be used on every one every time.
        
         | kristjansson wrote:
         | Additional wafer area would be a marginal increase in
         | performance (+~20% core core best case) but increases the
         | complexity of their design, and requires they figure out how to
         | package/connect/house/etc. a non-standard shape. A wafer scale
         | chip is already a huge tech risk, why spend more novelty budget
         | on nonessential weirdness?
        
         | ungreased0675 wrote:
         | Why does it have to be a square? There's no need to worry about
         | interchangeable third-party heat sink compatibility. Is it
         | possible to make it an irregular polygon instead of square?
        
       | bee_rider wrote:
       | > Second, a cluster of defects could overwhelm fault tolerant
       | areas and disable the whole chip.
       | 
       | That's an interesting point. In architecture class (which was
       | basic and abstract so I'm sure Cerebras is doing something much
       | more clever), we learned that defects cluster, but this is a good
       | thing. A bunch of defects clustering on one core takes out the
       | core, a bunch of defects not clustering could take out... a bunch
       | of cores, maybe rendering the whole chip useless.
       | 
       | I wonder why they don't like clustering. I could imagine in a
       | network of little cores, maybe enough defects clustered on the
       | network could... sort of overwhelm it, maybe?
       | 
       | Also I wonder how much they benefit from being on one giant
       | wafer. It is definitely cool as hell. But could chiplets eat away
       | at their advantage?
        
       | IshKebab wrote:
       | TSMC also have a manufacturing process used by Tesla's Dojo where
       | you can cut up the chips, throw away the defective ones, and then
       | reassemble working ones into a sort of wafer scale device (5x5
       | chips for Dojo). Seems like a more logical design to me.
        
         | mhh__ wrote:
         | Amazing. I clicked a button in the azure deployment menu
         | today...
        
         | ryao wrote:
         | I had been under the impression that Nvidia had done something
         | similar here, but they did not talk about deploying the space
         | saving design and instead only talked about the server rack
         | where all of the chips on the mega wafer normally are.
         | 
         | https://www.sportskeeda.com/gaming-tech/what-nvlink72-nvidia...
        
           | wmf wrote:
           | That shield is just a prop that looks nothing like the real
           | product. The NVL72 rack doesn't use any wafer-scale-like
           | packaging.
        
             | ryao wrote:
             | It would be nice if they made it real. The cost savings
             | from not needing so much material should be fantastic.
        
         | bee_rider wrote:
         | Is this similar to a chiplet design? Chiplets have been a thing
         | for a while, so I assume Cerebras avoided them on purpose.
        
           | IshKebab wrote:
           | I don't think so - chiplets are much smaller and I think the
           | process is different.
        
       | iataiatax10 wrote:
       | The yield problem is not surprising they found a solution. Maybe
       | they could elaborate more on the power distribution and
       | dissipation problem?
        
       | highfrequency wrote:
       | To summarize: localize defect contamination to a very small unit
       | size, by making the cores tiny and redundant.
       | 
       | Analogous to a conglomerate wrapping each business vertical in a
       | limited liability veil so that lawsuits and bankruptcy do not
       | bring down the whole company. The smaller the subsidiaries, the
       | less defect contamination but also the less scope for
       | frictionless resource and information sharing.
        
       | exabrial wrote:
       | I have a dumb question. Why isn't silicon sold in cubes instead
       | of cylinders?
        
         | bigmattystyles wrote:
         | no matter how you orient a circle on a plane, it's the same
        
         | amelius wrote:
         | The silicon ingots have a rotating production process that
         | results in cylinders, not bricks.
        
           | exabrial wrote:
           | fascinating, I figured it was something like that. maybe we
           | should produce hexagonal, instead of square, chip designs
        
         | kryptiskt wrote:
         | Crystalline silicon is produced with the Czochralski process
         | (https://en.wikipedia.org/wiki/Czochralski_method), which
         | produces a round ingot. So you'd have to cut away perfectly
         | fine silicon to make something squarish.
        
       | NickHoff wrote:
       | Neat. What about power density?
       | 
       | An H100 has a TDP of 700 watts (for the SXM5 version). With a die
       | size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the
       | same power density, that means a cerebras TDP of 37.8 kW.
       | 
       | That's a lot. Let's say you cover the whole die area of the chip
       | with water 1 cm deep. How long would it take to boil the water
       | starting from room temperature (20 degrees C)?
       | 
       | amount of water = (die area of 46225 mm^2) * (1 cm deep) *
       | (density of water) = 462 grams
       | 
       | energy needed = (specific heat of water) * (80 kelvin difference)
       | * (462 grams) = 154 kJ
       | 
       | time = 154 kJ / 39.8 kW = 3.9 seconds
       | 
       | This thing will boil (!) a centimeter of water in 4 seconds. A
       | typical consumer water cooler radiator would reduce the
       | temperature of the coolant water by only 10-15 C relative to
       | ambient, and wouldn't like it (I presume) if you pass in boiling
       | water. To use water cooling you'd need some extreme flow rate and
       | a big rack of radiators, right? I don't really know. I'm not even
       | sure if that would work. How do you cool a chip at this power
       | density?
        
         | lostlogin wrote:
         | If rack mounted, you are ending up with something like a
         | reverse power station.
         | 
         | So why not use it as an energy source? Spin a turbine.
        
           | sebzim4500 wrote:
           | If my very stale physics is accurate then even with perfect
           | thermodynamic efficiency you would only recover about a third
           | of the energy that you put into the chips.
        
             | dylan604 wrote:
             | 1/3 > 0, so even if you don't get a $0 energy bill I'd
             | venture that any company that could get 1/3 of energy bill
             | would be happy
        
           | bentcorner wrote:
           | I'm aware of the efficiency losses but I think it would be
           | amusing to use that turbine to help power the machine
           | generating the heat.
        
             | twic wrote:
             | Hey, we're building artificial general intelligence, what's
             | a little perpetual motion on the side?
        
           | kristjansson wrote:
           | If you let the chip actual boil enough water to run a turbine
           | you're going to have a hard time keeping the magic smoke
           | inside. Much better to run at reasonable temps and try to
           | recover energy from the waste heat.
        
             | ericye16 wrote:
             | What if you chose a refrigerant with a lower boiling point?
        
               | kristjansson wrote:
               | That's basically the principle of binary cycle[1]
               | generators. However for data center waste heat recovery,
               | I'd think you'd want to use a more stable fluid for
               | cooling, and then pump it to a separate closed-loop
               | binary-cycle generator. No reason to make your datacenter
               | cooling system also deal with high pressure fluids, and
               | moving high pressure working fluid from 1000s of chips to
               | a turbine of sufficient size, etc.
               | 
               | [1]: https://en.wikipedia.org/wiki/Binary_cycle
        
           | renhanxue wrote:
           | There's a bunch of places in Europe that use waste heat from
           | datacenters in district heating systems. Same thing with
           | waste heat from various industrial processes. It's relatively
           | common practice.
        
         | buildbot wrote:
         | A Very Fancy cooling engine: https://www.eetimes.com/powering-
         | and-cooling-a-wafer-scale-d...
        
         | jwan584 wrote:
         | A good talk on how Cerebras does power & cooling (8min)
         | https://www.youtube.com/watch?v=wSptSOcO6Vw&ab_channel=Appli...
        
         | flopsamjetsam wrote:
         | Minor correction, the keynote video says ~20 kW
        
         | throwup238 wrote:
         | The machine that actually holds one of their wafers is almost
         | as impressive as the chip itself. Tons of water cooling
         | channels and other interesting hardware for cooling.
        
         | Paul_Clayton wrote:
         | The enthalpy of vaporization of water (at standard pressure) is
         | listed by Wikipedia[1] as 2.257 kJ/g, so _boiling_ 462 grams
         | would require an additional 1.04 MJ, adding 26 seconds.
         | Cerebras claims a  "peak sustained system power of 23kW" for
         | the CS-3 16 Rack Unit system[2], so clearly the power density
         | is lower than for an H100.
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Enthalpy_of_vaporization#Other...
         | [2] https://cerebras.ai/product-system/
        
           | twic wrote:
           | On a tangent: has anyone built an active cooling system which
           | operates in a partial vacuum? At half atmospheric pressure,
           | water boils at around 80 C, which i believe is roughly the
           | operating temperature for a hard-working chip. You could pump
           | water onto the chip, have it vapourise, taking away all that
           | heat, then take the vapour away and condense it at the fan
           | end.
           | 
           | This is how heat pipes work, i believe, but heat pipes aren't
           | pumped, they rely entirely on heat-driven flow. I would have
           | thought there were pumped heat pipes. Are they called
           | something else?
           | 
           | It's also not a refrigerator, because those use a pump to
           | pressurise the coolant in its gas phase, whereas here you
           | would only be pumping the water.
        
             | pants2 wrote:
             | No need to bother with a partial vacuum when ethanol boils
             | at around 80 C as well and doesn't destroy electronics. I'm
             | not aware of any active cooling systems utilizing this
             | though.
        
               | ddxxdd wrote:
               | I could argue that ethanol has 1/3 the latent heat of
               | vaporization of water, and would boil off 3 times
               | quicker. However, what ultimately matters is the rate of
               | heat transfer, so my nitpick may be irrelevant.
        
               | pezezin wrote:
               | May I introduce you to the glorious vodka cooled PC?
               | https://www.youtube.com/watch?v=IYTJfLyo_vE
        
             | TehCorwiz wrote:
             | I found this review from 2019 of mechanically pumped heat
             | pipe technologies. I skimmed the intro. Looks like it
             | already has a foothold in aerospace.
             | 
             | https://www.sciencedirect.com/science/article/abs/pii/S1359
             | 4...
        
             | Dylan16807 wrote:
             | > This is how heat pipes work, i believe, but heat pipes
             | aren't pumped, they rely entirely on heat-driven flow. I
             | would have thought there were pumped heat pipes.
             | 
             | Do you have a particular benefit in mind that a pump would
             | help with?
        
       | bigmattystyles wrote:
       | When I was a kid, I used to get intel keychains with a die in
       | acrylic - good job to whoever thought of that to sell the fully
       | defective chips.
        
         | dylan604 wrote:
         | wow, fancy with the acrylic. lots of places just place a chip
         | (I'm more familiar with RAM sticks) on a keychain and call it a
         | day.
        
           | bigmattystyles wrote:
           | they're all over eBay, I just checked - the one I was
           | thinking of, that I think I had is going for $150 - the
           | things you get rid of....
        
             | bradyd wrote:
             | Electronic Goldmine sells entire scrapped 200mm wafers for
             | $15 or less
             | 
             | https://theelectronicgoldmine.com/search?options%5Bprefix%5
             | D...
        
           | kragen wrote:
           | Those aren't just a chip; they're an epoxy package with a
           | leadframe and a chip inside it. To put just a chip on a
           | keychain, you'd have to drill a hole through it, which is
           | difficult because silicon is so brittle--almost like drilling
           | a hole in glass. Then, when someone put it onto a keyring,
           | the keyring would form a lever that applies a massive force
           | to the edge of the brittle hole, shattering the brittle
           | silicon. Potting the chip in acrylic resin is a much cheaper
           | solution that works better.
        
       | Neywiny wrote:
       | Understanding that there's inherent bias by them being
       | competitors of the other companies, but still this article seems
       | to make some stretches. If you told me you had an 8% core defect
       | rate reduced 100x, I'd assume you got to close to 99% enablement.
       | The table at the end shows... Otherwise.
       | 
       | They also keep flipping between cores, SMs, dies, and maybe other
       | block sizes. At the end of the day I'm not very impressed. They
       | seemingly have marginally better yields despite all that effort.
        
         | sfink wrote:
         | I think you're missing the point. The comparison is not between
         | 93% and 92%. The comparison is between what they're getting
         | (93%) and what you'd get if you scaled up the usual process to
         | the core size they're using (0%). They are doing something
         | different (namely: a ~whole wafer chip) that isn't possible
         | without massively boosting the intra-chip redundancy. (The
         | usual process stops working once you no longer have any extra
         | dies to discard.)
         | 
         | > Despite having built the world's largest chip, we enable 93%
         | of our silicon area, which is higher than the leading GPU
         | today.
         | 
         | The important part is building the largest chip. The icing on
         | the top is that the enablement is not lower. Which it would be
         | without the routing-to-spare-cores magic sauce.
         | 
         | And the differing terminology is because they're talking about
         | differing things? You could call an SM a core, but it kind of
         | contains (heterogeneous) cores itself. (I've no idea whether
         | intra-SM cores can be redundant to boost yield.) A die is the
         | part you break off and build a computer out of, it may contain
         | a bunch of cores, a wafer can be broken up into multiple dies
         | but for Cerebras it isn't.
         | 
         | If NVIDIA were to go and build a whole-wafer die, they'd do
         | something similar. But Cerebras did it and got it to work.
         | NVIDIA hasn't gotten into that space yet, so there's no point
         | in building a product that you can't sell to a consumer or even
         | a data center that isn't built around that exact product (or to
         | contain a Balrog).
        
           | fspeech wrote:
           | There is nothing inherently good about wafer scale. It's
           | actually harder to dissipate heat and enable hybrid bonding
           | with DRAM. So the gp is entirely correct that you need to
           | actually show higher silicon utilization to be even
           | considered as being something worthwhile.
        
           | Neywiny wrote:
           | I think I'll still stand by my viewpoint. They said:
           | 
           | > On the Cerebras side, the effective die size is a bit
           | smaller at 46,225mm2. Applying the same defect rate, the
           | WSE-3 would see 46 defects. Each core is 0.05mm2. This means
           | 2.2mm2 in total would be lost to defects.
           | 
           | So ok they claim that they should see (46225-2.2)/46225 =
           | 99.995%. Doing the same math for their Nvidia numbers it's
           | 99.4%. And yet in practice neither approach got to these
           | numbers. Nowhere near it. I just feel like the whole article
           | talks about all this theory and numbers and math of how
           | they're so much better but in practice it's meaningless.
           | 
           | So what I'm not seeing is why it'd be impossible for all the
           | H100s on a wafer to be interconnected and call it a day.
           | You'd presumably get 92/93 = 98.9% of the performance and,
           | here's the kicker, no need to switch to another architecture.
           | I didn't know where your 0% number came from. Nothing about
           | this article says that a competitor doing the same scaling to
           | wafer scale would get 0%, just a marginal decrease in how
           | many cores made it through fab.
           | 
           | Fundamentally I am not convinced from this article that
           | Cerebras has done something in their design that makes this
           | possible. All I'm seeing is that it'd perform 1% faster.
           | 
           | Edit: thinking a bit more on it, to me it's like they said
           | TSMC has a guy with a sledgehammer who smashes all the wafers
           | and their architecture snaps a tiny bit cleaner. But they
           | haven't said anything about firing the guy with the
           | sledgehammer. Their paragraph before the final table says
           | that this whole exercise is pretty much meaningless because
           | their numbers are made up about competitors and they aren't
           | even the right numbers to be using. Then the table backs up
           | my paraphrase.
        
       | wendyshu wrote:
       | What's yield?
        
         | elpocko wrote:
         | When driving a car, to yield means that merging drivers must
         | prepare to stop if necessary to let a driver on another
         | approach proceed.
         | 
         | That's not necessary if you have strong weaponry mounted on
         | your vehicle: research shows that you dont't have to stop if
         | all the other drivers are dead.
        
         | wmf wrote:
         | It's the fraction of usable product from a manufacturing
         | process.
        
       | anonymousDan wrote:
       | Very interesting. Am I correct in saying that fault tolerance
       | here is with respect to 'static' errors that occur during
       | manufacturing and are straightforward to detect before reaching
       | the customer? Or can these failures potentially occur later on
       | (and be tolerated) during the normal life of the chip?
        
       | abrookewood wrote:
       | Looking at the H100 on the left, why is the chip yield (72) based
       | on a circular layout/constraint? Why do they discard all of the
       | other chips that fall outside the circle?
        
         | flumpcakes wrote:
         | Because the circle is the physical silicon. Any chips that fall
         | outside the circle are only part of a full chip. They will be
         | physically missing half the chip.
        
         | donavanm wrote:
         | AFAIK all wafer ingots are cylinders, which means the wafers
         | themselves are a circular cross section. So manufacturing is
         | binpacking rectangles in to a circle. Plus different
         | effects/defects in the chips based on the distance from the
         | edge of the wafer.
         | 
         | So I believe its the opposite: why are they representing the
         | larger square and implying lower yield off the wafer in space
         | that doesnt practically exist?
        
         | therealcamino wrote:
         | That's just the shape of the wafer. I don't know why the
         | diagram continued the grid outside it.
        
       | ryao wrote:
       | > Take the Nvidia H100 - a massive GPU weighing in at 814mm2.
       | Traditionally this chip would be very difficult to yield
       | economically. But since its cores (SMs) are fault tolerant, a
       | manufacturing defect does not knock out the entire product. The
       | chip physically has 144 SMs but the commercialized product only
       | has 132 SMs active. This means the chip could suffer numerous
       | defects across 12 SMs and still be sold as a flagship part.
       | 
       | Fault tolerance seems to be the wrong term to use here. If I
       | wrote this, I would have written redundant.
        
         | jjk166 wrote:
         | Redundant cores lead to a fault tolerant chip.
        
           | ryao wrote:
           | ECC memory is fault tolerant. It repairs issues on the fly
           | without disabling hardware. This on the other hand is merely
           | redundant to handle manufacturing defects. If they make a
           | mistake and ship a bad core that malfunctions at runtime, it
           | is not going to tolerate that.
        
             | jjk166 wrote:
             | Redundancy is a method of providing fault tolerance, the
             | existence of other methods doesn't make it less fault
             | tolerant.
             | 
             | Nothing is tolerant to all possible faults. Fault tolerance
             | refers to being able to tolerate specific types of faults
             | under specific conditions.
             | 
             | Fault tolerant is the proper term for this.
        
       | gunalx wrote:
       | My biggest question is who are the buyers?
        
         | asdasdsddd wrote:
         | mostly 1 ai company in the middle east last I heard
        
       | bcatanzaro wrote:
       | This is a strange blog post. Their tables say:
       | 
       | Cerebras yields 46225 * .93 = 43000 square millimeters per wafer
       | 
       | NVIDIA yields 58608 * .92 = 54000 square millimeters per wafer
       | 
       | I don't know if their numbers are correct but it is a strange
       | thing for a startup to brag that it is worse than a big company
       | at something important.
        
         | saulpw wrote:
         | Being within striking distance of SOTA while using orders of
         | magnitude fewer resources is worth bragging about.
        
       | RecycledEle wrote:
       | IIRC, it was Carl Bruggeman's IPSA Thesis that showed us how to
       | laser out bad cores.
        
       | oksurewhynot wrote:
       | I live in a small city/large town that has a large number of
       | craft breweries. I always marveled at how these small operations
       | were able to churn out so many different varieties. Turns out
       | they are actually trying to make their few core recipes but the
       | yield is so low they market the less consistent results as...all
       | that variety I was so impressed with.
        
       | trhway wrote:
       | 56K mm2 vs 46K mm2. I wonder why they wouldn't use the smart
       | routing/etc to use more fitting shape than square and thus use
       | more of the wafer.
        
       | ilaksh wrote:
       | I assume people are aware, but Cerebras has a web demo and API
       | which is open to try and it is 2000 tokens per second for Llama
       | 3.3 70b and 1000 tokens per second for Llama 3.1 405b.
       | 
       | https://cerebras.ai/inference
        
       | Fokamul wrote:
       | Anyone has some picture how it is looks like inside these
       | servers?
        
       | hoseja wrote:
       | Why square chip? Make it an octagon or something.
        
       | aurareturn wrote:
       | Bear case on Cerebras:
       | https://irrationalanalysis.substack.com/p/cerebras-cbrso-equ...
       | 
       | Note: This author is heavily invested in Nvidia.
        
       | larsrc wrote:
       | How do these much smaller cores compare in computing power to the
       | bigger ones? They seem to implicitly claim that a core is a core
       | is a core, but surely one gets something extra out of the much
       | bigger one?
        
       | jstrong wrote:
       | I would like a workstation with 900k cores. lmk when these things
       | are on ebay.
        
         | riskable wrote:
         | Just need that 20kW connection to your energy provider.
        
       | aaroninsf wrote:
       | The number of people ITT this thread who have absorbed the world-
       | weary AI-is-a-bubble skepticism...
       | 
       | I'm just gonna say, with serene certainty,
       | 
       | the economic order we inhabit going through phase change is
       | certain. From certain myopic perspectives we can shoehorn that
       | into a narrative of cyclical patterns in the tech industry or
       | financial markets etc etc.
       | 
       | This is not going to be that. No more than the transformation of
       | American retail can be shoehorned to kind of look like it used if
       | you don't know anything at all about what contemporary
       | international trade and logistics and oligopoly actually mean in
       | terms of what is coming into your home from where and why it is
       | or isn't cheap.
       | 
       | Where we'll be in 10, 20, years is literally unimaginable today;
       | and trying to navigate that wrt traditional landmarks... oof.
        
       | lofaszvanitt wrote:
       | A well written, easy to understand article.
        
       ___________________________________________________________________
       (page generated 2025-01-16 23:01 UTC)