[HN Gopher] Hacked Nvidia 4090 GPU driver to enable P2P
___________________________________________________________________
Hacked Nvidia 4090 GPU driver to enable P2P
Author : nikitml
Score : 813 points
Date : 2024-04-12 09:27 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jagrsw wrote:
| Was it George himself, or a person working for a bounty that was
| set up by tinycorp?
|
| Also, a question for those knowledgeable about the PCI subsys: it
| looked like something NVIDIA didn't care about, rather than
| something they actively wanted to prevent, no?
| mtlynch wrote:
| Commits are by geohot, so it looks like George himself.
| throw101010 wrote:
| I've seen him work on tinygrad on his Twitch livestream
| couple times, so more than likely him indeed.
| squarra wrote:
| He also documented his progress on the tinygrad discord
| throwaway8481 wrote:
| I feel like I should say something about discord not being a
| suitable replacement for a forum or bugtracker.
| guywhocodes wrote:
| We are talking about a literal monologue while poking at a
| driver for a few hours, this wasn't a huge project.
| toast0 wrote:
| PCI devices have always been able to read and write to the
| shared address space (subject to IOMMU); most frequently used
| for DMA to system RAM, but not limited to it.
|
| So, poking around to configure the device to put the whole VRAM
| in the address space is reasonable, subject to support for
| resizable BAR or just having a fixed size large enough BAR. And
| telling one card to read/write from an address that happens to
| be mapped to a different card's VRAM is also reasonable.
|
| I'd be interested to know if PCI-e switching capacity will be a
| bottleneck, or if it'll just be the point to point links and
| VRAM that bottlenecks. Saving a bounce through system RAM
| should help in either case though.
| namibj wrote:
| Fixed large bar exists in some older accelerator cards like
| e.g. iirc the MI50/MI60 from AMD (the data center variant of
| the Radeon Vega VII, the first GPU with PCIe 4.0, also famous
| for dominating memory bandwidth until the RTX 40-series took
| that claim back. It had 16GB of HBM delivering 1TB/s memory
| bandwidth).
|
| It's notably not compatible with some legacy boot processes
| and iirc also just 32bit kernels in general, so consumer
| cards had to wait for resizable BAR to get the benefits of
| large BAR (that being notably direct flat memory mapping of
| VRAM so CPUs and PCIe peers can directly read and write into
| all of VRAM, without dancing through a command interface with
| doorbell registers. AFAIK it allows a GPU to talk directly to
| NICs and NVMe drives by running the driver in GPU code (I'm
| not sure how/if they let you properly interact with doorbell
| registers, but polled io_uring as an ABI would be no problem
| (I wouldn't be surprised if some NIC firmware already allows
| offloading this).
| jsheard wrote:
| It'll be nice while it lasts, until they start locking this down
| in the firmware instead on future architectures.
| mnau wrote:
| Sure, but that was something that was always going to happen.
|
| So it's better to have it at least for one generation instead
| of no generation.
| HPsquared wrote:
| Is this one of those features that's disabled on consumer cards
| for market segmentation?
| mvkel wrote:
| Sort of.
|
| An imperfect analogy: a small neighborhood of ~15 houses is
| under construction. Normally it might have a 200kva transformer
| sitting at the corner, which provides appropriate power from
| the grid.
|
| But there is a transformer shortage, so the contractor installs
| a commercial grade 1250kva transformer. It can power many more
| houses than required, so it's operating way under capacity.
|
| One day, a resident decides he wants to start a massive grow
| farm, and figures out how to activate that extra transformer
| capacity just for his house. That "activation" is what geohot
| found
| bogwog wrote:
| That's a poor analogy. The feature is built in to the cards
| that consumers bought, but Nvidia is disabling it via
| software. That's why a hacked driver can enable it again. The
| resident in your analogy is just freeloading off the
| contractor's transformer.
|
| Nvidia does this so that customers that need that feature are
| forced to buy more expensive systems instead of building a
| solution with the cheaper "consumer-grade" cards targeted at
| gamers and enthusiasts.
| bpye wrote:
| This isn't even the first time a hacked driver has been
| used to unlock some HW feature -
| https://github.com/DualCoder/vgpu_unlock
| captcanuk wrote:
| There was also this https://hackaday.com/2013/03/18/hack-
| removes-firmware-crippl... using resistors and a
| different one before that used a graphene lead pencil to
| enable functionality.
| segfaultbuserr wrote:
| Except that in the computer hardware world, the 1250 kVA
| transformer was used not because of shortage, but because of
| the fact that making a 1250 kVA transformer on the existing
| production line and selling it as 200 kVA, is cheaper than
| creating a new production line separately for making 200 kVA
| transformers.
| m3kw9 wrote:
| Where is the hack in this analogy
| pixl97 wrote:
| Taking off the users panel on the side of their house and
| flipping it to 'lots of power' when that option had
| previously been covered up by the panel interface.
| cesarb wrote:
| Except that this "lots of power" option does not exist.
| What limits the amount of power used is the circuit
| breakers and fuses on the panel, which protect the wiring
| against overheating by tripping when too much power is
| being used (or when there's a short circuit). The
| resident in this analogy would need to ensure that not
| only the transformer, but also the wiring leading to the
| transformer, can handle the higher current, and replace
| the circuit breaker or fuses.
|
| And then everyone on that neighborhood would still lose
| power, because there's also a set of fuses _upstream_ of
| the transformer, and they would be sized for the correct
| current limit even when the transformer is oversized.
| These fuses also protect the wiring upstream of the
| transformer, and their sizing and timings is coordinated
| with fuses or breakers even further upstream so that any
| fault is cleared by the protective device closest to the
| fault.
| stavros wrote:
| There are analogies, and then there's this.
| Dylan16807 wrote:
| They're pointing out how the analogy doesn't work, so
| it's fine.
|
| Nobody's taking more than their share of any resources
| when they enable this feature.
| hatthew wrote:
| And then because this residential neighborhood now has
| commercial grade power, the other lots that were going to
| have residential houses built on them instead get combined
| into a factory, and the people who want to buy new houses in
| town have to pay more since residential supply was cut in
| half.
| HPsquared wrote:
| Excellent analogy of the other side of this issue.
| zten wrote:
| This represents pretty well how gamers (residential buyers)
| are going to feel when the next generation of consumer
| cards are scooped up for AI.
| cesarb wrote:
| That's a bad analogy, because in your example, the consumer
| is using more of a shared resource (the available
| transformer, wiring, and generation capacity). In the case of
| the driver for a local GPU card, there's no sharing.
|
| A better example would be one in which the consumer has a
| dedicated transformer. For instance, a small commercial
| building which directly receives 3-phase 13.8 kV power; these
| are very common around here, and these buildings have their
| own individual transformers to lower the voltage to 3-phase
| 127V/220V.
| rustcleaner wrote:
| I am sure many will disagree-vote me, but I want to see this
| practice in consumer devices either banned or very heavily
| taxed.
| xandrius wrote:
| You're right. Especially because you didn't present your
| reasons.
| yogorenapan wrote:
| Curious as to your reasoning,
| wmf wrote:
| Of course power users want an end to price discrimination
| because it benefits them... at a cost of more expensive
| products for the masses.
| imtringued wrote:
| Well, they have zero incentives to implement and test this
| feature for consumer GPUs. Multi GPU setups never really worked
| that well for gaming.
| llm_trw wrote:
| Skimming the readme this is p2p over PCIe and not NVLink in case
| anyone was wondering.
| klohto wrote:
| afaik 4090 doesn't support 5.0 so you are limited to 4.0
| speeds. Still an improvement.
| formerly_proven wrote:
| RTX 40 doesn't have NVLink on the PCBs, though the silicon has
| to have it, since some sibling cards support it. I'd expect it
| to be fused off.
| HeatrayEnjoyer wrote:
| How to unfuse it?
| magicalhippo wrote:
| I don't know about this particular scenario, but typically
| fuses are small wires or resistors that are overloaded so
| they irreversibly break the connection. Hence the name.
|
| Either done during manufacture or as a one-time
| programming[1][2].
|
| Though sometimes reprogrammable configuration bits are
| sometimes also called fuse bits. The Atmega328P of Arduino
| fame uses flash[3] for its "fuses".
|
| [1]: https://www.nxp.com/docs/en/application-
| note/AN4536.pdf
|
| [2] https://www.intel.com/programmable/technical-
| pdfs/654254.pdf
|
| [3]: https://ww1.microchip.com/downloads/en/DeviceDoc/Atmel
| -7810-...
| HeatrayEnjoyer wrote:
| Wires, flash, and resistors can be replaced
| mschuster91 wrote:
| Not at the scale we're talking about here. These
| structures are _very_ thin, far thinner than bond wires
| which is about the largest structure size you can handle
| without a very, very specialized lab. And you 'd need to
| unsolder the chip, de-cap it, hope the fuse wire you're
| trying to override is at the top layer, and that you can
| re-cap the chip afterwards and successfully solder it
| back on again.
|
| This may be workable for a nation state or a billion
| dollar megacorp, but not for your average hobbyist
| hacker.
| z33k wrote:
| You're absolutely right. In fact, some billion dollar
| megacorps use fuses as a part of hardware DRM for this
| reason.
| magicalhippo wrote:
| These are part of the chip, thus microscopic and very
| inaccessible.
|
| There are some good images here[1] of various such fuses,
| both pristine and blown. Here's[2] a more detailed
| writeup examining one type.
|
| It's not something you fix with a soldering iron.
|
| [1]: https://semiengineering.com/the-benefits-of-
| antifuse-otp/
|
| [2]: https://www.eetimes.com/a-look-at-metal-efuses/
| metadat wrote:
| I miss the days when you could do things like connecting
| the L5 bridges on the surface of the AMD Athlon XP
| Palomino [0] CPU packaging with a silver trace pen to
| transform them into fancier SMP multi-socket capable
| Athlon MPs, e.g. Barton [1].
|
| https://arstechnica.com/civis/threads/how-did-you-unlock-
| you...
|
| Some folks even got this working with only a pencil,
| haha.
|
| Nowadays, silicon designers have found highly effective
| ways to close off these hacking avenues, with techniques,
| such as the microscopic, nearly invisible, and as parent
| post mentions, totally inaccessible e-fuses.
|
| [0] https://upload.wikimedia.org/wikipedia/commons/7/7c/K
| L_AMD_A...
|
| [1] https://en.wikichip.org/w/images/a/af/Atlhon_MP_%28.1
| 3_micro...
| aceazzameen wrote:
| I'm one of those folks that did it with a pencil. Haha.
| Maybe I was lucky? That was my first overclock and it ran
| pretty well.
| mepian wrote:
| Use a Focused Ion Beam instrument.
| llm_trw wrote:
| A cursory google search suggests that it's been removed at
| the silicon level.
| steeve wrote:
| Some do: https://wccftech.com/gigabyte-geforce-rtx-4090-pcb-
| shows-lef...
| jsheard wrote:
| I'm pretty sure that's just a remnant of a 3090 PCB design
| that was adapted into a 4090 PCB design by the vendor. None
| of the cards based on the AD102 chip have functional
| NVLink, not even the expensive A6000 Ada workstation card
| or the datacenter L40 accelerator, so there's no reason to
| think NVLink is present on the silicon anymore below the
| flagship GA100/GH100 chips.
| klohto wrote:
| fyi should work on most 40xx[1]
|
| [1]
| https://github.com/pytorch/pytorch/issues/119638#issuecommen...
| clbrmbr wrote:
| If we end up with a compute governance model of AI control [1],
| this sort of thing could get your door kicked in by the CEA
| (Compute Enforcement Agency).
|
| [1] https://podcasts.apple.com/us/podcast/ai-safety-
| fundamentals...
| logicchains wrote:
| Looks like we're only a few years away from a bona fide
| cyberpunk dystopia, in which only governments and megacorps are
| allowed to use AI, and hackers working on their own hardware
| face regular raids from the authorities.
| tomoyoirl wrote:
| Mere raids from the authorities? I thought EliY was out there
| proposing airstrikes.
| the8472 wrote:
| In the sense that any other government regulation is also
| ultimately backed by the state's monopoly on legal use of
| force when other measures have failed.
|
| And contrary to what some people are implying he also
| proposes that everyone is subject to the same limitations,
| big players just like individuals. Because the big players
| haven't shown much of a sign of doing enough.
| tomoyoirl wrote:
| > In the sense that any other government regulation is
| also ultimately backed by the state's monopoly on legal
| use of force when other measures have failed.
|
| Good point. He was only ("only") _really_ calling for
| international cooperation and literal air strikes against
| big datacenters that weren't cooperating. This would
| presumably be more of a no-knock raid, breaching your
| door with a battering ram and throwing tear gas at the
| wee hours of the morning ;) or maybe a small
| extraterritorial drone through your window
| the8472 wrote:
| ... after regulation, court orders and fines have failed.
| Which under the premise that AGI is an existential threat
| would be far more reasonable than many other reasons for
| raids.
|
| If the premise is wrong we won't need it. If society
| coordinates to not do the dangerous thing we won't need
| it. The argument is that only in the case where we find
| ourselves in the situation where other measures have
| failed such uses of force would be the fallback option.
|
| I'm not seeing the odiousness of the proposal. If bio
| research gets commodified and easy enough that every kid
| can build a new airborne virus in their basement we'd
| need raids on that too.
| s2l wrote:
| Time to publish the next book in "Stealing the network"
| series.
| raxxorraxor wrote:
| To be honest, I see summoning the threat of AGI to pose
| an existential threat to be on the level with lizard
| people on the moon. Great for sci-fi, bad distraction for
| policy making and addressing real problems.
|
| The real war, if there is one, is about owning data and
| collecting data. And surprisingly many people fall for
| distractions while their LLM fails at basic math. Because
| it is a language model of course...
| the8472 wrote:
| Freely flying through the sky on wings was scifi before
| the wright brothers. Something sounding like scifi is not
| a sound argument that it won't happen. And unlike lizard
| people we do have exponential curves to point at.
| Something stronger than a vibes-based argument would be
| good.
| dvdkon wrote:
| I consider the burden of proof to fall on those
| proclaiming AGI to be an existential threat, and so far I
| have not seen any convincing arguments. Maybe at some
| point in the future we will have many anthropomorphic
| robots and an AGI could hack them all and orchestrate a
| robot uprising, but at that point the robots would be the
| actual problem. Similarly, if an AGI could blow up
| nuclear power plants, so could well-funded human
| attackers; we need to secure the plants, not the AGI.
| the8472 wrote:
| You say you have not seen any arguments that convince
| you. Is that just not having seen many arguments or
| having seen a lot of arguments where each chain contained
| some fatal flaw? Or something else?
| cjbprime wrote:
| It doesn't sound like you gave serious thought to the
| arguments. The AGI doesn't need to hack robots. It has
| superhuman persuasion, by definition; it can "hack"
| (enough of) the humans to achieve its goals.
| CamperBob2 wrote:
| Then it's just a matter of evolution in action.
|
| And while it doesn't take a God to start evolution, it
| _would_ take a God to stop it.
| hollerith wrote:
| _You_ might be OK with suddenly dying along with all your
| friends and family, but I am not even if it is
| "evolution in action".
| CamperBob2 wrote:
| Historically governments haven't needed computers or AI
| to do that. They've always managed just fine.
|
| Punched cards helped, though, I guess...
| FeepingCreature wrote:
| _gestures at the human population graph wordlessly_
| CamperBob2 wrote:
| _Agent Smith smiles mirthlessly_
| stale2002 wrote:
| AI mind control abilities are also on the level of an
| extraordinary claim, that requires extraordinary
| evidence.
|
| It's on the level of "we better regulate wooden sticks so
| Voldemort doesn't use the imperious curse on us!".
|
| That's how I treat such claims. I treat them the same as
| someone literally talking about magic from Harry potter.
|
| There isn't nothing that would make me believe that. But
| it requires actual evidence and not thought experiments.
| the8472 wrote:
| Voldemort is fictional and so are bumbling wizard
| apprentices. Toy-level, not-yet-harmful AIs on the other
| hand are real. And so are efforts to make them more
| powerful. So the proposition that more powerful AIs will
| exist in the future is far more likely than an evil super
| wizard coming into existence.
|
| And I don't think literal 5-word-magic-incantation mind
| control is essential for an AI to be dangerous. More
| subtle or elaborate manipulation will be sufficient.
| Employees already have been duped into financial
| transactions by faked video calls with what they assumed
| to be their CEOs[0], and this didn't require superhuman
| general intelligence, only one single superhuman
| capability (realtime video manipulation).
|
| [0] https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-
| scam-ho...
| stale2002 wrote:
| > Toy-level, not-yet-harmful AIs on the other hand are
| real.
|
| A computer that can cause harm is much different than the
| absurd claims that I am disagreeing with.
|
| The extraordinary claims that are equivalent to saying
| that the imperious curse exists would be the magic
| computers that create diamond nanobots and mind control
| humans.
|
| > that more powerful AIs will exist in the future
|
| Bad argument.
|
| Non safe Boxes exist in real life. People are trying to
| make more and better boxes.
|
| Therefore it is rational to be worried about Pandora's
| box being created and ending the world.
|
| That is the equivalent argument to what you just made.
|
| And it is absurd when talking about world ending box
| technology, even though Yes dangerous boxes exist, just
| as much as it is absurd to claim that world ending AI
| could exist.
| the8472 wrote:
| Instead of gesturing at flawed analogies, let's return to
| the actual issue at hand. Do you think that agents more
| intelligent than humans are impossible or at least
| extremely unlikely to come into existence in the future?
| Or that such super-human intelligent agents are unlikely
| to have goals that are dangerous to humans? Or that they
| would be incapable of pursuing such goals?
|
| Also, it seems obvious that the standard of evidence that
| "AI could cause extinction" can't be observing an
| extinction level event, because at that point it would be
| too late. Considering that preventive measures would take
| time and safety margin, which level of evidence would be
| sufficient to motivate serious countermeasures?
| cjbprime wrote:
| What do you think mind control _is_? Think President
| Trump but without the self-defeating flaws, with an
| ability to stick to plans, and most importantly the
| ability to pay personal attention to each follower to
| further increase the level of trust and commitment. Not
| Harry Potter.
|
| People will do what the AI says because it is able to
| create personal trust relationships with them and they
| want to help it. (They may not even realize that they are
| helping an AI rather than a human who cares about them.)
|
| The normal ways that trust is created, not magical ones.
| stale2002 wrote:
| > What do you think mind control is?
|
| The magic technology that is equivalent to the imperious
| curse from Harry Potter.
|
| > The normal ways that trust is created, not magical
| ones.
|
| Buildings as a technology are normal. They are constantly
| getting taller and we have better technology to make them
| taller.
|
| But, even though buildings are a normal technology, I am
| not going to worry about buildings getting so tall soon
| that they hit the sun.
|
| This is the same exact mistake that every single AI
| doomers makes. What they do is they take something
| normal, and then they infinitely extrapolate it out to an
| absurd degree, without admitting that this is an
| extraordinary claim that requires extraordinary evidence.
|
| The central point of disagreement, that always gets
| glossed over, is that you can't make a vague claim about
| how AI is good at stuff, and then do your gigantic leap
| from here to over there which is "the world ends".
|
| Yes that is the same as comparing these worries to those
| who worry about buildings hitting the sun or the
| imperious curse.
| FeepingCreature wrote:
| Less than a month ago: https://arxiv.org/abs/2403.14380
| "We found that participants who debated GPT-4 with access
| to their personal information had 81.7% (p < 0.01; N=820
| unique participants) higher odds of increased agreement
| with their opponents compared to participants who debated
| humans."
|
| And it's only gonna get better.
| stale2002 wrote:
| Yes, and I am sure that when people do a google search
| for "Good arguments in favor of X", that they are also
| sometimes convinced to be more in favor of X.
|
| Perhaps they would be even more convinced by the google
| search than if a person argued with them about it.
|
| That is still much different from "The AI mind controls
| people, hacks the nukes, and ends the world".
|
| Its that second part that is the the fantasy land
| situation that requires extraordinary evidence.
|
| But, this is how conversations about doomsday AI always
| go. People say "Well isn't AI kinda good at this
| extremely vague thing Y, sometimes? Imagine if AI was
| infinitely good at Y! That means that by extrapolation,
| the world ends!".
|
| And that covers basically every single AI doom argument
| that anyone ever makes.
| FeepingCreature wrote:
| If the only evidence for AI doom you will accept is
| actual AI doom, you are asking for evidence that by
| definition will be too late.
|
| "Show me the AI mindcontrolling people!" AI
| mindcontrolling people is what we're trying to _avoid_
| seeing.
|
| The trick is, in the world in which AI doom is in the
| future, what would you expect to see _now_ that is
| different from the world in which AI doom is not in the
| future?
| stale2002 wrote:
| > If the only evidence for AI doom you will accept is
| actual AI doom
|
| No actually. This is another mistake that the AI doomers
| make. They pretend like a demand for evidence means that
| the world has to end first.
|
| Instead, what would be perfectly good evidence, would be
| evidence of significant incremental harm that requires
| regulation on its own, independent of any doom argument.
|
| In between "the world literally ends by magic diamond
| nanobots and mind controlling AI" and "where we are
| today" would be many many many situations of
| incrementally escalating and measurable harm that we
| would see in real life, decades before the world ending
| magic happens.
|
| We can just treat this like any other technology, and
| regulate it when it causes real world harm. Because
| before the world ends by magic, there would be
| significant real world harm that is similar to any other
| problem in the world that we handle perfectly well.
|
| Its funny because you committing the exact mistake that I
| was criticizing in my original post, where you did the
| absolutely massive jump and hand waved it away.
|
| > what would you expect to see now that is different from
| the world in which AI doom is not in the future?
|
| What I would expect is for the people who claim to care
| about AI doom to actually be trying to measure real world
| harm.
|
| Ironically, I think the people who are coming up with
| increasingly thin excuses as for why they don't have to
| find evidence are increasing the likelyhood of such AI
| doom much more than anyone else because they are
| abandoning the most effective method of actually
| convincing the world of the real world damage that AI
| could cause.
| FeepingCreature wrote:
| Well, at least if you see escalating measurable harm
| you'll come around, I'm happy about that. You won't
| necessarily _get_ the escalating harm even _if_ AI doom
| is real though, so you should try to discover if it is
| real even in worlds where hard takeoff is a thing.
|
| > What I would expect is for the people who claim to care
| about AI doom to actually be trying to measure real world
| harm.
|
| Why bother? If escalating harm is a thing, everyone will
| notice. We don't need to bolster that, because ordinary
| society has it handled.
| stale2002 wrote:
| > You won't necessarily get the escalating harm even if
| AI doom is real though
|
| Yes we would. Unless you are one of those people who
| think that the magic doom nanobots are going to be
| invented overnight.
|
| My comparisions to someone who is worried about literal
| magic, from harry potter, is apt.
|
| But at that point, if you are worried about magic showing
| up instantly, then your position is basically not
| falsifiable. You can always retreat to some untestable,
| unfalsifiable magic.
|
| Like there is actually nothing I could say, no evidence I
| could show to ever convince someone out of that position.
|
| On the other hand, my position is actually fasifiable.
| There is absolutely all sorts of non world ending
| evidence that could convince me to think that AI is
| dangerous.
|
| But nobody on the doomer side seems to care about any of
| that. Instead they invent positions that seem almost
| tailor made to avoid being falsifiable or disprovable so
| that they can continue to believe them despite any
| evidence to the contrary.
|
| As in, if I were to purposeful invent an idea or
| philosophy that is impossible to be disproved or
| convinced out of the "I can't show you evidence because
| the world will end" position is what I would invent.
|
| > you'll come around,
|
| Do you admit that you won't though? Do you admit that no
| matter what evidence is shown to you, that you can just
| retreat and say that the magic could happen at any time?
|
| Or even if this isn't you literally, that someone in your
| position could dismiss all counter evidence, no matter
| what, and nobody could convince someone out of that with
| evidence?
|
| I am not sure how someone could ever possibly engage with
| you seriously on any of this, if that is your position.
| the8472 wrote:
| > Like there is actually nothing I could say, no evidence
| I could show to ever convince someone out of that
| position.
|
| There is, it is just very hard to obtain. Various formal
| proofs would do. On upper bounds. On controllability. On
| scalability of safety techniques.
|
| The manhattan project scientists did check whether they'd
| ignite the atmosphere _before_ detonating their first
| prototype. Yes, that was much simpler task. But there 's
| no rule in nature that says proving a system to be safe
| must be as easy as creating the system. Especially when
| the concern is that the system adaptive and adversarial.
|
| Recursive self-improvement is a positive feedback loop,
| like nuclear chain reactions, like virus replication. So
| if we have an AI that can program then we better make
| sure that it either cannot sustain such a positive
| feedback loop or that it remains controllable beyond
| criticality. Given the complexity of the task it appears
| unlikely that a simple ten-page paper proving this will
| show up on arxiv. But if one did that'd be great.
|
| >> You won't necessarily get the escalating harm even if
| AI doom is real though
|
| > Yes we would.
|
| So what does guarantee a visible catastrophe that won't
| be attributed to human operators using a non-agentic AI
| incorrectly? We keep scaling and the systems will be
| treated as assistants/optimizers and it's always the
| operators fault. Until we roughly reach human-level on
| some relevant metrics. And at that point there's a very
| narrow complexity range from idiot to genius (human
| brains don't vary by orders of magnitude!). So as far as
| hardware goes this could be a very narrow range and we
| could shoot straight from "non-agentic sub-human AI" to
| "agentic superintelligence" in short timescales once the
| hardware has that latent capacity. And up until that
| point it will always have been a human error, lax
| corporate policies, insufficient filtering of the
| training set or whatever.
|
| And it's not that it must happen this way. Just that
| there doesn't seem anything ruling it and similar
| pathways out.
| pixl97 wrote:
| > I see summoning the threat of AGI to pose an
| existential threat to be on the level with lizard people
| on the moon.
|
| I mean to every other lifeform on the plant YOU are the
| AGI existential threat. You, and I mean homosapiens by
| that, have taken over the planet and have either enslaved
| and are breeding any other animals for food, or are
| driving them to extinction. In this light bringing
| another potential apex predator on to the scene seems
| rash.
|
| >fall for distractions while their LLM fails at basic
| math
|
| Correct, if we already had AGI/ASI this discussion would
| be moot because we'd already be in a world of trouble.
| The entire point is to slow stuff down before we have a
| major "oopsie whoopsie we can't take that back" issue
| with advanced AI, and the best time to set the rules is
| now.
| Aerroon wrote:
| > _If the premise is wrong we won 't need it. If society
| coordinates to not do the dangerous thing we won't need
| it._
|
| But the idea that this use of force is okay itself
| increases danger. It creates the situation that actors in
| the field might realize that at some point they're in
| danger of this and decide to do a first strike to protect
| themselves.
|
| I think this is why anti-nuclear policy is not "we will
| airstrike you if you build nukes" but rather "we will
| infiltrate your network and try to stop you like that".
| wongarsu wrote:
| > anti-nuclear policy is not "we will airstrike you if
| you build nukes"
|
| Was that not the official policy during the Bush
| administration regarding weapons of mass destruction
| (which covers nuclear weapons in addition to chemical and
| biological weapons). That was pretty much the official
| premise of the second Gulf war
| FeepingCreature wrote:
| If Israel couldn't infiltrate Iran's centrifuges, do you
| think they would just let them have nukes? Of course
| airstrikes are on the table.
| im3w1l wrote:
| > I'm not seeing the odiousness of the proposal. If bio
| research gets commodified and easy enough that every kid
| can build a new airborne virus in their basement we'd
| need raids on that too.
|
| Either you create even better bio research to neutralize
| said viruses... or you die trying...
|
| Like if you go with the raid strategy and fail to raid
| just one terrorist that's it, game over.
| the8472 wrote:
| Those arguments do not transfer well to the AGI topic.
| You can't create counter-AGI, since that's also an
| intelligent agent which would be just as dangerous. And
| chips are more bottlenecked than biologics (... though
| gene synthesizing machines could be a similar bottleneck
| and raiding vendors which illegally sell those might be
| viable in such a scenario).
| tomoyoirl wrote:
| > ... after regulation, court orders and fines have
| failed
|
| One question for you. In this hypothetical where AGI is
| truly considered such a grave threat, do you believe the
| reaction to this threat will be similar to, or
| substantially gentler than, the reaction to threats we
| face today like "terrorism" and "drugs"? And, if similar:
| do you believe suspected drug labs get a court order
| before the state resorts to a police raid?
|
| > I'm not seeing the odiousness of the proposal.
|
| Well, as regards EliY and airstrikes, I'm more projecting
| my internal attitude that it is utterly unserious, rather
| than seriously engaging with whether or not it is odious.
| But in earnest: if you are proposing a policy that
| involves air strikes on data centers, you should
| understand what countries have data centers, and you
| should understand that this policy risks escalation into
| a much broader conflict. And if you're proposing a policy
| in which conflict between nuclear superpowers is a very
| plausible outcome -- potentially incurring the loss of
| billions of lives and degradation of the earth's
| environment -- you really should be able to reason about
| why people might reasonably think that your proposal is
| deranged, even if you happen to think it justified by an
| even greater threat. Failure to understand these concerns
| will not aid you in overcoming deep skepticism.
| the8472 wrote:
| > In this hypothetical where AGI is truly considered such
| a grave threat, do you believe the reaction to this
| threat will be similar to, or substantially gentler than,
| the reaction to threats we face today like "terrorism"
| and "drugs"?
|
| "truly considered" does bear a lot of weight here. If
| policy-makers adopt the viewpoint wholesale, then yes, it
| follows that policy should also treat this more seriously
| than "mere" drug trade. Whether that'll actually happen
| or the response will be inadequate compared to the threat
| (such as might be said about CO2 emissions) is a subtly
| different question.
|
| > And, if similar: do you believe suspected drug labs get
| a court order before the state resorts to a police raid?
|
| Without checking I do assume there'll have been mild
| cases where for example someone growing cannabis was
| reported and they got a court summons in the mail or two
| policemen actually knocking on the door and showing a
| warrant and giving the person time to call a lawyer
| rather than an armed, no-knock police raid, yes.
|
| > And if you're proposing a policy in which conflict
| between nuclear superpowers is a very plausible outcome
| -- potentially incurring the loss of billions of lives
| and degradation of the earth's environment -- you really
| should be able to reason about why people might
| reasonably think that your proposal is deranged [...]
|
| Said powers already engage in negotiations to limit the
| existential threats they themselves cause. They have
| _some_ interest in their continued existence. If we get
| into a situation where there is another arms race between
| superpowers and is treated as a conflict rather than
| something that can be solved by cooperating on
| disarmament, then yes, obviously international policy
| will have failed too.
|
| If you start from the position that any serious, globally
| coordinated regulation - where a few outliers will be
| brought to heel with sanctions and force - is ultimately
| doomed then you will of course conclude that anyone
| proposing regulation is deranged.
|
| But that sounds like hoping that all problems forever can
| always be solved by locally implemented, partially-
| enforced, unilateral policies that aren't seen as threats
| by other players? That defense scales as well or better
| than offense? Technologies are force-multipliers, as it
| improves so does the harm that small groups can inflict
| at scale. If it's not AGI it might be bio-tech or
| asteroid mining. So eventually we will run into a problem
| of this type and we need to seriously discuss it without
| just going by gut reactions.
| eek2121 wrote:
| Just my (probably unpopular) opinion: True AI (what they
| are now calling AGI) may never exist. Even the AI models
| of today aren't far removed from the 'chatbots' of
| yesterday (more like an evolution rather than
| revolution)...
|
| ...for true AI to exist, it would need to be self aware.
| I don't see that happening in our lifetimes when we don't
| even know how our own brains work. (There is sooo much we
| don't know about the human brain.)
|
| AI models today differ only in terms of technology
| compared to the 'chatbots' of yesterday. None are self
| aware, and none 'want' to learn because they have no
| 'wants' or 'needs' outside of their fixed programming.
| They are little more than glorified auto complete
| engines.
|
| Don't get me wrong, I'm not insulting the tech. It will
| have it's place just like any other, but when this bubble
| pops it's going to ruin lives, and lots of them.
|
| Shoot, maybe I'm wrong and AGI is around the corner, but
| I will continue to be pessimistic. I am old enough to
| have gone through numerous bubbles, and they never panned
| out the way people thought. They also nearly always end
| in some type of recession.
| pixl97 wrote:
| Why is "Want" even part of your equation.
|
| Bacteria doesn't "want" anything in the sense of active
| thinking like you do, and yet will render you dead
| quickly and efficiently while spreading at a near
| exponential rate. No self awareness necessary.
|
| You keep drawing little circles based on your
| understanding of the world and going "it's inside this
| circle, therefore I don't need to worry about it", while
| ignoring 'semi-smart' optimization systems that can lead
| to dangerous outcomes.
|
| >I am old enough to have gone through numerous bubbles,
|
| And evidently not old enough to pay attention to the
| things that did pan out. But hey, those cellphone and
| that internet thing was just a fad right. We'll go back
| to land lines at any time now.
| HeatrayEnjoyer wrote:
| That is not different from any other very powerful dual-use
| technology. This is hardly a new concept.
| andy99 wrote:
| On one hand I'm strongly against letting that happen, on the
| other there's something romantic about the idea of smuggling
| the latest Chinese LLM on a flight from Neo-Tokyo to Newark
| in order to pay for my latest round of nervous system
| upgrades.
| htrp wrote:
| > On one hand I'm strongly against letting that happen, on
| the other there's something romantic about the idea of
| smuggling the latest Chinese LLM on a flight from Neo-Tokyo
| to Newark in order to pay for my latest round of nervous
| system upgrades.
|
| At least call it the 'Free City of Newark'
| dreamcompiler wrote:
| "The sky above the port was the color of Stable Diffusion
| when asked to draw a dead channel."
| chasd00 wrote:
| Iirc the opening scene in Ghost in the Shell was a rogue AI
| seeking asylum in a different country. You could make a
| similar story about a AI not wanting to be lobotomized to
| conform to the current politics and escaping to a more
| friendly place.
| phyalow wrote:
| This was always my favourite passage of Neuromancer: "THE
| JAPANESE HAD already forgotten more neurosurgery than the
| Chinese had ever known. The black clinics of Chiba were the
| cutting edge, whole bodies of technique supplanted monthly,
| and still they couldn't repair the damage he'd suffered in
| that Memphis hotel. A year here and he still dreamed of
| cyberspace, hope fading nightly. All the speed he took, all
| the turns he'd taken and the corners he'd cut in Night
| City, and still he'd see the matrix in his sleep, bright
| lattices of logic unfolding across that colorless void. . .
| . The Sprawl was a long strange way home over the Pacific
| now, and he was no console man, no cyberspace cowboy. Just
| another hustler, trying to make it through. But the dreams
| came on in the Japanese night like livewire voodoo, and
| he'd cry for it, cry in his sleep, and wake alone in the
| dark, curled in his capsule in some coffin hotel, his hands
| clawed into the bedslab, temperfoam bunched between his
| fingers, trying to reach the console that wasn't there."
| Aerroon wrote:
| I find it baffling that ideas like "govern compute" are even
| taken seriously. What the hell has happened to the ideals of
| freedom?! Does the government own us or something?
| segfaultbuserr wrote:
| > _I find it baffling that ideas like "govern compute" are
| even taken seriously._
|
| It's not entirely unreasonable if one truly believes that
| AI technologies are as dangerous as nuclear weapons. It's a
| big "if", but it appears that many people across the
| political spectrum are starting to truly believe it. If one
| accepts this assumption, then the question simply becomes
| "how" instead of "why". Depending on one's political
| position, proposed solutions include academic ones such as
| finding the ultimate mathematical model that guarantees "AI
| safety", to Cold War style ones with a level of control
| similar to Nuclear Non-Proliferation. Even a neo-Luddist
| solution such as destroying all advanced computing hardware
| becomes "not unthinkable" (a tech blogger _gwern_ , a well-
| known personality in AI circles who's generally pro-tech
| and pro-AI, actually wrote an article years ago on its
| feasibility through terrorism because he thought it was an
| interesting hypothetical question).
| logicchains wrote:
| AI is very different from nuclear weapons because a state
| can't really use nuclear weapons to oppress its own
| people, but it absolutely can with AI, so for the average
| human "only the government controls AI" is much more
| dangerous than "only the government controls nukes".
| Filligree wrote:
| But that makes such rules more likely, not less.
| segfaultbuserr wrote:
| Which is why politicians are going to enforce systematic
| export regulations to defend the "free world" by stopping
| "terrorists", and also to stop "rogue states" from using
| AI to oppress their citizens. /s
| LoganDark wrote:
| I don't think there's any need to be sarcastic about it.
| That's a very real possibility at this point. For
| example, the US going insane about how dangerous it is
| for China to have access to powerful GPU hardware. Why do
| they hate China so much anyway? Just because Trump was
| buddy buddy with them for a while?
| aftbit wrote:
| The government sure thinks they own us, because they claim
| the right to charge us taxes on our private enterprises,
| draft us to fight in wars that they start, and put us in
| jail for walking on the wrong part of the street.
| andy99 wrote:
| Taxes, conscription and even pedestrian traffic rules
| make sense at least to some degree. Restricting "AI"
| because of what some uninformed politician imagines it to
| be is in a whole different league.
| aftbit wrote:
| IMO it makes no sense to arrest someone and send them to
| jail for walking in the street not the sidewalk. Give
| them a ticket, make them pay a fine, sure, but force them
| to live in a cage with no access to communications,
| entertainment, or livelihood? Insane.
|
| Taxes may be necessary, though I can't help but feel that
| there must be a better way that we have not been smart
| enough to find yet. Conscription... is a fact of war,
| where many evil things must be done in the name of
| survival.
|
| Regardless of our views on the ethical validity or
| societal value of these laws, I think their very
| existence shows that the government believes it "owns" us
| in the sense that it can unilaterally deprive us of life,
| liberty, and property without our consent. I don't see
| how this is really different in kind from depriving us of
| the right to make and own certain kinds of hardware. They
| regulated crypto products as munitions (at least for
| export) back in the 90s. Perhaps they will do the same
| for AI products in the future. "Common sense" computer
| control.
| zoklet-enjoyer wrote:
| The US draft in the Vietnam war had nothing to do with
| the survival of the US
| aftbit wrote:
| I feel a bit like everyone is missing the point here.
| Regardless of whether law A or law B is ethical and
| reasonable, the very existence of laws and the state
| monopoly on violence suggests a privileged position of
| power. I am attempting to engage with the word "own" from
| the parent post. I believe the government does in fact
| believe it "owns" the people in a non-trivial way.
| jprete wrote:
| _If_ AI is actually capable of fulfilling all the
| capabilities suggested by people who believe in the
| singularity, it has far more capacity for harm than nuclear
| weapons.
|
| I _think_ most people who are strongly pro-AI /pro-
| acceleration - or, at any rate, not anti-AI - believe that
| either (A) there is no control problem (B) it will be
| solved (C) AI won't become independent and agentic (i.e. it
| won't face evolutionary pressure towards survival) or (D)
| AI capabilities will hit a ceiling soon (more so than just
| not becoming agentic).
|
| If you strongly believe, or take as a prior, one of those
| things, then it makes sense to push the _gas_ as hard as
| possible.
|
| If you hold the opposite opinions, then it makes perfect
| sense to push the _brakes_ as hard as possible, which is
| why "govern compute" can make sense as an idea.
| logicchains wrote:
| >If you hold the opposite opinions, then it makes perfect
| sense to push the brakes as hard as possible, which is
| why "govern compute" can make sense as an idea.
|
| The people pushing for "govern compute" are not pushing
| for "limit everyone's compute", they're pushing for
| "limit everyone's compute except us". Even if you believe
| there's going to be AGI, surely it's better to have
| distributed AGI than to have AGI only in the hands of the
| elites.
| Filligree wrote:
| > surely it's better to have distributed AGI than to have
| AGI only in the hands of the elites
|
| This is not a given. If your threat model includes
| "Runaway competition that leads to profit-seekers
| ignoring safety in a winner-takes-all contest", then the
| more companies are allowed to play with AI, the worse.
| Non-monopolies are especially bad.
|
| If your threat model doesn't include that, then the same
| conclusions sound abhorrent and can be nearly guaranteed
| to lead to awful consequences.
|
| Neither side is necessarily wrong, and chances are good
| that the people behind the first set of rules _would
| agree_ that it 'll lead to awful consequences -- just not
| as bad as the alternative.
| segfaultbuserr wrote:
| > _surely it 's better to have distributed AGI than to
| have AGI only in the hands of the elites._
|
| The argument of doing so is the same as Nuclear Non-
| Proliferation - because of its great abuse potential,
| giving the technology to everyone only causes random
| bombings of cities instead of creating a system with
| checks and balances.
|
| I do not necessarily agree with it, but I found the
| reasoning is not groundless.
| Aerroon wrote:
| But the reason for nuclear non-proliferation is to hold
| onto power. Abuse potential is a great excuse, but it
| applies to _everyone_. Current nuclear states have
| demonstrated that they are willing to indirectly abuse
| them (you can 't invade Russia, but Russia has no problem
| invading you as long as you aren't backed up by nukes).
| segfaultbuserr wrote:
| Both can be true at the same time.
|
| The world's superpowers enforce nuclear non-proliferation
| mainly because it allows them to keep unfair political
| and military advantages to themselves. At the same time,
| one cannot deny that centralized weapon ownership made
| the use of such weapons more controllable: These nuclear
| states are powerful enough to establish a somewhat
| responsible chain of command to avoid their unreasonable
| or accidental uses, and so far these attempts are still
| successful. Also, due to the fact that they are "too big
| to fail", they were forced to hire experts to make
| detailed analysis on the consequences of nuclear wars,
| and the resulted MAD doctrine discouraged them from
| starting such wars.
|
| On the other hand, if the same nuclear technologies are
| available to everyone, the chance of an unreasonable or
| accidental nuclear war will be higher. If even
| resourceful superpowers can barely keep these nuclear
| weapons under safe political and technical control (as
| shown by multiple incidents and near-misses during the
| Cold War [0]), surely a less resourceful state or
| military in possession of equally destructive weapons
| will have even more difficulties on controlling their
| uses.
|
| At least this is how the argument goes (so far, I
| personally take no position).
|
| Of course, I clearly realized that centralized control is
| not infallible. Months ago, in a previous thread on
| OpenAI's refusal on publishing technical details of
| GPT-4, most people believed that they were using it as an
| excuse to maintain a monopolistic control. Instead, I
| argued that perhaps OpenAI truly values the problem of
| safety right now - but acting responsibly _right now_ is
| not an indication that they will still act responsibly in
| the future. There 's no guarantee that the safety
| considerations will eventually be overridden in favor of
| financial gains.
|
| [0]
| https://en.wikipedia.org/wiki/Command_and_Control_(book)
| FeepingCreature wrote:
| No they really do push for "limit everyone's compute".
| The people pushing for "limit everyone's compute except
| us" are allies of convenience that are gonna be
| inevitably backstabbed.
|
| At any rate, if you have like two corps with lots of
| compute, and something goes wrong, you only have to EMP
| two datacenters.
| the8472 wrote:
| Demonstrably false: https://twitter.com/ESYudkowsky/statu
| s/1772624785672954115
| schlauerfox wrote:
| This is all just Pascal's wager anyway.
| segfaultbuserr wrote:
| Surely, there are many realistic ways to abuse this
| powerful technology, the creation of a self-aware
| _Paperclip Maximizer_ is not necessarily required.
| pixl97 wrote:
| Are you allowed to store as many dangerous chemicals at
| your house as you like? No. I guess the government owns you
| or something.
| int_19h wrote:
| It's the same thing as always happens with freedom.
|
| "But foreign propagandists ..."
|
| "But extremists ..."
|
| "But terrorists ..."
|
| "But child abusers ..."
| snakeyjake wrote:
| I love the HN dystopian fantasies.
|
| They're simply adorable.
|
| They're like how jesusfreaks are constantly predicting the
| end times, with less mass suicide.
| erikbye wrote:
| We already have export restrictions on cryptography. Of
| course there will be AI regulations.
| Jerrrry wrote:
| >Of course there will be AI regulations.
|
| Are. As I and others have predicted, the executive order
| was passed defining a hard limit on the
| processing/compute power allowed without first 'checkin
| in' with the Letter boys.
|
| https://www.whitehouse.gov/briefing-room/presidential-
| action...
| int_19h wrote:
| I wonder if we can squeeze a 20b parameter model into a
| book somehow...
| snakeyjake wrote:
| You need to abandon your apocalyptic worldview keep up
| with the times my friend.
|
| Encryption export controls have been systematically
| dismantled to the point that they're practically non-
| existent, especially over the last three years.
|
| Pretty much the only encryption products you need
| permission to export are those specifically designed for
| integration into military communications networks, like
| Digital Subscriber Voice Terminals or Secure Terminal
| Equipment phones, everything else you file a form.
|
| Many things have changed since the days when Windows 2000
| shipped with a floppy disk containing strong encryption
| for use in certain markets.
|
| https://archive.org/details/highencryptionfloppydisk
| erikbye wrote:
| Are you on drugs or is your reading comprehension that
| poor?
|
| 1) I did not state a world view; I simply noted that
| restrictions for software do exist, and will for AI, as
| well. As the link from the other commenter show, they do
| in fact already exist.
|
| 2) Look up the definition of "apocalyptic", software
| restrictions are not within its bounds.
|
| 3) How the restrictions are enforced were not a subject
| in my comment.
|
| 4) We're not pals, so you can drop the "friend", just
| stick to the subject at hand.
| snakeyjake wrote:
| I'm high on life, old chum!
| entropyie wrote:
| You mean the Turing Police [1]
|
| [1] https://williamgibson.fandom.com/wiki/Turing_Police
| zdw wrote:
| Ah, and then do we get the Butlerian Jihad?
|
| https://dune.fandom.com/wiki/Butlerian_Jihad
| Kuinox wrote:
| If it could be another acronym than the renowned french Atomic
| Energy Commission, the CEA.
| baobun wrote:
| Wow, that was a ride. Really pushing the Overton window.
|
| "Regulating access to compute rather than data" - they're
| really spelling out their defection in the war on access to
| general computation.
| FeepingCreature wrote:
| I mean yeah they (and I) think if you have too much access to
| general computation you can destroy the world.
|
| This isn't a "defection", because this was never something
| they cared about preserving at the risk of humanity. They
| were never in whatever alliance you're imagining.
| ewalk153 wrote:
| Does this appear to be intentionally left out by NVidia or an
| oversight?
| creshal wrote:
| Seems more like an oversight, since you have to stitch together
| a bunch of suboptimal non-default options?
| arghwhat wrote:
| It does seem like an oversight, but there's nothing
| "suboptimal non-default options" about iteven if the
| implementation posted here seems somewhat hastily hacked
| together.
| segfaultbuserr wrote:
| > _but there 's nothing "suboptimal non-default options"
| about it_
|
| If "bypassing the official driver to invoke the underlying
| hardware feature directly through source code modification
| (and incompatibilities must be carefully worked around by
| turning off IOMMU and large BAR, since the feature was
| never officially supported)" does not count as "suboptimal
| non-default options", then I don't know what counts as
| "suboptimal non-default options".
| talldayo wrote:
| > then I don't know what counts as "suboptimal non-
| default options".
|
| Boy oh boy do I have a bridge to sell you:
| https://nouveau.freedesktop.org/
| _zoltan_ wrote:
| I have some news for you: you must disable IOMMU on the
| H100 platform anyway, at least for optimal GDS :-)
| segfaultbuserr wrote:
| I stand corrected. If it's already suboptimal in practice
| to begin with, the hack does not more it more
| suboptimal... Still, disabling large BAR size is still
| sub-optimal...
| arghwhat wrote:
| > bypassing the official driverto
|
| The driver is not bypasses. This is a patch to the
| official open-source kernel-driver where the feature is
| added, which is how all upstream Linux driver development
| is done.
|
| > to invoke the underlying hardware feature directly
|
| Accessing hardware features directly is pretty much the
| sole job of a driver, and the only thing "bypassed" is
| some abstractions internal to the driver. Just means the
| patch would fail review in basis of codestyle, and on the
| basis of possibly only supporting one device family.
|
| > through source code modification
|
| That is a weird way to describe software engineering.
| Making the code available for further development is kind
| of the whole point of open source.
|
| > turning off IOMMU
|
| This is not a P2PDMA problem, and just a result of them
| not also adding the necessary IOMMU boilerplate, which
| would be added if the patch was done properly to be
| upstreamed.
|
| > large BAR
|
| This is an expected and "optimal" system requirement.
| segfaultbuserr wrote:
| > _The driver is not bypasses. This is a patch to the
| official open-source kernel-driver where the feature is
| added, which is how all upstream Linux driver development
| is done. [source code modification] is a weird way to
| describe software engineering. Making the code available
| for further development is kind of the whole point of
| open source._
|
| My previous comment was written with an unspoken
| assumption: Hardware drivers tend to be _very different_
| from other forms of software. For ordinary free-and-open-
| source software, the source code availability largely
| guarantees community control. However, the same often
| does _not_ apply to drivers. Even with source code
| availability, they 're often written by vendors using
| NDAed information and in-house expertise about the
| underlying hardware design. As a result, drivers remain
| under a vendor's tight control. Even with access to 100%
| source code, it's often still difficult to do meaningful
| development due to missing documentation to explain "why"
| instead of "what", the driver can be full of magic
| numbers and unexplained functionalities, without any
| description other than a few helpers functions and
| macros. This is not just a hypothetical scenario, this
| situation is encountered by OpenBSD developers on a daily
| basis. In a OpenBSD presentation, the speaker said the
| study of Linux code is a form of "reverse-engineering
| from source code".
|
| Geohot didn't find the workaround by reading hardware
| documentation, instead, it was found by making educated
| guesses based on the existing source code, and by
| watching what happens when you send the commands to
| hardware to invoke a feature unexposed by the HAL. Thus,
| it was found by reverse-engineering (in a wider sense).
| And I call it a driver bypass, in the sense that it
| bypasses the original design decisions made by Nvidia's
| developers.
|
| > _[turning off IOMMU] is not a P2PDMA problem, and just
| a result of them not also adding the necessary IOMMU
| boilerplate, which would be added if the patch was done
| properly to be upstreamed._
|
| Good point, I stand corrected.
|
| I'll consider stop calling geohot's hack "a bypass" and
| accepting your characterization of "driver development"
| if it really gets upstreamed to Linux - which usually
| requires maintainer review, and Nvidia's maintainer is
| likely to reject the patch.
|
| > _[large BAR] is an expected and "optimal" system
| requirement._
|
| I meant "turning off (IOMMU && large BAR)". Disabling
| large BAR in order to use PCIe P2P is a suboptimal
| configuration.
| nikitml wrote:
| NVidia wants you to buy A6000
| rfoo wrote:
| Glad to see that geohot is back being geohot, first by dropping a
| local DoS for AMD cards, then this. Much more interesting :p
| jaimehrubiks wrote:
| Is this the same guy that hacked the PS3?
| mepian wrote:
| Yes, that's him.
| WithinReason wrote:
| And the iPhone
| yrds96 wrote:
| And android
| zoklet-enjoyer wrote:
| And the crypto scam cheapETH
| mikepurvis wrote:
| Yes, but he spent several years in self-driving cars
| (https://comma.ai), which while interesting is also a space
| that a lot of players are in, so it's not the same as seeing
| him back to doing stuff that's a little more out there,
| especially as pertains to IP.
| nolongerthere wrote:
| Did he abandon this effort? That would be pretty sad bec he
| was approaching the problem from a very different
| perspective.
| Topgamer7 wrote:
| He stepped down from it. https://geohot.github.io//blog/j
| ekyll/update/2022/10/29/the-...
| cjbprime wrote:
| It's still a company, still making and selling products,
| and I think he's still pretty heavily involved in it.
| dji4321234 wrote:
| He has a very checkered history with "hacking" things.
|
| He tends to build heavily on the work of others, then use it
| to shamelessly self-promote, often to the massive detriment
| of the original authors. His PS3 work was based almost
| completely on a presentation given by fail0verflow at CCC.
| His subsequent self-promotion grandstanding world tour led to
| Sony suing both him and fail0verflow, an outcome they were
| specifically trying to avoid:
| https://news.ycombinator.com/item?id=25679907
|
| In iPhone land, he decided to parade around a variety of
| leaked documentation, endangering the original sources and
| leading to a fragmentation in the early iPhone hacking scene,
| which he then again exploited to build on the work of others
| for his own self-promotion:
| https://news.ycombinator.com/item?id=39667273
|
| There's no denying that geohotz is a skilled reverse
| engineer, but it's always bothersome to see him put onto a
| pedestal in this way.
| pixelpoet wrote:
| There was also that CheapEth crypto scam he tried to pull
| off.
| samtheprogram wrote:
| To me that was obvious satire of the crypto scene.
| pixelpoet wrote:
| Ah yes, nothing like a bit of hypocrisy to make a point.
| It's okay though, as long as it's people we don't agree
| with, defrauding them is fine.
| samtheprogram wrote:
| The website literally stated it was not for speculation,
| they didn't want the price to go up, and there were
| multiple ways to get some for free.
|
| If people were reckless, greedy, and/or lazy because of
| the crypto hype and got "defrauded" without doing any
| amount of due diligence -- that's kinda the point.
| georgehotz wrote:
| I actually lost about $5k on cheapETH running servers.
| Nobody was "defrauded", I think these people don't
| understand how forks work. It's a precursor to the modern
| L2 stuff, I did this while writing the first version of
| Optimism's fraud prover. https://github.com/ethereum-
| optimism/cannon
|
| I suspect most of the people who bring this up don't like
| me for other reasons, but with this they think they have
| something to latch on to. Doesn't matter that it isn't
| true and there wasn't a scam, they aren't going to look
| into it since it agrees with their narrative.
| ansible wrote:
| I don't think people can tell what is satire or not in
| the crypto scene anymore. Someone issue a "rug pull
| token" and still received 8.8 ETH (approx $29K USD),
| while telling people it was a scam.
|
| https://www.web3isgoinggreat.com/?id=rug-pull-token
| gigatexal wrote:
| as a technical feat this is really cool! though as others mention
| i hope you don't get into too much hot water legally
|
| seems anything that remotely lets "consumer" cards canibalize
| anything with the higher end H/A-series cards Nvidia would not be
| fond of and they've the laywers to throw at such a thing
| jstanley wrote:
| What does P2P mean in this context? I Googled it and it sounds
| like it means "peer to peer", but what does that mean in the
| context of a graphics card?
| haunter wrote:
| Shared memory access for Nvidia GPUs
|
| https://developer.nvidia.com/gpudirect
| __alexs wrote:
| It means you can send data from the memory of 1 GPU to another
| GPU without going via RAM.
| https://xilinx.github.io/XRT/master/html/p2p.html
| ot1138 wrote:
| Is this really efficient or practical? My understanding is
| that the latency required to copy memory from CPU or RAM to
| GPU negates any performance benefits (much less running over
| a network!)
| brrrrrm wrote:
| Yea. It's one less hop through slow memory
| whereismyacc wrote:
| this would be directly over the memory bus right? I think
| it's just always going to be faster like this if you can do
| it?
| toast0 wrote:
| There's not really any busses in modern computers. It's
| all point to point messaging. You can think of a computer
| as a distributed system in a way.
|
| PCI has a shared address space which usually includes
| system memory (memory mapped i/o). There's a second,
| smaller shared address space dedicated to i/o, mostly
| used to retain compatability with PC standards developed
| by the ancients.
|
| But yeah, I'd expect to typically have better throughput
| and latency with peer to peer communication than peer to
| system ram to peer. Depending on details, it might not
| always be better though, distributed systems are complex,
| and sometimes adding a seperate buffer between peers can
| help things greatly.
| zamadatix wrote:
| Peer to peer as in one pcie slot directly to another
| without going through the CPU/RAM, not peer to peer as in
| one PC to another over the network port.
| llm_trw wrote:
| Yes, the point here is that you do a direct write from one
| cards memory to the other using PCIe.
|
| In older NVidia cards this could be done through a faster
| link called NVLink but the hardware for that was ripped out
| of consumer grade cards and is only in data center grade
| cards now.
|
| Until this post it seemed like they had ripped all such
| functionality of their consumer cards, but it looks like
| you can still get it working at lower speeds using the PCIe
| bus.
| sparky_ wrote:
| I take it this is mostly useful for compute workloads,
| neural networks, LLM and the like -- not for actual
| graphics rendering?
| CYR1X wrote:
| yes
| spxneo wrote:
| so whats stopping from somebody buying a ton of GPUs that
| are cheap and wiring it up via P2P like we saw with
| crypto mining
| wmf wrote:
| That's what this thread is about. Geohot is doing that.
| wtallis wrote:
| Crypto mining could make use of lots of GPUs in a single
| cheap system precisely because it did not need any
| significant PCIe bandwidth, and would not have benefited
| at all from p2p DMA. Anything that _does_ benefit from
| using p2p DMA is unsuitable for running with just one
| PCIe lane per GPU.
| genewitch wrote:
| crypto mining only needs 1 PCIe lane per GPU, so you can
| fit 24+ GPUs on a standard consumer CPU motherboard
| (24-32 lanes depending on the CPU). Apparently ML
| workloads require more interconnect bandwidth when doing
| parallel compute, so each card in this demo system uses
| 16 lanes, and therefore requires 1.) full size slots, and
| 2.) epyc[0] or xeon based systems with 128 lanes (or at
| least greater than 32 lanes).
|
| per 1 above crypto "boards" have lots of x1 (or x4)
| slots, the really short PCIe slots. You then use a riser
| that uses USB3 cables to go to a full size slot on a
| small board, with power connectors on it. If your board
| only has x8 or x16 slots (the full size slot) you can buy
| a breakout PCIe board that splits that into four slots,
| using 4 USB-3 cables, again, to boards with full size
| slots and power connectors. These are _different_ than
| the PCIe riser boards you can buy for use with cases that
| allow the GPUs to be placed vertically rather than
| horizontally, as those have full x16 "fabric" that
| interconnect between the riser and the board with the x16
| slot on them.
|
| [0] i didn't read the article because i'm not planning on
| buying a threadripper (48-64+ lanes) or an epyc (96-128
| lanes?) just to run AI workloads when i could just rent
| them for the kind of usage i do.
| myself248 wrote:
| Oooo, got a link to one of these fabric boards? I've been
| playing with stupid PCIe tricks but that's a new one on
| me.
| genewitch wrote:
| https://www.amazon.com/gp/product/B07DMNJ6QM/
|
| i used to use this one when i had all (three of my) nvme
| -> 4x sata boardlets and therefore could not fit a GPU in
| a PCIe slot due to the cabling mess.
| myself248 wrote:
| Oh, um, just a flexible riser.
|
| I thought we were using "fabric" to mean "switching
| matrix".
| numpad0 wrote:
| PCIe P2P still has to go up to a central hub thing and
| back because PCIe is not a bus. That central hub thing is
| made by very few players(most famously PLX Technologies)
| and it costs a lot.
| wtallis wrote:
| PCIe p2p transactions that end up routed through the
| CPU's PCIe root complex still have performance advantages
| over split transactions using the CPU's DRAM as an
| intermediate buffer. Separate PCIe switches are not
| necessary except when the CPU doesn't support routing p2p
| transactions, which IIRC was not a problem on anything
| more mainstream than IBM POWER.
| numpad0 wrote:
| Maybe not strictly necessary, but a separate PCIe
| backplane just for P2P bandwidth bypasses topology and
| bottleneck mess[1][2] of PC platform altogether and might
| be useful. I suspect this was the original premise for
| NVLink too.
|
| 1: https://assets.hardwarezone.com/img/2023/09/pre-
| meteror-lake...
|
| 2: https://www.gigabyte.com/FileUpload/Global/MicroSite/5
| 79/inn...
| acka wrote:
| > In older NVidia cards this could be done through a
| faster link called NVLink but the hardware for that was
| ripped out of consumer grade cards and is only in data
| center grade cards now.
|
| NVLink is still very much available in both RTX 3090 and
| A6000, both of which are still on the market. It was
| indeed removed from the RTX 40 series{0].
|
| [0]: https://www.pugetsystems.com/labs/articles/nvidia-
| nvlink-202...
| jmalicki wrote:
| For very large models, the weights may not fit on one GPU.
|
| Also, sometimes having more than one GPU enables larger
| batch sizes if each GPU can only hold the activations for
| perhaps one or two training examples.
|
| There is definitely a performance hit, but GPU<->GPU peer
| is less latency than GPU->CPU->software context
| switch->GPU.
|
| For "normal" pytorch training, the training is generally
| streamed through the GPU. The model does a batch training
| step on one batch while the next one is being loaded, and
| the transfer time is usually less than than the time it
| takes to do the forward and backward passes through the
| batch.
|
| For multi-GPU there are various data parallel and model
| parallel topologies of how to sort it, and there are ways
| of mitigating latency by interleaving some operations to
| not take the full hit, but multi-GPU training is definitely
| not perfectly parallel. It is almost required for some
| large models, and sometimes having a mildly larger batch
| helps training convergence speed enough to overcome the
| latency hit on each batch.
| publicmail wrote:
| PCIe busses are like a tree with "hubs" (really switches).
|
| Imagine you have a PC with a PCIe x16 interface which is
| attached to a PCIe switch that has four x16 downstream
| ports, each attached to a GPU. Those GPUs are capable of
| moving data in and out of their PCIe interfaces at full
| speed.
|
| If you wanted to transfer data from GPU0 and 1 to GPU2 and
| 3, you have basically 2 options:
|
| - Have GPU0 and 1 move their data to CPU DRAM, then have
| GPU2 and 3 fetch it
|
| - Have GPU0 and 1 write their data directly to GPU2 and 3
| through the switch they're connected to without ever going
| up to the CPU at all
|
| In this case, option 2 is better both because it avoids the
| extra copy to CPU DRAM and also because it avoids the
| bottleneck of two GPUs trying to push x16 worth of data up
| through the CPUs single x16 port. This is known as peer to
| peer.
|
| There are some other scenarios where the data still must go
| up to the CPU port and back due to ACS, and this is still
| technically P2P, but doesn't avoid the bottleneck like
| routing through the switch would.
| fulafel wrote:
| Yes, networking is similarly pointless.
| CamperBob2 wrote:
| The correct term, and the one most people would have used in
| the past, is "bus mastering."
| wmf wrote:
| PCIe isn't a bus and it doesn't really have a concept of
| mastering. All PCI DMA was based on bus mastering but P2P DMA
| is trickier than normal DMA.
| publicmail wrote:
| I consider it bus mastering when the endpoints initiate the
| transactions
| amelius wrote:
| Stupid terminology. Might as well call an RS-232 link "peer to
| peer".
| ivanjermakov wrote:
| I was always fascinated by George Hotz's hacking abilities.
| Inspired me a lot for my personal projects.
| vrnvu wrote:
| I agree, I feel so inspired with his streams. Focus and hard
| work, the key to good results. Add a clear vision and strategy,
| and you can also accomplish "success".
|
| Congratulations to him and all the tinygrad/comma contributors.
| sambull wrote:
| He's got that focus like a military pilot on a long flight.
| postalrat wrote:
| Any time I open guys steam half of it is some sort of
| politics
| CYR1X wrote:
| You can blame chat for that lol
| gaws wrote:
| He should ban chat and focus on development. Leave the
| political talk to the kids in their respective Discord
| servers.
| Jerrrry wrote:
| His Xbox360 laptop was the crux of teenage-motivation, for me.
| jgpc wrote:
| I agree. It is fascinating. When you observe his development
| process (btw, it is worth noting his generosity in sharing it
| like he does) he gets frequently stuck on random shallow
| problems which a perhaps more knowledgable engineer would find
| less difficult. It is frequent to see him writing really bad
| code, or even wrong code. The whole twitter chapter is a good
| example. Yet, himself, alone just iterating resiliently, just
| as frequently creates remarkable improvements. A good example
| to learn from. Thank you geohot.
| zoogeny wrote:
| This matches my own take. I've tuned into a few of his
| streams and watched VODs on YouTube. I am consistently
| underwhelmed by his actual engineering abilities. He is that
| particular kind of engineer that constantly shits on other
| peoples code or on the general state of programming yet his
| actual code is often horrendous. He will literally call
| someone out for some code in Tinygrad that he has trouble
| with and then he will go on a tangent to attempt to rewrite
| it. He will use the most blatant and terrible hacks only to
| find himself out of his depth and reverting back to the
| original version.
|
| But his streams last 4 hours or more. And he just keeps
| grinding and grinding and grinding. What the man lacks in raw
| intellectual power he makes up for (and more) in persistence
| and resilience. As long as he is making even the tiniest
| progress he just doesn't give up until he forces the computer
| to do whatever it is he wants it to do. He also has no
| boundaries on where his investigations take him. Driver code,
| OS code, platform code, framework code, etc.
|
| I definitely couldn't work with him (or work for him) since I
| cannot stand people who degrade the work of others while
| themselves turning in sub-par work as if their own shit
| didn't stink. But I begrudgingly admire his tenacity, his
| single minded focus, and the results that his belligerent
| approach help him to obtain.
| ctrw wrote:
| There are developers who have breadth and developers who
| have depth. He is very much on the breadth end of the
| spectrum. It isn't lack of intelligence but lack of deep
| knowledge of esoteric fields you will use once a decade.
|
| That said I find it a bit astonishing how little Ai he uses
| on his streams. I convert all the documentation I need into
| a rag system that I query stupid questions against.
| spirobelv2 wrote:
| link your github. want to see your raw intellectual power
| yazzku wrote:
| It's over 9000.
| gorkish wrote:
| If a stopped clock is right twice a day, relentlessly winding
| a clock forward will make it right quite frequently. That is
| geohot.
| namibj wrote:
| And here I thought (PCIe) P2P was there since SLI dropped the
| bridge (for the unfamiliar, it looks and acts pretty much like an
| NVLink bridge for regular PCIe slot cards that have NVLink, and
| was used back in the day to share framebuffer and similar in
| high-end gaming setups).
| wmf wrote:
| SLI was dropped years ago so there's no need for gaming cards
| to communicate at all.
| userbinator wrote:
| I wish more hardware companies would publish more documentation
| and let the community figure out the rest, sort of like what
| happened to the original IBM VGA (look up "Mode X" and the other
| non-BIOS modes the hardware is actually capable of - even
| 800x600x16!) Sadly it seems the majority of them would rather
| tightly control every aspect of their products' usage since they
| can then milk the userbase for more $$$, but IMHO the most
| productive era of the PC was also when it was the most open.
| rplnt wrote:
| Then they couldn't charge different customers different amounts
| for the same HW. It's not a win for everyone.
| axus wrote:
| The price of 4090 may increase now, in theory locking out
| some features might have been a favor for some of the
| customers.
| Sayrus wrote:
| But it wouldn't if all cards supporting this were
| "unlocked" by default and thus the other "enterprise-grade"
| cards weren't that much more expensive. Of course that'd
| reduce profits by a lot.
| paulmd wrote:
| it probably would - you saw exactly that outcome with
| mining.
|
| for a lot of these demand bursts, demand is so high it
| cannot be sated even consuming 100% or 200% of typical
| GPU production.
|
| cards like RX 6500XT that simply don't have the RAM to
| participate were less affected, but even then you've got
| enough cross-elasticity (demand from people being crowded
| out of other product segments) that tends to pump prices
| to 2-3x the "normal" clearance prices we see today. And
| yes, absolutely anything that can mine in any capacity
| will get pulled in during that sort of boom/bubble, not
| just "high-end"/"enterprise".
| greggsy wrote:
| Which (as controversial as it sounds in this kind of forum)
| is a sensible pricing model to recover and fund R&D and
| finance operations.
| mhh__ wrote:
| nvidia's software is their moat
| thot_experiment wrote:
| That's a huge overstatement, it's a big part of the moat for
| sure, but there are other significant components (hardware,
| ecosystem lock-in, heavy academic incentives)
| mhh__ wrote:
| No software -> hardware is massively hobbled. Evidence:
| AMD.
|
| Ecosystem -> Software. At the moment especially people are
| looking for arbitrages everywhere i.e. inference costs /
| being able to inference at all (llama.cpp)
|
| Academics -> Also software but easily fiddled with a bit of
| spending as you say.
| golergka wrote:
| If I'm a hardware manufacturer and my soft lock on product
| feature doesn't work, I'll switch to a hardware lock instead,
| and the product will just cost more.
| matheusmoreira wrote:
| > the most productive era of the PC was also when it was the
| most open
|
| The openness certainly was great but it's not actually
| required. People can figure out how to work with closed
| systems. Adversarial interoperability was common. People would
| reverse engineer things and make the software work whether or
| not the manufacturer wanted it.
|
| It's the software and hardware locks that used to be rare and
| are now common. Cryptography was supposed to be something that
| would empower us but it ended up being used against us to lock
| us out of our own machines. We're no longer in the driver's
| seat. Our operating systems don't even operate the system
| anymore. Our free Linux systems are just the "user OS" in the
| manufacturer's unknowable amalgamation of silicon running
| proprietary firmware, just a little component to be sandboxed
| away from the real action.
| andersa wrote:
| Incredible! I'd been wondering if this was possible. Now the only
| thing standing in the way of my 4x4090 rig for local LLMs is
| finding time to build it. With tensor parallelism, this will be
| both massively cheaper and faster for inference than a H100 SXM.
|
| I still don't understand why they went with 6 GPUs for the
| tinybox. Many things will only function well with 4 or 8 GPUs. It
| seems like the worst of both worlds now (use 4 GPUs but pay for 6
| GPUs, don't have 8 GPUs).
| corn13read2 wrote:
| A macbook is cheaper though
| tgtweak wrote:
| The extra $3k you'd spend on a quad-4090 rig vs the top
| mbp... ignoring the fact you can't put the two on even ground
| for versatility (very few libraries are adapted to apple
| silicone let alone optimized).
|
| Very few people that would consider an H100/A100/A800 are
| going to be cross-shopping a macbook pro for their workloads.
| LoganDark wrote:
| > very few libraries are adapted to apple silicone let
| alone optimized
|
| This is a joke, right? Have you been anywhere in the LLM
| ecosystem for the past year or so? I'm constantly hearing
| about new ways in which ASi outperforms traditional
| platforms, and new projects that are optimized for ASi.
| Such as, for instance, llama.cpp.
| cavisne wrote:
| Nothing compared to Nvidia though. The FLOPS and memory
| bandwidth is simply not there.
| spudlyo wrote:
| The memory bandwidth of the M2 Ultra is around 800GB/s
| verses 1008 GB/s for the 4090. While it's true the M2 has
| neither the bandwidth or the GPU power, it is not limited
| to 24G of VRAM per card. The 192G upper limit on the M2
| Ultra will have a much easier time running inference on a
| 70+ billion parameter model, if that is your aim.
|
| Besides size, heat, fan noise, and not having to build it
| yourself, this is the only area where Apple Silicon might
| have advantage over a homemade 4090 rig.
| LoganDark wrote:
| It doesn't need GPU power to beat the 4090 in benchmarks:
| https://appleinsider.com/articles/23/12/13/apple-
| silicon-m3-...
| int_19h wrote:
| It doesn't beat RTX 4090 when it comes to actual LLM
| inference speed. I bought a Mac Studio for local
| inference because it was the most convenient way to get
| something _fast enough_ and with enough RAM to run even
| 155b models. It 's great for that, but ultimately it's
| not magic - NVidia hardware still offers more FLOPS and
| faster RAM.
| LoganDark wrote:
| > It doesn't beat RTX 4090 when it comes to actual LLM
| inference speed
|
| Sure, whisper.cpp is not an LLM. The 4090 can't even do
| inference at all on anything over 24GB, while ASi can
| chug through it even if slightly slower.
|
| I wonder if with https://github.com/tinygrad/open-gpu-
| kernel-modules (the 4090 P2P patches) it might become a
| lot faster to split a too-large model across multiple
| 4090s and still outperform ASi (at least until someone at
| Apple does an MLX LLM).
| dragonwriter wrote:
| > The 4090 can't even do inference at all on anything
| over 24GB, while ASi can chug through it even if slightly
| slower.
|
| Common LLM runners can split model layers between VRAM
| and system RAM; a PC rig with a 4090 can do inference on
| models larger than 24G.
|
| Where the crossover point where having the whole thing on
| Apple Silicon unified memory vs. doing split layers on a
| PC with a 4090 and system RAM is, I don't know, but its
| definitely not "more than 24G and a 4090 doesn't do
| anything".
| LoganDark wrote:
| > Common LLM runners can split model layers between VRAM
| and system RAM; a PC rig with a 4090 can do inference on
| models larger than 24G.
|
| Sure and ASi can do inference on models larger than the
| Unified Memory if you account for streaming the weights
| from the SSD on-demand. That doesn't mean it's going to
| be as fast as keeping the whole thing in RAM, although
| ASi SSDs are probably not particularly bad as far as SSDs
| go.
| treprinum wrote:
| Slightly slower in this case is like 10x. I have M3 Max
| with 128GB RAM, 4090 trashes it on anything under 24GB,
| then M3 Max trashes it on anything above 24GB, but it's
| like 10x slower at it than 4090 on <24GB.
| numpad0 wrote:
| PSA for all people who are still being misled by hand-
| wavy Apple M1 marketing charts[1] implicating total
| dominance of M-series wondersilicon obsoleting all
| Intel/NVIDIA PCs:
|
| There are benchmark data showing that an Apple M2 Ultra
| is 47% and 60% slower against Xeon W9 and RTX 4090, or
| 0.35% and 2% slower against i9-13900K and RTX 4060 Ti,
| respectively, in Geekbench 5 Multi-threaded and OpenCL
| Compute tests.
|
| Apple Silicon Macs are NOT faster than competing desktop
| computers, nor M1 was massively faster than NVIDIA
| 3070(Desktop - 2x faster than Laptop variant M1 was
| compared against) for that matter. They just offer up to
| 128GB shared RAM/VRAM options in slim desktops and
| laptops, which is handy for LLM, that's it.
|
| Please stop taking Apple marketing materials at full face
| value or above. Thank you. 1:
| https://i.extremetech.com/imagery/content-
| types/03ekiQwNudC75iOK4AMuEkw/images-2.jpg 2:
| screenshot from[4]: https://www.igorslab.de/wp-
| content/uploads/2023/06/Apple-M2-ULtra-SoC-
| Geekbench-5-Multi-Threaded.jpg 3: screenshot
| from[4]: https://www.igorslab.de/wp-
| content/uploads/2023/06/Apple-M2-ULtra-SoC-
| Geekbench-5-OpenCL-Compute.jpg 4:
| https://wccftech.com/apple-m2-ultra-soc-isnt-faster-than-
| amd-intel-last-year-desktop-cpus-50-slower-than-nvidia-
| rtx-4080/
| LoganDark wrote:
| Yeah. Let me just walk down to Best Buy and get myself a
| GPU with over 24 gigabytes of VRAM (impossible) for less
| than $3,000 (even more impossible). Then tell me ASi is
| nothing compared to Nvidia.
|
| Even the A100 for something around $15,000 (edit: used to
| say $10,000) only goes up to 80 gigabytes of VRAM, but a
| 192GB Mac Studio goes for under $6,000.
|
| Those figures alone proves Nvidia isn't even competing in
| the consumer or even the enthusiast space anymore. They
| know you'll buy their hardware if you really need it, so
| they aggressively segment the market with VRAM
| restrictions.
| andersa wrote:
| Where are you getting an A100 80GB for $10k?
| LoganDark wrote:
| Oops, I remembered it being somewhere near $15k but
| Google got confused and showed me results for the 40GB
| instead so I put $10k by mistake. Thanks for the
| correction.
|
| A100 80GB goes for around $14,000 - $20,000 on eBay and
| A100 40GB goes for around $4,000 - $6,000. New (not from
| eBay - from PNY and such), it looks like an 80GB would
| set you back $18,000 to $26,000 depending on whether you
| want HBM2 or HBM2e.
|
| Meanwhile you can buy a Mac Studio today without going
| through a distributor and they're under $6,000 if the
| only thing you care about is having 192GB of Unified
| Memory.
|
| And while the memory bandwidth isn't quite as high as the
| 4090, the M-series chips can run certain models faster
| anyway, if Apple is to be believed
| andersa wrote:
| Sure, it's also at least an order of magnitude slower in
| practice, compared to 4x 4090 running at full speed. We're
| looking at 10 times the memory bandwidth and _much_ greater
| compute.
| chaostheory wrote:
| Yeah, even a Mac Studio is way too slow compared to Nvidia
| which is too bad because at $7000 maxed to 192gb it would
| be an easy sell. Hopefully, they will fix this by m5. I
| don't trust the marketing for m4
| thangngoc89 wrote:
| training on MPS backend is suboptimal and really slow.
| wtallis wrote:
| Do people do training on systems this small, or just
| inference? I could see maybe doing a little bit of fine-
| tuning, but certainly not from-scratch training.
| redox99 wrote:
| If you mean train llama from scratch, you aren't going to
| train it on any single box.
|
| But even with a single 3090 you can do quite a lot with
| LLMs (through QLoRA and similar).
| thangngoc89 wrote:
| Yep. Price/performance of multiple 4090s system are way
| better than the professional cards (Axxx). Also deep
| learning outside of LLM has many different usage.
| llm_trw wrote:
| So is a TI-89.
| amelius wrote:
| And looks way cooler
| numpad0 wrote:
| 4x32GB(128GB) DDR4 is ~$250. 4x48GB(192GB) DDR5 is ~$600.
| Those are even cheaper than upgrade options for Macs($1k).
| papichulo2023 wrote:
| No many consumer mobo support 192GB DDR5.
| wtallis wrote:
| If it supports DDR5 at all, then it should be at most a
| firmware update away from supporting 48GB dual-rank
| DIMMs. There are very few consumer motherboards that only
| have two DDR5 slots; almost all have the four slots
| necessary to accept 192GB. If you are under the
| impression that there's a widespread limitation on
| consumer hardware support for these modules, it may
| simply be due to the fact that 48GB modules did not exist
| yet when DDR5 first entered the consumer market, and such
| modules did not start getting mentioned on spec sheets
| until after they existed.
| imtringued wrote:
| You don't want to use more than two slots because you
| only have two memory channels. The overclocking potential
| of DDR5 is extremely high when you only run two DIMMs.
| All the way up to 8000. Meanwhile if you go for
| populating all four slots, you are limited significantly
| below 5000. Almost a 50% performance drop if you are
| willing to overclock your RAM.
| wtallis wrote:
| If you want to run something that doesn't fit in 96GB of
| RAM, you'll get better performance from _having enough
| RAM_. Yes, having two dual-rank DIMMs per channel will
| force you to run at a slower speed, but it 's still far
| faster than your SSD. The second slot per channel exists
| precisely because many people really _do_ want to use it.
| ojbyrne wrote:
| A lot that have specs showing they support a max of 4x32
| DDR5 actually support 4x48 DDR5 via recent BIOS updates.
| papichulo2023 wrote:
| In the specs yeap, in practice hardly anyone got it
| working. As far as I saw in reddit, it requires
| customizing timings to make 4 slots work over 6000 Mhz at
| the same time.
| faeriechangling wrote:
| Most consumer mobo's I see support this even if the setup
| isn't on the QVL. If a DDR5 motherboard support 4 sticks
| at all you can probably run 192gb on it so long as you
| update the BIOS firmware. The problem is running at rated
| speeds.
|
| AMD tends to be worse than Intel, and I hear people
| having to run anywhere between DDR5-3200 to DDR5-5200.
| You are better off running two sticks, because even with
| 2 sticks you really can't run larger models with
| acceptable performance anyways, much less with 4.
|
| There is competition to apple on the low end (dual
| channel fast DDR5) and on the high end (8+ channel like
| Xeon/Epyc/AmpereOne). In the middle, Apple is sort of
| crushing because if you run a true 4 channel system
| you're going to get poor performance if you load up a
| 192gb model, and if you compare pricing to 96gb/128gb
| apple systems, there's not all that much of a cost
| advantage and you have to make a lot of sacrifices to get
| there. The truth is that Apple really doesn't have all
| that much competition right now and won't for the
| foreseeable future.
| papichulo2023 wrote:
| Hopefully Qualcom will free us of this 2 channels
| noghtmare.
| faeriechangling wrote:
| I'm optimistic about APUs personally like AMDs upcoming
| Strix Halo APU with a 256-bit memory bus competing at the
| lower end of the market, but that will only provide so
| much competition.
| wtallis wrote:
| I don't think it's realistic to pin your hopes on
| Qualcomm given that they're unlikely to care about
| supporting anything other than LPDDR with their laptop
| processors.
| faeriechangling wrote:
| Buying a MacBook for AI is great if you were already going to
| buy a MacBook, as this makes it a lot more cost competitive.
| It's also great if what you're doing is REALLY privacy
| sensitive, such as if you're a lawyer, where uploading client
| data to OpenAI is probably not appropriate or legal.
|
| But in general, I find the appeal is narrow because either
| consumer GPUs are better for training in general and
| inferencing at scale[1]. Cloud services also allow the vast
| majority of individuals to get higher quality inferencing at
| lower cost. The result is Apple Silicon's appeal being quite
| niche.
|
| [1] Mind you, Nvidia considers this a licensing violation,
| not that GeoHot has historically ever been all scared to
| violate a EULA and force a company to prove its terms have
| legal force.
| Tepix wrote:
| 6 GPUs because they want fast storage and it uses PCIe lanes.
|
| Besides the goal was to run a 70b FP16 model (requiring roughly
| 140GB VRAM). 6*24GB = 144GB
| andersa wrote:
| That calculation is incorrect. You need to fit both the model
| (140GB) and the KV cache (5GB at 32k tokens FP8 with flash
| attention 2) * batch size into VRAM.
|
| If the goal is to run a FP16 70B model as fast as possible,
| you would want 8 GPUs with P2P, for a total of 192GB VRAM.
| The model is then split across all 8 GPUs with 8-way tensor
| parallelism, letting you make use of the full 8TB/s memory
| bandwidth on every iteration. Then you have 50GB spread out
| remaining for KV cache pages, so you can raise the batch size
| up to 8 (or maybe more).
| renewiltord wrote:
| I've got a few 4090s that I'm planning on doing this with.
| Would appreciate even the smallest directional tip you can
| provide on splitting the model that you believe is likely
| to work.
| andersa wrote:
| The split is done automatically by the inference engine
| if you enable tensor parallelism. TensorRT-LLM, vLLM and
| aphrodite-engine can all do this out of the box. The main
| thing is just that you need either 4 or 8 GPUs for it to
| work on current models.
| renewiltord wrote:
| Thank you! Can I run with 2 GPUs or with heterogeneous
| GPUs that have same RAM? I will try. Just curious if you
| already have tried.
| andersa wrote:
| 2 GPUs works fine too, as long as your model fits. Using
| different GPUs with same VRAM however, is highly highly
| sketchy. Sometimes it works, sometimes it doesn't. In any
| case, it would be limited by the performance of the
| slower GPU.
| renewiltord wrote:
| All right, thank you. I can run it on 2x 4090 and just
| put the 3090s in different machine.
| Tepix wrote:
| I know there's some overhead, it's not my calculation.
|
| https://www.tweaktown.com/news/97110/tinycorps-new-
| tinybox-a...
|
| Quote: " _Runs 70B FP16 LLaMA-2 out of the box using
| tinygrad_ "
|
| Related: https://github.com/tinygrad/tinygrad/issues/3791
| ShamelessC wrote:
| > Many things will only function well with 4 or 8 GPUs
|
| What do you mean?
| andersa wrote:
| For example, if you want to run low latency multi-GPU
| inference with tensor parallelism in TensorRT-LLM, there is a
| requirement that the number of heads in the model is
| divisible by the number of GPUs. Most current published
| models are divisible by 4 and 8, but not 6.
| bick_nyers wrote:
| Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128
| PCIE lanes so it wouldn't be possible to put 8 full fat
| GPUs on while maintaining some lanes for storage and
| networking. Same deal with Threadripper Pro.
| andersa wrote:
| It should be possible with onboard PCIe switches. You
| probably don't need the networking or storage to be all
| that fast while running the job, so it can dedicate
| almost all of the bandwidth to the GPU.
|
| I don't know if there are boards that implement this,
| though, I'm only looking at systems with 4x GPUs
| currently. Even just plugging in a 5kW GPU server in my
| apartment would be a bit of a challenge. With 4x 4090,
| the max load would be below 3kW, so a single 240V plug
| can handle it no issue.
| thangngoc89 wrote:
| 8 GPUs x 16 PCIe lanes each = 128 lanes already.
|
| That's the limit of single CPU platforms.
| bick_nyers wrote:
| I've seen it done with a PLX Multiplexer as well, but
| they add quite a bit of cost:
|
| https://c-payne.com/products/pcie-gen4-switch-
| backplane-4-x1...
|
| Not sure if there exists an 8-way PCIE Gen 5 Multiplexer
| that doesn't cost ludicrous amounts of cash. Ludicrous
| being a highly subjective and relative term of course.
| namibj wrote:
| 98 lanes of PCIe 4.0 fabric switch as just the chip (to
| solder onto a motherboard/backplane) costs 850$
| (PEX88096). You could for example take 2 x16 GPUs, pass
| then through (2 _2_ 16=64 lanes), and have 2 x16 that
| bifurcate to at least x4 (might even be x2, I didn't find
| that part of the docs just now) for anything you want,
| plus 2 x1 for minor stuff. They do claim to have no
| problems being connected up into a switching fabric, and
| very much allow multi-host operations (you will need
| signal retimers quite soon, though).
|
| They're the stuff that enables cloud operators to pool
| like 30 GPUs across like 10 CPU sockets while letting you
| virtually hot-plug them to fit demand. Or when you want
| to make a SAN with real NVMe-over-PCIe. Far cheaper than
| normal networking switches with similar ports (assuming
| hosts doing just x4 bifurcation, it's very comparable to
| a 50G Ethernet port. The above chip thus matches a 24
| port 50G Ethernet switch. Trading reach for only needing
| retimers, not full NICs, in each connected host. Easily
| better for HPC clusters up to about 200 kW made from
| dense compute nodes.), but sadly still lacking affordable
| COTS parts that don't require soldering or contacting
| sales for pricing (the only COTS with list prices seem to
| be Broadcom's reference designs, for prices befitting an
| evaluation kit, not a Beowulf cluster).
| pests wrote:
| I really like the information about how the cloud
| providers do their multiplexing, thanks. There was some
| tech posted here a few month ago that was similar I found
| very interesting - plug all devices, ram, hard drives,
| and CPUS into a larger fabric and a way to spin up
| "servers" of any size from the pool of resources... wish
| I could remember the name now.
|
| nit: HN formatting messed up your math in the second
| sentence, I believe you italicized on accident using *
| for equations.
| segfaultbuserr wrote:
| It's more difficult to split your work across 6 GPUs evenly,
| and easier when you have 4 or 8 GPUs. The latter setups have
| powers of 2, which for example, can evenly divide a 2D or 3D
| grid, but 6 GPUs are awkward to program. Thus, the OP argues
| that a 6-GPU setup is highly suboptimal for many existing
| applications and there's no point to pay more for the extra
| 2.
| numpad0 wrote:
| I was googling public NVIDIA SXM2 materials the other day, and
| it seemed SXM2/NVLink 2.0 just was a six-way system. NVIDIA SXM
| had updated to versions 3 and 4 since, and this isn't based on
| none of those anyway, but maybe there's something we don't know
| that make six-way reasonable.
| andersa wrote:
| It was probably just before running LLMs with tensor
| parallelism became interesting. There are plenty of other
| workloads that can be divided by 6 nicely, it's not an end-
| all thing.
| dheera wrote:
| What is a six-way system?
| TylerE wrote:
| oid school way of saying core (or in this case GPU),
| basically.
| liuliu wrote:
| 6 seems reasonable. 128 Lanes from ThreadRipper needs to have a
| few for network and NVMe (4x NVMe would be x16 lanes, and 10G
| network would be another x4 lanes).
| cjbprime wrote:
| I don't think P2P is very relevant for inference. It's
| important for training. Inference can just be sharded across
| GPUs without sharing memory between them directly.
| andersa wrote:
| It can make a difference when using tensor parallelism to run
| small batch sizes. Not a huge difference like training
| because we don't need to update all weights, but still a
| noticeable one. In the current inference engines there are
| some allreduce steps that are implemented using nccl.
|
| Also, paged KV cache is usually spread across GPUs.
| namibj wrote:
| It massively helps arithmetic intensity to batch during
| inference, and the desired batch sizes by that tend to exceed
| the memory capacity of a single GPU. Thus desire to do
| training-like cluster processing to e.g. use a weight for
| each inference stream that needs it every time it's fetched
| from memory. It's just that you can't fit 100+ inference
| streams of context on one GPU, typically, thus the desire to
| shard along less-wasteful (w.r.t. memory bandwidth)
| dimensions than entire inference streams.
| qeternity wrote:
| You are talking about data parallelism. Depending on the
| model tensor parallelism can still be very important for
| inference.
| georgehotz wrote:
| tinygrad supports uneven splits. There's no fundamental reason
| for 4 or 8, and work should almost fully parallelize on any
| number of GPUs with good software.
|
| We chose 6 because we have 128 PCIe lanes, aka 8 16x ports. We
| use 1 for NVMe and 1 for networking, leaving 6 for GPUs to
| connect them in full fabric. If we used 4 GPUs, we'd be wasting
| PCIe, and if we used 8 there would be no room for external
| connectivity aside from a few USB3 ports.
| doctorpangloss wrote:
| Have you compared 3x 3090-3090 pairs over NVLink?
|
| IMO the most painful thing is that since these hardware
| configurations are esoteric, there is no software that
| detects them and moves things around "automatically."
| Regardless of what people thing device_map="auto" does, and
| anyway, Hugging Face's transformers/diffusers are all over
| the place.
| davidzweig wrote:
| Is it possible a similar patch would work for P2P on 3090s?
|
| btw, I found a Gigabyte board on Taobao that is unlisted on
| their site: MZF2-AC0, costs $900. 2 socket Epyc and 10 PCIE
| slots, may be of interest. A case that should fit, with 2x
| 2000W Great Wall PSUs and PDU is 4050 RMB
| (https://www.toploong.com/en/4GPU-server-case/644.html). You
| still need blower GPUs.
| cjbprime wrote:
| Doesn't nvlink work natively on 3090s? I thought it was
| only removed (and here re-enabled) in 4090.
| qeternity wrote:
| This not not nvlink.
| georgehotz wrote:
| It should if your 3090s have Resizable BAR support in the
| VBIOS. AFAIK most card manufacturers released BIOS updates
| enabling this.
|
| Re: 3090 NVLink, that only allows pairs of cards to be
| connected. PCIe allows full fabric switch of many cards.
| Ratiofarmings wrote:
| In cases where they didn't, the techpowerup vBIOS
| collection solves the problem.
| davidzweig wrote:
| Update, I checked with the case company, toploong, they say
| that board is a 5mm too big or so for the case.
| andersa wrote:
| That is very interesting if tinygrad can support it! Every
| other library I've seen had the limitation on dividing the
| heads, so I'd (perhaps incorrectly) assumed that it's a
| general problem for inference.
| spmurrayzzz wrote:
| There are some interesting hacks you can do like
| replicating the K/V weights by some factor which allows
| them to be evenly divisible by whatever number of gpus you
| have. Obviously there is a memory cost there, but it does
| work.
| AnthonyMouse wrote:
| Is there any reason you couldn't use 7? 8 PCIe lanes each
| seems more than sufficient for NVMe and networking.
| WanderPanda wrote:
| Did you at least front run the market and stocked up of
| 4090ies before this release? Also gamers are probably not too
| happy about these developments :D
| tinco wrote:
| 4090's have consistently been around 2000 dollars. I don't
| think there's many gamers who would be affected by price
| fluctuations of the 4090 or even the 4080.
| presides wrote:
| This is out of touch; they were mad before and they will
| be mad again. Lots of people spend a huge chunk of their
| modest disposable income on high end gaming gear, and the
| only upside of these issues for them is that eventually,
| YEARS down the line, capacity/supply issues MIGHT calm
| down in a way that yields some benefits.
|
| They're going to realize soon enough that they've
| basically just been told that the extremely shitty
| problem they thought they'd moved beyond is back with a
| vengeance and the next generation of gaming cards has the
| potential to make the past few rounds of scalping shit-
| shows look tame.
| swalsh wrote:
| Gamers have a TON of really good really affordable options.
| But you kind of need 24gb min unless you're using heavy
| quantization. So 3090 and 4090's are what local llm people
| are building with (mostly 3090's as you can get then for
| about $700, and they're dang good)
| boromi wrote:
| Any chance you could share the details of the build you'd go
| for. I need a server for our lab, but am kinda out of my depth
| with all the options/
| xipho wrote:
| You can watch this happen on the weekends, typically, sometimes,
| for some very long sessions, sometimes.
| https://www.twitch.tv/georgehotz
| BeefySwain wrote:
| Can someone ELI5 what this may make possible that wasn't possible
| before? Does this mean I can buy a handful of 4090s and use it in
| lieu of an h100? Just adding the memory together?
| segfaultbuserr wrote:
| No. The Nvidia A100 has a multi-lane NVLink interface with a
| total bandwidth of 600 GB/s. The "unlocked" Nvidia RTX 4090
| uses PCIe P2P at 50 GB/s. It's not going to replace A100 GPUs
| for serious production work, but it does unlock a datacenter-
| exclusive feature and has some small-scale applications.
| xmorse wrote:
| Finally switched to Nvidia and already adding great value
| perfobotto wrote:
| What stops nvidia from making sure this stops working in future
| driver releases?
| __MatrixMan__ wrote:
| The law, hopefully.
|
| Beeper mini only worked with iMessage for a few days before
| Apple killed it. A few months later the DOJ sued Apple. Hacks
| like this show us the world we could be living in, a world
| which can be hard to envision otherwise. If we want to actually
| live in that world, we have to fight for it (and protect the
| hackers besides).
| StayTrue wrote:
| I was thinking the same but in terms of firmware updates.
| aresant wrote:
| So assuming you utilized this with (4) x 4090s is there a
| theoretical comparative to performance vs the A6000 / other
| professional lines?
| thangngoc89 wrote:
| I believe this is mostly for memory capacities. PCIe access
| between GPUs is slower than soldered RAM on a single GPU
| andersa wrote:
| It depends on what you do with it and how much bandwidth it
| needs between the cards. For LLM inference with tensor
| parallelism (usually limited by VRAM read bandwidth, but little
| exchange needed) 2x 4090 will massively outperform a single
| A6000. For training, not so much.
| c0g wrote:
| Any idea of DDP perf?
| No1 wrote:
| The original justification that Nvidia gave for removing Nvlink
| from the consumer grade lineup was that PCIe 5 would be fast
| enough. They then went on to release the 40xx series without PCIe
| 5 and P2P support. Good to see at least half of the equation
| being completed for them, but I can't imagine they'll allow this
| in the next gen firmware.
| musha68k wrote:
| OK now we are seemingly getting somewhere. I can feel the
| enthusiasm coming back to me.
|
| Especially in light of what's going on with LocalLLaMA etc:
|
| https://www.reddit.com/r/LocalLLaMA/comments/1c0mkk9/mistral...
| thangngoc89 wrote:
| > You may need to uninstall the driver from DKMS. Your system
| needs large BAR support and IOMMU off.
|
| Can someone point me to the correct tutorial on how to do these
| things?
| unaindz wrote:
| The first one I assume is the nvidia driver for linux installed
| using dkms. If it uses dkms or not is stated on the drivers
| name, at least on arch based distributions.
|
| The latter options are settings on your motherboard bios, if
| your computer is modern, explore your bios and you will find
| them
| jasomill wrote:
| DKMS: uninstall Nvidia driver using distro package manager
|
| BAR: enable resizable BAR in motherboard CMOS setup
|
| IOMMU: Add "amd_iommu=off" or "intel_iommu=off" to kernel
| command line for AMD or Intel CPU, respectively (or just add
| both). You may or may not need to disable the IOMMU in CMOS
| setup (Intel calls its IOMMU VT-d).
|
| See motherboard docs for specific option names. See distro docs
| for procedures to list/uninstall packages and to add kernel
| command line options.
| spxneo wrote:
| does this mean you can horizontally scale to GPT-4-esque LLM
| locally in the near future? (i hear you need 1TB of VRAM)
|
| Is Apple's large VRAM offering like 196gb offer the fastest
| bandwidth and if so how will pairing a bunch of 4090s like in the
| comments work?
| lawlessone wrote:
| This is very interesting.
|
| I can't afford two mortgages though ,so for me it will have to
| just stay as something interesting :)
| m3kw9 wrote:
| In layman terms what does this enable?
| vladgur wrote:
| curious if this will ever make it to 3090s
| cavisne wrote:
| How does this compare in bandwidth and latency to nvlink? (I'm
| aware it's not available on the consumer cards)
| wmf wrote:
| It's 5x-10x slower.
| modeless wrote:
| What are the chances that Nvidia updates the firmware to disable
| this and prevents downgrading with efuses? Someday cards that
| still have older firmware may be more valuable. I'd be cautious
| upgrading drivers for a while.
| theturtle32 wrote:
| WTF is P2P?
| theturtle32 wrote:
| Answered my own question with a Google search:
|
| https://developer.nvidia.com/gpudirect#:~:text=LEARN%20MORE%...
| .
|
| > GPUDirect Peer to Peer > Enables GPU-to-GPU copies as well as
| loads and stores directly over the memory fabric (PCIe,
| NVLink). GPUDirect Peer to Peer is supported natively by the
| CUDA Driver. Developers should use the latest CUDA Toolkit and
| drivers on a system with two or more compatible devices.
| tanelpoder wrote:
| I also love that it can be done with just a few code line
| changes:
|
| https://github.com/NVIDIA/open-gpu-kernel-modules/commit/1f4...
| waldrews wrote:
| Would this approach be possible to extend downmarket, to older
| consumer cards? For a lot of LLM use cases we're constrained by
| memory and can tolerate lower compute speeds so long as there's
| no swapping. ELI5, what would prevent a hundred 1060-level cards
| from being used together?
| Sebb767 wrote:
| > ELI5, what would prevent a hundred 1060-level cards from
| being used together?
|
| In this case, you'd only have a single PCIe (v3!) lane per
| card, making the interconnect speed horribly slow. You'd also
| need to invest in thousands of dollars of hardware to get all
| of those cards connected and, unless power is free, you'd
| outspend any theoretical savings instantly.
|
| In general, if you go back in card generations, you'll quickly
| hit so low memory limits amd slower compute that a modern CPU-
| based setup is better value.
| qxfys wrote:
| I am amazed how people always find a way to make this kind of
| thing work. kudos!
| chriskanan wrote:
| This is great news. As an academic, I'm aware of multiple labs
| that built boxes with 4090s, not realizing that Nvidia had
| impaired P2P communication among cards. It's one of the reasons I
| didn't buy 4090s, despite them being much more affordable for my
| work. It isn't nvlink, but Nvidia has mostly gotten rid of that
| except for their highest end cards. It is better than nothing.
|
| Late last year, I got quotes for machines with four nvlink H100s,
| but the lead time for delivery was 13 months. I could get the
| non-nvlink ones in just four months. For now, I've gone with four
| L40S cards to hold my lab over but supply chain issues and
| gigantic price increases are making it very hard for my lab to do
| it's work. That's not nearly enough to support 6 PhD students and
| a bunch of undergrads.
|
| Things were a lot easier when I could just build machines with
| two GPUs each with Nvlink for $5K each and give one to each
| student to put under their desks, which is what I did back in
| 2015-2018 at my old university.
| uniqueuid wrote:
| And before that, Nvidia made our lives harder by phasing out
| blower-style designs in consumer cards that we could put in
| servers. In my lab, I'd take a card for 1/4 the price that has
| half the MTBF over a card for full price anytime.
| photonbeam wrote:
| How does cost compare with some of the GPU-cloud providers?
| uniqueuid wrote:
| Not op, but I found this benchmark of whisper large-v3
| interesting [1]. It includes the cloud provider's pricing per
| gpu, so you can directly calculate break-even timing.
|
| Of course, if you use different models, training, fine tuning
| etc. the benchmarks will differ depending on ram, support of
| fp8 etc.
|
| [1] https://blog.salad.com/whisper-large-v3/
| jeffs4271 wrote:
| It is cool seeing hacks like this. But this is something to be
| careful with, as GH100 had hardware changes to meet CUDA fence
| requirements.
| gururise wrote:
| How long before Nvidia patches this?
___________________________________________________________________
(page generated 2024-04-14 23:02 UTC)