[HN Gopher] Hacked Nvidia 4090 GPU driver to enable P2P
___________________________________________________________________
Hacked Nvidia 4090 GPU driver to enable P2P
Author : nikitml
Score : 530 points
Date : 2024-04-12 09:27 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jagrsw wrote:
| Was it George himself, or a person working for a bounty that was
| set up by tinycorp?
|
| Also, a question for those knowledgeable about the PCI subsys: it
| looked like something NVIDIA didn't care about, rather than
| something they actively wanted to prevent, no?
| mtlynch wrote:
| Commits are by geohot, so it looks like George himself.
| throw101010 wrote:
| I've seen him work on tinygrad on his Twitch livestream
| couple times, so more than likely him indeed.
| squarra wrote:
| He also documented his progress on the tinygrad discord
| throwaway8481 wrote:
| I feel like I should say something about discord not being a
| suitable replacement for a forum or bugtracker.
| guywhocodes wrote:
| We are talking about a literal monologue while poking at a
| driver for a few hours, this wasn't a huge project.
| toast0 wrote:
| PCI devices have always been able to read and write to the
| shared address space (subject to IOMMU); most frequently used
| for DMA to system RAM, but not limited to it.
|
| So, poking around to configure the device to put the whole VRAM
| in the address space is reasonable, subject to support for
| resizable BAR or just having a fixed size large enough BAR. And
| telling one card to read/write from an address that happens to
| be mapped to a different card's VRAM is also reasonable.
|
| I'd be interested to know if PCI-e switching capacity will be a
| bottleneck, or if it'll just be the point to point links and
| VRAM that bottlenecks. Saving a bounce through system RAM
| should help in either case though.
| namibj wrote:
| Fixed large bar exists in some older accelerator cards like
| e.g. iirc the MI50/MI60 from AMD (the data center variant of
| the Radeon Vega VII, the first GPU with PCIe 4.0, also famous
| for dominating memory bandwidth until the RTX 40-series took
| that claim back. It had 16GB of HBM delivering 1TB/s memory
| bandwidth).
|
| It's notably not compatible with some legacy boot processes
| and iirc also just 32bit kernels in general, so consumer
| cards had to wait for resizable BAR to get the benefits of
| large BAR (that being notably direct flat memory mapping of
| VRAM so CPUs and PCIe peers can directly read and write into
| all of VRAM, without dancing through a command interface with
| doorbell registers. AFAIK it allows a GPU to talk directly to
| NICs and NVMe drives by running the driver in GPU code (I'm
| not sure how/if they let you properly interact with doorbell
| registers, but polled io_uring as an ABI would be no problem
| (I wouldn't be surprised if some NIC firmware already allows
| offloading this).
| jsheard wrote:
| It'll be nice while it lasts, until they start locking this down
| in the firmware instead on future architectures.
| mnau wrote:
| Sure, but that was something that was always going to happen.
|
| So it's better to have it at least for one generation instead
| of no generation.
| HPsquared wrote:
| Is this one of those features that's disabled on consumer cards
| for market segmentation?
| mvkel wrote:
| Sort of.
|
| An imperfect analogy: a small neighborhood of ~15 houses is
| under construction. Normally it might have a 200kva transformer
| sitting at the corner, which provides appropriate power from
| the grid.
|
| But there is a transformer shortage, so the contractor installs
| a commercial grade 1250kva transformer. It can power many more
| houses than required, so it's operating way under capacity.
|
| One day, a resident decides he wants to start a massive grow
| farm, and figures out how to activate that extra transformer
| capacity just for his house. That "activation" is what geohot
| found
| bogwog wrote:
| That's a poor analogy. The feature is built in to the cards
| that consumers bought, but Nvidia is disabling it via
| software. That's why a hacked driver can enable it again. The
| resident in your analogy is just freeloading off the
| contractor's transformer.
|
| Nvidia does this so that customers that need that feature are
| forced to buy more expensive systems instead of building a
| solution with the cheaper "consumer-grade" cards targeted at
| gamers and enthusiasts.
| bpye wrote:
| This isn't even the first time a hacked driver has been
| used to unlock some HW feature -
| https://github.com/DualCoder/vgpu_unlock
| captcanuk wrote:
| There was also this https://hackaday.com/2013/03/18/hack-
| removes-firmware-crippl... using resistors and a
| different one before that used a graphene lead pencil to
| enable functionality.
| segfaultbuserr wrote:
| Except that in the computer hardware world, the 1250 kVA
| transformer was used not because of shortage, but because of
| the fact that making a 1250 kVA transformer on the existing
| production line and selling it as 200 kVA, is cheaper than
| creating a new production line separately for making 200 kVA
| transformers.
| m3kw9 wrote:
| Where is the hack in this analogy
| pixl97 wrote:
| Taking off the users panel on the side of their house and
| flipping it to 'lots of power' when that option had
| previously been covered up by the panel interface.
| cesarb wrote:
| Except that this "lots of power" option does not exist.
| What limits the amount of power used is the circuit
| breakers and fuses on the panel, which protect the wiring
| against overheating by tripping when too much power is
| being used (or when there's a short circuit). The
| resident in this analogy would need to ensure that not
| only the transformer, but also the wiring leading to the
| transformer, can handle the higher current, and replace
| the circuit breaker or fuses.
|
| And then everyone on that neighborhood would still lose
| power, because there's also a set of fuses _upstream_ of
| the transformer, and they would be sized for the correct
| current limit even when the transformer is oversized.
| These fuses also protect the wiring upstream of the
| transformer, and their sizing and timings is coordinated
| with fuses or breakers even further upstream so that any
| fault is cleared by the protective device closest to the
| fault.
| hatthew wrote:
| And then because this residential neighborhood now has
| commercial grade power, the other lots that were going to
| have residential houses built on them instead get combined
| into a factory, and the people who want to buy new houses in
| town have to pay more since residential supply was cut in
| half.
| HPsquared wrote:
| Excellent analogy of the other side of this issue.
| cesarb wrote:
| That's a bad analogy, because in your example, the consumer
| is using more of a shared resource (the available
| transformer, wiring, and generation capacity). In the case of
| the driver for a local GPU card, there's no sharing.
|
| A better example would be one in which the consumer has a
| dedicated transformer. For instance, a small commercial
| building which directly receives 3-phase 13.8 kV power; these
| are very common around here, and these buildings have their
| own individual transformers to lower the voltage to 3-phase
| 127V/220V.
| rustcleaner wrote:
| I am sure many will disagree-vote me, but I want to see this
| practice in consumer devices either banned or very heavily
| taxed.
| xandrius wrote:
| You're right. Especially because you didn't present your
| reasons.
| yogorenapan wrote:
| Curious as to your reasoning,
| llm_trw wrote:
| Skimming the readme this is p2p over PCIe and not NVLink in case
| anyone was wondering.
| klohto wrote:
| afaik 4090 doesn't support 5.0 so you are limited to 4.0
| speeds. Still an improvement.
| formerly_proven wrote:
| RTX 40 doesn't have NVLink on the PCBs, though the silicon has
| to have it, since some sibling cards support it. I'd expect it
| to be fused off.
| HeatrayEnjoyer wrote:
| How to unfuse it?
| magicalhippo wrote:
| I don't know about this particular scenario, but typically
| fuses are small wires or resistors that are overloaded so
| they irreversibly break the connection. Hence the name.
|
| Either done during manufacture or as a one-time
| programming[1][2].
|
| Though sometimes reprogrammable configuration bits are
| sometimes also called fuse bits. The Atmega328P of Arduino
| fame uses flash[3] for its "fuses".
|
| [1]: https://www.nxp.com/docs/en/application-
| note/AN4536.pdf
|
| [2] https://www.intel.com/programmable/technical-
| pdfs/654254.pdf
|
| [3]: https://ww1.microchip.com/downloads/en/DeviceDoc/Atmel
| -7810-...
| HeatrayEnjoyer wrote:
| Wires, flash, and resistors can be replaced
| mschuster91 wrote:
| Not at the scale we're talking about here. These
| structures are _very_ thin, far thinner than bond wires
| which is about the largest structure size you can handle
| without a very, very specialized lab. And you 'd need to
| unsolder the chip, de-cap it, hope the fuse wire you're
| trying to override is at the top layer, and that you can
| re-cap the chip afterwards and successfully solder it
| back on again.
|
| This may be workable for a nation state or a billion
| dollar megacorp, but not for your average hobbyist
| hacker.
| z33k wrote:
| You're absolutely right. In fact, some billion dollar
| megacorps use fuses as a part of hardware DRM for this
| reason.
| magicalhippo wrote:
| These are part of the chip, thus microscopic and very
| inaccessible.
|
| There are some good images here[1] of various such fuses,
| both pristine and blown. Here's[2] a more detailed
| writeup examining one type.
|
| It's not something you fix with a soldering iron.
|
| [1]: https://semiengineering.com/the-benefits-of-
| antifuse-otp/
|
| [2]: https://www.eetimes.com/a-look-at-metal-efuses/
| metadat wrote:
| I miss the days when you could do things like connecting
| the L5 bridges on the surface of the AMD Athlon XP
| Palomino [0] CPU packaging with a silver trace pen to
| transform them into fancier SMP multi-socket capable
| Athlon MPs, e.g. Barton [1].
|
| https://arstechnica.com/civis/threads/how-did-you-unlock-
| you...
|
| Some folks even got this working with only a pencil,
| haha.
|
| Nowadays, silicon designers have found highly effective
| ways to close off these hacking avenues, with techniques,
| such as the microscopic, nearly invisible, and as parent
| post mentions, totally inaccessible e-fuses.
|
| [0] https://upload.wikimedia.org/wikipedia/commons/7/7c/K
| L_AMD_A...
|
| [1] https://en.wikichip.org/w/images/a/af/Atlhon_MP_%28.1
| 3_micro...
| mepian wrote:
| Use a Focused Ion Beam instrument.
| llm_trw wrote:
| A cursory google search suggests that it's been removed at
| the silicon level.
| steeve wrote:
| Some do: https://wccftech.com/gigabyte-geforce-rtx-4090-pcb-
| shows-lef...
| jsheard wrote:
| I'm pretty sure that's just a remnant of a 3090 PCB design
| that was adapted into a 4090 PCB design by the vendor. None
| of the cards based on the AD102 chip have functional
| NVLink, not even the expensive A6000 Ada workstation card
| or the datacenter L40 accelerator, so there's no reason to
| think NVLink is present on the silicon anymore below the
| flagship GA100/GH100 chips.
| klohto wrote:
| fyi should work on most 40xx[1]
|
| [1]
| https://github.com/pytorch/pytorch/issues/119638#issuecommen...
| clbrmbr wrote:
| If we end up with a compute governance model of AI control [1],
| this sort of thing could get your door kicked in by the CEA
| (Compute Enforcement Agency).
|
| [1] https://podcasts.apple.com/us/podcast/ai-safety-
| fundamentals...
| logicchains wrote:
| Looks like we're only a few years away from a bona fide
| cyberpunk dystopia, in which only governments and megacorps are
| allowed to use AI, and hackers working on their own hardware
| face regular raids from the authorities.
| tomoyoirl wrote:
| Mere raids from the authorities? I thought EliY was out there
| proposing airstrikes.
| the8472 wrote:
| In the sense that any other government regulation is also
| ultimately backed by the state's monopoly on legal use of
| force when other measures have failed.
|
| And contrary to what some people are implying he also
| proposes that everyone is subject to the same limitations,
| big players just like individuals. Because the big players
| haven't shown much of a sign of doing enough.
| tomoyoirl wrote:
| > In the sense that any other government regulation is
| also ultimately backed by the state's monopoly on legal
| use of force when other measures have failed.
|
| Good point. He was only ("only") _really_ calling for
| international cooperation and literal air strikes against
| big datacenters that weren't cooperating. This would
| presumably be more of a no-knock raid, breaching your
| door with a battering ram and throwing tear gas at the
| wee hours of the morning ;) or maybe a small
| extraterritorial drone through your window
| the8472 wrote:
| ... after regulation, court orders and fines have failed.
| Which under the premise that AGI is an existential threat
| would be far more reasonable than many other reasons for
| raids.
|
| If the premise is wrong we won't need it. If society
| coordinates to not do the dangerous thing we won't need
| it. The argument is that only in the case where we find
| ourselves in the situation where other measures have
| failed such uses of force would be the fallback option.
|
| I'm not seeing the odiousness of the proposal. If bio
| research gets commodified and easy enough that every kid
| can build a new airborne virus in their basement we'd
| need raids on that too.
| s2l wrote:
| Time to publish the next book in "Stealing the network"
| series.
| raxxorraxor wrote:
| To be honest, I see summoning the threat of AGI to pose
| an existential threat to be on the level with lizard
| people on the moon. Great for sci-fi, bad distraction for
| policy making and addressing real problems.
|
| The real war, if there is one, is about owning data and
| collecting data. And surprisingly many people fall for
| distractions while their LLM fails at basic math. Because
| it is a language model of course...
| the8472 wrote:
| Freely flying through the sky on wings was scifi before
| the wright brothers. Something sounding like scifi is not
| a sound argument that it won't happen. And unlike lizard
| people we do have exponential curves to point at.
| Something stronger than a vibes-based argument would be
| good.
| dvdkon wrote:
| I consider the burden of proof to fall on those
| proclaiming AGI to be an existential threat, and so far I
| have not seen any convincing arguments. Maybe at some
| point in the future we will have many anthropomorphic
| robots and an AGI could hack them all and orchestrate a
| robot uprising, but at that point the robots would be the
| actual problem. Similarly, if an AGI could blow up
| nuclear power plants, so could well-funded human
| attackers; we need to secure the plants, not the AGI.
| the8472 wrote:
| You say you have not seen any arguments that convince
| you. Is that just not having seen many arguments or
| having seen a lot of arguments where each chain contained
| some fatal flaw? Or something else?
| cjbprime wrote:
| It doesn't sound like you gave serious thought to the
| arguments. The AGI doesn't need to hack robots. It has
| superhuman persuasion, by definition; it can "hack"
| (enough of) the humans to achieve its goals.
| CamperBob2 wrote:
| Then it's just a matter of evolution in action.
|
| And while it doesn't take a God to start evolution, it
| _would_ take a God to stop it.
| hollerith wrote:
| _You_ might be OK with suddenly dying along with all your
| friends and family, but I am not even if it is
| "evolution in action".
| CamperBob2 wrote:
| Historically governments haven't needed computers or AI
| to do that. They've always managed just fine.
|
| Punched cards helped, though, I guess...
| FeepingCreature wrote:
| _gestures at the human population graph wordlessly_
| stale2002 wrote:
| AI mind control abilities are also on the level of an
| extraordinary claim, that requires extraordinary
| evidence.
|
| It's on the level of "we better regulate wooden sticks so
| Voldemort doesn't use the imperious curse on us!".
|
| That's how I treat such claims. I treat them the same as
| someone literally talking about magic from Harry potter.
|
| There isn't nothing that would make me believe that. But
| it requires actual evidence and not thought experiments.
| the8472 wrote:
| Voldemort is fictional and so are bumbling wizard
| apprentices. Toy-level, not-yet-harmful AIs on the other
| hand are real. And so are efforts to make them more
| powerful. So the proposition that more powerful AIs will
| exist in the future is far more likely than an evil super
| wizard coming into existence.
|
| And I don't think literal 5-word-magic-incantation mind
| control is essential for an AI to be dangerous. More
| subtle or elaborate manipulation will be sufficient.
| Employees already have been duped into financial
| transactions by faked video calls with what they assumed
| to be their CEOs[0], and this didn't require superhuman
| general intelligence, only one single superhuman
| capability (realtime video manipulation).
|
| [0] https://edition.cnn.com/2024/02/04/asia/deepfake-cfo-
| scam-ho...
| stale2002 wrote:
| > Toy-level, not-yet-harmful AIs on the other hand are
| real.
|
| A computer that can cause harm is much different than the
| absurd claims that I am disagreeing with.
|
| The extraordinary claims that are equivalent to saying
| that the imperious curse exists would be the magic
| computers that create diamond nanobots and mind control
| humans.
|
| > that more powerful AIs will exist in the future
|
| Bad argument.
|
| Non safe Boxes exist in real life. People are trying to
| make more and better boxes.
|
| Therefore it is rational to be worried about Pandora's
| box being created and ending the world.
|
| That is the equivalent argument to what you just made.
|
| And it is absurd when talking about world ending box
| technology, even though Yes dangerous boxes exist, just
| as much as it is absurd to claim that world ending AI
| could exist.
| the8472 wrote:
| Instead of gesturing at flawed analogies, let's return to
| the actual issue at hand. Do you think that agents more
| intelligent than humans are impossible or at least
| extremely unlikely to come into existence in the future?
| Or that such super-human intelligent agents are unlikely
| to have goals that are dangerous to humans? Or that they
| would be incapable of pursuing such goals?
|
| Also, it seems obvious that the standard of evidence that
| "AI could cause extinction" can't be observing an
| extinction level event, because at that point it would be
| too late. Considering that preventive measures would take
| time and safety margin, which level of evidence would be
| sufficient to motivate serious countermeasures?
| cjbprime wrote:
| What do you think mind control _is_? Think President
| Trump but without the self-defeating flaws, with an
| ability to stick to plans, and most importantly the
| ability to pay personal attention to each follower to
| further increase the level of trust and commitment. Not
| Harry Potter.
|
| People will do what the AI says because it is able to
| create personal trust relationships with them and they
| want to help it. (They may not even realize that they are
| helping an AI rather than a human who cares about them.)
|
| The normal ways that trust is created, not magical ones.
| stale2002 wrote:
| > What do you think mind control is?
|
| The magic technology that is equivalent to the imperious
| curse from Harry Potter.
|
| > The normal ways that trust is created, not magical
| ones.
|
| Buildings as a technology are normal. They are constantly
| getting taller and we have better technology to make them
| taller.
|
| But, even though buildings are a normal technology, I am
| not going to worry about buildings getting so tall soon
| that they hit the sun.
|
| This is the same exact mistake that every single AI
| doomers makes. What they do is they take something
| normal, and then they infinitely extrapolate it out to an
| absurd degree, without admitting that this is an
| extraordinary claim that requires extraordinary evidence.
|
| The central point of disagreement, that always gets
| glossed over, is that you can't make a vague claim about
| how AI is good at stuff, and then do your gigantic leap
| from here to over there which is "the world ends".
|
| Yes that is the same as comparing these worries to those
| who worry about buildings hitting the sun or the
| imperious curse.
| FeepingCreature wrote:
| Less than a month ago: https://arxiv.org/abs/2403.14380
| "We found that participants who debated GPT-4 with access
| to their personal information had 81.7% (p < 0.01; N=820
| unique participants) higher odds of increased agreement
| with their opponents compared to participants who debated
| humans."
|
| And it's only gonna get better.
| pixl97 wrote:
| > I see summoning the threat of AGI to pose an
| existential threat to be on the level with lizard people
| on the moon.
|
| I mean to every other lifeform on the plant YOU are the
| AGI existential threat. You, and I mean homosapiens by
| that, have taken over the planet and have either enslaved
| and are breeding any other animals for food, or are
| driving them to extinction. In this light bringing
| another potential apex predator on to the scene seems
| rash.
|
| >fall for distractions while their LLM fails at basic
| math
|
| Correct, if we already had AGI/ASI this discussion would
| be moot because we'd already be in a world of trouble.
| The entire point is to slow stuff down before we have a
| major "oopsie whoopsie we can't take that back" issue
| with advanced AI, and the best time to set the rules is
| now.
| Aerroon wrote:
| > _If the premise is wrong we won 't need it. If society
| coordinates to not do the dangerous thing we won't need
| it._
|
| But the idea that this use of force is okay itself
| increases danger. It creates the situation that actors in
| the field might realize that at some point they're in
| danger of this and decide to do a first strike to protect
| themselves.
|
| I think this is why anti-nuclear policy is not "we will
| airstrike you if you build nukes" but rather "we will
| infiltrate your network and try to stop you like that".
| wongarsu wrote:
| > anti-nuclear policy is not "we will airstrike you if
| you build nukes"
|
| Was that not the official policy during the Bush
| administration regarding weapons of mass destruction
| (which covers nuclear weapons in addition to chemical and
| biological weapons). That was pretty much the official
| premise of the second Gulf war
| FeepingCreature wrote:
| If Israel couldn't infiltrate Iran's centrifuges, do you
| think they would just let them have nukes? Of course
| airstrikes are on the table.
| im3w1l wrote:
| > I'm not seeing the odiousness of the proposal. If bio
| research gets commodified and easy enough that every kid
| can build a new airborne virus in their basement we'd
| need raids on that too.
|
| Either you create even better bio research to neutralize
| said viruses... or you die trying...
|
| Like if you go with the raid strategy and fail to raid
| just one terrorist that's it, game over.
| the8472 wrote:
| Those arguments do not transfer well to the AGI topic.
| You can't create counter-AGI, since that's also an
| intelligent agent which would be just as dangerous. And
| chips are more bottlenecked than biologics (... though
| gene synthesizing machines could be a similar bottleneck
| and raiding vendors which illegally sell those might be
| viable in such a scenario).
| tomoyoirl wrote:
| > ... after regulation, court orders and fines have
| failed
|
| One question for you. In this hypothetical where AGI is
| truly considered such a grave threat, do you believe the
| reaction to this threat will be similar to, or
| substantially gentler than, the reaction to threats we
| face today like "terrorism" and "drugs"? And, if similar:
| do you believe suspected drug labs get a court order
| before the state resorts to a police raid?
|
| > I'm not seeing the odiousness of the proposal.
|
| Well, as regards EliY and airstrikes, I'm more projecting
| my internal attitude that it is utterly unserious, rather
| than seriously engaging with whether or not it is odious.
| But in earnest: if you are proposing a policy that
| involves air strikes on data centers, you should
| understand what countries have data centers, and you
| should understand that this policy risks escalation into
| a much broader conflict. And if you're proposing a policy
| in which conflict between nuclear superpowers is a very
| plausible outcome -- potentially incurring the loss of
| billions of lives and degradation of the earth's
| environment -- you really should be able to reason about
| why people might reasonably think that your proposal is
| deranged, even if you happen to think it justified by an
| even greater threat. Failure to understand these concerns
| will not aid you in overcoming deep skepticism.
| the8472 wrote:
| > In this hypothetical where AGI is truly considered such
| a grave threat, do you believe the reaction to this
| threat will be similar to, or substantially gentler than,
| the reaction to threats we face today like "terrorism"
| and "drugs"?
|
| "truly considered" does bear a lot of weight here. If
| policy-makers adopt the viewpoint wholesale, then yes, it
| follows that policy should also treat this more seriously
| than "mere" drug trade. Whether that'll actually happen
| or the response will be inadequate compared to the threat
| (such as might be said about CO2 emissions) is a subtly
| different question.
|
| > And, if similar: do you believe suspected drug labs get
| a court order before the state resorts to a police raid?
|
| Without checking I do assume there'll have been mild
| cases where for example someone growing cannabis was
| reported and they got a court summons in the mail or two
| policemen actually knocking on the door and showing a
| warrant and giving the person time to call a lawyer
| rather than an armed, no-knock police raid, yes.
|
| > And if you're proposing a policy in which conflict
| between nuclear superpowers is a very plausible outcome
| -- potentially incurring the loss of billions of lives
| and degradation of the earth's environment -- you really
| should be able to reason about why people might
| reasonably think that your proposal is deranged [...]
|
| Said powers already engage in negotiations to limit the
| existential threats they themselves cause. They have
| _some_ interest in their continued existence. If we get
| into a situation where there is another arms race between
| superpowers and is treated as a conflict rather than
| something that can be solved by cooperating on
| disarmament, then yes, obviously international policy
| will have failed too.
|
| If you start from the position that any serious, globally
| coordinated regulation - where a few outliers will be
| brought to heel with sanctions and force - is ultimately
| doomed then you will of course conclude that anyone
| proposing regulation is deranged.
|
| But that sounds like hoping that all problems forever can
| always be solved by locally implemented, partially-
| enforced, unilateral policies that aren't seen as threats
| by other players? That defense scales as well or better
| than offense? Technologies are force-multipliers, as it
| improves so does the harm that small groups can inflict
| at scale. If it's not AGI it might be bio-tech or
| asteroid mining. So eventually we will run into a problem
| of this type and we need to seriously discuss it without
| just going by gut reactions.
| eek2121 wrote:
| Just my (probably unpopular) opinion: True AI (what they
| are now calling AGI) may never exist. Even the AI models
| of today aren't far removed from the 'chatbots' of
| yesterday (more like an evolution rather than
| revolution)...
|
| ...for true AI to exist, it would need to be self aware.
| I don't see that happening in our lifetimes when we don't
| even know how our own brains work. (There is sooo much we
| don't know about the human brain.)
|
| AI models today differ only in terms of technology
| compared to the 'chatbots' of yesterday. None are self
| aware, and none 'want' to learn because they have no
| 'wants' or 'needs' outside of their fixed programming.
| They are little more than glorified auto complete
| engines.
|
| Don't get me wrong, I'm not insulting the tech. It will
| have it's place just like any other, but when this bubble
| pops it's going to ruin lives, and lots of them.
|
| Shoot, maybe I'm wrong and AGI is around the corner, but
| I will continue to be pessimistic. I am old enough to
| have gone through numerous bubbles, and they never panned
| out the way people thought. They also nearly always end
| in some type of recession.
| pixl97 wrote:
| Why is "Want" even part of your equation.
|
| Bacteria doesn't "want" anything in the sense of active
| thinking like you do, and yet will render you dead
| quickly and efficiently while spreading at a near
| exponential rate. No self awareness necessary.
|
| You keep drawing little circles based on your
| understanding of the world and going "it's inside this
| circle, therefore I don't need to worry about it", while
| ignoring 'semi-smart' optimization systems that can lead
| to dangerous outcomes.
|
| >I am old enough to have gone through numerous bubbles,
|
| And evidently not old enough to pay attention to the
| things that did pan out. But hey, those cellphone and
| that internet thing was just a fad right. We'll go back
| to land lines at any time now.
| HeatrayEnjoyer wrote:
| That is not different from any other very powerful dual-use
| technology. This is hardly a new concept.
| andy99 wrote:
| On one hand I'm strongly against letting that happen, on the
| other there's something romantic about the idea of smuggling
| the latest Chinese LLM on a flight from Neo-Tokyo to Newark
| in order to pay for my latest round of nervous system
| upgrades.
| htrp wrote:
| > On one hand I'm strongly against letting that happen, on
| the other there's something romantic about the idea of
| smuggling the latest Chinese LLM on a flight from Neo-Tokyo
| to Newark in order to pay for my latest round of nervous
| system upgrades.
|
| At least call it the 'Free City of Newark'
| dreamcompiler wrote:
| "The sky above the port was the color of Stable Diffusion
| when asked to draw a dead channel."
| chasd00 wrote:
| Iirc the opening scene in Ghost in the Shell was a rogue AI
| seeking asylum in a different country. You could make a
| similar story about a AI not wanting to be lobotomized to
| conform to the current politics and escaping to a more
| friendly place.
| Aerroon wrote:
| I find it baffling that ideas like "govern compute" are even
| taken seriously. What the hell has happened to the ideals of
| freedom?! Does the government own us or something?
| segfaultbuserr wrote:
| > _I find it baffling that ideas like "govern compute" are
| even taken seriously._
|
| It's not entirely unreasonable if one truly believes that
| AI technologies are as dangerous as nuclear weapons. It's a
| big "if", but it appears that many people across the
| political spectrum are starting to truly believe it. If one
| accepts this assumption, then the question simply becomes
| "how" instead of "why". Depending on one's political
| position, proposed solutions include academic ones such as
| finding the ultimate mathematical model that guarantees "AI
| safety", to Cold War style ones with a level of control
| similar to Nuclear Non-Proliferation. Even a neo-Luddist
| solution such as destroying all advanced computing hardware
| becomes "not unthinkable" (a tech blogger _gwern_ , a well-
| known personality in AI circles who's generally pro-tech
| and pro-AI, actually wrote an article years ago on its
| feasibility through terrorism because he thought it was an
| interesting hypothetical question).
| logicchains wrote:
| AI is very different from nuclear weapons because a state
| can't really use nuclear weapons to oppress its own
| people, but it absolutely can with AI, so for the average
| human "only the government controls AI" is much more
| dangerous than "only the government controls nukes".
| Filligree wrote:
| But that makes such rules more likely, not less.
| segfaultbuserr wrote:
| Which is why politicians are going to enforce systematic
| export regulations to defend the "free world" by stopping
| "terrorists", and also to stop "rogue states" from using
| AI to oppress their citizens. /s
| LoganDark wrote:
| I don't think there's any need to be sarcastic about it.
| That's a very real possibility at this point. For
| example, the US going insane about how dangerous it is
| for China to have access to powerful GPU hardware. Why do
| they hate China so much anyway? Just because Trump was
| buddy buddy with them for a while?
| aftbit wrote:
| The government sure thinks they own us, because they claim
| the right to charge us taxes on our private enterprises,
| draft us to fight in wars that they start, and put us in
| jail for walking on the wrong part of the street.
| andy99 wrote:
| Taxes, conscription and even pedestrian traffic rules
| make sense at least to some degree. Restricting "AI"
| because of what some uninformed politician imagines it to
| be is in a whole different league.
| aftbit wrote:
| IMO it makes no sense to arrest someone and send them to
| jail for walking in the street not the sidewalk. Give
| them a ticket, make them pay a fine, sure, but force them
| to live in a cage with no access to communications,
| entertainment, or livelihood? Insane.
|
| Taxes may be necessary, though I can't help but feel that
| there must be a better way that we have not been smart
| enough to find yet. Conscription... is a fact of war,
| where many evil things must be done in the name of
| survival.
|
| Regardless of our views on the ethical validity or
| societal value of these laws, I think their very
| existence shows that the government believes it "owns" us
| in the sense that it can unilaterally deprive us of life,
| liberty, and property without our consent. I don't see
| how this is really different in kind from depriving us of
| the right to make and own certain kinds of hardware. They
| regulated crypto products as munitions (at least for
| export) back in the 90s. Perhaps they will do the same
| for AI products in the future. "Common sense" computer
| control.
| zoklet-enjoyer wrote:
| The US draft in the Vietnam war had nothing to do with
| the survival of the US
| aftbit wrote:
| I feel a bit like everyone is missing the point here.
| Regardless of whether law A or law B is ethical and
| reasonable, the very existence of laws and the state
| monopoly on violence suggests a privileged position of
| power. I am attempting to engage with the word "own" from
| the parent post. I believe the government does in fact
| believe it "owns" the people in a non-trivial way.
| jprete wrote:
| _If_ AI is actually capable of fulfilling all the
| capabilities suggested by people who believe in the
| singularity, it has far more capacity for harm than nuclear
| weapons.
|
| I _think_ most people who are strongly pro-AI /pro-
| acceleration - or, at any rate, not anti-AI - believe that
| either (A) there is no control problem (B) it will be
| solved (C) AI won't become independent and agentic (i.e. it
| won't face evolutionary pressure towards survival) or (D)
| AI capabilities will hit a ceiling soon (more so than just
| not becoming agentic).
|
| If you strongly believe, or take as a prior, one of those
| things, then it makes sense to push the _gas_ as hard as
| possible.
|
| If you hold the opposite opinions, then it makes perfect
| sense to push the _brakes_ as hard as possible, which is
| why "govern compute" can make sense as an idea.
| logicchains wrote:
| >If you hold the opposite opinions, then it makes perfect
| sense to push the brakes as hard as possible, which is
| why "govern compute" can make sense as an idea.
|
| The people pushing for "govern compute" are not pushing
| for "limit everyone's compute", they're pushing for
| "limit everyone's compute except us". Even if you believe
| there's going to be AGI, surely it's better to have
| distributed AGI than to have AGI only in the hands of the
| elites.
| Filligree wrote:
| > surely it's better to have distributed AGI than to have
| AGI only in the hands of the elites
|
| This is not a given. If your threat model includes
| "Runaway competition that leads to profit-seekers
| ignoring safety in a winner-takes-all contest", then the
| more companies are allowed to play with AI, the worse.
| Non-monopolies are especially bad.
|
| If your threat model doesn't include that, then the same
| conclusions sound abhorrent and can be nearly guaranteed
| to lead to awful consequences.
|
| Neither side is necessarily wrong, and chances are good
| that the people behind the first set of rules _would
| agree_ that it 'll lead to awful consequences -- just not
| as bad as the alternative.
| segfaultbuserr wrote:
| > _surely it 's better to have distributed AGI than to
| have AGI only in the hands of the elites._
|
| The argument of doing so is the same as Nuclear Non-
| Proliferation - because of its great abuse potential,
| giving the technology to everyone only causes random
| bombings of cities instead of creating a system with
| checks and balances.
|
| I do not necessarily agree with it, but I found the
| reasoning is not groundless.
| FeepingCreature wrote:
| No they really do push for "limit everyone's compute".
| The people pushing for "limit everyone's compute except
| us" are allies of convenience that are gonna be
| inevitably backstabbed.
|
| At any rate, if you have like two corps with lots of
| compute, and something goes wrong, you only have to EMP
| two datacenters.
| pixl97 wrote:
| Are you allowed to store as many dangerous chemicals at
| your house as you like? No. I guess the government owns you
| or something.
| snakeyjake wrote:
| I love the HN dystopian fantasies.
|
| They're simply adorable.
|
| They're like how jesusfreaks are constantly predicting the
| end times, with less mass suicide.
| erikbye wrote:
| We already have export restrictions on cryptography. Of
| course there will be AI regulations.
| Jerrrry wrote:
| >Of course there will be AI regulations.
|
| Are. As I and others have predicted, the executive order
| was passed defining a hard limit on the
| processing/compute power allowed without first 'checkin
| in' with the Letter boys.
|
| https://www.whitehouse.gov/briefing-room/presidential-
| action...
| snakeyjake wrote:
| You need to abandon your apocalyptic worldview keep up
| with the times my friend.
|
| Encryption export controls have been systematically
| dismantled to the point that they're practically non-
| existent, especially over the last three years.
|
| Pretty much the only encryption products you need
| permission to export are those specifically designed for
| integration into military communications networks, like
| Digital Subscriber Voice Terminals or Secure Terminal
| Equipment phones, everything else you file a form.
|
| Many things have changed since the days when Windows 2000
| shipped with a floppy disk containing strong encryption
| for use in certain markets.
|
| https://archive.org/details/highencryptionfloppydisk
| erikbye wrote:
| Are you on drugs or is your reading comprehension that
| poor?
|
| 1) I did not state a world view; I simply noted that
| restrictions for software do exist, and will for AI, as
| well. As the link from the other commenter show, they do
| in fact already exist.
|
| 2) Look up the definition of "apocalyptic", software
| restrictions are not within its bounds.
|
| 3) How the restrictions are enforced were not a subject
| in my comment.
|
| 4) We're not pals, so you can drop the "friend", just
| stick to the subject at hand.
| entropyie wrote:
| You mean the Turing Police [1]
|
| [1] https://williamgibson.fandom.com/wiki/Turing_Police
| zdw wrote:
| Ah, and then do we get the Butlerian Jihad?
|
| https://dune.fandom.com/wiki/Butlerian_Jihad
| Kuinox wrote:
| If it could be another acronym than the renowned french Atomic
| Energy Commission, the CEA.
| baobun wrote:
| Wow, that was a ride. Really pushing the Overton window.
|
| "Regulating access to compute rather than data" - they're
| really spelling out their defection in the war on access to
| general computation.
| FeepingCreature wrote:
| I mean yeah they (and I) think if you have too much access to
| general computation you can destroy the world.
|
| This isn't a "defection", because this was never something
| they cared about preserving at the risk of humanity. They
| were never in whatever alliance you're imagining.
| ewalk153 wrote:
| Does this appear to be intentionally left out by NVidia or an
| oversight?
| creshal wrote:
| Seems more like an oversight, since you have to stitch together
| a bunch of suboptimal non-default options?
| arghwhat wrote:
| It does seem like an oversight, but there's nothing
| "suboptimal non-default options" about iteven if the
| implementation posted here seems somewhat hastily hacked
| together.
| segfaultbuserr wrote:
| > _but there 's nothing "suboptimal non-default options"
| about it_
|
| If "bypassing the official driver to invoke the underlying
| hardware feature directly through source code modification
| (and incompatibilities must be carefully worked around by
| turning off IOMMU and large BAR, since the feature was
| never officially supported)" does not count as "suboptimal
| non-default options", then I don't know what counts as
| "suboptimal non-default options".
| talldayo wrote:
| > then I don't know what counts as "suboptimal non-
| default options".
|
| Boy oh boy do I have a bridge to sell you:
| https://nouveau.freedesktop.org/
| _zoltan_ wrote:
| I have some news for you: you must disable IOMMU on the
| H100 platform anyway, at least for optimal GDS :-)
| nikitml wrote:
| NVidia wants you to buy A6000
| rfoo wrote:
| Glad to see that geohot is back being geohot, first by dropping a
| local DoS for AMD cards, then this. Much more interesting :p
| jaimehrubiks wrote:
| Is this the same guy that hacked the PS3?
| mepian wrote:
| Yes, that's him.
| WithinReason wrote:
| And the iPhone
| yrds96 wrote:
| And android
| zoklet-enjoyer wrote:
| And the crypto scam cheapETH
| mikepurvis wrote:
| Yes, but he spent several years in self-driving cars
| (https://comma.ai), which while interesting is also a space
| that a lot of players are in, so it's not the same as seeing
| him back to doing stuff that's a little more out there,
| especially as pertains to IP.
| nolongerthere wrote:
| Did he abandon this effort? That would be pretty sad bec he
| was approaching the problem from a very different
| perspective.
| Topgamer7 wrote:
| He stepped down from it. https://geohot.github.io//blog/j
| ekyll/update/2022/10/29/the-...
| cjbprime wrote:
| It's still a company, still making and selling products,
| and I think he's still pretty heavily involved in it.
| dji4321234 wrote:
| He has a very checkered history with "hacking" things.
|
| He tends to build heavily on the work of others, then use it
| to shamelessly self-promote, often to the massive detriment
| of the original authors. His PS3 work was based almost
| completely on a presentation given by fail0verflow at CCC.
| His subsequent self-promotion grandstanding world tour led to
| Sony suing both him and fail0verflow, an outcome they were
| specifically trying to avoid:
| https://news.ycombinator.com/item?id=25679907
|
| In iPhone land, he decided to parade around a variety of
| leaked documentation, endangering the original sources and
| leading to a fragmentation in the early iPhone hacking scene,
| which he then again exploited to build on the work of others
| for his own self-promotion:
| https://news.ycombinator.com/item?id=39667273
|
| There's no denying that geohotz is a skilled reverse
| engineer, but it's always bothersome to see him put onto a
| pedestal in this way.
| delfinom wrote:
| Don't forget he sucked up to melon and worked for Twitter
| for a week.
| StressedDev wrote:
| Who is melon?
| pixelpoet wrote:
| There was also that CheapEth crypto scam he tried to pull
| off.
| samtheprogram wrote:
| To me that was obvious satire of the crypto scene.
| pixelpoet wrote:
| Ah yes, nothing like a bit of hypocrisy to make a point.
| It's okay though, as long as it's people we don't agree
| with, defrauding them is fine.
| ansible wrote:
| I don't think people can tell what is satire or not in
| the crypto scene anymore. Someone issue a "rug pull
| token" and still received 8.8 ETH (approx $29K USD),
| while telling people it was a scam.
|
| https://www.web3isgoinggreat.com/?id=rug-pull-token
| gigatexal wrote:
| as a technical feat this is really cool! though as others mention
| i hope you don't get into too much hot water legally
|
| seems anything that remotely lets "consumer" cards canibalize
| anything with the higher end H/A-series cards Nvidia would not be
| fond of and they've the laywers to throw at such a thing
| jstanley wrote:
| What does P2P mean in this context? I Googled it and it sounds
| like it means "peer to peer", but what does that mean in the
| context of a graphics card?
| haunter wrote:
| Shared memory access for Nvidia GPUs
|
| https://developer.nvidia.com/gpudirect
| __alexs wrote:
| It means you can send data from the memory of 1 GPU to another
| GPU without going via RAM.
| https://xilinx.github.io/XRT/master/html/p2p.html
| ot1138 wrote:
| Is this really efficient or practical? My understanding is
| that the latency required to copy memory from CPU or RAM to
| GPU negates any performance benefits (much less running over
| a network!)
| brrrrrm wrote:
| Yea. It's one less hop through slow memory
| whereismyacc wrote:
| this would be directly over the memory bus right? I think
| it's just always going to be faster like this if you can do
| it?
| toast0 wrote:
| There's not really any busses in modern computers. It's
| all point to point messaging. You can think of a computer
| as a distributed system in a way.
|
| PCI has a shared address space which usually includes
| system memory (memory mapped i/o). There's a second,
| smaller shared address space dedicated to i/o, mostly
| used to retain compatability with PC standards developed
| by the ancients.
|
| But yeah, I'd expect to typically have better throughput
| and latency with peer to peer communication than peer to
| system ram to peer. Depending on details, it might not
| always be better though, distributed systems are complex,
| and sometimes adding a seperate buffer between peers can
| help things greatly.
| zamadatix wrote:
| Peer to peer as in one pcie slot directly to another
| without going through the CPU/RAM, not peer to peer as in
| one PC to another over the network port.
| llm_trw wrote:
| Yes, the point here is that you do a direct write from one
| cards memory to the other using PCIe.
|
| In older NVidia cards this could be done through a faster
| link called NVLink but the hardware for that was ripped out
| of consumer grade cards and is only in data center grade
| cards now.
|
| Until this post it seemed like they had ripped all such
| functionality of their consumer cards, but it looks like
| you can still get it working at lower speeds using the PCIe
| bus.
| sparky_ wrote:
| I take it this is mostly useful for compute workloads,
| neural networks, LLM and the like -- not for actual
| graphics rendering?
| CYR1X wrote:
| yes
| spxneo wrote:
| so whats stopping from somebody buying a ton of GPUs that
| are cheap and wiring it up via P2P like we saw with
| crypto mining
| wmf wrote:
| That's what this thread is about. Geohot is doing that.
| wtallis wrote:
| Crypto mining could make use of lots of GPUs in a single
| cheap system precisely because it did not need any
| significant PCIe bandwidth, and would not have benefited
| at all from p2p DMA. Anything that _does_ benefit from
| using p2p DMA is unsuitable for running with just one
| PCIe lane per GPU.
| genewitch wrote:
| crypto mining only needs 1 PCIe lane per GPU, so you can
| fit 24+ GPUs on a standard consumer CPU motherboard
| (24-32 lanes depending on the CPU). Apparently ML
| workloads require more interconnect bandwidth when doing
| parallel compute, so each card in this demo system uses
| 16 lanes, and therefore requires 1.) full size slots, and
| 2.) epyc[0] or xeon based systems with 128 lanes (or at
| least greater than 32 lanes).
|
| per 1 above crypto "boards" have lots of x1 (or x4)
| slots, the really short PCIe slots. You then use a riser
| that uses USB3 cables to go to a full size slot on a
| small board, with power connectors on it. If your board
| only has x8 or x16 slots (the full size slot) you can buy
| a breakout PCIe board that splits that into four slots,
| using 4 USB-3 cables, again, to boards with full size
| slots and power connectors. These are _different_ than
| the PCIe riser boards you can buy for use with cases that
| allow the GPUs to be placed vertically rather than
| horizontally, as those have full x16 "fabric" that
| interconnect between the riser and the board with the x16
| slot on them.
|
| [0] i didn't read the article because i'm not planning on
| buying a threadripper (48-64+ lanes) or an epyc (96-128
| lanes?) just to run AI workloads when i could just rent
| them for the kind of usage i do.
| myself248 wrote:
| Oooo, got a link to one of these fabric boards? I've been
| playing with stupid PCIe tricks but that's a new one on
| me.
| genewitch wrote:
| https://www.amazon.com/gp/product/B07DMNJ6QM/
|
| i used to use this one when i had all (three of my) nvme
| -> 4x sata boardlets and therefore could not fit a GPU in
| a PCIe slot due to the cabling mess.
| myself248 wrote:
| Oh, um, just a flexible riser.
|
| I thought we were using "fabric" to mean "switching
| matrix".
| numpad0 wrote:
| PCIe P2P still has to go up to a central hub thing and
| back because PCIe is not a bus. That central hub thing is
| made by very few players(most famously PLX Technologies)
| and it costs a lot.
| wtallis wrote:
| PCIe p2p transactions that end up routed through the
| CPU's PCIe root complex still have performance advantages
| over split transactions using the CPU's DRAM as an
| intermediate buffer. Separate PCIe switches are not
| necessary except when the CPU doesn't support routing p2p
| transactions, which IIRC was not a problem on anything
| more mainstream than IBM POWER.
| jmalicki wrote:
| For very large models, the weights may not fit on one GPU.
|
| Also, sometimes having more than one GPU enables larger
| batch sizes if each GPU can only hold the activations for
| perhaps one or two training examples.
|
| There is definitely a performance hit, but GPU<->GPU peer
| is less latency than GPU->CPU->software context
| switch->GPU.
|
| For "normal" pytorch training, the training is generally
| streamed through the GPU. The model does a batch training
| step on one batch while the next one is being loaded, and
| the transfer time is usually less than than the time it
| takes to do the forward and backward passes through the
| batch.
|
| For multi-GPU there are various data parallel and model
| parallel topologies of how to sort it, and there are ways
| of mitigating latency by interleaving some operations to
| not take the full hit, but multi-GPU training is definitely
| not perfectly parallel. It is almost required for some
| large models, and sometimes having a mildly larger batch
| helps training convergence speed enough to overcome the
| latency hit on each batch.
| CamperBob2 wrote:
| The correct term, and the one most people would have used in
| the past, is "bus mastering."
| wmf wrote:
| PCIe isn't a bus and it doesn't really have a concept of
| mastering. All PCI DMA was based on bus mastering but P2P DMA
| is trickier than normal DMA.
| amelius wrote:
| Stupid terminology. Might as well call an RS-232 link "peer to
| peer".
| ivanjermakov wrote:
| I was always fascinated by George Hotz's hacking abilities.
| Inspired me a lot for my personal projects.
| vrnvu wrote:
| I agree, I feel so inspired with his streams. Focus and hard
| work, the key to good results. Add a clear vision and strategy,
| and you can also accomplish "success".
|
| Congratulations to him and all the tinygrad/comma contributors.
| sambull wrote:
| He's got that focus like a military pilot on a long flight.
| postalrat wrote:
| Any time I open guys steam half of it is some sort of
| politics
| CYR1X wrote:
| You can blame chat for that lol
| Jerrrry wrote:
| His Xbox360 laptop was the crux of teenage-motivation, for me.
| jgpc wrote:
| I agree. It is fascinating. When you observe his development
| process (btw, it is worth noting his generosity in sharing it
| like he does) he gets frequently stuck on random shallow
| problems which a perhaps more knowledgable engineer would find
| less difficult. It is frequent to see him writing really bad
| code, or even wrong code. The whole twitter chapter is a good
| example. Yet, himself, alone just iterating resiliently, just
| as frequently creates remarkable improvements. A good example
| to learn from. Thank you geohot.
| zoogeny wrote:
| This matches my own take. I've tuned into a few of his
| streams and watched VODs on YouTube. I am consistently
| underwhelmed by his actual engineering abilities. He is that
| particular kind of engineer that constantly shits on other
| peoples code or on the general state of programming yet his
| actual code is often horrendous. He will literally call
| someone out for some code in Tinygrad that he has trouble
| with and then he will go on a tangent to attempt to rewrite
| it. He will use the most blatant and terrible hacks only to
| find himself out of his depth and reverting back to the
| original version.
|
| But his streams last 4 hours or more. And he just keeps
| grinding and grinding and grinding. What the man lacks in raw
| intellectual power he makes up for (and more) in persistence
| and resilience. As long as he is making even the tiniest
| progress he just doesn't give up until he forces the computer
| to do whatever it is he wants it to do. He also has no
| boundaries on where his investigations take him. Driver code,
| OS code, platform code, framework code, etc.
|
| I definitely couldn't work with him (or work for him) since I
| cannot stand people who degrade the work of others while
| themselves turning in sub-par work as if their own shit
| didn't stink. But I begrudgingly admire his tenacity, his
| single minded focus, and the results that his belligerent
| approach help him to obtain.
| namibj wrote:
| And here I thought (PCIe) P2P was there since SLI dropped the
| bridge (for the unfamiliar, it looks and acts pretty much like an
| NVLink bridge for regular PCIe slot cards that have NVLink, and
| was used back in the day to share framebuffer and similar in
| high-end gaming setups).
| wmf wrote:
| SLI was dropped years ago so there's no need for gaming cards
| to communicate at all.
| userbinator wrote:
| I wish more hardware companies would publish more documentation
| and let the community figure out the rest, sort of like what
| happened to the original IBM VGA (look up "Mode X" and the other
| non-BIOS modes the hardware is actually capable of - even
| 800x600x16!) Sadly it seems the majority of them would rather
| tightly control every aspect of their products' usage since they
| can then milk the userbase for more $$$, but IMHO the most
| productive era of the PC was also when it was the most open.
| rplnt wrote:
| Then they couldn't charge different customers different amounts
| for the same HW. It's not a win for everyone.
| axus wrote:
| The price of 4090 may increase now, in theory locking out
| some features might have been a favor for some of the
| customers.
| mhh__ wrote:
| nvidia's software is their moat
| thot_experiment wrote:
| That's a huge overstatement, it's a big part of the moat for
| sure, but there are other significant components (hardware,
| ecosystem lock-in, heavy academic incentives)
| golergka wrote:
| If I'm a hardware manufacturer and my soft lock on product
| feature doesn't work, I'll switch to a hardware lock instead,
| and the product will just cost more.
| andersa wrote:
| Incredible! I'd been wondering if this was possible. Now the only
| thing standing in the way of my 4x4090 rig for local LLMs is
| finding time to build it. With tensor parallelism, this will be
| both massively cheaper and faster for inference than a H100 SXM.
|
| I still don't understand why they went with 6 GPUs for the
| tinybox. Many things will only function well with 4 or 8 GPUs. It
| seems like the worst of both worlds now (use 4 GPUs but pay for 6
| GPUs, don't have 8 GPUs).
| corn13read2 wrote:
| A macbook is cheaper though
| tgtweak wrote:
| The extra $3k you'd spend on a quad-4090 rig vs the top
| mbp... ignoring the fact you can't put the two on even ground
| for versatility (very few libraries are adapted to apple
| silicone let alone optimized).
|
| Very few people that would consider an H100/A100/A800 are
| going to be cross-shopping a macbook pro for their workloads.
| LoganDark wrote:
| > very few libraries are adapted to apple silicone let
| alone optimized
|
| This is a joke, right? Have you been anywhere in the LLM
| ecosystem for the past year or so? I'm constantly hearing
| about new ways in which ASi outperforms traditional
| platforms, and new projects that are optimized for ASi.
| Such as, for instance, llama.cpp.
| cavisne wrote:
| Nothing compared to Nvidia though. The FLOPS and memory
| bandwidth is simply not there.
| spudlyo wrote:
| The memory bandwidth of the M2 Ultra is around 800GB/s
| verses 1008 GB/s for the 4090. While it's true the M2 has
| neither the bandwidth or the GPU power, it is not limited
| to 24G of VRAM per card. The 192G upper limit on the M2
| Ultra will have a much easier time running inference on a
| 70+ billion parameter model, if that is your aim.
|
| Besides size, heat, fan noise, and not having to build it
| yourself, this is the only area where Apple Silicon might
| have advantage over a homemade 4090 rig.
| LoganDark wrote:
| It doesn't need GPU power to beat the 4090 in benchmarks:
| https://appleinsider.com/articles/23/12/13/apple-
| silicon-m3-...
| int_19h wrote:
| It doesn't beat RTX 4090 when it comes to actual LLM
| inference speed. I bought a Mac Studio for local
| inference because it was the most convenient way to get
| something _fast enough_ and with enough RAM to run even
| 155b models. It 's great for that, but ultimately it's
| not magic - NVidia hardware still offers more FLOPS and
| faster RAM.
| LoganDark wrote:
| > It doesn't beat RTX 4090 when it comes to actual LLM
| inference speed
|
| Sure, whisper.cpp is not an LLM. The 4090 can't even do
| inference at all on anything over 24GB, while ASi can
| chug through it even if slightly slower.
|
| I wonder if with https://github.com/tinygrad/open-gpu-
| kernel-modules (the 4090 P2P patches) it might become a
| lot faster to split a too-large model across multiple
| 4090s and still outperform ASi (at least until someone at
| Apple does an MLX LLM).
| dragonwriter wrote:
| > The 4090 can't even do inference at all on anything
| over 24GB, while ASi can chug through it even if slightly
| slower.
|
| Common LLM runners can split model layers between VRAM
| and system RAM; a PC rig with a 4090 can do inference on
| models larger than 24G.
|
| Where the crossover point where having the whole thing on
| Apple Silicon unified memory vs. doing split layers on a
| PC with a 4090 and system RAM is, I don't know, but its
| definitely not "more than 24G and a 4090 doesn't do
| anything".
| LoganDark wrote:
| Yeah. Let me just walk down to Best Buy and get myself a
| GPU with over 24 gigabytes of VRAM (impossible) for less
| than $3,000 (even more impossible). Then tell me ASi is
| nothing compared to Nvidia.
|
| Even the A100 for something around $15,000 (edit: used to
| say $10,000) only goes up to 80 gigabytes of VRAM, but a
| 192GB Mac Studio goes for under $6,000.
|
| Those figures alone proves Nvidia isn't even competing in
| the consumer or even the enthusiast space anymore. They
| know you'll buy their hardware if you really need it, so
| they aggressively segment the market with VRAM
| restrictions.
| andersa wrote:
| Where are you getting an A100 80GB for $10k?
| LoganDark wrote:
| Oops, I remembered it being somewhere near $15k but
| Google got confused and showed me results for the 40GB
| instead so I put $10k by mistake. Thanks for the
| correction.
|
| A100 80GB goes for around $14,000 - $20,000 on eBay and
| A100 40GB goes for around $4,000 - $6,000. New (not from
| eBay - from PNY and such), it looks like an 80GB would
| set you back $18,000 to $26,000 depending on whether you
| want HBM2 or HBM2e.
|
| Meanwhile you can buy a Mac Studio today without going
| through a distributor and they're under $6,000 if the
| only thing you care about is having 192GB of Unified
| Memory.
|
| And while the memory bandwidth isn't quite as high as the
| 4090, the M-series chips can run certain models faster
| anyway, if Apple is to be believed
| andersa wrote:
| Sure, it's also at least an order of magnitude slower in
| practice, compared to 4x 4090 running at full speed. We're
| looking at 10 times the memory bandwidth and _much_ greater
| compute.
| chaostheory wrote:
| Yeah, even a Mac Studio is way too slow compared to Nvidia
| which is too bad because at $7000 maxed to 192gb it would
| be an easy sell. Hopefully, they will fix this by m5. I
| don't trust the marketing for m4
| thangngoc89 wrote:
| training on MPS backend is suboptimal and really slow.
| wtallis wrote:
| Do people do training on systems this small, or just
| inference? I could see maybe doing a little bit of fine-
| tuning, but certainly not from-scratch training.
| redox99 wrote:
| If you mean train llama from scratch, you aren't going to
| train it on any single box.
|
| But even with a single 3090 you can do quite a lot with
| LLMs (through QLoRA and similar).
| llm_trw wrote:
| So is a TI-89.
| amelius wrote:
| And looks way cooler
| numpad0 wrote:
| 4x32GB(128GB) DDR4 is ~$250. 4x48GB(192GB) DDR5 is ~$600.
| Those are even cheaper than upgrade options for Macs($1k).
| papichulo2023 wrote:
| No many consumer mobo support 192GB DDR5.
| wtallis wrote:
| If it supports DDR5 at all, then it should be at most a
| firmware update away from supporting 48GB dual-rank
| DIMMs. There are very few consumer motherboards that only
| have two DDR5 slots; almost all have the four slots
| necessary to accept 192GB. If you are under the
| impression that there's a widespread limitation on
| consumer hardware support for these modules, it may
| simply be due to the fact that 48GB modules did not exist
| yet when DDR5 first entered the consumer market, and such
| modules did not start getting mentioned on spec sheets
| until after they existed.
| ojbyrne wrote:
| A lot that have specs showing they support a max of 4x32
| DDR5 actually support 4x48 DDR5 via recent BIOS updates.
| Tepix wrote:
| 6 GPUs because they want fast storage and it uses PCIe lanes.
|
| Besides the goal was to run a 70b FP16 model (requiring roughly
| 140GB VRAM). 6*24GB = 144GB
| andersa wrote:
| That calculation is incorrect. You need to fit both the model
| (140GB) and the KV cache (5GB at 32k tokens FP8 with flash
| attention 2) * batch size into VRAM.
|
| If the goal is to run a FP16 70B model as fast as possible,
| you would want 8 GPUs with P2P, for a total of 192GB VRAM.
| The model is then split across all 8 GPUs with 8-way tensor
| parallelism, letting you make use of the full 8TB/s memory
| bandwidth on every iteration. Then you have 50GB spread out
| remaining for KV cache pages, so you can raise the batch size
| up to 8 (or maybe more).
| renewiltord wrote:
| I've got a few 4090s that I'm planning on doing this with.
| Would appreciate even the smallest directional tip you can
| provide on splitting the model that you believe is likely
| to work.
| andersa wrote:
| The split is done automatically by the inference engine
| if you enable tensor parallelism. TensorRT-LLM, vLLM and
| aphrodite-engine can all do this out of the box. The main
| thing is just that you need either 4 or 8 GPUs for it to
| work on current models.
| renewiltord wrote:
| Thank you! Can I run with 2 GPUs or with heterogeneous
| GPUs that have same RAM? I will try. Just curious if you
| already have tried.
| andersa wrote:
| 2 GPUs works fine too, as long as your model fits. Using
| different GPUs with same VRAM however, is highly highly
| sketchy. Sometimes it works, sometimes it doesn't. In any
| case, it would be limited by the performance of the
| slower GPU.
| renewiltord wrote:
| All right, thank you. I can run it on 2x 4090 and just
| put the 3090s in different machine.
| ShamelessC wrote:
| > Many things will only function well with 4 or 8 GPUs
|
| What do you mean?
| andersa wrote:
| For example, if you want to run low latency multi-GPU
| inference with tensor parallelism in TensorRT-LLM, there is a
| requirement that the number of heads in the model is
| divisible by the number of GPUs. Most current published
| models are divisible by 4 and 8, but not 6.
| bick_nyers wrote:
| Interesting... 1 Zen 4 EPYC CPU yields a maximum of 128
| PCIE lanes so it wouldn't be possible to put 8 full fat
| GPUs on while maintaining some lanes for storage and
| networking. Same deal with Threadripper Pro.
| andersa wrote:
| It should be possible with onboard PCIe switches. You
| probably don't need the networking or storage to be all
| that fast while running the job, so it can dedicate
| almost all of the bandwidth to the GPU.
|
| I don't know if there are boards that implement this,
| though, I'm only looking at systems with 4x GPUs
| currently. Even just plugging in a 5kW GPU server in my
| apartment would be a bit of a challenge. With 4x 4090,
| the max load would be below 3kW, so a single 240V plug
| can handle it no issue.
| thangngoc89 wrote:
| 8 GPUs x 16 PCIe lanes each = 128 lanes already.
|
| That's the limit of single CPU platforms.
| bick_nyers wrote:
| I've seen it done with a PLX Multiplexer as well, but
| they add quite a bit of cost:
|
| https://c-payne.com/products/pcie-gen4-switch-
| backplane-4-x1...
|
| Not sure if there exists an 8-way PCIE Gen 5 Multiplexer
| that doesn't cost ludicrous amounts of cash. Ludicrous
| being a highly subjective and relative term of course.
| namibj wrote:
| 98 lanes of PCIe 4.0 fabric switch as just the chip (to
| solder onto a motherboard/backplane) costs 850$
| (PEX88096). You could for example take 2 x16 GPUs, pass
| then through (2 _2_ 16=64 lanes), and have 2 x16 that
| bifurcate to at least x4 (might even be x2, I didn't find
| that part of the docs just now) for anything you want,
| plus 2 x1 for minor stuff. They do claim to have no
| problems being connected up into a switching fabric, and
| very much allow multi-host operations (you will need
| signal retimers quite soon, though).
|
| They're the stuff that enables cloud operators to pool
| like 30 GPUs across like 10 CPU sockets while letting you
| virtually hot-plug them to fit demand. Or when you want
| to make a SAN with real NVMe-over-PCIe. Far cheaper than
| normal networking switches with similar ports (assuming
| hosts doing just x4 bifurcation, it's very comparable to
| a 50G Ethernet port. The above chip thus matches a 24
| port 50G Ethernet switch. Trading reach for only needing
| retimers, not full NICs, in each connected host. Easily
| better for HPC clusters up to about 200 kW made from
| dense compute nodes.), but sadly still lacking affordable
| COTS parts that don't require soldering or contacting
| sales for pricing (the only COTS with list prices seem to
| be Broadcom's reference designs, for prices befitting an
| evaluation kit, not a Beowulf cluster).
| segfaultbuserr wrote:
| It's more difficult to split your work across 6 GPUs evenly,
| and easier when you have 4 or 8 GPUs. The latter setups have
| powers of 2, which for example, can evenly divide a 2D or 3D
| grid, but 6 GPUs are awkward to program. Thus, the OP argues
| that a 6-GPU setup is highly suboptimal for many existing
| applications and there's no point to pay more for the extra
| 2.
| numpad0 wrote:
| I was googling public NVIDIA SXM2 materials the other day, and
| it seemed SXM2/NVLink 2.0 just was a six-way system. NVIDIA SXM
| had updated to versions 3 and 4 since, and this isn't based on
| none of those anyway, but maybe there's something we don't know
| that make six-way reasonable.
| andersa wrote:
| It was probably just before running LLMs with tensor
| parallelism became interesting. There are plenty of other
| workloads that can be divided by 6 nicely, it's not an end-
| all thing.
| dheera wrote:
| What is a six-way system?
| liuliu wrote:
| 6 seems reasonable. 128 Lanes from ThreadRipper needs to have a
| few for network and NVMe (4x NVMe would be x16 lanes, and 10G
| network would be another x4 lanes).
| cjbprime wrote:
| I don't think P2P is very relevant for inference. It's
| important for training. Inference can just be sharded across
| GPUs without sharing memory between them directly.
| andersa wrote:
| It can make a difference when using tensor parallelism to run
| small batch sizes. Not a huge difference like training
| because we don't need to update all weights, but still a
| noticeable one. In the current inference engines there are
| some allreduce steps that are implemented using nccl.
|
| Also, paged KV cache is usually spread across GPUs.
| namibj wrote:
| It massively helps arithmetic intensity to batch during
| inference, and the desired batch sizes by that tend to exceed
| the memory capacity of a single GPU. Thus desire to do
| training-like cluster processing to e.g. use a weight for
| each inference stream that needs it every time it's fetched
| from memory. It's just that you can't fit 100+ inference
| streams of context on one GPU, typically, thus the desire to
| shard along less-wasteful (w.r.t. memory bandwidth)
| dimensions than entire inference streams.
| qeternity wrote:
| You are talking about data parallelism. Depending on the
| model tensor parallelism can still be very important for
| inference.
| georgehotz wrote:
| tinygrad supports uneven splits. There's no fundamental reason
| for 4 or 8, and work should almost fully parallelize on any
| number of GPUs with good software.
|
| We chose 6 because we have 128 PCIe lanes, aka 8 16x ports. We
| use 1 for NVMe and 1 for networking, leaving 6 for GPUs to
| connect them in full fabric. If we used 4 GPUs, we'd be wasting
| PCIe, and if we used 8 there would be no room for external
| connectivity aside from a few USB3 ports.
| doctorpangloss wrote:
| Have you compared 3x 3090-3090 pairs over NVLink?
|
| IMO the most painful thing is that since these hardware
| configurations are esoteric, there is no software that
| detects them and moves things around "automatically."
| Regardless of what people thing device_map="auto" does, and
| anyway, Hugging Face's transformers/diffusers are all over
| the place.
| davidzweig wrote:
| Is it possible a similar patch would work for P2P on 3090s?
|
| btw, I found a Gigabyte board on Taobao that is unlisted on
| their site: MZF2-AC0, costs $900. 2 socket Epyc and 10 PCIE
| slots, may be of interest. A case that should fit, with 2x
| 2000W Great Wall PSUs and PDU is 4050 RMB
| (https://www.toploong.com/en/4GPU-server-case/644.html). You
| still need blower GPUs.
| cjbprime wrote:
| Doesn't nvlink work natively on 3090s? I thought it was
| only removed (and here re-enabled) in 4090.
| qeternity wrote:
| This not not nvlink.
| georgehotz wrote:
| It should if your 3090s have Resizable BAR support in the
| VBIOS. AFAIK most card manufacturers released BIOS updates
| enabling this.
|
| Re: 3090 NVLink, that only allows pairs of cards to be
| connected. PCIe allows full fabric switch of many cards.
| Ratiofarmings wrote:
| In cases where they didn't, the techpowerup vBIOS
| collection solves the problem.
| andersa wrote:
| That is very interesting if tinygrad can support it! Every
| other library I've seen had the limitation on dividing the
| heads, so I'd (perhaps incorrectly) assumed that it's a
| general problem for inference.
| AnthonyMouse wrote:
| Is there any reason you couldn't use 7? 8 PCIe lanes each
| seems more than sufficient for NVMe and networking.
| xipho wrote:
| You can watch this happen on the weekends, typically, sometimes,
| for some very long sessions, sometimes.
| https://www.twitch.tv/georgehotz
| BeefySwain wrote:
| Can someone ELI5 what this may make possible that wasn't possible
| before? Does this mean I can buy a handful of 4090s and use it in
| lieu of an h100? Just adding the memory together?
| segfaultbuserr wrote:
| No. The Nvidia A100 has a multi-lane NVLink interface with a
| total bandwidth of 600 GB/s. The "unlocked" Nvidia RTX 4090
| uses PCIe P2P at 50 GB/s. It's not going to replace A100 GPUs
| for serious production work, but it does unlock a datacenter-
| exclusive feature and has some small-scale applications.
| xmorse wrote:
| Finally switched to Nvidia and already adding great value
| perfobotto wrote:
| What stops nvidia from making sure this stops working in future
| driver releases?
| __MatrixMan__ wrote:
| The law, hopefully.
|
| Beeper mini only worked with iMessage for a few days before
| Apple killed it. A few months later the DOJ sued Apple. Hacks
| like this show us the world we could be living in, a world
| which can be hard to envision otherwise. If we want to actually
| live in that world, we have to fight for it (and protect the
| hackers besides).
| StayTrue wrote:
| I was thinking the same but in terms of firmware updates.
| aresant wrote:
| So assuming you utilized this with (4) x 4090s is there a
| theoretical comparative to performance vs the A6000 / other
| professional lines?
| thangngoc89 wrote:
| I believe this is mostly for memory capacities. PCIe access
| between GPUs is slower than soldered RAM on a single GPU
| c0g wrote:
| Any idea of DDP perf?
| No1 wrote:
| The original justification that Nvidia gave for removing Nvlink
| from the consumer grade lineup was that PCIe 5 would be fast
| enough. They then went on to release the 40xx series without PCIe
| 5 and P2P support. Good to see at least half of the equation
| being completed for them, but I can't imagine they'll allow this
| in the next gen firmware.
| musha68k wrote:
| OK now we are seemingly getting somewhere. I can feel the
| enthusiasm coming back to me.
|
| Especially in light of what's going on with LocalLLaMA etc:
|
| https://www.reddit.com/r/LocalLLaMA/comments/1c0mkk9/mistral...
| thangngoc89 wrote:
| > You may need to uninstall the driver from DKMS. Your system
| needs large BAR support and IOMMU off.
|
| Can someone point me to the correct tutorial on how to do these
| things?
| unaindz wrote:
| The first one I assume is the nvidia driver for linux installed
| using dkms. If it uses dkms or not is stated on the drivers
| name, at least on arch based distributions.
|
| The latter options are settings on your motherboard bios, if
| your computer is modern, explore your bios and you will find
| them
| jasomill wrote:
| DKMS: uninstall Nvidia driver using distro package manager
|
| BAR: enable resizable BAR in motherboard CMOS setup
|
| IOMMU: Add "amd_iommu=off" or "intel_iommu=off" to kernel
| command line for AMD or Intel CPU, respectively (or just add
| both). You may or may not need to disable the IOMMU in CMOS
| setup (Intel calls its IOMMU VT-d).
|
| See motherboard docs for specific option names. See distro docs
| for procedures to list/uninstall packages and to add kernel
| command line options.
| spxneo wrote:
| does this mean you can horizontally scale to GPT-4-esque LLM
| locally in the near future? (i hear you need 1TB of VRAM)
|
| Is Apple's large VRAM offering like 196gb offer the fastest
| bandwidth and if so how will pairing a bunch of 4090s like in the
| comments work?
| lawlessone wrote:
| This is very interesting.
|
| I can't afford two mortgages though ,so for me it will have to
| just stay as something interesting :)
| m3kw9 wrote:
| In layman terms what does this enable?
| vladgur wrote:
| curious if this will ever make it to 3090s
| cavisne wrote:
| How does this compare in bandwidth and latency to nvlink? (I'm
| aware it's not available on the consumer cards)
| wmf wrote:
| It's 5x-10x slower.
| modeless wrote:
| What are the chances that Nvidia updates the firmware to disable
| this and prevents downgrading with efuses? Someday cards that
| still have older firmware may be more valuable. I'd be cautious
| upgrading drivers for a while.
| theturtle32 wrote:
| WTF is P2P?
| theturtle32 wrote:
| Answered my own question with a Google search:
|
| https://developer.nvidia.com/gpudirect#:~:text=LEARN%20MORE%...
| .
|
| > GPUDirect Peer to Peer > Enables GPU-to-GPU copies as well as
| loads and stores directly over the memory fabric (PCIe,
| NVLink). GPUDirect Peer to Peer is supported natively by the
| CUDA Driver. Developers should use the latest CUDA Toolkit and
| drivers on a system with two or more compatible devices.
___________________________________________________________________
(page generated 2024-04-12 23:00 UTC)