[HN Gopher] What lengths will Chinese companies go to get an Nvi...
       ___________________________________________________________________
        
       What lengths will Chinese companies go to get an Nvidia A100 chip?
        
       Author : yorwba
       Score  : 49 points
       Date   : 2024-02-26 15:24 UTC (7 hours ago)
        
 (HTM) web link (chinai.substack.com)
 (TXT) w3m dump (chinai.substack.com)
        
       | wtcactus wrote:
       | Interesting, I didn't know this was so sought after.
       | 
       | I actually have one for sale (the 40GB PCIe one), but I haven't
       | got to list it on eBay due to lack of time yet (and because I
       | didn't think there was so much interest in it).
       | 
       | To be sincere, maybe for DL this is really much better than the
       | alternatives, but for some simulations and parallelizing come
       | radiative transfer code, it was not that much better than a RTX
       | 4090 with the extra hassle that it's more difficult to cool it.
        
         | moondev wrote:
         | How did you end up cooling it? I have an a40 and it's been
         | interesting testing all kinds of methods, from two 40mm fans to
         | a 3A 9030 centrifugal blower with 3d printed duct
        
         | siver_john wrote:
         | As someone who used various types of GPUs in graduate school.
         | For most simulations, and even machine learning (unless you
         | need the VRAM) you are generally better off going with a
         | consumer card. There is generally about the same number of CUDA
         | cores and the higher clock speeds will generally net you better
         | performance overall.
         | 
         | Simulations where this isn't true is any that need double
         | floating point (which you previously were able to do in the
         | Titan series of consumer-ish cards). And where it is super
         | important for DL is the VRAM it allows you to use much larger
         | models. Plus the added features of being able to string them
         | together and share memory which is an important feature that
         | has been left off consumer cards (honestly in a way that makes
         | sense because SLI has been dumb for some time).
        
         | rhdunn wrote:
         | The A100 is comparable to the 3090 but with more memory. The
         | H100 is the one comparable to the 4090.
         | 
         | The advantage of these is the access to the larger memory. And
         | they are able to be linked together such that they all share
         | the same memory via NVLink. This makes them scalable for
         | processing the large data and holding the models for the larger
         | scale LLMs and other NN/ML based models.
        
           | tmaly wrote:
           | Have you seen an actual A100?
           | 
           | They are massive, I can imagine them being comparable to a
           | 3090 at all.
        
             | Feorn wrote:
             | A reference 3090 is longer by 69mm, wider by 29mm, and
             | thicker by a slot than a PCIe A100.
             | 
             | Though I think the comment you're replying to was talking
             | about them both using the same Nvidia GPU architecture,
             | Ampere.
        
           | adfbkandfionio wrote:
           | >And they are able to be linked together such that they all
           | share the same memory via NVLink. This makes them scalable
           | for processing the large data and holding the models for the
           | larger scale LLMs and other NN/ML based models.
           | 
           | GPUs connected with NVLink do not exactly share memory. They
           | don't look like a single logical GPU. One GPU can issue loads
           | or stores to a different GPU's memory using "GPUDirect Peer-
           | To-Peer", but you cannot have a single buffer or a single
           | kernel that spans multiple GPUs. This is easier to use and
           | more powerful than the previous system of explicit copies
           | from device to device, perhaps, but a far cry from the way
           | multiple CPU sockets "just work". Even if you could treat the
           | system as one big GPU you wouldn't want to. The performance
           | takes a serious hit if you constantly access off-device
           | memory.
           | 
           | NVLink doesn't open up any functionality that isn't available
           | over PCIe, as far as I know. It's "merely" a performance
           | improvement. The peer-to-peer technology still works without
           | NVLink.
           | 
           | NVidia's docs are, as always, confusing at best. There are
           | several similarly-named technologies. The main documentation
           | page just says "email us for more info". The best online
           | documentation I've found is in some random slides.
           | 
           | https://developer.nvidia.com/gpudirect
           | 
           | https://developer.download.nvidia.com/CUDA/training/cuda_web.
           | ..
        
             | rhdunn wrote:
             | Interesting. So that would mean that you would still need a
             | 40 or 80 GB card to run the larger models (30B LLM, 70B
             | LLM, 8x7B LLM) and perform training of them.
             | 
             | Or would it be possible to split the model layers between
             | the cards like you can between RAM and VRAM? I suppose in
             | that case each card would be able to evaluate the results
             | of the layers in its own memory and then pass those results
             | to the other card(s) as necessary.
        
       | dachworker wrote:
       | Here's an interesting question: why are there lots of startups
       | claiming they can create better bespoke DL hardware accelerators
       | than Nvidia's offerings?
       | 
       | I understand how a bespoke architecture could in theory
       | accommodate larger models and offer better throughput despite
       | being based on older generations of MOFSET nodes. But if that was
       | the case, wouldn't China simply create their own bespoke hardware
       | accelerators? So what's stopping them?
        
         | russli1993 wrote:
         | hint: they are
        
         | alephnerd wrote:
         | > wouldn't China simply create their own bespoke hardware
         | accelerators? So what's stopping them?
         | 
         | Fabrication.
         | 
         | SMEE is supposedly able to mass produce 28nm lithography
         | machines, but most modern (2016/Pascal onwards) GPUs are
         | fabricated using 16nm lithography or lower (eg. Ampere uses a
         | 7nm process, and there are multiple newer architectures in the
         | pipeline that leverage 3nm fabrication processes at Samsung,
         | TSMC, and Intel).
         | 
         | Chinese companies like SMEE are trying, but it will take 3-5
         | years to reach 16nm lithography at scale.
         | 
         | Also, GPUs are being limited for simulation (aka Nuclear
         | weapons testing) reasons - not "AI" - as just about every
         | country except North Korea honors the Comprehensive Nuclear
         | Test Ban Treaty, forcing countries to test using HPC (edit:
         | also used for Jet Turbine simulation a la Autodesk Federal).
         | 
         | A 28nm process is more than enough to make EWS, Avionics, and
         | Precision weapons, which is what Russia uses to manufacture the
         | Elbus-8s chipset domestically.
         | 
         | This is partially related to why companies like Nvidia has
         | begun moving fabrication to Samsung fabs over TSMC in the short
         | term, as SK has a formal defense agreement unlike Taiwan.
        
           | mistrial9 wrote:
           | > GPUs are being limited for simulation (aka Nuclear weapons
           | testing) reasons - not "AI"
           | 
           | that information seems dated in 2024
        
             | alephnerd wrote:
             | Not really.
             | 
             | You don't need bespoke cutting edge hardware or models for
             | most defense applications (aka to kill people) today.
             | 
             | For example, C-RAMs are using Maxwell level hardware at
             | most.
             | 
             | The biggest driver for GPU, FPGA, and CPU development has
             | been nuclear research, and is a major reason why the top
             | supercomputers and HPC programs globally are usually linked
             | with Nuclear Weapons Labs (eg. LLNL, LBL, Argonne, Oak
             | Ridge, NSC Guangzhou).
             | 
             | It just so happens that you use the same math for nuclear
             | simulations as you would for "ML", bioinformatics, and
             | computer graphics.
             | 
             | It's all Numerical Analysis and Optimization Theory at the
             | end of the day.
        
               | rfoo wrote:
               | I don't know. I thought Gina specifically said the recent
               | ban is for AI. [0]
               | 
               | > "What we cannot allow them to ship is the most
               | sophisticated, highest-processing power AI chips, which
               | would enable China to train their frontier models," she
               | added.
               | 
               | [0] https://www.reuters.com/technology/us-talks-with-
               | nvidia-abou...
        
               | alephnerd wrote:
               | The origin of these bans started in 2016-17, with Intel
               | Xeon processors being restricted after the Chinese NSC
               | program was found to be using some of their HPC infra for
               | nuclear weapons simulations.
               | 
               | No one wants to say the "Ne" word as it causes some
               | pretty severe domestic political blowback.
               | 
               | Already China has been looking at countering American
               | second strike capabilities with their nuclear weapons
               | buildup over the past decade.
               | 
               | It also doesn't help that unlike most previous Premiers
               | post-Mao, Xi Jinping started his administration career in
               | the PLA.
        
               | rfoo wrote:
               | Xeon Phi bans were _targeted_ and successful. It shows
               | that you don 't need to issue a country-wide ban to
               | prevent China from using a chip to build their
               | supercomputers.
               | 
               | And then, Sunway happened, it's built for military-use
               | supercomputers. Do we even know whether it's 40nm, 28nm
               | or 14nm or who on the earth fabricated them? And nothing
               | changed on how export control works after that, that's
               | certainly not the trigger.
               | 
               | GPU bans _before ChatGPT happened_ were also _targeted_ ,
               | similar to how BIS Entity List works.
               | 
               | Let's face it: the recent ban-entire-China movement was
               | just about "AI", instead of HPC/simulations, the only
               | purpose of it is to deny China NVIDIA GPU access and
               | ensure they can't compete on SOTA language models.
        
           | kkarakk wrote:
           | >as just about every country except North Korea honors the
           | Comprehensive Nuclear Test Ban Treaty
           | 
           | i don't think this is correct -
           | https://www.nti.org/education-center/treaties-and-
           | regimes/co... there are 8 other countries that have not
           | signed so why would they honor the treaty?
           | 
           | india in particular.
        
             | alephnerd wrote:
             | > india in particular
             | 
             | Last time India (and Pakistan) tested a live nuclear weapon
             | was 1998, and both faced SEVERE sanctions at the time - and
             | was a major reason both countries stopped buying American
             | weapons systems and switched to Israeli and Chinese vendors
             | respectively.
             | 
             | The only country left that tests nuclear weapons live is
             | North Korea
        
               | kkarakk wrote:
               | yes but couldn't the countries that want to do
               | simulations just buy through india/saudi arabia etc?
        
               | alephnerd wrote:
               | It doesn't scale out logistically.
               | 
               | While some lossage is expected, you can't build a
               | Tianhe-2 type supercomputer by smuggling Nvidia (or AMD)
               | GPUs - and a nuclear program needs dozens.
               | 
               | China, Russia, NK, and Iran have very severe hardware
               | import restrictions circa 2024.
               | 
               | This is why Russia, China, and Iran have been building
               | our domestic fabrication capabilities (Russia in the
               | early 2010s, China in the early 2020s, and Iran
               | presently)
        
               | kkarakk wrote:
               | Interesting, thank you for answering!
        
         | cherioo wrote:
         | Huawei has created the "ascend" series ai chip. Reportedly it
         | has 80% of A100 in theoretical performance.
         | 
         | Rumor is that they had trouble selling it against A100 due to
         | worse price to performance and worse software integration. But
         | the US sanction has now created a huge market for it.
        
           | alephnerd wrote:
           | It's still fabricated by TSMC though.
           | 
           | Domestic fabrication is still at the 28nm phase at most.
           | 
           | That said, the Ascend series a fairly massive leap, as it
           | means domestic EDA capabilities have grown massively in
           | China.
           | 
           | That said, I remember they had a massive Design labs in both
           | SV and Delhi+Bangalore that poached from the Samsung, Nvidia,
           | and Intel labs in town before they were kicked out of
           | US+India so idk how much design was done domestically in
           | China.
        
             | DinaCoder99 wrote:
             | > Domestic fabrication is still at the 28nm phase at most.
             | 
             | Allegedly SMIC has both 14nm and 7nm fabs as of last year.
        
               | alephnerd wrote:
               | Allegedly but with low yield rates and leveraging
               | ASML+Nikon DUV products.
               | 
               | Not to say that Chinese vendors won't eventually crack
               | that nut, but it'll take several (3-7) years. Also,
               | domestic design capacity seems to be somewhat lacking
               | (though rapidly changing)
               | 
               | The most cutting edge domestic lithography tool China has
               | is SMEE's 28nm one - that's still a MASSIVE
               | accomplishment, but still significantly far from
               | 20/16/14/7nm processes.
               | 
               | Even Russia had domestic capabilities for 28nm
               | fabrication at scale in the 2010s.
               | 
               | That said, for most military use cases, this is more than
               | enough.
        
         | imtringued wrote:
         | Because most of them only implement inference, duh.
        
       | elzbardico wrote:
       | If money is not a problem for them, I don't see how can we deny
       | Chinese companies access to those cards.
       | 
       | The soviets were able to smuggle whole mainframes and mini-
       | computers during the 70s. Giant fucking machines that you could
       | only buy trough very specific sales channels. It was not like you
       | could ebay yourself a S/360, and yet, they did it.
        
         | The_Colonel wrote:
         | It's not a question of "if" , but "how many". Abd that's
         | largely what matters, same as in the case of USSR.
        
         | __loam wrote:
         | Asianometry on YouTube has a ton of videos on Soviet computing
         | including one's discussing things like the failed semiconductor
         | industry in East Germany. The Soviets certainly smuggled and
         | tried to clone a lot of machines but they were always behind
         | the United States.
        
       | tibbydudeza wrote:
       | I have seen some reports of modded GPU's in Brazil and China
       | where they convert earlier gen consumer cards (3000 series) into
       | 12GB or 16GB variants using GDDR6 from Samsung.
       | 
       | Seems the GPU boot code is not hard limited but reads capacity
       | from what is installed.
        
       ___________________________________________________________________
       (page generated 2024-02-26 23:02 UTC)