[HN Gopher] The art of high performance computing
       ___________________________________________________________________
        
       The art of high performance computing
        
       Author : rramadass
       Score  : 415 points
       Date   : 2023-12-30 14:01 UTC (8 hours ago)
        
 (HTM) web link (theartofhpc.com)
 (TXT) w3m dump (theartofhpc.com)
        
       | mkoubaa wrote:
       | UT Austin really is a fantastic institution for HPC and
       | computational methods.
        
         | bee_rider wrote:
         | Every BLAS you want to use has at least some connection to UT
         | Austin's TACC.
        
           | mgaunard wrote:
           | aren't the lapack people in tennessee?
        
             | bee_rider wrote:
             | Sort of like BLAS, LAPACK is more than just one
             | implementation. Dongarra described what everybody should do
             | from Tennesse, but other places implemented it elsewhere.
        
           | victotronics wrote:
           | Not quite. Every modern BLAS is (likely) based on Kazushige
           | Goto's implementation, and he was indeed at TACC for a while.
           | But probably the best open source implementation "BLIS" is
           | from UT Austin, but not connected to TACC.
        
             | bee_rider wrote:
             | Oh really? I thought BLIS was from TACC. Oops, mea culpa.
        
               | RhysU wrote:
               | https://github.com/flame/blis/
               | 
               | Field et al, recent winners of the James H. Wilkinson
               | Prize for Numerical Software.
               | 
               | Field and Goto both collaborated with Robert van de
               | Geijn. Lots of TACC interaction in that broader team.
        
       | davidthewatson wrote:
       | I was asked to share a TA role on a graduate course in HPC a
       | decade ago. I turned down the offer.
       | 
       | After a cursory glance, I can honestly say that if this book were
       | available then, I'd have taken the opportunity.
       | 
       | The combination of what I perceive to be Knuth's framing of art,
       | along with carpentry and the need to be a better devops person
       | than your devops person is compelling.
       | 
       | Kudos to the author for such an achievement. UT Austin seems to
       | have achieved in computer science what North Texas State did in
       | music.
        
       | atrettel wrote:
       | I took a course on scientific computing in 2013. It was cross-
       | listed under both the computer science and applied math
       | departments. The issue is that the field is pretty broad overall
       | and a lot of topics were covered in a cursory manner, including
       | anything related to HPC and parallel programming in particular. I
       | don't regret taking the course, but it was too broad for the
       | applications I was pursuing.
       | 
       | I haven't looked at what courses are being offered in several
       | years, but when I was a graduate student, I really would have
       | benefited from a dedicated semester-long course on parallel
       | computing, especially going into the weeds about particular
       | algorithms and data structures in parallel and distributed
       | computing. Those were handled in a super cursory manner in the
       | scientific computing course I took, as if somehow you'd know
       | precisely how to parallelize things the first time you try. I've
       | since learned a lot of this stuff on my own and from colleagues
       | over the years, as many people do in HPC, but books like these
       | would have been invaluable as part of a dedicated semester-long
       | course.
        
       | dist1ll wrote:
       | It's very interesting how abtracted away HPC sometimes looks from
       | hardware. The books seem to revolve a lot around SPMD
       | programming, algo & DS, task parallelism, synchronization etc,
       | but very little about computer architecture details like
       | supercomputer memory subsystems, high-bandwidth interconnects
       | like CXL, GPU architecture and so on. Are the abstractions and
       | tooling already good enough that you don't need to worry about
       | these details? I'm also curious if HPC practitioners have to
       | fiddle a lot of black-box knobs to squeeze out performance?
        
         | bee_rider wrote:
         | I don't think I do HPC (I only will use up to, say, 8 nodes at
         | a time), but the impression I get is that they are already
         | working on quite hard problems at the high-level, so they need
         | to lean on good libraries for the low-level stuff, otherwise it
         | is just too much.
        
         | atrettel wrote:
         | Yes and no.
         | 
         | MPI and OpenMP are the primary abstractions from the hardware
         | in HPC, with MPI being an abstracted form of distributed-memory
         | parallel computing and OpenMP being an abstracted form of
         | shared-memory parallel computing. Many researchers write their
         | codes purely using those, often both in the same code. When
         | using those, you really do not need to worry about the
         | architectural details most of the time.
         | 
         | Still, some researchers who like to further optimize things do
         | in fact fiddle with a lot of small architectural details to
         | increase performance further. For example, loop unrolling is
         | pretty common and can get quite confusing in my opinion. I
         | vaguely recall some stuff about trying to vectorize operations
         | by preferring addition over multiplication due to the
         | particular CPU architecture, but I do not think I've seen that
         | in practice.
         | 
         | Preventing cache misses is another major one, where some codes
         | are written so that the most needed information is stored in
         | the CPU's cache rather than memory. Most codes only handle this
         | by ensuring column-major order loops for array operations in
         | Fortran or row-major order loops in C, but the concept can be
         | extended further. If you know the cache size for your
         | processors, you could hypothetically optimize some operations
         | to keep all of the needed information inside the cache to
         | minimize cache misses. I've never seen this in practice but it
         | was actively discussed in the scientific computing course I
         | took in 2013.
         | 
         | The use of particular GPUs depends heavily on the problem being
         | solved, with some being great on GPUs and others being too
         | difficult. I'm not too knowledgeable about that, unfortunately.
        
           | bee_rider wrote:
           | Of course, not every problem can be solved by BLAS, but if
           | you are doing linear algebra, the cache stuff should be
           | mostly handled by BLAS.
           | 
           | I'm not sure how much multiplication vs addition matters on a
           | modern chip. You can have a bazillion instructions in flight
           | after all, as long as they don't have any dependencies, so
           | I'd go with whichever option shortens the data dependencies
           | on the critical path. The computer will figure out where to
           | park longer instruction if it needs to.
        
             | atrettel wrote:
             | You're right that the addition vs. multiplication issue
             | likely does not matter on a modern chip. I just gave the
             | example because it shows how the CPU architecture can
             | affect how you write the code. I do not recall precisely
             | when or where I heard the idea, but it was about a decade
             | ago --- ages ago by computing standards.
        
         | MichaelZuo wrote:
         | Memory architecture and bandwidth are still very important,
         | most of IBM's latest performance gains for both mainframes and
         | POWER are reliant on some novel innovations there.
        
         | eslaught wrote:
         | No, the abstractions are not sufficient. We do care about these
         | details, a lot.
         | 
         | Of course, not every application is optimized to the hilt. But
         | if you _want_ to so optimize an application, exactly things you
         | 're talking about are what come into play.
         | 
         | So yes, I would expect every competent HPC practitioner to have
         | a solid (if not necessarily intimate) grasp of hardware
         | architecture.
        
         | mgaunard wrote:
         | Regardless of what you do, domain knowledge tends to be more
         | valuable than purely technical skills.
         | 
         | Knowing more numerical analysis will get probably get you
         | further in HPC than knowledge of specific hardware
         | architectures.
         | 
         | Ideally you want both, of course.
        
         | jandrewrogers wrote:
         | For most HPC, you will not be able to maximize parallelism and
         | throughput without intimate knowledge of the hardware
         | architecture and its behavior. As a general principle, you want
         | the topology of the software to match the topology of the
         | hardware as closely as possible for optimal scaling behavior.
         | Efficient HPC software is strongly influenced by the nature of
         | the hardware.
         | 
         | When I wrote code for new HPC hardware, people were always
         | surprised when I asked for the system hardware and architecture
         | docs instead of the programming docs. But if you understood the
         | hardware design, the correct way of designing software for it
         | became obvious from first principles. The programming docs
         | typically contained quite a few half-truths intended to make
         | things seem misleadingly easier for developers than a proper
         | understanding would suggest. In fact, some HPC platforms failed
         | in large part because they consistently misrepresented what was
         | required from developers to achieve maximum performance in
         | order to appear "easy to use", and then failing to deliver the
         | performance the silicon was capable of if you actually wrote
         | software the way the marketing implied would be effective.
         | 
         | You can write HPC code on top of abstractions, and many people
         | do, but the performance and scaling losses are often
         | unavoidably integer factor. As with most software, this was
         | considered an acceptable loss in many cases if it allowed less
         | capable software devs to design the code. HPC is like any other
         | type of software in that most developers that notionally
         | specialize in it struggle to produce consistently good results.
         | Much of the expensive hardware used in HPC is there to mitigate
         | the performance losses of worse software designs.
         | 
         | In HPC there are no shortcuts to actually understanding how the
         | hardware works if you want maximum performance. Which is no
         | different than regular software, in HPC the hardware systems
         | are just bigger and more complex.
        
         | crabbone wrote:
         | You'd be surprised how actually backwards and primitive are the
         | tools used in HPC.
         | 
         | Take for instance the so-called workload managers, of which the
         | most popular ones are Slurm, PBS, UGE, LSF. Only Slurm is
         | really open-source, PBS has a community edition, the rest is
         | proprietary stuff executed in the best traditions of enterprise
         | software which locks you into using pathetically bad tools,
         | ancient and backwards tech with crappy / nonexistent
         | documentation and inept tech support.
         | 
         | The interface between WLMs and the user who wants to use some
         | resources is through submitting "jobs". These jobs can be
         | interactive, but most often they are the so-called "batch
         | jobs". A batch job is usually defined as... a Unix Shell
         | script, where the comments are parsed to interpret those as
         | instructions to the WLM. In the world with dozens of
         | configuration formats... they chose to do this: embed
         | configuration into Shell comments.
         | 
         | Debugging job failures is a nightmare, mostly because WLM
         | software has really poor quality of execution. Pathetic error
         | reporting. Idiotic defaults. Everything is so fragile it falls
         | apart if you just as much as look at it in the wrong way.
         | Working with it reminds me the very early days of Linux, when
         | sometimes things just won't build, or would segfault right
         | after you've tried running them, and there wasn't much you
         | could do beside spending days or weeks trying to debug it just
         | to get some basic functionality going.
         | 
         | When I have to deal with it, I feel kind of like in a steam-
         | punk movie. Some stuff is really advanced, and then you find
         | out that this advanced stuff is propped by some DIY retro
         | nonsense you thought have died off decades ago. The advanced
         | stuff is usually more on the side of hardware, while software
         | is not keeping up with it for the most part.
        
           | StableAlkyne wrote:
           | > Working with it reminds me the very early days of Linux
           | 
           | The other cool thing about HPC is it is one of the last areas
           | where multi-user Unix is used! At least, if you're using a
           | university or NSF cluster that is!
           | 
           | Only other place I really see multiple humans using the same
           | machine is SDF or the Tildes
        
             | victotronics wrote:
             | It's saturday afternoon.                 [login1 ~:3] who |
             | cut -d ' ' -f 1 | sort -u | wc -l       41
        
           | bee_rider wrote:
           | Having switched from LSF to slurm, I have to appreciate that
           | the ecosystem is so bash-centric. Lots of re-use in the
           | conversion. If I'd had to learn some kind of slurm-markup-
           | language or slurmScript or find buttons in some SlurmWizard,
           | it would have been a nightmare.
        
             | crabbone wrote:
             | Oh LSF... I don't know if you know this. LSF is perhaps the
             | only system alive today that I know of that uses literal
             | patches as a means of software distribution.
             | 
             | Fist time I saw it, I had a flashback to the times when I
             | worked for HP, and they were making some huge SAP knock-
             | off, and that system was so labor-intensive to deploy that
             | their QA process involved actual patches. As in pre-release
             | QA cycle involved installing the system, validating it
             | (which could take a few weeks) and if it's not considered
             | DoD, then the developers are given the final list of things
             | they need to fix and those fixes would have to be submitted
             | as patches (sometimes, literal diffs that need to be
             | applied to the deployed system with the patch tool).
             | 
             | This is, I guess, how the "patch version component" came to
             | be in SemVer spec. It's kind of funny how lots of tools are
             | using this component today for completely unrelated
             | purposes... but yeah, LSF feels like the time is ticking
             | there at a different pace :)
        
           | OPA100 wrote:
           | I've dug deeply into LSF in the last few years and it's like
           | a car crash - you can't look away. It feels like something
           | that started in the early unix days but was developed into
           | perhaps the late 90s, but in reality LSF was only started in
           | the 90s (in academia). As far as I can tell development all
           | but stopped when IBM acquired it some ten years ago.
        
           | convolvatron wrote:
           | HPC software is one area where we have arguably regressed in
           | the last 30 years. Chapel is the only light I see in the
           | darkness
        
           | victotronics wrote:
           | You do a lot of scare quotes. Do you have any suggestions on
           | how things could be different? You need batch jobs because
           | the scheduler has to wait for resources to be available. It's
           | kinda like Tetris in processor/time space. (In fact, that's
           | my personal "proof" that workload scheduling is NP-complete:
           | it's isomorphic to Tetris.)
           | 
           | And what's wrong with shell scripts? It's a lingua franca,
           | generally accepted across scientific disciplines, cluster
           | vendors, workload managers, .... Considering the complexity
           | of some setups (copy data to node-local file systems; run
           | multiple programs, post-process results, ... ) I don't see
           | how you could set up things other than in some scripting
           | language. And then unix shell scripts are not the worst idea.
           | 
           | Debugging failures: yeah. Too many levels where something can
           | go wrong, and it can be a pain to debug. Still, your average
           | cluster processes a few million jobs in its lifetime. If more
           | than a microscopic portion of that would fail, computing
           | centers would need way more personnel than they have.
        
           | romanows wrote:
           | I really like using Slurm, the documentation is great
           | (https://slurm.schedmd.com) and the model is pretty
           | straightforward, at least for the mostly-single-node jobs I
           | used it for.
           | 
           | You can launch a job(s) via command-line, config in Bash
           | comments, REST APIs, linking to their library, and I think a
           | few more ways.
           | 
           | I found it pretty easy to setup and admin. Scaling in the
           | cloud was way less developed when I used it, so I just hacked
           | in a simple script that allowed scaling up and down based on
           | the job queue size.
           | 
           | What do you like better and for what use-case? Mine was for a
           | group of researchers training models, and the feature _I_
           | desired most was an approximately fair distribution of
           | resources (cores, GPU hours, etc.).
        
         | dahart wrote:
         | There is a lot of abstraction, but knowing which abstraction to
         | use still takes knowing a lot about the hardware.
         | 
         | > I'm also curious if HPC practitioners have to fiddle a lot of
         | black-box knobs to squeeze out performance?
         | 
         | In my experience with CUDA developers, yes the Shmoo Plot
         | (https://en.wikipedia.org/wiki/Shmoo_plot, sometimes called a
         | 'wedge' in some industries) is one of the workhorses of every
         | day optimization. I'm not sure I'd call it black-box, though
         | maybe the net effect is the same. It's really common to have
         | educated guesses and to know what the knobs do and how they
         | work, and still find big surprises when you measure. The first
         | rule of optimization is measure. I always think of Michael
         | Abrash's first chapter in the "Black Book": "The Best Optimizer
         | is Between Your Ears"
         | http://twimgs.com/ddj/abrashblackbook/gpbb1.pdf. This is a
         | fabulous snippet of the philosophy of high performance (even
         | though it's PC game centric and not about modern HPC.)
         | 
         | Related to your point about abstraction, the heaviest knob-
         | tuning should get done at the end of the optimization process,
         | because as soon as you refactor or change anything, you have to
         | do the knob tuning again. A minor change in register spills or
         | cache access patterns can completely reset any fine-tuning of
         | thread configuration or cache or shared memory size, etc..
         | Despite this, some healthy amount of knob tuning is still done
         | along the way to check & balance & get an intuitive sense of
         | the local perf space of the code. (Just noticed Abrash talks a
         | little about why this is a good idea.)
        
           | squidgyhead wrote:
           | Could you explain how you use a shmoo plot for optimization?
           | Do you just have a performance metric at each point in
           | parameter space?
        
         | marcosdumay wrote:
         | It's not intuitive, but for HPC is more about scalability than
         | performance.
         | 
         | You won't be able to use a supercomputer at all without
         | scalability, and it's the one topic that is specific to it.
         | But, of course, those computers time is quite expensive so
         | you'll want to optimize for performance too. It's just
         | secondary.
        
         | bluedino wrote:
         | I started in HPC about 2 years ago on a ~500 node cluster at a
         | Fortune 100 company. I was really just looking for a job where
         | I was doing Linux 100% of the time, and it's been fun so far.
         | 
         | But it wasn't what I thought it would be. I guess I expected to
         | be doing more performance oriented work, analyzing numbers and
         | trying to get every last bit of performance out of the cluster.
         | To be honest, they didn't even have any kind of monitoring
         | running. I set some up, and it doesn't really get used. Once in
         | a while we get questions from management about "how busy is the
         | cluster", to justify budgets and that sort of thing.
         | 
         | Most of my 'optimization' work ends up being things like making
         | sure people aren't (usually unknowingly) requesting 384 CPUs
         | when their script only uses 16, testing software to see what #
         | of CPU's it works with before you see a degradation, etc. I've
         | only had the Intel profiler open twice.
         | 
         | And I've found that most of the job is really just helping
         | researchers and such with their work. Typically running either
         | a commercial or open-source program, troubleshooting it, or
         | getting some code written by another team on another cluster
         | and getting it built and running on yours. Slogging through
         | terrible Python code. Trying to get a C++ project built on a
         | more modern cluster in a CentOS 7 environment.
         | 
         | It can be fun in a way. I've worked with different languages
         | over the years so I enjoy trying to get things working, digging
         | through crashes and stack traces. And working with such large
         | machines, your sense of normal gets twisted when you're on a
         | server with 'only' 128GB of RAM or 20TB of disk.
         | 
         | It's a little scary when you know the results of some of this
         | stuff are being used in the real world, and the people running
         | the simulations aren't even doing things right. Incorrect code,
         | mixed up source code, not using the data they thing they are, I
         | once found a huge bug that had existed for 3 years. Doesn't
         | this invalidate all the work you've done on this subject?
         | 
         | The one drawback I find is that a lot of HPC jobs want you do
         | have a masters degree. Even to just run the cluster. Doesn't
         | make sense to me, I'm not writing the software you're running,
         | we aren't running some state of the art, TOP500 cluster. We're
         | just getting a bunch of machines networked together and running
         | some code.
        
           | throwawaaarrgh wrote:
           | I always found that funny too. A business who needs a
           | powerful computing solution can come up with some amazingly
           | robust stuff, whereas science/research just buys a big
           | mainframe and hopes it works.
        
           | justin66 wrote:
           | > The one drawback I find is that a lot of HPC jobs want you
           | do have a masters degree.
           | 
           | Is it possible that pretty much any specialization, outside
           | of the most common ones, engages in a lot of gatekeeping? I
           | remember how difficult it appeared to be after I graduated to
           | break into embedded systems (I never did). I persisted until
           | I realized it doesn't even pay very well, comparatively.
        
         | bayindirh wrote:
         | HPC admin here, generally serving "long tail of science"
         | researchers.
         | 
         | In today's x86_64 hardware, there's no "supercomputer memory
         | subsystem". It's just a glorified NUMA system, and the biggest
         | problem is putting the memory close to your core, i.e. keeping
         | data local in your NUMA node to reduce latencies.
         | 
         | Your resource mapping is handled by your scheduler. It knows
         | your hardware, hence it creates a cgroup which satisfies your
         | needs and as optimized as possible, and stuffs your application
         | into that cgroup and runs it.
         | 
         | Currently king of high performance interconnects is Infiniband,
         | and it accelerates MPI at the fabric level. You can send
         | messages, broadcasts and reduce results like there's no
         | tomorrow. Because when the message arrives you, it's already
         | reduced. When you broadcast, you only send a single message
         | which is broadcasted at fabric layer. Multiple Context IB cards
         | have many queues and more than one MPI job can run on the same
         | node/card with queue/context isolation.
         | 
         | If you're using a framework for GPU work, the architecture &
         | optimization is done at that level automatically (the framework
         | developers do the hard work generally). NVIDIA's drivers are
         | pure black magic, too. They handle some parts of the
         | optimization, too. InterGPU connection is handled by a physical
         | fabric, managed by drivers and its own daemon.
         | 
         | If you're CPU bound, your libraries are generally hand tuned by
         | its vendor (Intel MKL, BLAS, Eigen, etc.). I personally used
         | Eigen, and it has processor specific hints and optimizations
         | baked in.
         | 
         | The things you have to worry is to compile your code for the
         | correct architecture, make sure that the hardware you run on
         | can satisfy your demands (i.e.: do not make too many random
         | memory accesses, keep the prefetcher and branch predictor happy
         | if you're trying to go "all-out fast" on the node, do not abuse
         | disk access, etc.).
         | 
         | On the number crunching side, keeping things independent (so
         | they can be instruction level parallelized/vectorized), making
         | sure you're not doing unnecessary calculations, and not abusing
         | MPI (reducing inter-node talk to only necessary chatter) is the
         | key.
         | 
         | It's way easier said than done, but when you get the hang of
         | it, it becomes like a second nature to think about these
         | things, if these kinds of things are your cup of tea.
        
         | efxhoy wrote:
         | I wrote scientific simulation software in academia for a few
         | years. None of us writing the software had any formal software
         | engineering training above what we'd pieced together ourselves
         | from statistics courses. We wrote our simulations to run
         | independently on many nodes and aggregated the results at the
         | end, no use of any HPC features other than "run these 100
         | scripts on a node each please, thank you slurm". That approach
         | worked very well for our problem.
         | 
         | I'd bet a significant part of compute work on HPC clusters in
         | academia works the same way. The only thing we paid attention
         | to was number of cores on the node and preferring node local
         | storage over the shared volumes for caching. No MPI.
         | 
         | There are of course problems requiring "genuine" HPC clusters
         | but ours could have run on any pile of workers with a job
         | queue.
        
       | teleforce wrote:
       | Is there something wrong with the GitHub files since I cannot
       | render any of the textbooks PDF files?
       | 
       | https://github.com/VictorEijkhout/TheArtofHPC_pdfs/blob/main...
        
         | npalli wrote:
         | I think the files are too large to render in the github browser
         | and they give an error. You can pick the 'download raw' option
         | to download locally and read the file. Worked for me.
        
           | TimMeade wrote:
           | I just "git clone
           | https://github.com/VictorEijkhout/TheArtofHPC_pdfs.git" on my
           | local drive. Had it all in under a minute.
        
       | rramadass wrote:
       | Just amazed at how the author has created (and shared for free)
       | such a comprehensive set of books including teaching C++ and Unix
       | tools! There is something to learn for all Programmers (HPC
       | specific or not) here.
       | 
       | Related: Jorg Arndt's "Matters Computational" book and FXT
       | library - https://www.jjj.de/fxt/
        
       | rlupi wrote:
       | I am interested in the more hardware management side of HPC (how
       | problems are detected, diagnosed, mapped into actions such as
       | reboot/reinstall/repairs, how these are scheduled and how that is
       | optimized to provide the best level of service, how this is done
       | if there are multiple objectives to optimize at once e.g. node
       | availability vs overall throughput, how different topologies
       | affect the above, how other constraints affect the above, and in
       | general a system dynamics approach to these problems).
       | 
       | I haven't found many good sources for this kind of information.
       | If you are aware of any, please cite them in a comment below.
        
         | synergy20 wrote:
         | check out openbmc project and DTMF association
        
           | timoteostewart wrote:
           | DMTF (not DTMF)
           | 
           | https://www.dmtf.org/
        
         | CoastalCoder wrote:
         | This seemed like a big topic when I was interviewing with Meta
         | and nVidia some months ago.
         | 
         | Meta had a few good YouTube videos about the problems of
         | dealing with this many GPUs at scale.
        
           | keefle wrote:
           | Could you link me the YouTube videos/articles in question? It
           | happens to be my research area and I'm interested in knowing
           | how big companies such as meta deal with multi-GPU systems
        
             | CoastalCoder wrote:
             | I don't have them bookmarked anymore, but they may have
             | been from this playlist: [0]
             | 
             | [0] https://www.youtube.com/playlist?list=PLBnLThDtSXOw_keP
             | Wy3CS...
        
         | nyrikki wrote:
         | Assuming you are moving past just the typical nonblocking
         | folded-Clos networksor Little's Law; and want to have a more
         | engineering focus, "Queuing theory" is one discipline you want
         | to dig into.
         | 
         | Queuing theory seems trivial and easy how it is introduced, but
         | it has many open questions.
         | 
         | Performance metrics for a system with random arrival times,
         | independent service times, with k servers (M/G/k) is still an
         | open question as an example.
         | 
         | https://www.sciencedirect.com/science/article/pii/S089571770...
         | 
         | There are actually lots of open problems in queuing theory that
         | one wouldn't expect.
        
         | cavisne wrote:
         | This paper from Microsoft [1] is the coolest thing I've seen in
         | this space. Basically workload (deep learning in this case)
         | level optimization to allow jobs to be resized and preempted.
         | 
         | [1] https://arxiv.org/pdf/2202.07848.pdf
        
       | justin66 wrote:
       | There is some really good content here for any programmer.
       | 
       | And with volume 3, such a contrast: the author teaches C++17
       | and... Fortran2008.
        
       | toddm wrote:
       | Kudos to Victor for assembling such a wonderful resource!
       | 
       | While I am not acquainted with him personally, I did my doctoral
       | work at UT Austin the the 1990's and had the privilege of working
       | with the resources (Cray Y-MP, IBM SP/2 Winterhawk, and mostly on
       | Lonestar, a host name which pointed to a Cray T3E at the time)
       | maintained by TACC (one of my Ph.D. committee members is still on
       | staff!) to complete my work (TACC was called HPCC and/or CHPC if
       | I recall the acronyms correctly).
       | 
       | Back then, it was incumbent on the programmer to parallelize
       | their code (in my case, using MPI on the Cray T3E in the UNICOS
       | environment) and have some understanding of the hardware, if only
       | because the field was still emergent and problems were solved by
       | reading the gray Cray ring-binder and whichever copies of Gropp
       | et al. we had on-hand. That and having a very knowledgeable
       | contact as mentioned above :) of course helped...
        
         | victotronics wrote:
         | > Lonestar, a host name which pointed to a Cray T3E
         | 
         | Lonestar5 was a Cray again. Currently Lonestar6 is an oil-
         | immersion AMD Milan cluster with A100 GPUs. The times, they
         | never stand still.
        
       | LASR wrote:
       | The hardware / datacenter side of this is equally fascinating.
       | 
       | I used to work in AWS, but on the software / services side of
       | things. But now and then, we would crash some talks from the
       | datacenter folks.
       | 
       | One key relevation for me was that increasing compute power in
       | DCs is primarily a thermodynamics problem than actual computing.
       | The nodes have become so dense that shipping power in and
       | shipping heat out, with all kinds of redundancies is an extremely
       | hard problem. And it's not like you can perform a software update
       | if you've discovered some inefficiencies.
       | 
       | This was ~10 years ago, so probably some things have changed.
       | 
       | What blows me away is that Amazon, starting out as an internet
       | bookstore is at the cutting edge of solving thermodynamics
       | problems.
        
         | projectileboy wrote:
         | Seymour Cray used to say this all the way back in the 1970s:
         | his biggest problems were associated with dissipating heat. For
         | the Cray 2 he took an even more dramatic approach: "The
         | Cray-2's unusual cooling scheme immersed dense stacks of
         | circuit boards in a special non-conductive liquid called
         | Fluorinert(tm)" (https://www.computerhistory.org/revolution/sup
         | ercomputers/10...)
        
         | cogman10 wrote:
         | It always made me wonder why liquid cooling wasn't more of a
         | thing for datacenters.
         | 
         | Water has a massive amount of thermal capacity and can quickly
         | and in bulk be cooled to optimal temperatures. You'd probably
         | still need fans and AC to dissipate heat of non-liquid cooled
         | parts, but for the big energy items like CPUs and GPUs/compute
         | engines, you could ship out huge amounts of heat fairly quickly
         | and directly.
         | 
         | I guess the complexity and risk of a leak would be a problem,
         | but for amazon sized data centers that doesn't seem like a
         | major concern.
        
       | jebarker wrote:
       | I'm interested in what people think of the approach to teaching
       | C++ used here. Any particular drawbacks?
       | 
       | I'm a very experienced Python programmer with some C, C++ and
       | CUDA doing application level research in HPC environments
       | (ML/DL). I'd really like to level up my C++ skills and looking
       | through book 3 it seems aimed exactly at the right level for me -
       | doesn't move too slowly and teaches best practices (per the
       | author) rather than trying to be comprehensive.
        
       ___________________________________________________________________
       (page generated 2023-12-30 23:00 UTC)