[HN Gopher] ROOT: analyzing petabytes of data scientifically
       ___________________________________________________________________
        
       ROOT: analyzing petabytes of data scientifically
        
       Author : z3phyr
       Score  : 306 points
       Date   : 2024-06-01 07:25 UTC (15 hours ago)
        
 (HTM) web link (root.cern)
 (TXT) w3m dump (root.cern)
        
       | mjtlittle wrote:
       | Didnt know there was a cern tld
        
         | SiempreViernes wrote:
         | Yeah... according to wikipedia they've had it since 2014, but
         | even now a lot of their pages are on .ch
        
         | sneak wrote:
         | Yes, the root zone is terribly polluted now. Unfortunately
         | there's no way to unring that bell, people depend on a lot of
         | these new domains now.
         | 
         | It was a huge mistake, borne out of greed and recklessness.
         | 
         | https://en.wikipedia.org/wiki/ICANN#Notable_events
        
           | jesprenj wrote:
           | I guess ICANN needs to get money somehow.
        
             | lambdaxyzw wrote:
             | Why can't it just get funding from the government?
        
               | rnhmjoj wrote:
               | Aren't they already getting an outrageous amount of money
               | for essentially supervising a txt file?
        
           | Biganon wrote:
           | I fail to see the problem with those new TLDs.
        
             | oefrha wrote:
             | Certain gTLDs have been borderline scams. The most infamous
             | one might be .sucks, an extortion scheme charging an annual
             | protection fee of $$$, complete with the pre-registration
             | process when you could buy <yourtrademark>.sucks for $$$$
             | before it's snatched up by your enemies.
             | 
             | They also screwed up some old URL/email parsers/sniffers
             | hardcoding TLDs. Largely the fault of bad assumptions to
             | begin with.
             | 
             | Other than the above, I don't see much of a problem.
             | Whatever problems people like to point out about gLTDs
             | already existed with numerous sketchy ccTLDs, like .io.
             | Guess what, the latest hotness .ai is also one of those.
        
           | 9dev wrote:
           | I still wonder why we need that arbitrary restriction anyway?
        
             | 8organicbits wrote:
             | If we allowed all possible TLDs, then we'd need a default
             | organization to administer them. The current setup requires
             | an organization to control each TLD, which allows us to
             | grant control to countries or large organizations. The web
             | should be decentralized, which means TLD ownership should
             | be spread across multiple organizations. More TLDs with
             | more distinct owners is a better situation than one
             | default.
        
         | ragebol wrote:
         | Handy if they host conferences, for people worried about too
         | many TLDs perhaps.
         | 
         | https://con.cern is not yet used, so...
        
       | SiempreViernes wrote:
       | Ah, root... every day it happens I am thankful I don't have to
       | used a version older than 6.
        
         | YakBizzarro wrote:
         | Root was zone of the reasons to decide to not study particle
         | physics
        
           | oefrha wrote:
           | You don't have to. I worked on data analysis (mostly cleaning
           | and correction) for CMS (one of the two main experiments at
           | LHC) for a while and didn't have to touch it. Disclaimer: I
           | was a high energy theorist, but did the aforementioned
           | experimental work early in my PhD for funding.
        
             | aoanla wrote:
             | I mean, most of the researchers I know at least use PyRoot
             | (or the Julia equivalent) as much as possible, rather than
             | actually interacting with Root itself. Which probably saves
             | their sanity...
        
           | brnt wrote:
           | I did my master and PhD around the time numpy/scipy got
           | competitive for a lot of analysis (for me a complete
           | replacement) but the Python binding for root weren't there or
           | in beta. Root-the-data+format remained however the main
           | output of Geant4, so I set up a tiny Python wrapper around a
           | root script that would dump any .root contents and load it up
           | in a numpy file.
           | 
           | My plots looked a lot nicer ;)
        
           | tempay wrote:
           | These days you can mostly avoid it. The Python HEP ecosystem
           | is now pretty advanced so you can even read ROOT files
           | without needing root itself. See:
           | 
           | https://scikit-hep.org/
        
         | twixfel wrote:
         | I'm still waiting for the interface-breaking, let's-finally-
         | make-root-good, version 7, which I think I first heard about in
         | 2016 or so... true vapourware.
        
           | amadio wrote:
           | ROOT 7 is coming. Things are being discussed this year about
           | it, the target is for HL-LHC. See link below. https://indico.
           | cern.ch/event/1369601/contributions/5867782/a...
        
       | nomilk wrote:
       | Source code: https://github.com/root-project
        
       | dailykoder wrote:
       | >Debugging CERN ROOT scripts and ROOT-based programs in Eclipse
       | IDE (30 Oct 2021)
       | 
       | Oh gosh. The nightmares. - What obviously shows that you can
       | build extraordinary stuff in horrible environments.
        
         | BSDobelix wrote:
         | I don't understand is it about eclipse?
        
           | amadio wrote:
           | It was a nice guest post on the website about eclipse, but
           | most people just use gdb. It is now possible to step through
           | ROOT macros with gdb by exporting CLING_DEBUG=1. See
           | https://indico.jlab.org/event/459/contributions/11563/
        
       | leohonexus wrote:
       | Very cool to see large-scale software projects used for
       | scientific discoveries.
       | 
       | Another example: Gravitational waves were found with GStreamer at
       | LIGO: https://lscsoft.docs.ligo.org/gstlal/
        
         | hkwerf wrote:
         | Here it's more the other way around. CERN needs a data analysis
         | framework, so CERN develops, maintains and publishes it for
         | other users.
         | 
         | That being said, I don't know whether it's actually a good idea
         | for someone external to actually use it. My experience may be a
         | little outdated, but it's quite clunky and dated. The big
         | advantage of using it for CERN or particle physics stuff is
         | that it's basically a standard, so it's easy to collaborate
         | internally.
        
         | aulin wrote:
         | Well these are two very different examples. One, ROOT, is a
         | powerful data analysis framework that as powerful as it is
         | failed to be general and easy to use enough to ever get out the
         | HEP world.
         | 
         | The other one, gstreamer, is a beautifully designed platform
         | with an architecture so nice it can be easily abstracted and
         | reused in completely different scenarios, even ones that
         | probably never occurred to the authors.
        
           | im3w1l wrote:
           | Gstreamer must have been a winamp clone right?
        
         | jakjak123 wrote:
         | > Gravitational waves were found with GStreamer at LIGO:
         | https://lscsoft.docs.ligo.org/gstlal/
         | 
         | Say WHAT now?!
        
           | semi-extrinsic wrote:
           | They even have a "gstlal-ugly" package!
        
       | scheme271 wrote:
       | ROOT, providing the C++ repl that no one asked for.
        
         | fooker wrote:
         | The researchers behind this contributed it into mainline clang
         | as clang-repl
        
         | pjmlp wrote:
         | Before ROOT, there was Energize C++ and Visual Age for C++ v
         | 4.0, however too expensive and resource demanding for early
         | 1990's workstations.
         | 
         | There are also a couple of C++ live environments in the game
         | industry.
        
         | Jeaye wrote:
         | I definitely asked for it. I'm using Cling for JIT compiling my
         | native Clojure dialect: https://github.com/jank-lang/jank
         | 
         | Without Cling, this sort of thing wouldn't be feasible in C++.
         | Not in the way which Clojure dialects work. The runtime is a
         | library and the generated code is just using that library.
        
       | SilverSlash wrote:
       | Let me guess, it only run on an IBN 5100?
        
         | 8organicbits wrote:
         | No. https://root.cern/install/
        
         | div72 wrote:
         | Only for the optional "read time travel and world domination
         | plans" module.
        
       | codecalec wrote:
       | Root is definitely the backbone of a ton of work done in
       | experimental particle physics but it is also the nightmare of new
       | graduate students. It's affectively engrained into particle
       | physics and I don't expect that to change anytime soon
        
         | elashri wrote:
         | It is not that bad now with pyroot (ROOT python interface) and
         | uproot being an option that is easy to learn for new graduate
         | students. The problem is about legacy code which they usually
         | have to maintain as part of experiment service
        
           | ephimetheus wrote:
           | I can't count the number of of times where a beginner did
           | some stuff in pyroot that was horrifically slow and just
           | implementing the exact same algorithm in C++ was two orders
           | of magnitude faster.
           | 
           | If you don't use RDataFrame, or it's just histogram plotting,
           | be very careful with pyroot.
        
             | SiempreViernes wrote:
             | You should be using RDataFrame though, or awkward + dask.
        
               | ephimetheus wrote:
               | +1 for RDataFrame for what it can do. Just be prepared to
               | bail to C++ and for loops when you exceed what it can do
               | without major headaches.
        
       | lnauta wrote:
       | Have they released v7 yet? When I started my PhD it they
       | announced it, and I looked forward towards the consistency
       | between certain parts of the software they would introduce (some
       | mismatches really dont make sense and are clearly organic) and
       | now I'm already 2 years past my graduation.
        
         | npalli wrote:
         | v6.32
        
       | bobek wrote:
       | Aaah, this brings memories of late night debugging sessions of
       | code written by briliant physicists without computer science
       | background ;)
        
         | andrepd wrote:
         | Ahh I can imagine the 2000 lines-long main() :)
        
       | elashri wrote:
       | There are no many reasons why new analyses should default to
       | using ROOT instead of more user friendly and sane options like
       | uproot [1]. Maybe some people have some legacy workflow or their
       | experiments have many custom patches on top of ROOT (common
       | practice) for other things but for physics analysis you might be
       | self torturing yourself.
       | 
       | Also I really like their 404 page [2]. And no it is not about
       | room 404 :)
       | 
       | [1] https://github.com/scikit-hep/uproot5
       | 
       | [2] https://root.cern/404/
        
         | moelf wrote:
         | One common criticism of uproot is that it's not flexible when
         | per-row computation gets complicated because for-loops in
         | Python is too slow. For that one can either use Numba (when it
         | works), or, here's the shameless plug, use Julia:
         | https://github.com/JuliaHEP/UnROOT.jl
         | 
         | Past HN discussion on Julia for particle physics:
         | https://news.ycombinator.com/item?id=38512793
        
           | elashri wrote:
           | That'a true and Julia might be a solution but I don't see the
           | adoption happening anytime soon.
           | 
           | But this particular problem (per row computation) have
           | different options to tackle now in hep-python ecosystem. One
           | approach is to leverage array programming with NumPy to
           | vectorize operations as much as possible. By operating on
           | entire arrays rather than looping over individual elements,
           | significant speedups can often be achieved.
           | 
           | Another possibility is to use a library like Awkward Array,
           | which is designed to work with nested, variable-sized data
           | structures. Awkward Array integrates well with uproot and
           | provides a powerful and flexible framework for performing
           | fast computations on i.e jagged arrays.
        
             | moelf wrote:
             | Uproot already returns you Awkward array, so both things
             | you mentioned are different ways of saying the same thing.
             | The irreducible complexity of data analysis is there no
             | matter how you do it, and "one-vector-at-a-time" sometimes
             | feel like shoehorning (other terms people come up with
             | include vector-style mental gymnastics).
             | 
             | For the record, vector-style programming is great when it
             | works, I mean Julia even has a dedicated syntax for
             | broadcasting. I'm saying when the irreducible complexity
             | arrives, you don't want to NOT be able to just write a for-
             | loop
             | 
             | Just a recent example, a double-for loop looks like this in
             | Awkward array: https://github.com/Moelf/UnROOT_RDataFrame_M
             | iniBenchmark/blo... -- the result looks "neat" as in a
             | piece of art.
        
           | szvsw wrote:
           | A great alternative to numba for accelerated Python is
           | Taichi. Trivial to convert a regular python program into a
           | taichi kernel, and then it can target CUDA (and a variety of
           | other options) as the backend. No need to worry about
           | block/grid/thread allocation etc. at the same time, it's
           | super deep with great support for data classes, custom memory
           | layouts for complexly nested classes, etc etc, comes with
           | autograd, etc. I'm a huge fan - makes writing code that runs
           | on the GPU _and_ integrates with your python libraries an
           | absolute breeze. Super powerful. By far the best tool in the
           | accelerated python toolbox IMO.
        
             | OutOfHere wrote:
             | Negative, as Taichi doesn't even support Python 3.12, and
             | it's unclear if it ever will. Why would I limit myself to
             | an old version of Python?
        
               | almostgotcaught wrote:
               | Hn people are so haughty
               | 
               | https://github.com/taichi-dev/taichi/pull/8522
        
               | OutOfHere wrote:
               | The haughtiness is not for nothing. Since Dec 2023, they
               | made a lame excuse that Pytorch didn't support 3.12:
               | https://github.com/taichi-
               | dev/taichi/issues/8365#issuecommen...
               | 
               | Later, even when Pytorch added support for 3.12, nothing
               | changed (so far) in Taichi.
        
               | almostgotcaught wrote:
               | >they made a lame excuse that Pytorch didn't support 3.12
               | 
               | how is this a lame excuse
               | 
               | >but it fails on a bunch of PyTorch-related tests. We
               | then figured out that PyTorch does not have Python 3.12
               | support
               | 
               | they have a dep that was blocking them from upgrading.
               | you would have them do what? push pytorch to upgrade?
               | 
               | >Later, even when Pytorch added support for 3.12, nothing
               | changed (so far) in Taichi.
               | 
               | my friend that "Later" is feb/march of this year ie 2-3
               | months ago. exactly how fast would you like for this open
               | source project to service your needs? not to mention
               | _there is a PR up for the bump_.
               | 
               | I stand by my original comment.
        
       | captainmuon wrote:
       | A blast from the past, I used to work in particle physics and
       | used ROOT a lot. I had a love/hate relationship with it. On the
       | one hand, it had a lot of technical debt and idiosyncrasies. But
       | on the other hand, there are a bunch of things that are easier in
       | ROOT than in more "modern" options like matplotlib. For example,
       | anything that has to do with histograms. Or highly structured
       | data (where your 'columns' contain objects with fields). Or just
       | plotting functions (without having to allocate arrays for the x
       | and y values). I also like the very straightforward object-
       | oriented API. It feels like old-school C++ or Java, as opposed to
       | pandas/matplotlib which has a lot of method chaining, abuse of []
       | syntax and other magic. It is not elegant, and quite verbose, but
       | that is probably a good thing when doing a scientific analysis.
       | 
       | I left about 5 years ago, and ROOT was in a process of change.
       | They already ripped out the old CINT interpreter and moved to a
       | clang-based codebase, and now you can run your analyses in
       | Jupyter as far as I know (in C++ or Python). I heard the code
       | quality has improved a lot, too.
        
         | ilrwbwrkhv wrote:
         | I wonder if Haskell would also be a good fit for writing
         | something like this.
        
           | shrimp_emoji wrote:
           | No.
        
             | hackable_sand wrote:
             | Could it though?
        
             | mynameisvlad wrote:
             | This is a technical community. You really have to do better
             | than a one word dismissal without any reasoning.
             | 
             | In other words, why do you think it's not a good fit?
        
               | sfpotter wrote:
               | I think the response gets right to the point!
               | 
               | Using something like Haskell for ROOT is ridiculous for a
               | lot of obvious reasons. A simple and dismissive "no"
               | invites the cautious reader to discover them on their own
               | rather than waste engaging in a protracted debate. Maybe
               | it's better to reject the idea out of hand and spend our
               | time elsewhere.
        
           | goy wrote:
           | I think having Haskell bindings to it will be quite valuable
           | .For implementation of core structures, though, it's better
           | to stick to C++ to max out on performance and have a finer
           | control on resource usage. Haskell isn't particularly good at
           | that.
           | 
           | EDIT: there's one at
           | https://hackage.haskell.org/package/HROOT
        
           | tikhonj wrote:
           | Haskell would be great for _designing the interface_ of a
           | library like this, but not for _implementing_ it. It would
           | definitely not look like  "old-school C++ or Java" but, well,
           | that's the whole point :P
           | 
           | I haven't used ROOT so I don't know how well it would work to
           | write bindings for it in Haskell; it can be hard to provide a
           | good interface to an implementation that was designed for a
           | totally different style of use. Possible, just difficult.
        
         | BiteCode_dev wrote:
         | Honestly now with chatgpt, matplotlib terrible API is less of a
         | problem.
        
           | OutOfHere wrote:
           | That's true, but still, there are things you just can't do in
           | matplotlib that you can do better in other GPT-aware packages
           | like plotly.
        
           | typon wrote:
           | This is a great example of why the age of truly terrible
           | software is going to be ushered in as LLMS get better.
           | 
           | When the cost of complexity of interacting with an API is
           | paid by the LLM, optimizing this particular part of software
           | design (also one of the hardest to get right) will be less
           | fashionable.
        
         | ephimetheus wrote:
         | We all have a love/hate relationship with it. It's a bit like
         | Stockholm syndrome.
        
         | cozzyd wrote:
         | Because matplotlib is not so histogram focused (I guess because
         | the kids these days have plenty of r RAM), people always show
         | these abominable scatter plots that have so many points on top
         | of each other that they're useless. Yuck.
        
         | casualscience wrote:
         | The best thing about root was how it handled data loading.
         | TTree's, with their column based slicing on disk, are such a
         | good idea. Ever since I graduated and moved into industry, I've
         | been looking for something that works the same way.
        
           | moelf wrote:
           | Apache arrow and parquet all work this way. Even HDF5 in
           | column mode isn't completely bad.
           | 
           | TTree is succeeded by RNTuple, which is basically CERN's take
           | on Apache Arrow, they're incredibly similar
        
             | amelius wrote:
             | Is this a kind of lazy loading?
        
       | nousernamed wrote:
       | the amount of times I googled 'taxis' with predictable results
        
       | koolala wrote:
       | can they release a quantized 1bit version? i dont think anyones
       | pc can science this
        
       | qa-wolf-bates wrote:
       | I think that this article is very interesting
        
       | wolfspider wrote:
       | The part of Root I use is Cling the C++ interpreter along with
       | Xeus in a Jupyter notebook. I decided one night to test the
       | fastest n-body from benchmarkgames comparing Xeus and Python 3.
       | With Xeus I get 15.58 seconds and running the fastest Python code
       | with Python3 kernel, both on binder using the same instance, I
       | get 5 minutes. Output is exactly the same for both runs. Even
       | with an overhead tax for running dynamic C++ at ~300% for this
       | program Cling is very quick. SIMD and vectorization were not used
       | just purely the code from benchmarkgames. I use Cling primarily
       | as a quick stand-in JIT for languages that compile to C++.
        
         | Jeaye wrote:
         | I'm using Cling for JIT compiling my native Clojure dialect:
         | https://github.com/jank-lang/jank
         | 
         | Trying to bring C++ into the Clojure world and
         | Clojure/interactive programming into the C++ world.
        
       | usgroup wrote:
       | I struggle to see why one may want to use an interactive analysis
       | toolkit via C++. Could anyone who has used ROOT enlighten me on
       | this? I understand why you may write it in C++, but why would you
       | want to invoke it with C++ for this sort of work?
        
         | ephimetheus wrote:
         | All of our other code is C++. The data reconstruction framework
         | writing ROOT files, the analysis frameworks doing stat
         | analysis. The event data model is implemented in C++.
         | 
         | It has its rough edges, but you do get a lot of good synergy
         | out of this setup for sure.
        
         | konstantinua00 wrote:
         | if you can work in a fast language, why not?
         | 
         | comments here have already mentioned couple horror stories of
         | people accidentally/by inexperience doing a lot of work above
         | the framework - if you can save that by not being slow, why
         | not?
        
       | rubicks wrote:
       | What I remember about ROOT Cint is that it was an absolute
       | nightmare to work with, mostly because it couldn't do STL
       | containers very well. It was a weird time to do language interop
       | for physicists.
        
       | sbinet wrote:
       | IMHO, ROOT[3-5] is too many things with a lot of poorly designed
       | API and most importantly a lack of separation between ROOT-the-
       | library and ROOT-the-program (lots of globals and assumptions
       | that ROOT-the-program is how people should use it). ROOT 6
       | started to correct some of these things, but it takes time (and
       | IMHO, they are buying too much into llvm and clang, increasing
       | even more the build times and worsening the hackability of ROOT
       | as a project)
       | 
       | Also, for the longest time, the I/O format wasn't very well
       | documented, with only _1_ implementation.
       | 
       | Now, thanks to groot [1], uproot (that was developed building on
       | the work from groot) and others (freehep, openscientist, ...),
       | it's to read/write ROOT data w/o bringing the whole TWorld.
       | Interoperability. For data, I'd say it's very much paramount in
       | my book to have some hope to be able to read back that unique
       | data in 20, 30, ... years down the line.
       | 
       | [1] https://go-hep.org/x/hep/groot (I am the main dev behind go-
       | hep)
        
         | ephimetheus wrote:
         | uproot to this day doesn't properly implement reading
         | TEfficiency, I believe, which is a bummer, to be honest.
        
           | sbinet wrote:
           | that's odd. TEfficiency is a relatively simple thing to
           | read/write :
           | 
           | - https://github.com/go-
           | hep/hep/blob/main/groot/rhist/efficien...
        
             | ephimetheus wrote:
             | Yeah I think it has to do with the memberwise splitting.
             | https://github.com/scikit-hep/uproot5/issues/38
             | 
             | I understand this has not been a priority so far.
             | 
             | It kinda works if you open a magic file with a specific on-
             | disk representation which bypasses this, but that's not a
             | solution at all.
        
       ___________________________________________________________________
       (page generated 2024-06-01 23:00 UTC)