[HN Gopher] ROOT: analyzing petabytes of data scientifically
___________________________________________________________________
ROOT: analyzing petabytes of data scientifically
Author : z3phyr
Score : 306 points
Date : 2024-06-01 07:25 UTC (15 hours ago)
(HTM) web link (root.cern)
(TXT) w3m dump (root.cern)
| mjtlittle wrote:
| Didnt know there was a cern tld
| SiempreViernes wrote:
| Yeah... according to wikipedia they've had it since 2014, but
| even now a lot of their pages are on .ch
| sneak wrote:
| Yes, the root zone is terribly polluted now. Unfortunately
| there's no way to unring that bell, people depend on a lot of
| these new domains now.
|
| It was a huge mistake, borne out of greed and recklessness.
|
| https://en.wikipedia.org/wiki/ICANN#Notable_events
| jesprenj wrote:
| I guess ICANN needs to get money somehow.
| lambdaxyzw wrote:
| Why can't it just get funding from the government?
| rnhmjoj wrote:
| Aren't they already getting an outrageous amount of money
| for essentially supervising a txt file?
| Biganon wrote:
| I fail to see the problem with those new TLDs.
| oefrha wrote:
| Certain gTLDs have been borderline scams. The most infamous
| one might be .sucks, an extortion scheme charging an annual
| protection fee of $$$, complete with the pre-registration
| process when you could buy <yourtrademark>.sucks for $$$$
| before it's snatched up by your enemies.
|
| They also screwed up some old URL/email parsers/sniffers
| hardcoding TLDs. Largely the fault of bad assumptions to
| begin with.
|
| Other than the above, I don't see much of a problem.
| Whatever problems people like to point out about gLTDs
| already existed with numerous sketchy ccTLDs, like .io.
| Guess what, the latest hotness .ai is also one of those.
| 9dev wrote:
| I still wonder why we need that arbitrary restriction anyway?
| 8organicbits wrote:
| If we allowed all possible TLDs, then we'd need a default
| organization to administer them. The current setup requires
| an organization to control each TLD, which allows us to
| grant control to countries or large organizations. The web
| should be decentralized, which means TLD ownership should
| be spread across multiple organizations. More TLDs with
| more distinct owners is a better situation than one
| default.
| ragebol wrote:
| Handy if they host conferences, for people worried about too
| many TLDs perhaps.
|
| https://con.cern is not yet used, so...
| SiempreViernes wrote:
| Ah, root... every day it happens I am thankful I don't have to
| used a version older than 6.
| YakBizzarro wrote:
| Root was zone of the reasons to decide to not study particle
| physics
| oefrha wrote:
| You don't have to. I worked on data analysis (mostly cleaning
| and correction) for CMS (one of the two main experiments at
| LHC) for a while and didn't have to touch it. Disclaimer: I
| was a high energy theorist, but did the aforementioned
| experimental work early in my PhD for funding.
| aoanla wrote:
| I mean, most of the researchers I know at least use PyRoot
| (or the Julia equivalent) as much as possible, rather than
| actually interacting with Root itself. Which probably saves
| their sanity...
| brnt wrote:
| I did my master and PhD around the time numpy/scipy got
| competitive for a lot of analysis (for me a complete
| replacement) but the Python binding for root weren't there or
| in beta. Root-the-data+format remained however the main
| output of Geant4, so I set up a tiny Python wrapper around a
| root script that would dump any .root contents and load it up
| in a numpy file.
|
| My plots looked a lot nicer ;)
| tempay wrote:
| These days you can mostly avoid it. The Python HEP ecosystem
| is now pretty advanced so you can even read ROOT files
| without needing root itself. See:
|
| https://scikit-hep.org/
| twixfel wrote:
| I'm still waiting for the interface-breaking, let's-finally-
| make-root-good, version 7, which I think I first heard about in
| 2016 or so... true vapourware.
| amadio wrote:
| ROOT 7 is coming. Things are being discussed this year about
| it, the target is for HL-LHC. See link below. https://indico.
| cern.ch/event/1369601/contributions/5867782/a...
| nomilk wrote:
| Source code: https://github.com/root-project
| dailykoder wrote:
| >Debugging CERN ROOT scripts and ROOT-based programs in Eclipse
| IDE (30 Oct 2021)
|
| Oh gosh. The nightmares. - What obviously shows that you can
| build extraordinary stuff in horrible environments.
| BSDobelix wrote:
| I don't understand is it about eclipse?
| amadio wrote:
| It was a nice guest post on the website about eclipse, but
| most people just use gdb. It is now possible to step through
| ROOT macros with gdb by exporting CLING_DEBUG=1. See
| https://indico.jlab.org/event/459/contributions/11563/
| leohonexus wrote:
| Very cool to see large-scale software projects used for
| scientific discoveries.
|
| Another example: Gravitational waves were found with GStreamer at
| LIGO: https://lscsoft.docs.ligo.org/gstlal/
| hkwerf wrote:
| Here it's more the other way around. CERN needs a data analysis
| framework, so CERN develops, maintains and publishes it for
| other users.
|
| That being said, I don't know whether it's actually a good idea
| for someone external to actually use it. My experience may be a
| little outdated, but it's quite clunky and dated. The big
| advantage of using it for CERN or particle physics stuff is
| that it's basically a standard, so it's easy to collaborate
| internally.
| aulin wrote:
| Well these are two very different examples. One, ROOT, is a
| powerful data analysis framework that as powerful as it is
| failed to be general and easy to use enough to ever get out the
| HEP world.
|
| The other one, gstreamer, is a beautifully designed platform
| with an architecture so nice it can be easily abstracted and
| reused in completely different scenarios, even ones that
| probably never occurred to the authors.
| im3w1l wrote:
| Gstreamer must have been a winamp clone right?
| jakjak123 wrote:
| > Gravitational waves were found with GStreamer at LIGO:
| https://lscsoft.docs.ligo.org/gstlal/
|
| Say WHAT now?!
| semi-extrinsic wrote:
| They even have a "gstlal-ugly" package!
| scheme271 wrote:
| ROOT, providing the C++ repl that no one asked for.
| fooker wrote:
| The researchers behind this contributed it into mainline clang
| as clang-repl
| pjmlp wrote:
| Before ROOT, there was Energize C++ and Visual Age for C++ v
| 4.0, however too expensive and resource demanding for early
| 1990's workstations.
|
| There are also a couple of C++ live environments in the game
| industry.
| Jeaye wrote:
| I definitely asked for it. I'm using Cling for JIT compiling my
| native Clojure dialect: https://github.com/jank-lang/jank
|
| Without Cling, this sort of thing wouldn't be feasible in C++.
| Not in the way which Clojure dialects work. The runtime is a
| library and the generated code is just using that library.
| SilverSlash wrote:
| Let me guess, it only run on an IBN 5100?
| 8organicbits wrote:
| No. https://root.cern/install/
| div72 wrote:
| Only for the optional "read time travel and world domination
| plans" module.
| codecalec wrote:
| Root is definitely the backbone of a ton of work done in
| experimental particle physics but it is also the nightmare of new
| graduate students. It's affectively engrained into particle
| physics and I don't expect that to change anytime soon
| elashri wrote:
| It is not that bad now with pyroot (ROOT python interface) and
| uproot being an option that is easy to learn for new graduate
| students. The problem is about legacy code which they usually
| have to maintain as part of experiment service
| ephimetheus wrote:
| I can't count the number of of times where a beginner did
| some stuff in pyroot that was horrifically slow and just
| implementing the exact same algorithm in C++ was two orders
| of magnitude faster.
|
| If you don't use RDataFrame, or it's just histogram plotting,
| be very careful with pyroot.
| SiempreViernes wrote:
| You should be using RDataFrame though, or awkward + dask.
| ephimetheus wrote:
| +1 for RDataFrame for what it can do. Just be prepared to
| bail to C++ and for loops when you exceed what it can do
| without major headaches.
| lnauta wrote:
| Have they released v7 yet? When I started my PhD it they
| announced it, and I looked forward towards the consistency
| between certain parts of the software they would introduce (some
| mismatches really dont make sense and are clearly organic) and
| now I'm already 2 years past my graduation.
| npalli wrote:
| v6.32
| bobek wrote:
| Aaah, this brings memories of late night debugging sessions of
| code written by briliant physicists without computer science
| background ;)
| andrepd wrote:
| Ahh I can imagine the 2000 lines-long main() :)
| elashri wrote:
| There are no many reasons why new analyses should default to
| using ROOT instead of more user friendly and sane options like
| uproot [1]. Maybe some people have some legacy workflow or their
| experiments have many custom patches on top of ROOT (common
| practice) for other things but for physics analysis you might be
| self torturing yourself.
|
| Also I really like their 404 page [2]. And no it is not about
| room 404 :)
|
| [1] https://github.com/scikit-hep/uproot5
|
| [2] https://root.cern/404/
| moelf wrote:
| One common criticism of uproot is that it's not flexible when
| per-row computation gets complicated because for-loops in
| Python is too slow. For that one can either use Numba (when it
| works), or, here's the shameless plug, use Julia:
| https://github.com/JuliaHEP/UnROOT.jl
|
| Past HN discussion on Julia for particle physics:
| https://news.ycombinator.com/item?id=38512793
| elashri wrote:
| That'a true and Julia might be a solution but I don't see the
| adoption happening anytime soon.
|
| But this particular problem (per row computation) have
| different options to tackle now in hep-python ecosystem. One
| approach is to leverage array programming with NumPy to
| vectorize operations as much as possible. By operating on
| entire arrays rather than looping over individual elements,
| significant speedups can often be achieved.
|
| Another possibility is to use a library like Awkward Array,
| which is designed to work with nested, variable-sized data
| structures. Awkward Array integrates well with uproot and
| provides a powerful and flexible framework for performing
| fast computations on i.e jagged arrays.
| moelf wrote:
| Uproot already returns you Awkward array, so both things
| you mentioned are different ways of saying the same thing.
| The irreducible complexity of data analysis is there no
| matter how you do it, and "one-vector-at-a-time" sometimes
| feel like shoehorning (other terms people come up with
| include vector-style mental gymnastics).
|
| For the record, vector-style programming is great when it
| works, I mean Julia even has a dedicated syntax for
| broadcasting. I'm saying when the irreducible complexity
| arrives, you don't want to NOT be able to just write a for-
| loop
|
| Just a recent example, a double-for loop looks like this in
| Awkward array: https://github.com/Moelf/UnROOT_RDataFrame_M
| iniBenchmark/blo... -- the result looks "neat" as in a
| piece of art.
| szvsw wrote:
| A great alternative to numba for accelerated Python is
| Taichi. Trivial to convert a regular python program into a
| taichi kernel, and then it can target CUDA (and a variety of
| other options) as the backend. No need to worry about
| block/grid/thread allocation etc. at the same time, it's
| super deep with great support for data classes, custom memory
| layouts for complexly nested classes, etc etc, comes with
| autograd, etc. I'm a huge fan - makes writing code that runs
| on the GPU _and_ integrates with your python libraries an
| absolute breeze. Super powerful. By far the best tool in the
| accelerated python toolbox IMO.
| OutOfHere wrote:
| Negative, as Taichi doesn't even support Python 3.12, and
| it's unclear if it ever will. Why would I limit myself to
| an old version of Python?
| almostgotcaught wrote:
| Hn people are so haughty
|
| https://github.com/taichi-dev/taichi/pull/8522
| OutOfHere wrote:
| The haughtiness is not for nothing. Since Dec 2023, they
| made a lame excuse that Pytorch didn't support 3.12:
| https://github.com/taichi-
| dev/taichi/issues/8365#issuecommen...
|
| Later, even when Pytorch added support for 3.12, nothing
| changed (so far) in Taichi.
| almostgotcaught wrote:
| >they made a lame excuse that Pytorch didn't support 3.12
|
| how is this a lame excuse
|
| >but it fails on a bunch of PyTorch-related tests. We
| then figured out that PyTorch does not have Python 3.12
| support
|
| they have a dep that was blocking them from upgrading.
| you would have them do what? push pytorch to upgrade?
|
| >Later, even when Pytorch added support for 3.12, nothing
| changed (so far) in Taichi.
|
| my friend that "Later" is feb/march of this year ie 2-3
| months ago. exactly how fast would you like for this open
| source project to service your needs? not to mention
| _there is a PR up for the bump_.
|
| I stand by my original comment.
| captainmuon wrote:
| A blast from the past, I used to work in particle physics and
| used ROOT a lot. I had a love/hate relationship with it. On the
| one hand, it had a lot of technical debt and idiosyncrasies. But
| on the other hand, there are a bunch of things that are easier in
| ROOT than in more "modern" options like matplotlib. For example,
| anything that has to do with histograms. Or highly structured
| data (where your 'columns' contain objects with fields). Or just
| plotting functions (without having to allocate arrays for the x
| and y values). I also like the very straightforward object-
| oriented API. It feels like old-school C++ or Java, as opposed to
| pandas/matplotlib which has a lot of method chaining, abuse of []
| syntax and other magic. It is not elegant, and quite verbose, but
| that is probably a good thing when doing a scientific analysis.
|
| I left about 5 years ago, and ROOT was in a process of change.
| They already ripped out the old CINT interpreter and moved to a
| clang-based codebase, and now you can run your analyses in
| Jupyter as far as I know (in C++ or Python). I heard the code
| quality has improved a lot, too.
| ilrwbwrkhv wrote:
| I wonder if Haskell would also be a good fit for writing
| something like this.
| shrimp_emoji wrote:
| No.
| hackable_sand wrote:
| Could it though?
| mynameisvlad wrote:
| This is a technical community. You really have to do better
| than a one word dismissal without any reasoning.
|
| In other words, why do you think it's not a good fit?
| sfpotter wrote:
| I think the response gets right to the point!
|
| Using something like Haskell for ROOT is ridiculous for a
| lot of obvious reasons. A simple and dismissive "no"
| invites the cautious reader to discover them on their own
| rather than waste engaging in a protracted debate. Maybe
| it's better to reject the idea out of hand and spend our
| time elsewhere.
| goy wrote:
| I think having Haskell bindings to it will be quite valuable
| .For implementation of core structures, though, it's better
| to stick to C++ to max out on performance and have a finer
| control on resource usage. Haskell isn't particularly good at
| that.
|
| EDIT: there's one at
| https://hackage.haskell.org/package/HROOT
| tikhonj wrote:
| Haskell would be great for _designing the interface_ of a
| library like this, but not for _implementing_ it. It would
| definitely not look like "old-school C++ or Java" but, well,
| that's the whole point :P
|
| I haven't used ROOT so I don't know how well it would work to
| write bindings for it in Haskell; it can be hard to provide a
| good interface to an implementation that was designed for a
| totally different style of use. Possible, just difficult.
| BiteCode_dev wrote:
| Honestly now with chatgpt, matplotlib terrible API is less of a
| problem.
| OutOfHere wrote:
| That's true, but still, there are things you just can't do in
| matplotlib that you can do better in other GPT-aware packages
| like plotly.
| typon wrote:
| This is a great example of why the age of truly terrible
| software is going to be ushered in as LLMS get better.
|
| When the cost of complexity of interacting with an API is
| paid by the LLM, optimizing this particular part of software
| design (also one of the hardest to get right) will be less
| fashionable.
| ephimetheus wrote:
| We all have a love/hate relationship with it. It's a bit like
| Stockholm syndrome.
| cozzyd wrote:
| Because matplotlib is not so histogram focused (I guess because
| the kids these days have plenty of r RAM), people always show
| these abominable scatter plots that have so many points on top
| of each other that they're useless. Yuck.
| casualscience wrote:
| The best thing about root was how it handled data loading.
| TTree's, with their column based slicing on disk, are such a
| good idea. Ever since I graduated and moved into industry, I've
| been looking for something that works the same way.
| moelf wrote:
| Apache arrow and parquet all work this way. Even HDF5 in
| column mode isn't completely bad.
|
| TTree is succeeded by RNTuple, which is basically CERN's take
| on Apache Arrow, they're incredibly similar
| amelius wrote:
| Is this a kind of lazy loading?
| nousernamed wrote:
| the amount of times I googled 'taxis' with predictable results
| koolala wrote:
| can they release a quantized 1bit version? i dont think anyones
| pc can science this
| qa-wolf-bates wrote:
| I think that this article is very interesting
| wolfspider wrote:
| The part of Root I use is Cling the C++ interpreter along with
| Xeus in a Jupyter notebook. I decided one night to test the
| fastest n-body from benchmarkgames comparing Xeus and Python 3.
| With Xeus I get 15.58 seconds and running the fastest Python code
| with Python3 kernel, both on binder using the same instance, I
| get 5 minutes. Output is exactly the same for both runs. Even
| with an overhead tax for running dynamic C++ at ~300% for this
| program Cling is very quick. SIMD and vectorization were not used
| just purely the code from benchmarkgames. I use Cling primarily
| as a quick stand-in JIT for languages that compile to C++.
| Jeaye wrote:
| I'm using Cling for JIT compiling my native Clojure dialect:
| https://github.com/jank-lang/jank
|
| Trying to bring C++ into the Clojure world and
| Clojure/interactive programming into the C++ world.
| usgroup wrote:
| I struggle to see why one may want to use an interactive analysis
| toolkit via C++. Could anyone who has used ROOT enlighten me on
| this? I understand why you may write it in C++, but why would you
| want to invoke it with C++ for this sort of work?
| ephimetheus wrote:
| All of our other code is C++. The data reconstruction framework
| writing ROOT files, the analysis frameworks doing stat
| analysis. The event data model is implemented in C++.
|
| It has its rough edges, but you do get a lot of good synergy
| out of this setup for sure.
| konstantinua00 wrote:
| if you can work in a fast language, why not?
|
| comments here have already mentioned couple horror stories of
| people accidentally/by inexperience doing a lot of work above
| the framework - if you can save that by not being slow, why
| not?
| rubicks wrote:
| What I remember about ROOT Cint is that it was an absolute
| nightmare to work with, mostly because it couldn't do STL
| containers very well. It was a weird time to do language interop
| for physicists.
| sbinet wrote:
| IMHO, ROOT[3-5] is too many things with a lot of poorly designed
| API and most importantly a lack of separation between ROOT-the-
| library and ROOT-the-program (lots of globals and assumptions
| that ROOT-the-program is how people should use it). ROOT 6
| started to correct some of these things, but it takes time (and
| IMHO, they are buying too much into llvm and clang, increasing
| even more the build times and worsening the hackability of ROOT
| as a project)
|
| Also, for the longest time, the I/O format wasn't very well
| documented, with only _1_ implementation.
|
| Now, thanks to groot [1], uproot (that was developed building on
| the work from groot) and others (freehep, openscientist, ...),
| it's to read/write ROOT data w/o bringing the whole TWorld.
| Interoperability. For data, I'd say it's very much paramount in
| my book to have some hope to be able to read back that unique
| data in 20, 30, ... years down the line.
|
| [1] https://go-hep.org/x/hep/groot (I am the main dev behind go-
| hep)
| ephimetheus wrote:
| uproot to this day doesn't properly implement reading
| TEfficiency, I believe, which is a bummer, to be honest.
| sbinet wrote:
| that's odd. TEfficiency is a relatively simple thing to
| read/write :
|
| - https://github.com/go-
| hep/hep/blob/main/groot/rhist/efficien...
| ephimetheus wrote:
| Yeah I think it has to do with the memberwise splitting.
| https://github.com/scikit-hep/uproot5/issues/38
|
| I understand this has not been a priority so far.
|
| It kinda works if you open a magic file with a specific on-
| disk representation which bypasses this, but that's not a
| solution at all.
___________________________________________________________________
(page generated 2024-06-01 23:00 UTC)