[HN Gopher] The big-load anti-pattern
       ___________________________________________________________________
        
       The big-load anti-pattern
        
       Author : r4um
       Score  : 56 points
       Date   : 2021-08-21 11:35 UTC (11 hours ago)
        
 (HTM) web link (lemire.me)
 (TXT) w3m dump (lemire.me)
        
       | touisteur wrote:
       | I'd add 'evreytime you read the data, do something so that next
       | time it's easier. Using java as an example, and pcapng files for
       | example, the first time you read your data, you should at least
       | build a simple block/packet index so that next time you won't
       | have to read it again. Same for all kinds of sparse data. I've
       | had great success using 'simple' things like roaringbitmaps as
       | 'field x is present in block y of stream z' indexes. I save the
       | compressed roaringbitmap(s) in a sqlite DB and next time I open
       | the data file I use it. This can be grafted quite quickly.
       | 
       | I realize over the years that the 'load everything' thing is
       | often linked to lack of understanding of machine limitations and
       | little training in stream-processing and how efficient and
       | scalable it is.
       | 
       | I'd blame maths teaching that focuses on the abstract operation
       | (full formula) and their 'implementation' past simple examples.
       | Or I'd blame Excel as the gateway drug to numerical computing.
       | But mostly, it's probably 'just' that not that many people happen
       | to encounter data 'that' big (yet it's not 'big' data) and when
       | they do they're often not helped in finding 'progressive'
       | solutions. Running variant/avg isn't hard to understand but you
       | must know it exists... Basic stream processing can be achieved
       | with not too much changes (depending on the algorithms, of
       | course). Simple indexes can be quite easy to build... But often
       | we sell them 'you need to go full DB' or 'this is a job for
       | hadoop or infrastructure-nightmare-tool-of-the-day'. Not everyone
       | points you to https://ujmp.org/ with sparse/dense 1d/2d/3d matrix
       | structures and operations and different storage options (disk-
       | backed, etc...).
       | 
       | Most of the time I meet data scientists in difficulty, after 1h
       | of explaining how I do X using roaring bitmaps or sparse
       | structures or after 1h spent building a file/field index using
       | very robust (old) libraries in their language/environment of
       | choice, I see them build pretty solid and scalable pipelines...
        
       | OnlyMortal wrote:
       | This is nothing new. Back in the late 2000s, I did this to deal
       | with a huge quantity of XML been sent over the wire to desktop
       | PCs. The original version needed 20GB of RAM. I changed it just
       | to pick up the tags needed and parsed the stream as it came. Time
       | was massively reduced too.
       | 
       | I see the same mistakes done with JSON nowadays.
       | 
       | Basically, if you don't need the DOM in its entirety, don't parse
       | it all.
        
       | dehrmann wrote:
       | I didn't find the solutions to be all that helpful; it was
       | essentially "pipeline your processing." If all I'm doing is a
       | programmatic "cat | grep | cut", that's fine. It gets messy when
       | I want to sort the data or join it with another medium-large
       | dataset, and this is why people load into memory in the first
       | place. Is there a sane path where I'm not immediately jumping to
       | Hadoop or Presto? Maybe I can offload the join to SQLite?
        
         | civilized wrote:
         | Yes! If your data is too big to comfortably manipulate in
         | memory with R/Python/your language of choice, and the data is
         | (or can be made to be) tabular, the next best place for it is a
         | relational database. If you don't have an existing server
         | that's well-suited, use SQLite or DuckDB, the "SQLite of
         | analytics". Definitely use a columnar database if you're going
         | to be doing complex queries and joins.
         | 
         | This is battle-worn experience and I will absolutely die on
         | this hill. Relational databases JUST WORK.
         | 
         | The only exception is when you need to do complex ML on big
         | data, in which case, take your pick from Apache Spark, Dask, or
         | some similar framework. Hopefully these will be increasingly
         | supplemented by in-database ML going forward, so that there is
         | no reason to ever leave the database.
        
         | dredmorbius wrote:
         | The best thing to do when it comes to sorting data is to not do
         | that. Sorts frequently are used in merges or to find some top-n
         | / bottom-n instance rate.
         | 
         | If you've _got_ to sort (or writing an alternative is too time-
         | consuming), do it _after_ eliminating as much data (variables,
         | records) as possible.
         | 
         | Hash tables (or other comparable tools such as btrees or Bloom
         | filters) can be used for dataset comparisons or key-value
         | lookups.
         | 
         | Arrays can be used to keep a runnning count of top-n records
         | (this may need to be sorted on insert, though comparing
         | largest/smallest entry can mitigate this).
         | 
         | Random sampling is very often as good as complete processing,
         | especially where the goal is to find the general
         | characteristics of a dataset rather than an exhaustive view or
         | precise tabulation. Even in accounting and financial
         | operations, sampling methods may be acceptable (see for
         | example, mechanical copyright royalties on performance of
         | recorded music broadcasts, or advertising and circulation rates
         | in publishing). It's fairly staggering how clearly a _small_
         | sample (as few as 30 records, often only a few hundred or
         | thousand) can describe an immense dataset (millions or billions
         | of entries) with reasonable accuracy, _so long as it is
         | randomly selected_. And if the first sample seems off, try
         | pulling a few others and seeing how they compare. (Effectively:
         | Monte Carlo method.)
         | 
         | The place where sampling methods break down most is in
         | iterative-calculation forecasts (e.g., weather modeling), where
         | larger aggregations (typically measured as datapoints over both
         | area and time) greatly reduce accuracy. Though even with very
         | precise reporting density, today's weather forecasts tend to be
         | limited to 7--10 days. Sometimes randomness simply wins.
        
           | kragen wrote:
           | > _Arrays can be used to keep a runnning count of top-n
           | records (this may need to be sorted on insert, though
           | comparing largest /smallest entry can mitigate this)._
           | 
           | Honored and wise Morbius, I humbly beg to differ; top-N
           | record arrays never need to be sorted on insert, because a
           | binary heap is 25 lines of code. Then all you need for top-N
           | is:                   for record in input_file:
           | if record <= aMinheap.minimum:                 continue
           | aMinheap.extract_minimum()             aMinheap.add(record)
           | 
           | I think that, to produce the top 100, this takes about 8
           | comparisons per _record added to the candidate set_ , rather
           | than the 1000 or so required by your suggested method. The
           | difference won't matter for a big enough unordered dataset,
           | though, since almost all the time will be taken by discarding
           | records in the initial comparison.
           | 
           | I was thinking quickselect was even faster when your data
           | _does_ fit in RAM, but I can 't see how that could be true,
           | even though it's linear time: if you have a million records,
           | then the first partitioning step of quickselect will examine
           | all of them (doing one comparison for each), the second
           | partitioning step will examine on average half of them, and
           | the third partitioning step will examine on average somewhat
           | more than a quarter of them (because the expected size for
           | the result of the second partitioning step is more than
           | 250,000 records), and so we end up doing something like _e_
           | comparisons per record on average rather than 1.
           | 
           | But, with the running-candidate-set approach you suggested,
           | if the million records are in random order, after we're past
           | the first 1000 records, we do only a single comparison for at
           | least 90% of the remaining records (because they're less than
           | the top 100 from the first 1000), and after we're past the
           | first 10,000 records, for at least 99%. So if we did 8000
           | comparisons in the first 1000, then inserted 900 of the next
           | 9000 records for another 7200 comparisons (plus 9000
           | comparisons to reject the other 8100), and then another 900
           | of the next 90,000 records (plus 90,000 initial comparisons),
           | we're at 121,400 comparisons for the first 100,000 records,
           | 1.214 comparisons per record. So, for a small top-N query,
           | quickselect loses the race before it's finished with its
           | second partitioning step. Or am I overlooking something?
        
             | dredmorbius wrote:
             | I'd hand-coded a "top-n" utility in awk some time back.
             | 
             | I think my first iteration used bubble sort. That was
             | _still_ faster than a sort | head script, by a lot.
             | 
             | (I think I later improved on that, possibly by retaining
             | min/max records.)
             | 
             | Even in awk, and even when choosing large values for n
             | (e.g., thousand or tens of thousands of records), this
             | easily beat the sort utility in cpu time, wall-clock time,
             | and memory usage, for many millions of input records.
             | 
             | (Most of my coding tends to be hackish and brutish.
             | Algorithms are not my strong suit.)
        
               | kragen wrote:
               | > I think my first iteration used bubble sort. That was
               | still faster than a sort | head script, by a lot.
               | 
               | Haha! Gotta get Barack Obama to give you some algorithms
               | lessons.
               | 
               | Thinking about it further, the initial heapify phase of
               | heapsort is linear-time, and it's the later heap-popping
               | phase that takes most of the runtime; if the command-line
               | `sort` used heapsort for in-memory data, maybe `sort |
               | head` would have been competitive or even faster? At
               | least for n>32 or so? The usual algorithm for `sort` is
               | mergesort, because that makes it perform acceptably for
               | things that don't fit in RAM, but mergesort has to do
               | most of its work before it can start producing output.
        
       | InjectionVE wrote:
       | At work we are batch-processing a decent amount of data every
       | night using Spark. The way the workload can be split into parts
       | allows the following strategy:
       | 
       | - Generate jobs only containing metadata and distribute those
       | jobs to workers. There the actual data that is required for that
       | specific job is queried on the fly from the db.
       | 
       | For a similar task however we later switched to the following
       | strategy:
       | 
       | - Load all data to be processed at once, use all sorts of
       | distributed transformations and aggregations on the full data set
       | and do the splitting at the end of the workflow.
       | 
       | The reason why we switched? The only real reason was that it
       | seemed more the "Big Data style" to do stuff. With the first
       | approach we actually would not need all the fancy Spark
       | functionality right? We would only abuse the framework for a
       | fancy way to distribute mostly independent workloads.
       | 
       | However, I very much regretted that decision later as it made our
       | life harder in many ways. For example, I could easily execute and
       | thus debug one of the former jobs locally within a minute. Try
       | that when the workflow is designed in a way that it needs to load
       | several gigabytes before applying any logic. To be fair, the
       | total load on the db was somewhat lower using the second
       | approach, but that just wasn't worth it.
        
       | thanksforfish wrote:
       | > To be fair, if the rest of your pipeline runs in the megabytes
       | per second, then memory allocation might as well be free from a
       | speed point of view.
       | 
       | This is important, but I think sometime people struggle with this
       | sort of thinking. I struggled to explain a similar concept to a
       | junior engineer recently. He was very keen to try to optimize
       | part of a process that wasn't the bottleneck. I tried a couple
       | approaches, like benchmarking various parts under different
       | conditions, modeling it to calculate how speeding up different
       | components would take.
       | 
       | I wasn't convincing, unfortunately, so he implemented some
       | changes that sussessfully sped up one part but didn't improve end
       | to end performance. I think sometimes you need to see it with
       | your own eyes.
        
         | kragen wrote:
         | At my first job I spent about US$10k on a super fast compile
         | server which didn't speed up our slow compiles because the
         | bottleneck was the shared 10BaseT Ethernet to the NFS
         | fileserver where we were storing both the source code and the
         | build artifacts. I should have listened to my boss who was
         | telling me it probably wouldn't help.
        
       | delaaxe wrote:
       | Am I the only one that giggled reading the title?
        
         | delaaxe wrote:
         | Jesus guys a little humor, why all the downvotes
        
           | _Algernon_ wrote:
           | Probably off-topic?
        
           | oblak wrote:
           | HN doesn't seem to do phrasing
        
           | klyrs wrote:
           | Jokes are not punished on HN as a rule. Low-effort comments
           | are. Double for whining about downvotes.
        
           | dredmorbius wrote:
           | They're best suffered silently.
        
       | criticaltinker wrote:
       | _> Simply put, it is nicer to build your systems so that, as much
       | as possible, they use a constant amount of memory irrespective of
       | the input size _
       | 
       | Really good advice - this is a hard earned lesson for many folks.
       | I've worked with quite a few data scientists who were brilliant
       | at experimental design but not necessarily experts in the field
       | of comp sci. Their relatively simple python scripts would run
       | nice and fast initially. As time passed and the organization
       | grew, their scripts would start to run slower and slower as the
       | datasets scaled and swapping to disk started occurring, etc. In
       | some cases they would completely lock up shared machines, taking
       | a good chunk of the team offline for a bit.
       | 
       | Anyway, Daniel Lemire's blog is a fantastic resource. I highly
       | recommend taking a look through his publications and open source
       | contributions. I was able to save my employer _a lot_ of money by
       | building on time series compression algorithms [1] and vectorized
       | implementations [2][3] that he has provided.
       | 
       | [1] Decoding billions of integers per second through
       | vectorization
       | https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203
       | 
       | [2] https://github.com/lemire/FastPFor
       | 
       | [3] https://github.com/searchivarius/PyFastPFor
        
         | alpineidyll3 wrote:
         | The flip side is that sometimes unnecessary optimizations can
         | slow progress towards the actual goal, especially if the data
         | transformations are sophisticated math.
        
           | Ologn wrote:
           | I have run into this problem more than once. Usually I am not
           | creating the initial data, but am importing it. Usually I
           | have ignored optimizations initially and just loaded and
           | processed the data. Then I go back and hack on the
           | subroutines loading the data and splitting it into records. I
           | was the only programmer working on both projects though.
           | 
           | Once I did it with a large RDF (XML) that was a list of
           | books. Initially I loaded and processed it in Perl with
           | XML::Simple, but XML::Twig let me deal with things one record
           | at a time.
           | 
           | The other time I had a large Java linked list holding
           | objects, where I grabbed the head, processed it, and then
           | shrunk the list (removeFirst()) as I proceeded (I probably
           | could have optimized the loading even more).
           | 
           | As a one-person project, punting on the big load optimization
           | worked for me, as I had a working program even before
           | optimizing. If it was a larger project, what time would be
           | best to optimize might have differed.
        
           | jka wrote:
           | In the context of high-performance, revenue-sensitive data
           | transformations, it'd be interesting to see how reusable
           | free/open-source components would fare, especially with
           | regard to licensing.
           | 
           | I can see MIT/permissive licenses being _less_ attractive
           | than usual in an environment like that, because the author
           | might perceive reduced chance of reaping rewards (versus
           | aiding competitors) from publishing components.
           | 
           | If improvements led to greater-than-zero reward for
           | publishers and consumers alike, though (as enforced by
           | copyleft approaches), then perhaps the results would be
           | different. All conjecture, really.
        
         | kragen wrote:
         | > _Really good advice - this is a hard-earned lesson for many
         | folks. I 've worked with quite a few data scientists who were
         | brilliant at experimental design but not necessarily experts in
         | the field of comp sci. Their relatively simple Python scripts
         | would run nice and fast initially. As time passed and the
         | organization grew,_
         | 
         | The problem of "as the organization grew,..." is the problem
         | you want to have. 90% of organizations instead have the problem
         | "As we failed to get any interesting results or any funding,
         | and our contributors started drifting away to other
         | projects,..." and sometimes the reason they have that problem
         | is precisely because the initial contributors were too
         | concerned about doing things "the right way" and weren't
         | satisfied with "relatively simple Python scripts [that] run
         | nice and fast initially".
         | 
         | When your limiting factor is programmer effort, there are two
         | kinds of systems: the kind that runs as slowly as the users and
         | developers can tolerate, and the kind that people don't use
         | because it lacks the functionality they want, functionality
         | that some other system has. Probably a slower one.
         | 
         | Of course you should make things go as fast as you can, and
         | performance can be critical not only for ergonomics but also
         | for the ability to iterate rapidly, and sometimes the
         | performance _is_ the functionality, but when you find yourself
         | facing down the shitty performance of  "founder code" that
         | locks up shared machines, don't fall into the trap of wishing
         | that the people had done it right in the first place, before
         | "time passed and the organization grew". In the startup
         | graveyard there are nine other groups who tried to do the same
         | thing, but couldn't quite get it right. Some of them probably
         | had damn fast code, which you aren't looking at, because their
         | organization never grew.
         | 
         | And I say that as someone who spent a fair bit of the last week
         | looking at assembly language and thinking about compiler
         | optimization tradeoffs.
        
       | kohlerm wrote:
       | This is in my experience the most common performance antipattern.
       | What makes it even worse is the fact that people quite often do
       | this on the name of performance. Which rarely is a good idea.
        
       ___________________________________________________________________
       (page generated 2021-08-21 23:01 UTC)