[HN Gopher] The big-load anti-pattern
___________________________________________________________________
The big-load anti-pattern
Author : r4um
Score : 56 points
Date : 2021-08-21 11:35 UTC (11 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| touisteur wrote:
| I'd add 'evreytime you read the data, do something so that next
| time it's easier. Using java as an example, and pcapng files for
| example, the first time you read your data, you should at least
| build a simple block/packet index so that next time you won't
| have to read it again. Same for all kinds of sparse data. I've
| had great success using 'simple' things like roaringbitmaps as
| 'field x is present in block y of stream z' indexes. I save the
| compressed roaringbitmap(s) in a sqlite DB and next time I open
| the data file I use it. This can be grafted quite quickly.
|
| I realize over the years that the 'load everything' thing is
| often linked to lack of understanding of machine limitations and
| little training in stream-processing and how efficient and
| scalable it is.
|
| I'd blame maths teaching that focuses on the abstract operation
| (full formula) and their 'implementation' past simple examples.
| Or I'd blame Excel as the gateway drug to numerical computing.
| But mostly, it's probably 'just' that not that many people happen
| to encounter data 'that' big (yet it's not 'big' data) and when
| they do they're often not helped in finding 'progressive'
| solutions. Running variant/avg isn't hard to understand but you
| must know it exists... Basic stream processing can be achieved
| with not too much changes (depending on the algorithms, of
| course). Simple indexes can be quite easy to build... But often
| we sell them 'you need to go full DB' or 'this is a job for
| hadoop or infrastructure-nightmare-tool-of-the-day'. Not everyone
| points you to https://ujmp.org/ with sparse/dense 1d/2d/3d matrix
| structures and operations and different storage options (disk-
| backed, etc...).
|
| Most of the time I meet data scientists in difficulty, after 1h
| of explaining how I do X using roaring bitmaps or sparse
| structures or after 1h spent building a file/field index using
| very robust (old) libraries in their language/environment of
| choice, I see them build pretty solid and scalable pipelines...
| OnlyMortal wrote:
| This is nothing new. Back in the late 2000s, I did this to deal
| with a huge quantity of XML been sent over the wire to desktop
| PCs. The original version needed 20GB of RAM. I changed it just
| to pick up the tags needed and parsed the stream as it came. Time
| was massively reduced too.
|
| I see the same mistakes done with JSON nowadays.
|
| Basically, if you don't need the DOM in its entirety, don't parse
| it all.
| dehrmann wrote:
| I didn't find the solutions to be all that helpful; it was
| essentially "pipeline your processing." If all I'm doing is a
| programmatic "cat | grep | cut", that's fine. It gets messy when
| I want to sort the data or join it with another medium-large
| dataset, and this is why people load into memory in the first
| place. Is there a sane path where I'm not immediately jumping to
| Hadoop or Presto? Maybe I can offload the join to SQLite?
| civilized wrote:
| Yes! If your data is too big to comfortably manipulate in
| memory with R/Python/your language of choice, and the data is
| (or can be made to be) tabular, the next best place for it is a
| relational database. If you don't have an existing server
| that's well-suited, use SQLite or DuckDB, the "SQLite of
| analytics". Definitely use a columnar database if you're going
| to be doing complex queries and joins.
|
| This is battle-worn experience and I will absolutely die on
| this hill. Relational databases JUST WORK.
|
| The only exception is when you need to do complex ML on big
| data, in which case, take your pick from Apache Spark, Dask, or
| some similar framework. Hopefully these will be increasingly
| supplemented by in-database ML going forward, so that there is
| no reason to ever leave the database.
| dredmorbius wrote:
| The best thing to do when it comes to sorting data is to not do
| that. Sorts frequently are used in merges or to find some top-n
| / bottom-n instance rate.
|
| If you've _got_ to sort (or writing an alternative is too time-
| consuming), do it _after_ eliminating as much data (variables,
| records) as possible.
|
| Hash tables (or other comparable tools such as btrees or Bloom
| filters) can be used for dataset comparisons or key-value
| lookups.
|
| Arrays can be used to keep a runnning count of top-n records
| (this may need to be sorted on insert, though comparing
| largest/smallest entry can mitigate this).
|
| Random sampling is very often as good as complete processing,
| especially where the goal is to find the general
| characteristics of a dataset rather than an exhaustive view or
| precise tabulation. Even in accounting and financial
| operations, sampling methods may be acceptable (see for
| example, mechanical copyright royalties on performance of
| recorded music broadcasts, or advertising and circulation rates
| in publishing). It's fairly staggering how clearly a _small_
| sample (as few as 30 records, often only a few hundred or
| thousand) can describe an immense dataset (millions or billions
| of entries) with reasonable accuracy, _so long as it is
| randomly selected_. And if the first sample seems off, try
| pulling a few others and seeing how they compare. (Effectively:
| Monte Carlo method.)
|
| The place where sampling methods break down most is in
| iterative-calculation forecasts (e.g., weather modeling), where
| larger aggregations (typically measured as datapoints over both
| area and time) greatly reduce accuracy. Though even with very
| precise reporting density, today's weather forecasts tend to be
| limited to 7--10 days. Sometimes randomness simply wins.
| kragen wrote:
| > _Arrays can be used to keep a runnning count of top-n
| records (this may need to be sorted on insert, though
| comparing largest /smallest entry can mitigate this)._
|
| Honored and wise Morbius, I humbly beg to differ; top-N
| record arrays never need to be sorted on insert, because a
| binary heap is 25 lines of code. Then all you need for top-N
| is: for record in input_file:
| if record <= aMinheap.minimum: continue
| aMinheap.extract_minimum() aMinheap.add(record)
|
| I think that, to produce the top 100, this takes about 8
| comparisons per _record added to the candidate set_ , rather
| than the 1000 or so required by your suggested method. The
| difference won't matter for a big enough unordered dataset,
| though, since almost all the time will be taken by discarding
| records in the initial comparison.
|
| I was thinking quickselect was even faster when your data
| _does_ fit in RAM, but I can 't see how that could be true,
| even though it's linear time: if you have a million records,
| then the first partitioning step of quickselect will examine
| all of them (doing one comparison for each), the second
| partitioning step will examine on average half of them, and
| the third partitioning step will examine on average somewhat
| more than a quarter of them (because the expected size for
| the result of the second partitioning step is more than
| 250,000 records), and so we end up doing something like _e_
| comparisons per record on average rather than 1.
|
| But, with the running-candidate-set approach you suggested,
| if the million records are in random order, after we're past
| the first 1000 records, we do only a single comparison for at
| least 90% of the remaining records (because they're less than
| the top 100 from the first 1000), and after we're past the
| first 10,000 records, for at least 99%. So if we did 8000
| comparisons in the first 1000, then inserted 900 of the next
| 9000 records for another 7200 comparisons (plus 9000
| comparisons to reject the other 8100), and then another 900
| of the next 90,000 records (plus 90,000 initial comparisons),
| we're at 121,400 comparisons for the first 100,000 records,
| 1.214 comparisons per record. So, for a small top-N query,
| quickselect loses the race before it's finished with its
| second partitioning step. Or am I overlooking something?
| dredmorbius wrote:
| I'd hand-coded a "top-n" utility in awk some time back.
|
| I think my first iteration used bubble sort. That was
| _still_ faster than a sort | head script, by a lot.
|
| (I think I later improved on that, possibly by retaining
| min/max records.)
|
| Even in awk, and even when choosing large values for n
| (e.g., thousand or tens of thousands of records), this
| easily beat the sort utility in cpu time, wall-clock time,
| and memory usage, for many millions of input records.
|
| (Most of my coding tends to be hackish and brutish.
| Algorithms are not my strong suit.)
| kragen wrote:
| > I think my first iteration used bubble sort. That was
| still faster than a sort | head script, by a lot.
|
| Haha! Gotta get Barack Obama to give you some algorithms
| lessons.
|
| Thinking about it further, the initial heapify phase of
| heapsort is linear-time, and it's the later heap-popping
| phase that takes most of the runtime; if the command-line
| `sort` used heapsort for in-memory data, maybe `sort |
| head` would have been competitive or even faster? At
| least for n>32 or so? The usual algorithm for `sort` is
| mergesort, because that makes it perform acceptably for
| things that don't fit in RAM, but mergesort has to do
| most of its work before it can start producing output.
| InjectionVE wrote:
| At work we are batch-processing a decent amount of data every
| night using Spark. The way the workload can be split into parts
| allows the following strategy:
|
| - Generate jobs only containing metadata and distribute those
| jobs to workers. There the actual data that is required for that
| specific job is queried on the fly from the db.
|
| For a similar task however we later switched to the following
| strategy:
|
| - Load all data to be processed at once, use all sorts of
| distributed transformations and aggregations on the full data set
| and do the splitting at the end of the workflow.
|
| The reason why we switched? The only real reason was that it
| seemed more the "Big Data style" to do stuff. With the first
| approach we actually would not need all the fancy Spark
| functionality right? We would only abuse the framework for a
| fancy way to distribute mostly independent workloads.
|
| However, I very much regretted that decision later as it made our
| life harder in many ways. For example, I could easily execute and
| thus debug one of the former jobs locally within a minute. Try
| that when the workflow is designed in a way that it needs to load
| several gigabytes before applying any logic. To be fair, the
| total load on the db was somewhat lower using the second
| approach, but that just wasn't worth it.
| thanksforfish wrote:
| > To be fair, if the rest of your pipeline runs in the megabytes
| per second, then memory allocation might as well be free from a
| speed point of view.
|
| This is important, but I think sometime people struggle with this
| sort of thinking. I struggled to explain a similar concept to a
| junior engineer recently. He was very keen to try to optimize
| part of a process that wasn't the bottleneck. I tried a couple
| approaches, like benchmarking various parts under different
| conditions, modeling it to calculate how speeding up different
| components would take.
|
| I wasn't convincing, unfortunately, so he implemented some
| changes that sussessfully sped up one part but didn't improve end
| to end performance. I think sometimes you need to see it with
| your own eyes.
| kragen wrote:
| At my first job I spent about US$10k on a super fast compile
| server which didn't speed up our slow compiles because the
| bottleneck was the shared 10BaseT Ethernet to the NFS
| fileserver where we were storing both the source code and the
| build artifacts. I should have listened to my boss who was
| telling me it probably wouldn't help.
| delaaxe wrote:
| Am I the only one that giggled reading the title?
| delaaxe wrote:
| Jesus guys a little humor, why all the downvotes
| _Algernon_ wrote:
| Probably off-topic?
| oblak wrote:
| HN doesn't seem to do phrasing
| klyrs wrote:
| Jokes are not punished on HN as a rule. Low-effort comments
| are. Double for whining about downvotes.
| dredmorbius wrote:
| They're best suffered silently.
| criticaltinker wrote:
| _> Simply put, it is nicer to build your systems so that, as much
| as possible, they use a constant amount of memory irrespective of
| the input size _
|
| Really good advice - this is a hard earned lesson for many folks.
| I've worked with quite a few data scientists who were brilliant
| at experimental design but not necessarily experts in the field
| of comp sci. Their relatively simple python scripts would run
| nice and fast initially. As time passed and the organization
| grew, their scripts would start to run slower and slower as the
| datasets scaled and swapping to disk started occurring, etc. In
| some cases they would completely lock up shared machines, taking
| a good chunk of the team offline for a bit.
|
| Anyway, Daniel Lemire's blog is a fantastic resource. I highly
| recommend taking a look through his publications and open source
| contributions. I was able to save my employer _a lot_ of money by
| building on time series compression algorithms [1] and vectorized
| implementations [2][3] that he has provided.
|
| [1] Decoding billions of integers per second through
| vectorization
| https://onlinelibrary.wiley.com/doi/full/10.1002/spe.2203
|
| [2] https://github.com/lemire/FastPFor
|
| [3] https://github.com/searchivarius/PyFastPFor
| alpineidyll3 wrote:
| The flip side is that sometimes unnecessary optimizations can
| slow progress towards the actual goal, especially if the data
| transformations are sophisticated math.
| Ologn wrote:
| I have run into this problem more than once. Usually I am not
| creating the initial data, but am importing it. Usually I
| have ignored optimizations initially and just loaded and
| processed the data. Then I go back and hack on the
| subroutines loading the data and splitting it into records. I
| was the only programmer working on both projects though.
|
| Once I did it with a large RDF (XML) that was a list of
| books. Initially I loaded and processed it in Perl with
| XML::Simple, but XML::Twig let me deal with things one record
| at a time.
|
| The other time I had a large Java linked list holding
| objects, where I grabbed the head, processed it, and then
| shrunk the list (removeFirst()) as I proceeded (I probably
| could have optimized the loading even more).
|
| As a one-person project, punting on the big load optimization
| worked for me, as I had a working program even before
| optimizing. If it was a larger project, what time would be
| best to optimize might have differed.
| jka wrote:
| In the context of high-performance, revenue-sensitive data
| transformations, it'd be interesting to see how reusable
| free/open-source components would fare, especially with
| regard to licensing.
|
| I can see MIT/permissive licenses being _less_ attractive
| than usual in an environment like that, because the author
| might perceive reduced chance of reaping rewards (versus
| aiding competitors) from publishing components.
|
| If improvements led to greater-than-zero reward for
| publishers and consumers alike, though (as enforced by
| copyleft approaches), then perhaps the results would be
| different. All conjecture, really.
| kragen wrote:
| > _Really good advice - this is a hard-earned lesson for many
| folks. I 've worked with quite a few data scientists who were
| brilliant at experimental design but not necessarily experts in
| the field of comp sci. Their relatively simple Python scripts
| would run nice and fast initially. As time passed and the
| organization grew,_
|
| The problem of "as the organization grew,..." is the problem
| you want to have. 90% of organizations instead have the problem
| "As we failed to get any interesting results or any funding,
| and our contributors started drifting away to other
| projects,..." and sometimes the reason they have that problem
| is precisely because the initial contributors were too
| concerned about doing things "the right way" and weren't
| satisfied with "relatively simple Python scripts [that] run
| nice and fast initially".
|
| When your limiting factor is programmer effort, there are two
| kinds of systems: the kind that runs as slowly as the users and
| developers can tolerate, and the kind that people don't use
| because it lacks the functionality they want, functionality
| that some other system has. Probably a slower one.
|
| Of course you should make things go as fast as you can, and
| performance can be critical not only for ergonomics but also
| for the ability to iterate rapidly, and sometimes the
| performance _is_ the functionality, but when you find yourself
| facing down the shitty performance of "founder code" that
| locks up shared machines, don't fall into the trap of wishing
| that the people had done it right in the first place, before
| "time passed and the organization grew". In the startup
| graveyard there are nine other groups who tried to do the same
| thing, but couldn't quite get it right. Some of them probably
| had damn fast code, which you aren't looking at, because their
| organization never grew.
|
| And I say that as someone who spent a fair bit of the last week
| looking at assembly language and thinking about compiler
| optimization tradeoffs.
| kohlerm wrote:
| This is in my experience the most common performance antipattern.
| What makes it even worse is the fact that people quite often do
| this on the name of performance. Which rarely is a good idea.
___________________________________________________________________
(page generated 2021-08-21 23:01 UTC)