https://lemire.me/blog/2021/08/21/the-big-load-anti-pattern/

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the University of
Quebec (TELUQ) in Montreal. His research is focused on software
performance and data engineering. He is a techno-optimist.

Menu and widgets

  * My home page
  * My papers
  * My software

Subscribe

Email Address [                    ]
[                    ]
[Subscribe by email]
Where to find me?

I am on Twitter and GitHub:

Follow @lemire
You can also find Daniel Lemire on

  * on Google Scholar with 4k citations and over 75 peer-reviewed
    publications,
  * on Facebook,
  * and on LinkedIn.

Before the pandemic of 2020, you could meet Daniel in person, as he
was organizing regular talks open to the public in Montreal: tribalab
and technolab .
Search for: [                    ] [Search]
Support my work!

I do not accept any advertisement. However, you can support the blog
with donations through paypal. Please consider getting in touch if
you are a supporter so that I can thank you.

Recent Posts

  * The big-load anti-pattern
  * How fast can you pipe a large file to a C++ program?
  * Science and Technology links (July 31st 2021)
  * Measuring memory usage: virtual versus real memory
  * Faster sorted array unions by reducing branches

Recent Comments

  * --- on The big-load anti-pattern
  * Daniel Lemire on The big-load anti-pattern
  * Jens Alfke on The big-load anti-pattern
  * Daniel Lemire on The big-load anti-pattern
  * --- on The big-load anti-pattern

Pages

  * A short history of technology
  * About me
  * Book recommendations
  * Cognitive biases
  * Interviews and talks
  * My bets
  * My favorite articles
  * My favorite quotes
  * My readers
  * My sayings
  * Predictions
  * Recommended video games
  * Terms of use
  * Write good papers

Archives

Archives [Select Month        ]

Boring stuff

  * Log in
  * Entries feed
  * Comments feed
  * WordPress.org

The big-load anti-pattern

When doing data engineering, it is common for engineers to want to
first load all of the data in memory before processing the data. If
you have sufficient memory and the loaded data is not ephemeral or
you have small volumes, it is a sensible approach. After all, that is
how a spreadsheet typically works: you load the whole speadsheet in
memory.

But what if your application falls in the "extract-transform-load"
category: as you scan the data, you discard it? What if you have
large volumes of data? Then I consider the "load everything in memory
at once" a performance anti-pattern when you are doing high
performance data engineering.

The most obvious problem with a big load is the scalability. As your
data inputs get larger and larger, you consume more and more memory.
Though it is true that over time we get more memory, we also tend to
get more processing cores, and RAM is a shared ressource. Currently,
in 2021, some of the most popular instance types on the popular cloud
system AWS have 4 GB per virtual CPU. If you have the means, AWS will
provide you with memory-optimized virtual nodes that have 24 TB of
RAM. However, these nodes have 448 logical processors sharing that
memory.

Frequent and large memory allocations are somewhat risky. A single
memory allocation failure may force an entire process to terminate.
On a server, it might be difficult to anticipate what other processes
do and how much memory they use. Thus each process should seek to
keep its memory usage predictable, if not stable. Simply put, it is
nicer to build your systems so that, as much as possible, they use a
constant amount of memory irrespective of the input size. If you are
designing a web service, and you put a hard limit on the size of a
single result, you will help engineers build better clients.

You may also encounter various limits which reduce your portability.
Not every cloud framework will allow you to upload a 40 GB file at
once, without fragmentation. And, of course, on-device processing in
a mobile setting becomes untenable if you have no bound on the data
inputs.

But what about the performance? If you have inefficient code (maybe
written in JavaScript or bad C++), then you should have no worries.
But if you are concerned with performance, the story gets more
intricate.

If you are processing the data in tiny increments, you can keep most
of the data that you are consuming in CPU cache. However, if you are
using a big-load, then you need to allocate a large memory region,
initialize it, fill it up and then read it again. The data goes from
the CPU to the RAM and back again.

The process is relatively expensive. To illustrate the point, I wrote
a little benchmark. I consider a function which allocates memory and
populates an array of integer with the values 0,1,2...

  int * content = new int[volume/sizeof(int)];
  init(content, volume);
  delete[] content;

It is a silly function: everything you would do that involves memory
allocation is likely far slower. So how fast is this fast function? I
get the following numbers on two of my machines. I pick the best
results within each run.

                                   1 MB    1 GB
alloc-init (best) - AMD Rome Linux 33 GB/s 7 GB/s
alloc-init (best) - Apple M1       30 GB/s 9 GB/s

Simply put, allocating memory and pushing data into it gets slower
and slower with the volume. We can explain it in terms of CPU cache
and RAM, but the principle is entirely general.

You may consider 7 GB/s or 9 GB/s to be a good speed, and indeed
these processors and operating systems are efficient. However,
consider that it is actually your starting point. We haven't read the
data yet. If you need to actually "read" that data, let alone
transform it or do any kind of reasoning over it, you must then bring
it back from RAM to cache. So you have the full cache to RAM and RAM
to cache cycle. Unavoidably, your speed will start to fall... 5 GB/s, 2
GB/s... and soon you will be in the megabytes per second.

Your pipeline could be critically bounded because it is built out of
slow software (e.g., JavaScript code) or because you are relying on
slow networks and disk. To be fair, if the rest of your pipeline runs
in the megabytes per second, then memory allocation might as well be
free from a speed point of view. That is why I qualify the big-load
to be an anti-pattern for high-performance data engineering.

In a high-performance context, for efficiency, you should stream
through the data as much as possible, reading it in chunks that are
roughly the size of your CPU cache (e.g., megabytes). The best chunk
size depends on many parameters, but it is typically not tiny
(kilobytes) nor large (gigabytes). If you bypass such an optimization
as part of your system's architecture, you may have hard limits on
your performance later.

It is best to view the processor as a dot at the middle of a sequence
of concentric circles. The processor is hard of hearing: they can
only communicate with people in the inner circle. But there is
limited room in each circle. The further you are from the processor,
the more expensive it is for the processor to talk to you because you
first need to move to the center, possibly pushing out some other
folks. The room close to the processor is crowded and precious. So if
you can, you should have your guests come into the center once, and
then exit forever. What a big load tends to do is to get people into
the inner circle, and then out to some remote circle, and then back
again into the inner circle. It works well when there are few guests
because everyone gets to stay in the inner circle or nearby, but as
more and more people come in, it becomes less and less efficient.

It does not matter how your code looks: if you need to fully consume
all of a large data file before you process it, you have a case of
big load. Whether you are using fancy techniques such as memory file
mapping or not, does not change the equation. Some parameters like
the size of your pages may help however.

How may you avoid the big-load anti-pattern?

  * 
      + Within the files themselves, you should have some kind of
        structure so that you do not need to consume the whole file
        at once when it is large. It comes naturally with popular
        formats such as CSV where you can often consume just one line
        at a time. If you are working with JSON data files, you may
        want to adopt to JSON streaming for an equivalent result.
        Most data-engineering formats will support some concept of
        chunk or page to help you.
      + Consider splitting your data. If you have a database engine,
        you may consider sharding. If you are working with large
        files, you may want to use smaller files. You should be
        cautious not to fall for the small-load anti-pattern. E.g.,
        do not store only a few bytes per file and do not fragment
        your web applications into 400 loadable ressources.
      + When compressing data, try to make sure you can uncompress
        small usable chunks (a few megabytes).

 

Published by

[4b7361]

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).
View all posts by Daniel Lemire

Posted on August 21, 2021August 21, 2021Author Daniel Lemire
Categories  

5 thoughts on "The big-load anti-pattern"

 1. [e03257] --- says:
    August 21, 2021 at 12:34 pm


        roughly the size of your CPU cache

    That's often too small, although if you're talking about L3
    caches on many present day CPUs, it's probably okay. The cost of
    context switching and I/O is significantly worse than RAM speed
    on most setups, so you often want to tune for I/O as opposed to
    CPU cache, such as making large read requests to disk
    (particularly if it's a harddisk).

    What's probably more important is doing async I/O, so your
    processing can occur whilst the disk is pulling in data for you.
    Also, avoiding random I/O, as sequential I/O is generally fastest
    (i.e. be careful with reading too many things at the same time).

    Since I/O costs typically dwarf RAM speeds, large files are
    better than smaller ones (lowers random access costs and
    performance penalties with opening files) - it's actually why
    many games go to the effort of packing all their assets into
    archives because it's faster to deal with one big file than many
    smaller ones.

    Reply
     1. [4b7361] Daniel Lemire says:
        August 21, 2021 at 1:53 pm

        Yes, using many tiny files would be a performance
        anti-pattern of its own (which I refer to as "small-load").

        Note that in my context, I assume a high-performance SSD. If
        you are using a spinning drive, then you may be limited to a
        few megabytes per second (if that) and my whole post is
        irrelevant.

        Reply
         1. [e03257] --- says:
            August 21, 2021 at 10:48 pm

            It's not just many tiny files, any splitting can slow you
            down because each file you have incurs overhead with
            opening/closing it (and can reduce sequential accesses
            because filesystems are less likely to place these
            specific files sequentially on disk). Granted, if the
            files are large enough, the cost is pretty small, but it
            doesn't mean you should arbitrarily introduce splitting
            with the assumption that it improves performance.

            Most SSDs are limited by SATA bandwidth, so around 500MB/
            s max. If you're using a cloud hosting platform, you'll
            also be limited by network bandwidth, so possibly even
            less.
            Even if you're using a high performance modern NMVe
            drive, you're still stuck with needing to do kernel traps
            when issuing reads.

            As such, the point stands that you shouldn't optimise
            based on CPU cache, but on whatever is most efficient for
            I/O. 1GB is likely too big, but 1MB is likely too small.
            SSDs obviously have different performance characteristics
            to HDDs, but fortunately, for sequential access, the
            guidelines for the two are the same.

            Reply
 2. [edbd5f] Jens Alfke says:
    August 21, 2021 at 5:22 pm

    Memory-mapping the file (e.g. with mmap) can give you the best of
    both worlds. You can code as though the file were all in memory,
    which is simpler, and you can even backtrack or peek forward if
    necessary. But the computer doesn't have to copy the whole file
    into RAM.

    (Depending on the OS, you may lose some performance because the
    file is read in smaller chunks, possibly as small as 4KB. But in
    practice the kernel will usually see your access pattern and read
    ahead in bigger chunks, which btw gives you async I/O for free.)

    Reply
     1. [4b7361] Daniel Lemire says:
        August 21, 2021 at 6:11 pm

        Memory mapping is a great approach but underneath it is
        limited by the same fundamental principles. It does not
        bypasses CPU cache and RAM.

        (I am aware that you know this, but I am clarifying it.)

        Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked 
*

To create code blocks or other preformatted text, indent by four
spaces:

    This will be displayed in a monospaced font. The first four
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
        [                                             ]
Comment [                                             ]

Name * [                              ]

Email * [                              ]

Website [                              ]

[ ] Save my name, email, and website in this browser for the next
time I comment.

Receive Email Notifications? [no, do not subscribe      ] 
[instantly    ]
Or, you can subscribe without commenting.

[Post Comment] 

[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
[                                             ]
Post navigation

Previous Previous post: How fast can you pipe a large file to a C++
program?
Proudly powered by WordPress