https://lemire.me/blog/2021/08/21/the-big-load-anti-pattern/ Skip to content Daniel Lemire's blog Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist. Menu and widgets * My home page * My papers * My software Subscribe Email Address [ ] [ ] [Subscribe by email] Where to find me? I am on Twitter and GitHub: Follow @lemire You can also find Daniel Lemire on * on Google Scholar with 4k citations and over 75 peer-reviewed publications, * on Facebook, * and on LinkedIn. Before the pandemic of 2020, you could meet Daniel in person, as he was organizing regular talks open to the public in Montreal: tribalab and technolab . Search for: [ ] [Search] Support my work! I do not accept any advertisement. However, you can support the blog with donations through paypal. Please consider getting in touch if you are a supporter so that I can thank you. Recent Posts * The big-load anti-pattern * How fast can you pipe a large file to a C++ program? * Science and Technology links (July 31st 2021) * Measuring memory usage: virtual versus real memory * Faster sorted array unions by reducing branches Recent Comments * --- on The big-load anti-pattern * Daniel Lemire on The big-load anti-pattern * Jens Alfke on The big-load anti-pattern * Daniel Lemire on The big-load anti-pattern * --- on The big-load anti-pattern Pages * A short history of technology * About me * Book recommendations * Cognitive biases * Interviews and talks * My bets * My favorite articles * My favorite quotes * My readers * My sayings * Predictions * Recommended video games * Terms of use * Write good papers Archives Archives [Select Month ] Boring stuff * Log in * Entries feed * Comments feed * WordPress.org The big-load anti-pattern When doing data engineering, it is common for engineers to want to first load all of the data in memory before processing the data. If you have sufficient memory and the loaded data is not ephemeral or you have small volumes, it is a sensible approach. After all, that is how a spreadsheet typically works: you load the whole speadsheet in memory. But what if your application falls in the "extract-transform-load" category: as you scan the data, you discard it? What if you have large volumes of data? Then I consider the "load everything in memory at once" a performance anti-pattern when you are doing high performance data engineering. The most obvious problem with a big load is the scalability. As your data inputs get larger and larger, you consume more and more memory. Though it is true that over time we get more memory, we also tend to get more processing cores, and RAM is a shared ressource. Currently, in 2021, some of the most popular instance types on the popular cloud system AWS have 4 GB per virtual CPU. If you have the means, AWS will provide you with memory-optimized virtual nodes that have 24 TB of RAM. However, these nodes have 448 logical processors sharing that memory. Frequent and large memory allocations are somewhat risky. A single memory allocation failure may force an entire process to terminate. On a server, it might be difficult to anticipate what other processes do and how much memory they use. Thus each process should seek to keep its memory usage predictable, if not stable. Simply put, it is nicer to build your systems so that, as much as possible, they use a constant amount of memory irrespective of the input size. If you are designing a web service, and you put a hard limit on the size of a single result, you will help engineers build better clients. You may also encounter various limits which reduce your portability. Not every cloud framework will allow you to upload a 40 GB file at once, without fragmentation. And, of course, on-device processing in a mobile setting becomes untenable if you have no bound on the data inputs. But what about the performance? If you have inefficient code (maybe written in JavaScript or bad C++), then you should have no worries. But if you are concerned with performance, the story gets more intricate. If you are processing the data in tiny increments, you can keep most of the data that you are consuming in CPU cache. However, if you are using a big-load, then you need to allocate a large memory region, initialize it, fill it up and then read it again. The data goes from the CPU to the RAM and back again. The process is relatively expensive. To illustrate the point, I wrote a little benchmark. I consider a function which allocates memory and populates an array of integer with the values 0,1,2... int * content = new int[volume/sizeof(int)]; init(content, volume); delete[] content; It is a silly function: everything you would do that involves memory allocation is likely far slower. So how fast is this fast function? I get the following numbers on two of my machines. I pick the best results within each run. 1 MB 1 GB alloc-init (best) - AMD Rome Linux 33 GB/s 7 GB/s alloc-init (best) - Apple M1 30 GB/s 9 GB/s Simply put, allocating memory and pushing data into it gets slower and slower with the volume. We can explain it in terms of CPU cache and RAM, but the principle is entirely general. You may consider 7 GB/s or 9 GB/s to be a good speed, and indeed these processors and operating systems are efficient. However, consider that it is actually your starting point. We haven't read the data yet. If you need to actually "read" that data, let alone transform it or do any kind of reasoning over it, you must then bring it back from RAM to cache. So you have the full cache to RAM and RAM to cache cycle. Unavoidably, your speed will start to fall... 5 GB/s, 2 GB/s... and soon you will be in the megabytes per second. Your pipeline could be critically bounded because it is built out of slow software (e.g., JavaScript code) or because you are relying on slow networks and disk. To be fair, if the rest of your pipeline runs in the megabytes per second, then memory allocation might as well be free from a speed point of view. That is why I qualify the big-load to be an anti-pattern for high-performance data engineering. In a high-performance context, for efficiency, you should stream through the data as much as possible, reading it in chunks that are roughly the size of your CPU cache (e.g., megabytes). The best chunk size depends on many parameters, but it is typically not tiny (kilobytes) nor large (gigabytes). If you bypass such an optimization as part of your system's architecture, you may have hard limits on your performance later. It is best to view the processor as a dot at the middle of a sequence of concentric circles. The processor is hard of hearing: they can only communicate with people in the inner circle. But there is limited room in each circle. The further you are from the processor, the more expensive it is for the processor to talk to you because you first need to move to the center, possibly pushing out some other folks. The room close to the processor is crowded and precious. So if you can, you should have your guests come into the center once, and then exit forever. What a big load tends to do is to get people into the inner circle, and then out to some remote circle, and then back again into the inner circle. It works well when there are few guests because everyone gets to stay in the inner circle or nearby, but as more and more people come in, it becomes less and less efficient. It does not matter how your code looks: if you need to fully consume all of a large data file before you process it, you have a case of big load. Whether you are using fancy techniques such as memory file mapping or not, does not change the equation. Some parameters like the size of your pages may help however. How may you avoid the big-load anti-pattern? * + Within the files themselves, you should have some kind of structure so that you do not need to consume the whole file at once when it is large. It comes naturally with popular formats such as CSV where you can often consume just one line at a time. If you are working with JSON data files, you may want to adopt to JSON streaming for an equivalent result. Most data-engineering formats will support some concept of chunk or page to help you. + Consider splitting your data. If you have a database engine, you may consider sharding. If you are working with large files, you may want to use smaller files. You should be cautious not to fall for the small-load anti-pattern. E.g., do not store only a few bytes per file and do not fragment your web applications into 400 loadable ressources. + When compressing data, try to make sure you can uncompress small usable chunks (a few megabytes). Published by [4b7361] Daniel Lemire A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire Posted on August 21, 2021August 21, 2021Author Daniel Lemire Categories 5 thoughts on "The big-load anti-pattern" 1. [e03257] --- says: August 21, 2021 at 12:34 pm roughly the size of your CPU cache That's often too small, although if you're talking about L3 caches on many present day CPUs, it's probably okay. The cost of context switching and I/O is significantly worse than RAM speed on most setups, so you often want to tune for I/O as opposed to CPU cache, such as making large read requests to disk (particularly if it's a harddisk). What's probably more important is doing async I/O, so your processing can occur whilst the disk is pulling in data for you. Also, avoiding random I/O, as sequential I/O is generally fastest (i.e. be careful with reading too many things at the same time). Since I/O costs typically dwarf RAM speeds, large files are better than smaller ones (lowers random access costs and performance penalties with opening files) - it's actually why many games go to the effort of packing all their assets into archives because it's faster to deal with one big file than many smaller ones. Reply 1. [4b7361] Daniel Lemire says: August 21, 2021 at 1:53 pm Yes, using many tiny files would be a performance anti-pattern of its own (which I refer to as "small-load"). Note that in my context, I assume a high-performance SSD. If you are using a spinning drive, then you may be limited to a few megabytes per second (if that) and my whole post is irrelevant. Reply 1. [e03257] --- says: August 21, 2021 at 10:48 pm It's not just many tiny files, any splitting can slow you down because each file you have incurs overhead with opening/closing it (and can reduce sequential accesses because filesystems are less likely to place these specific files sequentially on disk). Granted, if the files are large enough, the cost is pretty small, but it doesn't mean you should arbitrarily introduce splitting with the assumption that it improves performance. Most SSDs are limited by SATA bandwidth, so around 500MB/ s max. If you're using a cloud hosting platform, you'll also be limited by network bandwidth, so possibly even less. Even if you're using a high performance modern NMVe drive, you're still stuck with needing to do kernel traps when issuing reads. As such, the point stands that you shouldn't optimise based on CPU cache, but on whatever is most efficient for I/O. 1GB is likely too big, but 1MB is likely too small. SSDs obviously have different performance characteristics to HDDs, but fortunately, for sequential access, the guidelines for the two are the same. Reply 2. [edbd5f] Jens Alfke says: August 21, 2021 at 5:22 pm Memory-mapping the file (e.g. with mmap) can give you the best of both worlds. You can code as though the file were all in memory, which is simpler, and you can even backtrack or peek forward if necessary. But the computer doesn't have to copy the whole file into RAM. (Depending on the OS, you may lose some performance because the file is read in smaller chunks, possibly as small as 4KB. But in practice the kernel will usually see your access pattern and read ahead in bigger chunks, which btw gives you async I/O for free.) Reply 1. [4b7361] Daniel Lemire says: August 21, 2021 at 6:11 pm Memory mapping is a great approach but underneath it is limited by the same fundamental principles. It does not bypasses CPU cache and RAM. (I am aware that you know this, but I am clarifying it.) Reply Leave a Reply Cancel reply Your email address will not be published. Required fields are marked * To create code blocks or other preformatted text, indent by four spaces: This will be displayed in a monospaced font. The first four spaces will be stripped off, but all other whitespace will be preserved. Markdown is turned off in code blocks: [This is not a link](http://example.com) To create not a block, but an inline code span, use backticks: Here is some inline `code`. For more help see http://daringfireball.net/projects/markdown/syntax [ ] [ ] [ ] [ ] [ ] [ ] [ ] Comment [ ] Name * [ ] Email * [ ] Website [ ] [ ] Save my name, email, and website in this browser for the next time I comment. Receive Email Notifications? [no, do not subscribe ] [instantly ] Or, you can subscribe without commenting. [Post Comment] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] Post navigation Previous Previous post: How fast can you pipe a large file to a C++ program? Proudly powered by WordPress