https://memgraph.com/blog/in-memory-vs-disk-based-databases-larger-than-memory-architecture * * * * [61fa30e301] Check out the Graph Database Performance Benchmark Right arrow icon[61fa30e301]Close icon [615e] Product Core Memgraph DB Arrow icon On prem in-memory graph database for streaming data. Memgraph Cloud Arrow icon Hosted and fully managed, our cloud service requires no admin. The Memgraph Ecosystem Lab Arrow icon A user interface for graph data visualization. MAGE Arrow icon A growing open-source graph algorithm repository. GQLAlchemy Arrow icon An object graph mapper (OGM) for Python. Download Platform How it works Arrow icon Check under the hood and get a glimpse at the inner workings of Memgraph. Use cases Cybersecurity Arrow icon Prevent cyber attacks by analyzing compromising patterns Knowledge Graph Arrow icon Build knowledge graphs that scale with your data. Fraud detection Arrow icon Eliminate chargeback fees and unrecoverable fraud in real-time. See all use cases Arrow icon Unique Case Arrow icon Not sure Memgraph is the right fit for your use case? Set up a call and explore let's explore the possibilities together. Neo4j vs MemgraphNetworkX Resources Playground Arrow icon Master graph algorithms in minutes through guided lessons and sandboxes on real-world problems in the browser. Community Arrow icon Join a growing community of graph developers and data scientists building graph based apps. Blog Arrow icon As blogs do. Dive into Memgraph topics. Webinars Arrow icon List of Upcoming Webinars Email courses Arrow icon Upgrade your Cypher or Graph Modelling skills in 10 days. Code with Buda Arrow icon Watch Memgraph's CTO demonstrate the power of graphs. DocsPricing Discord logoGitHub logo Download Discord logoGitHub logo [61fa30e301] Check out the Graph Database Performance Benchmark Right arrow icon[61fa30e301]Close icon Topics ACID Announcements App Building Benchgraph Betweenness Centrality C++ Case Studies Community Detection Company Comparison Culture Cybersecurity Cypher QL Data Lineage Databases Dev Tools Energy Management System Explore Datasets Fraud Detection GQLAlchemy Graph Algorithms Graph Database 101 Graph Streaming Graphs for stream devs IAM Systems Knowledge Graphs Logistics Machine Learning Memgraph Lab Memgraph MAGE MemgraphDB Neo4j Network Resource Optimization NetworkX NoSQL Databases Node2Vec Open source licenses Orb PageRank Partnership Performance benchmarks Product Python Real-Time Analytics Recommendation Engine RedisGraph alternative Showcase Tutorials Under the Hood Use Cases Product In-memory vs. disk-based databases: Why do you need a larger than memory architecture? Announcements C++ Cybersecurity Energy Management System IAM Systems Machine Learning MemgraphDB NoSQL Databases Product Showcase Use Cases In-memory vs. disk-based databases: Why do you need a larger than memory architecture? In-memory vs. disk-based databases: Why do you need a larger than memory architecture? [] Andi Skrgat August 31, 2023 Topics: Under the Hood Share Blog Sign up for our Newsletter Get the latest articles on all things graph databases, algorithms, and Memgraph updates delivered straight to your inbox Memgraph is an in-memory graph database that recently added support for working with data that cannot fit into memory. This allows users with smaller budgets to still load large graphs to Memgraph without paying for (more) expensive RAM. However, expanding the main-memory graph database to support disk storage is, by all means, a complex engineering endeavor. Let's break this process down into pieces. On-disk databases Disk-based databases have been, for a long time, a de facto standard in the database development world. Their huge advantage lies in their ability to store a vast amount of data relatively cheaply on disk. However, the development can be very complex due to the interaction with low-level OS primitives. Fetching data from disk is something that everyone strives to avoid since it takes approximately 10x more time than using it from main memory. Neo4j is an example of a graph, an on-disk database that uses disk as its main storage media while trying to cache as much data as possible to main memory so it could be reused afterward. disk oriented dbms In-memory databases In-memory databases avoid the fundamental cost of accessing data from disk by simply storing all its data in the main memory. Such architecture also significantly simplifies the development of the storage part of the database since there is no need for a buffer pool. However, the biggest issue with in-memory databases is when the data cannot fit into the random access memory since the only possible way out is to transfer the data to a larger and, consequently, more expensive machine. In-memory database users rely on the fact that durability is still secured through durability mechanisms like transaction logging and snapshots so that data loss does not occur. Larger-than-memory architecture Main memory computation Larger-than-memory architecture describes a database architecture when the majority of computations are still within the main memory, but the database offers the ability to store a vast amount of data on disk, too, without having the computational complexity of interacting with buffer pools. Identify hot & cold data The larger-than-memory architecture utilizes the fact that there are always hot and cold parts of the database in terms of accessing it. The goal is then to find cold data stored and move it to the disk so that transactions still have fast access to hot data. Cold data identification can be done either by directly tracking transactions' access patterns (online) or by offline computation in which a background thread analyzes data. The second very important feature of the larger-than-memory architecture is the process of evicting cold data. This can be done in two ways: 1. DB tracks the memory usage and starts evicting data as soon as it reaches a predefined threshold. 2. Eviction can be done only when new data is needed. Transaction management Different systems also behave differently regarding transaction management. If the transaction needs data that is currently stored on the disk, it can: 1. Abort the transaction, fetch data stored on the disk, and restart the transaction. 2. Stall the transaction by synchronously fetching data from the disk. Transaction must fit into memory The question is, what happens when the transaction data cannot fit into random access memory? In Memgraph, we decided to start with an approach that all transaction data must fit into memory. This means that some analytical queries cannot be executed on a large dataset, but this is the tradeoff we were willing to accept in the first iteration. memory dbms Benefits of larger-than-memory databases Memgraph uses RocksDB as a key-value store for extending the capabilities of the in-memory database. Not to go into too many details about RocksDB, but let's just briefly mention that it is based on a data structure called Log-Structured Merge-Tree (LSMT) (instead of B-Trees, typically the default option in databases), which are saved on disk and because of the design come with a much smaller write amplification than B-Trees. The in-memory version of Memgraph uses Delta storage to support multi-version concurrency control (MVCC). However, for larger-than-memory storage, we decided to use the Optimistic Concurrency Control Protocol (OCC) since we assumed conflicts would rarely happen, and we could make use of RocksDB's transactions without dealing with the custom layer of complexity like in the case of Delta storage. We've implemented OCC in a way that every transaction has its own private workspace, so potential conflicts are detected at the commit time. One of our primary requirements before starting to add disk-based data storage was not to ruin the performance of the main memory-based storage. Although we all knew there was no such thing as zero-cost abstraction, we managed to stay within 10% of the original version. We decided to use snapshot isolation as an appropriate concurrency isolation level since we believed it could be the default option for the large majority of Memgraph users. Disadvantages of larger-than-memory databases As always, not everything is sunshine and flowers, especially when introducing such a significant feature to an existing database, so there are still improvements to be made. First, the requirement that a single transaction must fit into memory makes it impossible to use large analytical queries. It also makes our LOAD CSV command for importing CSV files practically unusable since the command is executed as a single transaction. Although RocksDB is really good, fits really well into our codebase, and has proved to be very efficient in its caching mechanisms, maintaining an external library is always hard. In retrospect Albeit the significant engineering endeavor, the larger-than-memory architecture is a super valuable asset to Memgraph users since it allows them to store large amounts of data cheaply on disk without sacrificing the performance of in-memory computation. We are actively working on resolving issues introduced with the new storage mode, so feel free to ask, open an issue, or pull a request. We will be more than happy to help. Until next time In this article Sign up for our Newsletter Get the latest articles on all things graph databases, algorithms, and Memgraph updates delivered straight to your inbox [ ] [Sign up] Thank you! Check your inbox to confirm your subscription. Sorry, something went wrong. Please try refreshing the page. Share Blog Sign up for our Newsletter Get the latest articles on all things graph databases, algorithms, and Memgraph updates delivered straight to your inbox Get a personalized Memgraph demo and get your questions answered. BOOK A DEMO Read next parallel-processing-in-database-recovery Under the Hood Utilizing Parallel Processing in Database Recovery This blog post explains the nuisances of parallel processing and database recovery, how modern computer architectures can be utilized to better serve the needs of the user in the case of recovery, and how Memgraph specifically deals with this issue. by Gabor Volfinger June 6, 2023 use-prometheus-monitoring-memgraph-performance-metrics Under the Hood Use Prometheus to Monitor Memgraph's Performance Metrics Explore how you can use Prometheus, a time-series database that enables countless applications, to seamlessly monitor and react to performance changes in Memgraph. by Josip Mrden June 1, 2023 benchgraph-backstory-the-untapped-potential Under the Hood Benchgraph Backstory: The Untapped Potential How do you know if your Memgraph configuration is achieving optimal performance? From now on, Benchgraph will be at your disposal to get performance insights! by Ante Javor April 25, 2023 [] [62cf5584ff] Platform Memgraph DBCloudLabMAGEGQLAlchemy How it worksUse CasesPricing Resources DocsPlaygroundCommunityBlogWebinarsEmail CoursesCode with Buda Newsletter Company About UsCareersTeamsLegalPartnersPress RoomContact Join us on Discord Our growing community of graph enthusiasts awaits you! Join our community [62cf5a7422][62cf5a7434][62cf5a7488][62cf5a7422]Linkedin icon (c) 2023 Memgraph Ltd. All rights reserved. Terms & Privacy