https://memgraph.com/blog/in-memory-vs-disk-based-databases-larger-than-memory-architecture

* * * *
[61fa30e301] 
Check out the Graph Database Performance Benchmark
Right arrow icon[61fa30e301]Close icon
[615e]
Product
Core
 
Memgraph DB
Arrow icon
On prem in-memory graph database for streaming data.
 
Memgraph Cloud
Arrow icon
Hosted and fully managed, our cloud service requires no admin.
The Memgraph Ecosystem
 
Lab
Arrow icon
A user interface for graph data visualization.
 
MAGE
Arrow icon
A growing open-source graph algorithm repository.
 
GQLAlchemy
Arrow icon
An object graph mapper (OGM) for Python.
 
Download Platform
 
How it works
Arrow icon
Check under the hood and get a glimpse at the inner workings of
Memgraph.
Use cases
 
Cybersecurity
Arrow icon
Prevent cyber attacks by analyzing compromising patterns
 
Knowledge Graph
Arrow icon
Build knowledge graphs that scale with your data.
 
Fraud detection
Arrow icon
Eliminate chargeback fees and unrecoverable fraud in real-time.
 
See all use cases
Arrow icon
 
Unique Case
Arrow icon
Not sure Memgraph is the right fit for your use case? Set up a call
and explore let's explore the possibilities together.
Neo4j vs MemgraphNetworkX
Resources
 
Playground
Arrow icon
Master graph algorithms in minutes through guided lessons and
sandboxes on real-world problems in the browser.
 
Community
Arrow icon
Join a growing community of graph developers and data scientists
building graph based apps.
 
Blog
Arrow icon
As blogs do. Dive into Memgraph topics.
 
Webinars
Arrow icon
List of Upcoming Webinars
 
Email courses
Arrow icon
Upgrade your Cypher or Graph Modelling skills in 10 days.
 
Code with Buda
Arrow icon
Watch Memgraph's CTO demonstrate the power of graphs.
DocsPricing
Discord logoGitHub logo
Download
Discord logoGitHub logo
[61fa30e301] 
Check out the Graph Database Performance Benchmark
Right arrow icon[61fa30e301]Close icon

Topics

ACID
Announcements
App Building
Benchgraph
Betweenness Centrality
C++
Case Studies
Community Detection
Company
Comparison
Culture
Cybersecurity
Cypher QL
Data Lineage
Databases
Dev Tools
Energy Management System
Explore Datasets
Fraud Detection
GQLAlchemy
Graph Algorithms
Graph Database 101
Graph Streaming
Graphs for stream devs
IAM Systems
Knowledge Graphs
Logistics
Machine Learning
Memgraph Lab
Memgraph MAGE
MemgraphDB
Neo4j
Network Resource Optimization
NetworkX
NoSQL Databases
Node2Vec
Open source licenses
Orb
PageRank
Partnership
Performance benchmarks
Product
Python
Real-Time Analytics
Recommendation Engine
RedisGraph alternative
Showcase
Tutorials
Under the Hood
Use Cases

Product

 

In-memory vs. disk-based databases: Why do you need a larger than
memory architecture?

Announcements

C++

Cybersecurity

Energy Management System

IAM Systems

Machine Learning

MemgraphDB

NoSQL Databases

Product

Showcase

Use Cases

In-memory vs. disk-based databases: Why do you need a larger than
memory architecture?

In-memory vs. disk-based databases: Why do you need a larger than
memory architecture?

[]
Andi Skrgat
August 31, 2023
Topics:
 
Under the Hood
Share Blog
Sign up for our Newsletter

Get the latest articles on all things graph databases, algorithms,
and Memgraph updates delivered straight to your inbox

Memgraph is an in-memory graph database that recently added support
for working with data that cannot fit into memory. This allows users
with smaller budgets to still load large graphs to Memgraph without
paying for (more) expensive RAM. However, expanding the main-memory
graph database to support disk storage is, by all means, a complex
engineering endeavor. Let's break this process down into pieces.

On-disk databases

Disk-based databases have been, for a long time, a de facto standard
in the database development world. Their huge advantage lies in their
ability to store a vast amount of data relatively cheaply on disk.
However, the development can be very complex due to the interaction
with low-level OS primitives. Fetching data from disk is something
that everyone strives to avoid since it takes approximately 10x more
time than using it from main memory. Neo4j is an example of a graph,
an on-disk database that uses disk as its main storage media while
trying to cache as much data as possible to main memory so it could
be reused afterward.

disk oriented dbms

In-memory databases

In-memory databases avoid the fundamental cost of accessing data from
disk by simply storing all its data in the main memory. Such
architecture also significantly simplifies the development of the
storage part of the database since there is no need for a buffer
pool. However, the biggest issue with in-memory databases is when the
data cannot fit into the random access memory since the only possible
way out is to transfer the data to a larger and, consequently, more
expensive machine.

In-memory database users rely on the fact that durability is still
secured through durability mechanisms like transaction logging and
snapshots so that data loss does not occur.

Larger-than-memory architecture

Main memory computation

Larger-than-memory architecture describes a database architecture
when the majority of computations are still within the main memory,
but the database offers the ability to store a vast amount of data on
disk, too, without having the computational complexity of interacting
with buffer pools.

Identify hot & cold data

The larger-than-memory architecture utilizes the fact that there are
always hot and cold parts of the database in terms of accessing it.
The goal is then to find cold data stored and move it to the disk so
that transactions still have fast access to hot data. Cold data
identification can be done either by directly tracking transactions'
access patterns (online) or by offline computation in which a
background thread analyzes data.

The second very important feature of the larger-than-memory
architecture is the process of evicting cold data. This can be done
in two ways:

 1. DB tracks the memory usage and starts evicting data as soon as it
    reaches a predefined threshold.
 2. Eviction can be done only when new data is needed.

Transaction management

Different systems also behave differently regarding transaction
management. If the transaction needs data that is currently stored on
the disk, it can:

 1. Abort the transaction, fetch data stored on the disk, and restart
    the transaction.
 2. Stall the transaction by synchronously fetching data from the
    disk.

Transaction must fit into memory

The question is, what happens when the transaction data cannot fit
into random access memory? In Memgraph, we decided to start with an
approach that all transaction data must fit into memory. This means
that some analytical queries cannot be executed on a large dataset,
but this is the tradeoff we were willing to accept in the first
iteration.

memory dbms

Benefits of larger-than-memory databases

Memgraph uses RocksDB as a key-value store for extending the
capabilities of the in-memory database. Not to go into too many
details about RocksDB, but let's just briefly mention that it is
based on a data structure called Log-Structured Merge-Tree (LSMT)
(instead of B-Trees, typically the default option in databases),
which are saved on disk and because of the design come with a much
smaller write amplification than B-Trees.

The in-memory version of Memgraph uses Delta storage to support
multi-version concurrency control (MVCC). However, for
larger-than-memory storage, we decided to use the Optimistic
Concurrency Control Protocol (OCC) since we assumed conflicts would
rarely happen, and we could make use of RocksDB's transactions
without dealing with the custom layer of complexity like in the case
of Delta storage.

We've implemented OCC in a way that every transaction has its own
private workspace, so potential conflicts are detected at the commit
time. One of our primary requirements before starting to add
disk-based data storage was not to ruin the performance of the main
memory-based storage. Although we all knew there was no such thing as
zero-cost abstraction, we managed to stay within 10% of the original
version. We decided to use snapshot isolation as an appropriate
concurrency isolation level since we believed it could be the default
option for the large majority of Memgraph users.

Disadvantages of larger-than-memory databases

As always, not everything is sunshine and flowers, especially when
introducing such a significant feature to an existing database, so
there are still improvements to be made. First, the requirement that
a single transaction must fit into memory makes it impossible to use
large analytical queries.

It also makes our LOAD CSV command for importing CSV files
practically unusable since the command is executed as a single
transaction. Although RocksDB is really good, fits really well into
our codebase, and has proved to be very efficient in its caching
mechanisms, maintaining an external library is always hard.

In retrospect

Albeit the significant engineering endeavor, the larger-than-memory
architecture is a super valuable asset to Memgraph users since it
allows them to store large amounts of data cheaply on disk without
sacrificing the performance of in-memory computation. We are actively
working on resolving issues introduced with the new storage mode, so
feel free to ask, open an issue, or pull a request. We will be more
than happy to help. Until next time 

In this article
 
Sign up for our Newsletter

Get the latest articles on all things graph databases, algorithms,
and Memgraph updates delivered straight to your inbox

[                    ]
[Sign up]
Thank you! Check your inbox to confirm your subscription.
Sorry, something went wrong. Please try refreshing the page.
Share Blog
Sign up for our Newsletter

Get the latest articles on all things graph databases, algorithms,
and Memgraph updates delivered straight to your inbox

Get a personalized Memgraph demo
and get your questions answered.

BOOK A DEMO

Read next

parallel-processing-in-database-recovery
Under the Hood
Utilizing Parallel Processing in Database Recovery 

This blog post explains the nuisances of parallel processing and
database recovery, how modern computer architectures can be utilized
to better serve the needs of the user in the case of recovery, and
how Memgraph specifically deals with this issue.

by
Gabor Volfinger
June 6, 2023
 
use-prometheus-monitoring-memgraph-performance-metrics
Under the Hood
Use Prometheus to Monitor Memgraph's Performance Metrics

Explore how you can use Prometheus, a time-series database that
enables countless applications, to seamlessly monitor and react to
performance changes in Memgraph.

by
Josip Mrden
June 1, 2023
 
benchgraph-backstory-the-untapped-potential
Under the Hood
Benchgraph Backstory: The Untapped Potential

How do you know if your Memgraph configuration is achieving optimal
performance? From now on, Benchgraph will be at your disposal to get
performance insights!

by
Ante Javor
April 25, 2023
 

 
[]
[62cf5584ff]
Platform
Memgraph DBCloudLabMAGEGQLAlchemy
How it worksUse CasesPricing
Resources
DocsPlaygroundCommunityBlogWebinarsEmail CoursesCode with Buda
Newsletter
Company
About UsCareersTeamsLegalPartnersPress RoomContact

Join us on Discord

Our growing community of graph enthusiasts awaits you!

Join our community
[62cf5a7422][62cf5a7434][62cf5a7488][62cf5a7422]Linkedin icon
(c) 2023 Memgraph Ltd. All rights reserved. Terms & Privacy