https://calpaterson.com/s3.html

Cal Paterson | Home About

S3 is files, but not a filesystem

March 2024

"Deep" modules, mismatched interfaces - and why SAP is so painful

a box labelled: CAL'S MISC My very own "object store"

Amazon S3 is the original cloud technology: it came out in 2006.
"Objects" were popular at the time and S3 was labelled an "object
store", but everyone really knows that S3 is for files. S3 is a cloud
filesystem, not an object-whatever.

I think the idea that S3 is really "Amazon Cloud Filesystem" is a bit
of a load bearing fiction. It's sort of true: S3 can store files.
It's also a very useful belief in getting people to adopt S3, a
fundamentally good technology, which otherwise they might not. But
it's false: S3 is not a filesystem and can't stand in for one.

What filesystems are about, and module "depth"

The unix file API is pretty straightforward. There are just five
basic functions. They don't take many arguments.

Here are (the Python versions of) these five basic functions:

# open a file
open(filepath) # returns a `file`

# read from that file (moving the position forward)
file.read(size=100) # returns 100 bytes

# write to that file (moving the position forward)
file.write("hello, world")

# move the position to byte 94
file.seek(94)

# close the file
file.close()

Well, perhaps I should add an asterisk: I am simplifying a bit. There
are loads more calls than that. But still, those five calls are the
irreducible nub of the file API. They're all you need to read and
write files.

Those five functions handle a lot of concerns:

  * buffering
  * the page cache
  * fragmentation
  * permissions
  * IO scheduling
  * and whatever else

Even though the file API handles all those concerns, but it doesn't
expose them to you. A narrow interface handling a large number of
concerns - that makes the unix file API a "deep" module.

diagram of a deep vs a shallow diagram

Deep modules are great because you can benefit from their features -
like wear-levelling on SD cards - but without bearing the psychic
toll of thinking about any of it as you save a jpeg to your phone.
Happy days.

But if the file API is "deep", what sorts of things are "shallow"?

A shallow module would have a relatively large API surface in
proportion to what it's handling for you. One hint these days that a
module is shallow is that the interface to it is YAML. YAML appears
to be a mere markup language but in practice is a reuseable syntax
onto which almost any semantics can be plonked.

Often YAML works as the "Programming language of DevOps" and
programming languages provide about the widest interface possible.
Examine your YAML micro-language closely. Does it offer a looping
construct? If so, it's likely Turing complete.

But sometimes it is hard to package something up nicely with a bow on
top. SQL ORMs are inherently a leaky abstraction. You can't use them
without some understanding of SQL. So being shallow isn't inherently
a criticism. Sometimes a shallow module is the best that can be done.
But all else equal, deeper is better.

What S3 is about (it is deep too)

The unix file API was in place by the early 1970s. The interface has
been retained and the guts have been re-implemented many times for
compatibility reasons.

But Amazon S3 does not reimplement the unix filesystem API.

It has a wholly different arrangement and the primitives are only
partly compatible. Here's a brief description of the calls that are
analogous to the above five basic unix calls:

# Read (part) of an object
GetObject(Bucket, Key, Range=None) # contents is the HTTP body

# Write an (entire) object
PutObject(Bucket, Key) # send contents as HTTP body

# er, that's it!

Two functions versus five. That's right, the S3 API is simpler than
the unix file API. There is one additional concept ("buckets") but I
think when you net it out, S3's interface-to-functionality ratio is
even better than the unix file API.

But something is missing. While you can partially read an object
using the Range argument to GetObject, you can't overwrite partially.
Overwrites have to be the whole file.

That sounds minor but actually scopes S3 to a subset of the old
usecases for files.

Filesystem software, especially databases, can't be ported to Amazon
S3

Databases of all kinds need a place to put their data. Generally,
that place has ended up being various files on the filesystem.
Postgres maintains two or three files per table, plus loads of others
for bookkeeping. SQLite famously stores everything in a single file.
MySQL, MongoDB, Elasticsearch - whatever - they all store data in
files.

Crucially, these databases overwhelmingly rely on the ability to do
partial overwrites. They store data in "pages" (eg 4 or 8 kilobytes
long) in "heap" files where writes are done page by page. There might
be thousands of pages in a single file. Pages are overwritten as
necessary to store whatever data is required. That means partial
overwrites are absolutely essential.

diagram of a database heap file A heap file is full of pages (and
empty slots). Pages are overwritten individually as necessary.

Some software projects start with a dream of storing their data in a
'simple' way by combining two well tested technologies: Amazon S3 and
SQLite (or DuckDB). After all, what could be simpler and more
straightforward? Sadly, they go together like oil and water.

When your SQLite database is kept in S3, each write suddenly becomes
a total overwrite of the entire database. While S3 can do big writes
fast, even it isn't fast enough to make that strategy work for any
but the smallest datasets. And you're jettisoning all the
transactional integrity that the database authors have painstakingly
implemented: rewriting the database file each time throws out all
that stuff. On S3, the last write wins.

What S3 is good at and what it is bad at

The joy of S3 is that bandwidth ("speed") for reads and writes is
extremely, extremely high. It's not hard to find examples online of
people who have written to or read from S3 at over 10 gigabytes per
second. In fact I once saturated a financial client's office network
with a set of S3 writes.

But the lack of partial overwrites isn't the only problem. There are
a few more.

S3 has no rename or move operation. Renaming is CopyObject and then
DeleteObject. CopyObject takes linear time to the size of the file
(s). This comes up fairly often when someone has written a lot of
files to the wrong place - moving the files back is very slow.

And listing files is slow. While the joy of Amazon S3 is that you can
read and write at extremely, extremely, high bandwidths, listing out
what is there is much much slower. Slower than a slow local
filesystem.

But S3 is much lower maintenance than a filesystem. You just name the
bucket, name the key and the cloud elves will sort out everything
else. This is worth a lot as setting backups, replicating offsite,
provisioning (which, remember is for IO ops as well as capacity) is
pure drudgework.

Module depth is even more important across organisations

In retrospect it is not a surprise the S3 was the first popular cloud
API. If deep APIs are helpful in containing the complexity between
different modules with a single system (like your computer) they are
even more helpful in containing the complexity of an interaction
between two different businesses, where the costs of interacting are
so much higher.

Consider a converse example. Traditionally when one business wants to
get it's computers working with those of another they call it
"integration". It is a byword for suffering. Imagine you are tasked
with integrating some Big Entreprise software horror into your
organisation. Something like SAP. Is SAP a deep module? No. The
tragedy of SAP is that almost your entire organisation has to
understand it. Then you have to reconcile it with everything you're
doing. At all times. SAP integration projects are consequently
expensive, massive and regularly fail.

There isn't much less complexity in S3 than there is in a SAP
installation. Amazon named it the "Simple Storage Service" but the
amount of complexity in S3 is pretty frightening. Queueing theory, IO
contention, sharding, the list of problems just goes on and on - in
addition too all the stuff I listed above that filesystems deal with.
(And can you believe they do it all on-prem?)

The "simple" in S3 is a misnomer. S3 is not actually simple. It's
deep.

---------------------------------------------------------------------

Contact/etc

Please do send me an email about this article, especially if you
disagreed with it.

If you liked it, you might like other things I've written.

Find out when I write something new - by email or RSS rss-logo. Or
follow me on Mastodon.

 I have moved to Helsinki and am working to resurrect the Helsinki
Python meetup. If you know someone willing to give a talk or lend us
space to meet, please do get in touch. Our first meeting is probably
going to happen in early April. Join the group on meetup.com to get
an alert when we announce it. 

If you enjoyed this article and as a result are feeling charitable
towards me: please try out my side-project, csvbase, or "Github, but
for data tables".

---------------------------------------------------------------------

Other notes

I don't mean to suggest in any way via this article that S3 is not
overpriced for what it is. To rephrase a famous joke about hedge
funds, it often seems like The Cloud is a revenue model masquerading
as a service model.

The concept of deep vs shallow modules comes from John Ousterhout's
excellent book. The book is effectly a list of ideas on software
design. Some are real hits with me, others not, but well worth
reading overall. Praise for making it succinct.

A few databases are explicitly designed from the start to us the S3
API for storage. Snowflake was. So it's possible - but not
transparently. But snowflake is one of the few I'm aware of (and they
made this decision very early, at least by 2016). If you know of
others - let me know by email.

It isn't just databases that struggle on S3. Many file formats assume
that you'll be able to seek around cheaply and are less performant on
S3 than on disk. Zipfiles are a key example.

Other stuff about S3 that is a matter for regret

I genuinely like S3 so did not want to create the wrong impression by
including a laundry list of complaints in the middle of the post but
anyway here are the other major problems I didn't mention above:

 1. The S3 API is only available as XML. JSON was around in 2006 but
    XML was still dominant and so it's probably not a surprise that
    Amazon picked XML originally. It is a surprise that Amazon never
    released a JSON version though - particularly when they made the
    switch from SOAP to REST, which would have been a good time.

 2. It's also a matter for regret that Amazon gave up on maintaining
    the XSD schema as this is one of the key benefits of XML for
    APIs. The canonical documentation is just a website now.

 3. Criminally, Amazon - like many cloud service providers - have
    never produced any kind of local test environment. In Python, the
    more diligent test with the moto library. moto is maintained by
    volunteers which is weird given that it's a testing tool for a
    commercial offering.

 4. Amazon S3 does support checksums. For whatever reason they are
    not turned on by default. Amazon makes many claims about
    durability. I haven't heard of people having problems but
    equally: I've never seen these claims tested. I am at least a bit
    curious about these claims.

 5. For years Amazon S3 held one other trap for the unwary: eventual
    consistency. If you read a file, then overwrote it, you might
    read it back and find it hadn't changed yet. Particularly because
    it only happened sometimes, for short periods of time, this
    caused all manner of chaos. Other implementors of S3 didn't copy
    this property and a few years ago Amazon fixed it in their
    implementation.