[HN Gopher] S3 is files, but not a filesystem
       ___________________________________________________________________
        
       S3 is files, but not a filesystem
        
       Author : todsacerdoti
       Score  : 393 points
       Date   : 2024-03-10 04:11 UTC (18 hours ago)
        
 (HTM) web link (calpaterson.com)
 (TXT) w3m dump (calpaterson.com)
        
       | 3weeksearlier wrote:
       | I dunno, are features like partial file overwrites necessary to
       | make something a filesystem? This reminds me of how there are
       | lots of internal systems at Google whose maintainers keep
       | asserting are not filesystems, but everyone considers them so, to
       | the point where "_____ is not a filesystem" has become an inside
       | joke.
        
         | fiddlerwoaroof wrote:
         | Yeah, it's sort of funny how "POSIXish semantics" has become
         | our definition of these things, when it's just one kind of
         | thing that's been called a filesystem historically.
        
           | mickael-kerjean wrote:
           | Fun experiment I made with my mum, building a storage
           | independent dropbox like UI [1] for anything that implement
           | this interface:                 type IBackend interface {
           | Ls(path string) ([]os.FileInfo, error)         Cat(path
           | string) (io.ReadCloser, error)         Mkdir(path string)
           | error         Rm(path string) error         Mv(from string,
           | to string) error         Save(path string, file io.Reader)
           | error         Touch(path string) error       }
           | 
           | My mum really couldn't care less about the posix semantic as
           | soon as she can see the pictures of my kid which happen to be
           | on S3
           | 
           | [1] https://github.com/mickael-kerjean/filestash
        
             | wwalexander wrote:
             | Reducing things to basically the interface you laid out is
             | the point of 9p [1], and is what Plan 9's UNIX-but-
             | distributed design was built on top of. Same inventor as
             | Go! If you haven't dived down the Plan 9 rabbit hole yet,
             | it's a beautiful and haunting vision of how simple cloud
             | computing could have been.
             | 
             | [1] https://9fans.github.io/plan9port/man/man9/intro.html
        
             | MrJohz wrote:
             | I think this interface is less interesting than the
             | semantics behind it, particularly when it comes to
             | concurrency: what happens when you delete a folder, and
             | then try and create a file in that folder at the same time?
             | What happens when you move a folder to a new location, and
             | during that move, delete the new or old folders?
             | 
             | Like yes, for your mum's use case, with a single user, it's
             | probably not all that important that you cover those edge
             | cases, but every time I've built pseudo-filesystems on top
             | of non-filesystem storage APIs, those sorts of semantic
             | questions have been where all the problems have hidden.
             | It's not particularly hard to implement the interface
             | you've described, but it's very hard to do it in such a way
             | that, for example, you never have dangling files that exist
             | but aren't contained in any folder, or that you never have
             | multiple files with the same path, and so on.
        
           | DonHopkins wrote:
           | Can S3 murder your wife like ReiserFS and Reiser4?
           | 
           | https://en.wikipedia.org/w/index.php?title=Comparison_of_fil.
           | ..
        
         | CobrastanJorji wrote:
         | They are necessary because as soon as someone decides that S3
         | is a filesystem, they will look at the other cloud
         | "filesystems," notice that S3 is cheaper than most of them, and
         | then for some reason they will decide to run giant Hadoop fs
         | stuff on it or mount a relational database on it or all other
         | manner of stupidity. I guarantee you S3's customer-facing
         | engineers are fielding multiple calls per week from customers
         | who are angry that S3 isn't as fast as some real filesystem
         | solution that the customer migrated from because S3 was
         | cheaper.
         | 
         | When people decide that X is a filesystem, they try to use it
         | like it's a local, POSIX filesystem, and that's terrible
         | because it won't be immediately obvious why it's a stupid plan.
        
           | albert_e wrote:
           | If a customer makes an IT decision as big as running Hadoop
           | or RDBMS with S3 as storage ... but does not consult at least
           | a Associate level AWS Certified architect (who are doke a
           | dozen) for at least one day worth of advice which is probably
           | a couple of hundred dollars at most ...
           | 
           | Can we really blame AWS?
           | 
           | I am sure none of official AWS documentations or examples
           | show such an architecture.
           | 
           | ----
           | 
           | Amazon EMR can run Hadoop and use Amazon S3 as storage via
           | EMR FS.
           | 
           | "S3 mountpoints" are a feature specifically for workloads
           | that need to see S3 as a file system.
           | 
           | For block storage workloads there is EBS and EFS and FSx that
           | AWS heavily advertises.
        
             | albert_e wrote:
             | *dime a dozen
             | 
             | (Apologies for typos. The "noprocrast" setting sometimes
             | locks us out of HN right after submitting a comment. And it
             | is now too late, not editable)
        
         | karmasimida wrote:
         | Exactly, especially when the concept of filesystem really is
         | defined before the whole internet scale becomes a thing or
         | reality.
         | 
         | Maybe S3 isn't a filesystem according to this definition, but
         | does it really matter to make it one? I doubt it. The Elastic
         | Filesystem is also an AWS product, but you can't really work as
         | one as you have locally, any folder over 20k files basically
         | will timeout if you do a ls. Does it make EFS a filesystem or
         | not?
        
         | yencabulator wrote:
         | The problem is once you let go of those semantics, a lot of
         | software stops working if run against such a "filesystem". If
         | you dilute the meaning of "filesystem" too much, it becomes
         | less useful as a term.
         | 
         | https://en.wikipedia.org/wiki/Andrew_File_System was
         | interesting, I'd actually love to see something similar re-
         | implemented with modern ideas, but it's more of an direct-
         | access archival system than a general-purpose filesystem[1],
         | you can't just put files written by arbitrary software on it.
         | It's a bit like NFS without locks&leases, but even less like a
         | normal filesystem; only really good for files created once that
         | "settle down" into effectively being read-only.
         | 
         | [1]: I wrote https://github.com/bazil/plop that is
         | (unfortunately undocumented) content-addressed immutable file
         | storage over object storage, used in conjunction with a git
         | repo with symlinks to it to manage the "naming layer". See
         | https://bazil.org/doc/ for background, plop is basically a
         | simplification of the ideas to get to working code easier. Site
         | hasn't been updated in almost a decade, wow. It's in everyday
         | use though!
        
       | leetrout wrote:
       | My big pet peeve is AWS adding buttons in the UI to make
       | "folders".
       | 
       | It is also a fiction! There are no folders in S3.
       | 
       | > When you create a folder in Amazon S3, S3 creates a 0-byte
       | object with a key that's set to the folder name that you
       | provided. For example, if you create a folder named photos in
       | your bucket, the Amazon S3 console creates a 0-byte object with
       | the key photos/. The console creates this object to support the
       | idea of folders.
       | 
       | https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...
        
         | riehwvfbk wrote:
         | Is that really so different from how folders work on other
         | systems? A directory inode is just an inode.
        
           | daynthelife wrote:
           | The payload still contains a list of other inodes though
        
           | klodolph wrote:
           | Yes. It is, in practice, incredibly different.
           | 
           | Imagine you have a file named /some/dir/file.jpg.
           | 
           | In a filesystem, there's an inode for /some. It contains an
           | entry for /some/dir, which is also an inode, and then in the
           | very deepest level, there is an inode for /some/dir/file.jpg.
           | You can rename /some to /something_else if you want. Think of
           | it kind of like a table:
           | +-------+--------+----------+-------+       | inode | parent
           | |     name |  data |
           | +-------+--------+----------+-------+       |     1 | (null)
           | |     some | (dir) |       |     2 |      1 |      dir |
           | (dir) |       |     3 |      2 | file.jpg |  jpeg |
           | +-------+--------+----------+-------+
           | 
           | In S3 (and other object stores), the table is like this:
           | +-------------------+------+       | key               | data
           | |       +-------------------+------+       |
           | some/dir/file.jpg | jpeg |       +-------------------+------+
           | 
           | The kind of queries you can do is completely different. There
           | are no inodes in S3. There is just a mapping from keys to
           | objects. There's an index on these keys, so you can do
           | queries--but the / character is NOT SPECIAL and does not
           | actually have any significance to the S3 storage system and
           | API. The / character only has significance in the UI.
           | 
           | You can, if you want, use a completely different character to
           | separate "components" in S3, rather than using /, because /
           | is not special. If you want something like
           | "some:dir:file.jpg" or "some.dir.file.jpg" you can do that.
           | Again, because / is not special.
        
             | fiddlerwoaroof wrote:
             | Except, S3 does let you query by prefix and so the keys
             | have more structure than the second diagram implies:
             | they're not just random keys, the API implies that common
             | prefixes indicate related objects.
        
               | klodolph wrote:
               | That's kind of stretching the idea of "more structure" to
               | the breaking point, I think. The key is just a string.
               | There is no entry for directories.
               | 
               | > the API implies that common prefixes indicate related
               | objects.
               | 
               | That's something users do. The API doesn't imply anything
               | is related.
               | 
               | And prefixes can be anything, not just directories. If
               | you have /some/dir/file.jpg, then you can query using
               | /some/dir/ as a prefix (like a directory!) or you can
               | query using /so as a prefix, or /some/dir/fil as a
               | prefix. It's just a string. It only looks like a
               | directory when you, the user, decide to interpret the /
               | in the file key as a directory separator. You could just
               | as easily use any other character.
        
               | hiyer wrote:
               | One operation where this difference is significant is
               | renaming a "folder". In UNIX (and even UNIX-y distributed
               | filesystems like HDFS) a rename operation at "folder"
               | level is O(1) as it only involves metadata changes. In
               | S3, renaming a "folder" is O(number of files).
        
               | okr wrote:
               | Imho, renaming "folders" on S3 results in copying and
               | deleting O(number of files)
        
               | pepa65 wrote:
               | From reading the above, if you have a folder 'dir' and a
               | file 'dir/file', after renaming 'dir' to 'folder', you
               | would just have 'folder' and 'dir/file'.
        
               | klodolph wrote:
               | There is really no such thing as a folder in S3.
               | 
               | If you have something which is dir/file, then NORMALLY
               | "dir" does not exist at all. Only dir/file exists. There
               | is nothing to rename.
               | 
               | If you happen to have something which is named "dir",
               | then it's just another file (a.k.a. object). In that
               | scenario, you have two files (objects) named "dir" and
               | "dir/file". Weird, but nothing stopping you from doing
               | that. You can also have another object named
               | "dir///../file" or something, although that can be
               | inconvenient, for various reasons.
        
               | Someone wrote:
               | > In S3, renaming a "folder" is O(number of files).
               | 
               | More like _O(max(number of files, total file size))_. You
               | can't rename objects in S3. To simulate a rename, you
               | have to copy an object and then delete the old one.
               | 
               | Unlike renames in typical file systems, that isn't atomic
               | (there will be a time period in which both the old and
               | the new object exist), and it becomes slower the larger
               | the file.
        
               | fiddlerwoaroof wrote:
               | > That's something users do. The API doesn't imply
               | anything is related.
               | 
               | Querying ids by prefix doesn't make any sense for a
               | normal ID type. Just making this operation available and
               | part of your public API indicates that prefixes are
               | semantically relevant to your API's ID type.
        
               | klodolph wrote:
               | "Prefix" is not the same thing as "directory".
               | 
               | I can look up names with the prefix "B" and get Bart,
               | Bella, Brooke, Blake, etc. That doesn't imply that
               | there's some kind of semantics associated with prefixes.
               | It's just a feature of your system that you may find
               | useful. The fact that these names have a common prefix,
               | "B", is not a particularly interesting thing to me. Just
               | like if I had a list of files, 1.jpg, 10.jpg, 100.jpg,
               | it's probably not significant that they're being returned
               | sequentially (because I probably want 2.jpg after 1.jpg).
        
               | afiori wrote:
               | by this logic the file "foo/bar/" correspond to the
               | filename "f:o:o:/:b:a:r:/" (using a different caracter as
               | separator)
        
             | riehwvfbk wrote:
             | Thank you, now I understand what the special 0-byte object
             | refers to. It represents an empty folder.
             | 
             | Fair enough, basing folders on object names split by / is
             | pretty inefficient. I wonder why they didn't go with a
             | solution like git's trees.
        
               | klodolph wrote:
               | > Fair enough, basing folders on object names split by /
               | is pretty inefficient. I wonder why they didn't go with a
               | solution like git's trees.
               | 
               | What, exactly, is inefficient about it?
               | 
               | Think for a moment about the data structures you would
               | use to represent a directory structure in a filesystem,
               | and the data structures you would use to represent a
               | key/value store.
               | 
               | With a filesystem, if you split a string
               | /some/dir/file.jpg into three parts, "some", "dir",
               | "file.jpg", then you are actually making a decision about
               | the tree structure. And here's a question--is that a
               | _balanced_ tree you got there? Maybe it's completely
               | unbalanced! That's actually inefficient.
               | 
               | Let's suppose, instead, you treat the key as a plain
               | string and stick it in a tree. You have a lot of freedom
               | now, in how you balance the tree, since you are not
               | forced to stick nodes in the tree at every / character.
               | 
               | It's just a different efficiency tradeoff. Certain
               | operations are now much less efficient (like "rename a
               | directory" which, on S3, is actually "copy a zillion
               | objects). Some operations are more efficient, like "store
               | a file" or "retrieve a file".
        
               | umanwizard wrote:
               | I think what you're describing is simply not a
               | hierarchical file system. It's a different thing that
               | supports different operations and, indeed, is better or
               | worse at different operations.
        
               | afiori wrote:
               | I think it is fair to say that S3 (as named files) is not
               | a filesystem and it is inefficient to use it directly as
               | such for common filesystem use cases; the same way that
               | you could say it for a tarball[0].
               | 
               | This does not make S3 a bad storage, just a bad
               | filesystem, not everything needs to be a filesystem.
               | 
               | Arguably is it good that S3 is not a filesystem, as it
               | can be a leaky abstraction eg in git you cannot have two
               | tags name "v2" and "v2/feature-1" as you cannot have both
               | a file and a folder with the same name.
               | 
               | For something more closely related to URLs than filenames
               | forcing a filesystem abstraction is a limitation as
               | "/some/url", "/some/url/", and "/some/url/some-default-
               | name-decided-by-the-webserver" can be different.[1]
               | 
               | [0] where a different tradeoff is that searching a file
               | by name is slower but reading many small files can be
               | faster.
               | 
               | [1] maybe they should be the same, but enforcing it is a
               | bad idea
        
               | inkyoto wrote:
               | > [...] what the special 0-byte object refers to. It
               | represents an empty folder.
               | 
               | Alas, no. It represents a tag, e.g. <<folder/>>, that
               | points to a zero byte object.
               | 
               | You can then upload two files, e.g. <<folder/file1.txt>>
               | and <<folder/file2.txt>>, delete the <<folder/>>, being a
               | _tag_ , and still have the <<folder/file1.txt>> and
               | <<folder/file2.txt>> file intact in the S3 bucket.
               | 
               | Deleting <<folder/>> in a traditional file system, on the
               | other hand, will also delete <<file1.txt>> and
               | <<file2.txt>> in it.
        
               | dchest wrote:
               | It's a matter of a client UI implementation. You can't
               | delete a non-empty folder with POSIX API on common
               | filesystems or FTP too.
               | 
               | However, there are file managers, FTP clients, and S3
               | clients that will do that for you by deleting individual
               | files.
        
               | _flux wrote:
               | But if the S3 semantics are not helping you, e.g. with
               | multiple clients doing copy/move/delete operations in the
               | hierarchy you could still end up with files that are not
               | in "directories".
               | 
               | So essentially an S3 file manager must be able to handle
               | the situation where there are files without a "directory"
               | --and that I assume is also the most common case as well
               | for S3. Might just not have the "directories" in the
               | first place.
        
               | klodolph wrote:
               | I have personally never seen the 0-byte files people keep
               | talking about here. In every S3 bucket I've ever looked
               | at, the "directories" don't exist at all. If you have a
               | dir/file1.txt and dir/file2.txt, there is NO such object
               | as dir. Not even a placeholder.
        
               | _flux wrote:
               | Yeah, this post was the first one I had even heard of
               | them.
        
               | cwillu wrote:
               | Deleting folder/ in a traditional file system will _fail_
               | if the folder is not empty. Userspace needs to recurse
               | over the directory structure to unlink everything in it
               | before unlinking the actual folder.
        
               | gjvc wrote:
               | "folders" do not exist in S3 -- why do you keep insisting
               | that they do?
               | 
               | They appear to exist because the key is split on the
               | slash character for navigation in the web front-end. This
               | gives the familiar appearance of a filesystem, but the
               | implementation is at a much higher level.
        
             | Demiurge wrote:
             | Let's start with the fact that you're talking to an HTTP
             | api... Even if S3 had web3.0 inodes, the querying semantics
             | would not make sense. It's a higher level API, because you
             | don't deal with blocks of magnetic storage and binary
             | buffers. Of course s3 is not a filesystem, that is part of
             | its definition, and reason to be...
        
               | klodolph wrote:
               | I think if you focus too narrowly on the details of the
               | wire protocol, you'll lose sight of the big picture and
               | the semantics.
               | 
               | S3 is not a filesystem because the semantics are
               | different from the kind of semantics we expect from
               | filesystems. You can't take the high-level API provided
               | by a filesystem, use S3 as the backing storage, and
               | expect to get good performance out of it unless you use a
               | _ton_ of translation.
               | 
               | Stuff like NFS or CIFS _are_ filesystems. They behave
               | like filesystems, in practice. You can rename files. You
               | can modify files. You can create directories.
        
               | Demiurge wrote:
               | Right, the NFS/CIFS support writing blocks, but S3
               | basically does HTTP get and post verbs. I would say that
               | these concepts are the defining difference. To call S3 a
               | filesystem is not wrong in abstract, but it's not
               | different than calling Wordpress a filesystem, or DNS, or
               | anything that stores something for you. Of course, it
               | will be inefficient to implement a block write on top of
               | any of these, that's because you have to literally do it
               | yourself. As in, download the file, edit it, upload
               | again.
        
               | klodolph wrote:
               | I think the blocks are one part of it, and the other part
               | is that S3 doesn't support renaming or moving objects,
               | and doesn't have directories (just prefixes). Whenever
               | I've seen something with filesystem-like semantics on top
               | of S3, it's done by using S3 as a storage layer, and
               | building some other kind of view of the storage on top
               | using a separate index.
               | 
               | For example, maybe you have a database mapping file paths
               | to S3 objects. This gives you a separate metadata layer,
               | with S3 as the storage layer for large blocks of data.
        
             | keithalewis wrote:
             | Even youngsters are yelling at clouds now. Just a different
             | kind of cloud.
        
             | tuwtuwtuwtuw wrote:
             | "filesystem" is not a name reserved for Unix-style file
             | systems. There are many types of file system which is not
             | built on according to your description. When I was a kid, I
             | used systems which didn't support directories, but it was
             | still file systems.
             | 
             | It's an incorrect take that a system to manage files must
             | follow a set of patterns like the ones you mentioned to be
             | called "file system".
        
               | afiori wrote:
               | Terms evolve and now filesytem and "system of files" mean
               | different things,
               | 
               | I would argue that not supporting folders or many other
               | file operations make something not a filesystem today.
        
               | quickthrower2 wrote:
               | Yeah hacker used to not mean someone hacking into a
               | computer and breaking a password, then it did then now it
               | means both that and a tech tinkerer.
        
               | tuwtuwtuwtuw wrote:
               | You're free to argue whatever you want, but claiming that
               | a file system should have folders as the parent commenter
               | did, or support specific operations, seems a bit
               | meaningless.
               | 
               | I could create a system not supporting folders because it
               | relies on tags or something else. Or I could create a
               | system which is write-only and doesn't support rename or
               | delete.
               | 
               | These systems would be file systems according to how the
               | term has been used for 40 (?) years at least. Just don't
               | see any point in restricting the term to exclude random
               | variants.
        
           | erik_seaberg wrote:
           | You can create a simulated directory, and write a bunch of
           | files in it, but you can't atomically rename it--behind the
           | scenes each file needs to be copied from old name to new.
        
           | 8organicbits wrote:
           | Another challenge is directory flattening. On a file system
           | "a/b" and "a//b" are usually considered the same path. But on
           | S3 the slash isn't a directory separator, so the paths are
           | distinct. You need to be extra careful when building paths
           | not to include double slashes.
           | 
           | Many tools end up handling this by showing a folder named "a"
           | containing a folder named "" (empty string). This confuses
           | users quite a bit. It's more than the inodes, it's how the
           | tooling handles the abstraction.
        
             | hnlmorg wrote:
             | Coincidentally I ran into an issue just like this a week
             | ago. A customer facing application failed because there was
             | an object named "/foo/bar" (emphasis on the leading slash).
             | 
             | This created a prefix named "/" which confused the hell out
             | of the application.
        
           | ithkuil wrote:
           | In S3 each file is identified with a full path.
           | 
           | Not only you cannot rename a single file, but you also cannot
           | rename a "folder" (because that would imply a bulk rename on
           | a large number of children of that "folder")
           | 
           | This is the fundamental difference between a first class
           | folder and just a convention on prefixes of full path names.
           | 
           | If you don't allow renames, it doesn't really make sense to
           | have each "folder" store the list of the children.
           | 
           | You can instead have a giant ordered map (some kind of
           | b-tree) that allows you for efficient lookup and scanning
           | neighbouring nodes.
        
             | lukeh wrote:
             | UMich LDAP server, upon which many were based, stored
             | entrys' hierarchical (distinguished) names with each entry,
             | which I always found a bit weird. AD, eDirectory, and the
             | OpenLDAP HDB backend don't have this problem.
        
         | solumunus wrote:
         | What exactly do you think a folder is? It's just an abstraction
         | for organising data.
        
           | winwang wrote:
           | I'm having a lot of fun imagining this being said to a kid
           | who's trying to buy some folders for school.
        
           | klodolph wrote:
           | S3 doesn't have that abstraction.
           | 
           | The console UI shows folders but they don't actually exist in
           | S3. They're made up by the UI.
        
             | 3weeksearlier wrote:
             | It sounds like they have that abstraction in the UI. But if
             | the CLI and API don't have it too, that's weird.
        
               | klodolph wrote:
               | Yeah, the UI and CLI show you "folders". It's a client-
               | side thing that doesn't exist in the actual service.
               | Behind the scenes, the clients are making specific types
               | of queries on the object keys.
               | 
               | You can't examine when a folder was created (it doesn't
               | exist in the first place), you can't rename a folder (it
               | doesn't exist), you can't delete a folder (again, it
               | doesn't exist).
        
               | throwitaway222 wrote:
               | That's just an implementation detail of well known
               | filesystems.
        
               | dathery wrote:
               | Yes, which is why it's not ideal to reuse the folder
               | metaphor here. Users have an idea how directories work on
               | well-known filesystems and get confused when these fake
               | folders don't behave the same way.
        
               | throwitaway222 wrote:
               | Are all your s3 keys opaque strings (like UUIDs)?, do you
               | use / (slash) in your keys?
               | 
               | If you truly believe S3 has absolutely no connection to
               | folders, you would answer Yes and No.
        
               | klodolph wrote:
               | I don't think that's a defensible standpoint.
               | 
               | Folders are an important part of the way most people use
               | filesystems.
        
             | throwitaway222 wrote:
             | Similarly the UI in linux is making up the notion of
             | folders and files in them. But we don't say it doesn't
             | exist.
        
               | dathery wrote:
               | Directories actually exist on the filesystem, which is
               | why you have to create them before use and they can exist
               | and be empty. They don't exist in S3 and neither of those
               | properties do, either. Similarly, common filesystem
               | operations on directories (like efficiently renaming
               | them, and thus the files under them) are not possible in
               | S3.
               | 
               | Of course it can still be useful to group objects in the
               | S3 UI, but it would probably be better to use some kind
               | of prefix-centric UI rather than reusing the folder
               | metaphor when it doesn't match the paradigm people are
               | used to.
        
               | kelnos wrote:
               | No, they're not made up. A folder (or directory) is a
               | specific type of inode, just a file is.
               | 
               | S3 doesn't have folders. The UI fakes them by creating a
               | 0-byte object (or file, if you will). It's a kludge.
        
               | klodolph wrote:
               | The UI will fake them without even creating the 0-byte
               | object.
        
             | DonHopkins wrote:
             | Speaking of user interfaces with optical illustions about
             | directory separators:
             | 
             | On the Mac, the Finder lets you have files with slashes in
             | their names, even though it's a Unix file system
             | underneath. Don't believe me? Go try to use the Finder to
             | make a directory whose name is "Reports from 2024/03/10".
             | See?
             | 
             | But as everyone knows, slash is the ONLY character you're
             | not allowed to have in a file or directory name under Unix.
             | It's enforced in the kernel at the system call inteface.
             | There is absolutely no way to make a file with a slash in
             | it. Yet there it is!
             | 
             | The original MacOS operating system used the ":" character
             | to delimit directory names, instead of "/", so you could
             | have files and directories with slashes in their names,
             | justs not with colons in their names.
             | 
             | When Apple transitioned from MacOS to Unix, they did not
             | want to freak out their users by reaming all their files.
             | 
             | So now try to use the Finder (or any app that uses the
             | standard file dialog) to make a folder or file with a ":"
             | in its name on a modern Mac. You still can't!
             | 
             | So now go into the shell and list out the parent directory
             | containing the directory you made with a slash in its name.
             | It's actually called "Reports from 2024:03:10"!
             | 
             | The Mac Finder and system file dialog user interfaces
             | actually switche "/" and ":" when they show paths on the
             | screen!
             | 
             | Try making a file in the shell with colons in it, then look
             | at it in the finder to see the slashes.
             | 
             | However, back in the days of the old MacOS that permitted
             | slashes in file names, there was a handy network gateway
             | box called the "Gatorbox" that was a Localtalk-to-Ethernet
             | AFP/NFS bridge, which took a subtly different approach.
             | 
             | https://en.wikipedia.org/wiki/GatorBox
             | 
             | It took advantage of the fact (or rather it triggered the
             | bug) that the Unix NFS implementation boldly made an end-
             | run around the kernel's safe system call interface that
             | disallowed slashes in file names. So any NFS client could
             | actually trick Unix into putting slashes into file names
             | via the NFS protocol!
             | 
             | It appeared to work just fine, but then down the line the
             | Unix "restore" command would totally shit itself! Of course
             | "dump" worked just fine, never raising an error that it was
             | writing corrupted dumps that you would not be able to read
             | back in your time of need, so you'd only learn that you'd
             | been screwed by the bug and lost all your files months or
             | years later!
             | 
             | So not only does NFS stand for "No File Security", it also
             | stands for "Nasty Forbidden Slashes"!
             | 
             | https://news.ycombinator.com/item?id=31820504
             | 
             | >NFS originally stood for "No File Security".
             | 
             | >The NFS protocol wasn't just stateless, but also
             | securityless!
             | 
             | >Stewart, remember the open secret that almost everybody at
             | Sun knew about, in which you could tftp a host's
             | /etc/exports (because tftp was set up by default in a way
             | that left it wide open to anyone from anywhere reading
             | files in /etc) to learn the name of all the servers a host
             | allowed to mount its file system, and then in a root shell
             | simply go "hostname foo ; mount remote:/dir /mnt ; hostname
             | `hostname`" to temporarily change the CLIENT's hostname to
             | the name of a host that the SERVER allowed to mount the
             | directory, then mount it (claiming to be an allowed
             | client), then switch it back?
             | 
             | >That's right, the server didn't bother checking the
             | client's IP address against the host name it claimed to be
             | in the NFS mountd request. That's right: the protocol
             | itself let the client tell the server what its host name
             | was, and the server implementation didn't check that
             | against the client's ip address. Nice professional protocol
             | design and implementation, huh?
             | 
             | >Yes, that actually worked, because the NFS protocol
             | laughably trusted the CLIENT to identify its host name for
             | security purposes. That level of "trust" was built into the
             | original NFS protocol and implementation from day one, by
             | the geniuses at Sun who originally designed it. The network
             | is the computer is insecure, indeed.
             | 
             | [...]
             | 
             | From the Unix-Haters Handbook:
             | 
             | https://archive.org/stream/TheUnixHatersHandbook/ugh_djvu.t
             | x...
             | 
             | Don't Touch That Slash!
             | 
             | UFS allows any character in a filename except for the slash
             | (/) and the ASCII NUL character. (Some versions of Unix
             | allow ASCII characters with the high-bit, bit 8, set.
             | Others don't.)
             | 
             | This feature is great -- especially in versions of Unix
             | based on Berkeley's Fast File System, which allows
             | filenames longer than 14 characters. It means that you are
             | free to construct informative, easy-to-understand filenames
             | like these:
             | 
             | 1992 Sales Report
             | 
             | Personnel File: Verne, Jules
             | 
             | rt005mfkbgkw0 . cp
             | 
             | Unfortunately, the rest of Unix isn't as tolerant. Of the
             | filenames shown above, only rt005mfkbgkw0.cp will work with
             | the majority of Unix utilities (which generally can't
             | tolerate spaces in filenames).
             | 
             | However, don't fret: Unix will let you construct filenames
             | that have control characters or graphics symbols in them.
             | (Some versions will even let you build files that have no
             | name at all.) This can be a great security feature --
             | especially if you have control keys on your keyboard that
             | other people don't have on theirs. That's right: you can
             | literally create files with names that other people can't
             | access. It sort of makes up for the lack of serious
             | security access controls in the rest of Unix.
             | 
             | Recall that Unix does place one hard-and-fast restriction
             | on filenames: they may never, ever contain the magic slash
             | character (/), since the Unix kernel uses the slash to
             | denote subdirectories. To enforce this requirement, the
             | Unix kernel simply will never let you create a filename
             | that has a slash in it. (However, you can have a filename
             | with the 0200 bit set, which does list on some versions of
             | Unix as a slash character.)
             | 
             | Never? Well, hardly ever.                   Date: Mon, 8
             | Jan 90 18:41:57 PST          From:
             | sun!wrs!yuba!steve@decwrl.dec.com (Steve Sekiguchi)
             | Subject: Info-Mac Digest V8 #3 5               I've got a
             | rather difficult problem here. We've got a Gator Box run-
             | ning the NFS/AFP conversion. We use this to hook up Macs
             | and          Suns. With the Sun as a AppleShare File
             | server. All of this works          great!               Now
             | here is the problem, Macs are allowed to create files on
             | the Sun/          Unix fileserver with a "/" in the
             | filename. This is great until you try          to restore
             | one of these files from your "dump" tapes, "restore" core
             | dumps when it runs into a file with a "/" in the filename.
             | As far as I          can tell the "dump" tape is fine.
             | Does anyone have a suggestion for getting the files off the
             | backup          tape?               Thanks in Advance,
             | Steven Sekiguchi Wind River Systems
             | sun!wrs!steve, steve@wrs.com Emeryville CA, 94608
             | 
             | Apparently Sun's circa 1990 NFS server (which runs inside
             | the kernel) assumed that an NFS client would never, ever
             | send a filename that had a slash inside it and thus didn't
             | bother to check for the illegal character. We're surprised
             | that the files got written to the dump tape at all. (Then
             | again, perhaps they didn't. There's really no way to tell
             | for sure, is there now?)
        
           | ahepp wrote:
           | Is it an abstraction for requesting the data you want, or an
           | abstraction for storing the data in a retrievable manner?
        
         | nostrebored wrote:
         | Weird that it says folders now. I remember it being very
         | strictly called a prefix when I was at AWS.
        
           | paranoidrobot wrote:
           | I think it's just the Web console, It's still prefix in the
           | APIs and CLI.
           | 
           | https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje.
           | ..
        
             | Izkata wrote:
             | The web console even collapses them like folders on
             | slashes, further obfuscating how it actually works. I
             | remember having to explain to coworkers why it was so slow
             | to load a large bucket.
        
         | klodolph wrote:
         | I see you getting downvotes, but you're speaking the honest
         | truth, here.
        
         | halayli wrote:
         | I don't know why you are being downvoted, what you said is true
         | and confuses many newcomers.
        
         | highwaylights wrote:
         | This!
         | 
         | I'm fine with it, I actually appreciate the logic and
         | simplicity behind it, but the amount of times I've tried to
         | explain why "folders" on S3 keep disappearing while people
         | stare at me like I'm an idiot is really frustrating.
         | 
         | (When you remove the last file in a "folder" on S3, the
         | "folder" disappears, because that pattern no longer appears in
         | the bucket k/v dictionary so there's no reason to show it as it
         | never existed in the first place).
        
         | wkat4242 wrote:
         | Hmm well there's no folders but if you interact with the object
         | the URL does become nested. So in a sense it does behave
         | exactly like a folder for all intents and purposes when dealing
         | with it that way. It depends what API you use I guess.
         | 
         | I use S3 just as a web bucket of files (I know it's not the
         | best way to do that but it's what I could easily obtain through
         | our company's processes). But in this case it makes a lot of
         | sense though I try to avoid making folders. But other people
         | using the same hosting do use them.
        
           | raverbashing wrote:
           | Except stuff like s3 cli has all these weird names for normal
           | filesystem items and you have to bang your head to try to
           | figure it out what it all means
           | 
           | (also don't get me started on the whole s3api thing)
        
       | inkyoto wrote:
       | S3 is a tagged versioned object storage with file like semantics
       | implemented in the AWS SDK (via AWS S3 API's). The S3 object key
       | is the tag.
       | 
       | Files and folders are used to make S3 buckets more approachable
       | to those who either don't know or don't want to know what it
       | actually is, and one day they get a surprise.
        
       | Twirrim wrote:
       | S3 is a key value store. Just happens to be able to store really
       | large values.
        
       | dmarinus wrote:
       | I talked to people at AWS who work in RDS Aurora and they hinted
       | they use S3 internally as a backend for MySQL and PostgreSQL.
        
         | readyman wrote:
         | Big if true. That was definitely not in the AWS cert I took
         | lol.
        
           | multani wrote:
           | Separating compute and storage is one of the core idea behind
           | Aurora. They talked about it in several places, for instance:
           | 
           | * https://www.amazon.science/publications/amazon-aurora-
           | design... * https://d1.awsstatic.com/events/reinvent/2019/REP
           | EAT_Amazon_...
        
         | WatchDog wrote:
         | Maybe for snapshots, but certainly not for live data.
        
       | YouWhy wrote:
       | The article is well written, but I am annoyed at the attempt to
       | gatekeep the definition of a filesystem.
       | 
       | Like literally any abstraction out there, filesystems are
       | associated with a multitude of possible approaches with
       | conceptually different semantics. It's a bit sophistic to say
       | that Postgres cannot be run on S3 because S3 is not a filesystem;
       | a better choice would have been to explore the underlying
       | assumptions; (I suspect latency would kill the hypothetical use
       | case of Postgres over S3 even if S3 had incorporated the
       | necessary API semantics - could somebody more knowledgeable chime
       | in?).
       | 
       | A more interesting venue to pursue would be - what other
       | additions could be made to the S3 API to make it more usable on
       | its own right - for example, why doesn't S3 offer more than one
       | filename per blob? (e.g., a similar to what links do in POSIX)
        
         | bilalq wrote:
         | This might be of interest to you: https://neon.tech/blog/bring-
         | your-own-s3-to-neon.
         | 
         | There's also the OG Aurora whitepaper:
         | https://www.amazon.science/publications/amazon-aurora-design...
        
         | zX41ZdbW wrote:
         | ClickHouse can work with S3 as a main storage. This is possible
         | because a table is a set of immutable data parts. Data parts
         | can be written once and deleted, possibly as a result of a
         | background merge operation. S3 API is almost enough, except for
         | cases of concurrent database updates. In this case, it is not
         | possible to rely on S3 only because it does not support an
         | atomic "write if not exists" operation. That's why external,
         | strongly consistent metadata storage is needed, which is
         | handled by ClickHouse Keeper.
        
           | afiori wrote:
           | Is a "write if not exists" atomic operation enouhg as a
           | concurrency primitive for database locks?
        
             | justincormack wrote:
             | Yes, its not necessarily the most efficient mechanism
             | (could be a lot of retries) but its sufficient. See the
             | Delta Lake paper for example [0]
             | 
             | [0] https://people.eecs.berkeley.edu/~matei/papers/2020/vld
             | b_del...
        
             | yencabulator wrote:
             | When talking about analytical databases for "big data",
             | yeah. They generally just want a "atomically replace the
             | list of Parquet files that make up this table", with one
             | writer succeeding at a time.
             | 
             | That would not be a great base to build a transactional
             | database on.
        
           | mlhpdx wrote:
           | Conditional PUT would be a great addition to S3, indeed.
        
             | buremba wrote:
             | That would probably require them to rewrite a non-trivial
             | part of S3 from scratch.
        
           | yencabulator wrote:
           | Google Cloud Storage supports create-if-not-exist and
           | compare-and-swap on generation counter. S3 is much harder to
           | use as a building block without tying your code into a second
           | system like DynamoDB etc.
           | 
           | https://pkg.go.dev/cloud.google.com/go/storage#Conditions
        
         | jillesvangurp wrote:
         | The notion of postgres not being able to run on s3 has more to
         | do with the characteristics of how it works than with it not
         | being a filesystem. After all, people have developed fuse
         | drivers for s3 so they can actually pretend it's a filesystem.
         | But using that to store a database is going to end in tears for
         | the same reasons that using e.g. NFS for this is also likely to
         | end in tears. You might get it to work but it won't be fast or
         | even reliable. And since NFS actually stands for networked file
         | system, it's hard to argue that NFS isn't a filesystem.
         | 
         | Whether something is or isn't a filesystem requires defining
         | what that actually is. A system that stores files would be a
         | simple explanation. Which is clearly something S3 is capable
         | of. This probably upsets the definition gatekeepers for
         | whatever more specific definitions they are guarding. But it
         | has a nice simple logic to it.
         | 
         | It's worth considering that file systems have had a long
         | history, weren't always the way they are now, and predate the
         | invention of relational databases (like postgres). Technically
         | before hard disks were invented in the fifties, we had no file
         | systems. Just tapes and punch cards. A tape would consist a
         | single blob of bits, which you'd load in memory. Or it would
         | have multiple such blobs at known offsets. I had cassettes full
         | of games for my commodore 64. But no disk drive. These blobs
         | were called files but there was no file system. Sometime, after
         | the invention of disks file systems were invented in the early
         | sixties.
         | 
         | Hierarchical databases were common before relational databases
         | and filesystems with directories are basically a hierarchical
         | database. S3 lacking hierarchy as a simpler key value store
         | clearly isn't a hierarchical database. But of course it's easy
         | to mimic one simply by using / characters in the keys. Which is
         | how the fuse driver probably fakes directories. And S3 even has
         | APIs to listfiles with a common prefix. A bigger deal is the
         | inability to modify files. You can only replace them with other
         | files (delete and add). That kind of is a show stopper for a
         | database. Replacing the entire database on every write isn't
         | very practical.
        
           | buremba wrote:
           | Neon.tech runs Postgresql runs on S3. They persist the WAL to
           | S3 so that they can replicate the data and bring it to local
           | ssds I assume.
        
         | defaultcompany wrote:
         | I've wondered this also because it can be handy to have
         | multiple ways of accessing the same file. For example to
         | obfuscate database uuids if they are used in the key. In theory
         | you could implement soft links in AWS by just storing a file
         | with the path to the linked file. But it would be a lot of
         | manual work.
        
       | throwaway892238 wrote:
       | > The "simple" in S3 is a misnomer. S3 is not actually simple.
       | It's deep.
       | 
       | Simple doesn't mean "not deep". It means having the fewest parts
       | needed in order to accomplish your requirements.
       | 
       | If you require a distributed, centralized, replicated, high-
       | availability, high-durability, high-bandwidth, low-latency,
       | strongly-consistent, synchronous, scalable object store with HTTP
       | REST API, you can't get much simpler than S3. Lots of features
       | have been added to AWS S3 over the years, but the basic operation
       | has remained the same.
        
         | svat wrote:
         | > _It means having the fewest parts needed in order to
         | accomplish your requirements._
         | 
         | That is exactly what "deep" means, in the terminology of this
         | post (from Ousterhout's book _A Philosophy of Software Design_
         | ). Simple means "not complex" (see also Rich Hickey's talk
         | Simple Made Easy: https://www.infoq.com/presentations/Simple-
         | Made-Easy/), while "deep" means providing/having a lot of
         | internally-complex functionality via a small interface. The
         | latter is a better description of S3 (which is what you seem to
         | be saying too) than "simple" which would mean there isn't much
         | to it.
        
           | throwaway892238 wrote:
           | Hickey's definition of simple is wrong. It's not the opposite
           | of complex at all. They are not opposites, nor mutually
           | exclusive.                 - Easy is when something does not
           | require much effort.       - Simple means the least complex
           | it can be and still work.       - Complex means there are
           | lots of components.
           | 
           | These are all quite different concepts:                 -
           | Easy is a concept that distinguishes the amount of work
           | needed to use a solution       - Simple is a concept that
           | distinguishes whether or not there is an excess number of
           | interacting properties in a system       - Complex is a
           | concept describing the quality of having a number of
           | interacting properties in a system
           | 
           | Hickey's talk is useful in terms of thinking about software,
           | but it also contains many over-generalizations which are
           | incorrect and lead to incorrect thinking about things that
           | aren't software. (Even some of his declarations about
           | software are wrong)
           | 
           | "Deep", in the context of software complexity, probably only
           | makes sense in terms of describing the number of layers
           | involved in a piece of technology. You could make something
           | have many layers, and it could still be simple, or be
           | complex, or easy.
        
         | ahepp wrote:
         | In terms the article puts forth, I would almost argue that
         | simple implies deep (and the associated "narrow" interface).
        
       | type_Ben_struct wrote:
       | Tools like LucidLink and Weka go a way to making S3 even more of
       | a "file system". They break files into smaller chunks (S3
       | objects) which helps with partial writes, reads and performance.
       | Alongside tiering of data from S3 to disk when needed for
       | performance.
        
         | hnlmorg wrote:
         | I don't know a whole lot about LucidLink but Weka basically
         | uses S3 as a dataplane for their own file system.
        
         | rwmj wrote:
         | Someone contributed an nbdkit S3 plugin which basically works
         | the way you described. It uses numbered S3 chunks using the
         | pattern "key/%16x", allowing the virtual disk to be updated.
         | (https://libguestfs.org/nbdkit-S3-plugin.1.html
         | https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3)
        
         | cuno wrote:
         | The problem with these approaches is that the data is scrambled
         | on the backend, so you can't access the files directly from S3
         | anymore. Instead you need an S3 gateway to convert from
         | scrambled S3 to unscrambled S3. They rely on a separate
         | database to reassemble the pieces back together again.
        
       | hn72774 wrote:
       | > Filesystem software, especially databases, can't be ported to
       | Amazon S3
       | 
       | Hudi, Delta, iceberg bridge that gap now. Databricks built a
       | company around it.
       | 
       | Don't try to do relational on object storage on your own. Use one
       | of those libraries. It seems simple but it's not. Late arriving
       | data, deletes, updates, primary key column values changing, etc.
        
         | albert_e wrote:
         | There is specifically block storage service (EBS) and falvirs
         | of it like EBS multi-attach and EFS that can ne used if there
         | is a need to port software/databases to the cloud with low
         | level filesystem support.
         | 
         | Why would we need to do it on object storage which addresses a
         | different type of storage need.
         | 
         | Nevertheless there are projects like EMRFS and S3 file system
         | mount points that try to provide files stem interfaces to
         | workloads that need to see S3 as a filesystem.
        
           | albert_e wrote:
           | *flavors
           | 
           | *can be used
           | 
           | *file system
           | 
           | (Apologies for typos. The "noprocrast" setting sometimes
           | locks us out of HN right after submitting a comment. And it
           | is now too late, not editable)
        
           | hn72774 wrote:
           | S3 is better for large datasets. It's cheaper and handles
           | large file sizes with ease.
           | 
           | It has become a de-facto standard for distributed, data-
           | intensive workloads like those common with spark.
           | 
           | A key benefit is decoupling the data from the compute so that
           | they can scale independently. EBS is tightly coupled to iops
           | and you pay extra for that.
           | 
           | (Source: a long time working in data engineering)
        
             | albert_e wrote:
             | Yes and I also believe:
             | 
             | Experienced Spark / Data Engineering teams would not assume
             | S3 is readily useable as a filesystem.
             | 
             | This [1] seems like a good guide on how to configure spark
             | for working with Cloud object stores, while recognizing the
             | limitations and pitfalls.
             | 
             | [1]: https://spark.apache.org/docs/latest/cloud-
             | integration.html
             | 
             | ---
             | 
             | Amazon EMR offers a managed way to run hadoop or spark
             | clusters and it implements an "EMR FS" [2] system to
             | interface with S3 as storage.
             | 
             | [2]:
             | https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
             | fs.h...
             | 
             | AWS Glue is another option which is "serverless" ETL.
             | Source and Destination can be S3 data lakes read through a
             | data catalog (hive or glue data catalog). During processing
             | AWs Glue can optionally use S3 [3,4,5] for shuffle
             | partition.
             | 
             | [3]: https://aws.amazon.com/blogs/big-data/introducing-
             | amazon-s3-...
             | 
             | [4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-
             | spark-shu...
             | 
             | [5]: https://aws.amazon.com/blogs/big-data/introducing-the-
             | cloud-...
        
         | 8n4vidtmkvmk wrote:
         | I still don't understand why you'd want to do it in the first
         | place. Just by some contiguous storage.
        
       | zmmmmm wrote:
       | The limitations of S3 (and all the cloud "file systems") are
       | quite astonishing when you consider you're paying for it as a
       | premium service.
       | 
       | Try to imagine your astonishment if a traditional storage vendor
       | showed up and told you that their very expensive premium file
       | system they had just sold you:                   - can't store
       | log files because it can't append           anything to an
       | existing files         - can't copy files more than 5GB         -
       | can't rename or move a file
       | 
       | When challenged on how you are supposed to make all your
       | applications work with limitations like that, they glibly told
       | you "oh you're supposed to rewrite them all".
        
         | umanwizard wrote:
         | Amazon doesn't market S3 as a replacement for file systems,
         | that's why EBS exists.
         | 
         | Also, is S3 really "very expensive"? Relative to what?
        
           | vbezhenar wrote:
           | S3 usually is the cheapest storage, not only for Amazon, but
           | for other clouds. I don't understand why.
        
             | ForHackernews wrote:
             | This is not true in my experience
             | https://www.backblaze.com/cloud-storage/pricing
        
               | kiwijamo wrote:
               | That Backblaze page (not surprisingly) compares their
               | prices to a fairly expensive S3 pricing tier and makes
               | other assumptions in Blackblaze's favour. For some use
               | cases B2 is more expensive e.g. one copy of my backups
               | goes to AWS Deep Glacier which is really cheap.
        
         | throwaway290 wrote:
         | It's for building things on top. If you want to
         | rename/move/copy data, implement a layer that maps objects to
         | "filenames" or any metadata you like (or use some lib). If you
         | want to write logs, implement append and rotation. But I for
         | example don't and won't need any of that and if it helps keep
         | the API simpler and more reliable then I benefit.
         | 
         | being a conventional filesystem for S3 would be either a very
         | leaky abstraction or completely different product
        
         | Cthulhu_ wrote:
         | They're not filesystems though, they're object storage or
         | key/value storage if you will. It's intended to store the log
         | files for long term once they're full.
         | 
         | You can rename / move a file, but it involves copying and
         | deleting the original; I don't understand why they don't have a
         | shortcut for that, but it probably makes sense that the user of
         | the service is aware of the process instead of hiding it.
         | 
         | I'm not sure about the 5GB limit, it's probably documented
         | somewhere as to why that is; possibly, like tweets, having an
         | upper limit helps them optimize things. Anyway there too
         | there's tools, you can do multipart somethings and there's this
         | official blogpost on the subject:
         | https://aws.amazon.com/blogs/storage/copying-objects-greater...
         | 
         | Interesting to note maybe in the context of the post; copy,
         | rename, moving large files, all that _could_ be abstracted
         | away, but that would hide the underlying logic - which might
         | lead to inefficient usage of the service - and worse, make
         | users _think_ it 's just a filesystem and use it accordingly,
         | but it's not intended or designed for that use case.
        
           | gray_-_wolf wrote:
           | Current limit is 5TB. The 5GB is for a single upload, you can
           | hover do multipart upload to get up to the maximum size of
           | 5TB.
           | 
           | https://aws.amazon.com/s3/faqs/
        
         | ozim wrote:
         | These "file systems" are not file systems and I don't
         | understand why people expect them to be.
         | 
         | Some people are creating tools that make those services easier
         | to synch with file systems but that is not intended use anyway.
        
         | inkyoto wrote:
         | S3 is an object storage, not a file system. The file system in
         | AWS is called EFS. S3 is not positioned as a substitute for
         | file systems, either.
        
         | pjc50 wrote:
         | It's not a filesystem, but it has _better_ semantics for
         | distributed operation because of it. Nobody talks about the
         | locking semantics of S3 because it 's at the blob level; that
         | rules out whole categories of problems.
         | 
         | And that's also why you can't append. If you had multiple
         | readers while appending, and appending to multiple replicas,
         | guaranteeing that each reader would see a consistent only-
         | forwards read of the append is extremely hard. So simply ban
         | people from doing that and force them to use a different system
         | designed for the purpose of logging.
         | 
         | Microservices. S3 is for blobs. If you want something that
         | isn't a blob, use a different microservice.
        
       | hiAndrewQuinn wrote:
       | I feel like I understand the lasting popularity of the humble FTP
       | fileserver a bit better now. Thank you.
        
         | jugg1es wrote:
         | oh but amazon offers SFTP on top of S3 so you don't have to
         | miss out.
        
           | hiAndrewQuinn wrote:
           | If it's offered on top of S3, though, doesn't it still have
           | all the same issues of needing to totally overwrite files?
        
       | globular-toast wrote:
       | A filesystem is an abstraction built on a block device. A block
       | device just gives you a massive array of bytes and lets you
       | read/write from them in blocks (e.g. write these 300 bytes at
       | position 273041).
       | 
       | A block device itself is an abstraction built on real hardware.
       | "Write these 300 bytes" really means something like "move needle
       | on platter 2 to position 6... etc"
       | 
       | S3 is just a different abstraction that is also built on raw
       | storage somehow. It's a strictly flat key-object store. That's
       | it. I don't know why people have a problem with this. If you need
       | "filesystem stuff" then implement it in your app, or use a
       | filesystem. You only need to append? Use a database to keep track
       | of the chain of appends and store the chunks in S3. Doesn't work
       | for you? Use something else. Need to "copy"? Make a new reference
       | to the same object in your db. Doesn't work for you? Use
       | something else.
       | 
       | S3 works for a lot of people. Stop trying to make it something
       | else.
       | 
       | And stop trying to change the meaning of super well-established
       | names in your field. A filesystem is described in text books
       | everywhere. S3 is not a filesystem and never claimed to be one.
       | 
       | Oh and please study a bit of operating system design. Just a
       | little bit. It really helps and is great fun too.
        
       | gjvc wrote:
       | JFC the people on this thread missing the difference between
       | object storage and a blocks-and-inodes filesystem is alarming
        
       | nickcw wrote:
       | Great article - would have been useful to read before starting
       | out on the journey of making rclone mount (mount your cloud
       | storage via fuse)!
       | 
       | After a lot of iterating we eventually came up with the VFS layer
       | in rclone which adapts S3 (or any other similar storage system
       | like Google Cloud Storage, Azure Blob, Openstack Swift, Oracle
       | Object Storage, etc) into a POSIX-ish file system layer in
       | rclone. The actual rclone mount code is quite a thin layer on top
       | of this.
       | 
       | The VFS layer has various levels of compatibility "off" where it
       | just does directory caching. In this mode, like the article
       | states you can't read and write to a file simultaneously and you
       | can't write to the middle of a file and you can only write files
       | sequentially. Surprisingly quite a lot of things work OK with
       | these limitations. The next level up is "writes" - this supports
       | nearly all the POSIX features that applications want like being
       | able to read and write to the same file at the same time, write
       | to the middle of the file, etc. The cost for that though is a
       | local copy of the file which is uploaded asynchronously when it
       | is closed.
       | 
       | Here are some docs for the VFS caching modes - these mirror the
       | limitations in the article nicely!
       | 
       | https://rclone.org/commands/rclone_mount/#vfs-file-caching
       | 
       | By default S3 doesn't have real directories either. This means
       | you can't have a directory with no files in, and directories
       | don't have valid metadata (like modification time). You can
       | create zero length files ending in / which are known as directory
       | markers and a lot of tools (including rclone) support these. Not
       | being able to have empty directories isn't too much of a problem
       | normally as the VFS layer fakes them and most apps then write
       | something into their empty directories pretty quickly.
       | 
       | So it is really quite a lot of work trying to convert something
       | which looks like S3 into something which looks like a POSIX file
       | system. There is a whole lot of smoke and mirrors behind the
       | scene when things like renaming an open file happens and other
       | nasty corner cases like that.
       | 
       | Rclone's lower level move/sync/copy commands don't bother though
       | and use the S3 API pretty much as-is.
       | 
       | If I could change one thing about S3's API I would like an option
       | to read the metadata with the listings. Rclone stores
       | modification times of files as metadata on the object and there
       | isn't a bulk way of reading these, you have to HEAD the object.
       | Or alternatively a way of setting the Last-Modified on an object
       | when you upload it would do too.
        
         | Hakkin wrote:
         | > If I could change one thing about S3's API I would like an
         | option to read the metadata with the listings. Rclone stores
         | modification times of files as metadata on the object and there
         | isn't a bulk way of reading these, you have to HEAD the object.
         | Or alternatively a way of setting the Last-Modified on an
         | object when you upload it would do too.
         | 
         | I wonder if you couldn't hack this in by storing the metadata
         | in the key name itself? Obviously with the key length limit of
         | 1024 you would be limited in how much metadata you could store,
         | but it's still quite a lot of space, even taking into account
         | the file path. You could use a deliminator that would be
         | invalid in a normalized path, like '//', for example:
         | /path/to/file.txt//mtime=1710066090
         | 
         | You would still be able to fetch "directories" via prefixes and
         | direct files by using '<filename>//' as the prefix.
         | 
         | This kind of formatting would probably make it pretty
         | incompatible with other software though.
        
           | nickcw wrote:
           | I think that is a nice idea - maybe something we could
           | implement in an overlay backend. However people really like
           | the fact that the object they upload with rclone arrive with
           | the filenames they had originally on s3, so I think the
           | incompatible with other software downside would make it
           | unattractive for most users.
        
         | klauspost wrote:
         | > If I could change one thing about S3's API I would like an
         | option to read the metadata with the listings.
         | 
         | Agree. In MinIO (disclaimer: I work there) we added a "secret"
         | parameter (metadata=true) to include metadata and tags in
         | listings if the user has the appropriate permissions. Of course
         | it being an extension it is not really something that you can
         | reliably use. But rclone can of course always try it and use it
         | if available :)
         | 
         | > You can create zero length files ending in /
         | 
         | Yeah. Though you could also consider "shared prefixes" in
         | listings as directories by itself. That of course makes
         | directories "stateless" and unable to exist if there are no
         | objects in there - which has pros and cons.
         | 
         | > Or alternatively a way of setting the Last-Modified on an
         | object when you upload it would do too.
         | 
         | Yes, that gives severe limitations to clients. However it does
         | make the "server" time the reference. But we have to deal with
         | the same limitation for client side replication/mirroring.
         | 
         | My personal biggest complaint is that there isn't a
         | `HeadObjectVersions` that returns version information for a
         | single object. `ListObjectVersions` is always going to be a
         | "cluster-wide" operation, since you cannot know if the given
         | prefix is actually a prefix or an object key. AWS recently
         | added "GetObjectAttributes" - but it doesn't add version
         | information, which would have fit in nicely there.
        
           | nickcw wrote:
           | > Agree. In MinIO (disclaimer: I work there) we added a
           | "secret" parameter (metadata=true) to include metadata and
           | tags in listings if the user has the appropriate permissions.
           | Of course it being an extension it is not really something
           | that you can reliably use. But rclone can of course always
           | try it and use it if available :)
           | 
           | Is this "secret" parameter documented somewhere? Sounds very
           | useful :-) Rclone knows when it is talking to Minio so we
           | could easily wedge that in.
           | 
           | > My personal biggest complaint is that there isn't a
           | `HeadObjectVersions` that returns version information for a
           | single object. `ListObjectVersions` is always going to be a
           | "cluster-wide" operation, since you cannot know if the given
           | prefix is actually a prefix or an object key
           | 
           | Yes that is annoying having to do a List just to figure out
           | which object Version is being referred to. (Rclone has this
           | problem when using --s3-list-version).
        
         | glitchcrab wrote:
         | Hey Nick :wave:
        
       | wodenokoto wrote:
       | Is there a generic name for these distributed cloud file
       | storages?
       | 
       | AWS is S3, google is buckets, Azure is blob storage, the open
       | source version is ... ?
        
         | dexwiz wrote:
         | Object Storage
        
           | jeffbr13 wrote:
           | I tend to go by Binary Large OBject (BLOB) storage to discern
           | between this kind of object storage and "object" as in OOP.
           | BLOB is also what databases call files stored in columns.
        
             | OJFord wrote:
             | When would that be confusing? As in what would an AWS
             | service offering OOP object storage be/mean?
        
         | gilbetron wrote:
         | "blob storage" is the usual generic term, even though Azure
         | uses it explicitly. It's like calling adhesive bandages,
         | "bandaids" even though that is a specific company's term.
        
         | surajrmal wrote:
         | Google buckets is a bit off - the product is called Google
         | storage. Buckets are also a term used by s3 and are equivalent
         | to azure blob storage containers. They are an intermediary
         | layer that determines attributes for the objects stored within
         | it such as ACLs and storage class (and therefore cost and
         | performance).
         | 
         | As to your question, object storage[1] seems to be the generic
         | term for the technology. Internally they all rely on naming
         | files based on the hash of their contents for quick lookup,
         | deduplication, and avoiding name clashes.
         | 
         | 1: https://en.wikipedia.org/wiki/Object_storage
        
       | tison wrote:
       | It's ever discussed in https://github.com/apache/arrow-
       | rs/issues/3888 for comparing object_store in Apache Arrow to the
       | APIs provided by Apache OpenDAL.
       | 
       | Briefly, Apache OpenDAL is a library providing FS-like APIs over
       | multiple storage backends, including S3 and many other cloud
       | storage.
       | 
       | A few database systems, such as GreptimeDB and Databend, use
       | OpenDAL as a better S3 SDK to access data on cloud storage.
       | 
       | Other solutions exist to manage filesystem-like interfaces over
       | S3, including Alluxio and JuiceFS. Unlike Apache OpenDAL, Alluxio
       | and JuiceFS need to be deployed standalone and have a dedicated
       | internal metadata service.
        
         | Lucasoato wrote:
         | I'm not sure if Alluxio could be substituted by OpenDAL as a
         | local cache layer for TrinoDB.
        
       | cynicalsecurity wrote:
       | Backblaze B2 is worth mentioning while we are speaking of S3. I'm
       | absolutely in love with their prices (3 times lower than of S3).
       | (I'm not their representative).
        
         | silvertaza wrote:
         | With every alternative, the prevailing issue is the fact that
         | your data is as safe as the company your data is with. But I
         | think this can be remedied by doubly external backups.
        
           | didgeoridoo wrote:
           | B2 having an S3-compatible API available makes this
           | particularly easy :)
        
           | OJFord wrote:
           | Backblaze is like if Amazon spun AWS S3 out as its own
           | business (and it added some backup helper tooling as a
           | result) though, I wouldn't really worry any more about it.
           | You could write a second copy to S3 Glacier Deep Archive
           | (using B2 for instant access when you wanted to restore or on
           | a new device) and still be much cheaper.
        
         | overstay8930 wrote:
         | We liked B2 but not enough to pay for IPv4 addresses, insane
         | they advertise as a multi-cloud solution but basically kill any
         | chance at adoption when NAT gateways and IPv4 charges are
         | everywhere. We would literally save money paying B2 bandwidth
         | fees (high read low write) but not when being pushed through a
         | NAT64 gateway, or paying an hourly charge just to be able to
         | access B2.
        
           | Kwpolska wrote:
           | How could they launch a cloud service like this and not have
           | IPv6 in 2015? What other basic things did they cheap out on?
        
           | miyuru wrote:
           | I also migrated, after asking for IPv6 for more than 3 years
           | on reddit.
           | 
           | they does not seem to understand users on the b2 product.
           | it's almost as if b2 is just a supplementary service from
           | their backup service.
           | 
           | https://www.reddit.com/r/backblaze/comments/ij9y9s/b2_s3_not.
           | ..
        
       | orf wrote:
       | > And listing files is slow. While the joy of Amazon S3 is that
       | you can read and write at extremely, extremely, high bandwidths,
       | listing out what is there is much much slower. Slower than a slow
       | local filesystem
       | 
       | This misses something critical. Yes, s3 has fast reading and
       | writing, but that's not really what makes it _useful_.
       | 
       | What makes it useful _is_ listing. In an unversioned bucket (or
       | one with no delete markers), listing any given prefix is
       | essentially constant time: I can take any given string, in a
       | bucket with 100 billion objects, and say "give me the next 1000
       | keys alphabetically that come after this random string".
       | 
       | What's more, using "/" as a delimiter is just the default - you
       | can use any character you want and get a set of common prefixes.
       | There are no "directories", "directories" are created out of thin
       | air on demand.
       | 
       | This is super powerful, and it's the thing that lets you
       | partition your data in various ways, using whatever identifiers
       | you need, without worrying about performance.
       | 
       | If listing was just "slow", couldn't list on file prefixes _and_
       | got slower proportional to the number of keys (I.e a traditional
       | unix file system), then it wouldn't be useful at all.
        
         | adrian_b wrote:
         | Since 30 years ago (starting with XFS in 1993, which was
         | inspired by HPFS) all the good UNIX file systems implement the
         | directories as some kind of B trees.
         | 
         | Therefore they do not get slower proportional to the number of
         | entries and listing based on file prefixes is extremely fast.
        
           | orf wrote:
           | Yes they do. What APIs does Linux offer that allows you to
           | list a directories contents alphabetically _starting at a
           | specific filename_ in constant time? You have to iterate the
           | _directory_ contents.
           | 
           | You can maybe use "d_off" with readdir in some way, but
           | that's specific to the filesystem. There's no portable way to
           | do this with POSIX.
           | 
           | Regardless of if you can do it with a single directory, you
           | can't do it for all files recursively under a given prefix.
           | You can't just ignore directories, or say that "for this list
           | request, '-' is my directory separator".
           | 
           | The use of b-trees in file systems is completely beside the
           | point.
        
             | adrian_b wrote:
             | The POSIX API is indeed even older, so it is not helpful.
             | 
             | But as you say, there are filesystem-specific methods or
             | operating-system specific methods to reach the true
             | performance of the filesystem.
             | 
             | It is likely that for maximum performance one would have to
             | write custom directory search functions using directly the
             | Linux syscalls, instead of using the standard libc
             | functions, but I would rather do that instead of paying for
             | S3 or something like it.
        
               | orf wrote:
               | Yes. You could also just use a SQLite table with two
               | columns (path, contents), then just query that. Or do any
               | number of other things.
               | 
               | The question isn't if it's possible, because of course it
               | is, the question is if it's portable and well supported
               | with the POSIX interface. Because if it's not, then...
        
               | anamexis wrote:
               | > The question isn't if it's possible, because of course
               | it is, the question is if it's portable and well
               | supported with the POSIX interface. Because if it's not,
               | then...
               | 
               | Where did this goalpost come from? S3 is not portable or
               | POSIX compliant.
        
               | orf wrote:
               | From the article we're commenting on, which is comparing
               | the interface of S3 to the POSIX interface. Not any given
               | filesystem + platform specific interface.
        
               | anamexis wrote:
               | The article does not mention POSIX, or anything about
               | listing files, at all.
        
               | zaphar wrote:
               | The article starts out by making a comparison between the
               | posix api filesystem calls and S3's api. The context is
               | very much a comparison between those two api surface
               | areas.
        
               | orf wrote:
               | It mistakenly mentions UNIX whilst referencing the POSIX
               | filesystem API, and I literally quoted where it talks
               | about listing in my original comment.
        
               | justincormack wrote:
               | There are no specific syscalls that you can use for this.
               | The libc functions and the syscalls are extremely
               | similar.
        
           | nh2 wrote:
           | > listing based on file prefixes is extremely fast
           | 
           | This functionality does not exist to my knowledge.
           | 
           | ext4 and XFS return directory entries in pseudo-random order
           | (due to hashing), not lexicographically.
           | 
           | For an example, see e.g.
           | https://righteousit.wordpress.com/2022/01/13/xfs-
           | part-6-btre...
           | 
           | If you know a way to return lexicographical order directly
           | from the file system, without the need to sort, please link
           | it.
        
           | kbolino wrote:
           | Resolving random file system paths still gets slower
           | proportional to their _depth_ , which is not the case for S3,
           | where the prefix is on the entire object key and not just the
           | "basename" part of it, like in a filesystem.
        
         | jacobsimon wrote:
         | What is it about S3 that enables this speed, and why can't
         | traditional Unix file systems do the same?
        
           | orf wrote:
           | S3 doesn't have directories, it could be thought of a flat +
           | sorted list of keys.
           | 
           | UNIX (and all operating systems) differentiate between a file
           | and a directory. To list the contents of a directory, you
           | need to make an explicit call. That call might return files
           | or directories.
           | 
           | So to list all files recursively, you need to list, sort,
           | check if an entry is a directory, recurse". This isn't great.
        
             | bradleyjg wrote:
             | Code written against s3 is not portable either. It doesn't
             | support azure or gcp, much less some random proprietary
             | cloud.
        
               | arcfour wrote:
               | I've seen several S3-compatible APIs and there are open-
               | source clients. If anything it's the de-facto standard.
        
               | zaphar wrote:
               | GCP storage buckets implement the S3 api. You can treat
               | them like they were an s3 bucket. Something I do all the
               | time.
        
               | cuno wrote:
               | Actually we've found it's often much worse than that.
               | Code written against AWS S3 using the AWS SDK often
               | doesn't work on a great many "S3-compatible" vendors
               | (including on-prem versions). Although there's
               | documentation on S3, it's vague in many ways, and the AWS
               | SDKs rely on actual AWS behaviour. We've had to deal with
               | a lot of commercial and cloud vendors that subtly break
               | things. This includes giant public cloud companies. In
               | one case a giant vendor only failed at high loads, making
               | it appear to "work" until it didn't, because its backoff
               | response was not what the AWS SDK expected. It's been a
               | headache that we've had to deal for cunoFS, as well as
               | making it work with GCP and Azure. At the big HPC
               | conference Supercomputing 2023, when we mentioned
               | supporting "S3 compatible" systems, we would often be
               | told stories about applications not working with their
               | supposedly "S3 compatible" one (from a mix of vendors).
        
               | yencabulator wrote:
               | Back in 2011 when I was working on making Ceph's RadosGW
               | more S3-compatible, it was pretty common that AWS S3
               | behavior differed from their documentation too. I wrote a
               | test suite to run against AWS and Ceph, just to figure
               | out the differences. That lives on at
               | https://github.com/ceph/s3-tests
        
             | mechanicalpulse wrote:
             | Isn't that a limitation imposed by the POSIX APIs, though,
             | as a direct consequence of the interface's representation
             | of hierarchical filesystems as trees? As you've
             | illustrated, that necessitates walking the tree. Many
             | tools, I suppose, walk the tree via a single thread,
             | further serializing the process. In an admittedly haphazard
             | test, I ran `find(1)` on ext4, xfs, and zfs filesystems and
             | saw only one thread.
             | 
             | I imagine there's at least one POSIX-compatible file system
             | out there that supports another, more performant method of
             | dumping its internal metadata via some system call or
             | another. But then we would no longer be comparing the S3
             | and POSIX APIs.
        
         | aeyes wrote:
         | And if for some reason you need a complete listing along with
         | object sizes and other attributes you can get one every 24
         | hours with S3 inventory report.
         | 
         | That has always been good enough for me.
        
         | tjoff wrote:
         | Is listing really such a key feature that people use it as a
         | database to find objects?
         | 
         | Have not used S3, but that is not how I imagined using it.
        
           | orf wrote:
           | Sure. It's kind of an index - limited to prefix-only
           | searching, but useful.
           | 
           | Say you store uploads associated with a company and a user.
           | You'd maybe naively store them as `[company-uuid]/[user-
           | id].[timestamp]`.
           | 
           | If you need to list a given users (123) uploads after a given
           | date, you'd list keys after `[company-uuid]/123.[date]`. If
           | you need to list all users uploads, you'd list `[company-
           | uuid]/123.`. If you need to get a set of all users who have
           | photos, you'd list `[company-uuid]/` with a Delimiter set to
           | `.`
           | 
           | The point is that it's flexible and with a bit of thought it
           | allows you to "remove all a users uploads between two dates",
           | "remove all a companies uploads" or "remove all a users
           | uploads" with a single call. Or whatever specific stuff is
           | important to your use-case, that might otherwise need a
           | separate DB.
           | 
           | It's not perfect - you can't reverse the listing (i.e you
           | can't get the _latest_ photo for a given user by sorting
           | descending for example), and needs some thought about your
           | key structure.
        
             | tjoff wrote:
             | But surely you need to track that elsewhere anyway?
             | 
             | That some niche edge-case runs efficiently doesn't sound
             | like a defining feature of S3. On the contrary many common
             | operations map terrible to S3, so you kind of need the
             | logic to be elsewhere.
        
               | orf wrote:
               | My overall point can be summarised as this:
               | 
               | - Listing things is a very common operation to do.
               | 
               | - The POSIX api and the directory/file hierarchy it
               | provides is a restrictive one.
               | 
               | - S3 does not suffer from this, you can recursively list
               | and group keys into directories at "list time".
               | 
               | - If you find yourself needing to list gigantic numbers
               | of keys in one go, you can do better by only listing a
               | subset. S3 isn't a filesystem, you shouldn't need to list
               | 1k+ keys sequentially apart from during maintenance
               | tasks.
               | 
               | - This is actually quite fast, compared to alternatives.
               | 
               | Whether or not you see a use case for this is sort of
               | irrelevant: they exist. it's what allows you to easily
               | put data into s3 and flexibly group/scan it by specific
               | attributes.
        
               | tjoff wrote:
               | Listing things is very common, so why would you outsource
               | that to S3 when all your bookkeeping is elsewhere? It's
               | not like you would ever rely on the POSIX API for that
               | anyway, even for when your files actually are on a POSIX
               | filesystem.
               | 
               | For sure, for maintenance tasks etc. it sounds quite
               | useful. And good hygiene with prefixes sounds like a sane
               | idea. But listing being a critical part of what "makes S3
               | useful"? That seems like an huge stretch that your points
               | don't seem to address.
        
               | orf wrote:
               | > It's not like you would ever rely on the POSIX API for
               | that anyway, even for when your files actually are on a
               | POSIX filesystem.
               | 
               | Because there _is no_ POSIX api for this. Depending on
               | your requirements and query patterns, you may not need a
               | completely separate database that you need to keep in
               | sync.
        
               | kbolino wrote:
               | > But surely you need to track that elsewhere anyway?
               | 
               | Why? If the S3 structure and listing is sufficient, I
               | don't need to store anything else anywhere else.
               | 
               | Many use cases may involve other requirements that S3
               | can't meet, such as being able to find the same object
               | via different keys, or being able to search through the
               | metadata fields. However, if the requirements match up
               | with S3's structure, then additional services are
               | unnecessary and keeping them in sync with S3 is more
               | hassle than it's worth.
        
               | tjoff wrote:
               | I agree, but something as simple (in functionality) as
               | that ought to be an edge-case. Not a defining feature of
               | S3.
        
               | dekhn wrote:
               | it's a property of the system that I, as an architect,
               | would seriously consider as part of my system's design.
               | I've worked with many systems where iterating over items
               | in order starting from a prefix is extremely cheap
               | (sstables).
        
               | orf wrote:
               | It's fundamental to how S3 works and its ability to
               | scale, so it is a defining feature of S3.
               | 
               | If you think wider, a bucket itself is just a prefix.
        
               | tjoff wrote:
               | From amazons perspective, sure!
               | 
               | But that's not what we are discussing.
        
           | belter wrote:
           | No. The standard practice is to use a DynamoDB table as the
           | index for your objects in S3.
           | 
           | This article misunderstood S3 and could as well have the
           | title: "An Airplane is not a Car" :-)
        
             | macintux wrote:
             | I don't know that you can characterize that as a "standard
             | practice".
             | 
             | Maybe it's widespread, but I've not encountered it.
        
               | belter wrote:
               | "Building and Maintaining an Amazon S3 Metadata Index
               | without Servers" - https://aws.amazon.com/pt/blogs/big-
               | data/building-and-mainta...
               | 
               | Here is the architecture of Amazon Drive and the storage
               | of metadata.
               | 
               | "AWS re:Invent 2014 | (ARC309) Building and Scaling
               | Amazon Cloud Drive to Millions of Users" -
               | https://youtu.be/R2pKtmhyNoA
               | 
               | And you can see the use here at correct time:
               | https://youtu.be/R2pKtmhyNoA?t=546
        
               | ianburrell wrote:
               | That article is old. DynamoDB was used because of the
               | old, weak consistency model of S3. Writes were atomic,
               | but lists could return old results so needed consistent
               | list of objects.
               | 
               | But in 2020, S3 changed to strong consistency model.
               | There is no need to use DynamoDB now.
        
             | fijiaarone wrote:
             | So in reality S3 takes about 2 seconds to retrieve a single
             | file, under ideal conditions. 1 second round trip for the
             | request to DynamoDB to get the object key of the file and 1
             | second round trip to S3 to get the file contents (assuming
             | no CPU cost on the search because you're getting the key by
             | ID from the DynamoDB in a flat single table store. And that
             | the file has no network IO because it is a trivial number
             | of bytes, so the HTTP header overwhelmed the content.)
             | 
             | I know what you're thinking -- 2 seconds, that's faster
             | than I can type the 300 character file key with its pseudo
             | prefixes)!
             | 
             | Ah, but what if you wanted to get 2 files from S3?
        
         | calpaterson wrote:
         | I have to say that I'm not hugely convinced. I don't really
         | think that being able to pull out the keys before or after a
         | prefix is particularly impressive. That is the basis for
         | database indices going back to the 1970s after all.
         | 
         | Perhaps the use-cases you're talking about are very different
         | from mine. That's possible of course.
         | 
         | But for me, often the slow speed of listing the bucket gets in
         | the way. Your bucket doesn't have to get very big before
         | listing the keys takes longer than reading them. I seem to
         | remember that listing operations ran at sub-1mbps, but
         | admittedly I don't have a big bucket handy right now to test
         | that.
        
           | orf wrote:
           | It depends on a few factors. The list objects call hides
           | deleted and noncurrent versions, but it has to skip over
           | them. Grouping prefixes also takes time, if they contain a
           | lot of noncurrent or deleted keys.
           | 
           | A pathological case would be a prefix with 100 million
           | deleted keys, and 1 actual key at the end. Listing the parent
           | prefix takes a long time in this case - I've seen it take
           | several minutes.
           | 
           | If your bucket is pretty "normal" and doesn't have this, or
           | isn't versioned, then you can do 4-5 thousand list requests a
           | second, at any given key/prefix, in constant time. Or or you
           | can explicitly list object versions (and not skip deleted
           | keys) also in constant time.
           | 
           | It all depends on your data: if you need to list all objects
           | then yeah it's gonna be slow because you need to paginate
           | through all the objects. But the point is that you don't have
           | to do that if you don't want to, unlike a traditional
           | filesystem with a directory hierarchy.
           | 
           | And this enables parallelisation: why list everything
           | sequentially, when you can group the prefixes by some
           | character (i.e "-"), then process each of those prefixes in
           | parallel.
           | 
           | The world is your oyster.
        
           | cuno wrote:
           | We and our customers use S3 as a POSIX filesystem, and we
           | generally find it faster than a local filesystem for many
           | benchmarks. For listing directories we find it faster than
           | Lustre (a real high performance filesystem). Our approach is
           | to first try listing directories with a single ListObjectV2
           | (which on AWS S3 is in lexicographic order) and if it hasn't
           | made much progress, we start listing with parallel
           | ListObjectV2. Once you start parallelising the ListObjectV2
           | (rather than sequentially "continuing") you get massive
           | speedups.
        
             | crabbone wrote:
             | > find it faster than a local filesystem for many
             | benchmarks.
             | 
             | What did you measure? How did you compare? This claim seems
             | _very_ contrary to my experience and understanding of how
             | things work...
             | 
             | Let me refine the question: did you measure metadata or
             | data operations? What kind of storage medium is used by the
             | filesystem you use? How much memory (and subsequently the
             | filesystem cache) does your system have?
             | 
             | ----
             | 
             | The thing is: you should expect, in the best case,
             | something like 5 ms latency on network calls over the
             | Internet in an ideal case. Within the datacenter, maybe you
             | can achieve sub-ms latency, but that's hard. AWS within
             | region but different zones tends to be around 1 ms latency.
             | 
             | This is while NVMe latency, even on consumer products, is
             | 10-20 _micro_ seconds. I.e. we are talking about roughly
             | 100 times faster than anything going through the network
             | can offer.
        
               | cuno wrote:
               | For AWS, we're comparing against filesystems in the
               | datacenter - so EBS, EFS and FSx Lustre. Compared to
               | these, you can see in the graphs where S3 is much faster
               | for workloads with big files and small files:
               | https://cuno.io/technology/
               | 
               | and in even more detail of different types of EBS/EFS/FSx
               | Lustre here: https://cuno.io/blog/making-the-right-
               | choice-comparing-the-c...
        
               | hnlmorg wrote:
               | EFS is ridiculously slow though. Almost to the point
               | where I fail to see how it's actually useful for any of
               | the traditional use cases for NFS.
        
               | dekhn wrote:
               | if you turn all the EFS performance knobs up (at a high
               | cost), it's quite fast.
        
               | hnlmorg wrote:
               | Fast _er_ , sure. But I wouldn't got so far as to say it
               | is _fast_
        
               | wenc wrote:
               | S3 is really high latency though. I store parquet files
               | on S3 and querying them through DuckDB is much slower
               | than file system because random access patterns. I can
               | see S3 being decent if it's bulk access but definitely
               | not for random access.
               | 
               | This is why there's a new S3 Express offering that is low
               | latency (but costs more).
        
               | crabbone wrote:
               | The tests are very weird...
               | 
               | Normally, from someone working in the storage, you'd
               | expect tests to be in IOPS, and the goto tool for
               | reproducible tests is FIO. I mean, of course
               | "reproducibility" is a very broad subject, but people are
               | so used to this tool that they develop certain intuition
               | and interpretation for it / its results.
               | 
               | On the other hand, seeing throughput figures is kinda...
               | it tells you very little about how the system performs.
               | Just to give you some reasons: a system can be configured
               | to do compression or deduplication on client / server,
               | and this will significantly impact your throughput,
               | depending on what do you actually measure: the amount of
               | useful information presented to the user or the amount of
               | information transferred. Also throughput at the expense
               | of higher latency may or may not be a good thing...
               | Really, if you ask anyone who ever worked on a storage
               | product about how they could crank up throughput numbers,
               | they'd tell you: "write bigger blocks asynchronously".
               | This is the basic recipe, if that's what you want.
               | Whether this makes a good all around system or not... I'd
               | say, probably not.
               | 
               | Of course, there are many other concerns. Data
               | consistency is a big one, and this is a typical tradeoff
               | when it comes to choosing between object store and a
               | filesystem, since filesystem offers more data consistency
               | guarantees, whereas object store can do certain things
               | faster, while breaking them.
               | 
               | BTW, I don't think most readers would understand Lustre
               | and similar to be the "local filesystem", since it
               | operates over network and network performance will have a
               | significant impact, of course, it will also put it in the
               | same ballpark as other networked systems.
               | 
               | I'd also say that Ceph is kinda missing from this
               | benchmark... Again, if we are talking about filesystem on
               | top of object store, it's the prime example...
        
               | cuno wrote:
               | IOPS is a really lazy benchmark that we believe can
               | greatly diverge from most real life workloads, except for
               | truly random I/O in applications such as databases. For
               | example, in Machine Learning, training usually consists
               | of taking large datasets (sometimes many PBs in scale),
               | randomly shuffling them each Epoch, and feeding them into
               | the engine as fast as possible. Because of this, we see
               | storage vendors for ML workloads concentrate on IOPS
               | numbers. The GPUs however only really care about
               | throughput. Indeed, we find a great many applications
               | only really care about the throughput, and IOPS is only
               | relevant if it helps to accomplish that throughput. For
               | ML, we realised that the shuffling isn't actually random
               | - there's no real reason for it to be random versus
               | pseudo-random. And if its pseudo-random then it is
               | predictable, and if its predictable then we can exploit
               | that to great effect - yielding a 60x boost in throughput
               | on S3, beating out a bunch of other solutions. S3 is not
               | going to do great for truly random I/O, however, we find
               | that most scientific, media and finance workloads are
               | actually deterministic or semi-deterministic, and this is
               | where cunoFS, by peering inside each process, can better
               | predict intra-file and inter-file access patterns, so
               | that we can hide the latencies present in S3. At the end
               | of the day, the right benchmark is the one that reflects
               | real world usage of applications, but that's a lot of
               | effort to document one by one.
               | 
               | I agree that things like dedupe and compression can
               | affect things, so in our large file benchmarks each file
               | is actually random. The small file benchmarks aren't
               | affected by "write bigger blocks" because there's nothing
               | bigger than the file itself. Yes, data consistency can be
               | an issue, and we've had to do all sorts of things to
               | ensure POSIX consistency guarantees beyond what S3 (or
               | compatible) can provide. These come with restrictions
               | (such as on concurrent writes to the same file on
               | multiple nodes), but so does NFS. In practice, we
               | introduced a cunoFS Fusion mode that relies on a
               | traditional high-IOPS filesystem for such workloads and
               | consistency (automatically migrating data to that tier),
               | and high throughput object for other workloads that don't
               | need it.
        
             | supriyo-biswas wrote:
             | > Once you start parallelising the ListObjectV2 (rather
             | than sequentially "continuing")
             | 
             | How are you "parallelizing" the ListObjectsV2? The
             | continuation token can be only fed in once the previous
             | ListObjectsV2 response has completed, unless you know the
             | name or structure of keys ahead of time, in which listing
             | objects isn't necessary.
        
               | cuno wrote:
               | For example, you can do separate parallel ListObjectV2
               | for files starting a-f and g-k, etc.. covering the whole
               | key space. You can parallelize recursively based on what
               | is found in the first 1000 entries so that it matches the
               | statistics of the keys. Yes there may be pathological
               | cases, but in practice we find this works very well.
        
               | johnmaguire wrote:
               | You're right that it won't work for all use cases, but
               | starting two threads with prefixes A and M, for example,
               | is one way you might achieve this.
        
             | fijiaarone wrote:
             | If you think s3 is fast, you should try FTP. It's at least
             | a hundred times faster. And combined with rsync, dozens of
             | times more reliable.
        
         | hayd wrote:
         | You can set up cloud watch events to trigger a lambda function
         | to store meta data about the s3 file in a regular database.
         | That way you can index it how you expect to list.
         | 
         | Very effective for our use case.
        
         | foldr wrote:
         | >What makes it useful is listing.
         | 
         | I think 99% of S3 usage just consists of retrieving objects
         | with known keys. It seems odd to me to consider prefix listing
         | as a key feature.
        
           | bostik wrote:
           | When you embed the relevant (not necessarily that of object
           | creation) timestamp as a prefix, it sure becomes one. Whether
           | that prefix is part of the "path"
           | (object/path/prefix/with/<4-digit year/)" or directly part of
           | the basename (object/path/prefix/to/app-
           | specific/files/<4-digit year>-<2-digit month>-....), being
           | able to limit the search space server-side becomes incredibly
           | useful.
           | 
           | You can try it yourself: list objects in a bucket prefix with
           | _lots_ of files, and measure the time it takes to list all of
           | them vs. the time it takes to list only a subset of them that
           | share a common prefix.
        
         | gamache wrote:
         | > ...listing any given prefix is essentially constant time: I
         | can take any given string, in a bucket with 100 billion
         | objects, and say "give me the next 1000 keys alphabetically
         | that come after this random string".
         | 
         | I'm not sure we agree on the definition of "constant time"
         | here. Just because you get 1000 keys in one network call
         | doesn't imply anything about the complexity of the backend!
        
           | orf wrote:
           | Constant time irregardless of the number of objects in the
           | bucket and irregardless of the initial starting position of
           | your list request.
        
             | hobobaggins wrote:
             | The technical implementation is indeed impressive that it
             | operates more-or-less within constant time, but probably
             | very few use cases actually fit that narrow window, so this
             | technical strength is moot when it comes to actual usage.
             | 
             | Since each request is dependent upon the position received
             | in the last request, 1000 arbitrary keys on your 3rd or
             | 1000th attempt doesn't really help unless you found your
             | needle in the haystack in _that_ request (and in that case
             | the rest of that 1000 key listing was wasted.)
        
               | orf wrote:
               | You're assuming you're paginating through all objects
               | from start to finish.
               | 
               | A request to list objects under "foo/" is a request to
               | list all objects starting with "foo/", which is constant
               | time irregardless of the number of keys before. Same
               | applies for "foo/bar-", or any other list request for any
               | given prefix. There are no directories on s3.
        
         | nh2 wrote:
         | The key difference between lexicographically keyed flat
         | hierarchies, and directory-nested filesystem hierarchies,
         | becomes clear based on this example:
         | dir1/a/000000         dir1/a/...         dir1/a/999999
         | dir1/b
         | 
         | On a proper hierarchical file file system with directories as
         | tree interior nodes, `ls dir1/` needs to traverse and return
         | only 2 entries ("a" and "b").
         | 
         | A flat string-indexed KV store that only supports lexicographic
         | order, without special handling of delimters, needs to traverse
         | 1 million dirents ("a/00000" throuh "a/999999") before arriving
         | at "b".
         | 
         | Thus, simple flat hierarchies are much slower at listing the
         | contents of a single dir: O(all recursive children), vs.
         | O(immediate children) on a "proper" filesystem.
         | 
         | Lexicographic strings cannot model multi-level tree structures
         | with the same complexities; this may give it the reputation of
         | "listing files is slow".
         | 
         | UNLESS you tell the listing algorithm what the delimter
         | character is (e.g. `/`). Then a lexicographical prefix tree can
         | efficiently skip over all subtrees at the next `/`.
         | 
         | Amazon S3 supports that, with the docs explicitly mentioning
         | "skipping over and summarizing the (possibly millions of) keys
         | nested at deeper levels" in the `CommonPrefixes` field:
         | https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...
         | 
         | I have not tested whether Amazon's implemented actually saves
         | the traversal (or whether it traverses and just returns less
         | results), but I'd hope so.
        
           | nh2 wrote:
           | For completeness: The orignal post says:                   S3
           | has no rename or move operation.         Renaming is
           | CopyObject and then DeleteObject.         CopyObject takes
           | linear time to the size of the file(s).         This comes up
           | fairly often when someone has written a lot of files
           | to the wrong place - moving the files back is very slow.
           | 
           | This is right:
           | 
           | In a normal file system, renaming a directory is fast O(1),
           | in S3 it's slow O(all recursive children).
           | 
           | And Amazon S3 has not added a delimiter-based function to
           | reduce its complexity, even though that would be easily
           | possible in a lexicographic prefix tree (re-rooting the
           | subtree).
           | 
           | So here the original post has indeed found a case where S3 is
           | much slower than a normal file system.
        
       | finalhacker wrote:
       | S3 not implementate vfs api, but you can treat it as a software
       | defined storage filesystem. Just like Ceph.
       | 
       | there are so many applications depends on file storage, such as
       | Mysql. But horizontal scale for those app still difficult in many
       | case. Replace from vfs api to s3 storage perhaps is trending in
       | my experience.
        
       | igtztorrero wrote:
       | Check out kopia.io is a backup software that uses S3 to store
       | files by blocks or pages.
       | 
       | You can browse, search and sort the files and directories of the
       | different snapshot or versions of the file.
       | 
       | I love it !
       | 
       | For me it's a file system in S3.
       | 
       | Bonus: you must use a key, to encrypt the files.
        
       | MatthiasPortzel wrote:
       | This article was an epiphany for me because I realized I've been
       | thinking of the Unix filesystem as if it has two functions:
       | read_file and write_file. (And then getting frustrated with the
       | filesystem APIs in programming languages.)
        
         | markhahn wrote:
         | So you came from an S3 or other put-get world, and found actual
         | filesystems odd?
         | 
         | I suppose that's not so different from a WMP user's epiphany
         | when they discover processes, shells, etc.
        
           | MatthiasPortzel wrote:
           | Well I'm used to an application-level view of the file
           | system.
           | 
           | A document editor or text editor opens files and saves files,
           | but these are whole-document operations. I can't open a
           | document in Sublime Text without reading it, and I can't save
           | part of a file without saving all of it. So it's not obvious
           | that these would be different at an OS level.
           | 
           | As the post points out, there are uses for Unix's sub-file-
           | level read-and-write commands, but I've never needed them.
        
       | arvindamirtaa wrote:
       | Like Gmail is emails but not IMAP. It's fine. We have seen that
       | these kinds of wrappers work pretty well most of the time
       | considering the performance and simplicity they bring in building
       | and managing these systems.
        
       | chrisblackwell wrote:
       | Random note: Has anyone noticed how fast the author's webpage is?
       | I know it's static, but I mean it's fast even for the DNS lookup.
       | I would love to know what they have on.
        
         | adverbly wrote:
         | The response headers include
         | 
         | server: cloudflare
         | 
         | You said it though - the reason is that its static without any
         | js/frameworks/SPA round trip requests.
        
         | overstay8930 wrote:
         | Full stack Cloudflare is really fast
        
         | wooptoo wrote:
         | Could be using Cloudflare pages hosted on a R2 bucket:
         | https://pages.cloudflare.com/
        
       | alphazard wrote:
       | S3 is not even files, and definitely not a filesystem.
       | 
       | The thing I would expect from a file abstraction is mutability. I
       | should be able to edit pieces of a file, grow it, shrink it, read
       | and write at random offsets. I shouldn't have to go back up to
       | the root, or a higher level concept once I have the file in hand.
       | S3 provides a mutable listing of immutable objects, if I want to
       | do any of the mutability business, I need to make a copy and re-
       | upload. As originally conceived, the file abstraction finds some
       | sectors on disk, and presents them to the client as a contiguous
       | buffer. S3 solves a different problem.
       | 
       | Many people misinterpret the Good Idea from UNIX "everything is a
       | file" to mean that everything should look like a contiguous
       | virtual buffer. That's not what the real Good Idea is. Really:
       | everything can be listed in a directory, including directories.
       | There will be base leaves, which could be files, or any object
       | the system wants to present to a process, and there will be
       | recursive trees (which are directories). The directories are what
       | make the filesystem, not the type of a particular leaf. Adding a
       | new type of leaf, like a socket or a frame buffer, or whatever,
       | is almost boring, and doesn't erode the integrity of the real
       | good idea. Adding a different kind of container like a list,
       | would make the structure of the filesystem more complex, and that
       | _would_ erode the conceptual integrity.
       | 
       | S3 doesn't do any of these things, and that's fine. I just want a
       | place to put things that won't fit in the database, and know they
       | won't bitrot when I'm not looking. The desire to make S3 look
       | more like a filesystem comes from client misunderstanding of what
       | it's good at/for, and poor product management indulging that
       | misunderstanding instead of guarding the system from it.
        
         | akerl_ wrote:
         | How do read-only filesystems align with your definition?
        
           | yencabulator wrote:
           | You can't create new things on a read-only filesystem, you
           | can in S3; not a good analogy.
        
         | thinkharderdev wrote:
         | > S3 is not even files, and definitely not a filesystem.
         | 
         | I agree. To me the correct analog for S3 is a block storage
         | device (a very weird one where blocks can be any size and can
         | have a key associated with them) and not a filesystem. A
         | filesystem is an abstraction that sits on top of a block
         | storage device and so an "S3 filesystem" would have to be an
         | abstraction that sits on top of S3 as the underlying block
         | storage.
        
       | sbussard wrote:
       | It's been a while, but I really like the way google handles its
       | file system internally. No confusion.
        
       | remram wrote:
       | I am currently pondering this exact problem. I want to run a
       | file-sharing web application (think: NextCloud) but I don't want
       | to use expensive block storage or the dedicated server's disk
       | space for the files, as some of them will be accessed
       | infrequently.
       | 
       | I am wondering if s3fs/rclone-mount is sufficient, or if I should
       | use something like JuiceFS that adds random-access, renaming, etc
       | on top of it. Are those really necessary APIs for my use case? Is
       | there only one way to find out?
       | 
       | (The app doesn't have native S3 support)
        
         | cuno wrote:
         | It depends on if you want to expose filesystem semantics or
         | metadata to applications using it. For example random access
         | writes are done by ffmpeg, which is a workhorse of the media
         | industry, but most things can't handle that or are too slow. We
         | had to build our own solution cunoFS to make it work properly
         | at high speeds.
        
       | jkoudys wrote:
       | I absolutely loved this article. Super well written with
       | interesting insights.
        
       | donatj wrote:
       | > And listing files is slow. While the joy of Amazon S3 is that
       | you can read and write at extremely, extremely, high bandwidths,
       | listing out what is there is much much slower. Slower than a slow
       | local filesystem.
       | 
       | I was taken aback by this recently. At my coworkers request, I
       | was putting some work into a script we have to manage assets in
       | S3. It has a cache for the file listing, and my coworker who
       | wrote it sent me his pre-populated cache. My initial thought was
       | "this can't really be necessary" and started poking.
       | 
       | We have ~100,000 root level directories for our individual
       | assets. Each of those have five or six directories with a handful
       | of files. Probably less than a million files total, maybe 3
       | levels deep at its deepest.
       | 
       | Recursively listing these files takes literally fifteen minutes.
       | I poked and prodded suggestions from stack overflow and ChatGPT
       | at potential ways to speed up the process and got nothing
       | notable. That's absurdly slow. Why on earth is it so slow?
       | 
       | Why is this something Amazon has not fixed? From the outside
       | really seems like they could slap some B-trees on the individual
       | buckets and call it a day.
       | 
       | If it is a difficult problem, I'm sure it would be for
       | fascinating reasons I'd love to hear about.
        
         | returningfory2 wrote:
         | Are you performing list calls sequentially? If you have O(100k)
         | directories and are doing O(100k) requests sequentially, 15
         | minutes works out at O(10ms) per request which doesn't seem
         | that bad? (assuming my math is correct...)
        
           | luhn wrote:
           | At risk of being pedantic, you seem to be using big O to mean
           | "approximately" or "in the order of", but that's not what it
           | means at all. Big O is an expression of the growth rate of a
           | function. Any constant value has a growth rate of 0, so
           | O(100k) isn't meaningful: It's exactly the same as O(1).
        
         | anonymous-panda wrote:
         | I think it's far more mundane a reason. You can list 10k
         | objects per request and getting the next 10k requires the
         | result of the previous request, so it's all serial. That means
         | to list 1M files, you're looking at 100 back to back requests.
         | Assuming a ping time of 50ms, that's easily 5s of just going
         | back and forth, not including the cost of doing the listing
         | itself on a flat iteration. The cost of a 10k item list is
         | about the cost of a write which is kinda slow. Additionally, I
         | suspect each listing is a strongly consistent snapshot which
         | adds to the cost of the operation (it can be hard to provide an
         | inconsistent view).
         | 
         | I don't think btrees would help unless you're doing directory
         | traversals, but even then I suspect that's not that beneficial
         | as your bottleneck is going to be the network operations and
         | exposed operations. Ultimately, file listing isn't that
         | critical a use case and typically most use cases are
         | accomplished through things like object lifecycles where you
         | tell S3 what you want done and it does it efficiently at the FS
         | layer for you.
        
           | tsimionescu wrote:
           | That's 5s of a 15m duration. I don't think it matters in the
           | least.
        
             | anonymous-panda wrote:
             | Depends how you're iterating. If your iterating by
             | hierarchy level, then you could easily see this being
             | several orders of magnitude more requests.
        
         | catlifeonmars wrote:
         | S3 is fundamentally a key value store. The fact that you can
         | view objects in "directories" is nothing more than a prefix
         | filter. It is not a file system and has no concept of
         | directories.
        
           | Spivak wrote:
           | If I wanted to use S3 as a filesystem in the manner people
           | are describing I would probably start looking at storing
           | filesystem metadata in a sidecar database so you can get
           | directory listings, permissions bits, xattrs and only have to
           | round-trip to S3 when you need the content.
        
             | SOLAR_FIELDS wrote:
             | Isn't this essentially what systems like Minio and
             | SeaweedFS do with their S3 integrations/mirroring/caching?
             | What you describe sounds a lot like SeaweedFS Filer when
             | backed by S3
        
           | anonymous-panda wrote:
           | Directories make up a hierarchical filesystem, but it's not a
           | necessary condition. A filesystem at its core is just a way
           | of organizing files. If you're storing and organizing files
           | in s3 then it's a filesystem for you. Saying it's
           | "fundamentally a key value store" like it's something
           | different is confusing because a filesystem is just a key
           | value store of path to contents of file.
           | 
           | Indeed there's every reason to believe that a modern file
           | system would perform significantly faster if the hierarchy
           | was implemented as a prefix filter than actually maintaining
           | the hierarchical data structures (at least for most
           | operations). You can guess that this might be the case that
           | file creation is extremely slow on modern file systems (on
           | the order of hundreds or maybe thousands per second on a
           | modern NVME disk that can otherwise do millions of IOPs and
           | listing the contents of an extremely large directory is
           | exceedingly slow)
        
             | senderista wrote:
             | A real hierarchy makes global constraints easier to scale,
             | e.g. globally unique names or hierarchical access controls.
             | These policies only need to scale to a single node rather
             | than to the whole namespace (via some sort of global
             | index).
        
             | catlifeonmars wrote:
             | In context of the comment I was addressing, it's clear that
             | filesystem means more than just a key value store. I'd
             | argue that this is generally true in common vernacular.
        
               | anonymous-panda wrote:
               | This is a technical website discussing the nuances of
               | filesystems. Common vernacular is how you choose to
               | define it but even the Wikipedia definition says that
               | directories and hierarchy are just one property of some
               | filesystems. That they became the dominant model on local
               | machines doesn't take away from the more general
               | definition that can describe distributed filesystems.
        
         | jamesrat wrote:
         | I implemented a solution by threading the listing. Get the
         | files in the root then spin a separate process to do the
         | recursion for each directory.
        
         | perryizgr8 wrote:
         | It's not a good model to think of S3 has having directories in
         | a bucket. It's all objects. The web interface has a visual way
         | of representing prefixes separated by slashes. But that's just
         | a nice way to present the objects. Each object has a key, and
         | that key can contain slashes, and you can think of each segment
         | to be a directory for your ease of mind.
         | 
         | But that illusion breaks when you try to do operations you
         | usually do with/on directories.
        
         | electroly wrote:
         | The way that you said "recursively" and spent a lot of time
         | describing "directories" and "levels" worries me. The fastest
         | way to list objects in S3 wouldn't involve recursion at all;
         | you just list all objects under a prefix. If you're using the
         | path delimiter to pretend that S3 keys are a folder structure
         | (they're not) and go "folder by folder", it's going to be way
         | slower. When calling ListObjectsV2, make sure you are NOT
         | passing "delimiter". The "directories" and "levels" have no
         | impact on performance when you're not using the delimiter
         | functionality. Split the one list operation into multiple
         | parallel lists on separate prefixes to attain any total time
         | goal you'd like.
        
           | petters wrote:
           | Yes, this is very good advice and will likely solve their
           | problem
        
           | blakesley wrote:
           | All these comments saying merely "S3 has no concept of
           | directories" without an explanation (or at least a link to an
           | explanation) are pretty unhelpful, IMO. I dismissed your
           | comment, but then I came upon this later one explaining why:
           | https://news.ycombinator.com/item?id=39660445
           | 
           | After reading that, I now understand your comment.
        
             | electroly wrote:
             | I appreciate you sharing that point of view. There's a
             | "curse of knowledge" effect with AWS where its card-
             | carrying proponents (myself included) lose perspective on
             | how complex it actually is.
        
         | jameshart wrote:
         | A fun corollary of this issue:
         | 
         |  _Deleting_ an S3 bucket is nontrivial!
         | 
         | You can't delete a bucket with objects in it. And you can't
         | just tell S3 to delete all the objects. You need to send
         | individual API requests to S3 to delete each object. Which
         | means sending requests to S3 to list out the objects, 1000 at a
         | time. Which takes time. And those list calls cost money to
         | execute.
         | 
         | This is a good summary of the situation:
         | https://cloudcasts.io/article/deleting-an-s3-bucket-costs-mo...
         | 
         | The fastest way to quickly dispose of an S3 bucket turns out to
         | be to _delete the AWS account it belongs to_.
        
           | electroly wrote:
           | No, don't do that. Set up a lifecycle rule that expires all
           | of the objects and wait 24 hours. You won't pay for API calls
           | and even the cost of storing the objects themselves is waived
           | once they are marked for expiration.
           | 
           | The article has a mistake about this too: expirations do NOT
           | count as lifecycle transitions and you don't get charged as
           | such. You will, of course, get charged if you prematurely
           | delete objects that are in a storage class with a minimum
           | storage duration that they haven't reached yet. This is what
           | they're actually talking about when they mention Infrequent
           | Access and other lower tiers.
        
             | jameshart wrote:
             | Still counts as nontrivial.
        
               | electroly wrote:
               | This is really easy; much easier than trying to delete
               | them by hand. AWS does all the work for you. It takes
               | longer to log into the AWS Management Console than it
               | does to set up this lifecycle rule.
        
       | breckognize wrote:
       | > I haven't heard of people having problems [with S3's
       | Durability] but equally: I've never seen these claims tested. I
       | am at least a bit curious about these claims.
       | 
       | Believe the hype. S3's durability is industry leading and
       | traditional file systems don't compare. It's not just the
       | software - it's the physical infrastructure and safety culture.
       | 
       | AWS' availability zone isolation is better than the other cloud
       | providers. When I worked at S3, customers would beat us up over
       | pricing compared to GCP blob storage, but the comparison was
       | unfair because Google would store your data in the same building
       | (or maybe different rooms of the same building) - not with the
       | separation AWS did.
       | 
       | The entire organization was unbelievably paranoid about data
       | integrity (checksum all the things) and bigger events like
       | natural disasters. S3 even operates at a scale where we could
       | detect "bitrot" - random bit flips caused by gamma rays hitting a
       | hard drive platter (roughly one per second across trillions of
       | objects iirc). We even measured failure rates by hard drive
       | vendor/vintage to minimize the chance of data loss if a batch of
       | disks went bad.
       | 
       | I wouldn't store critical data anywhere else.
       | 
       | Source: I wrote the S3 placement system.
        
         | tracerbulletx wrote:
         | My first job was at a startup in 2012 where I was expected to
         | build things at a scale way over what I really had the
         | experience to do. Anyways the best choice I ever made was using
         | RDS and S3 (and django).
        
         | supriyo-biswas wrote:
         | Checksumming the data is not based out of paranoia but simply
         | as a result of having to detect which blocks are unusable in
         | order to run the Reed-Solomon algorithm.
         | 
         | I'd also assume that a sufficient number of these corruption
         | events are used as a signal to "heal" the system by migrating
         | the individual data blocks onto different machines.
         | 
         | Overall, I'd say the things that you mentioned are pretty
         | typical of a storage system, and are not at all specific to S3
         | :)
        
           | catlifeonmars wrote:
           | The S3 checksum feature applies to the objects, so that's
           | entirely orthogonal to erasure codes. Unless you know
           | something I don't and SHA256 has commutative properties.
           | You'd still need to compute the object hash independent of
           | any blocks.
           | 
           | Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide
           | /checki...
        
         | staunch wrote:
         | > _Believe the hype._
         | 
         | I'd rather believe the test results.
         | 
         | Is there a neutral third-party that has validated S3's
         | durability/integrity/consistency? Something as rigorous as
         | Jepsen?
         | 
         | It'd be really neat if someone compared all the S3 compatible
         | cloud storage systems in a really rigorous way. I'm sure we'd
         | discover that there are huge scary problems. Or maybe someone
         | already has?
        
         | rsync wrote:
         | "AWS' availability zone isolation is better than the other
         | cloud providers."
         | 
         | Not better than _all_ of them.
         | 
         | A geo-redundant rsync.net account exists in two different
         | states (or countries) - for instance, primary in Fremont[1] and
         | secondary in Denver.
         | 
         | "S3 even operates at a scale where we could detect "bitrot""
         | 
         | That is not a function of scale. My personal server running ZFS
         | detects bitrot just fine - and the scale involved is tiny.
         | 
         | [1] he.net headquarters
        
           | Helmut10001 wrote:
           | Agree.
           | 
           | > S3 even operates at a scale where we could detect "bitrot"
           | - random bit flips caused by gamma rays hitting a hard drive
           | platter (roughly one per second across trillions of objects
           | iirc).
           | 
           | I would expect any cloud provider to be able to detect bitrot
           | these days.
        
             | senderista wrote:
             | I think the point the OP was trying to make is that they
             | _regularly detected_ bitrot due to their scale, not that
             | they were merely _capable_ of doing so.
        
               | Helmut10001 wrote:
               | Ah, thank you. This makes more sense. And I think I
               | remember reading about it once. Apologies for the
               | misinterpretation!
        
               | pclmulqdq wrote:
               | Everyone with significant scale and decent software
               | regularly detects bitrot.
        
           | breckognize wrote:
           | Backing up across two different regions is possible for any
           | provider with two "regions" but requires either doubling your
           | storage footprint or accepting a latency hit because you have
           | to make a roundtrip from Fremont to Denver.
           | 
           | The neat thing about AWS' AZ architecture is that it's a
           | sweet spot in the middle. They're far enough apart for good
           | isolation, which provides durability and availability, but
           | close enough that the network round trip time is negligible
           | compared to the disk seek.
           | 
           | Re: bit rot, I mean the frequency of events. If you've got a
           | few disks, you may see one flip every couple years. They
           | happen frequently enough in S3 that you can have expectations
           | about the arrival rate and alarm when that deviates from
           | expectations.
        
             | logifail wrote:
             | > The neat thing about AWS' AZ architecture is that it's a
             | sweet spot in the middle
             | 
             | What may be less of a sweet spot is AWS' pricing.
        
               | emodendroket wrote:
               | Sending the data to /dev/null is the cheapest option if
               | that's all you care about.
        
               | logifail wrote:
               | Seems the snark detector just went off :)
               | 
               | Back on topic, I'd hope all of us would expect value for
               | money for any and all services we recommend or purchase.
               | Search for "site:news.ycombinator.com Away From AWS" to
               | find dozens of discussions on how to save money by
               | leaving AWS.
               | 
               | EDIT: just one article of the many I've read recently:
               | 
               | "What I've always found surprising about egress is just
               | how expensive it is. On AWS, downloading a file from S3
               | to your computer once costs 4 times more than storing it
               | for an entire month"
               | 
               | https://robaboukhalil.medium.com/youre-paying-too-much-
               | for-e...
        
             | senderista wrote:
             | > the network round trip time is negligible compared to the
             | disk seek
             | 
             | Only for spinning rust, right?
        
               | breckognize wrote:
               | Yes, which is what all the hyperscalers use for object
               | storage. HDD seek time is ~10ms. Inter-az network latency
               | is a few hundred micros.
        
             | alexchamberlain wrote:
             | > They're far enough apart for good isolation, which
             | provides durability and availability
             | 
             | It can't possibly be enough for critical data though,
             | right? I'm guessing a fire in 1 is unlikely to spread to
             | another, but could it affect the availability of another?
             | What about a deliberate attack on the DCs or the utilities
             | supplying the DCs?
        
               | immibis wrote:
               | Yes, if a terrorist blows up all of the several Amazon
               | DCs holding your data, your data will be lost. This is
               | true no matter how many DCs are holding your data, who
               | owns them, or where they are. You can improve your
               | chances, of course.
               | 
               | There have been region-wide availability outages before.
               | They're pretty rare and make worldwide news media due to
               | how much of the internet they take out. I don't think
               | there's been S3 data loss since they got serious about
               | preventing S3 data loss.
        
           | allset_ wrote:
           | FWIW, both AWS S3 and GCP GCS also allow you to store data in
           | multi-region.
           | 
           | https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiR.
           | ..
           | 
           | https://cloud.google.com/storage/docs/locations#consideratio.
           | ..
        
             | andrewguenther wrote:
             | Yes, but S3 has single region redundancy that is better
             | than GCP. Your data in two AZs in one region is in two
             | physically separate buildings. So multi-region is less
             | important to durability.
        
           | mannyv wrote:
           | How does the latest ZFS bug impact your bitrot statement?
           | 
           | I mean, technically it's not bitrot if zeros were
           | accidentally written out instead of data.
        
             | woodada wrote:
             | Probably none because they didn't update to the exact
             | version that had the bug
        
         | medler wrote:
         | > customers would beat us up over pricing compared to GCP blob
         | storage, but the comparison was unfair because Google would
         | store your data in the same building
         | 
         | I don't think this is true. Per the Google Cloud Storage docs,
         | data is replicated across multiple zones, and each zone maps to
         | a different cluster.
         | https://cloud.google.com/compute/docs/regions-zones/zone-vir...
        
           | singron wrote:
           | Google puts multiple clusters in a single building.
        
             | medler wrote:
             | Seems you're right. They say each zone is a separate
             | failure domain but you kind of have to trust their word on
             | that.
        
             | navaati wrote:
             | Flashback to that Clichy datacenter fire near Paris...
        
           | yencabulator wrote:
           | Zones are about correlated power and networking failures.
           | Regions are about disasters. If you want multiple regions,
           | Google can of course do that too:
           | 
           | https://cloud.google.com/storage/docs/locations#consideratio.
           | ..
        
         | treflop wrote:
         | What's your experience like at other storage outfits?
         | 
         | I only ask because your post is a bit like singing praises for
         | Cinnabon that they make their own dough.
         | 
         | The things that you mentioned are standard storage company
         | activities.
         | 
         | Checksum-all-the-things is a basic feature of a lot of file
         | systems. If you can already set up your home computer to detect
         | bitrot and alert you, you can bet big storage vendors do it.
         | 
         | Keeping track of hard drive failure rates by vendor is normal.
         | Storage companies publicly publish their own reports. The tiny
         | 6-person IT operation I was in had a spreadsheet. Hell, I
         | toured a friend's friend's major data center last year and he
         | managed to find time to talk hard drive vendors. Now you. I get
         | it -- y'all make spreadsheets.
         | 
         | There are a lot of smart people working on storage outside AWS
         | and long before AWS existed.
        
           | pclmulqdq wrote:
           | When I worked at Google in storage, we had our own figures of
           | merit that showed that we were the best and Amazon's
           | durability was trash in comparison to us.
           | 
           | As far as I can tell, every cloud provider's object store is
           | too durable to actually measure ("14 9's"), and it's not a
           | problem.
        
             | breckognize wrote:
             | 9's are overblown. When cloud providers report that,
             | they're really saying "Assuming random hard drive failure
             | at the rates we've historically measured and how we quickly
             | we detect and fix those failures, what's the mean time to
             | data loss".
             | 
             | But that's burying the lede. By far the greatest risks to a
             | file's durability are: 1. Bugs (which aren't captured by a
             | durability model). This is mitigated by deploying slowly
             | and having good isolation between regions. 2. An act of God
             | that wipes out a facility.
             | 
             | The point of my comment was that it's not just about
             | checksums. That's table stakes. The main driver of data
             | loss for storage organizations with competent software is
             | safety culture and physical infrastructure.
             | 
             | My experience was that S3's safety culture is outstanding.
             | In terms of physical separation and how "solid" the AZs
             | are, AWS is overbuilt compared to the other players.
        
               | pclmulqdq wrote:
               | That was not how we treated the 9's at Google. Those had
               | been tested through natural experiments (disasters).
               | 
               | I was not at Google for the Clichy fire, but it wasn't
               | the first datacenter fire Google experienced. I think
               | your information about Google's data placement may be
               | incorrect, or you may be mapping AWS concepts onto Google
               | internal infrastructure in the wrong way.
        
               | fsociety wrote:
               | I would not lose sleep over storing data on GCS, but have
               | heard from several Google Cloud folks that their concept
               | of zones is a mirage at best.
        
               | pclmulqdq wrote:
               | Yeah, that's definitely true. Google sort of mapped an
               | AWS concept onto its own cluster splits. However, there
               | are enough regional-scale outages at all the major clouds
               | that I don't personally place much stock in the idea of
               | zones to begin with. The only way to get close to true
               | 24/7 five-9's uptime with clouds is to be multi-region
               | (and preferably multi-cloud).
        
               | breckognize wrote:
               | Do you mean Google included "acts of God" when computing
               | 9's? That's definitely not right.
               | 
               | 11 9's of durability means mean time to data loss of 100
               | billion years. Nothing on earth is 11 9's durable in the
               | face of natural (or man-made) disasters. The earth is
               | only 4.5 billion years old.
        
               | pclmulqdq wrote:
               | Normally, companies store more than 1 byte of data, and
               | the 9's (not just for data loss, for everything) are
               | ensemble averages.
               | 
               | By the way, I don't doubt that AWS has plenty of 9's by
               | that metric - perhaps more than GCP.
        
               | jftuga wrote:
               | If I were to upload a 50kb object to S3 (standard tier),
               | about how many unique physical copies would exist?
        
               | cyberax wrote:
               | At least 3.
        
           | fierro wrote:
           | it's well known and not debatable that Cinnabon is fire
        
           | FooBarWidget wrote:
           | "Checksum-all-the-things is a basic feature of a lot of file
           | systems"
           | 
           | "A lot"? Does anything but ZFS and maybe btrfs do this? Ext4
           | anf XFS -- two very common filesystems -- still don't have
           | data checksums.
        
             | Filligree wrote:
             | Bcachefs, and LVM also has a way to do it.
             | 
             | Unfortunately I'm not aware of any filesystem that does it
             | while maintaining the full bandwidth of a modern NVMe. Not
             | even with the extra reads factored in; on ZFS I get 800
             | MB/s max.
        
           | 4death4 wrote:
           | This was a few years ago, but blob storage on GCP had a
           | global outage due to an outage in a single zone. That, among
           | numerous other issues with GCP, lost my confidence entirely.
           | Maybe it's better now.
        
         | loeg wrote:
         | Not a public cloud, but storage at Facebook is similar in terms
         | of physical infrastructure, safety culture, and scale.
        
         | spintin wrote:
         | Correct me if I'm wrong but bitrot only affects spinning rust
         | since NAND uses ECC?
         | 
         | If you see this I wonder if S3 is planning on adding hardlinks?
        
           | sgtnoodle wrote:
           | Pretty much any modern storage medium depends on a healthy
           | amount of error correcting code.
        
           | surajrmal wrote:
           | Nand is constantly moving around your data to prevent it from
           | bit rotting. If you leave data too long without moving it,
           | you may not be able to read the data from the nand.
        
         | Veserv wrote:
         | But they asked if the claims were audited by a unbiased third
         | party. Are there such audits?
         | 
         | Alternatively, AWS does publicly provide legally binding
         | availability guarantees, but I have never seen any prominently
         | displayed legally binding durability guarantees. Are these
         | published somewhere less prominently?
        
           | cyberax wrote:
           | > Alternatively, AWS does publicly provide legally binding
           | availability guarantees, but I have never seen any
           | prominently displayed legally binding durability guarantees.
           | Are these published somewhere less prominently?
           | 
           | It's listed prominently in the public docs:
           | https://aws.amazon.com/s3/storage-classes/
        
         | chupasaurus wrote:
         | > and bigger events like natural disasters
         | 
         | Outdated anecdata: I've worked for a company that lost some
         | parts of buckets after the lightning strike incident in 2011,
         | which bumped the paranoia quite a bit. AFAIK same thing
         | couldn't happen for more than a decade.
        
       | svat wrote:
       | It's nice to see Ousterhout's idea of module depth (the main idea
       | from his _A Philosophy of Software Design_ ) getting more
       | mainstream -- mentioned in this article with attribution only in
       | "Other notes", which suggests the author found it natural enough
       | not to require elaboration. Being obvious-in-hindsight like this
       | is a sign of a good idea. :-)
       | 
       | > _The concept of deep vs shallow modules comes from John
       | Ousterhout 's excellent book. The book is [effectively] a list of
       | ideas on software design. Some are real hits with me, others not,
       | but well worth reading overall. Praise for making it succinct._
        
       | ahepp wrote:
       | Are filesystems the correct abstraction to build databases on?
       | Isn't a filesystem a database in a way? Is there a reason to
       | build a database on top of a filesystem abstraction rather than a
       | block abstraction?
       | 
       | To say you can't build an efficient database on top of S3 makes
       | sense to me. S3 is already a certain kind of data-storing
       | abstraction optimized for certain usages. If you try and build
       | another data-storing abstraction optimized for incompatible
       | usages on top of that, you are going to have a difficult time.
        
         | d0gsg0w00f wrote:
         | In my $dayjob as cloud architect I sometimes suggest S3 as an
         | alternative to pulling massive JSON blobs from RDS
         | Postgres/Redis etc. As long as their latency minimums are high
         | enough there's no reason you can't.
        
         | jandrewrogers wrote:
         | The traditional POSIX filesystem is the wrong abstraction for a
         | database, but not filesystems per se. All databases that care
         | about performance and scalability implement their own
         | filesystems, either directly against raw block devices or as an
         | overlay on top of a POSIX filesystem that bypasses some of its
         | limitations. The performance and scalability gains by doing so
         | are not small.
         | 
         | The issue with POSIX filesystems is that they are required to
         | make a set of tradeoffs to support features a database engine
         | doesn't need, to the significant detriment of scalability and
         | performance in areas that databases care about a lot. For
         | example, one such database filesystem I've used occasionally
         | over the years, while a bit dated at this point, is designed
         | such that you can have tens of millions of files in a single
         | directory where you are creating and destroying tens of
         | thousands of files every second, on upwards of a petabyte of
         | storage. Very far from being POSIX compatible but you don't get
         | anything like that type of scalability on POSIX.
         | 
         | Object storage is far from ideal as database storage. The
         | biggest issue, though, is the terrible storage bandwidth
         | available in the cloud. It is a small fraction of what is
         | available in a normal server and modern database engines are
         | capable of fully exploiting a large JBOD of NVMe.
        
       | d-z-m wrote:
       | > S3 is a cloud filesystem, not an object-whatever. [...]I think
       | the idea that S3 is really "Amazon Cloud Filesystem" is a bit of
       | a load bearing fiction.
       | 
       | Does anyone actually think this? I have never encountered anyone
       | who has described S3 in these terms.
        
         | teaearlgraycold wrote:
         | Not sure if the author is aware of EFS
        
       | chubot wrote:
       | > Filesystem software, especially databases, can't be ported to
       | Amazon S3
       | 
       | This seems mistaken. Porting databases that run on local disk to
       | S3 seems like a good way to get a lashing from https://aphyr.com/
       | 
       | Can any databases do it correctly?
       | 
       | If so, I doubt they work with the model of partial overwrites.
       | They probably have to do something very custom, and either
       | sacrifice a lot of tail latency, or their uptime is capped by the
       | uptime of a single AWS availability zone. Doesn't seem like a
       | great design.
       | 
       | (copy of lobste.rs comment)
        
         | est31 wrote:
         | My employer (Neon) offers Postgres databases that run on top of
         | a couple of caching layers at the end of which there is S3:
         | https://neon.tech/docs/introduction/architecture-overview
         | 
         | Directly exposing every write to S3 gives you the partial
         | overwrite issues as described. But one can collect a bunch of
         | traffic and push state to S3 once it reaches a threshold.
         | Instead, a few writes in the postgres WAL are held outside of
         | S3 in a replicated on-disk cache.
        
           | chubot wrote:
           | Thanks for the link.
           | 
           | But I searched the docs for "durability" and got zero
           | results. Before I use anything like this, I'd like to see
           | what durability settings are used:
           | 
           | https://www.postgresql.org/docs/current/non-durability.html
           | 
           | Litestream documents the their data loss window, it seems
           | like Neon should too:
           | 
           | https://litestream.io/tips/
           | 
           |  _By default, Litestream will replicate new changes to an S3
           | replica every second. During this time where data has not yet
           | been replicated, a catastrophic crash on your server will
           | result in the loss of data in that time window._
           | 
           | I also searched for "data loss" and got zero results -- this
           | is important because Neon is almost certainly sacrificing
           | durability for performance.
        
             | yencabulator wrote:
             | Neon handles that by staging the WAL segments on 3x
             | replicated Safekeeper nodes. Durability relies on not
             | having all of those blow up at the same time. I'd expect it
             | to be much safer than traditional Postgres replication
             | mechanisms (with the trade-off having a comparatively large
             | minimum node count; Neon really is built for multitenancy
             | where that cost can be amortized across lots of databases).
        
             | est31 wrote:
             | > I searched the docs for "durability" and got zero
             | results.
             | 
             | The link I gave above explains it, right the sentence with
             | "durability":
             | 
             | > Safekeepers are responsible for durability of recent
             | updates. Postgres streams Write-Ahead Log (WAL) to the
             | Safekeepers, and the Safekeepers store the WAL durably
             | until it has been processed by the Pageservers and uploaded
             | to cloud storage.
             | 
             | > Safekeepers can be thought of as an ultra reliable write
             | buffer that holds the latest data until it is processed and
             | uploaded to cloud storage. Safekeepers implement the Paxos
             | protocol for reliability.
        
       | BirAdam wrote:
       | Underneath the software, there's still a filesystem with files.
       | 
       | If you stand up an S3 instance with Ceph, you still have a
       | filesystem on spinning rust or fancy SSDs. There's just a bunch
       | of stuff on top of that. It's cool, but to say that there's no
       | filesystem is simply what the customer or middle person sees, not
       | what is actually happening.
        
         | seabrookmx wrote:
         | S3 actually uses a completely custom system[1] for writing
         | bytes to disk. I haven't seen much in the way of details on the
         | on-disk format but I certainly wouldn't assume it resembles a
         | normal filesystem.
         | 
         | [1]: https://aws.amazon.com/blogs/storage/how-automated-
         | reasoning...
        
         | aseipp wrote:
         | No there isn't. AWS does not use the traditional filesystem
         | layer to store data; that would be a massive mistake from a
         | performance and reliability POV; the POSIX filesystem
         | specification is notoriously vague about things like fsync
         | consistency under particular scenarios, i.e. "do I need to
         | fsync the parent directory before or after fsyncing the
         | contents" for instance and has many bizarre performance cliffs
         | if you aren't careful. At the scale AWS is at even a 10%
         | performance cliff or performance delta would be worth clawing
         | back if it meant removing the POSIX filesystem.
         | 
         | Filesystems are not free; they incur "complexity" (that
         | favorite bugbear everyone on HN loves to complain about) just
         | as much as any other component in the stack does.
         | 
         | > If you stand up an S3 instance with Ceph,
         | 
         | Okay, but AWS does not run on Ceph. Even then, Ceph is an
         | example that recommends the opposite. Nowadays they recommend
         | solutions like the Bluestore OSD backend to store actual data
         | directly on raw block devices, completely bypassing the
         | filesystem layer -- for the exact same reasons I outlined above
         | and many, many others (the actual metadata does use "BlueFS"
         | which is a small FS shim, but this is mostly so that RocksDB
         | can write directly to the block device too, next to the data
         | segments, and BlueFS is in no way a real POSIX filesystem, it's
         | just a shim for existing software).
         | 
         | See "File Systems Unfit as Distributed Storage Backends:
         | Lessons from 10 Years of Ceph Evolution" written by the Ceph
         | authors[1] about why they finally gave in and wrote Bluestore.
         | The spoiler alert is they got rid of the filesystem precisely
         | because "a filesystem with files" underneath, as you describe,
         | was problematic and worked poorly in comparison (see the
         | conclusion in Section 9.)
         | 
         | Many places do use POSIX filesystems for various reasons, even
         | at large scale, of course.
         | 
         | [1] https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf
        
         | yencabulator wrote:
         | Ceph's BlueStore has talked direct to block devices, no
         | filesystem in between, since 2017.
         | 
         | https://ceph.com/community/new-luminous-bluestore/
         | 
         | [Disclaimer: ex-Ceph employee, from before BlueStore]
        
         | jandrewrogers wrote:
         | I seriously doubt this is correct. It is common for database
         | engines to install directly on raw block devices, bypassing the
         | Linux kernel and effectively becoming the filesystem for those
         | storage devices. Why would S3 work any differently? There are
         | no advantages to building on top of a filesystem and many
         | disadvantages for this kind of thing.
         | 
         | It would be a poor engineering choice to build something like
         | S3 on top of some other filesystem. There are often ways to do
         | it by using an overlay that converts a filesystem into a pseudo
         | block device, but that is usually considered a compatibility
         | shim used for environments that lacking dedicated storage, at
         | the cost of robustness and performance.
        
       | somedudetbh wrote:
       | > Amazon S3 is the original cloud technology: it came out in
       | 2006. "Objects" were popular at the time and S3 was labelled an
       | "object store", but everyone really knows that S3 is for files.
       | S3
       | 
       | Alternative theory: everyone who worked on this knew that it was
       | not a filesystem and "object store" is a description intended to
       | describe everything else pointed out in this post.
       | 
       | "Objects were really popular" is about objects as software
       | component that combines executable code with local state. None of
       | the original S3 examples were about "hey you can serialize live
       | objects to this store and then deserialize them into another live
       | process!" It was all like "hey you know how you have all those
       | static assets for your website..." "Objects" was used in this
       | sense in databases at the time in the phrase "binary large
       | object" or "blob". S3 was like "hey, stuff that doesn't fit in
       | your database, you know...objects...this is a store for them."
       | 
       | This is meant to describe precisely things like "listing is slow"
       | because when S3 was designed, the launch usecases assumed an
       | index of contents existed _somewhere else_, because, yeah, it's
       | not a filesystem. it's an object store.
        
         | senderista wrote:
         | Yes, the author doesn't seem to realize that "object storage"
         | is a term of art in storage systems that has nothing to do with
         | OOP.
         | 
         | https://en.wikipedia.org/wiki/Object_storage
        
       | tutfbhuf wrote:
       | S3 is obviously not a filesystem in the sense of a POSIX
       | filesystem. And I would argue it is not a filesystem, even if we
       | were to relax POSIX filesystem semantics (do not implement the
       | full spec). But what is certainly possible is to span a
       | filesystem on top of S3. It is basically possible to span a
       | filesystem on anything that can store data. You can even go crazy
       | for demonstration purposes and put a filesystem on top of YouTube
       | (there are some tech demos for that on GitHub).
       | 
       | I think a better question is whether there are any good
       | filesystem implementations on top of S3. There are many attempts
       | like s3fs-fuse[^1] or seaweedfs[^2], but I have not heard many
       | stories about their use at scale from big companies. Just
       | recently there was a post here about cunoFS[^3]. It is a startup
       | that implements a POSIX-compliant (supports symlinks, hard links
       | (emulated), UIDs & GIDs, permissions, random writes, etc.)
       | filesystem on top of S3/AZ/GCP storage and claims to have really
       | good performance. I think only time will tell if it works out in
       | practice for companies to use S3 as a filesystem through fs
       | implementations on top of S3.
       | 
       | [^1]: https://github.com/s3fs-fuse/s3fs-fuse
       | 
       | [^2]: https://github.com/seaweedfs/seaweedfs
       | 
       | [^3]: https://news.ycombinator.com/item?id=39640307
        
       | ein0p wrote:
       | A bit off topic but also related: I use Minio as a local "S3" to
       | store datasets and model checkpoints for my garage compute.
       | Minio, however, has a bunch of features that I simply don't need.
       | I just want to be able copy to/from, list prefixes, and delete
       | every now and then. I could use nfs I suppose, but that'd be a
       | bit inconvenient since I also use Minio to store build deps
       | (which Bazel then downloads), and I'd like to be able to
       | comfortably build stuff on my laptop. In particular, one feature
       | I do not need is the constant disk access than Minio does to
       | "protect against bit rot" and whatever. That protection is
       | already provided by periodic scrubs on my raidz6.
       | 
       | So what's the current best (preferably statically linked) self-
       | hosted, single-node option for minimal S3 like "thing" that just
       | lets me CRUD the files and list them?
        
       | OnlyMortal wrote:
       | It can be a file system.
       | 
       | I've written my own FUSE that uses Rabin Chunking and stores the
       | data (and meta) in S3. The C++/AWS SDK FUSE is connected to a Go
       | SMB server that runs locally on my Mac and works with (local)
       | TimeMachine.
       | 
       | I use Wasabi for cost and speed reasons.
        
       | jcims wrote:
       | It seems like they're moving away from this with S3 directory
       | buckets and express zone.
        
       ___________________________________________________________________
       (page generated 2024-03-10 23:00 UTC)