[HN Gopher] S3 is files, but not a filesystem
___________________________________________________________________
S3 is files, but not a filesystem
Author : todsacerdoti
Score : 393 points
Date : 2024-03-10 04:11 UTC (18 hours ago)
(HTM) web link (calpaterson.com)
(TXT) w3m dump (calpaterson.com)
| 3weeksearlier wrote:
| I dunno, are features like partial file overwrites necessary to
| make something a filesystem? This reminds me of how there are
| lots of internal systems at Google whose maintainers keep
| asserting are not filesystems, but everyone considers them so, to
| the point where "_____ is not a filesystem" has become an inside
| joke.
| fiddlerwoaroof wrote:
| Yeah, it's sort of funny how "POSIXish semantics" has become
| our definition of these things, when it's just one kind of
| thing that's been called a filesystem historically.
| mickael-kerjean wrote:
| Fun experiment I made with my mum, building a storage
| independent dropbox like UI [1] for anything that implement
| this interface: type IBackend interface {
| Ls(path string) ([]os.FileInfo, error) Cat(path
| string) (io.ReadCloser, error) Mkdir(path string)
| error Rm(path string) error Mv(from string,
| to string) error Save(path string, file io.Reader)
| error Touch(path string) error }
|
| My mum really couldn't care less about the posix semantic as
| soon as she can see the pictures of my kid which happen to be
| on S3
|
| [1] https://github.com/mickael-kerjean/filestash
| wwalexander wrote:
| Reducing things to basically the interface you laid out is
| the point of 9p [1], and is what Plan 9's UNIX-but-
| distributed design was built on top of. Same inventor as
| Go! If you haven't dived down the Plan 9 rabbit hole yet,
| it's a beautiful and haunting vision of how simple cloud
| computing could have been.
|
| [1] https://9fans.github.io/plan9port/man/man9/intro.html
| MrJohz wrote:
| I think this interface is less interesting than the
| semantics behind it, particularly when it comes to
| concurrency: what happens when you delete a folder, and
| then try and create a file in that folder at the same time?
| What happens when you move a folder to a new location, and
| during that move, delete the new or old folders?
|
| Like yes, for your mum's use case, with a single user, it's
| probably not all that important that you cover those edge
| cases, but every time I've built pseudo-filesystems on top
| of non-filesystem storage APIs, those sorts of semantic
| questions have been where all the problems have hidden.
| It's not particularly hard to implement the interface
| you've described, but it's very hard to do it in such a way
| that, for example, you never have dangling files that exist
| but aren't contained in any folder, or that you never have
| multiple files with the same path, and so on.
| DonHopkins wrote:
| Can S3 murder your wife like ReiserFS and Reiser4?
|
| https://en.wikipedia.org/w/index.php?title=Comparison_of_fil.
| ..
| CobrastanJorji wrote:
| They are necessary because as soon as someone decides that S3
| is a filesystem, they will look at the other cloud
| "filesystems," notice that S3 is cheaper than most of them, and
| then for some reason they will decide to run giant Hadoop fs
| stuff on it or mount a relational database on it or all other
| manner of stupidity. I guarantee you S3's customer-facing
| engineers are fielding multiple calls per week from customers
| who are angry that S3 isn't as fast as some real filesystem
| solution that the customer migrated from because S3 was
| cheaper.
|
| When people decide that X is a filesystem, they try to use it
| like it's a local, POSIX filesystem, and that's terrible
| because it won't be immediately obvious why it's a stupid plan.
| albert_e wrote:
| If a customer makes an IT decision as big as running Hadoop
| or RDBMS with S3 as storage ... but does not consult at least
| a Associate level AWS Certified architect (who are doke a
| dozen) for at least one day worth of advice which is probably
| a couple of hundred dollars at most ...
|
| Can we really blame AWS?
|
| I am sure none of official AWS documentations or examples
| show such an architecture.
|
| ----
|
| Amazon EMR can run Hadoop and use Amazon S3 as storage via
| EMR FS.
|
| "S3 mountpoints" are a feature specifically for workloads
| that need to see S3 as a file system.
|
| For block storage workloads there is EBS and EFS and FSx that
| AWS heavily advertises.
| albert_e wrote:
| *dime a dozen
|
| (Apologies for typos. The "noprocrast" setting sometimes
| locks us out of HN right after submitting a comment. And it
| is now too late, not editable)
| karmasimida wrote:
| Exactly, especially when the concept of filesystem really is
| defined before the whole internet scale becomes a thing or
| reality.
|
| Maybe S3 isn't a filesystem according to this definition, but
| does it really matter to make it one? I doubt it. The Elastic
| Filesystem is also an AWS product, but you can't really work as
| one as you have locally, any folder over 20k files basically
| will timeout if you do a ls. Does it make EFS a filesystem or
| not?
| yencabulator wrote:
| The problem is once you let go of those semantics, a lot of
| software stops working if run against such a "filesystem". If
| you dilute the meaning of "filesystem" too much, it becomes
| less useful as a term.
|
| https://en.wikipedia.org/wiki/Andrew_File_System was
| interesting, I'd actually love to see something similar re-
| implemented with modern ideas, but it's more of an direct-
| access archival system than a general-purpose filesystem[1],
| you can't just put files written by arbitrary software on it.
| It's a bit like NFS without locks&leases, but even less like a
| normal filesystem; only really good for files created once that
| "settle down" into effectively being read-only.
|
| [1]: I wrote https://github.com/bazil/plop that is
| (unfortunately undocumented) content-addressed immutable file
| storage over object storage, used in conjunction with a git
| repo with symlinks to it to manage the "naming layer". See
| https://bazil.org/doc/ for background, plop is basically a
| simplification of the ideas to get to working code easier. Site
| hasn't been updated in almost a decade, wow. It's in everyday
| use though!
| leetrout wrote:
| My big pet peeve is AWS adding buttons in the UI to make
| "folders".
|
| It is also a fiction! There are no folders in S3.
|
| > When you create a folder in Amazon S3, S3 creates a 0-byte
| object with a key that's set to the folder name that you
| provided. For example, if you create a folder named photos in
| your bucket, the Amazon S3 console creates a 0-byte object with
| the key photos/. The console creates this object to support the
| idea of folders.
|
| https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...
| riehwvfbk wrote:
| Is that really so different from how folders work on other
| systems? A directory inode is just an inode.
| daynthelife wrote:
| The payload still contains a list of other inodes though
| klodolph wrote:
| Yes. It is, in practice, incredibly different.
|
| Imagine you have a file named /some/dir/file.jpg.
|
| In a filesystem, there's an inode for /some. It contains an
| entry for /some/dir, which is also an inode, and then in the
| very deepest level, there is an inode for /some/dir/file.jpg.
| You can rename /some to /something_else if you want. Think of
| it kind of like a table:
| +-------+--------+----------+-------+ | inode | parent
| | name | data |
| +-------+--------+----------+-------+ | 1 | (null)
| | some | (dir) | | 2 | 1 | dir |
| (dir) | | 3 | 2 | file.jpg | jpeg |
| +-------+--------+----------+-------+
|
| In S3 (and other object stores), the table is like this:
| +-------------------+------+ | key | data
| | +-------------------+------+ |
| some/dir/file.jpg | jpeg | +-------------------+------+
|
| The kind of queries you can do is completely different. There
| are no inodes in S3. There is just a mapping from keys to
| objects. There's an index on these keys, so you can do
| queries--but the / character is NOT SPECIAL and does not
| actually have any significance to the S3 storage system and
| API. The / character only has significance in the UI.
|
| You can, if you want, use a completely different character to
| separate "components" in S3, rather than using /, because /
| is not special. If you want something like
| "some:dir:file.jpg" or "some.dir.file.jpg" you can do that.
| Again, because / is not special.
| fiddlerwoaroof wrote:
| Except, S3 does let you query by prefix and so the keys
| have more structure than the second diagram implies:
| they're not just random keys, the API implies that common
| prefixes indicate related objects.
| klodolph wrote:
| That's kind of stretching the idea of "more structure" to
| the breaking point, I think. The key is just a string.
| There is no entry for directories.
|
| > the API implies that common prefixes indicate related
| objects.
|
| That's something users do. The API doesn't imply anything
| is related.
|
| And prefixes can be anything, not just directories. If
| you have /some/dir/file.jpg, then you can query using
| /some/dir/ as a prefix (like a directory!) or you can
| query using /so as a prefix, or /some/dir/fil as a
| prefix. It's just a string. It only looks like a
| directory when you, the user, decide to interpret the /
| in the file key as a directory separator. You could just
| as easily use any other character.
| hiyer wrote:
| One operation where this difference is significant is
| renaming a "folder". In UNIX (and even UNIX-y distributed
| filesystems like HDFS) a rename operation at "folder"
| level is O(1) as it only involves metadata changes. In
| S3, renaming a "folder" is O(number of files).
| okr wrote:
| Imho, renaming "folders" on S3 results in copying and
| deleting O(number of files)
| pepa65 wrote:
| From reading the above, if you have a folder 'dir' and a
| file 'dir/file', after renaming 'dir' to 'folder', you
| would just have 'folder' and 'dir/file'.
| klodolph wrote:
| There is really no such thing as a folder in S3.
|
| If you have something which is dir/file, then NORMALLY
| "dir" does not exist at all. Only dir/file exists. There
| is nothing to rename.
|
| If you happen to have something which is named "dir",
| then it's just another file (a.k.a. object). In that
| scenario, you have two files (objects) named "dir" and
| "dir/file". Weird, but nothing stopping you from doing
| that. You can also have another object named
| "dir///../file" or something, although that can be
| inconvenient, for various reasons.
| Someone wrote:
| > In S3, renaming a "folder" is O(number of files).
|
| More like _O(max(number of files, total file size))_. You
| can't rename objects in S3. To simulate a rename, you
| have to copy an object and then delete the old one.
|
| Unlike renames in typical file systems, that isn't atomic
| (there will be a time period in which both the old and
| the new object exist), and it becomes slower the larger
| the file.
| fiddlerwoaroof wrote:
| > That's something users do. The API doesn't imply
| anything is related.
|
| Querying ids by prefix doesn't make any sense for a
| normal ID type. Just making this operation available and
| part of your public API indicates that prefixes are
| semantically relevant to your API's ID type.
| klodolph wrote:
| "Prefix" is not the same thing as "directory".
|
| I can look up names with the prefix "B" and get Bart,
| Bella, Brooke, Blake, etc. That doesn't imply that
| there's some kind of semantics associated with prefixes.
| It's just a feature of your system that you may find
| useful. The fact that these names have a common prefix,
| "B", is not a particularly interesting thing to me. Just
| like if I had a list of files, 1.jpg, 10.jpg, 100.jpg,
| it's probably not significant that they're being returned
| sequentially (because I probably want 2.jpg after 1.jpg).
| afiori wrote:
| by this logic the file "foo/bar/" correspond to the
| filename "f:o:o:/:b:a:r:/" (using a different caracter as
| separator)
| riehwvfbk wrote:
| Thank you, now I understand what the special 0-byte object
| refers to. It represents an empty folder.
|
| Fair enough, basing folders on object names split by / is
| pretty inefficient. I wonder why they didn't go with a
| solution like git's trees.
| klodolph wrote:
| > Fair enough, basing folders on object names split by /
| is pretty inefficient. I wonder why they didn't go with a
| solution like git's trees.
|
| What, exactly, is inefficient about it?
|
| Think for a moment about the data structures you would
| use to represent a directory structure in a filesystem,
| and the data structures you would use to represent a
| key/value store.
|
| With a filesystem, if you split a string
| /some/dir/file.jpg into three parts, "some", "dir",
| "file.jpg", then you are actually making a decision about
| the tree structure. And here's a question--is that a
| _balanced_ tree you got there? Maybe it's completely
| unbalanced! That's actually inefficient.
|
| Let's suppose, instead, you treat the key as a plain
| string and stick it in a tree. You have a lot of freedom
| now, in how you balance the tree, since you are not
| forced to stick nodes in the tree at every / character.
|
| It's just a different efficiency tradeoff. Certain
| operations are now much less efficient (like "rename a
| directory" which, on S3, is actually "copy a zillion
| objects). Some operations are more efficient, like "store
| a file" or "retrieve a file".
| umanwizard wrote:
| I think what you're describing is simply not a
| hierarchical file system. It's a different thing that
| supports different operations and, indeed, is better or
| worse at different operations.
| afiori wrote:
| I think it is fair to say that S3 (as named files) is not
| a filesystem and it is inefficient to use it directly as
| such for common filesystem use cases; the same way that
| you could say it for a tarball[0].
|
| This does not make S3 a bad storage, just a bad
| filesystem, not everything needs to be a filesystem.
|
| Arguably is it good that S3 is not a filesystem, as it
| can be a leaky abstraction eg in git you cannot have two
| tags name "v2" and "v2/feature-1" as you cannot have both
| a file and a folder with the same name.
|
| For something more closely related to URLs than filenames
| forcing a filesystem abstraction is a limitation as
| "/some/url", "/some/url/", and "/some/url/some-default-
| name-decided-by-the-webserver" can be different.[1]
|
| [0] where a different tradeoff is that searching a file
| by name is slower but reading many small files can be
| faster.
|
| [1] maybe they should be the same, but enforcing it is a
| bad idea
| inkyoto wrote:
| > [...] what the special 0-byte object refers to. It
| represents an empty folder.
|
| Alas, no. It represents a tag, e.g. <<folder/>>, that
| points to a zero byte object.
|
| You can then upload two files, e.g. <<folder/file1.txt>>
| and <<folder/file2.txt>>, delete the <<folder/>>, being a
| _tag_ , and still have the <<folder/file1.txt>> and
| <<folder/file2.txt>> file intact in the S3 bucket.
|
| Deleting <<folder/>> in a traditional file system, on the
| other hand, will also delete <<file1.txt>> and
| <<file2.txt>> in it.
| dchest wrote:
| It's a matter of a client UI implementation. You can't
| delete a non-empty folder with POSIX API on common
| filesystems or FTP too.
|
| However, there are file managers, FTP clients, and S3
| clients that will do that for you by deleting individual
| files.
| _flux wrote:
| But if the S3 semantics are not helping you, e.g. with
| multiple clients doing copy/move/delete operations in the
| hierarchy you could still end up with files that are not
| in "directories".
|
| So essentially an S3 file manager must be able to handle
| the situation where there are files without a "directory"
| --and that I assume is also the most common case as well
| for S3. Might just not have the "directories" in the
| first place.
| klodolph wrote:
| I have personally never seen the 0-byte files people keep
| talking about here. In every S3 bucket I've ever looked
| at, the "directories" don't exist at all. If you have a
| dir/file1.txt and dir/file2.txt, there is NO such object
| as dir. Not even a placeholder.
| _flux wrote:
| Yeah, this post was the first one I had even heard of
| them.
| cwillu wrote:
| Deleting folder/ in a traditional file system will _fail_
| if the folder is not empty. Userspace needs to recurse
| over the directory structure to unlink everything in it
| before unlinking the actual folder.
| gjvc wrote:
| "folders" do not exist in S3 -- why do you keep insisting
| that they do?
|
| They appear to exist because the key is split on the
| slash character for navigation in the web front-end. This
| gives the familiar appearance of a filesystem, but the
| implementation is at a much higher level.
| Demiurge wrote:
| Let's start with the fact that you're talking to an HTTP
| api... Even if S3 had web3.0 inodes, the querying semantics
| would not make sense. It's a higher level API, because you
| don't deal with blocks of magnetic storage and binary
| buffers. Of course s3 is not a filesystem, that is part of
| its definition, and reason to be...
| klodolph wrote:
| I think if you focus too narrowly on the details of the
| wire protocol, you'll lose sight of the big picture and
| the semantics.
|
| S3 is not a filesystem because the semantics are
| different from the kind of semantics we expect from
| filesystems. You can't take the high-level API provided
| by a filesystem, use S3 as the backing storage, and
| expect to get good performance out of it unless you use a
| _ton_ of translation.
|
| Stuff like NFS or CIFS _are_ filesystems. They behave
| like filesystems, in practice. You can rename files. You
| can modify files. You can create directories.
| Demiurge wrote:
| Right, the NFS/CIFS support writing blocks, but S3
| basically does HTTP get and post verbs. I would say that
| these concepts are the defining difference. To call S3 a
| filesystem is not wrong in abstract, but it's not
| different than calling Wordpress a filesystem, or DNS, or
| anything that stores something for you. Of course, it
| will be inefficient to implement a block write on top of
| any of these, that's because you have to literally do it
| yourself. As in, download the file, edit it, upload
| again.
| klodolph wrote:
| I think the blocks are one part of it, and the other part
| is that S3 doesn't support renaming or moving objects,
| and doesn't have directories (just prefixes). Whenever
| I've seen something with filesystem-like semantics on top
| of S3, it's done by using S3 as a storage layer, and
| building some other kind of view of the storage on top
| using a separate index.
|
| For example, maybe you have a database mapping file paths
| to S3 objects. This gives you a separate metadata layer,
| with S3 as the storage layer for large blocks of data.
| keithalewis wrote:
| Even youngsters are yelling at clouds now. Just a different
| kind of cloud.
| tuwtuwtuwtuw wrote:
| "filesystem" is not a name reserved for Unix-style file
| systems. There are many types of file system which is not
| built on according to your description. When I was a kid, I
| used systems which didn't support directories, but it was
| still file systems.
|
| It's an incorrect take that a system to manage files must
| follow a set of patterns like the ones you mentioned to be
| called "file system".
| afiori wrote:
| Terms evolve and now filesytem and "system of files" mean
| different things,
|
| I would argue that not supporting folders or many other
| file operations make something not a filesystem today.
| quickthrower2 wrote:
| Yeah hacker used to not mean someone hacking into a
| computer and breaking a password, then it did then now it
| means both that and a tech tinkerer.
| tuwtuwtuwtuw wrote:
| You're free to argue whatever you want, but claiming that
| a file system should have folders as the parent commenter
| did, or support specific operations, seems a bit
| meaningless.
|
| I could create a system not supporting folders because it
| relies on tags or something else. Or I could create a
| system which is write-only and doesn't support rename or
| delete.
|
| These systems would be file systems according to how the
| term has been used for 40 (?) years at least. Just don't
| see any point in restricting the term to exclude random
| variants.
| erik_seaberg wrote:
| You can create a simulated directory, and write a bunch of
| files in it, but you can't atomically rename it--behind the
| scenes each file needs to be copied from old name to new.
| 8organicbits wrote:
| Another challenge is directory flattening. On a file system
| "a/b" and "a//b" are usually considered the same path. But on
| S3 the slash isn't a directory separator, so the paths are
| distinct. You need to be extra careful when building paths
| not to include double slashes.
|
| Many tools end up handling this by showing a folder named "a"
| containing a folder named "" (empty string). This confuses
| users quite a bit. It's more than the inodes, it's how the
| tooling handles the abstraction.
| hnlmorg wrote:
| Coincidentally I ran into an issue just like this a week
| ago. A customer facing application failed because there was
| an object named "/foo/bar" (emphasis on the leading slash).
|
| This created a prefix named "/" which confused the hell out
| of the application.
| ithkuil wrote:
| In S3 each file is identified with a full path.
|
| Not only you cannot rename a single file, but you also cannot
| rename a "folder" (because that would imply a bulk rename on
| a large number of children of that "folder")
|
| This is the fundamental difference between a first class
| folder and just a convention on prefixes of full path names.
|
| If you don't allow renames, it doesn't really make sense to
| have each "folder" store the list of the children.
|
| You can instead have a giant ordered map (some kind of
| b-tree) that allows you for efficient lookup and scanning
| neighbouring nodes.
| lukeh wrote:
| UMich LDAP server, upon which many were based, stored
| entrys' hierarchical (distinguished) names with each entry,
| which I always found a bit weird. AD, eDirectory, and the
| OpenLDAP HDB backend don't have this problem.
| solumunus wrote:
| What exactly do you think a folder is? It's just an abstraction
| for organising data.
| winwang wrote:
| I'm having a lot of fun imagining this being said to a kid
| who's trying to buy some folders for school.
| klodolph wrote:
| S3 doesn't have that abstraction.
|
| The console UI shows folders but they don't actually exist in
| S3. They're made up by the UI.
| 3weeksearlier wrote:
| It sounds like they have that abstraction in the UI. But if
| the CLI and API don't have it too, that's weird.
| klodolph wrote:
| Yeah, the UI and CLI show you "folders". It's a client-
| side thing that doesn't exist in the actual service.
| Behind the scenes, the clients are making specific types
| of queries on the object keys.
|
| You can't examine when a folder was created (it doesn't
| exist in the first place), you can't rename a folder (it
| doesn't exist), you can't delete a folder (again, it
| doesn't exist).
| throwitaway222 wrote:
| That's just an implementation detail of well known
| filesystems.
| dathery wrote:
| Yes, which is why it's not ideal to reuse the folder
| metaphor here. Users have an idea how directories work on
| well-known filesystems and get confused when these fake
| folders don't behave the same way.
| throwitaway222 wrote:
| Are all your s3 keys opaque strings (like UUIDs)?, do you
| use / (slash) in your keys?
|
| If you truly believe S3 has absolutely no connection to
| folders, you would answer Yes and No.
| klodolph wrote:
| I don't think that's a defensible standpoint.
|
| Folders are an important part of the way most people use
| filesystems.
| throwitaway222 wrote:
| Similarly the UI in linux is making up the notion of
| folders and files in them. But we don't say it doesn't
| exist.
| dathery wrote:
| Directories actually exist on the filesystem, which is
| why you have to create them before use and they can exist
| and be empty. They don't exist in S3 and neither of those
| properties do, either. Similarly, common filesystem
| operations on directories (like efficiently renaming
| them, and thus the files under them) are not possible in
| S3.
|
| Of course it can still be useful to group objects in the
| S3 UI, but it would probably be better to use some kind
| of prefix-centric UI rather than reusing the folder
| metaphor when it doesn't match the paradigm people are
| used to.
| kelnos wrote:
| No, they're not made up. A folder (or directory) is a
| specific type of inode, just a file is.
|
| S3 doesn't have folders. The UI fakes them by creating a
| 0-byte object (or file, if you will). It's a kludge.
| klodolph wrote:
| The UI will fake them without even creating the 0-byte
| object.
| DonHopkins wrote:
| Speaking of user interfaces with optical illustions about
| directory separators:
|
| On the Mac, the Finder lets you have files with slashes in
| their names, even though it's a Unix file system
| underneath. Don't believe me? Go try to use the Finder to
| make a directory whose name is "Reports from 2024/03/10".
| See?
|
| But as everyone knows, slash is the ONLY character you're
| not allowed to have in a file or directory name under Unix.
| It's enforced in the kernel at the system call inteface.
| There is absolutely no way to make a file with a slash in
| it. Yet there it is!
|
| The original MacOS operating system used the ":" character
| to delimit directory names, instead of "/", so you could
| have files and directories with slashes in their names,
| justs not with colons in their names.
|
| When Apple transitioned from MacOS to Unix, they did not
| want to freak out their users by reaming all their files.
|
| So now try to use the Finder (or any app that uses the
| standard file dialog) to make a folder or file with a ":"
| in its name on a modern Mac. You still can't!
|
| So now go into the shell and list out the parent directory
| containing the directory you made with a slash in its name.
| It's actually called "Reports from 2024:03:10"!
|
| The Mac Finder and system file dialog user interfaces
| actually switche "/" and ":" when they show paths on the
| screen!
|
| Try making a file in the shell with colons in it, then look
| at it in the finder to see the slashes.
|
| However, back in the days of the old MacOS that permitted
| slashes in file names, there was a handy network gateway
| box called the "Gatorbox" that was a Localtalk-to-Ethernet
| AFP/NFS bridge, which took a subtly different approach.
|
| https://en.wikipedia.org/wiki/GatorBox
|
| It took advantage of the fact (or rather it triggered the
| bug) that the Unix NFS implementation boldly made an end-
| run around the kernel's safe system call interface that
| disallowed slashes in file names. So any NFS client could
| actually trick Unix into putting slashes into file names
| via the NFS protocol!
|
| It appeared to work just fine, but then down the line the
| Unix "restore" command would totally shit itself! Of course
| "dump" worked just fine, never raising an error that it was
| writing corrupted dumps that you would not be able to read
| back in your time of need, so you'd only learn that you'd
| been screwed by the bug and lost all your files months or
| years later!
|
| So not only does NFS stand for "No File Security", it also
| stands for "Nasty Forbidden Slashes"!
|
| https://news.ycombinator.com/item?id=31820504
|
| >NFS originally stood for "No File Security".
|
| >The NFS protocol wasn't just stateless, but also
| securityless!
|
| >Stewart, remember the open secret that almost everybody at
| Sun knew about, in which you could tftp a host's
| /etc/exports (because tftp was set up by default in a way
| that left it wide open to anyone from anywhere reading
| files in /etc) to learn the name of all the servers a host
| allowed to mount its file system, and then in a root shell
| simply go "hostname foo ; mount remote:/dir /mnt ; hostname
| `hostname`" to temporarily change the CLIENT's hostname to
| the name of a host that the SERVER allowed to mount the
| directory, then mount it (claiming to be an allowed
| client), then switch it back?
|
| >That's right, the server didn't bother checking the
| client's IP address against the host name it claimed to be
| in the NFS mountd request. That's right: the protocol
| itself let the client tell the server what its host name
| was, and the server implementation didn't check that
| against the client's ip address. Nice professional protocol
| design and implementation, huh?
|
| >Yes, that actually worked, because the NFS protocol
| laughably trusted the CLIENT to identify its host name for
| security purposes. That level of "trust" was built into the
| original NFS protocol and implementation from day one, by
| the geniuses at Sun who originally designed it. The network
| is the computer is insecure, indeed.
|
| [...]
|
| From the Unix-Haters Handbook:
|
| https://archive.org/stream/TheUnixHatersHandbook/ugh_djvu.t
| x...
|
| Don't Touch That Slash!
|
| UFS allows any character in a filename except for the slash
| (/) and the ASCII NUL character. (Some versions of Unix
| allow ASCII characters with the high-bit, bit 8, set.
| Others don't.)
|
| This feature is great -- especially in versions of Unix
| based on Berkeley's Fast File System, which allows
| filenames longer than 14 characters. It means that you are
| free to construct informative, easy-to-understand filenames
| like these:
|
| 1992 Sales Report
|
| Personnel File: Verne, Jules
|
| rt005mfkbgkw0 . cp
|
| Unfortunately, the rest of Unix isn't as tolerant. Of the
| filenames shown above, only rt005mfkbgkw0.cp will work with
| the majority of Unix utilities (which generally can't
| tolerate spaces in filenames).
|
| However, don't fret: Unix will let you construct filenames
| that have control characters or graphics symbols in them.
| (Some versions will even let you build files that have no
| name at all.) This can be a great security feature --
| especially if you have control keys on your keyboard that
| other people don't have on theirs. That's right: you can
| literally create files with names that other people can't
| access. It sort of makes up for the lack of serious
| security access controls in the rest of Unix.
|
| Recall that Unix does place one hard-and-fast restriction
| on filenames: they may never, ever contain the magic slash
| character (/), since the Unix kernel uses the slash to
| denote subdirectories. To enforce this requirement, the
| Unix kernel simply will never let you create a filename
| that has a slash in it. (However, you can have a filename
| with the 0200 bit set, which does list on some versions of
| Unix as a slash character.)
|
| Never? Well, hardly ever. Date: Mon, 8
| Jan 90 18:41:57 PST From:
| sun!wrs!yuba!steve@decwrl.dec.com (Steve Sekiguchi)
| Subject: Info-Mac Digest V8 #3 5 I've got a
| rather difficult problem here. We've got a Gator Box run-
| ning the NFS/AFP conversion. We use this to hook up Macs
| and Suns. With the Sun as a AppleShare File
| server. All of this works great! Now
| here is the problem, Macs are allowed to create files on
| the Sun/ Unix fileserver with a "/" in the
| filename. This is great until you try to restore
| one of these files from your "dump" tapes, "restore" core
| dumps when it runs into a file with a "/" in the filename.
| As far as I can tell the "dump" tape is fine.
| Does anyone have a suggestion for getting the files off the
| backup tape? Thanks in Advance,
| Steven Sekiguchi Wind River Systems
| sun!wrs!steve, steve@wrs.com Emeryville CA, 94608
|
| Apparently Sun's circa 1990 NFS server (which runs inside
| the kernel) assumed that an NFS client would never, ever
| send a filename that had a slash inside it and thus didn't
| bother to check for the illegal character. We're surprised
| that the files got written to the dump tape at all. (Then
| again, perhaps they didn't. There's really no way to tell
| for sure, is there now?)
| ahepp wrote:
| Is it an abstraction for requesting the data you want, or an
| abstraction for storing the data in a retrievable manner?
| nostrebored wrote:
| Weird that it says folders now. I remember it being very
| strictly called a prefix when I was at AWS.
| paranoidrobot wrote:
| I think it's just the Web console, It's still prefix in the
| APIs and CLI.
|
| https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObje.
| ..
| Izkata wrote:
| The web console even collapses them like folders on
| slashes, further obfuscating how it actually works. I
| remember having to explain to coworkers why it was so slow
| to load a large bucket.
| klodolph wrote:
| I see you getting downvotes, but you're speaking the honest
| truth, here.
| halayli wrote:
| I don't know why you are being downvoted, what you said is true
| and confuses many newcomers.
| highwaylights wrote:
| This!
|
| I'm fine with it, I actually appreciate the logic and
| simplicity behind it, but the amount of times I've tried to
| explain why "folders" on S3 keep disappearing while people
| stare at me like I'm an idiot is really frustrating.
|
| (When you remove the last file in a "folder" on S3, the
| "folder" disappears, because that pattern no longer appears in
| the bucket k/v dictionary so there's no reason to show it as it
| never existed in the first place).
| wkat4242 wrote:
| Hmm well there's no folders but if you interact with the object
| the URL does become nested. So in a sense it does behave
| exactly like a folder for all intents and purposes when dealing
| with it that way. It depends what API you use I guess.
|
| I use S3 just as a web bucket of files (I know it's not the
| best way to do that but it's what I could easily obtain through
| our company's processes). But in this case it makes a lot of
| sense though I try to avoid making folders. But other people
| using the same hosting do use them.
| raverbashing wrote:
| Except stuff like s3 cli has all these weird names for normal
| filesystem items and you have to bang your head to try to
| figure it out what it all means
|
| (also don't get me started on the whole s3api thing)
| inkyoto wrote:
| S3 is a tagged versioned object storage with file like semantics
| implemented in the AWS SDK (via AWS S3 API's). The S3 object key
| is the tag.
|
| Files and folders are used to make S3 buckets more approachable
| to those who either don't know or don't want to know what it
| actually is, and one day they get a surprise.
| Twirrim wrote:
| S3 is a key value store. Just happens to be able to store really
| large values.
| dmarinus wrote:
| I talked to people at AWS who work in RDS Aurora and they hinted
| they use S3 internally as a backend for MySQL and PostgreSQL.
| readyman wrote:
| Big if true. That was definitely not in the AWS cert I took
| lol.
| multani wrote:
| Separating compute and storage is one of the core idea behind
| Aurora. They talked about it in several places, for instance:
|
| * https://www.amazon.science/publications/amazon-aurora-
| design... * https://d1.awsstatic.com/events/reinvent/2019/REP
| EAT_Amazon_...
| WatchDog wrote:
| Maybe for snapshots, but certainly not for live data.
| YouWhy wrote:
| The article is well written, but I am annoyed at the attempt to
| gatekeep the definition of a filesystem.
|
| Like literally any abstraction out there, filesystems are
| associated with a multitude of possible approaches with
| conceptually different semantics. It's a bit sophistic to say
| that Postgres cannot be run on S3 because S3 is not a filesystem;
| a better choice would have been to explore the underlying
| assumptions; (I suspect latency would kill the hypothetical use
| case of Postgres over S3 even if S3 had incorporated the
| necessary API semantics - could somebody more knowledgeable chime
| in?).
|
| A more interesting venue to pursue would be - what other
| additions could be made to the S3 API to make it more usable on
| its own right - for example, why doesn't S3 offer more than one
| filename per blob? (e.g., a similar to what links do in POSIX)
| bilalq wrote:
| This might be of interest to you: https://neon.tech/blog/bring-
| your-own-s3-to-neon.
|
| There's also the OG Aurora whitepaper:
| https://www.amazon.science/publications/amazon-aurora-design...
| zX41ZdbW wrote:
| ClickHouse can work with S3 as a main storage. This is possible
| because a table is a set of immutable data parts. Data parts
| can be written once and deleted, possibly as a result of a
| background merge operation. S3 API is almost enough, except for
| cases of concurrent database updates. In this case, it is not
| possible to rely on S3 only because it does not support an
| atomic "write if not exists" operation. That's why external,
| strongly consistent metadata storage is needed, which is
| handled by ClickHouse Keeper.
| afiori wrote:
| Is a "write if not exists" atomic operation enouhg as a
| concurrency primitive for database locks?
| justincormack wrote:
| Yes, its not necessarily the most efficient mechanism
| (could be a lot of retries) but its sufficient. See the
| Delta Lake paper for example [0]
|
| [0] https://people.eecs.berkeley.edu/~matei/papers/2020/vld
| b_del...
| yencabulator wrote:
| When talking about analytical databases for "big data",
| yeah. They generally just want a "atomically replace the
| list of Parquet files that make up this table", with one
| writer succeeding at a time.
|
| That would not be a great base to build a transactional
| database on.
| mlhpdx wrote:
| Conditional PUT would be a great addition to S3, indeed.
| buremba wrote:
| That would probably require them to rewrite a non-trivial
| part of S3 from scratch.
| yencabulator wrote:
| Google Cloud Storage supports create-if-not-exist and
| compare-and-swap on generation counter. S3 is much harder to
| use as a building block without tying your code into a second
| system like DynamoDB etc.
|
| https://pkg.go.dev/cloud.google.com/go/storage#Conditions
| jillesvangurp wrote:
| The notion of postgres not being able to run on s3 has more to
| do with the characteristics of how it works than with it not
| being a filesystem. After all, people have developed fuse
| drivers for s3 so they can actually pretend it's a filesystem.
| But using that to store a database is going to end in tears for
| the same reasons that using e.g. NFS for this is also likely to
| end in tears. You might get it to work but it won't be fast or
| even reliable. And since NFS actually stands for networked file
| system, it's hard to argue that NFS isn't a filesystem.
|
| Whether something is or isn't a filesystem requires defining
| what that actually is. A system that stores files would be a
| simple explanation. Which is clearly something S3 is capable
| of. This probably upsets the definition gatekeepers for
| whatever more specific definitions they are guarding. But it
| has a nice simple logic to it.
|
| It's worth considering that file systems have had a long
| history, weren't always the way they are now, and predate the
| invention of relational databases (like postgres). Technically
| before hard disks were invented in the fifties, we had no file
| systems. Just tapes and punch cards. A tape would consist a
| single blob of bits, which you'd load in memory. Or it would
| have multiple such blobs at known offsets. I had cassettes full
| of games for my commodore 64. But no disk drive. These blobs
| were called files but there was no file system. Sometime, after
| the invention of disks file systems were invented in the early
| sixties.
|
| Hierarchical databases were common before relational databases
| and filesystems with directories are basically a hierarchical
| database. S3 lacking hierarchy as a simpler key value store
| clearly isn't a hierarchical database. But of course it's easy
| to mimic one simply by using / characters in the keys. Which is
| how the fuse driver probably fakes directories. And S3 even has
| APIs to listfiles with a common prefix. A bigger deal is the
| inability to modify files. You can only replace them with other
| files (delete and add). That kind of is a show stopper for a
| database. Replacing the entire database on every write isn't
| very practical.
| buremba wrote:
| Neon.tech runs Postgresql runs on S3. They persist the WAL to
| S3 so that they can replicate the data and bring it to local
| ssds I assume.
| defaultcompany wrote:
| I've wondered this also because it can be handy to have
| multiple ways of accessing the same file. For example to
| obfuscate database uuids if they are used in the key. In theory
| you could implement soft links in AWS by just storing a file
| with the path to the linked file. But it would be a lot of
| manual work.
| throwaway892238 wrote:
| > The "simple" in S3 is a misnomer. S3 is not actually simple.
| It's deep.
|
| Simple doesn't mean "not deep". It means having the fewest parts
| needed in order to accomplish your requirements.
|
| If you require a distributed, centralized, replicated, high-
| availability, high-durability, high-bandwidth, low-latency,
| strongly-consistent, synchronous, scalable object store with HTTP
| REST API, you can't get much simpler than S3. Lots of features
| have been added to AWS S3 over the years, but the basic operation
| has remained the same.
| svat wrote:
| > _It means having the fewest parts needed in order to
| accomplish your requirements._
|
| That is exactly what "deep" means, in the terminology of this
| post (from Ousterhout's book _A Philosophy of Software Design_
| ). Simple means "not complex" (see also Rich Hickey's talk
| Simple Made Easy: https://www.infoq.com/presentations/Simple-
| Made-Easy/), while "deep" means providing/having a lot of
| internally-complex functionality via a small interface. The
| latter is a better description of S3 (which is what you seem to
| be saying too) than "simple" which would mean there isn't much
| to it.
| throwaway892238 wrote:
| Hickey's definition of simple is wrong. It's not the opposite
| of complex at all. They are not opposites, nor mutually
| exclusive. - Easy is when something does not
| require much effort. - Simple means the least complex
| it can be and still work. - Complex means there are
| lots of components.
|
| These are all quite different concepts: -
| Easy is a concept that distinguishes the amount of work
| needed to use a solution - Simple is a concept that
| distinguishes whether or not there is an excess number of
| interacting properties in a system - Complex is a
| concept describing the quality of having a number of
| interacting properties in a system
|
| Hickey's talk is useful in terms of thinking about software,
| but it also contains many over-generalizations which are
| incorrect and lead to incorrect thinking about things that
| aren't software. (Even some of his declarations about
| software are wrong)
|
| "Deep", in the context of software complexity, probably only
| makes sense in terms of describing the number of layers
| involved in a piece of technology. You could make something
| have many layers, and it could still be simple, or be
| complex, or easy.
| ahepp wrote:
| In terms the article puts forth, I would almost argue that
| simple implies deep (and the associated "narrow" interface).
| type_Ben_struct wrote:
| Tools like LucidLink and Weka go a way to making S3 even more of
| a "file system". They break files into smaller chunks (S3
| objects) which helps with partial writes, reads and performance.
| Alongside tiering of data from S3 to disk when needed for
| performance.
| hnlmorg wrote:
| I don't know a whole lot about LucidLink but Weka basically
| uses S3 as a dataplane for their own file system.
| rwmj wrote:
| Someone contributed an nbdkit S3 plugin which basically works
| the way you described. It uses numbered S3 chunks using the
| pattern "key/%16x", allowing the virtual disk to be updated.
| (https://libguestfs.org/nbdkit-S3-plugin.1.html
| https://gitlab.com/nbdkit/nbdkit/-/tree/master/plugins/S3)
| cuno wrote:
| The problem with these approaches is that the data is scrambled
| on the backend, so you can't access the files directly from S3
| anymore. Instead you need an S3 gateway to convert from
| scrambled S3 to unscrambled S3. They rely on a separate
| database to reassemble the pieces back together again.
| hn72774 wrote:
| > Filesystem software, especially databases, can't be ported to
| Amazon S3
|
| Hudi, Delta, iceberg bridge that gap now. Databricks built a
| company around it.
|
| Don't try to do relational on object storage on your own. Use one
| of those libraries. It seems simple but it's not. Late arriving
| data, deletes, updates, primary key column values changing, etc.
| albert_e wrote:
| There is specifically block storage service (EBS) and falvirs
| of it like EBS multi-attach and EFS that can ne used if there
| is a need to port software/databases to the cloud with low
| level filesystem support.
|
| Why would we need to do it on object storage which addresses a
| different type of storage need.
|
| Nevertheless there are projects like EMRFS and S3 file system
| mount points that try to provide files stem interfaces to
| workloads that need to see S3 as a filesystem.
| albert_e wrote:
| *flavors
|
| *can be used
|
| *file system
|
| (Apologies for typos. The "noprocrast" setting sometimes
| locks us out of HN right after submitting a comment. And it
| is now too late, not editable)
| hn72774 wrote:
| S3 is better for large datasets. It's cheaper and handles
| large file sizes with ease.
|
| It has become a de-facto standard for distributed, data-
| intensive workloads like those common with spark.
|
| A key benefit is decoupling the data from the compute so that
| they can scale independently. EBS is tightly coupled to iops
| and you pay extra for that.
|
| (Source: a long time working in data engineering)
| albert_e wrote:
| Yes and I also believe:
|
| Experienced Spark / Data Engineering teams would not assume
| S3 is readily useable as a filesystem.
|
| This [1] seems like a good guide on how to configure spark
| for working with Cloud object stores, while recognizing the
| limitations and pitfalls.
|
| [1]: https://spark.apache.org/docs/latest/cloud-
| integration.html
|
| ---
|
| Amazon EMR offers a managed way to run hadoop or spark
| clusters and it implements an "EMR FS" [2] system to
| interface with S3 as storage.
|
| [2]:
| https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-
| fs.h...
|
| AWS Glue is another option which is "serverless" ETL.
| Source and Destination can be S3 data lakes read through a
| data catalog (hive or glue data catalog). During processing
| AWs Glue can optionally use S3 [3,4,5] for shuffle
| partition.
|
| [3]: https://aws.amazon.com/blogs/big-data/introducing-
| amazon-s3-...
|
| [4]: https://docs.aws.amazon.com/glue/latest/dg/monitor-
| spark-shu...
|
| [5]: https://aws.amazon.com/blogs/big-data/introducing-the-
| cloud-...
| 8n4vidtmkvmk wrote:
| I still don't understand why you'd want to do it in the first
| place. Just by some contiguous storage.
| zmmmmm wrote:
| The limitations of S3 (and all the cloud "file systems") are
| quite astonishing when you consider you're paying for it as a
| premium service.
|
| Try to imagine your astonishment if a traditional storage vendor
| showed up and told you that their very expensive premium file
| system they had just sold you: - can't store
| log files because it can't append anything to an
| existing files - can't copy files more than 5GB -
| can't rename or move a file
|
| When challenged on how you are supposed to make all your
| applications work with limitations like that, they glibly told
| you "oh you're supposed to rewrite them all".
| umanwizard wrote:
| Amazon doesn't market S3 as a replacement for file systems,
| that's why EBS exists.
|
| Also, is S3 really "very expensive"? Relative to what?
| vbezhenar wrote:
| S3 usually is the cheapest storage, not only for Amazon, but
| for other clouds. I don't understand why.
| ForHackernews wrote:
| This is not true in my experience
| https://www.backblaze.com/cloud-storage/pricing
| kiwijamo wrote:
| That Backblaze page (not surprisingly) compares their
| prices to a fairly expensive S3 pricing tier and makes
| other assumptions in Blackblaze's favour. For some use
| cases B2 is more expensive e.g. one copy of my backups
| goes to AWS Deep Glacier which is really cheap.
| throwaway290 wrote:
| It's for building things on top. If you want to
| rename/move/copy data, implement a layer that maps objects to
| "filenames" or any metadata you like (or use some lib). If you
| want to write logs, implement append and rotation. But I for
| example don't and won't need any of that and if it helps keep
| the API simpler and more reliable then I benefit.
|
| being a conventional filesystem for S3 would be either a very
| leaky abstraction or completely different product
| Cthulhu_ wrote:
| They're not filesystems though, they're object storage or
| key/value storage if you will. It's intended to store the log
| files for long term once they're full.
|
| You can rename / move a file, but it involves copying and
| deleting the original; I don't understand why they don't have a
| shortcut for that, but it probably makes sense that the user of
| the service is aware of the process instead of hiding it.
|
| I'm not sure about the 5GB limit, it's probably documented
| somewhere as to why that is; possibly, like tweets, having an
| upper limit helps them optimize things. Anyway there too
| there's tools, you can do multipart somethings and there's this
| official blogpost on the subject:
| https://aws.amazon.com/blogs/storage/copying-objects-greater...
|
| Interesting to note maybe in the context of the post; copy,
| rename, moving large files, all that _could_ be abstracted
| away, but that would hide the underlying logic - which might
| lead to inefficient usage of the service - and worse, make
| users _think_ it 's just a filesystem and use it accordingly,
| but it's not intended or designed for that use case.
| gray_-_wolf wrote:
| Current limit is 5TB. The 5GB is for a single upload, you can
| hover do multipart upload to get up to the maximum size of
| 5TB.
|
| https://aws.amazon.com/s3/faqs/
| ozim wrote:
| These "file systems" are not file systems and I don't
| understand why people expect them to be.
|
| Some people are creating tools that make those services easier
| to synch with file systems but that is not intended use anyway.
| inkyoto wrote:
| S3 is an object storage, not a file system. The file system in
| AWS is called EFS. S3 is not positioned as a substitute for
| file systems, either.
| pjc50 wrote:
| It's not a filesystem, but it has _better_ semantics for
| distributed operation because of it. Nobody talks about the
| locking semantics of S3 because it 's at the blob level; that
| rules out whole categories of problems.
|
| And that's also why you can't append. If you had multiple
| readers while appending, and appending to multiple replicas,
| guaranteeing that each reader would see a consistent only-
| forwards read of the append is extremely hard. So simply ban
| people from doing that and force them to use a different system
| designed for the purpose of logging.
|
| Microservices. S3 is for blobs. If you want something that
| isn't a blob, use a different microservice.
| hiAndrewQuinn wrote:
| I feel like I understand the lasting popularity of the humble FTP
| fileserver a bit better now. Thank you.
| jugg1es wrote:
| oh but amazon offers SFTP on top of S3 so you don't have to
| miss out.
| hiAndrewQuinn wrote:
| If it's offered on top of S3, though, doesn't it still have
| all the same issues of needing to totally overwrite files?
| globular-toast wrote:
| A filesystem is an abstraction built on a block device. A block
| device just gives you a massive array of bytes and lets you
| read/write from them in blocks (e.g. write these 300 bytes at
| position 273041).
|
| A block device itself is an abstraction built on real hardware.
| "Write these 300 bytes" really means something like "move needle
| on platter 2 to position 6... etc"
|
| S3 is just a different abstraction that is also built on raw
| storage somehow. It's a strictly flat key-object store. That's
| it. I don't know why people have a problem with this. If you need
| "filesystem stuff" then implement it in your app, or use a
| filesystem. You only need to append? Use a database to keep track
| of the chain of appends and store the chunks in S3. Doesn't work
| for you? Use something else. Need to "copy"? Make a new reference
| to the same object in your db. Doesn't work for you? Use
| something else.
|
| S3 works for a lot of people. Stop trying to make it something
| else.
|
| And stop trying to change the meaning of super well-established
| names in your field. A filesystem is described in text books
| everywhere. S3 is not a filesystem and never claimed to be one.
|
| Oh and please study a bit of operating system design. Just a
| little bit. It really helps and is great fun too.
| gjvc wrote:
| JFC the people on this thread missing the difference between
| object storage and a blocks-and-inodes filesystem is alarming
| nickcw wrote:
| Great article - would have been useful to read before starting
| out on the journey of making rclone mount (mount your cloud
| storage via fuse)!
|
| After a lot of iterating we eventually came up with the VFS layer
| in rclone which adapts S3 (or any other similar storage system
| like Google Cloud Storage, Azure Blob, Openstack Swift, Oracle
| Object Storage, etc) into a POSIX-ish file system layer in
| rclone. The actual rclone mount code is quite a thin layer on top
| of this.
|
| The VFS layer has various levels of compatibility "off" where it
| just does directory caching. In this mode, like the article
| states you can't read and write to a file simultaneously and you
| can't write to the middle of a file and you can only write files
| sequentially. Surprisingly quite a lot of things work OK with
| these limitations. The next level up is "writes" - this supports
| nearly all the POSIX features that applications want like being
| able to read and write to the same file at the same time, write
| to the middle of the file, etc. The cost for that though is a
| local copy of the file which is uploaded asynchronously when it
| is closed.
|
| Here are some docs for the VFS caching modes - these mirror the
| limitations in the article nicely!
|
| https://rclone.org/commands/rclone_mount/#vfs-file-caching
|
| By default S3 doesn't have real directories either. This means
| you can't have a directory with no files in, and directories
| don't have valid metadata (like modification time). You can
| create zero length files ending in / which are known as directory
| markers and a lot of tools (including rclone) support these. Not
| being able to have empty directories isn't too much of a problem
| normally as the VFS layer fakes them and most apps then write
| something into their empty directories pretty quickly.
|
| So it is really quite a lot of work trying to convert something
| which looks like S3 into something which looks like a POSIX file
| system. There is a whole lot of smoke and mirrors behind the
| scene when things like renaming an open file happens and other
| nasty corner cases like that.
|
| Rclone's lower level move/sync/copy commands don't bother though
| and use the S3 API pretty much as-is.
|
| If I could change one thing about S3's API I would like an option
| to read the metadata with the listings. Rclone stores
| modification times of files as metadata on the object and there
| isn't a bulk way of reading these, you have to HEAD the object.
| Or alternatively a way of setting the Last-Modified on an object
| when you upload it would do too.
| Hakkin wrote:
| > If I could change one thing about S3's API I would like an
| option to read the metadata with the listings. Rclone stores
| modification times of files as metadata on the object and there
| isn't a bulk way of reading these, you have to HEAD the object.
| Or alternatively a way of setting the Last-Modified on an
| object when you upload it would do too.
|
| I wonder if you couldn't hack this in by storing the metadata
| in the key name itself? Obviously with the key length limit of
| 1024 you would be limited in how much metadata you could store,
| but it's still quite a lot of space, even taking into account
| the file path. You could use a deliminator that would be
| invalid in a normalized path, like '//', for example:
| /path/to/file.txt//mtime=1710066090
|
| You would still be able to fetch "directories" via prefixes and
| direct files by using '<filename>//' as the prefix.
|
| This kind of formatting would probably make it pretty
| incompatible with other software though.
| nickcw wrote:
| I think that is a nice idea - maybe something we could
| implement in an overlay backend. However people really like
| the fact that the object they upload with rclone arrive with
| the filenames they had originally on s3, so I think the
| incompatible with other software downside would make it
| unattractive for most users.
| klauspost wrote:
| > If I could change one thing about S3's API I would like an
| option to read the metadata with the listings.
|
| Agree. In MinIO (disclaimer: I work there) we added a "secret"
| parameter (metadata=true) to include metadata and tags in
| listings if the user has the appropriate permissions. Of course
| it being an extension it is not really something that you can
| reliably use. But rclone can of course always try it and use it
| if available :)
|
| > You can create zero length files ending in /
|
| Yeah. Though you could also consider "shared prefixes" in
| listings as directories by itself. That of course makes
| directories "stateless" and unable to exist if there are no
| objects in there - which has pros and cons.
|
| > Or alternatively a way of setting the Last-Modified on an
| object when you upload it would do too.
|
| Yes, that gives severe limitations to clients. However it does
| make the "server" time the reference. But we have to deal with
| the same limitation for client side replication/mirroring.
|
| My personal biggest complaint is that there isn't a
| `HeadObjectVersions` that returns version information for a
| single object. `ListObjectVersions` is always going to be a
| "cluster-wide" operation, since you cannot know if the given
| prefix is actually a prefix or an object key. AWS recently
| added "GetObjectAttributes" - but it doesn't add version
| information, which would have fit in nicely there.
| nickcw wrote:
| > Agree. In MinIO (disclaimer: I work there) we added a
| "secret" parameter (metadata=true) to include metadata and
| tags in listings if the user has the appropriate permissions.
| Of course it being an extension it is not really something
| that you can reliably use. But rclone can of course always
| try it and use it if available :)
|
| Is this "secret" parameter documented somewhere? Sounds very
| useful :-) Rclone knows when it is talking to Minio so we
| could easily wedge that in.
|
| > My personal biggest complaint is that there isn't a
| `HeadObjectVersions` that returns version information for a
| single object. `ListObjectVersions` is always going to be a
| "cluster-wide" operation, since you cannot know if the given
| prefix is actually a prefix or an object key
|
| Yes that is annoying having to do a List just to figure out
| which object Version is being referred to. (Rclone has this
| problem when using --s3-list-version).
| glitchcrab wrote:
| Hey Nick :wave:
| wodenokoto wrote:
| Is there a generic name for these distributed cloud file
| storages?
|
| AWS is S3, google is buckets, Azure is blob storage, the open
| source version is ... ?
| dexwiz wrote:
| Object Storage
| jeffbr13 wrote:
| I tend to go by Binary Large OBject (BLOB) storage to discern
| between this kind of object storage and "object" as in OOP.
| BLOB is also what databases call files stored in columns.
| OJFord wrote:
| When would that be confusing? As in what would an AWS
| service offering OOP object storage be/mean?
| gilbetron wrote:
| "blob storage" is the usual generic term, even though Azure
| uses it explicitly. It's like calling adhesive bandages,
| "bandaids" even though that is a specific company's term.
| surajrmal wrote:
| Google buckets is a bit off - the product is called Google
| storage. Buckets are also a term used by s3 and are equivalent
| to azure blob storage containers. They are an intermediary
| layer that determines attributes for the objects stored within
| it such as ACLs and storage class (and therefore cost and
| performance).
|
| As to your question, object storage[1] seems to be the generic
| term for the technology. Internally they all rely on naming
| files based on the hash of their contents for quick lookup,
| deduplication, and avoiding name clashes.
|
| 1: https://en.wikipedia.org/wiki/Object_storage
| tison wrote:
| It's ever discussed in https://github.com/apache/arrow-
| rs/issues/3888 for comparing object_store in Apache Arrow to the
| APIs provided by Apache OpenDAL.
|
| Briefly, Apache OpenDAL is a library providing FS-like APIs over
| multiple storage backends, including S3 and many other cloud
| storage.
|
| A few database systems, such as GreptimeDB and Databend, use
| OpenDAL as a better S3 SDK to access data on cloud storage.
|
| Other solutions exist to manage filesystem-like interfaces over
| S3, including Alluxio and JuiceFS. Unlike Apache OpenDAL, Alluxio
| and JuiceFS need to be deployed standalone and have a dedicated
| internal metadata service.
| Lucasoato wrote:
| I'm not sure if Alluxio could be substituted by OpenDAL as a
| local cache layer for TrinoDB.
| cynicalsecurity wrote:
| Backblaze B2 is worth mentioning while we are speaking of S3. I'm
| absolutely in love with their prices (3 times lower than of S3).
| (I'm not their representative).
| silvertaza wrote:
| With every alternative, the prevailing issue is the fact that
| your data is as safe as the company your data is with. But I
| think this can be remedied by doubly external backups.
| didgeoridoo wrote:
| B2 having an S3-compatible API available makes this
| particularly easy :)
| OJFord wrote:
| Backblaze is like if Amazon spun AWS S3 out as its own
| business (and it added some backup helper tooling as a
| result) though, I wouldn't really worry any more about it.
| You could write a second copy to S3 Glacier Deep Archive
| (using B2 for instant access when you wanted to restore or on
| a new device) and still be much cheaper.
| overstay8930 wrote:
| We liked B2 but not enough to pay for IPv4 addresses, insane
| they advertise as a multi-cloud solution but basically kill any
| chance at adoption when NAT gateways and IPv4 charges are
| everywhere. We would literally save money paying B2 bandwidth
| fees (high read low write) but not when being pushed through a
| NAT64 gateway, or paying an hourly charge just to be able to
| access B2.
| Kwpolska wrote:
| How could they launch a cloud service like this and not have
| IPv6 in 2015? What other basic things did they cheap out on?
| miyuru wrote:
| I also migrated, after asking for IPv6 for more than 3 years
| on reddit.
|
| they does not seem to understand users on the b2 product.
| it's almost as if b2 is just a supplementary service from
| their backup service.
|
| https://www.reddit.com/r/backblaze/comments/ij9y9s/b2_s3_not.
| ..
| orf wrote:
| > And listing files is slow. While the joy of Amazon S3 is that
| you can read and write at extremely, extremely, high bandwidths,
| listing out what is there is much much slower. Slower than a slow
| local filesystem
|
| This misses something critical. Yes, s3 has fast reading and
| writing, but that's not really what makes it _useful_.
|
| What makes it useful _is_ listing. In an unversioned bucket (or
| one with no delete markers), listing any given prefix is
| essentially constant time: I can take any given string, in a
| bucket with 100 billion objects, and say "give me the next 1000
| keys alphabetically that come after this random string".
|
| What's more, using "/" as a delimiter is just the default - you
| can use any character you want and get a set of common prefixes.
| There are no "directories", "directories" are created out of thin
| air on demand.
|
| This is super powerful, and it's the thing that lets you
| partition your data in various ways, using whatever identifiers
| you need, without worrying about performance.
|
| If listing was just "slow", couldn't list on file prefixes _and_
| got slower proportional to the number of keys (I.e a traditional
| unix file system), then it wouldn't be useful at all.
| adrian_b wrote:
| Since 30 years ago (starting with XFS in 1993, which was
| inspired by HPFS) all the good UNIX file systems implement the
| directories as some kind of B trees.
|
| Therefore they do not get slower proportional to the number of
| entries and listing based on file prefixes is extremely fast.
| orf wrote:
| Yes they do. What APIs does Linux offer that allows you to
| list a directories contents alphabetically _starting at a
| specific filename_ in constant time? You have to iterate the
| _directory_ contents.
|
| You can maybe use "d_off" with readdir in some way, but
| that's specific to the filesystem. There's no portable way to
| do this with POSIX.
|
| Regardless of if you can do it with a single directory, you
| can't do it for all files recursively under a given prefix.
| You can't just ignore directories, or say that "for this list
| request, '-' is my directory separator".
|
| The use of b-trees in file systems is completely beside the
| point.
| adrian_b wrote:
| The POSIX API is indeed even older, so it is not helpful.
|
| But as you say, there are filesystem-specific methods or
| operating-system specific methods to reach the true
| performance of the filesystem.
|
| It is likely that for maximum performance one would have to
| write custom directory search functions using directly the
| Linux syscalls, instead of using the standard libc
| functions, but I would rather do that instead of paying for
| S3 or something like it.
| orf wrote:
| Yes. You could also just use a SQLite table with two
| columns (path, contents), then just query that. Or do any
| number of other things.
|
| The question isn't if it's possible, because of course it
| is, the question is if it's portable and well supported
| with the POSIX interface. Because if it's not, then...
| anamexis wrote:
| > The question isn't if it's possible, because of course
| it is, the question is if it's portable and well
| supported with the POSIX interface. Because if it's not,
| then...
|
| Where did this goalpost come from? S3 is not portable or
| POSIX compliant.
| orf wrote:
| From the article we're commenting on, which is comparing
| the interface of S3 to the POSIX interface. Not any given
| filesystem + platform specific interface.
| anamexis wrote:
| The article does not mention POSIX, or anything about
| listing files, at all.
| zaphar wrote:
| The article starts out by making a comparison between the
| posix api filesystem calls and S3's api. The context is
| very much a comparison between those two api surface
| areas.
| orf wrote:
| It mistakenly mentions UNIX whilst referencing the POSIX
| filesystem API, and I literally quoted where it talks
| about listing in my original comment.
| justincormack wrote:
| There are no specific syscalls that you can use for this.
| The libc functions and the syscalls are extremely
| similar.
| nh2 wrote:
| > listing based on file prefixes is extremely fast
|
| This functionality does not exist to my knowledge.
|
| ext4 and XFS return directory entries in pseudo-random order
| (due to hashing), not lexicographically.
|
| For an example, see e.g.
| https://righteousit.wordpress.com/2022/01/13/xfs-
| part-6-btre...
|
| If you know a way to return lexicographical order directly
| from the file system, without the need to sort, please link
| it.
| kbolino wrote:
| Resolving random file system paths still gets slower
| proportional to their _depth_ , which is not the case for S3,
| where the prefix is on the entire object key and not just the
| "basename" part of it, like in a filesystem.
| jacobsimon wrote:
| What is it about S3 that enables this speed, and why can't
| traditional Unix file systems do the same?
| orf wrote:
| S3 doesn't have directories, it could be thought of a flat +
| sorted list of keys.
|
| UNIX (and all operating systems) differentiate between a file
| and a directory. To list the contents of a directory, you
| need to make an explicit call. That call might return files
| or directories.
|
| So to list all files recursively, you need to list, sort,
| check if an entry is a directory, recurse". This isn't great.
| bradleyjg wrote:
| Code written against s3 is not portable either. It doesn't
| support azure or gcp, much less some random proprietary
| cloud.
| arcfour wrote:
| I've seen several S3-compatible APIs and there are open-
| source clients. If anything it's the de-facto standard.
| zaphar wrote:
| GCP storage buckets implement the S3 api. You can treat
| them like they were an s3 bucket. Something I do all the
| time.
| cuno wrote:
| Actually we've found it's often much worse than that.
| Code written against AWS S3 using the AWS SDK often
| doesn't work on a great many "S3-compatible" vendors
| (including on-prem versions). Although there's
| documentation on S3, it's vague in many ways, and the AWS
| SDKs rely on actual AWS behaviour. We've had to deal with
| a lot of commercial and cloud vendors that subtly break
| things. This includes giant public cloud companies. In
| one case a giant vendor only failed at high loads, making
| it appear to "work" until it didn't, because its backoff
| response was not what the AWS SDK expected. It's been a
| headache that we've had to deal for cunoFS, as well as
| making it work with GCP and Azure. At the big HPC
| conference Supercomputing 2023, when we mentioned
| supporting "S3 compatible" systems, we would often be
| told stories about applications not working with their
| supposedly "S3 compatible" one (from a mix of vendors).
| yencabulator wrote:
| Back in 2011 when I was working on making Ceph's RadosGW
| more S3-compatible, it was pretty common that AWS S3
| behavior differed from their documentation too. I wrote a
| test suite to run against AWS and Ceph, just to figure
| out the differences. That lives on at
| https://github.com/ceph/s3-tests
| mechanicalpulse wrote:
| Isn't that a limitation imposed by the POSIX APIs, though,
| as a direct consequence of the interface's representation
| of hierarchical filesystems as trees? As you've
| illustrated, that necessitates walking the tree. Many
| tools, I suppose, walk the tree via a single thread,
| further serializing the process. In an admittedly haphazard
| test, I ran `find(1)` on ext4, xfs, and zfs filesystems and
| saw only one thread.
|
| I imagine there's at least one POSIX-compatible file system
| out there that supports another, more performant method of
| dumping its internal metadata via some system call or
| another. But then we would no longer be comparing the S3
| and POSIX APIs.
| aeyes wrote:
| And if for some reason you need a complete listing along with
| object sizes and other attributes you can get one every 24
| hours with S3 inventory report.
|
| That has always been good enough for me.
| tjoff wrote:
| Is listing really such a key feature that people use it as a
| database to find objects?
|
| Have not used S3, but that is not how I imagined using it.
| orf wrote:
| Sure. It's kind of an index - limited to prefix-only
| searching, but useful.
|
| Say you store uploads associated with a company and a user.
| You'd maybe naively store them as `[company-uuid]/[user-
| id].[timestamp]`.
|
| If you need to list a given users (123) uploads after a given
| date, you'd list keys after `[company-uuid]/123.[date]`. If
| you need to list all users uploads, you'd list `[company-
| uuid]/123.`. If you need to get a set of all users who have
| photos, you'd list `[company-uuid]/` with a Delimiter set to
| `.`
|
| The point is that it's flexible and with a bit of thought it
| allows you to "remove all a users uploads between two dates",
| "remove all a companies uploads" or "remove all a users
| uploads" with a single call. Or whatever specific stuff is
| important to your use-case, that might otherwise need a
| separate DB.
|
| It's not perfect - you can't reverse the listing (i.e you
| can't get the _latest_ photo for a given user by sorting
| descending for example), and needs some thought about your
| key structure.
| tjoff wrote:
| But surely you need to track that elsewhere anyway?
|
| That some niche edge-case runs efficiently doesn't sound
| like a defining feature of S3. On the contrary many common
| operations map terrible to S3, so you kind of need the
| logic to be elsewhere.
| orf wrote:
| My overall point can be summarised as this:
|
| - Listing things is a very common operation to do.
|
| - The POSIX api and the directory/file hierarchy it
| provides is a restrictive one.
|
| - S3 does not suffer from this, you can recursively list
| and group keys into directories at "list time".
|
| - If you find yourself needing to list gigantic numbers
| of keys in one go, you can do better by only listing a
| subset. S3 isn't a filesystem, you shouldn't need to list
| 1k+ keys sequentially apart from during maintenance
| tasks.
|
| - This is actually quite fast, compared to alternatives.
|
| Whether or not you see a use case for this is sort of
| irrelevant: they exist. it's what allows you to easily
| put data into s3 and flexibly group/scan it by specific
| attributes.
| tjoff wrote:
| Listing things is very common, so why would you outsource
| that to S3 when all your bookkeeping is elsewhere? It's
| not like you would ever rely on the POSIX API for that
| anyway, even for when your files actually are on a POSIX
| filesystem.
|
| For sure, for maintenance tasks etc. it sounds quite
| useful. And good hygiene with prefixes sounds like a sane
| idea. But listing being a critical part of what "makes S3
| useful"? That seems like an huge stretch that your points
| don't seem to address.
| orf wrote:
| > It's not like you would ever rely on the POSIX API for
| that anyway, even for when your files actually are on a
| POSIX filesystem.
|
| Because there _is no_ POSIX api for this. Depending on
| your requirements and query patterns, you may not need a
| completely separate database that you need to keep in
| sync.
| kbolino wrote:
| > But surely you need to track that elsewhere anyway?
|
| Why? If the S3 structure and listing is sufficient, I
| don't need to store anything else anywhere else.
|
| Many use cases may involve other requirements that S3
| can't meet, such as being able to find the same object
| via different keys, or being able to search through the
| metadata fields. However, if the requirements match up
| with S3's structure, then additional services are
| unnecessary and keeping them in sync with S3 is more
| hassle than it's worth.
| tjoff wrote:
| I agree, but something as simple (in functionality) as
| that ought to be an edge-case. Not a defining feature of
| S3.
| dekhn wrote:
| it's a property of the system that I, as an architect,
| would seriously consider as part of my system's design.
| I've worked with many systems where iterating over items
| in order starting from a prefix is extremely cheap
| (sstables).
| orf wrote:
| It's fundamental to how S3 works and its ability to
| scale, so it is a defining feature of S3.
|
| If you think wider, a bucket itself is just a prefix.
| tjoff wrote:
| From amazons perspective, sure!
|
| But that's not what we are discussing.
| belter wrote:
| No. The standard practice is to use a DynamoDB table as the
| index for your objects in S3.
|
| This article misunderstood S3 and could as well have the
| title: "An Airplane is not a Car" :-)
| macintux wrote:
| I don't know that you can characterize that as a "standard
| practice".
|
| Maybe it's widespread, but I've not encountered it.
| belter wrote:
| "Building and Maintaining an Amazon S3 Metadata Index
| without Servers" - https://aws.amazon.com/pt/blogs/big-
| data/building-and-mainta...
|
| Here is the architecture of Amazon Drive and the storage
| of metadata.
|
| "AWS re:Invent 2014 | (ARC309) Building and Scaling
| Amazon Cloud Drive to Millions of Users" -
| https://youtu.be/R2pKtmhyNoA
|
| And you can see the use here at correct time:
| https://youtu.be/R2pKtmhyNoA?t=546
| ianburrell wrote:
| That article is old. DynamoDB was used because of the
| old, weak consistency model of S3. Writes were atomic,
| but lists could return old results so needed consistent
| list of objects.
|
| But in 2020, S3 changed to strong consistency model.
| There is no need to use DynamoDB now.
| fijiaarone wrote:
| So in reality S3 takes about 2 seconds to retrieve a single
| file, under ideal conditions. 1 second round trip for the
| request to DynamoDB to get the object key of the file and 1
| second round trip to S3 to get the file contents (assuming
| no CPU cost on the search because you're getting the key by
| ID from the DynamoDB in a flat single table store. And that
| the file has no network IO because it is a trivial number
| of bytes, so the HTTP header overwhelmed the content.)
|
| I know what you're thinking -- 2 seconds, that's faster
| than I can type the 300 character file key with its pseudo
| prefixes)!
|
| Ah, but what if you wanted to get 2 files from S3?
| calpaterson wrote:
| I have to say that I'm not hugely convinced. I don't really
| think that being able to pull out the keys before or after a
| prefix is particularly impressive. That is the basis for
| database indices going back to the 1970s after all.
|
| Perhaps the use-cases you're talking about are very different
| from mine. That's possible of course.
|
| But for me, often the slow speed of listing the bucket gets in
| the way. Your bucket doesn't have to get very big before
| listing the keys takes longer than reading them. I seem to
| remember that listing operations ran at sub-1mbps, but
| admittedly I don't have a big bucket handy right now to test
| that.
| orf wrote:
| It depends on a few factors. The list objects call hides
| deleted and noncurrent versions, but it has to skip over
| them. Grouping prefixes also takes time, if they contain a
| lot of noncurrent or deleted keys.
|
| A pathological case would be a prefix with 100 million
| deleted keys, and 1 actual key at the end. Listing the parent
| prefix takes a long time in this case - I've seen it take
| several minutes.
|
| If your bucket is pretty "normal" and doesn't have this, or
| isn't versioned, then you can do 4-5 thousand list requests a
| second, at any given key/prefix, in constant time. Or or you
| can explicitly list object versions (and not skip deleted
| keys) also in constant time.
|
| It all depends on your data: if you need to list all objects
| then yeah it's gonna be slow because you need to paginate
| through all the objects. But the point is that you don't have
| to do that if you don't want to, unlike a traditional
| filesystem with a directory hierarchy.
|
| And this enables parallelisation: why list everything
| sequentially, when you can group the prefixes by some
| character (i.e "-"), then process each of those prefixes in
| parallel.
|
| The world is your oyster.
| cuno wrote:
| We and our customers use S3 as a POSIX filesystem, and we
| generally find it faster than a local filesystem for many
| benchmarks. For listing directories we find it faster than
| Lustre (a real high performance filesystem). Our approach is
| to first try listing directories with a single ListObjectV2
| (which on AWS S3 is in lexicographic order) and if it hasn't
| made much progress, we start listing with parallel
| ListObjectV2. Once you start parallelising the ListObjectV2
| (rather than sequentially "continuing") you get massive
| speedups.
| crabbone wrote:
| > find it faster than a local filesystem for many
| benchmarks.
|
| What did you measure? How did you compare? This claim seems
| _very_ contrary to my experience and understanding of how
| things work...
|
| Let me refine the question: did you measure metadata or
| data operations? What kind of storage medium is used by the
| filesystem you use? How much memory (and subsequently the
| filesystem cache) does your system have?
|
| ----
|
| The thing is: you should expect, in the best case,
| something like 5 ms latency on network calls over the
| Internet in an ideal case. Within the datacenter, maybe you
| can achieve sub-ms latency, but that's hard. AWS within
| region but different zones tends to be around 1 ms latency.
|
| This is while NVMe latency, even on consumer products, is
| 10-20 _micro_ seconds. I.e. we are talking about roughly
| 100 times faster than anything going through the network
| can offer.
| cuno wrote:
| For AWS, we're comparing against filesystems in the
| datacenter - so EBS, EFS and FSx Lustre. Compared to
| these, you can see in the graphs where S3 is much faster
| for workloads with big files and small files:
| https://cuno.io/technology/
|
| and in even more detail of different types of EBS/EFS/FSx
| Lustre here: https://cuno.io/blog/making-the-right-
| choice-comparing-the-c...
| hnlmorg wrote:
| EFS is ridiculously slow though. Almost to the point
| where I fail to see how it's actually useful for any of
| the traditional use cases for NFS.
| dekhn wrote:
| if you turn all the EFS performance knobs up (at a high
| cost), it's quite fast.
| hnlmorg wrote:
| Fast _er_ , sure. But I wouldn't got so far as to say it
| is _fast_
| wenc wrote:
| S3 is really high latency though. I store parquet files
| on S3 and querying them through DuckDB is much slower
| than file system because random access patterns. I can
| see S3 being decent if it's bulk access but definitely
| not for random access.
|
| This is why there's a new S3 Express offering that is low
| latency (but costs more).
| crabbone wrote:
| The tests are very weird...
|
| Normally, from someone working in the storage, you'd
| expect tests to be in IOPS, and the goto tool for
| reproducible tests is FIO. I mean, of course
| "reproducibility" is a very broad subject, but people are
| so used to this tool that they develop certain intuition
| and interpretation for it / its results.
|
| On the other hand, seeing throughput figures is kinda...
| it tells you very little about how the system performs.
| Just to give you some reasons: a system can be configured
| to do compression or deduplication on client / server,
| and this will significantly impact your throughput,
| depending on what do you actually measure: the amount of
| useful information presented to the user or the amount of
| information transferred. Also throughput at the expense
| of higher latency may or may not be a good thing...
| Really, if you ask anyone who ever worked on a storage
| product about how they could crank up throughput numbers,
| they'd tell you: "write bigger blocks asynchronously".
| This is the basic recipe, if that's what you want.
| Whether this makes a good all around system or not... I'd
| say, probably not.
|
| Of course, there are many other concerns. Data
| consistency is a big one, and this is a typical tradeoff
| when it comes to choosing between object store and a
| filesystem, since filesystem offers more data consistency
| guarantees, whereas object store can do certain things
| faster, while breaking them.
|
| BTW, I don't think most readers would understand Lustre
| and similar to be the "local filesystem", since it
| operates over network and network performance will have a
| significant impact, of course, it will also put it in the
| same ballpark as other networked systems.
|
| I'd also say that Ceph is kinda missing from this
| benchmark... Again, if we are talking about filesystem on
| top of object store, it's the prime example...
| cuno wrote:
| IOPS is a really lazy benchmark that we believe can
| greatly diverge from most real life workloads, except for
| truly random I/O in applications such as databases. For
| example, in Machine Learning, training usually consists
| of taking large datasets (sometimes many PBs in scale),
| randomly shuffling them each Epoch, and feeding them into
| the engine as fast as possible. Because of this, we see
| storage vendors for ML workloads concentrate on IOPS
| numbers. The GPUs however only really care about
| throughput. Indeed, we find a great many applications
| only really care about the throughput, and IOPS is only
| relevant if it helps to accomplish that throughput. For
| ML, we realised that the shuffling isn't actually random
| - there's no real reason for it to be random versus
| pseudo-random. And if its pseudo-random then it is
| predictable, and if its predictable then we can exploit
| that to great effect - yielding a 60x boost in throughput
| on S3, beating out a bunch of other solutions. S3 is not
| going to do great for truly random I/O, however, we find
| that most scientific, media and finance workloads are
| actually deterministic or semi-deterministic, and this is
| where cunoFS, by peering inside each process, can better
| predict intra-file and inter-file access patterns, so
| that we can hide the latencies present in S3. At the end
| of the day, the right benchmark is the one that reflects
| real world usage of applications, but that's a lot of
| effort to document one by one.
|
| I agree that things like dedupe and compression can
| affect things, so in our large file benchmarks each file
| is actually random. The small file benchmarks aren't
| affected by "write bigger blocks" because there's nothing
| bigger than the file itself. Yes, data consistency can be
| an issue, and we've had to do all sorts of things to
| ensure POSIX consistency guarantees beyond what S3 (or
| compatible) can provide. These come with restrictions
| (such as on concurrent writes to the same file on
| multiple nodes), but so does NFS. In practice, we
| introduced a cunoFS Fusion mode that relies on a
| traditional high-IOPS filesystem for such workloads and
| consistency (automatically migrating data to that tier),
| and high throughput object for other workloads that don't
| need it.
| supriyo-biswas wrote:
| > Once you start parallelising the ListObjectV2 (rather
| than sequentially "continuing")
|
| How are you "parallelizing" the ListObjectsV2? The
| continuation token can be only fed in once the previous
| ListObjectsV2 response has completed, unless you know the
| name or structure of keys ahead of time, in which listing
| objects isn't necessary.
| cuno wrote:
| For example, you can do separate parallel ListObjectV2
| for files starting a-f and g-k, etc.. covering the whole
| key space. You can parallelize recursively based on what
| is found in the first 1000 entries so that it matches the
| statistics of the keys. Yes there may be pathological
| cases, but in practice we find this works very well.
| johnmaguire wrote:
| You're right that it won't work for all use cases, but
| starting two threads with prefixes A and M, for example,
| is one way you might achieve this.
| fijiaarone wrote:
| If you think s3 is fast, you should try FTP. It's at least
| a hundred times faster. And combined with rsync, dozens of
| times more reliable.
| hayd wrote:
| You can set up cloud watch events to trigger a lambda function
| to store meta data about the s3 file in a regular database.
| That way you can index it how you expect to list.
|
| Very effective for our use case.
| foldr wrote:
| >What makes it useful is listing.
|
| I think 99% of S3 usage just consists of retrieving objects
| with known keys. It seems odd to me to consider prefix listing
| as a key feature.
| bostik wrote:
| When you embed the relevant (not necessarily that of object
| creation) timestamp as a prefix, it sure becomes one. Whether
| that prefix is part of the "path"
| (object/path/prefix/with/<4-digit year/)" or directly part of
| the basename (object/path/prefix/to/app-
| specific/files/<4-digit year>-<2-digit month>-....), being
| able to limit the search space server-side becomes incredibly
| useful.
|
| You can try it yourself: list objects in a bucket prefix with
| _lots_ of files, and measure the time it takes to list all of
| them vs. the time it takes to list only a subset of them that
| share a common prefix.
| gamache wrote:
| > ...listing any given prefix is essentially constant time: I
| can take any given string, in a bucket with 100 billion
| objects, and say "give me the next 1000 keys alphabetically
| that come after this random string".
|
| I'm not sure we agree on the definition of "constant time"
| here. Just because you get 1000 keys in one network call
| doesn't imply anything about the complexity of the backend!
| orf wrote:
| Constant time irregardless of the number of objects in the
| bucket and irregardless of the initial starting position of
| your list request.
| hobobaggins wrote:
| The technical implementation is indeed impressive that it
| operates more-or-less within constant time, but probably
| very few use cases actually fit that narrow window, so this
| technical strength is moot when it comes to actual usage.
|
| Since each request is dependent upon the position received
| in the last request, 1000 arbitrary keys on your 3rd or
| 1000th attempt doesn't really help unless you found your
| needle in the haystack in _that_ request (and in that case
| the rest of that 1000 key listing was wasted.)
| orf wrote:
| You're assuming you're paginating through all objects
| from start to finish.
|
| A request to list objects under "foo/" is a request to
| list all objects starting with "foo/", which is constant
| time irregardless of the number of keys before. Same
| applies for "foo/bar-", or any other list request for any
| given prefix. There are no directories on s3.
| nh2 wrote:
| The key difference between lexicographically keyed flat
| hierarchies, and directory-nested filesystem hierarchies,
| becomes clear based on this example:
| dir1/a/000000 dir1/a/... dir1/a/999999
| dir1/b
|
| On a proper hierarchical file file system with directories as
| tree interior nodes, `ls dir1/` needs to traverse and return
| only 2 entries ("a" and "b").
|
| A flat string-indexed KV store that only supports lexicographic
| order, without special handling of delimters, needs to traverse
| 1 million dirents ("a/00000" throuh "a/999999") before arriving
| at "b".
|
| Thus, simple flat hierarchies are much slower at listing the
| contents of a single dir: O(all recursive children), vs.
| O(immediate children) on a "proper" filesystem.
|
| Lexicographic strings cannot model multi-level tree structures
| with the same complexities; this may give it the reputation of
| "listing files is slow".
|
| UNLESS you tell the listing algorithm what the delimter
| character is (e.g. `/`). Then a lexicographical prefix tree can
| efficiently skip over all subtrees at the next `/`.
|
| Amazon S3 supports that, with the docs explicitly mentioning
| "skipping over and summarizing the (possibly millions of) keys
| nested at deeper levels" in the `CommonPrefixes` field:
| https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-...
|
| I have not tested whether Amazon's implemented actually saves
| the traversal (or whether it traverses and just returns less
| results), but I'd hope so.
| nh2 wrote:
| For completeness: The orignal post says: S3
| has no rename or move operation. Renaming is
| CopyObject and then DeleteObject. CopyObject takes
| linear time to the size of the file(s). This comes up
| fairly often when someone has written a lot of files
| to the wrong place - moving the files back is very slow.
|
| This is right:
|
| In a normal file system, renaming a directory is fast O(1),
| in S3 it's slow O(all recursive children).
|
| And Amazon S3 has not added a delimiter-based function to
| reduce its complexity, even though that would be easily
| possible in a lexicographic prefix tree (re-rooting the
| subtree).
|
| So here the original post has indeed found a case where S3 is
| much slower than a normal file system.
| finalhacker wrote:
| S3 not implementate vfs api, but you can treat it as a software
| defined storage filesystem. Just like Ceph.
|
| there are so many applications depends on file storage, such as
| Mysql. But horizontal scale for those app still difficult in many
| case. Replace from vfs api to s3 storage perhaps is trending in
| my experience.
| igtztorrero wrote:
| Check out kopia.io is a backup software that uses S3 to store
| files by blocks or pages.
|
| You can browse, search and sort the files and directories of the
| different snapshot or versions of the file.
|
| I love it !
|
| For me it's a file system in S3.
|
| Bonus: you must use a key, to encrypt the files.
| MatthiasPortzel wrote:
| This article was an epiphany for me because I realized I've been
| thinking of the Unix filesystem as if it has two functions:
| read_file and write_file. (And then getting frustrated with the
| filesystem APIs in programming languages.)
| markhahn wrote:
| So you came from an S3 or other put-get world, and found actual
| filesystems odd?
|
| I suppose that's not so different from a WMP user's epiphany
| when they discover processes, shells, etc.
| MatthiasPortzel wrote:
| Well I'm used to an application-level view of the file
| system.
|
| A document editor or text editor opens files and saves files,
| but these are whole-document operations. I can't open a
| document in Sublime Text without reading it, and I can't save
| part of a file without saving all of it. So it's not obvious
| that these would be different at an OS level.
|
| As the post points out, there are uses for Unix's sub-file-
| level read-and-write commands, but I've never needed them.
| arvindamirtaa wrote:
| Like Gmail is emails but not IMAP. It's fine. We have seen that
| these kinds of wrappers work pretty well most of the time
| considering the performance and simplicity they bring in building
| and managing these systems.
| chrisblackwell wrote:
| Random note: Has anyone noticed how fast the author's webpage is?
| I know it's static, but I mean it's fast even for the DNS lookup.
| I would love to know what they have on.
| adverbly wrote:
| The response headers include
|
| server: cloudflare
|
| You said it though - the reason is that its static without any
| js/frameworks/SPA round trip requests.
| overstay8930 wrote:
| Full stack Cloudflare is really fast
| wooptoo wrote:
| Could be using Cloudflare pages hosted on a R2 bucket:
| https://pages.cloudflare.com/
| alphazard wrote:
| S3 is not even files, and definitely not a filesystem.
|
| The thing I would expect from a file abstraction is mutability. I
| should be able to edit pieces of a file, grow it, shrink it, read
| and write at random offsets. I shouldn't have to go back up to
| the root, or a higher level concept once I have the file in hand.
| S3 provides a mutable listing of immutable objects, if I want to
| do any of the mutability business, I need to make a copy and re-
| upload. As originally conceived, the file abstraction finds some
| sectors on disk, and presents them to the client as a contiguous
| buffer. S3 solves a different problem.
|
| Many people misinterpret the Good Idea from UNIX "everything is a
| file" to mean that everything should look like a contiguous
| virtual buffer. That's not what the real Good Idea is. Really:
| everything can be listed in a directory, including directories.
| There will be base leaves, which could be files, or any object
| the system wants to present to a process, and there will be
| recursive trees (which are directories). The directories are what
| make the filesystem, not the type of a particular leaf. Adding a
| new type of leaf, like a socket or a frame buffer, or whatever,
| is almost boring, and doesn't erode the integrity of the real
| good idea. Adding a different kind of container like a list,
| would make the structure of the filesystem more complex, and that
| _would_ erode the conceptual integrity.
|
| S3 doesn't do any of these things, and that's fine. I just want a
| place to put things that won't fit in the database, and know they
| won't bitrot when I'm not looking. The desire to make S3 look
| more like a filesystem comes from client misunderstanding of what
| it's good at/for, and poor product management indulging that
| misunderstanding instead of guarding the system from it.
| akerl_ wrote:
| How do read-only filesystems align with your definition?
| yencabulator wrote:
| You can't create new things on a read-only filesystem, you
| can in S3; not a good analogy.
| thinkharderdev wrote:
| > S3 is not even files, and definitely not a filesystem.
|
| I agree. To me the correct analog for S3 is a block storage
| device (a very weird one where blocks can be any size and can
| have a key associated with them) and not a filesystem. A
| filesystem is an abstraction that sits on top of a block
| storage device and so an "S3 filesystem" would have to be an
| abstraction that sits on top of S3 as the underlying block
| storage.
| sbussard wrote:
| It's been a while, but I really like the way google handles its
| file system internally. No confusion.
| remram wrote:
| I am currently pondering this exact problem. I want to run a
| file-sharing web application (think: NextCloud) but I don't want
| to use expensive block storage or the dedicated server's disk
| space for the files, as some of them will be accessed
| infrequently.
|
| I am wondering if s3fs/rclone-mount is sufficient, or if I should
| use something like JuiceFS that adds random-access, renaming, etc
| on top of it. Are those really necessary APIs for my use case? Is
| there only one way to find out?
|
| (The app doesn't have native S3 support)
| cuno wrote:
| It depends on if you want to expose filesystem semantics or
| metadata to applications using it. For example random access
| writes are done by ffmpeg, which is a workhorse of the media
| industry, but most things can't handle that or are too slow. We
| had to build our own solution cunoFS to make it work properly
| at high speeds.
| jkoudys wrote:
| I absolutely loved this article. Super well written with
| interesting insights.
| donatj wrote:
| > And listing files is slow. While the joy of Amazon S3 is that
| you can read and write at extremely, extremely, high bandwidths,
| listing out what is there is much much slower. Slower than a slow
| local filesystem.
|
| I was taken aback by this recently. At my coworkers request, I
| was putting some work into a script we have to manage assets in
| S3. It has a cache for the file listing, and my coworker who
| wrote it sent me his pre-populated cache. My initial thought was
| "this can't really be necessary" and started poking.
|
| We have ~100,000 root level directories for our individual
| assets. Each of those have five or six directories with a handful
| of files. Probably less than a million files total, maybe 3
| levels deep at its deepest.
|
| Recursively listing these files takes literally fifteen minutes.
| I poked and prodded suggestions from stack overflow and ChatGPT
| at potential ways to speed up the process and got nothing
| notable. That's absurdly slow. Why on earth is it so slow?
|
| Why is this something Amazon has not fixed? From the outside
| really seems like they could slap some B-trees on the individual
| buckets and call it a day.
|
| If it is a difficult problem, I'm sure it would be for
| fascinating reasons I'd love to hear about.
| returningfory2 wrote:
| Are you performing list calls sequentially? If you have O(100k)
| directories and are doing O(100k) requests sequentially, 15
| minutes works out at O(10ms) per request which doesn't seem
| that bad? (assuming my math is correct...)
| luhn wrote:
| At risk of being pedantic, you seem to be using big O to mean
| "approximately" or "in the order of", but that's not what it
| means at all. Big O is an expression of the growth rate of a
| function. Any constant value has a growth rate of 0, so
| O(100k) isn't meaningful: It's exactly the same as O(1).
| anonymous-panda wrote:
| I think it's far more mundane a reason. You can list 10k
| objects per request and getting the next 10k requires the
| result of the previous request, so it's all serial. That means
| to list 1M files, you're looking at 100 back to back requests.
| Assuming a ping time of 50ms, that's easily 5s of just going
| back and forth, not including the cost of doing the listing
| itself on a flat iteration. The cost of a 10k item list is
| about the cost of a write which is kinda slow. Additionally, I
| suspect each listing is a strongly consistent snapshot which
| adds to the cost of the operation (it can be hard to provide an
| inconsistent view).
|
| I don't think btrees would help unless you're doing directory
| traversals, but even then I suspect that's not that beneficial
| as your bottleneck is going to be the network operations and
| exposed operations. Ultimately, file listing isn't that
| critical a use case and typically most use cases are
| accomplished through things like object lifecycles where you
| tell S3 what you want done and it does it efficiently at the FS
| layer for you.
| tsimionescu wrote:
| That's 5s of a 15m duration. I don't think it matters in the
| least.
| anonymous-panda wrote:
| Depends how you're iterating. If your iterating by
| hierarchy level, then you could easily see this being
| several orders of magnitude more requests.
| catlifeonmars wrote:
| S3 is fundamentally a key value store. The fact that you can
| view objects in "directories" is nothing more than a prefix
| filter. It is not a file system and has no concept of
| directories.
| Spivak wrote:
| If I wanted to use S3 as a filesystem in the manner people
| are describing I would probably start looking at storing
| filesystem metadata in a sidecar database so you can get
| directory listings, permissions bits, xattrs and only have to
| round-trip to S3 when you need the content.
| SOLAR_FIELDS wrote:
| Isn't this essentially what systems like Minio and
| SeaweedFS do with their S3 integrations/mirroring/caching?
| What you describe sounds a lot like SeaweedFS Filer when
| backed by S3
| anonymous-panda wrote:
| Directories make up a hierarchical filesystem, but it's not a
| necessary condition. A filesystem at its core is just a way
| of organizing files. If you're storing and organizing files
| in s3 then it's a filesystem for you. Saying it's
| "fundamentally a key value store" like it's something
| different is confusing because a filesystem is just a key
| value store of path to contents of file.
|
| Indeed there's every reason to believe that a modern file
| system would perform significantly faster if the hierarchy
| was implemented as a prefix filter than actually maintaining
| the hierarchical data structures (at least for most
| operations). You can guess that this might be the case that
| file creation is extremely slow on modern file systems (on
| the order of hundreds or maybe thousands per second on a
| modern NVME disk that can otherwise do millions of IOPs and
| listing the contents of an extremely large directory is
| exceedingly slow)
| senderista wrote:
| A real hierarchy makes global constraints easier to scale,
| e.g. globally unique names or hierarchical access controls.
| These policies only need to scale to a single node rather
| than to the whole namespace (via some sort of global
| index).
| catlifeonmars wrote:
| In context of the comment I was addressing, it's clear that
| filesystem means more than just a key value store. I'd
| argue that this is generally true in common vernacular.
| anonymous-panda wrote:
| This is a technical website discussing the nuances of
| filesystems. Common vernacular is how you choose to
| define it but even the Wikipedia definition says that
| directories and hierarchy are just one property of some
| filesystems. That they became the dominant model on local
| machines doesn't take away from the more general
| definition that can describe distributed filesystems.
| jamesrat wrote:
| I implemented a solution by threading the listing. Get the
| files in the root then spin a separate process to do the
| recursion for each directory.
| perryizgr8 wrote:
| It's not a good model to think of S3 has having directories in
| a bucket. It's all objects. The web interface has a visual way
| of representing prefixes separated by slashes. But that's just
| a nice way to present the objects. Each object has a key, and
| that key can contain slashes, and you can think of each segment
| to be a directory for your ease of mind.
|
| But that illusion breaks when you try to do operations you
| usually do with/on directories.
| electroly wrote:
| The way that you said "recursively" and spent a lot of time
| describing "directories" and "levels" worries me. The fastest
| way to list objects in S3 wouldn't involve recursion at all;
| you just list all objects under a prefix. If you're using the
| path delimiter to pretend that S3 keys are a folder structure
| (they're not) and go "folder by folder", it's going to be way
| slower. When calling ListObjectsV2, make sure you are NOT
| passing "delimiter". The "directories" and "levels" have no
| impact on performance when you're not using the delimiter
| functionality. Split the one list operation into multiple
| parallel lists on separate prefixes to attain any total time
| goal you'd like.
| petters wrote:
| Yes, this is very good advice and will likely solve their
| problem
| blakesley wrote:
| All these comments saying merely "S3 has no concept of
| directories" without an explanation (or at least a link to an
| explanation) are pretty unhelpful, IMO. I dismissed your
| comment, but then I came upon this later one explaining why:
| https://news.ycombinator.com/item?id=39660445
|
| After reading that, I now understand your comment.
| electroly wrote:
| I appreciate you sharing that point of view. There's a
| "curse of knowledge" effect with AWS where its card-
| carrying proponents (myself included) lose perspective on
| how complex it actually is.
| jameshart wrote:
| A fun corollary of this issue:
|
| _Deleting_ an S3 bucket is nontrivial!
|
| You can't delete a bucket with objects in it. And you can't
| just tell S3 to delete all the objects. You need to send
| individual API requests to S3 to delete each object. Which
| means sending requests to S3 to list out the objects, 1000 at a
| time. Which takes time. And those list calls cost money to
| execute.
|
| This is a good summary of the situation:
| https://cloudcasts.io/article/deleting-an-s3-bucket-costs-mo...
|
| The fastest way to quickly dispose of an S3 bucket turns out to
| be to _delete the AWS account it belongs to_.
| electroly wrote:
| No, don't do that. Set up a lifecycle rule that expires all
| of the objects and wait 24 hours. You won't pay for API calls
| and even the cost of storing the objects themselves is waived
| once they are marked for expiration.
|
| The article has a mistake about this too: expirations do NOT
| count as lifecycle transitions and you don't get charged as
| such. You will, of course, get charged if you prematurely
| delete objects that are in a storage class with a minimum
| storage duration that they haven't reached yet. This is what
| they're actually talking about when they mention Infrequent
| Access and other lower tiers.
| jameshart wrote:
| Still counts as nontrivial.
| electroly wrote:
| This is really easy; much easier than trying to delete
| them by hand. AWS does all the work for you. It takes
| longer to log into the AWS Management Console than it
| does to set up this lifecycle rule.
| breckognize wrote:
| > I haven't heard of people having problems [with S3's
| Durability] but equally: I've never seen these claims tested. I
| am at least a bit curious about these claims.
|
| Believe the hype. S3's durability is industry leading and
| traditional file systems don't compare. It's not just the
| software - it's the physical infrastructure and safety culture.
|
| AWS' availability zone isolation is better than the other cloud
| providers. When I worked at S3, customers would beat us up over
| pricing compared to GCP blob storage, but the comparison was
| unfair because Google would store your data in the same building
| (or maybe different rooms of the same building) - not with the
| separation AWS did.
|
| The entire organization was unbelievably paranoid about data
| integrity (checksum all the things) and bigger events like
| natural disasters. S3 even operates at a scale where we could
| detect "bitrot" - random bit flips caused by gamma rays hitting a
| hard drive platter (roughly one per second across trillions of
| objects iirc). We even measured failure rates by hard drive
| vendor/vintage to minimize the chance of data loss if a batch of
| disks went bad.
|
| I wouldn't store critical data anywhere else.
|
| Source: I wrote the S3 placement system.
| tracerbulletx wrote:
| My first job was at a startup in 2012 where I was expected to
| build things at a scale way over what I really had the
| experience to do. Anyways the best choice I ever made was using
| RDS and S3 (and django).
| supriyo-biswas wrote:
| Checksumming the data is not based out of paranoia but simply
| as a result of having to detect which blocks are unusable in
| order to run the Reed-Solomon algorithm.
|
| I'd also assume that a sufficient number of these corruption
| events are used as a signal to "heal" the system by migrating
| the individual data blocks onto different machines.
|
| Overall, I'd say the things that you mentioned are pretty
| typical of a storage system, and are not at all specific to S3
| :)
| catlifeonmars wrote:
| The S3 checksum feature applies to the objects, so that's
| entirely orthogonal to erasure codes. Unless you know
| something I don't and SHA256 has commutative properties.
| You'd still need to compute the object hash independent of
| any blocks.
|
| Source: https://docs.aws.amazon.com/AmazonS3/latest/userguide
| /checki...
| staunch wrote:
| > _Believe the hype._
|
| I'd rather believe the test results.
|
| Is there a neutral third-party that has validated S3's
| durability/integrity/consistency? Something as rigorous as
| Jepsen?
|
| It'd be really neat if someone compared all the S3 compatible
| cloud storage systems in a really rigorous way. I'm sure we'd
| discover that there are huge scary problems. Or maybe someone
| already has?
| rsync wrote:
| "AWS' availability zone isolation is better than the other
| cloud providers."
|
| Not better than _all_ of them.
|
| A geo-redundant rsync.net account exists in two different
| states (or countries) - for instance, primary in Fremont[1] and
| secondary in Denver.
|
| "S3 even operates at a scale where we could detect "bitrot""
|
| That is not a function of scale. My personal server running ZFS
| detects bitrot just fine - and the scale involved is tiny.
|
| [1] he.net headquarters
| Helmut10001 wrote:
| Agree.
|
| > S3 even operates at a scale where we could detect "bitrot"
| - random bit flips caused by gamma rays hitting a hard drive
| platter (roughly one per second across trillions of objects
| iirc).
|
| I would expect any cloud provider to be able to detect bitrot
| these days.
| senderista wrote:
| I think the point the OP was trying to make is that they
| _regularly detected_ bitrot due to their scale, not that
| they were merely _capable_ of doing so.
| Helmut10001 wrote:
| Ah, thank you. This makes more sense. And I think I
| remember reading about it once. Apologies for the
| misinterpretation!
| pclmulqdq wrote:
| Everyone with significant scale and decent software
| regularly detects bitrot.
| breckognize wrote:
| Backing up across two different regions is possible for any
| provider with two "regions" but requires either doubling your
| storage footprint or accepting a latency hit because you have
| to make a roundtrip from Fremont to Denver.
|
| The neat thing about AWS' AZ architecture is that it's a
| sweet spot in the middle. They're far enough apart for good
| isolation, which provides durability and availability, but
| close enough that the network round trip time is negligible
| compared to the disk seek.
|
| Re: bit rot, I mean the frequency of events. If you've got a
| few disks, you may see one flip every couple years. They
| happen frequently enough in S3 that you can have expectations
| about the arrival rate and alarm when that deviates from
| expectations.
| logifail wrote:
| > The neat thing about AWS' AZ architecture is that it's a
| sweet spot in the middle
|
| What may be less of a sweet spot is AWS' pricing.
| emodendroket wrote:
| Sending the data to /dev/null is the cheapest option if
| that's all you care about.
| logifail wrote:
| Seems the snark detector just went off :)
|
| Back on topic, I'd hope all of us would expect value for
| money for any and all services we recommend or purchase.
| Search for "site:news.ycombinator.com Away From AWS" to
| find dozens of discussions on how to save money by
| leaving AWS.
|
| EDIT: just one article of the many I've read recently:
|
| "What I've always found surprising about egress is just
| how expensive it is. On AWS, downloading a file from S3
| to your computer once costs 4 times more than storing it
| for an entire month"
|
| https://robaboukhalil.medium.com/youre-paying-too-much-
| for-e...
| senderista wrote:
| > the network round trip time is negligible compared to the
| disk seek
|
| Only for spinning rust, right?
| breckognize wrote:
| Yes, which is what all the hyperscalers use for object
| storage. HDD seek time is ~10ms. Inter-az network latency
| is a few hundred micros.
| alexchamberlain wrote:
| > They're far enough apart for good isolation, which
| provides durability and availability
|
| It can't possibly be enough for critical data though,
| right? I'm guessing a fire in 1 is unlikely to spread to
| another, but could it affect the availability of another?
| What about a deliberate attack on the DCs or the utilities
| supplying the DCs?
| immibis wrote:
| Yes, if a terrorist blows up all of the several Amazon
| DCs holding your data, your data will be lost. This is
| true no matter how many DCs are holding your data, who
| owns them, or where they are. You can improve your
| chances, of course.
|
| There have been region-wide availability outages before.
| They're pretty rare and make worldwide news media due to
| how much of the internet they take out. I don't think
| there's been S3 data loss since they got serious about
| preventing S3 data loss.
| allset_ wrote:
| FWIW, both AWS S3 and GCP GCS also allow you to store data in
| multi-region.
|
| https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiR.
| ..
|
| https://cloud.google.com/storage/docs/locations#consideratio.
| ..
| andrewguenther wrote:
| Yes, but S3 has single region redundancy that is better
| than GCP. Your data in two AZs in one region is in two
| physically separate buildings. So multi-region is less
| important to durability.
| mannyv wrote:
| How does the latest ZFS bug impact your bitrot statement?
|
| I mean, technically it's not bitrot if zeros were
| accidentally written out instead of data.
| woodada wrote:
| Probably none because they didn't update to the exact
| version that had the bug
| medler wrote:
| > customers would beat us up over pricing compared to GCP blob
| storage, but the comparison was unfair because Google would
| store your data in the same building
|
| I don't think this is true. Per the Google Cloud Storage docs,
| data is replicated across multiple zones, and each zone maps to
| a different cluster.
| https://cloud.google.com/compute/docs/regions-zones/zone-vir...
| singron wrote:
| Google puts multiple clusters in a single building.
| medler wrote:
| Seems you're right. They say each zone is a separate
| failure domain but you kind of have to trust their word on
| that.
| navaati wrote:
| Flashback to that Clichy datacenter fire near Paris...
| yencabulator wrote:
| Zones are about correlated power and networking failures.
| Regions are about disasters. If you want multiple regions,
| Google can of course do that too:
|
| https://cloud.google.com/storage/docs/locations#consideratio.
| ..
| treflop wrote:
| What's your experience like at other storage outfits?
|
| I only ask because your post is a bit like singing praises for
| Cinnabon that they make their own dough.
|
| The things that you mentioned are standard storage company
| activities.
|
| Checksum-all-the-things is a basic feature of a lot of file
| systems. If you can already set up your home computer to detect
| bitrot and alert you, you can bet big storage vendors do it.
|
| Keeping track of hard drive failure rates by vendor is normal.
| Storage companies publicly publish their own reports. The tiny
| 6-person IT operation I was in had a spreadsheet. Hell, I
| toured a friend's friend's major data center last year and he
| managed to find time to talk hard drive vendors. Now you. I get
| it -- y'all make spreadsheets.
|
| There are a lot of smart people working on storage outside AWS
| and long before AWS existed.
| pclmulqdq wrote:
| When I worked at Google in storage, we had our own figures of
| merit that showed that we were the best and Amazon's
| durability was trash in comparison to us.
|
| As far as I can tell, every cloud provider's object store is
| too durable to actually measure ("14 9's"), and it's not a
| problem.
| breckognize wrote:
| 9's are overblown. When cloud providers report that,
| they're really saying "Assuming random hard drive failure
| at the rates we've historically measured and how we quickly
| we detect and fix those failures, what's the mean time to
| data loss".
|
| But that's burying the lede. By far the greatest risks to a
| file's durability are: 1. Bugs (which aren't captured by a
| durability model). This is mitigated by deploying slowly
| and having good isolation between regions. 2. An act of God
| that wipes out a facility.
|
| The point of my comment was that it's not just about
| checksums. That's table stakes. The main driver of data
| loss for storage organizations with competent software is
| safety culture and physical infrastructure.
|
| My experience was that S3's safety culture is outstanding.
| In terms of physical separation and how "solid" the AZs
| are, AWS is overbuilt compared to the other players.
| pclmulqdq wrote:
| That was not how we treated the 9's at Google. Those had
| been tested through natural experiments (disasters).
|
| I was not at Google for the Clichy fire, but it wasn't
| the first datacenter fire Google experienced. I think
| your information about Google's data placement may be
| incorrect, or you may be mapping AWS concepts onto Google
| internal infrastructure in the wrong way.
| fsociety wrote:
| I would not lose sleep over storing data on GCS, but have
| heard from several Google Cloud folks that their concept
| of zones is a mirage at best.
| pclmulqdq wrote:
| Yeah, that's definitely true. Google sort of mapped an
| AWS concept onto its own cluster splits. However, there
| are enough regional-scale outages at all the major clouds
| that I don't personally place much stock in the idea of
| zones to begin with. The only way to get close to true
| 24/7 five-9's uptime with clouds is to be multi-region
| (and preferably multi-cloud).
| breckognize wrote:
| Do you mean Google included "acts of God" when computing
| 9's? That's definitely not right.
|
| 11 9's of durability means mean time to data loss of 100
| billion years. Nothing on earth is 11 9's durable in the
| face of natural (or man-made) disasters. The earth is
| only 4.5 billion years old.
| pclmulqdq wrote:
| Normally, companies store more than 1 byte of data, and
| the 9's (not just for data loss, for everything) are
| ensemble averages.
|
| By the way, I don't doubt that AWS has plenty of 9's by
| that metric - perhaps more than GCP.
| jftuga wrote:
| If I were to upload a 50kb object to S3 (standard tier),
| about how many unique physical copies would exist?
| cyberax wrote:
| At least 3.
| fierro wrote:
| it's well known and not debatable that Cinnabon is fire
| FooBarWidget wrote:
| "Checksum-all-the-things is a basic feature of a lot of file
| systems"
|
| "A lot"? Does anything but ZFS and maybe btrfs do this? Ext4
| anf XFS -- two very common filesystems -- still don't have
| data checksums.
| Filligree wrote:
| Bcachefs, and LVM also has a way to do it.
|
| Unfortunately I'm not aware of any filesystem that does it
| while maintaining the full bandwidth of a modern NVMe. Not
| even with the extra reads factored in; on ZFS I get 800
| MB/s max.
| 4death4 wrote:
| This was a few years ago, but blob storage on GCP had a
| global outage due to an outage in a single zone. That, among
| numerous other issues with GCP, lost my confidence entirely.
| Maybe it's better now.
| loeg wrote:
| Not a public cloud, but storage at Facebook is similar in terms
| of physical infrastructure, safety culture, and scale.
| spintin wrote:
| Correct me if I'm wrong but bitrot only affects spinning rust
| since NAND uses ECC?
|
| If you see this I wonder if S3 is planning on adding hardlinks?
| sgtnoodle wrote:
| Pretty much any modern storage medium depends on a healthy
| amount of error correcting code.
| surajrmal wrote:
| Nand is constantly moving around your data to prevent it from
| bit rotting. If you leave data too long without moving it,
| you may not be able to read the data from the nand.
| Veserv wrote:
| But they asked if the claims were audited by a unbiased third
| party. Are there such audits?
|
| Alternatively, AWS does publicly provide legally binding
| availability guarantees, but I have never seen any prominently
| displayed legally binding durability guarantees. Are these
| published somewhere less prominently?
| cyberax wrote:
| > Alternatively, AWS does publicly provide legally binding
| availability guarantees, but I have never seen any
| prominently displayed legally binding durability guarantees.
| Are these published somewhere less prominently?
|
| It's listed prominently in the public docs:
| https://aws.amazon.com/s3/storage-classes/
| chupasaurus wrote:
| > and bigger events like natural disasters
|
| Outdated anecdata: I've worked for a company that lost some
| parts of buckets after the lightning strike incident in 2011,
| which bumped the paranoia quite a bit. AFAIK same thing
| couldn't happen for more than a decade.
| svat wrote:
| It's nice to see Ousterhout's idea of module depth (the main idea
| from his _A Philosophy of Software Design_ ) getting more
| mainstream -- mentioned in this article with attribution only in
| "Other notes", which suggests the author found it natural enough
| not to require elaboration. Being obvious-in-hindsight like this
| is a sign of a good idea. :-)
|
| > _The concept of deep vs shallow modules comes from John
| Ousterhout 's excellent book. The book is [effectively] a list of
| ideas on software design. Some are real hits with me, others not,
| but well worth reading overall. Praise for making it succinct._
| ahepp wrote:
| Are filesystems the correct abstraction to build databases on?
| Isn't a filesystem a database in a way? Is there a reason to
| build a database on top of a filesystem abstraction rather than a
| block abstraction?
|
| To say you can't build an efficient database on top of S3 makes
| sense to me. S3 is already a certain kind of data-storing
| abstraction optimized for certain usages. If you try and build
| another data-storing abstraction optimized for incompatible
| usages on top of that, you are going to have a difficult time.
| d0gsg0w00f wrote:
| In my $dayjob as cloud architect I sometimes suggest S3 as an
| alternative to pulling massive JSON blobs from RDS
| Postgres/Redis etc. As long as their latency minimums are high
| enough there's no reason you can't.
| jandrewrogers wrote:
| The traditional POSIX filesystem is the wrong abstraction for a
| database, but not filesystems per se. All databases that care
| about performance and scalability implement their own
| filesystems, either directly against raw block devices or as an
| overlay on top of a POSIX filesystem that bypasses some of its
| limitations. The performance and scalability gains by doing so
| are not small.
|
| The issue with POSIX filesystems is that they are required to
| make a set of tradeoffs to support features a database engine
| doesn't need, to the significant detriment of scalability and
| performance in areas that databases care about a lot. For
| example, one such database filesystem I've used occasionally
| over the years, while a bit dated at this point, is designed
| such that you can have tens of millions of files in a single
| directory where you are creating and destroying tens of
| thousands of files every second, on upwards of a petabyte of
| storage. Very far from being POSIX compatible but you don't get
| anything like that type of scalability on POSIX.
|
| Object storage is far from ideal as database storage. The
| biggest issue, though, is the terrible storage bandwidth
| available in the cloud. It is a small fraction of what is
| available in a normal server and modern database engines are
| capable of fully exploiting a large JBOD of NVMe.
| d-z-m wrote:
| > S3 is a cloud filesystem, not an object-whatever. [...]I think
| the idea that S3 is really "Amazon Cloud Filesystem" is a bit of
| a load bearing fiction.
|
| Does anyone actually think this? I have never encountered anyone
| who has described S3 in these terms.
| teaearlgraycold wrote:
| Not sure if the author is aware of EFS
| chubot wrote:
| > Filesystem software, especially databases, can't be ported to
| Amazon S3
|
| This seems mistaken. Porting databases that run on local disk to
| S3 seems like a good way to get a lashing from https://aphyr.com/
|
| Can any databases do it correctly?
|
| If so, I doubt they work with the model of partial overwrites.
| They probably have to do something very custom, and either
| sacrifice a lot of tail latency, or their uptime is capped by the
| uptime of a single AWS availability zone. Doesn't seem like a
| great design.
|
| (copy of lobste.rs comment)
| est31 wrote:
| My employer (Neon) offers Postgres databases that run on top of
| a couple of caching layers at the end of which there is S3:
| https://neon.tech/docs/introduction/architecture-overview
|
| Directly exposing every write to S3 gives you the partial
| overwrite issues as described. But one can collect a bunch of
| traffic and push state to S3 once it reaches a threshold.
| Instead, a few writes in the postgres WAL are held outside of
| S3 in a replicated on-disk cache.
| chubot wrote:
| Thanks for the link.
|
| But I searched the docs for "durability" and got zero
| results. Before I use anything like this, I'd like to see
| what durability settings are used:
|
| https://www.postgresql.org/docs/current/non-durability.html
|
| Litestream documents the their data loss window, it seems
| like Neon should too:
|
| https://litestream.io/tips/
|
| _By default, Litestream will replicate new changes to an S3
| replica every second. During this time where data has not yet
| been replicated, a catastrophic crash on your server will
| result in the loss of data in that time window._
|
| I also searched for "data loss" and got zero results -- this
| is important because Neon is almost certainly sacrificing
| durability for performance.
| yencabulator wrote:
| Neon handles that by staging the WAL segments on 3x
| replicated Safekeeper nodes. Durability relies on not
| having all of those blow up at the same time. I'd expect it
| to be much safer than traditional Postgres replication
| mechanisms (with the trade-off having a comparatively large
| minimum node count; Neon really is built for multitenancy
| where that cost can be amortized across lots of databases).
| est31 wrote:
| > I searched the docs for "durability" and got zero
| results.
|
| The link I gave above explains it, right the sentence with
| "durability":
|
| > Safekeepers are responsible for durability of recent
| updates. Postgres streams Write-Ahead Log (WAL) to the
| Safekeepers, and the Safekeepers store the WAL durably
| until it has been processed by the Pageservers and uploaded
| to cloud storage.
|
| > Safekeepers can be thought of as an ultra reliable write
| buffer that holds the latest data until it is processed and
| uploaded to cloud storage. Safekeepers implement the Paxos
| protocol for reliability.
| BirAdam wrote:
| Underneath the software, there's still a filesystem with files.
|
| If you stand up an S3 instance with Ceph, you still have a
| filesystem on spinning rust or fancy SSDs. There's just a bunch
| of stuff on top of that. It's cool, but to say that there's no
| filesystem is simply what the customer or middle person sees, not
| what is actually happening.
| seabrookmx wrote:
| S3 actually uses a completely custom system[1] for writing
| bytes to disk. I haven't seen much in the way of details on the
| on-disk format but I certainly wouldn't assume it resembles a
| normal filesystem.
|
| [1]: https://aws.amazon.com/blogs/storage/how-automated-
| reasoning...
| aseipp wrote:
| No there isn't. AWS does not use the traditional filesystem
| layer to store data; that would be a massive mistake from a
| performance and reliability POV; the POSIX filesystem
| specification is notoriously vague about things like fsync
| consistency under particular scenarios, i.e. "do I need to
| fsync the parent directory before or after fsyncing the
| contents" for instance and has many bizarre performance cliffs
| if you aren't careful. At the scale AWS is at even a 10%
| performance cliff or performance delta would be worth clawing
| back if it meant removing the POSIX filesystem.
|
| Filesystems are not free; they incur "complexity" (that
| favorite bugbear everyone on HN loves to complain about) just
| as much as any other component in the stack does.
|
| > If you stand up an S3 instance with Ceph,
|
| Okay, but AWS does not run on Ceph. Even then, Ceph is an
| example that recommends the opposite. Nowadays they recommend
| solutions like the Bluestore OSD backend to store actual data
| directly on raw block devices, completely bypassing the
| filesystem layer -- for the exact same reasons I outlined above
| and many, many others (the actual metadata does use "BlueFS"
| which is a small FS shim, but this is mostly so that RocksDB
| can write directly to the block device too, next to the data
| segments, and BlueFS is in no way a real POSIX filesystem, it's
| just a shim for existing software).
|
| See "File Systems Unfit as Distributed Storage Backends:
| Lessons from 10 Years of Ceph Evolution" written by the Ceph
| authors[1] about why they finally gave in and wrote Bluestore.
| The spoiler alert is they got rid of the filesystem precisely
| because "a filesystem with files" underneath, as you describe,
| was problematic and worked poorly in comparison (see the
| conclusion in Section 9.)
|
| Many places do use POSIX filesystems for various reasons, even
| at large scale, of course.
|
| [1] https://pdl.cmu.edu/PDL-FTP/Storage/ceph-exp-sosp19.pdf
| yencabulator wrote:
| Ceph's BlueStore has talked direct to block devices, no
| filesystem in between, since 2017.
|
| https://ceph.com/community/new-luminous-bluestore/
|
| [Disclaimer: ex-Ceph employee, from before BlueStore]
| jandrewrogers wrote:
| I seriously doubt this is correct. It is common for database
| engines to install directly on raw block devices, bypassing the
| Linux kernel and effectively becoming the filesystem for those
| storage devices. Why would S3 work any differently? There are
| no advantages to building on top of a filesystem and many
| disadvantages for this kind of thing.
|
| It would be a poor engineering choice to build something like
| S3 on top of some other filesystem. There are often ways to do
| it by using an overlay that converts a filesystem into a pseudo
| block device, but that is usually considered a compatibility
| shim used for environments that lacking dedicated storage, at
| the cost of robustness and performance.
| somedudetbh wrote:
| > Amazon S3 is the original cloud technology: it came out in
| 2006. "Objects" were popular at the time and S3 was labelled an
| "object store", but everyone really knows that S3 is for files.
| S3
|
| Alternative theory: everyone who worked on this knew that it was
| not a filesystem and "object store" is a description intended to
| describe everything else pointed out in this post.
|
| "Objects were really popular" is about objects as software
| component that combines executable code with local state. None of
| the original S3 examples were about "hey you can serialize live
| objects to this store and then deserialize them into another live
| process!" It was all like "hey you know how you have all those
| static assets for your website..." "Objects" was used in this
| sense in databases at the time in the phrase "binary large
| object" or "blob". S3 was like "hey, stuff that doesn't fit in
| your database, you know...objects...this is a store for them."
|
| This is meant to describe precisely things like "listing is slow"
| because when S3 was designed, the launch usecases assumed an
| index of contents existed _somewhere else_, because, yeah, it's
| not a filesystem. it's an object store.
| senderista wrote:
| Yes, the author doesn't seem to realize that "object storage"
| is a term of art in storage systems that has nothing to do with
| OOP.
|
| https://en.wikipedia.org/wiki/Object_storage
| tutfbhuf wrote:
| S3 is obviously not a filesystem in the sense of a POSIX
| filesystem. And I would argue it is not a filesystem, even if we
| were to relax POSIX filesystem semantics (do not implement the
| full spec). But what is certainly possible is to span a
| filesystem on top of S3. It is basically possible to span a
| filesystem on anything that can store data. You can even go crazy
| for demonstration purposes and put a filesystem on top of YouTube
| (there are some tech demos for that on GitHub).
|
| I think a better question is whether there are any good
| filesystem implementations on top of S3. There are many attempts
| like s3fs-fuse[^1] or seaweedfs[^2], but I have not heard many
| stories about their use at scale from big companies. Just
| recently there was a post here about cunoFS[^3]. It is a startup
| that implements a POSIX-compliant (supports symlinks, hard links
| (emulated), UIDs & GIDs, permissions, random writes, etc.)
| filesystem on top of S3/AZ/GCP storage and claims to have really
| good performance. I think only time will tell if it works out in
| practice for companies to use S3 as a filesystem through fs
| implementations on top of S3.
|
| [^1]: https://github.com/s3fs-fuse/s3fs-fuse
|
| [^2]: https://github.com/seaweedfs/seaweedfs
|
| [^3]: https://news.ycombinator.com/item?id=39640307
| ein0p wrote:
| A bit off topic but also related: I use Minio as a local "S3" to
| store datasets and model checkpoints for my garage compute.
| Minio, however, has a bunch of features that I simply don't need.
| I just want to be able copy to/from, list prefixes, and delete
| every now and then. I could use nfs I suppose, but that'd be a
| bit inconvenient since I also use Minio to store build deps
| (which Bazel then downloads), and I'd like to be able to
| comfortably build stuff on my laptop. In particular, one feature
| I do not need is the constant disk access than Minio does to
| "protect against bit rot" and whatever. That protection is
| already provided by periodic scrubs on my raidz6.
|
| So what's the current best (preferably statically linked) self-
| hosted, single-node option for minimal S3 like "thing" that just
| lets me CRUD the files and list them?
| OnlyMortal wrote:
| It can be a file system.
|
| I've written my own FUSE that uses Rabin Chunking and stores the
| data (and meta) in S3. The C++/AWS SDK FUSE is connected to a Go
| SMB server that runs locally on my Mac and works with (local)
| TimeMachine.
|
| I use Wasabi for cost and speed reasons.
| jcims wrote:
| It seems like they're moving away from this with S3 directory
| buckets and express zone.
___________________________________________________________________
(page generated 2024-03-10 23:00 UTC)