[HN Gopher] Gcsfuse: A user-space file system for interacting wi...
___________________________________________________________________
Gcsfuse: A user-space file system for interacting with Google Cloud
Storage
Author : yla92
Score : 126 points
Date : 2023-09-06 09:48 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| carbocation wrote:
| I do scientific computing in google cloud. When I first got
| started, I heavily relied on GCSFuse. Over time, I have
| encountered enough trouble that I no longer use it for the vast
| majority of my work. Instead, I explicitly localize the files I
| want to the machine that will be operating on them, and this has
| eliminated a whole class of slowdown bugs and availability bugs.
|
| The scale of data for my work is modest (~50TB, ~1 million files
| total, about 50k files per "directory").
| paulddraper wrote:
| > The scale of data for my work is modest (~50TB, ~1 million
| files total, about 50k files per "directory").
|
| Then my work must be downright embarassing.
| nyc_pizzadev wrote:
| Did you use a local caching proxy like Varnish or Squid? Would
| that have helped?
| dekhn wrote:
| These codes aren't talking HTTP. They are talking POSIX to a
| real filesystem. The problem is that cloud-based FUSE mounts
| are never as reliable (they will "just hang" at random times
| and you need some sort of external timeout to kill the
| process and restart the job and possible the host) as a real
| filesystem (either a local POSIX one or NFS or SMB).
|
| I've used all the main FUSE cloud FS (gcsfuse, s3-fuse,
| rclone, etc) and they all end up falling over in prod.
|
| I think a better approach would be to port all the important
| science codes to work with file formats like parquet and use
| user-space access libraries linked into the application, and
| both the access library and the user code handle errors
| robustly. This is how systems like mapreduce work, and in my
| experience they work far more reliably than FUSE-mounts when
| dealing with 10s to 100s of TBs.
| laurencerowe wrote:
| These file systems are not a good fit for large numbers of
| small files. Their sweet spot is working with large (~GB+)
| files which are mostly read from beginning to end. I've mostly
| used them for bioinformatics stuff.
| ashishbijlani wrote:
| FUSE does not work well with a large number of small files (due
| to high metadata ops such as inode/dentry lookups).
|
| ExtFUSE (optimized FUSE with eBPF) [1] can offer you much
| higher performance. It caches metadata in the kernel to avoid
| lookups in user space. Disclaimer: I built it.
|
| 1. https://github.com/extfuse/extfuse
| laurencerowe wrote:
| ExtFUSE seems really cool and great for implementing
| performant drivers in userspace for local or lower latency
| network filesystems, but I doubt FUSE is the bottleneck in
| this case since S3/GCS have 100ms first byte latency.
|
| https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimi.
| ..
| markstos wrote:
| I had a similar experience with S3 Fuse. It was slower, more
| complex and expensive than using S3 directly. I had feared
| refactoring my code to use the API, but it went quickly. I've
| never gone back to using or recommending a cloud filesystem
| like that for a project.
| [deleted]
| yread wrote:
| There is also blobfuse2 for mounting Azure Storage
| https://github.com/Azure/azure-storage-fuse
|
| It has some nice features like streaming with block level caching
| for fast readonly access
| yla92 wrote:
| There is s3fs-fuse as well for AWS S3. https://github.com/s3fs-
| fuse/s3fs-fuse
|
| It even supports GCS (as GCS has S3 compatible API)
|
| https://github.com/s3fs-fuse/s3fs-fuse/wiki/Google-Cloud-Sto...
| alpb wrote:
| Hah nice! I developed https://github.com/ahmetb/azurefs back in
| 2012 when I was about to join to Azure. I'm glad Azure actually
| provides a supported and actively-maintained tool for this.
| easton wrote:
| mountpoint-s3 is AWS' first party solution for mounting s3
| buckets as file systems:
| https://github.com/awslabs/mountpoint-s3
|
| Haven't used it but it looks cool, if a bit immature.
| bushbaba wrote:
| Id also look at goofys, which I've found to be google performant
| for reads. Also nice that it's a golang binary which is easily
| passable around to hosts.
| nyc_pizzadev wrote:
| Does anyone have any experience on how this works at scale?
|
| Let's say I have a directory tree with 100MM files in a nested
| structure, where the average file is 4+ directories deep. When I
| `ls` the top few directories, is it fast? How long until I
| discover updates?
|
| Reading the docs, it looks like it's using this API for traversal
| [0]?
|
| What about metadata like creation times, permission, owner,
| group?
|
| Any consistency concerns?
|
| [0]
| https://cloud.google.com/storage/docs/json_api/v1/objects/li...
| BrandonY wrote:
| Hi, Brandon from GCS here. If you're looking for all of the
| guarantees of a real, POSIX filesystem, you want to do fast top
| level directory listing for 100MM+ nested files, and POSIX
| permissions/owner/group and other file metadata are important
| to you, Gcsfuse is probably not what you're after. You might
| want something more like Filestore:
| https://cloud.google.com/filestore
|
| We've got some additional documentation on the differences and
| limitations between Gcsfuse and a proper POSIX filesystem:
| https://cloud.google.com/storage/docs/gcs-fuse#expandable-1
|
| Gcsfuse is a great way to mount Cloud Storage buckets and view
| them like they're in a filesystem. It scales quite well for all
| sorts of uses. However, Cloud Storage itself is a flat
| namespace with no built-in directory support. Listing the few
| top level directories of a bucket with 100MM files more or less
| requires scanning over your entire list of objects, which means
| it's not going to be very fast. Listing objects in a leaf
| directory will be much faster, though.
| milesward wrote:
| Brandon, I know why this was built, and I agree with your
| list of viable uses; that said, it strikes me as extremely
| likely to lead to gnarly support load, grumpy customers, and
| system instability when it is inevitably misused. What steps
| across all of the user interfaces is GCP taking to warn users
| who may not understand their workload characteristics at all
| as to the narrow utility of this feature?
| nyc_pizzadev wrote:
| Thanks for the reply.
|
| Our theoretical usecase is 10+ PB and we need multiple TB/s
| of read throughout (maybe of fraction of that for writing).
| So I don't think Filestore fits this scale, right?
|
| As for the directory traversals, I guess caching might help
| here? Top level changes aren't as frequent as leaf additions.
|
| That being said, I don't see any (caching) proxy support
| anywhere other than the Google CDN.
| daviesliu wrote:
| If you really expect a file system experience over GCS, please
| try JuiceFS [1], which scales to 10 billions of files pretty
| well with TiKV or FoundationDB as meta engine.
|
| PS, I'm founder of JuiceFS.
|
| [1] https://github.com/juicedata/juicefs
| victor106 wrote:
| The description says S3. Does it also support GCS?
| 8organicbits wrote:
| The architecture image shows GCS and others, so I suspect
| it does.
|
| https://github.com/juicedata/juicefs#architecture
| skrowl wrote:
| [dead]
| asah wrote:
| gcsfuse worked great for me on a couple of projects, but YMMV for
| production use. As with all distributed storage systems, make
| sure you can handle timeouts, retries, high latency periods and
| outages.
| djbusby wrote:
| Why not rclone? It was discussed here yesterday as a replacement
| for sshfs - and supports GCS as well as dozens more backends.
|
| https://rclone.org/
|
| https://news.ycombinator.com/item?id=37390184
| [deleted]
| capableweb wrote:
| Last time gcsfuse was on HN
| (https://news.ycombinator.com/item?id=35784889), the author of
| rclone was in the comments:
|
| > From reading the docs, it looks very similar to `rclone
| mount` with `--vfs-cache-mode off` (the default). The
| limitations are almost identical.
|
| > However rclone has `--vfs-cache-mode writes` which caches
| file writes to disk first to allow overwriting in the middle of
| a file and `--vfs-cache-mode full` to cache all objects on a
| LRU basis. They both make the file system a whole lot more
| POSIX compatible and most applications will run using `--vfs-
| cache-mode writes` unlike `--vfs-cache-mode off`.
|
| https://news.ycombinator.com/item?id=35788919
|
| Seems rclone would be an even better option than Google's own
| tool.
| tough wrote:
| Comments like this are why HN comments section is usually
| better than the news in it.
|
| Also hi capableweb I think your name rings a bell from
| LLM's/Gen AI threads
| capableweb wrote:
| Me too!
|
| Hello! That's probably a sign I need to take a break from
| writing too many HN comments per day, thanks :)
| tough wrote:
| Don't worry man, wasn't implying that, it's just cool to
| see the same names/non-faces around tbh. simon is
| likewise heh
| paulgb wrote:
| > Cloud Storage FUSE can only write whole objects at a time to
| Cloud Storage and does not provide a mechanism for patching. If
| you try to patch a file, Cloud Storage FUSE will reupload the
| entire file. The only exception to this behavior is that you can
| append content to the end of a file that's 2 MB and larger, where
| Cloud Storage FUSE will only reupload the appended content.
|
| I didn't know GCS supported appends efficiently. Correct me if
| I'm wrong, but I don't think S3 has an equivalent way to append
| to a value, which makes it clunky to work with as a log sink.
| nicornk wrote:
| With S3 you can do something similar my misusing the multiparty
| upload functionality, e.g.:
| https://github.com/fsspec/s3fs/blob/fa1c76a3b75c6d0330ed03c4...
| rickette wrote:
| Azure Blob Storage actually has explicit append support using
| "Append"-blobs (next to block and page blobs)
| capableweb wrote:
| Building a storage service like these today and not having
| "append" would be very silly indeed. I guess S3 is kind of
| excused since it's so old by now. Although I haven't read
| anything about them adding it, so maybe less excused...
| paulddraper wrote:
| > Correct me if I'm wrong, but I don't think S3 has an
| equivalent way to append to a value, which makes it clunky to
| work with as a log sink.
|
| You are correct. (There are multipart uploads, but that's kinda
| different.)
|
| ELB logs are delivered as separate object every few minutes,
| FWIW.
| londons_explore wrote:
| Append workloads are common in distributed systems. Turns out
| nearly every time you think you need random read/write to a
| datastructure (eg. a hard drive/block device for a
| Windows/Linux VM), you can instead emulate that with an append-
| only log of changes and a set of append-only indexes.
|
| Doing so has huge benefits: Write performance is _way_ higher,
| you can do rollbacks easily (just ignore the tail of the
| files), you can do snapshotting easily (just make a new file
| and include by reference a byte range of the parent), etc.
|
| The downside is from time to time you need to make a new file
| and chuck out the dead data - but such an operation can be done
| 'online', and can be done during times of lower system load.
| merb wrote:
| you just described wal files of a database.
| KRAKRISMOTT wrote:
| The generalized architecture is called event sourcing
| capableweb wrote:
| Well, I'd argue that it's just two different names for
| similar concepts, but applied at different levels. WAL is
| a low-level implementation detail, usually for
| durability, while Event Source is a architecture applied
| to solve business problems.
|
| A WAL would usually disappear or truncate it's length
| after a while, and you'd only rerun things from it if you
| absolutely have two. Changes in business requirements
| shouldn't require you to do anything with a WAL.
|
| In contrast, Event sourcing log would be kept
| indefinitely, so when business requirements change, you
| could (if you want to, not required) re-run N previous
| events so you can apply new changes to old data in your
| data storage.
|
| But, if you really want to, it's basically the same, but
| in the end, applied differently :)
| tough wrote:
| ha I was just thinking how similar this was to postgres pg-
| audit way of reusing the logs to sum up the correct state
| capableweb wrote:
| > Append workloads are common in distributed systems.
|
| Bringing back the topic to what the parent was saying; since
| S3 is a pretty common system, and a distributed system at
| that, are you saying that S3 does support appending data?
| AFAIK, S3 never supported any append operations.
| vlovich123 wrote:
| I'll add some nuance here. You can implement append
| yourself in a clunk way. Create a new multipart upload for
| the file, copy its existing contents, create a new part
| appending what you need, and then complete the upload.
|
| Not as elegant / fast as GCS's and there may be other
| subtleties, but it's possible to simulate.
| ozfive wrote:
| It does not. Consider using services like Amazon Kinesis
| Firehose, which can buffer and batch logs, then
| periodically write them to S3.
| Severian wrote:
| You can use S3 versioning, assuming you have enabled this on
| the bucket. It would be a little clunky. It would also be done
| in batches and not continuous append.
|
| Basically if your data is append only (such as a log), buffer
| whatever reasonable amount is needed, and then put a new
| version of the file with said data (recording the generated
| version ID AWS gives you). This gets added to the "stack" of
| versions of said S3 object. To read them all, you basically get
| each version from oldest to newest and concatenate them
| together on the application side.
|
| Tracking versions would need to be done application side
| overall.
|
| You could also do "random" byte ranges if you track the
| versioning and your object has the range embedded somewhere in
| it. You'd still need to read everything to find what is the
| most up to date as some byte ranges would overwrite others.
|
| Definitely not the most efficient but it is doable.
| jeffbarr wrote:
| OMG...
| dataangel wrote:
| what is the advantage of versioning versus just naming your
| objects log001, log002, etc and opening them in order?
| vlovich123 wrote:
| You can set up lifecycle policies. For example, auto delete
| or auto archive versions > X date. That's one lifecycle
| rule. With custom naming schemes, it wouldn't scale as
| well.
| advisedwang wrote:
| gcsfuse uses compose [1] to append. Basically it uploads the
| new data to a temp object, then performs a compose operation to
| make a new object in the place if the original with the
| combined content.
|
| [1] https://cloud.google.com/storage/docs/composing-objects
| KptMarchewa wrote:
| I wonder how it handles potential conflicts.
| londons_explore wrote:
| For appends, the normal way is to apply the append operations
| in an arbitrary order if there are multiple concurrent
| writers. That way you can have 10 jobs all appending data to
| the same 'file', and you know every record will end up in
| that file when you later scan through it.
|
| Obviously, you need to make sure no write operation breaks a
| record midway while doing that. (unlike the posix write() API
| which can be interrupted midway).
| KptMarchewa wrote:
| That makes sense - if you keep data in something like
| ndjson and don't require any order.
|
| If you need order then probably writing to separate files
| and having compaction jobs is still better.
| boulos wrote:
| Objects have three fields for this: Version, Generation,
| and Metageneration. There's also a checksum. You can be
| sure that you were the writer / winner by checking these.
| dpkirchner wrote:
| You can also send a x-goog-if-generation-match[0] header
| that instructs GCS to reject writes that would replace
| the wrong generation (sort of like a version) of a file.
| Some utilities use this for locking.
|
| 0: https://cloud.google.com/storage/docs/xml-
| api/reference-head...
| buildbuildbuild wrote:
| The gcsfuse k8s CSI also works well if you build to expect
| occasional timeouts. It is a shame that a reliable S3-compatible
| alternative does not yet exist in the open source realm.
| jarym wrote:
| One thing I don't fully understand is whether data is cached
| locally or whether I would have to handle that myself (for
| example if I have to read a configuration file)? And if it is
| cached, how can I control how often it refreshes?
| plicense wrote:
| It uses FUSE and there's three types of Kernel cache you could
| use with FUSE (although, it seems like gcsfuse is exposing only
| one):
|
| 1. Cache of file attributes in the Kernel (this is controlled
| by "stat-cache-ttl" value - https://github.com/GoogleCloudPlatf
| orm/gcsfuse/blob/7dc5c7ff...) 2. Cache of directory listings 3.
| Cache of file contents
|
| It should be possible to use (2) and (3) for a better
| performance but might need changes to the underlying fuse
| library they use to expose those options.
| hansonw wrote:
| gcsfuse has controllable built-in caching of _metadata_ but not
| contents: https://cloud.google.com/storage/docs/gcsfuse-
| performance-an...
|
| You'd have to use your own cache otherwise. IME the OS-level
| page cache is actually quite effective at caching reads and
| seems to work out of the box with gcsfuse.
| droque wrote:
| I don't think it is, instead each operation makes a request.
| You can use something like catfs
| https://github.com/kahing/catfs
___________________________________________________________________
(page generated 2023-09-06 20:00 UTC)