[HN Gopher] File Permissions: A painful side of Docker (2019)
       ___________________________________________________________________
        
       File Permissions: A painful side of Docker (2019)
        
       Author : zdw
       Score  : 104 points
       Date   : 2021-05-28 04:04 UTC (3 days ago)
        
 (HTM) web link (blog.gougousis.net)
 (TXT) w3m dump (blog.gougousis.net)
        
       | addingnumbers wrote:
       | This doesn't even really seem like a problem that docker
       | introduced. All these problems have been encountered by anyone
       | running an NFS server, or a dozen other ways you can have systems
       | with disparate uid/gid mappings using a shared or removable file
       | system
        
       | mschuster91 wrote:
       | > First of all, security issues may rise in a production system.
       | If a container is compromised and the container is executed as
       | root (uid = 0), then the intruder has access to any file of the
       | host filesystem that has been loaded to the container filesystem
       | through a mount. The owner UID of files that belong to the host
       | root will be 0 in the container. So, they will be accessible to
       | the intruder.
       | 
       | Use supervisord to coordinate the processes inside your Docker
       | container, as easy as that. Bonus point, you don't need to
       | wrangle with properly handling "docker stop"/ctrl+c.
        
       | throayobviousl wrote:
       | Isn't this a bit of an anti-pattern? There really are very few
       | situations in which you should be mounting things in production.
       | Apache/PHP/etc is definitely not one of those situations.
        
         | q3k wrote:
         | I would absolutely say it's a production antipattern to run a
         | container with access to some already existing host files
         | belonging to some other user.
         | 
         | However, this is something that's basically unavoidable if
         | you're attempting to use OCI/Docker for dev where you access a
         | developer's source code checkout from a container running a
         | standardized language runtime. And that's what a lot of people
         | use OCI/Docker for...
        
           | dividedbyzero wrote:
           | Couldn't you run into this issue when mounting device files?
           | I believe doing that for accessing external hardware or
           | sensors is not all that uncommon.
        
             | q3k wrote:
             | Sure, that's one of the cases when this might needed in
             | prod (although in the parent post I meant only access to
             | honest-to-god data files, not things like bindmounting
             | /dev).
             | 
             | In practice bindmount smell can also be somewhat alleviated
             | by using things like k8s device plugins to request things
             | at a higher level ('I want GPU access' vs. 'please
             | bindmount /dev/drm... and use the proper modes'). It's
             | still effectively a bindmount, but some extra security
             | precautions can be made to ensure exclusive access and that
             | no arbitrary mounts from the host are permitted. And things
             | like k8s device plugins can also poke at file modes and
             | other namespace magic at runtime so that the end user never
             | has to worry about things like UID/GID and chardev modes.
             | That IMO prevents the smell associated with random host
             | bindmouts.
        
               | dividedbyzero wrote:
               | I wasn't aware of k8s device plugins, that seems like it
               | would help with that, if k8s is an option. Thanks for the
               | pointer!
        
               | q3k wrote:
               | You're welcome :).
               | 
               | They're also very easy to write, so if you ever happen to
               | run k8s and need to give workloads access to some
               | odd/custom host hardware, implementing a proper plugin
               | for it is quite painless and gives much better guarantees
               | than plain bindmounts.
        
         | nicce wrote:
         | Every additional mount can be considered as extra failure in
         | design in terms of security or just being considered as
         | laziness. Those all increase the attack vector. Even though
         | containers are not designed in terms of isolation, every mount
         | and volume are one step closer to break this isolation. Of
         | course, the total risk depends on where from you are mounting.
        
       | gourlaysama wrote:
       | There is a new mount syscall in Linux 5.12, see "ID remapping in
       | mounts" [1], that should help with all the permission madness,
       | eventually.
       | 
       | It allows different mounts to expose the same content with
       | different ownership, and in general to map permissions IDs
       | between mounts in any way we like.
       | 
       | systemd-homed wll use that to abstract over the uids and gids of
       | portable home directories, for example.
       | 
       | [1]: https://lwn.net/Articles/837566/
        
       | xorcist wrote:
       | Using uid 0 in containers is asking for trouble. Any privileged
       | resources (such as low ports) can be mapped in without messing
       | with capabilities so there should be no need for it.
        
         | hda111 wrote:
         | The port mapping is done by the container engine, not the
         | container. Using low ports is allowed if the engine runs as
         | root. Moreover I think it's acceptable to use uid 0 inside a
         | rootless container like podman since it's by default only
         | mapped to the user running it.
        
           | 0xbadcafebee wrote:
           | AWS Fargate won't let you remap ports. Whatever the container
           | exposes, that's the port it's going to listen on. To work
           | around this and other problems, I ended up making fat
           | containers that start as root, and add entrypoints that can
           | either run a process as root (to listen on low ports) or sudo
           | to a user to drop perms before starting a process (to listen
           | on high ports).
           | 
           | There's also weird junk you sometimes need to do in order to
           | capture file handles depending on how a container engine is
           | running the container, which you need to do before you fork
           | or drop privs. But it took me years to finally run into that
           | use case, most people will never need to do this.
        
       | choeger wrote:
       | No one mentions how podman solves that problem with user id
       | mapping?
        
         | hda111 wrote:
         | I mentioned this in another comment here.
        
         | dividedbyzero wrote:
         | Would you mind elaborating?
        
           | choeger wrote:
           | I don't have the time to write elaborate comments right now,
           | but see here:
           | 
           | http://docs.podman.io/en/latest/markdown/podman-run.1.html
           | 
           | Especially the "userns" option with the "keep-id" value.
        
       | FooBarWidget wrote:
       | I blogged about this same problem a month ago.
       | 
       | "Docker and the host filesystem owner matching problem":
       | https://www.joyfulbikeshedding.com/blog/2021-03-15-docker-an...
       | 
       | In my blog post I layout 2 solution strategies, how one might go
       | about implementing them, and caveats to watch out for.
       | 
       | 1. Matching the container's UID/GID with the host's UID/GID.
       | 
       | 2. Remounting the host path in the container using BindFS.
        
       | 988747 wrote:
       | I think all those problems disappear when you run containers with
       | proper orchestration tools, such as Kubernetes.
       | 
       | And not only that, I think that examples given in the article
       | ("Assume that your Apache/PHP container is mounting the host's
       | /home/alexandros/myapp/ application directory to the container's
       | /var/www/html directory.") are in fact anti-patterns. If your
       | container depends on specific file being available at specific
       | location on the host then you're doing it wrong. The only place
       | where that makes sense is on developer's local environment. In
       | shared enviornments you want something like Kubernetes ConfigMap
       | to contain config files, and dedicated persistent volumes for
       | everything else.
        
         | 0xbadcafebee wrote:
         | The orchestration tool does not provide any additional
         | functionality to fix this problem, it's up to the container
         | execution environment, and today's container execution
         | environments have no way (that I am aware of) to natively map
         | file permissions outside of the container.
         | 
         | It could be I just haven't dug enough into the kernel
         | internals, maybe there is a transparent permissions remapping
         | thing. But something would absolutely have to map permissions.
         | Otherwise there is no way to use filesystem ownership between
         | execution environments without them using conflicting UID/GIDs,
         | to say nothing of changing the file perms.
        
       | hda111 wrote:
       | In Podman this is a solved problem: podman run --userns=keep-id
        
         | Honiix wrote:
         | also `podman unshare` is really helpful
        
       | voidfunc wrote:
       | Ive always solved this by just having a proxy script that creates
       | a user when the container starts with the right UID/GID then
       | executes the given command.
        
       | dandarie wrote:
       | > The problem with this approach is that is not portable. What if
       | I am developing using more than one computers where in each
       | computer my user has different ID?
       | 
       | Make the build script use local $USERID and $GROUPID as args
       | during the build process.
       | 
       | In docker-compose.yml (or, if using docker directly, using
       | --build-arg):                   build:           context: ./build
       | args:             USERID: ${USERID}             GROUPID:
       | ${GROUPID}
       | 
       | So you're passing the local uid and gid as variables to the build
       | process.(1)
       | 
       | In build/Dockerfile:                 FROM image:tag       WORKDIR
       | "/application"       ARG USERID       ARG GROUPID            RUN
       | if [ ${USERID:-0} -ne 0 ] && [ ${GROUPID:-0} -ne 0 ]; then
       | userdel -f www-data ;fi \         && if getent group ${GROUPID} ;
       | then groupdel www-data; fi \         && groupadd -g ${GROUPID}
       | www-data && useradd -m -l -u ${USERID} -g www-data www-data -s
       | /bin/bash \
       | 
       | (1) $USERID and $USERID might not be available as an environment
       | variable on your system. To do so, place this under .bashrc:
       | export USERID=$(id -u)       export GROUPID=$(id -g)
        
         | q3k wrote:
         | But that doesn't solve the problem, just works around it:
         | 
         | 1. Images are still pre-baked with a given UID/GID pair, so you
         | can't distribute them as something universal and reusable.
         | 
         | 2. This requires workarounds / extra steps on a local
         | workstation, so it doesn't work for everyone unless they follow
         | a given project's unique quirks setup.
         | 
         | Shell/compose duct tape like this doesn't make for a great
         | experience, this really should be solved by upstream projects
         | themselves as it's an extremely common issue when attempting to
         | use Docker.
        
           | dandarie wrote:
           | 1. Nope, they are not pre-baked. They are built at runtime
           | from env vars on each machine. 2. One step, setting up two
           | vars. They can be set by a build script. Lots of things have
           | build scripts way more complicated than this.
           | 
           | The only tedious thing is you have to adapt this for every
           | image type you run.
        
             | momothereal wrote:
             | If you have to build it on each machine, I would not
             | consider that easily/universally distributable. One of the
             | key points of Docker is you can build once (in your CI or
             | someone else's) and run it on any machine. I think that was
             | GP's point.
        
             | woodrowbarlow wrote:
             | but that _requires_ you to build-at-runtime, which is
             | sometimes not the best way to deploy a docker app. if you
             | have one app that you want to run on many nodes, you'll
             | want to set up a docker registry and have the nodes pull
             | pre-built images.
        
               | dandarie wrote:
               | Of course, but really only build once on every machine.
               | The subsequent starts use the cached build, even after
               | reboot.
               | 
               | In fact, docker-compose up -d takes care of the build
               | thing by itself. It's a five second tradeoff for the
               | lifetime of the application.
        
               | lukeck wrote:
               | For anyone that uses immutable infrastructure where
               | servers' configuration is never once built and subsequent
               | deployments result in replacement with entirely new VMs,
               | building once per machine still happens every time there
               | is a deployment. You don't ever reboot these machines.
               | 
               | In environments where vulnerability scanning of docker
               | images used is important, running anything in production
               | that isn't stored in a docker registry kind of breaks
               | things.
               | 
               | This approach also won't work with container
               | orchestrators like Kubernetes, ECS, Lambda, CloudRun,
               | etc.
               | 
               | Where I can see doing a docker build of a small layer
               | that just sets file perms potentially being useful is for
               | container based dev environments to be ran on laptops and
               | workstations.
        
             | oauea wrote:
             | Sure, great, let me just rebuild all my docker images on
             | every single machine they run on thereby completely
             | defeating the point of having images in the first place.
        
               | dandarie wrote:
               | You start from a base image of your choice. You only
               | build the user replacement part.
               | 
               | You run docker-compose build ONCE and you're set. On my
               | machine, it takes five seconds.
               | 
               | Heck, you can even run docker-compose build everytime you
               | start the application, it will use the cached build and
               | take less than one second.
               | 
               | ---
               | 
               | Correction: the docker-compose up -d takes care of the
               | build process the first time it runs.
               | 
               | Literally, it takes more to complain about the issue than
               | build the image ONCE.
        
             | q3k wrote:
             | > The only tedious thing is you have to adapt this for
             | every image type you run.
             | 
             | The tedious thing is that this escalates into complexity
             | whenever you have to deal with K developers using M
             | projects developed by N teams each using a different way to
             | handle this:
             | 
             | Do I need to set USERID for project foo, or UID? Does it
             | default to 1000 or the author's UID? Oh, someone has a
             | problem with our project, did they remember to set
             | COMPANY_USERID in their bashrc? Oh, wait, they're using
             | zsh, how do you do that there? Oh, but they followed this
             | other project's readme and that set COMPANY_USERID but not
             | COMPANY_GROUPID...
             | 
             | Docker is supposed to simplify this by unification and a
             | limited API surface, and applying hacks like this on top
             | kind of kills that whole premise.
        
               | dandarie wrote:
               | > Do I need to set USERID for project foo, or UID? Does
               | it default to 1000 or the author's UID? Oh, someone has a
               | problem with our project, did they remember to set
               | COMPANY_USERID in their bashrc? Oh, wait, they're using
               | zsh, how do you do that there? Oh, but they followed this
               | other project's readme and that set COMPANY_USERID but
               | not COMPANY_GROUPID...
               | 
               | You set it to the output of id -u and id -g. It's two
               | lines. There are definitely lots of things more complex
               | when dealing with docker than this.
               | 
               | You provide the team with a script containing those two
               | lines and a docker-compose wrapper and you're set.
               | 
               | Of course it would have been better not to have to care
               | about these things, but hey, at least you're not
               | installing and configuring 4-5 services to bootstrap an
               | application.
        
           | rad_gruchalski wrote:
           | It's a feature for a multi-tenant deployment if you use user
           | remaps. Maybe you only allow specific tenant containers with
           | tenant specific uid/gid.
        
         | VLM wrote:
         | > more than one computers where in each computer my user has
         | different ID
         | 
         | Decades of network filesystem users have had many solutions to
         | that.
        
           | Joker_vD wrote:
           | I can think of basically two solutions:
           | 
           | 1) pass user/group names around and resolve them at the
           | destination to UID/GID; 2) ignore them entirely; assign
           | ownership of all newly created files to the currently
           | authenticated user (if authorized).
           | 
           | Are there other ones?
        
             | fiddlerwoaroof wrote:
             | 3) treat a machine-id/user-id pair as the "real userid" 4)
             | add a remote->local userid mapping feature to your
             | filesystem.
        
         | encryptluks2 wrote:
         | Containers are ideally meant for a single service. The best way
         | I've found is to just pass the `--user` flag to `docker run`
         | and have the service run as whatever user it is that you want.
         | The only challenge is that you need to make sure that the
         | volume mounts are already created on the host with the correct
         | permissions.
        
           | dandarie wrote:
           | That runs the container as a given usee, but doesn't prevent
           | the container running some processes as a different internal
           | user.
        
         | professor_v wrote:
         | Within docker-compose.yml I use                 services:
         | foo:           image: foo/bar:6.9           user:
         | ${UID:-1000}:${UID:-1000}
         | 
         | On Linux with Bash it runs with your current user and most
         | other platforms it runs with id 1000, which is setup as the
         | default user in the Dockerfile. This is no problem on MacOS or
         | Windows because of the way Docker-Desktop uses VM's.
         | 
         | ZSH or other shells don't necessarily set $UID, so if you're
         | running Linux, not id 1000 and not running Bash you might need
         | a little .env file with `UID=1001` in it to make it work. And
         | then the user is still nameless in the container. This is kind
         | of rare and I only use it for dev containers where most
         | relevant files (and permissions) are bind-mounted from the
         | host, so it hasn't really been a problem in practice.
         | 
         | Remaps would be cleaner but I find it too much work to explain
         | for normal developers just wanting to use a dev container.
        
           | dandarie wrote:
           | From my experience, UID is not always available as to docker-
           | compose.yml because it isn't exported (at least in bash).
           | 
           | See more here: https://stackoverflow.com/a/50900530/15428104
           | 
           | $ declare -p UID declare -ir UID="1000"
           | 
           | The -x option is missing.
        
         | StavrosK wrote:
         | This has been a major Docker pain point, and not many people
         | know about this trick. I didn't know you could have the
         | variables in the Compose file directly, does that really work?
         | 
         | Our approach so far was to add yet another layer (a script to
         | pass uid/gid to Compose), but if we don't need the script that
         | would be fantastic.
         | 
         | EDIT: Ah, I just saw the bashrc wrinkle you mention. Yeah,
         | that's why we had the script, and it's a damn shame Docker
         | can't do this natively. It has been a _major_ hassle.
        
           | nickjj wrote:
           | > I didn't know you could have the variables in the Compose
           | file directly, does that really work?
           | 
           | Yep, it's because the build args get read in from a .env file
           | by default and then from there Docker Compose sends those
           | build args to Docker when it builds the image.
           | 
           | This was one of the topics from my talk at DockerCon last
           | week (creating a production ready Docker Compose set up). The
           | video and 6,000 word blog post for it will be coming out
           | tomorrow. Both things will be added to the talk's reference
           | links at https://github.com/nickjj/dockercon21-docker-best-
           | practices.
        
             | StavrosK wrote:
             | That's interesting, thanks! My shell sets the USER variable
             | (but no USERID or GROUPID), which might be good enough for
             | all our developers, but probably not reliable enough for a
             | general audience.
        
               | nickjj wrote:
               | Honestly in practice everything tends to work fine
               | without any hacks or extra scripts.
               | 
               | I run all of my containers as a non-root user and create
               | the user in the image with its default values of
               | 1000:1000 for the uid:gid. I haven't bothered to expose
               | the uid:gid as build arguments because it's pretty much
               | never an issue in development or production.
               | 
               | With a uid:gid of 1000:1000 built into the image any bind
               | mounted files end up being correctly owned by the Docker
               | host's user under the following conditions:
               | 
               | - Docker Desktop on macOS
               | 
               | - Docker Desktop on Windows using WSL 1
               | 
               | - Docker Desktop on Windows using WSL 2 and native Linux
               | (as long as your dev box's user is set to 1000:1000)
               | 
               | IMO it's really rare that your dev box's user wouldn't be
               | 1000:1000 on native Linux or WSL 2.
               | 
               | In production you also have full control over the uid:gid
               | of your deploy user.
               | 
               | The only time where it kind of stinks is CI, but it's
               | super easy to get around this by simply not using volumes
               | in CI.
               | 
               | I have a bunch of examples of this pattern at:
               | - https://github.com/nickjj/docker-flask-example
               | - https://github.com/nickjj/docker-django-example
               | - https://github.com/nickjj/docker-rails-example
               | - https://github.com/nickjj/docker-phoenix-example
               | - https://github.com/nickjj/docker-node-example         -
               | https://github.com/oleksandra-holovina/docker-play-
               | example
        
               | q3k wrote:
               | > IMO it's really rare that your dev box's user wouldn't
               | be 1000:1000 on native Linux or WSL 2.
               | 
               | Any company-wide (GNU/)Linux deployment that uses LDAP or
               | some other centralized user directory will not have devs
               | with UID/GID 1000:1000. Hope is not a strategy.
        
               | nickjj wrote:
               | > Any company-wide (GNU/)Linux deployment that uses
               | LDAP...
               | 
               | You can go the extra mile and turn the UID:GID into build
               | args like the original parent and you're good to go. No
               | hacks necessary, and since it's all self contained into a
               | .env file there's nothing extra you need to run since
               | you're likely using an .env file already for other vars.
               | 
               | Alternatively you could do this:
               | https://news.ycombinator.com/item?id=27344491
               | 
               | In either case you can solve the problem without too much
               | effort.
        
               | q3k wrote:
               | > You can go the extra mile and turn the UID:GID into
               | build args like the original parent and you're good to
               | go.
               | 
               | That doesn't help you if you're attempting to use pre-
               | built/existing Docker images that are not built
               | internally and make the assumption that "1000:1000 is
               | good enough". You then not only have to hack around
               | Docker limitations, but also around someone else's broken
               | assumption.
        
               | nickjj wrote:
               | > That doesn't help you if you're attempting to use pre-
               | built/existing Docker images that are not built
               | internally
               | 
               | Most pre-built images that I've come across don't require
               | bind mounts to function.
               | 
               | Images like PostgreSQL aren't affected by this because
               | you can use a named volume, and most pre-built
               | applications that are shipped as images tend to store
               | their state in a database and don't require bind mounts
               | to function.
        
               | dilatedmind wrote:
               | maybe i did something weird last time i installed ubuntu,
               | but my user is 1001:1002 and the default ubuntu user is
               | 1000:1001
        
               | 1_player wrote:
               | IIRC on Arch, unless you create your own group, you're
               | part of the users group, with GID 100
        
               | mschuster91 wrote:
               | > IMO it's really rare that your dev box's user wouldn't
               | be 1000:1000 on native Linux or WSL 2.
               | 
               | Any major company using LDAP/AD or other forms of
               | centralized user management won't be able to make that
               | guarantee.
               | 
               | > In production you also have full control over the
               | uid:gid of your deploy user.
               | 
               | If you're running in an un-managed environment, yes -
               | managed hosting of any kind generally doesn't provide
               | these guarantees.
        
       | staticassertion wrote:
       | It makes sense that mounting a volume requires understanding a
       | user mapping tbh. I think the answer is twofold:
       | 
       | a) Many problems solvable with a volume can be solved with a
       | bind-mount, cache-mount, etc [0].
       | 
       | b) In the event that you actually need to map in a user-file,
       | wrap the docker command in a script that manages the logic. At
       | this point you're writing a system tool that's doing things
       | outside of the context of a container - it's not really docker's
       | fault that it doesn't try to make this trivial.
       | 
       | [0] https://vsupalov.com/buildkit-cache-mount-dockerfile/
        
       | viraptor wrote:
       | > If this user is the "root", then these files will not be
       | accessible from web server or the CGI server, except if the
       | server is running as root
       | 
       | Wait, what? Why not install the immutable files as root and let
       | them be readable to everyone?
        
       | prpl wrote:
       | This is something CharlieCloud was built around for HPC and
       | something podman can work around. User namespaces and fuse-
       | overlayfs are the building blocks to fix this
        
       | tacone wrote:
       | Shameless plug: a boilerplate where I had to solve UID
       | permissions, running as non-root user, publishing files to
       | another container, mounting fs as read only, and hot reloading in
       | dev environment.
       | 
       | It's still pretty much a proof of concept and it relies on docker
       | compose but perhaps some of you may find it useful as a starting
       | point: https://github.com/tacone/loki
        
       | epage wrote:
       | Recently ran into this. So far I've landed on `setfacl`
       | 
       | - `--user` didn't work for me because there were root permissions
       | in my image
       | 
       | - I didn't dig into why `userns-remap` didn't work
       | 
       | - I didn't give https://github.com/boxboat/fixuid a try yet
       | 
       | Some notes from my experience                 setfacl -dm
       | "u:alexandros:rw" ~/alpine
       | 
       | should be
       | 
       | setfacl -R -dm "u:alexandros:rwx" ~/alpine
       | 
       | In case:
       | 
       | - `-R`: There is existing content in `~/alpine` you want made
       | avalable
       | 
       | - `x`: You want your container to be able to create directories
       | 
       | However, you can still run into problems if
       | 
       | - Your container copies data from outside your bind-mount to
       | inside. It sort-of worked except somehow the mask was `r--`,
       | making things lose writeable.
       | 
       | - Your container moves data from outside your bind-mount to
       | inside. This fully preserves the permissions.
       | 
       | I ended up creating a `.keep` file in the bind mount and doing a
       | `cp --attributes-only --preserve=mode,ownership,xattr .keep
       | <target>`
        
       | hasheddan wrote:
       | Related to this post, a recent runc version included a change
       | that inadvertently made a number of images built on the
       | distroless base image difficult to use:
       | https://danielmangum.com/posts/runc-chdir-to-cwd/
        
       ___________________________________________________________________
       (page generated 2021-05-31 23:01 UTC)