[HN Gopher] Show HN: Morgan - PyPI Mirror for Restricted/Offline...
       ___________________________________________________________________
        
       Show HN: Morgan - PyPI Mirror for Restricted/Offline Environments
        
       Mirroring PyPI packages for environments/networks that do not have
       access to the Internet is hard. It's actually hard even in
       environments that do have access to the Internet. Most solutions
       out there either:  1. Depend on pip to download and cache package
       distributions. This means those downloads will probably only work
       in a similar environment (same Python interpreter, same libc),
       because of the nature of binary package distributions and the fact
       that packages have optional dependencies for different
       environments.  2. Depend on other PyPI packages, meaning installing
       the mirror in a restricted environment in itself is too difficult.
       3. Cannot resolve dependencies of dependencies, meaning mirroring
       PyPI partially is extremely difficult, and PyPI is huge.  Morgan
       works differently. It creates a mirror based on a configuration
       file that defines target environments (using Python's standard
       Environment Markers specification from PEP 345) and a list of
       package requirement strings (e.g. "requests>=2.24.0"). It downloads
       all files relevant to the target environments from PyPI (both
       source and binary distributions), and recursively resolves and
       downloads their dependencies, again based on the target
       environments. It then extracts a single-file server to the mirror
       directory that works with Python 3.7+, has no outside dependencies,
       and implements the standard Simple API. This directory can be
       copied to the restricted network, through whatever security
       policies are in place, and deployed easily with a simple `python
       server.py` command.  I should note that Morgan can find
       dependencies from various metadata sources inside package
       distributions, including standard METADATA/PKG-INFO/pyproject.toml
       files, and non-standard files such as setuptools' requires.txt.
       There's more information in the Git repository. If this is
       interesting to you, I'll be happy to receive your feedback.
       Thanks!
        
       Author : idop
       Score  : 67 points
       Date   : 2022-09-23 12:39 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | Galanwe wrote:
       | Maybe I'm confused about what this offers, but I have been
       | running private pypi repositories for a decade now, and it never
       | required more than running an HTTP server with directory listing.
       | 
       | As for doing partial mirroring of pypi with only what you are
       | using, is that really a good idea anyway? it will break whenever
       | you add or change any dependency.
        
         | colpabar wrote:
         | Out of curiosity, how do you run yours?
        
         | jamescampbell wrote:
         | Came here to say this. I run private pypi repositories for this
         | use case and it works fine. Ive had to thumbdrive over all of
         | our dependancies from the wheels etc. A single bash script that
         | runs all the checks and downloads and zips to the offline
         | environment then use your pip install like normal with the
         | login creds to your offline pypi registry.
        
         | idop wrote:
         | The problem isn't really on the serving side, it's on the
         | mirroring side. Trying to mirror PyPI - at its current 13.4 TB
         | size[1] - and bringing all those terabytes into a restricted
         | network with security policies and no access to the internet,
         | is impossible. Partial mirror is the only way to go for such a
         | use case, and given that Morgan automatically resolves and
         | mirrors dependencies, adding new dependency shouldn't break
         | anything.
         | 
         | [1] https://pypi.org/stats/
        
           | vasco wrote:
           | Can't you resolve the dependencies by running pip download
           | when you have internet and later serving that directory with
           | a local HTTP server as the parent suggested? Pip download
           | will resolve all the dependencies for you already the same
           | way as pip install would.
        
             | nijave wrote:
             | You can download as source packages instead of wheels but
             | then you need to make sure you have all the requisite
             | compilers and libraries needed. This isn't an issue for
             | Python-only dependencies but can be difficult for
             | dependencies with lots of native code like numpy/pandas
             | where you need a C toolchain & Fortran toolchain installed
             | (and possibly other libs)
             | 
             | If you're using something like Docker/containers, you can
             | download the dependencies inside the container and be
             | reasonably sure you get the right wheels. This becomes
             | trickier when you have different setups like developers on
             | Windows and production on Linux.
        
             | idop wrote:
             | No, as I mention both in this post and in the README. Pip
             | will download binary distributions (wheels) that were
             | compiled for the system it is running on. If my mirror is
             | meant to serve a different version of Python installed on a
             | different OS with a different libc (or other such
             | differences), then it won't work. I could try to match the
             | target environment on the mirroring side, say with Docker,
             | but this is either cumbersome or still not possible if you
             | have legacy environments from years before.
        
               | jpitz wrote:
               | By default it will download wheels for the system it is
               | running on, but there are knobs to tweak that.
               | 
               | https://pip.pypa.io/en/stable/cli/pip_download/#cmdoption
               | -pl...
               | 
               | https://peps.python.org/pep-0425/
        
               | vasco wrote:
               | Makes sense and you're right we did encounter issues when
               | changing platforms at the time when we were using some
               | self rolled janky versio n of this! Thanks
        
       | jvolkman wrote:
       | This looks similar to some Bazel rules I'm working on. I'm also
       | using the approach of defining target environments up front [1],
       | but the main difference is that I'm currently offloading the
       | actual resolution process to Poetry or PDM, which both generate
       | cross-platform lock files.
       | 
       | But Poetry and PDM don't add build dependencies to lock files -
       | which I need - so I'm thinking of building a custom resolver.
       | 
       | Did you consider using resolvelib [2], which is what underlies
       | both pip and PDM?
       | 
       | [1]
       | https://github.com/jvolkman/rules_pycross/blob/main/examples...
       | 
       | [2] https://github.com/sarugaku/resolvelib
        
         | idop wrote:
         | By the way, Poetry's dependency resolution isn't that great. It
         | doesn't properly evaluate optional dependencies. For example,
         | when I try to install pymongo on Linux, it will insist on
         | installing pywin32 as well, even though it is completely
         | irrelevant. It's given me a lot of headaches.
        
         | idop wrote:
         | I didn't know about resolvelib, looks interesting, I'll have to
         | give it a deeper look, thanks.
        
       | danrocks wrote:
       | When I worked at Microsoft, one team created a big solution for
       | an e-commerce customer using Kubernetes, Helm charts, etc.
       | Beautiful.
       | 
       | Then I had to take it to run in mainland China.
       | 
       | Nope.
        
       | indrora wrote:
       | Oh neat. Not only do I share a name with a project, it's a
       | project I was seriously thinking of starting.
        
         | idop wrote:
         | :) Naming projects is hard, so I tend to give it as little
         | thought as possible. I was playing Red Dead Redemption 2 while
         | writing the first version of this so I just named it after
         | Arthur Morgan, the main protagonist.
        
           | arthurcolle wrote:
           | I figured it was someone lamenting working at Morgan Stanley
           | for not letting you pull in dependencies without a lot of red
           | tape ;)
        
       | mofeing wrote:
       | hey,
       | 
       | We were running with the same problem (supercomputer with
       | clusters of different architecture and no outgoing connections
       | permitted) and so we created "pypickup" [1,2]. nice to see that
       | we came with similar solutions! I have some questions:
       | 
       | 1. is the directory of packages you create compatible with the
       | PEP 503? (so I can use `--index-url file://PATH_TO_LOCAL_CACHE`
       | flat with pip and it should work)
       | 
       | 2. is there some filtering mechanism? e.g. we are not interested
       | in non-release versions ("dev" versions, "rc" versions, "post"
       | versions, ...)
       | 
       | 3. I guess that the way morgan resolves dependencies is by
       | manually parsing files like "pyproject.toml" or
       | "requirements.txt" and it does not ask the build-system for the
       | dependencies. if so...                  - does "morgan" detect
       | build-dependencies?             - which build-systems are
       | compatible?             - is "morgan" capable of detecting more
       | complex dependency specifications? e.g. "oldest-supported-numpy"
       | which is used by "spicy" has dependency strings like the
       | following: numpy==1.19.2; python_version=='3.8' and
       | platform_machine=='aarch64' and platform_python_implementation !=
       | 'PyPy'
       | 
       | kudos for the good work
       | 
       | [1] https://pypi.org/project/pypickup/ [2] https://github.com/UB-
       | Quantic/pypickup
        
         | idop wrote:
         | Too bad your project didn't come up in any of my searches while
         | researching this problem. Probably because it doesn't use the
         | word "mirror" at all :)
         | 
         | As for your questions:
         | 
         | 1. I don't see any mention of directory structures in PEP 503.
         | The Morgan server does implement PEP 503 though. In any case, I
         | tried installing now straight from the directory and it didn't
         | work. Are you sure you meant PEP 503?
         | 
         | 2. Where Morgan differs from pypickup, as I can see, is that it
         | interprets requirement strings as per PEP 508 (e.g.
         | "requests>=2.40.0; python_version < '3.8'") instead of
         | providing a command such as `pypickup add requests`. For every
         | requirement string, it looks for the latest version in PyPI
         | that satisfies it, and downloads that version. You can filter
         | _in_ the requirement strings, other than that Morgan doesn't
         | have any specific handling of dev/rc/etc.
         | 
         | 3. Morgan detects and downloads the build system based either
         | on the [build-system] section of pyproject.toml, or the
         | setup_requires.txt file (from setuptools). These are the
         | sources currently supported. It doesn't actually care what the
         | build system is, it simply attempts to find where it is defined
         | and download it as well.
         | 
         | As for complex dependency specifications, yes, they are
         | supported and honored (Morgan relies on the "packaging" library
         | to properly evaluate those). By the way, I recently moved from
         | Poetry to Hatch for managing the Morgan project itself
         | specifically because I got fed up with Poetry not honoring
         | those specifications, and trying to download completely
         | irrelevant packages.
        
           | mofeing wrote:
           | Well, we first named it "pypi-cache" but there is a package
           | named "pypicache" from the year 2007 and we had to rename it.
           | We always thought of it as a "cache" rather than a
           | "mirror"... but yes, "mirror" is more appropriate. Btw we
           | released it just 1 week ago which is also maybe why you did
           | not find it.
           | 
           | 1. Well, the flag "--index-url" explicitly says that "...
           | should point to a repository compliant with PEP 503 (the
           | simple repository API) or a local directory laid out in the
           | same format". PEP 503 defines the directory structure where
           | there is a folder per package, an "index.html" on the root
           | with a link to each package and *an "index.html" in each
           | package folder that has a link per available file*.
           | 
           | URLs are not limited to "https", they can also be relative
           | paths. So the trick we do is to download the file to the
           | folder of the package and add an anchor to that file in the
           | "index.html" of the package. For example,
           | 
           | If you go to https://pypi.org/simple/numpy, you will find
           | links like the following: <a href="https://files.pythonhosted
           | .org/packages/f6/d8/ab692a75f584d1..." data-requires-
           | python=">=3.8">numpy-1.22.4.zip</a>
           | 
           | But we download it and write, <a href="./numpy-1.22.4.zip"
           | data-requires-python=">=3.8">numpy-1.22.4.zip</a>
           | 
           | This is specially important for us because we cannot setup
           | any kind of server.
           | 
           | 2. Okay nice. Yep, we thought that parsing would be more
           | difficult and that relying on parsing would be problematic
           | due to the different build-systems and that many packages
           | still do not have the "pyproject.toml" file. We opted for a
           | manual approach in which you do "pypickup add" until you have
           | no more "dependency missing" errors. Your approach looks much
           | better to me, but like you said is limited to
           | "pyproject.toml" and "setuptools" right now.
           | 
           | Btw, does it also downloads extra dependencies?
           | 
           | 3. Nice. I also stopped using Poetry for things like that,
           | but now I manually write my "pyproject.toml" with
           | "setuptools".
           | 
           | I like the idea on trying to parse the dependencies. I will
           | probably try something but since we download all files
           | (filtering some of them), it would be more costly. Maybe in
           | some weeks when I'm more free.
        
             | idop wrote:
             | Ahh, I get it, it needs index.html files. I can easily
             | implement this, but I actually did want the server because
             | I wanted it to be easily accessible from multiple machines,
             | I also wanted to implement the JSON API, and also want (in
             | an upcoming version) to allow uploading private packages to
             | the mirror.
             | 
             | As for extra dependencies, yes, they will be mirrored, but
             | only if relevant, i.e. if they are included in a
             | requirement string (be it a direct requirement or a
             | dependency of a dependency).
        
       | skbly7 wrote:
       | Thanks for creating it and looking forward to try it out.
       | 
       | I have been looking for similar solution and the whitelist used
       | to fail with other tools as they weren't resolving the
       | dependencies.
        
       | hackish wrote:
       | Thanks for posting this. I'm going to give setting up Morgan a
       | shot when I've got some free cycles.
       | 
       | I'd hesitantly accepted the risk of serving a devpi server over
       | vsock and into my (personal) restricted VLAN. I did so because
       | using a shared folder meant I'd need have cached the module and
       | any dependencies from my internet-connected VLAN first.
       | 
       | Combined with debmirror[0], vscodeoffline[1], and some nightly
       | snatcher shell scripts, I think I have most of my needs covered.
       | 
       | [0] https://help.ubuntu.com/community/Debmirror
       | 
       | [1] https://github.com/LOLINTERNETZ/vscodeoffline
        
       ___________________________________________________________________
       (page generated 2022-09-23 23:01 UTC)