[HN Gopher] Parsing URLs in Python
       ___________________________________________________________________
        
       Parsing URLs in Python
        
       Author : ibobev
       Score  : 78 points
       Date   : 2024-03-16 16:53 UTC (6 hours ago)
        
 (HTM) web link (tkte.ch)
 (TXT) w3m dump (tkte.ch)
        
       | memco wrote:
       | This is intriguing to me for the performance and correctness
       | reasons, but also if it makes the result more dev friendly than
       | the urllib.parse tiple-ish-object result thing.
        
       | Areading314 wrote:
       | Hard to imagine the tradeoff of using a third party binary
       | library developed this year vs just using urllib.parse being
       | worth it. Is this solving a real problem?
        
         | pyuser583 wrote:
         | urlib.parse is a pain. We really need something more like
         | pathlib.Path.
        
           | masklinn wrote:
           | That used to be werkzeug.urls, kinda (it certainly had a more
           | convenient API than urllib.parse), but it was killed in
           | Werkzeug 3.
        
             | pyuser583 wrote:
             | I remember and miss that. But I'm not going to install
             | werkzeug just for the url parsing.
        
               | d_kahneman7 wrote:
               | Is it that inconvenient?
        
           | Ch00k wrote:
           | There is https://github.com/gruns/furl
        
             | AMCMneCy wrote:
             | Also https://github.com/aio-libs/yarl
        
             | pyuser583 wrote:
             | Yes! That's the one I like!
        
             | VagabundoP wrote:
             | This library is very pythonic.
        
           | joouha wrote:
           | You might be interested in
           | https://github.com/fsspec/universal_pathlib
        
         | masklinn wrote:
         | According to itself, it's solving the issue of parsing
         | differentials vulnerabilities: urllib.parse is ad-hoc and
         | pretty crummy, and the headliner function "urlparse" is
         | literally the one you should not use under any circumstance: it
         | follows RFC 1808 (maybe, anyway) which was deprecated by RFC
         | 2396 _25 years ago_.
         | 
         | The odds that any other parser uses the same broken semantics
         | are basically nil.
        
           | Areading314 wrote:
           | It seems unlikely that this C++ library written by a solo dev
           | is somehow more secure than the Python standard library would
           | be for such a security-sensitive task.
        
             | masklinn wrote:
             | Not in the sense of differential vulnerabilities, since the
             | standard library _refuses_ to match any sort of modern
             | standard.
             | 
             | It's also
             | 
             | 1. not a solo dev
             | 
             | 2. Daniel Lemire
             | 
             | 3. a serious engineering _and_ research effort:
             | https://arxiv.org/pdf/2311.10533.pdf
        
               | Areading314 wrote:
               | This is the commit history:
               | https://github.com/TkTech/can_ada/commits/main/
               | 
               | I guess you are right that there are 2 commits from a
               | different dev, so it is technically not a solo project. I
               | still wouldn't ever use this in production code.
        
               | bqmjjx0kac wrote:
               | The can_ada repo threw me off, too. It looks super
               | amateurish because of the lack of tests, fuzzers, etc.
               | 
               | But it appears that they've just exported the meat of the
               | Ada project and left everything else upstream.
        
               | masklinn wrote:
               | ...
               | 
               | can_ada is just the python bindings.
               | 
               | The actual underlying project is at
               | https://github.com/ada-url/ada
        
             | TkTech wrote:
             | Hi, can_ada (but not ada!) dev here. Ada is over 20k lines
             | of well-tested and fuzzed source by 25+ developers, along
             | with an accompanying research paper. It is the parser used
             | in node.js and parses billions of URLs a day.
             | 
             | can_ada is simply a 60-line glue and packaging making it
             | available with low overhead to Python.
        
           | woodruffw wrote:
           | I agree that the stdlib parser is a mess, but as an
           | observation: replacing one use of it with a (better!)
           | implementation _introduces_ a potential parser differential
           | where one didn't exist before. I've seen this issue crop up
           | multiple times in real Python codebases, where a well-
           | intentioned developer adds a differential by incrementally
           | replacing the old, bad implementation.
           | 
           | That's the perverse nature of "wrong but ubiquitous" parsers:
           | unless you're confident that your replacement is _complete_ ,
           | you can make the situation worse, not better.
        
             | Spivak wrote:
             | > unless you're confident that your replacement is complete
             | 
             | And that any 3rd party libs you use also don't ever call
             | the stdlib parser internally because you do not want to
             | debug why a URL works through some code paths but not
             | others.
             | 
             | Turns out that url parsing is a cross-cutting concern like
             | logging where libs should defer to the calling code's
             | implementation but the Python devs couldn't have known that
             | when this module was written.
        
       | pyuser583 wrote:
       | The Ada programming language is cursed with overlapping acronyms.
       | GPS, ADA, SPARK, AWS. Seems it just got a little bit worse.
        
         | Solvency wrote:
         | It is _mindboggling_ to me how often developers create project
         | names without even _trying_ to search for precedent names _in
         | their own domain /industry_.
         | 
         | Calling this Ada is just ridiculous.
        
           | pyuser583 wrote:
           | Ada needs to fight back. Java templating library! Python
           | dependency resolver! Zag image manipulation library! SQL
           | event logging persistence framework!
           | 
           | I can't believe Amazon wasn't violating some anti-competition
           | rule by using "AWS" for "Amazon Web Services" when it already
           | meant "Ada Web Server."
           | 
           | Wow when I search for "Ada GPS" I get global position system
           | support libraries before GNAT Programming Studio.
        
           | gjvc wrote:
           | yes but they think it's punny/funny/clever
        
         | masklinn wrote:
         | > Ada is a WHATWG-compliant and fast URL parser written in
         | modern C++
         | 
         | Why would you do that, Daniel?
        
           | TkTech wrote:
           | Hah :) It's named after Yagiz Nizipli's (Ada dev) newborn
           | daughter, Ada.
        
         | EmilStenstrom wrote:
         | At least the can_ada (canada!) makes it a little bit more
         | unique.
        
       | ulrischa wrote:
       | Parsing an url is really a pain in the a*
        
         | SloopJon wrote:
         | Another post in this thread was downvoted and flagged (really?)
         | for claiming that URL parsing isn't difficult. The linked
         | article claims that "Parsing URLs _correctly_ is surprisingly
         | hard. " As a software tester, I'm very willing to believe that,
         | but I don't know that the article really made the case.
         | 
         | I did find a paper describing some vulnerabilities in popular
         | URL parsing libraries, including urllib and urllib3. Blog post
         | here:
         | 
         | https://claroty.com/team82/research/exploiting-url-parsing-c...
         | 
         | Paper here:
         | 
         | https://web-assets.claroty.com/exploiting-url-parsing-confus...
         | 
         | If you remember the Log4j vulnerability from a couple of years
         | ago, that was an URL parsing bug.
        
           | masklinn wrote:
           | > If you remember the Log4j vulnerability from a couple of
           | years ago, that was an URL parsing bug.
           | 
           | I don't think that's a fair description of the issue.
           | 
           | The log4j vulnerability was that it specifically added JNDI
           | support (https://issues.apache.org/jira/browse/LOG4J2-313) to
           | property substitution (https://logging.apache.org/log4j/2.x/m
           | anual/configuration.ht...), which it would apply on logged
           | messages. So it was a pretty literal feature of log4j. log4j
           | would just pass the URL to JNDI for resolution, and
           | substitute the result.
        
             | SloopJon wrote:
             | I didn't look into this in detail at the time, but the
             | report's summary of CVE-2021-45046 is that the parser that
             | validated an URL behaved differently than a separate parser
             | used to fetch the URL, so an URL like
             | jndi:ldap://127.0.0.1#.evilhost.com:1389/a
             | 
             | is validated as 127.0.0.1, which may be whitelisted, but
             | fetched from evilhost.com, which probably isn't.
        
       | bqmjjx0kac wrote:
       | Writing a new parser in C++ is a mistake IMO. At the very least,
       | you need to write a fuzzer. At best, you should be using one of
       | the many memory safe languages available to you.
       | 
       | I retract my criticism if this project is just for fun.
       | 
       | Edit: downvoters, do you disagree?
       | 
       | Edit2: OK, I may have judged a bit prematurely. _Ada_ itself has
       | fuzzers and tests. They 're just not exported to the _can_ada_
       | project.
        
       | VagabundoP wrote:
       | Okay so some googling found me that the "xn--" means the rest of
       | the hostname will be unicode, but why does e become -fsa in
       | www.xn--googl-fsa.com.
       | 
       | Google failed on the second part.
        
         | whalesalad wrote:
         | It's called an IDN. This is an encoding format called puny code
         | that transforms international domains into ascii
        
         | js2 wrote:
         | Because that's the Punycode representation:
         | 
         | https://en.wikipedia.org/wiki/Punycode
         | 
         | https://www.punycoder.com/
        
         | ekimekim wrote:
         | To expand on the sibling comments: This encoding (called
         | Punycode) works by combining the character to encode (e) and
         | the position the character should be in (the 7th position out
         | of a possible 7) into a single number. e is 233, there are 7
         | possible positions, and it is in position 6 (0-indexed) so that
         | single number is 233 * 7 + 6 = 1637. This is then encoded via a
         | fairly complex variable-length encoding scheme into the letters
         | "fsa".
         | 
         | See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-
         | ASCI...
        
       | bormaj wrote:
       | Why not just use `httpx`? If you're not bound to the stdlib, it's
       | a great alternative to `requests` and url parse
        
         | kmike84 wrote:
         | The URL parsing in httpx is rfc3986, which is not the same as
         | WHATWG URL living standard.
         | 
         | rfc3986 may reject URLs which browsers accept, or it can handle
         | them in a different way. WHATWG URL living standard tries to
         | put on paper the real browser behavior, so it's a much better
         | standard if you need to parse URLs extracted from real-world
         | web pages.
        
         | gjvc wrote:
         | httpx is great, and "needs to be in base"
        
       | ojbyrne wrote:
       | Looking at the github site for can_ada, I discovered that the
       | developers live in Montreal, Canada. Nice one.
        
         | TkTech wrote:
         | I do! :) I should probably also never be allowed to name
         | things.
        
       | kmike84 wrote:
       | A great initiative!
       | 
       | We need a better URL parser in Scrapy, for similar reasons. Speed
       | and WHATWG standard compliance (i.e. do the same as web browsers)
       | are the main things.
       | 
       | It's possible to get closer to WHATWG behavior by using urllib
       | and some hacks. This is what https://github.com/scrapy/w3lib
       | does, which Scrapy currently uses. But it's still not quite
       | compliant.
       | 
       | Also, surprisingly, on some crawls URL parsing can take CPU
       | amounts similar to HTML parsing.
       | 
       | Ada / can_ada look very promising!
        
         | TkTech wrote:
         | can_ada dev here. Scrapy is a fantastic project, we used it
         | extensively at 360pi (now Numerator), making trillions of
         | requests. Let me know if I can help :)
        
       ___________________________________________________________________
       (page generated 2024-03-16 23:00 UTC)