[HN Gopher] Parsing URLs in Python
___________________________________________________________________
Parsing URLs in Python
Author : ibobev
Score : 158 points
Date : 2024-03-16 16:53 UTC (1 days ago)
(HTM) web link (tkte.ch)
(TXT) w3m dump (tkte.ch)
| memco wrote:
| This is intriguing to me for the performance and correctness
| reasons, but also if it makes the result more dev friendly than
| the urllib.parse tiple-ish-object result thing.
| nerdponx wrote:
| I posted this elsewhere in the thread, but there absolutely is
| prior art here. Check out Yarl (urllib.parse wrapper with the
| nicer interface) and Hyperlink (green field, immutable OO
| style, focus on correctness). Both on PyPI for many years now.
| memco wrote:
| Thanks! I have come across yarl, but not hyperlink.
| diarrhea wrote:
| Yeah, that tuple API is bizarre. It really doesn't play well
| with type annotations either.
| Areading314 wrote:
| Hard to imagine the tradeoff of using a third party binary
| library developed this year vs just using urllib.parse being
| worth it. Is this solving a real problem?
| pyuser583 wrote:
| urlib.parse is a pain. We really need something more like
| pathlib.Path.
| masklinn wrote:
| That used to be werkzeug.urls, kinda (it certainly had a more
| convenient API than urllib.parse), but it was killed in
| Werkzeug 3.
| pyuser583 wrote:
| I remember and miss that. But I'm not going to install
| werkzeug just for the url parsing.
| d_kahneman7 wrote:
| Is it that inconvenient?
| Ch00k wrote:
| There is https://github.com/gruns/furl
| AMCMneCy wrote:
| Also https://github.com/aio-libs/yarl
| 4ec0755f5522 wrote:
| I use yarl as my default for this as well, it's been
| great to work with.
| pyuser583 wrote:
| Yes! That's the one I like!
| VagabundoP wrote:
| This library is very pythonic.
| joouha wrote:
| You might be interested in
| https://github.com/fsspec/universal_pathlib
| masklinn wrote:
| According to itself, it's solving the issue of parsing
| differentials vulnerabilities: urllib.parse is ad-hoc and
| pretty crummy, and the headliner function "urlparse" is
| literally the one you should not use under any circumstance: it
| follows RFC 1808 (maybe, anyway) which was deprecated by RFC
| 2396 _25 years ago_.
|
| The odds that any other parser uses the same broken semantics
| are basically nil.
| Areading314 wrote:
| It seems unlikely that this C++ library written by a solo dev
| is somehow more secure than the Python standard library would
| be for such a security-sensitive task.
| masklinn wrote:
| Not in the sense of differential vulnerabilities, since the
| standard library _refuses_ to match any sort of modern
| standard.
|
| It's also
|
| 1. not a solo dev
|
| 2. Daniel Lemire
|
| 3. a serious engineering _and_ research effort:
| https://arxiv.org/pdf/2311.10533.pdf
| Areading314 wrote:
| This is the commit history:
| https://github.com/TkTech/can_ada/commits/main/
|
| I guess you are right that there are 2 commits from a
| different dev, so it is technically not a solo project. I
| still wouldn't ever use this in production code.
| bqmjjx0kac wrote:
| The can_ada repo threw me off, too. It looks super
| amateurish because of the lack of tests, fuzzers, etc.
|
| But it appears that they've just exported the meat of the
| Ada project and left everything else upstream.
| masklinn wrote:
| ...
|
| can_ada is just the python bindings.
|
| The actual underlying project is at
| https://github.com/ada-url/ada
| TkTech wrote:
| Hi, can_ada (but not ada!) dev here. Ada is over 20k lines
| of well-tested and fuzzed source by 25+ developers, along
| with an accompanying research paper. It is the parser used
| in node.js and parses billions of URLs a day.
|
| can_ada is simply a 60-line glue and packaging making it
| available with low overhead to Python.
| Areading314 wrote:
| Ah, that makes more sense -- it might be a good idea to
| integrate with the upstream library as a submodule rather
| than lifting the actual .cpp/.h files into the bindings
| repo. That way people know the upstream C++ code is from
| a much more active project.
|
| Despite my snarky comments, thank you for contributing to
| the python ecosystem, this does seem like a cool project
| for high performance URL parsing!
| woodruffw wrote:
| I agree that the stdlib parser is a mess, but as an
| observation: replacing one use of it with a (better!)
| implementation _introduces_ a potential parser differential
| where one didn't exist before. I've seen this issue crop up
| multiple times in real Python codebases, where a well-
| intentioned developer adds a differential by incrementally
| replacing the old, bad implementation.
|
| That's the perverse nature of "wrong but ubiquitous" parsers:
| unless you're confident that your replacement is _complete_ ,
| you can make the situation worse, not better.
| Spivak wrote:
| > unless you're confident that your replacement is complete
|
| And that any 3rd party libs you use also don't ever call
| the stdlib parser internally because you do not want to
| debug why a URL works through some code paths but not
| others.
|
| Turns out that url parsing is a cross-cutting concern like
| logging where libs should defer to the calling code's
| implementation but the Python devs couldn't have known that
| when this module was written.
| yagiznizipli wrote:
| Ada was developed in eoy 2022, and included in Node.js since
| March 2023. Since then, Ada powers Node.js, Cloudflare workers,
| Redpanda, Clickhouse and many more libraries.
| pyuser583 wrote:
| The Ada programming language is cursed with overlapping acronyms.
| GPS, ADA, SPARK, AWS. Seems it just got a little bit worse.
| Solvency wrote:
| It is _mindboggling_ to me how often developers create project
| names without even _trying_ to search for precedent names _in
| their own domain /industry_.
|
| Calling this Ada is just ridiculous.
| pyuser583 wrote:
| Ada needs to fight back. Java templating library! Python
| dependency resolver! Zag image manipulation library! SQL
| event logging persistence framework!
|
| I can't believe Amazon wasn't violating some anti-competition
| rule by using "AWS" for "Amazon Web Services" when it already
| meant "Ada Web Server."
|
| Wow when I search for "Ada GPS" I get global position system
| support libraries before GNAT Programming Studio.
| gjvc wrote:
| yes but they think it's punny/funny/clever
| nine_k wrote:
| It's usually a poor reason. Calling the GNU Image
| Manipulation Program "GIMP" was kinda funny and punny, and
| maybe humble, but not wise.
|
| Inkscape or Krita have less poignant, but more reasonable
| name. (But yes, such an approach removes some of the
| teenage fun from doing a project.)
| yagiznizipli wrote:
| Ada developer here. Ada URL parser is named after my daughter
| Ada. We chose this name in particular as a reference to Ada
| Lovelace.
| Hendrikto wrote:
| > We chose this name in particular as a reference to Ada
| Lovelace.
|
| Just like the ~20 other projects named Ada.
| masklinn wrote:
| > Ada is a WHATWG-compliant and fast URL parser written in
| modern C++
|
| Why would you do that, Daniel?
| TkTech wrote:
| Hah :) It's named after Yagiz Nizipli's (Ada dev) newborn
| daughter, Ada.
| yagiznizipli wrote:
| Ada developer here, Ada is the name of my daughter, and this
| project is my gift to her, to remember me.
| pyuser583 wrote:
| Congrats on having a daughter! I hope and day she'll need
| to parse a url, and be able to see your love for her in her
| code!
| EmilStenstrom wrote:
| At least the can_ada (canada!) makes it a little bit more
| unique.
| ulrischa wrote:
| Parsing an url is really a pain in the a*
| SloopJon wrote:
| Another post in this thread was downvoted and flagged (really?)
| for claiming that URL parsing isn't difficult. The linked
| article claims that "Parsing URLs _correctly_ is surprisingly
| hard. " As a software tester, I'm very willing to believe that,
| but I don't know that the article really made the case.
|
| I did find a paper describing some vulnerabilities in popular
| URL parsing libraries, including urllib and urllib3. Blog post
| here:
|
| https://claroty.com/team82/research/exploiting-url-parsing-c...
|
| Paper here:
|
| https://web-assets.claroty.com/exploiting-url-parsing-confus...
|
| If you remember the Log4j vulnerability from a couple of years
| ago, that was an URL parsing bug.
| masklinn wrote:
| > If you remember the Log4j vulnerability from a couple of
| years ago, that was an URL parsing bug.
|
| I don't think that's a fair description of the issue.
|
| The log4j vulnerability was that it specifically added JNDI
| support (https://issues.apache.org/jira/browse/LOG4J2-313) to
| property substitution (https://logging.apache.org/log4j/2.x/m
| anual/configuration.ht...), which it would apply on logged
| messages. So it was a pretty literal feature of log4j. log4j
| would just pass the URL to JNDI for resolution, and
| substitute the result.
| SloopJon wrote:
| I didn't look into this in detail at the time, but the
| report's summary of CVE-2021-45046 is that the parser that
| validated an URL behaved differently than a separate parser
| used to fetch the URL, so an URL like
| jndi:ldap://127.0.0.1#.evilhost.com:1389/a
|
| is validated as 127.0.0.1, which may be whitelisted, but
| fetched from evilhost.com, which probably isn't.
| bqmjjx0kac wrote:
| Writing a new parser in C++ is a mistake IMO. At the very least,
| you need to write a fuzzer. At best, you should be using one of
| the many memory safe languages available to you.
|
| I retract my criticism if this project is just for fun.
|
| Edit: downvoters, do you disagree?
|
| Edit2: OK, I may have judged a bit prematurely. _Ada_ itself has
| fuzzers and tests. They 're just not exported to the _can_ada_
| project.
| yagiznizipli wrote:
| Ada developer here. Ada has more than 5000 tests, is included
| in oss-fuzz project and battle tested in Node.js and Cloudflare
| workers.
| bqmjjx0kac wrote:
| I apologize for the misjudgment. I just followed the link to
| can_ada and saw really minimal tests, e.g. https://github.com
| /TkTech/can_ada/blob/main/tests/test_parsi...
|
| I didn't understand that can_ada is not where the parser is
| developed.
| VagabundoP wrote:
| Okay so some googling found me that the "xn--" means the rest of
| the hostname will be unicode, but why does e become -fsa in
| www.xn--googl-fsa.com.
|
| Google failed on the second part.
| whalesalad wrote:
| It's called an IDN. This is an encoding format called puny code
| that transforms international domains into ascii
| js2 wrote:
| Because that's the Punycode representation:
|
| https://en.wikipedia.org/wiki/Punycode
|
| https://www.punycoder.com/
| andy99 wrote:
| I wasn't aware of this, I'd seen those URLs before but only
| in the context of Chinese ones and thought it was Chinese-
| specific.
|
| It's interesting because I just went down an apparent rabbit
| hole inplementing Byte-level encoding for using language
| models with unicode. There each byte in a unicode character
| is mapped to a printable character that goes up to 255 <
| ord(x) < 511 (I don't remember the highest but the point is
| each byte is mapped to another printable unicode character.
|
| See https://github.com/openai/gpt-2/blob/9b63575ef42771a01506
| 0c9...
|
| And the actual list of characters:
|
| https://github.com/rbitr/llm.f90/blob/dev/phi2/phi2/pretoken.
| ..
| ekimekim wrote:
| To expand on the sibling comments: This encoding (called
| Punycode) works by combining the character to encode (e) and
| the position the character should be in (the 7th position out
| of a possible 7) into a single number. e is 233, there are 7
| possible positions, and it is in position 6 (0-indexed) so that
| single number is 233 * 7 + 6 = 1637. This is then encoded via a
| fairly complex variable-length encoding scheme into the letters
| "fsa".
|
| See https://en.wikipedia.org/wiki/Punycode#Encoding_the_non-
| ASCI...
| pixelesque wrote:
| In case anyone else is confused as to why the domain in the
| example provided needs to be unicode (compared to the filename
| which is obvious): it's because the hyphen is the shorter '-'
| char, which is extended ASCII 226 not the standard '-' (which
| would be ASCII 45).
| ezequiel-garzon wrote:
| The first character you pasted is U+2011 (8209 in decimal),
| does not appear in the document and cannot be ASCII as it
| goes beyond the codepoint 127/7F. Also, U+2011 is meant to be
| a non-breaking hyphen.
| ks2048 wrote:
| This system seems pretty weird to me.
|
| I was wondering, can that clash with a "normal" domain
| registered as "xn--....."? Apparently there is another specific
| rule in RFC 5891 saying "The Unicode string MUST NOT contain
| "--" (two consecutive hyphens) in the third and fourth
| character positions" [0]
|
| Also, if I was forced to represent Unicode as ASCII, punycode
| encoding is not the obvious one - it's pretty confusing. But, I
| don't know much about how and why it was chosen, so I assume
| there's good reason.
|
| [0]
| https://datatracker.ietf.org/doc/html/rfc5891#section-4.2.3....
| lathiat wrote:
| I mean, yeah, but the odds of someone using "xn--" on the
| start of a domain are pretty small. The double dash is pretty
| uncommon.
| xg15 wrote:
| IDNs and Punycode were basically bolted-on extensions to DNS
| that were added after DNS was already widely deployed.
| Because there was no "proper" extension mechanism available,
| it was a design requirement that they can be implemented "on
| top" of the standard DNS without having to change any of the
| underlying components. So I think most of the DNS
| infrastructure can be (and is) still completely unaware that
| IDNs and Punycode exists.
|
| Actually, I wonder what happens if you take a "normal" (i.e.
| non-IDN, ascii-only) domain and encode it as Punycode. Should
| the encoded and non-encoded domains be considered identical
| or separate? (for purposes of DNS resolutions, origin
| separation, etc)
|
| Identical would be more intuitive and would match the
| behavior of domain names with non-ascii characters - on the
| other hand, this would require reworking of ALL non-punycode-
| aware DNS software, which I'm doubtful is possible.
|
| So this seems like a tricky thing to get right.
| Zambyte wrote:
| Regarding the quality of Google search results - I copied this
| comment verbatim into GPT 3.5, Claude 1, and Mistral small (the
| lowest quality LLMs from each provider available through Kagi)
| and each one explained Punycode encoding.
| atorodius wrote:
| FWIW I find this is the perfect question for ChatGPT/Gemini.
| Whenever the knowedge is somewhere on the web but hard to
| Google, I use these LLMs.
|
| In this case, Gemini correctly points to Punycode
| bormaj wrote:
| Why not just use `httpx`? If you're not bound to the stdlib, it's
| a great alternative to `requests` and url parse
| kmike84 wrote:
| The URL parsing in httpx is rfc3986, which is not the same as
| WHATWG URL living standard.
|
| rfc3986 may reject URLs which browsers accept, or it can handle
| them in a different way. WHATWG URL living standard tries to
| put on paper the real browser behavior, so it's a much better
| standard if you need to parse URLs extracted from real-world
| web pages.
| gjvc wrote:
| httpx is great, and "needs to be in base"
| orf wrote:
| No it doesn't, absolutely not. It's ironic that you say this
| after the post you're commenting on spells out quite
| explicitly why things "in base" are hard to change and adapt.
| gjvc wrote:
| where do you draw the line with this approach?
| ojbyrne wrote:
| Looking at the github site for can_ada, I discovered that the
| developers live in Montreal, Canada. Nice one.
| TkTech wrote:
| I do! :) I should probably also never be allowed to name
| things.
| kmike84 wrote:
| A great initiative!
|
| We need a better URL parser in Scrapy, for similar reasons. Speed
| and WHATWG standard compliance (i.e. do the same as web browsers)
| are the main things.
|
| It's possible to get closer to WHATWG behavior by using urllib
| and some hacks. This is what https://github.com/scrapy/w3lib
| does, which Scrapy currently uses. But it's still not quite
| compliant.
|
| Also, surprisingly, on some crawls URL parsing can take CPU
| amounts similar to HTML parsing.
|
| Ada / can_ada look very promising!
| TkTech wrote:
| can_ada dev here. Scrapy is a fantastic project, we used it
| extensively at 360pi (now Numerator), making trillions of
| requests. Let me know if I can help :)
| JulianWasTaken wrote:
| Nice.
|
| I'll also throw in that I've recently wrote bindings to Mozilla's
| servo URL library.
|
| Those live at https://github.com/crate-py/url
|
| They're not complete yet (meaning only the parsing bits are
| exposed, not URL modification) but I too was frustrated with the
| state of URL parsing.
| timhh wrote:
| IMO that URL crate is not especially high quality. I barely
| work with URLs and I quickly found an embarrassingly trivial
| bug:
|
| https://github.com/servo/rust-url/issues/864#issuecomment-16...
| phyzome wrote:
| Honestly to hell with the WHATWG's weird pseudo-standard.
| Backslashes? Five forward slashes? _No one_ is sending those URLs
| around, and if they are, they should fix it. The only thing that
| needs to deal with that kind of malformed URL is the browser 's
| address bar. If browsers want to do fixups on broken URLs at that
| point, they should feel free to, but it shouldn't pollute an
| entire spec.
|
| (It's not even a real standard -- it's a "living standard", which
| means it just changes randomly and there's no way to actually say
| "yes, you're compliant with that".)
| domenicd wrote:
| It's always been extremely funny to me how arguments like this
| and from the curl author go. "Yes, I had to change curl away
| from strictly accepting two slashes, for web compatibility. But
| `while(slash) { advance_parser() }`? That's completely
| unreasonable! `while(slash && ++i <= 3)` is so much better, and
| works almost as well!" Ok, whatever...
|
| As for your claim about living standards, I'd encourage you to
| read https://whatwg.org/faq#living-standard
| nerdponx wrote:
| No mention of prior art? The Hyperlink library has stated
| correctness as its goal for a long time:
| https://pypi.org/project/hyperlink
|
| Of course there is always room for new projects, but it still
| feels weird to act as if this is the first time anybody has ever
| tried this. It seems like a lot of people are under this same
| mistaken impression, at least according to the sample of HN users
| who commented in this thread.
| TkTech wrote:
| Hyperlink (which I didn't know existed, by the way) is not a
| parser for the WHATWG spec, it's for RFC3986. You seem to be
| getting things confused.
| nerdponx wrote:
| The article makes it sound like the only parser for URLS in
| the entire Python ecosystem is urllib.parse, regardless of
| which spec it supports. Hyperlink and Yarl are absolutely
| prior art here IMO, and at least deserve a mention in an
| article like this.
| abstractbslayer wrote:
| Pretty urls are the most unnecessary thing that was invented,
| they don't provide anything that non pretty urls can't provide
| and millions of parsers have to process them on every request.
| What a waste.
| edflsafoiewq wrote:
| What is a "pretty url"?
| d0mine wrote:
| There are people who are not native English speakers. They
| might find Unicode useful.
| d0mine wrote:
| stdlib behavior may be more preferable:
|
| - resolving "../" may have security implications - unicode
| hostname seems more readable. To get punycode, one can call
| .encode("idna") if necessary.
|
| How often parsing urls is a performance bottleneck?
___________________________________________________________________
(page generated 2024-03-17 23:02 UTC)