[HN Gopher] Parsing millions of URLs per Second (2023)
___________________________________________________________________
Parsing millions of URLs per Second (2023)
Author : PaulHoule
Score : 141 points
Date : 2024-12-23 23:48 UTC (23 hours ago)
(HTM) web link (onlinelibrary.wiley.com)
(TXT) w3m dump (onlinelibrary.wiley.com)
| notamy wrote:
| The title seems to have a few words missing. Original title:
|
| > Parsing millions of URLs per second
| HL33tibCe7 wrote:
| HN's stupid/arrogant automatic title rewriter strikes again
| PaulHoule wrote:
| Fixed
| ignoramous wrote:
| So, suprassing 80k karma, one gets title edit rights?
| PaulHoule wrote:
| I think anybody can edit a title within a short time of
| posting something. Or if there is a karma threshold it is
| way less than 80k.
|
| I caught that one manually but YOShInOn's tail end needs
| some love and could be updated so it that it fixes up
| titles that get mashed automatically or adds a comment
| sometimes to editorialize or provide an archive link.
| kristianp wrote:
| I've never noticed a title being rewritten automatically when
| posting an article. Are you sure that's really a thing?
| pests wrote:
| There are some auto rewrite rules. Off the top of my head:
| numbers in the beginning are stripped, [pdf] or [video] can
| be added to the end, and one more I can't remember that
| gets stripped off beginning and can cause confusion.
|
| A pdf link to "5 Reasons To Do Things" will be "Reasons To
| Do Things [pdf]" for example.
| Tomte wrote:
| ,,How" at the beginning is stripped, leading to all these
| strange sounding ,,I <verb>" submissions.
| Tomte wrote:
| Yes. And the algorithm is really incredibly stupid, but
| dang is opposed to even small improvements (like showing
| the changed title on submission beforehand, like the ,,x
| characters to long" message).
| Cicero22 wrote:
| This sort of work is something I wouldn't be able to do, but I
| can't help but point out at least one potential issue with the
| paper. It's a lot easier to find problems than solutions I guess.
|
| Are the benchmarks comparing node versions valid to conclude a
| real world performance increase?
|
| one possible confounder is the version of V8.
|
| https://github.com/nodejs/node/blob/v18.x/deps/v8/include/v8...
| https://github.com/nodejs/node/blob/v20.x/deps/v8/include/v8...
|
| ideally, they would've patched Node 18.15 with their changes
| directly and test their patch against 18.15.
| beached_whale wrote:
| Found the easier to read/download from Arxiv link
|
| https://arxiv.org/abs/2311.10533
| maartenscholl wrote:
| I had a lot of fun writing low latency parsers for various
| message standards C++. There are a lot of fun things you can do
| when you can take ownership of the read buffer and you can figure
| out how to parse in-situ (modifying the data in place as you move
| along)
| Thorrez wrote:
| > For example, the input string http://xn--6qqa088eba.xn--
| 3ds443g/./a/../b/./c should be normalized to the string
| https://xn--xn6qqa088eba-l19f.xn--xn3ds-zu3b/b/c
|
| Why would normalization change http:// to https:// ?
| ramon156 wrote:
| Because its 2024
| marginalia_nu wrote:
| http:// is not a typo for https://. There's still a fairly
| large amount of web servers that do not talk https, and you
| simply cannot assume that they do. That will leave you with a
| lot of dead links. Besides, most that accept both will auto-
| renegotiate to https.
| TacticalCoder wrote:
| > There's still a fairly large amount of web servers that
| do not talk https, and you simply cannot assume that they
| do.
|
| OTOH I'm browsing since _years_ forcing HTTPS only and life
| goes on fine. If the absolute worse comes to worse, I can
| use archive.is or archive.org but it 's very rare that I
| need that.
|
| Basically: if a link is HTTP to me it's not worth opening.
|
| The one exception would be Debian packages URLs: but these
| are signed and the signatures are verified.
|
| User __apt_ is the only one allowed to emit HTTP traffic.
|
| This prevents my ISP or anyone else injecting nasty stuff.
| forgotmypw17 wrote:
| Just because it is accessible to you does not mean it is
| accessible to everyone else. HTTPS has many failure modes
| which make it unreliable for essential access, such as
| time mismatches, certificate expirations, ssl version
| mismatches, etc. Security and privacy are important, and
| they are also not absolute. Sometimes the risk is
| outweighed by the importance of being able to access
| essential resources and reading material.
| Analemma_ wrote:
| User preferences should not be encoded into _parser_
| behavior, that's nuts. You wouldn't just arbitrarily
| change an ftp: // link to an imap:// link, so why would
| you accept it here? That exists at a whole other layer of
| the stack.
| dmd wrote:
| They would arbitrarily change an ftp:// link to an
| sftp:// link and then complain that it didn't work.
| chrismorgan wrote:
| There's _got_ to be some accidental mangling there.
| _Somewhere._ Because of that error, and still more because of
| the blatant error in the next sentence:
|
| > _For example, given the base
| stringhttp://example.org/foo/bar, the relative string
| http://example.com/ leads to the final URL
| http://example.org/example.com/._
|
| That's just... no. I do not believe I have ever encountered any
| software which would parse it in that way, and I refuse to
| believe such software _ever_ existed. It would be
| <http://example.com/>.
|
| But the PDF matches the HTML. I dunno, _something_ weird is
| going on. Look at the hyperlinks there, too, "http://xn--ivg
| but not the rest of the URL that follows, and how the -- has
| been changed to -. Something went wrong _somewhere_ in the
| editing or publication.
| silvestrov wrote:
| My guess is that the html formatter changed the text
| "example.com" into "http://example.com" to make it a valid
| absolute URL.
| youngtaff wrote:
| Lemire's blog is well worth a read if you're interested in this
| sort of thing https://lemire.me/blog/
| grayhatter wrote:
| I wonder how much time was spent promoting this parser, vs time
| spent on writing it? I've seen a lot of spam for this one, and
| I'm not the only one.
|
| https://daniel.haxx.se/blog/2023/11/21/url-parser-performanc...
| yagiznizipli wrote:
| Almost a year of development, 3 months of writing paper. All of
| the benchmarks are public. Run it before sharing someone else's
| blog post.
___________________________________________________________________
(page generated 2024-12-24 23:01 UTC)