[HN Gopher] Parsing millions of URLs per Second (2023)
       ___________________________________________________________________
        
       Parsing millions of URLs per Second (2023)
        
       Author : PaulHoule
       Score  : 141 points
       Date   : 2024-12-23 23:48 UTC (23 hours ago)
        
 (HTM) web link (onlinelibrary.wiley.com)
 (TXT) w3m dump (onlinelibrary.wiley.com)
        
       | notamy wrote:
       | The title seems to have a few words missing. Original title:
       | 
       | > Parsing millions of URLs per second
        
         | HL33tibCe7 wrote:
         | HN's stupid/arrogant automatic title rewriter strikes again
        
           | PaulHoule wrote:
           | Fixed
        
             | ignoramous wrote:
             | So, suprassing 80k karma, one gets title edit rights?
        
               | PaulHoule wrote:
               | I think anybody can edit a title within a short time of
               | posting something. Or if there is a karma threshold it is
               | way less than 80k.
               | 
               | I caught that one manually but YOShInOn's tail end needs
               | some love and could be updated so it that it fixes up
               | titles that get mashed automatically or adds a comment
               | sometimes to editorialize or provide an archive link.
        
           | kristianp wrote:
           | I've never noticed a title being rewritten automatically when
           | posting an article. Are you sure that's really a thing?
        
             | pests wrote:
             | There are some auto rewrite rules. Off the top of my head:
             | numbers in the beginning are stripped, [pdf] or [video] can
             | be added to the end, and one more I can't remember that
             | gets stripped off beginning and can cause confusion.
             | 
             | A pdf link to "5 Reasons To Do Things" will be "Reasons To
             | Do Things [pdf]" for example.
        
               | Tomte wrote:
               | ,,How" at the beginning is stripped, leading to all these
               | strange sounding ,,I <verb>" submissions.
        
             | Tomte wrote:
             | Yes. And the algorithm is really incredibly stupid, but
             | dang is opposed to even small improvements (like showing
             | the changed title on submission beforehand, like the ,,x
             | characters to long" message).
        
       | Cicero22 wrote:
       | This sort of work is something I wouldn't be able to do, but I
       | can't help but point out at least one potential issue with the
       | paper. It's a lot easier to find problems than solutions I guess.
       | 
       | Are the benchmarks comparing node versions valid to conclude a
       | real world performance increase?
       | 
       | one possible confounder is the version of V8.
       | 
       | https://github.com/nodejs/node/blob/v18.x/deps/v8/include/v8...
       | https://github.com/nodejs/node/blob/v20.x/deps/v8/include/v8...
       | 
       | ideally, they would've patched Node 18.15 with their changes
       | directly and test their patch against 18.15.
        
       | beached_whale wrote:
       | Found the easier to read/download from Arxiv link
       | 
       | https://arxiv.org/abs/2311.10533
        
       | maartenscholl wrote:
       | I had a lot of fun writing low latency parsers for various
       | message standards C++. There are a lot of fun things you can do
       | when you can take ownership of the read buffer and you can figure
       | out how to parse in-situ (modifying the data in place as you move
       | along)
        
       | Thorrez wrote:
       | > For example, the input string http://xn--6qqa088eba.xn--
       | 3ds443g/./a/../b/./c should be normalized to the string
       | https://xn--xn6qqa088eba-l19f.xn--xn3ds-zu3b/b/c
       | 
       | Why would normalization change http:// to https:// ?
        
         | ramon156 wrote:
         | Because its 2024
        
           | marginalia_nu wrote:
           | http:// is not a typo for https://. There's still a fairly
           | large amount of web servers that do not talk https, and you
           | simply cannot assume that they do. That will leave you with a
           | lot of dead links. Besides, most that accept both will auto-
           | renegotiate to https.
        
             | TacticalCoder wrote:
             | > There's still a fairly large amount of web servers that
             | do not talk https, and you simply cannot assume that they
             | do.
             | 
             | OTOH I'm browsing since _years_ forcing HTTPS only and life
             | goes on fine. If the absolute worse comes to worse, I can
             | use archive.is or archive.org but it 's very rare that I
             | need that.
             | 
             | Basically: if a link is HTTP to me it's not worth opening.
             | 
             | The one exception would be Debian packages URLs: but these
             | are signed and the signatures are verified.
             | 
             | User __apt_ is the only one allowed to emit HTTP traffic.
             | 
             | This prevents my ISP or anyone else injecting nasty stuff.
        
               | forgotmypw17 wrote:
               | Just because it is accessible to you does not mean it is
               | accessible to everyone else. HTTPS has many failure modes
               | which make it unreliable for essential access, such as
               | time mismatches, certificate expirations, ssl version
               | mismatches, etc. Security and privacy are important, and
               | they are also not absolute. Sometimes the risk is
               | outweighed by the importance of being able to access
               | essential resources and reading material.
        
               | Analemma_ wrote:
               | User preferences should not be encoded into _parser_
               | behavior, that's nuts. You wouldn't just arbitrarily
               | change an ftp: // link to an imap:// link, so why would
               | you accept it here? That exists at a whole other layer of
               | the stack.
        
               | dmd wrote:
               | They would arbitrarily change an ftp:// link to an
               | sftp:// link and then complain that it didn't work.
        
         | chrismorgan wrote:
         | There's _got_ to be some accidental mangling there.
         | _Somewhere._ Because of that error, and still more because of
         | the blatant error in the next sentence:
         | 
         | > _For example, given the base
         | stringhttp://example.org/foo/bar, the relative string
         | http://example.com/ leads to the final URL
         | http://example.org/example.com/._
         | 
         | That's just... no. I do not believe I have ever encountered any
         | software which would parse it in that way, and I refuse to
         | believe such software _ever_ existed. It would be
         | <http://example.com/>.
         | 
         | But the PDF matches the HTML. I dunno, _something_ weird is
         | going on. Look at the hyperlinks there, too, "http://xn--ivg
         | but not the rest of the URL that follows, and how the -- has
         | been changed to -. Something went wrong _somewhere_ in the
         | editing or publication.
        
           | silvestrov wrote:
           | My guess is that the html formatter changed the text
           | "example.com" into "http://example.com" to make it a valid
           | absolute URL.
        
       | youngtaff wrote:
       | Lemire's blog is well worth a read if you're interested in this
       | sort of thing https://lemire.me/blog/
        
       | grayhatter wrote:
       | I wonder how much time was spent promoting this parser, vs time
       | spent on writing it? I've seen a lot of spam for this one, and
       | I'm not the only one.
       | 
       | https://daniel.haxx.se/blog/2023/11/21/url-parser-performanc...
        
         | yagiznizipli wrote:
         | Almost a year of development, 3 months of writing paper. All of
         | the benchmarks are public. Run it before sharing someone else's
         | blog post.
        
       ___________________________________________________________________
       (page generated 2024-12-24 23:01 UTC)