hngopher.com

       [HN Gopher] Rewriting the Lexer Benchmark in Rust
       ___________________________________________________________________
        
       Rewriting the Lexer Benchmark in Rust
        
       Author : ibobev
       Score  : 24 points
       Date   : 2022-05-30 14:08 UTC (8 hours ago)
        
 (HTM) web link (eli.thegreenplace.net)
 (TXT) w3m dump (eli.thegreenplace.net)
        
       | vlmutolo wrote:
       | I'd like to see an "owned" version where the tokens hold
       | something like a CompactString [0] instead of a &str or String,
       | where CompactString is just some type that uses the small-string
       | optimization and usually avoids heap allocation. This could
       | result in a lifetime-free API with probably only sight
       | performance overhead compared to the &str version.
       | 
       | It would also be interesting to see how smol_str [1] stacks up,
       | since it was built with tokens in mind. Though I'm not sure how
       | helpful it would be in this specific case; one of its primary
       | advantages seems to be that it stores whitespace compactly, and I
       | don't think the author of the article is preserving whitespace.
       | 
       | [0]: https://docs.rs/compact_str
       | 
       | [1]: https://docs.rs/smol_str
        
         | Measter wrote:
         | Another alternative would be the use of a string interner, with
         | the tokens storing the interner ID.
         | 
         | Advantages would be that the token type can stay small and
         | Copy, while not having a lifetime to carry around.
         | 
         | Disadvantages would be the overhead of the interning, which
         | would slow down lexing, and you'd need to drag around the
         | interner to anywhere you need the actual string.
        
           | vlmutolo wrote:
           | I wonder if string interning is advantageous for a lexer,
           | where you know the strings you'd want to intern ahead-of-time
           | (AoT). If you have reserved words, those will probably end up
           | as enum variants in "Token". And things you can't know AoT
           | are less likely to be amenable to internment, like comments
           | and string literals. Tokenization already includes something
           | like a manual string internment process.
           | 
           | Originally I had the same thought as you, and that's why I
           | hunted down the smol_str library. I knew that Aleksey used it
           | in his rust-analyzer parser and figured it was an interner.
           | 
           | But then I saw it wasn't (other than for whitespace kinda)
           | and started to wonder string interning fit this problem.
        
       ___________________________________________________________________
       (page generated 2022-05-30 23:02 UTC)