[HN Gopher] Show HN: Gosax - A high-performance SAX XML parser f...
       ___________________________________________________________________
        
       Show HN: Gosax - A high-performance SAX XML parser for Go
        
       I've just released gosax, a new Go library for high-performance SAX
       (Simple API for XML) parsing. It's designed for efficient, memory-
       conscious XML processing, drawing inspiration from quick-xml and
       pkg/json. https://github.com/orisano/gosax Key features:  - Read-
       only SAX parsing - Highly efficient parsing using techniques
       inspired by quick-xml and pkg/json - SWAR (SIMD Within A Register)
       optimizations for fast text processing  gosax is particularly
       useful for processing large XML files or streams without loading
       the entire document into memory. It's well-suited for data feeds,
       large configuration files, or any scenario where XML parsing speed
       is crucial. I'd appreciate any feedback, especially from those
       working with large-scale XML processing in Go. What are your
       current pain points with XML parsing? How could gosax potentially
       help your projects?
        
       Author : orisano
       Score  : 46 points
       Date   : 2024-06-27 03:17 UTC (19 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | artpar wrote:
       | I upvoted you just because I made a golang library with the same
       | name but different purpose
       | 
       | https://github.com/artpar/gosax/
       | 
       | its a High performance golang implementation of _S_ ymbolic _A_
       | ggregate appro _X_ imation
        
       | lanstin wrote:
       | Nice. I like the event based/callback based parsing tools for XML
       | a lot. A little more cognitive work up front but much more
       | efficient. A little sad if unsurprised that XML is still a thing
       | in 2024, but if you have to read it, use a streaming parser.
        
         | glenjamin wrote:
         | If you've ever tried to read data from an XLSX file, you'll
         | find that streaming XML parsing is quite beneficial
         | 
         | And the world runs on Excel files.
        
         | IshKebab wrote:
         | I really hate SAX. Callback based parsing is really
         | unergonomic, and means you always have to code an explicit
         | state machine. You can't use your control flow as implicit
         | state.
         | 
         | It's like choosing to use `.then()` instead of `await`. I
         | seriously don't understand why it is so popular in the XML
         | world when pull based parsing is much easier to use and surely
         | just as efficient? Just brain damaged Java design patterns
         | maybe?
        
           | lanstin wrote:
           | Because of msgs that are larger than I want to allocate.
           | Explicit state machines forces one to think thru the problem.
           | And it forces the solution to be one pass over the input
           | data. I almost never am forced to use Java so unsure about
           | that reference.
        
             | IshKebab wrote:
             | Pull parsers can deal with arbitrarily large messages too.
             | And they also do one pass over the input data.
             | 
             | Yeah if you're unfamiliar, SAX is like this (pseudocode):
             | interface SAXCallbacks {          void onBeginToken(string
             | name);          void onAttribute(string key, string val);
             | void onText(string text);          void onEndToken(string
             | name);       }            void parse(Reader input,
             | SAXCallbacks yourCallbackImplementations);
             | 
             | Whereas pull parsers are like this:                 enum
             | Token {         Begin(string name),
             | Attribute(string key, string val),         Text(string
             | text),         End(string name),       }            class
             | PullParser {         void open(Reader input);         Token
             | next();       }
             | 
             | They are _much_ easier to use because you can trivially
             | write a recursive descent parser:                 void
             | parseThing(parser) {         let token = parser.next();
             | if (token == Begin("foo")) {            parseFoo(parser);
             | } else if ...
             | 
             | Whereas with SAX you're going to end up with some monstrous
             | hand-coded state machine like                  class
             | ThingParser {          enum State {
             | ParsingThing,            ParsingFoo,
             | ParsingFooExpectingAttributes,
             | ParsingFooExpectingEndTag,            ...
             | 
             | So painful. Honestly it's so obviously the right way to do
             | tokenisation and parsing that I have yet to see another
             | language that even has names for them. They all just use
             | pull parsers. Nobody else does callback-based parsing like
             | SAX because it's obviously ridiculous.
        
       | JonChesterfield wrote:
       | Very nice, thank you!
       | 
       | Unhelpfully my only pain point with XML parsing is colleagues
       | refusing to use XML in favour of json or, in really grim moments,
       | yaml.
       | 
       | So I'm delighted to see a sensible modern web language
       | implementation of the one true data exchange format. Thank you
       | for sharing it.
        
         | nonlogical wrote:
         | Out of curiosity, what are your top reasons to pick XML over
         | JSON(+jsonschema) or Msgpack/Protobuf, as data interchange? I
         | have come of age as a professional software engineer around the
         | time when industry has started switching from XML to JSON, and
         | as a consequence in the JSON camp, but I am always curious to
         | hear out folks with a different opinion.
        
           | 616c wrote:
           | Have you tried CBOR/CDDL tooling, nonlogical? What is your
           | opinion of it?
        
           | dgb23 wrote:
           | Not OP:
           | 
           | I'm in the same boat, but I found XML has some nice
           | properties that I sometimes miss in JSON, given that XML is
           | used well ("correctly"), such as the differentiation of
           | metadata (attributes) and data (nodes), namespaces, standard
           | query languages, XSLT etc. (You can use XSLT on the web
           | even.)
           | 
           | Think of all the custom, ad-hoc code that turns JSON into
           | HTML vs having a declarative standardized way of doing so.
           | 
           | https://developer.mozilla.org/en-US/docs/Web/XSLT
        
           | jerf wrote:
           | When to use XML/What XML is good at:
           | https://news.ycombinator.com/item?id=11446984
        
             | jtmarmon wrote:
             | Great writeup. To add an example, I personally use JSON for
             | most of my work, but have found myself using XML for
             | certain AI use cases that require annotating an original
             | text.
             | 
             | For example, if I wanted an AI to help me highlight to the
             | user where in a body of text I mentioned AI, I might have
             | it return something like:
             | 
             | <text>Great writeup. To add an example, I personally use
             | JSON for most of my work, but have found myself using XML
             | for certain <ai-mention>AI</ai-mention> use cases that
             | require annotating an original text with segments.</text>
        
         | IshKebab wrote:
         | I agree YAML is awful. JSON is ok if you allow comments at
         | least (for configuration use cases). There are a couple of
         | variants that do: JSON5 and Jsonnet. I like JSON5 for it's
         | relative simplicity but Jsonnet has much better ecosystem
         | support so I'd probably go with that.
         | 
         | XML is just terrible though. Unless you have a proper schema
         | everything is _entirely_ untyped (tbf the schema support is
         | pretty good). But more to the point it just doesn 't map to
         | normal programming language objects cleanly. It's a document
         | markup language, not an object encoding.
         | 
         | That means there's an annoying mismatch when parsing for 99% of
         | use cases.
         | 
         | Couple that with the crazy verbosity and the weird confusing
         | features like namespaces... I think I would rather use YAML to
         | be honest, even though it is really bad.
         | 
         | Since YAML is a superset of JSON I sometimes actually use JSON
         | with `#` comments, and read it as YAML. Only downside is
         | nothing checks if you are using that format correctly.
        
       | danesparza wrote:
       | This feels ... 20 years too late?
       | 
       | But excellent. Thanks!
        
         | slt2021 wrote:
         | libexpat was released in 1998 - the original high perf
         | streaming parser for XML written in C
        
         | blipvert wrote:
         | As Go only emerged in late 2009 then it can't really be more
         | than 15 years too late, can it?
        
           | IshKebab wrote:
           | I think he means SAX parsers and XML were all the rage 20
           | years ago. Today, not so much thankfully!
        
       | singpolyma3 wrote:
       | Does this support DTD/custom entities stuff? I would hope the
       | answer is no, but just checking
        
       | glenjamin wrote:
       | Oh nice, I've recently been looking into streaming XML parsing in
       | Go without a CGO depdency and found the available options pretty
       | lacking.
       | 
       | Great to see this sort of thing!
        
       | euroderf wrote:
       | Is there any improvement on the deficient namespace handling in
       | the stdlib ?
        
         | jerf wrote:
         | The "deficient namespace handling in the stdlib" is only
         | relevant when parsing XML, then trying to re-emit it. Since
         | this library does not support re-emitting XML, it is either
         | "worse" or "n/a", depending on your mood.
         | 
         | However, looking at the output data structures, yes, it would
         | have the same problem if the obvious modification to re-emit
         | XML was made.
         | 
         | It's actually very common, to the point I'm surprised when I
         | encounter an XML parser that handles the problem you are
         | referring to correctly, in any language. I've had to hack it in
         | to every XML parser I've ever used when I care about preserving
         | namespace abbreviations.
        
       | runlevel1 wrote:
       | Wish I'd had this a few years ago. I had to parse Confluence wiki
       | backups which, for reasons only known to Atlassian and god,
       | lacked any closing tags. I ended up writing something similar to
       | this, but mine was a lot kludgier.
        
       | 38 wrote:
       | Little trick with xml.Decoder. unlike unmarshal, decoder ignores
       | any garbage after the XML, which is nice if you want to parse
       | HTML without dealing with the DOM
        
       ___________________________________________________________________
       (page generated 2024-06-27 23:01 UTC)