[HN Gopher] Show HN: Gosax - A high-performance SAX XML parser f...
___________________________________________________________________
Show HN: Gosax - A high-performance SAX XML parser for Go
I've just released gosax, a new Go library for high-performance SAX
(Simple API for XML) parsing. It's designed for efficient, memory-
conscious XML processing, drawing inspiration from quick-xml and
pkg/json. https://github.com/orisano/gosax Key features: - Read-
only SAX parsing - Highly efficient parsing using techniques
inspired by quick-xml and pkg/json - SWAR (SIMD Within A Register)
optimizations for fast text processing gosax is particularly
useful for processing large XML files or streams without loading
the entire document into memory. It's well-suited for data feeds,
large configuration files, or any scenario where XML parsing speed
is crucial. I'd appreciate any feedback, especially from those
working with large-scale XML processing in Go. What are your
current pain points with XML parsing? How could gosax potentially
help your projects?
Author : orisano
Score : 46 points
Date : 2024-06-27 03:17 UTC (19 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| artpar wrote:
| I upvoted you just because I made a golang library with the same
| name but different purpose
|
| https://github.com/artpar/gosax/
|
| its a High performance golang implementation of _S_ ymbolic _A_
| ggregate appro _X_ imation
| lanstin wrote:
| Nice. I like the event based/callback based parsing tools for XML
| a lot. A little more cognitive work up front but much more
| efficient. A little sad if unsurprised that XML is still a thing
| in 2024, but if you have to read it, use a streaming parser.
| glenjamin wrote:
| If you've ever tried to read data from an XLSX file, you'll
| find that streaming XML parsing is quite beneficial
|
| And the world runs on Excel files.
| IshKebab wrote:
| I really hate SAX. Callback based parsing is really
| unergonomic, and means you always have to code an explicit
| state machine. You can't use your control flow as implicit
| state.
|
| It's like choosing to use `.then()` instead of `await`. I
| seriously don't understand why it is so popular in the XML
| world when pull based parsing is much easier to use and surely
| just as efficient? Just brain damaged Java design patterns
| maybe?
| lanstin wrote:
| Because of msgs that are larger than I want to allocate.
| Explicit state machines forces one to think thru the problem.
| And it forces the solution to be one pass over the input
| data. I almost never am forced to use Java so unsure about
| that reference.
| IshKebab wrote:
| Pull parsers can deal with arbitrarily large messages too.
| And they also do one pass over the input data.
|
| Yeah if you're unfamiliar, SAX is like this (pseudocode):
| interface SAXCallbacks { void onBeginToken(string
| name); void onAttribute(string key, string val);
| void onText(string text); void onEndToken(string
| name); } void parse(Reader input,
| SAXCallbacks yourCallbackImplementations);
|
| Whereas pull parsers are like this: enum
| Token { Begin(string name),
| Attribute(string key, string val), Text(string
| text), End(string name), } class
| PullParser { void open(Reader input); Token
| next(); }
|
| They are _much_ easier to use because you can trivially
| write a recursive descent parser: void
| parseThing(parser) { let token = parser.next();
| if (token == Begin("foo")) { parseFoo(parser);
| } else if ...
|
| Whereas with SAX you're going to end up with some monstrous
| hand-coded state machine like class
| ThingParser { enum State {
| ParsingThing, ParsingFoo,
| ParsingFooExpectingAttributes,
| ParsingFooExpectingEndTag, ...
|
| So painful. Honestly it's so obviously the right way to do
| tokenisation and parsing that I have yet to see another
| language that even has names for them. They all just use
| pull parsers. Nobody else does callback-based parsing like
| SAX because it's obviously ridiculous.
| JonChesterfield wrote:
| Very nice, thank you!
|
| Unhelpfully my only pain point with XML parsing is colleagues
| refusing to use XML in favour of json or, in really grim moments,
| yaml.
|
| So I'm delighted to see a sensible modern web language
| implementation of the one true data exchange format. Thank you
| for sharing it.
| nonlogical wrote:
| Out of curiosity, what are your top reasons to pick XML over
| JSON(+jsonschema) or Msgpack/Protobuf, as data interchange? I
| have come of age as a professional software engineer around the
| time when industry has started switching from XML to JSON, and
| as a consequence in the JSON camp, but I am always curious to
| hear out folks with a different opinion.
| 616c wrote:
| Have you tried CBOR/CDDL tooling, nonlogical? What is your
| opinion of it?
| dgb23 wrote:
| Not OP:
|
| I'm in the same boat, but I found XML has some nice
| properties that I sometimes miss in JSON, given that XML is
| used well ("correctly"), such as the differentiation of
| metadata (attributes) and data (nodes), namespaces, standard
| query languages, XSLT etc. (You can use XSLT on the web
| even.)
|
| Think of all the custom, ad-hoc code that turns JSON into
| HTML vs having a declarative standardized way of doing so.
|
| https://developer.mozilla.org/en-US/docs/Web/XSLT
| jerf wrote:
| When to use XML/What XML is good at:
| https://news.ycombinator.com/item?id=11446984
| jtmarmon wrote:
| Great writeup. To add an example, I personally use JSON for
| most of my work, but have found myself using XML for
| certain AI use cases that require annotating an original
| text.
|
| For example, if I wanted an AI to help me highlight to the
| user where in a body of text I mentioned AI, I might have
| it return something like:
|
| <text>Great writeup. To add an example, I personally use
| JSON for most of my work, but have found myself using XML
| for certain <ai-mention>AI</ai-mention> use cases that
| require annotating an original text with segments.</text>
| IshKebab wrote:
| I agree YAML is awful. JSON is ok if you allow comments at
| least (for configuration use cases). There are a couple of
| variants that do: JSON5 and Jsonnet. I like JSON5 for it's
| relative simplicity but Jsonnet has much better ecosystem
| support so I'd probably go with that.
|
| XML is just terrible though. Unless you have a proper schema
| everything is _entirely_ untyped (tbf the schema support is
| pretty good). But more to the point it just doesn 't map to
| normal programming language objects cleanly. It's a document
| markup language, not an object encoding.
|
| That means there's an annoying mismatch when parsing for 99% of
| use cases.
|
| Couple that with the crazy verbosity and the weird confusing
| features like namespaces... I think I would rather use YAML to
| be honest, even though it is really bad.
|
| Since YAML is a superset of JSON I sometimes actually use JSON
| with `#` comments, and read it as YAML. Only downside is
| nothing checks if you are using that format correctly.
| danesparza wrote:
| This feels ... 20 years too late?
|
| But excellent. Thanks!
| slt2021 wrote:
| libexpat was released in 1998 - the original high perf
| streaming parser for XML written in C
| blipvert wrote:
| As Go only emerged in late 2009 then it can't really be more
| than 15 years too late, can it?
| IshKebab wrote:
| I think he means SAX parsers and XML were all the rage 20
| years ago. Today, not so much thankfully!
| singpolyma3 wrote:
| Does this support DTD/custom entities stuff? I would hope the
| answer is no, but just checking
| glenjamin wrote:
| Oh nice, I've recently been looking into streaming XML parsing in
| Go without a CGO depdency and found the available options pretty
| lacking.
|
| Great to see this sort of thing!
| euroderf wrote:
| Is there any improvement on the deficient namespace handling in
| the stdlib ?
| jerf wrote:
| The "deficient namespace handling in the stdlib" is only
| relevant when parsing XML, then trying to re-emit it. Since
| this library does not support re-emitting XML, it is either
| "worse" or "n/a", depending on your mood.
|
| However, looking at the output data structures, yes, it would
| have the same problem if the obvious modification to re-emit
| XML was made.
|
| It's actually very common, to the point I'm surprised when I
| encounter an XML parser that handles the problem you are
| referring to correctly, in any language. I've had to hack it in
| to every XML parser I've ever used when I care about preserving
| namespace abbreviations.
| runlevel1 wrote:
| Wish I'd had this a few years ago. I had to parse Confluence wiki
| backups which, for reasons only known to Atlassian and god,
| lacked any closing tags. I ended up writing something similar to
| this, but mine was a lot kludgier.
| 38 wrote:
| Little trick with xml.Decoder. unlike unmarshal, decoder ignores
| any garbage after the XML, which is nice if you want to parse
| HTML without dealing with the DOM
___________________________________________________________________
(page generated 2024-06-27 23:01 UTC)