[HN Gopher] Parsing can become accidentally quadratic because of...
       ___________________________________________________________________
        
       Parsing can become accidentally quadratic because of sscanf
        
       Author : JdeBP
       Score  : 403 points
       Date   : 2021-03-01 13:53 UTC (9 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | randomNumber7 wrote:
       | I didn't think this is sensational, because I found this behavior
       | while writing one of my first C programs a couple of years ago.
       | 
       | Also you just have to google "sscanf slow"....
        
       | petters wrote:
       | This a performance bug in glibc (and others), right? Several
       | developers seem to be hit by this.
        
       | brundolf wrote:
       | Something I've mused about is whether it would be possible to
       | reason about performance characteristics at a types level, to
       | statically identify things like quadratic behavior. No idea what
       | this would look like; I just see this gap where type systems have
       | gotten quite good at ensuring against incorrectness (not C's, of
       | course, but that's another discussion), but there's almost no
       | equivalent story for catastrophic performance regressions, which
       | in some cases can be just as serious as a program-crashing bug.
       | 
       | Basically when it comes to performance we're still in the place
       | that we used to be with correctness: "either read the entire
       | stack of source code to understand everything that's happening,
       | or try and remember to read the docs which may or may not be
       | correct and exhaustive and just hope for the best".
       | 
       | The performance characteristics of a piece of code are part of
       | its contract with the outside world, but that contract is in no
       | way formalized right now.
        
         | voldacar wrote:
         | maybe you could encode the number of maximum nested loops into
         | the type system, so when defining a function you set a maximum
         | number which would fail to compile if exceeded? would be crude
         | but possibly helpful at times
        
         | zionic wrote:
         | This is brilliant, I too feel like there's an opportunity there
         | but don't have any idea off the top of my head how to implement
         | such a thing.
        
       | rurban wrote:
       | I've been hit by this behaviour just a few days ago with the
       | fantastic libfuzzer, which finds much more errors than afl++ or
       | honggfuzz in much less time. But it's buffer strategy is enoying,
       | leading to inefficiency and artificial sscanf bugs. Problem is,
       | oss-fuzz/libfuzzer gives you a const char* buffer without zero
       | termination. In reality every such input buffer is zero
       | terminated, because otherwise strtol or sscanf are worthless and
       | lead to buffer overflows. They don't even allow adding a zero at
       | the end, so you have to copy the whole input buffer just to add
       | your zero termination, which makes all the fuzzer speed
       | advantages worthless.
        
       | joe_the_user wrote:
       | Broadly, it's actually easy to see how naive parsing can become
       | quadratic if a programmer isn't careful.
       | 
       | You parse by processing a long string/file via "get-a-token(),
       | add-token-to-structure()" repeating until you have built your
       | structure or made an error in the construction process (here a
       | token a kind, a type of substring and get-a-token retrieves the
       | substring type in some form).
       | 
       | The problems come because the naive programmer is only thinking
       | about their structure building process, so they use some library
       | function to get their tokens. Whether that's a regular expression
       | interpreter or some other flexible facility by provided by the
       | library, it's easy for it's definition of a token-string to
       | include things "not in the string" as well as things in the
       | string - a "quoted string" begins and ends with a quote, so to
       | discover that X isn't a quoted string, you have to scan to the
       | end. So if thing go wrong, each call to get-a-token may scan the
       | while file to make sure this token isn't that. Just an example.
       | The problem here seems to be tokenizing floats, which also has a
       | "where it end" problem in c++. And the alternative is parsing
       | floats yourself, which indeed no one wants to do since they are
       | complex and defined by a complex standard.
        
       | ufo wrote:
       | In case anyone missed it, this sscanf behavior was the cause of
       | the long loading times in GTA V, a story that hit the front page
       | yesterday:
       | 
       | https://news.ycombinator.com/item?id=26296339
        
         | aparsons wrote:
         | That was a brilliant investigation. Reminded me of when I used
         | to work on Windows debuggers. I wish the author touched on
         | symbolization more. That is where a lot of the fun lies when
         | dissecting precompiled binaries.
        
           | jjoonathan wrote:
           | What are the tools for symbolization? I recently spent a lot
           | of time laboriously propagating RTTI information around
           | Visual C++ binaries by hand in Ghidra, so I would be very
           | interested in learning about this tooling, even if it's all
           | IDA for now.
        
             | aparsons wrote:
             | It has been 2-3 decades since I've worked on this, so this
             | may be outdated advice. In fact, I just learned of Ghidra
             | when I read the article.
             | 
             | In general there are three cases:
             | 
             | - Developer forgot to strip/obfuscate symbols, in which
             | case you're golden
             | 
             | - Symbol you seek is in a shared library: patch the DLL.
             | Often what we did was do investigations in a VM, where the
             | n most common shared library calls were patched to send
             | stack and symbol data to a local DB. Very similar play to
             | RR, but only in the patched code.
             | 
             | - If the (missing) symbol you seek is pre linked (static),
             | things get interesting. We built a pattern matcher that was
             | incredible. Idea is, if you have the disassembly and a
             | loop, like:
             | 
             | int fun(char* str) { call processChar on each uppercase
             | character }
             | 
             | It pattern matched a name like
             | rax_foreach_iflte_cx_call_obfuscatedSymbolForProcessChar
             | 
             | 90+% of code either:
             | 
             | - enters the same dynamic library trampoline
             | 
             | - displays very very common programming patterns.
             | 
             | From what I know, we never released this outside early MS
             | DN members because the 10% is often very useful code and
             | developer tools weren't given away for free yet.
             | 
             | But once the DLL is patched, you propagate names up graph.
             | Any remaining symbols were hand annotated (and we run the
             | propagator again).
             | 
             | Brian Parks went on to extend this system to work with C++
             | templates. C++ mangling is very predictable, so his program
             | could scavenge for symbols (obfuscators only worked on the
             | root symbol name) and build out the class at compile time
             | for us to reference.
        
             | wcarss wrote:
             | just a little related: I watched a guy step'n'watch the no-
             | symbols code once after he hooked WINDBG up to an MS Office
             | process that had a paint issue. He narrowed it down to
             | something IME was doing -- I gained nothing but awe.
        
         | masklinn wrote:
         | Well half the cause (or probably two thirds), as the post shows
         | there's a second quadratic blowup when filling the not-a-
         | hashmap thing after the very slow parsing has completed.
        
           | crowbahr wrote:
           | A filling that wasn't even required because it was de-duping
           | already de-duped data instead of just inserting it into a
           | list (or a hashmap, as suggested in the post).
        
       | JdeBP wrote:
       | I found this while making a collection of what C implementation
       | does what at https://news.ycombinator.com/item?id=26298300 .
       | 
       | There are two basic implementation strategies. The BSD (FreeBSD
       | and OpenBSD and more than likely NetBSD too), Microsoft, GNU, and
       | MUSL C libraries use one, and suffer from this; whereas the
       | OpenWatcom, P.J. Plauger, Tru64 Unix, and my standard C libraries
       | use another, and do not.
       | 
       | The 2002 report in the comp.lang.c Usenet newsgroup (listed in
       | that discussion) is the earliest that I've found so far.
       | 
       | There have been several assertions that library support for
       | strings in C++ is poor. However, note that the fix here was to
       | switch from the _Standard C library_ function sscanf() to a C++
       | library.
       | 
       | * https://github.com/fastfloat
       | 
       | One might advocate using other people's libraries, on the grounds
       | that they will have got such things right. However, note that
       | this is RapidYAML, and the bug got fixed several years after GTA
       | Online was released. Other third-party parsing libraries have
       | suffered from being layered on top of sscanf() in other ways.
       | 
       | * https://github.com/json-c/json-c/issues/173
       | 
       | * https://github.com/kgabis/parson/commit/96150ba1fd7f3398aa6a...
        
         | kiwidrew wrote:
         | It turns out that the strlen() behaviour was present as far
         | back as the 4BSD libc [1] circa _1980_ , and via the Net/2
         | releases has found its way into NetBSD/FreeBSD/OpenBSD along
         | with countless other BSD-derived libc implementations.
         | 
         | Of course, the underlying vfscanf() implementation that reads
         | from a FILE* simply proceeds one character at a time; the
         | pseudo-quadratic behaviour arising from the strlen() call is
         | simply an artifact of the simplistic "simulate a file using a
         | string" wrapper in sscanf().
         | 
         | [1] https://minnie.tuhs.org/cgi-
         | bin/utree.pl?file=4BSD/usr/src/l...
        
         | Jasper_ wrote:
         | Musl reads the "file" in 256 byte increments and calls strlen
         | on each one. Still a strlen, but on short strings it would only
         | be called once, on a short amount of data. It would work fine
         | for this use case.
        
           | JdeBP wrote:
           | As laid out at https://news.ycombinator.com/item?id=26298300,
           | it doesn't call strlen(); and it isn't 256 bytes. Moreover
           | that use case is not the use cases at hand, which are people
           | parsing 10MiB of JSON and 504KiB of YAML with parsers that
           | are not calling sscanf() just once.
           | 
           | MUSL, with the same implementation strategy of nonce FILE
           | objects, ends up performing multiple passes over those big
           | input strings, once to look for NUL and a second time to pull
           | each character out one by one. And it's doing those multiple
           | passes multiple times, as it sets up and tears down the nonce
           | FILE object each call to sscanf(), meaning that it's scanning
           | the same parts of the string for NUL over and over again as
           | it repeats the same buffer fill(s) from the input string a
           | few characters further on. As I said before, it's not
           | _entirely_ free of this behaviour. With the other
           | implementation strategy, in contrast, there 's only a single
           | pass over the whole input string, even with those multiple
           | calls to sscanf().
        
             | Jasper_ wrote:
             | Let's look at the source: https://git.musl-
             | libc.org/cgit/musl/tree/src/stdio/vsscanf.c
             | 
             | It fills in a buf with a string_read callback, but if we
             | look at what vfscanf does, it uses shgetc, which eventually
             | calls into __uflow, which is where the read is performed:
             | 
             | https://git.musl-
             | libc.org/cgit/musl/tree/src/stdio/vfscanf.c
             | https://git.musl-
             | libc.org/cgit/musl/tree/src/internal/shgetc...
             | https://git.musl-
             | libc.org/cgit/musl/tree/src/stdio/__uflow.c
             | 
             | This will call into the string_read callback with a buf
             | size of 1. The string_read will internally do a memchr(buf,
             | 0, len+256);, so, OK, it's memchr-ing at most 257 bytes,
             | sorry. The read data then gets stored in a stdio-internal
             | buffer (the rpos/rlen below), so this code path will
             | actually be skipped on future reads.
             | 
             | If it then needs to unparse a character, it will use ungetc
             | to put it back in a stdio-internal buffer. Reading this
             | code isn't that hard, I don't know why you keep insisting
             | it does something else.
        
               | JdeBP wrote:
               | You haven't read the code properly. The memcpy() copies
               | _one_ character. After shgetc() reads that and possibly a
               | few more characters, the entire FILE object goes away and
               | the next call to sscanf() sets up an entirely fresh nonce
               | FILE object and _does the initial fill all over again_ ,
               | including scanning the stuff that it has already scanned
               | for NUL, a few characters further on. As I just
               | explained.
        
               | Jasper_ wrote:
               | Yes, but it's not scanning the entire string, it's
               | scanning at most 256 bytes. In MSVC, the number of bytes
               | scanned is `M*N` where M is the number of calls to scanf
               | and N is the size of the file. In musl, N is now 257, no
               | matter the size of the file. Compared to ~10MB of data,
               | that's basically free.
               | 
               | Sure, it's scanning a bit more than maybe it needs to,
               | but it's much closer to a non-scanning libc in the amount
               | of data it's consuming.
        
               | JdeBP wrote:
               | And this will be the third time that I've written the
               | phrase "not _entirely_ free of this behaviour ", in
               | comparison with the implementations that are entirely
               | free of this behaviour.
        
               | wnoise wrote:
               | It's not entirely free of overhead. The overhead is
               | constant per invocation though, so it's entirely free of
               | _quadratic_ behavior. Which is what's being discussed.
        
               | JdeBP wrote:
               | No, that was not what was being discussed, neither here
               | nor in that discussion. See
               | https://news.ycombinator.com/item?id=26297612 . None of
               | us mentioned quadratic behaviour. We are talking about
               | sscanf() calling strlen(), and in the original discussion
               | why.
        
               | wnoise wrote:
               | Sure, the quadratic behavior is attached to the "calling
               | sscanf in a loop". The point is linear behavior for a
               | call to sscanf. Which isn't the case for musl. Nor does
               | musl call strlen, or even effectively call it. It will
               | bizarrely scan for a NUL in up to 256 bytes ahead of
               | where in the string it is currently. But that's not
               | actually the same behavior as calling strlen or having
               | other linear behavior.
        
               | jka wrote:
               | This isn't solely directed towards you, but as a general
               | comment: if you're keen to illustrate a point about code
               | behaviour (as we often are!) then it can potentially save
               | time and ambiguity to include commit references and line
               | ranges.
               | 
               | It still doesn't guarantee that the behaviour is stable
               | (because dependencies within the linked range may change
               | before a reader gets to it) but it can help support your
               | message (and hopefully your argument) both at time-of-
               | authorship and in future.
               | 
               | (and yep, it's a tradeoff of your own time -- and
               | response time -- against future readers' time; often
               | tricky to handle. I think code browsers could do a lot
               | more to make it easier)
        
         | [deleted]
        
       | [deleted]
        
       | 1f60c wrote:
       | I think "(2020)" should be added to the title. (There was a very
       | similar story on the front page yesterday, but alas, it's
       | ultimately unrelated.)
        
       | dividuum wrote:
       | Seen the GTA post already and the sscanf behavior is very
       | surprising as one would expect it to parse the given string until
       | the given format is satisfied or \0 is hit. Seems like this is
       | not the case. Found this: https://stackoverflow.com/a/23924112
       | 
       | > sscanf() converts the string you pass in to an _IO_FILE* to
       | make the string look like a "file". This is so the same internal
       | _IO_vfscanf() can be used for both a string and a FILE*. However,
       | as part of that conversion, done in a
       | _IO_str_init_static_internal() function, it calls __rawmemchr
       | (ptr, '\0'); essentially a strlen() call, on your input string.
       | 
       | Looks correct. Call flow:
       | 
       | https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
       | 
       | https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
       | 
       | https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
       | 
       | https://github.com/bminor/glibc/blob/83908b3a1ea51e3aa7ff422...
        
         | kazinator wrote:
         | By itself, that doesn't add up to N-squared behavior, though.
         | 
         | > _It looks like the problem is that reading floats is done
         | using c4::from_chars - > c4::atof -> sscanf which ultimately
         | calls _IO_str_init_static_internal in the glibc implementation.
         | As far as I understand, this essentially causes strlen to be
         | called on the string that rapidyaml passes to it, essentially
         | causing strlen to be called O(n^2) times._
         | 
         | As such, this doesn't follow. Just because sscanf takes the
         | length of the entire source string doesn't mean will get N^2
         | behavior, unless you do silly things with sscanf, like extract
         | a small token from the beginning of a large string, and then
         | skip past it to do the same thing again after that small token:
         | essentially, build a lexical analyzer in which the pattern
         | matcher consists of calls to scanf. This is not an intended use
         | case for sscanf. No self-respecting language parser designer
         | would do such a thing in a non-throwaway, production program.
         | 
         | sscanf has undefined behaviors for out-of-range numeric values,
         | so you can't use it for just blindly extracting numeric tokens
         | at all, regardless of any other issues.
         | 
         | sscanf should only be used on a string that your program has
         | either prepared itself, or carefully delimited by some other
         | means.
         | 
         | For instance, suppose we have a string which holds a date
         | (could be an input to the program). We can test whether it's in
         | the YYYY-MM-DD format, and get those values out:
         | int year, month, day;        char junk;             if
         | (sscanf(input, "%4d-%2d-%2d%c", &year, &month, &day, &junk) ==
         | 3)        {           /* we got it */        }
         | 
         | The purpose of junk is to test that there is no trailing
         | garbage character. If there is, it will be scanned and the
         | return value will be 4. If we don't care about that, we can
         | just omit that argument and conversion specifier.
         | 
         | This will accept inputs like 01-1-1. Worse, it will scan
         | leading signs!
         | 
         | If you wan to insist on nothing but digits, and exact digit
         | counts, you can break it up as string tokens first:
         | char yearstr[5], monthstr[3], daystr[3];        char junk;
         | if (sscanf(input, "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr,
         | monthstr, daystr, &junk) == 3)        {           int year =
         | atoi(yearstr);           int month = atoi(monthstr);
         | int day = atoi(daystr);        }
         | 
         | atoi is safe here, by the way, because we know that the
         | integers are decimal strings no more than four digits.
         | 
         | This is the sort of thing where sscanf comes in very handy. If
         | you know how to use it, it solves these sorts of things nicely,
         | and the performance is probably okay. At least not degenerately
         | bad.
         | 
         | Complete sample with all the validation:
         | #include <stdio.h>       #include <stdlib.h>       #include
         | <string.h>            int main(int argc, char **argv)       {
         | char yearstr[5] = "", monthstr[3] = "", daystr[3] = "";
         | char junk;               if (argv[0] && argv[1] &&
         | sscanf(argv[1], "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr, monthstr,
         | daystr, &junk) == 3 &&              yearstr[3] && monthstr[1]
         | && daystr[1])          {             int year = atoi(yearstr);
         | int month = atoi(monthstr);             int day = atoi(daystr);
         | printf("%d:%d:%d\n", year, month, day);                  return
         | 0;          }               return EXIT_FAILURE;       }
         | 
         | Some tests:                 $ ./date asdf       $ ./date 1-2-3
         | $ ./date 2020-01-1       $ ./date 20-01-01       $ ./date
         | 2020-01-01       2020:1:1       $ ./date 2020-01-01z       $
         | ./date 2020-01-12       2020:1:12
        
           | froh wrote:
           | to my understanding there is a long input string, completely
           | in memory. that string consists of a gazillion of floats. the
           | string is walked with a pointer, the floats are sscanf-ed one
           | after the other. each of the sscanfs will call strlen _once_
           | , but each strlen will scan the full remaining huge string to
           | find the \0 eos byte.
           | 
           | the sum of all scanned bytes is quadratic to the length of
           | the huge input string.
           | 
           | ?
           | 
           | edit: oh the autocorrect...
        
             | kazinator wrote:
             | That description seems to resemble my remark:
             | 
             | > _extract a small token from the beginning of a large
             | string, and then skip past it to do the same thing again
             | after that small token: essentially, build a lexical
             | analyzer in which the pattern matcher consists of calls to
             | scanf._
             | 
             | Scanning floats with sscanf means undefined behavior on a
             | datum like 1.0E999 against a %lf (double), where the
             | exponent is out of range. IEEEE double goes to E308, if I
             | recall.
             | 
             | C99 says _" Unless assignment suppression was indicated by
             | a[n asterisk], the result of the conversion is placed in
             | the object pointed to by the first argument following the
             | format argument that has not already received a conversion
             | result. If this object does not have an appropriate type,
             | or if the result of the conversion cannot be represented in
             | the object, the behavior is undefined."_
             | 
             | I'm seeing that on glibc, 1E999 scans to an Inf, but that
             | is not a requirement.
             | 
             | This problem could be solved with sscanf, too, while
             | maintaining the degenerate quadratic behavior. sscanf
             | patterns could be used to destructure a floating-point
             | number textually into pieces like the whole part,
             | fractional part and exponent after the E, which could then
             | be validated separately, like that the E part is between
             | -308 and 308 and so forth. Someone could still do this in a
             | loop over a string with millions of floats. I'd say though
             | that you're way past the point of needing a real lexical
             | analyzer, and not multiple passes through scanf.
        
               | 3xccc wrote:
               | What about parsing an unknown number of at-most-4-digits
               | integers from a string (being careful, as you
               | exemplified)? Isn't that a reasonable use of sscanf?
               | 
               | Do I need a full fledged parser for that? No. Does it
               | have to traverse the whole string every time? No, because
               | I'm being very clear about the maximum length of the
               | thing I'm trying to parse.
               | 
               | You cannot _not_ conclude that it's a bug.
        
           | adwn wrote:
           | > _[...] so you can 't use it for just blindly extracting
           | numeric tokens at all [...]_
           | 
           | So in other words, it's completely useless?
           | 
           | > _This is not an intended use case for sscanf._
           | 
           | Then what is?
        
             | kazinator wrote:
             | Yes, scanf is almost completely useless, and sscanf a
             | little less so. sscanf is useful if you need a pattern-
             | based breakdown of some string datum, and the patterns are
             | not out of sscanf's league. It will work "anywhere"; it's
             | been in C since before 1989 ANSI C.
             | 
             | In the entire TXR project, I used sscanf in one function: a
             | floating to integer conversion routine. The context is that
             | integers can be bignums, so we can't just rely on ordinary
             | conversion, except in the fixnum range.
             | 
             | http://www.kylheku.com/cgit/txr/tree/arith.c?h=txr-252#n291
             | 2
             | 
             | The approach taken is to sprintf the _double_ value into a
             | generous buffer, with a generous precision, using the %g
             | format which will automatically use the _e_ notation if
             | necessary.
             | 
             | We detect whether the _e_ notation was produced, and other
             | characteristics, and then proceed accordingly.
             | 
             | I see that an impossible case is being tested for; since
             | values with a small magnitude are handled via the fixnum
             | route, we will never hit the case where we can short-
             | circuit to zero.
             | 
             | Can anyone see a bug here, by the way? :)
        
             | nitrogen wrote:
             | The comment to which you reply uses the %[] conversion to
             | impose stricter limits on acceptable characters to be
             | interpreted as numbers.
             | 
             | My guess at the historical intention of *scanf is that it
             | was meant for parsing tab-delimited and fixed-width records
             | from old-school data files.
        
               | sdenton4 wrote:
               | (As opposed to the comma-separated data files that we
               | enlightened future-folk use!)
        
               | Darkphibre wrote:
               | You know, why don't we use 0x1F (Unit Separator)? Seems
               | like we have had a hierarchical delimiters since ASCII /
               | punch cards, yet I never hear about them. https://en.wiki
               | pedia.org/wiki/C0_and_C1_control_codes#Field_...
               | 
               | CSVs are such a pain if you have freeform text (in which
               | you have to handle newlines and escaped quotes).
               | 
               | It seems like using FS/GS/RS/US from the C0 control codes
               | would be divine (provided there was implicit support for
               | viewing/editing them in a simple text editor). I get that
               | it's a non-printable control character, and thus not
               | traditionally rendered... but that latter point could
               | have been done ages ago as a standard convention in
               | editors (enter VIM et. al., as they will indeed allow you
               | to see them).
        
               | jimmaswell wrote:
               | It's nice being able to edit a CSV in a text editor. It
               | wouldn't be worth it to lose that just to not have to
               | escape some strings.
        
               | afiori wrote:
               | more than that it is nice to edit files with a standard
               | keyboard,
        
               | db48x wrote:
               | If your editor doesn't allow you to enter arbitrary
               | characters using a standard keyboard then you need a
               | better editor.
        
               | sp332 wrote:
               | It wouldn't take that much effort to add a glyph in the
               | editor for it?
        
               | [deleted]
        
               | JdeBP wrote:
               | Text editors that can put control characters into files,
               | and indeed PC keyboards that have a [Control] key, have
               | been around for decades, though. For example: With
               | WordStar, SideKick, and Microsoft's EDIT, one just typed
               | [Control]+[P] and then the control character. M.
               | Darkphibre mentioned VIM, but really _this_ isn 't an
               | idea that was ever limited to just VIM and emacs.
        
               | nitrogen wrote:
               | No matter what delimiters we use, we still have either a
               | data sanitization problem or a data escaping problem.
               | I've worked with a wire protocol that used the ASCII
               | record delimiters, but with binary data so you also had
               | to use binary escape (0x1b) and set the eighth bit on the
               | following byte.
        
               | 0xbadcafebee wrote:
               | There are plenty of ways a lazy programmer will want to
               | pass the records into something else that either doesn't
               | handle non-printables, or does but the user doesn't like
               | how it looks.
               | 
               | TSV also works more reliably than CSV, because most
               | people don't put tab characters in the data in these
               | kinds of records. Tab is even the default field delimiter
               | for _cut_. But everybody uses CSV, because again it 's
               | easier to reason about the above. _(shrug)_
        
               | elygre wrote:
               | Which is a mess, because some parts of the world use
               | commas as decimal commas: 3,14159etc. So we have to use
               | semicolon for our CSVs.
        
               | bmn__ wrote:
               | > some parts of the world
               | 
               |  _most_ parts
               | 
               | Why can't you be normal? https://commons.wikimedia.org/wi
               | ki/File:DecimalSeparator.svg
        
               | unilynx wrote:
               | C is full of record based stuff. Apart from the numeric
               | modifiers to printf and scanf we have fread, fwrite,
               | calloc all taking record sizes
        
               | [deleted]
        
           | shawnz wrote:
           | > ... doesn't mean will get N^2 behavior, unless you do silly
           | things with sscanf, like extract a small token from the
           | beginning of a large string, and then skip past it to do the
           | same thing again after that small token: essentially, build a
           | lexical analyzer in which the pattern matcher consists of
           | calls to scanf. This is not an intended use case for sscanf.
           | 
           | Then why does it support the "%n" specifier? Isn't it exactly
           | for that use case?
        
             | [deleted]
        
           | liuliu wrote:
           | I am confused about this comment. The parent comment gave
           | clear call stack that pointing to the problem where the
           | glibc's sscanf implementation would strlen the input string
           | no matter what format string you use (because it need to
           | masquerade as a file and that requires strlen the whole
           | input: https://github.com/bminor/glibc/blob/21c3f4b5368686ade
           | 28d90d...)
           | 
           | All your magic on format string doesn't change that poor
           | behavior unless you manually add '\0' somewhere to null-
           | terminate the input string early.
           | 
           | If you parse the year-month-day in 100 places of that input
           | string, it will strlen 100 times. And now replace the "100
           | places" to a linear function of the input string length
           | (strlen(input) / 11?), that is quadratic.
        
             | jchw wrote:
             | I agree with you. However, the point of the parent comment
             | is that you should not use scanf to do this; it is not
             | designed to parse a small part of a large string. And
             | indeed, I would say it is bad for more reasons than just
             | this performance cliff. You should really prefer to use
             | strtod or strtol on the exact scanned token instead. Even
             | then, these functions may be impacted by C locale settings,
             | which means you should probably instead prefer using
             | implementations of this algorithm outside of libc...
        
               | josefx wrote:
               | I just ran strtod through a debugger it at least ends up
               | in strlen.S, so it probably wont improve the runtime
               | complexity . It seems to be best to just assume the worst
               | when it comes to any functions in the C standard library
               | that have to do with string handling and avoid them in
               | favor of literally anything else.
        
               | jchw wrote:
               | Did you call strtod with a null endptr? If not, I'd be
               | surprised if it actually strlen'd the input rather than a
               | copy of it. Still, I agree, you shouldn't really use the
               | C standard library to do strings.
        
             | [deleted]
        
           | oceanghost wrote:
           | > This is not an intended use case for sscanf.
           | 
           | It's almost like JSON and XML were invented so people didn't
           | have to understand lex/yacc.
           | 
           | And behold... they don't.
        
           | knuthsat wrote:
           | let's say you have a newline delimited dates in &junk. now
           | you do a while loop, whoops quadratic.
        
             | cozzyd wrote:
             | Well, not if you do something like                  char *
             | start = data;         char * next = strchr(data,'\n');
             | while(next)         {                *next=0;
             | sscanf(start, ...)            start=next+1;            char
             | * next = strchr(start,'\n');         }
        
             | nitrogen wrote:
             | junk is a character, not a string, though.
        
               | JdeBP wrote:
               | The calls to sscanf() in the preceding post have been
               | edited since xe wrote that. My guess is that xe meant the
               | parameter now named "input".
        
               | kazinator wrote:
               | The valid point is that the junk could be arbitrarily
               | long, and that glibc's sscanf will scan through it in
               | taking the length of the input.
               | 
               | But we are not continuing to scan into the junk here to
               | get more tokens here; we don't have a loop that will
               | cause O(N*N) behavior.
               | 
               | E.g. this could be user input from a form field where
               | they are expected to type a date, and we want to reject
               | that if there is trailing junk. At that point, the user
               | gets an error, and has to fix the input. We don't look
               | for anything in the trailing junk.
        
       | dmix wrote:
       | Unrelated but I found this but in the repo interesting:
       | 
       | (Under the heading "is it rapid?" Measuring performance).
       | 
       | > The 450MB/s read rate for JSON puts ryml squarely in the same
       | ballpark as RapidJSON and other fast json readers (data from
       | here). Even parsing full YAML is at ~150MB/s, which is still in
       | that performance ballpark, albeit at its lower end. This is
       | something to be proud of, as the YAML specification is much more
       | complex than JSON: 23449 vs 1969 words.
       | 
       | How did YAML get so complicated. I'd assume it would be much more
       | similarity YAML's quirks adding a few extra things to work on.
       | But that spec is quite huge by comparison (of words alone).
        
         | dragonwriter wrote:
         | > How did YAML get so complicated
         | 
         | YAML has both more syntactical options, and broader semantic
         | coverage; particularly, it includes the concept of schemas
         | defining YAML profiles.
         | 
         | Also, the YAML spec has much more informational content.
         | Examples are a stark illustration: RFC 8259 has 5 example JSON
         | documents (3 of which are simple scalar values). The preview
         | chapter of the YAML 1.2 spec has 28 examples, only a handful of
         | which are single scalars.
        
       | rob74 wrote:
       | Is this good ol' Shlemiel the painter
       | (https://www.joelonsoftware.com/2001/12/11/back-to-basics/)
       | rearing his lovable-but-sometimes-a-little-bit-empty head again?
        
         | thaumasiotes wrote:
         | Same pattern, but what's going on is that sscanf has been
         | implemented as a call to a related function, with setup that
         | incidentally does a bunch of unnecessary work. The scanning
         | isn't what's causing the problem; it's that, before sscanf
         | starts its actual work, it passes over the entire string to see
         | how long it is.
        
       | rpnx wrote:
       | Never been a fan of C style strings for exactly this reason. C
       | style strlen is slow. In C++ strings, the size is included in the
       | string structure. Also, fetching the distance between two
       | std::string::iterator is fast. string_view::size should also be
       | fast when that's available.
        
         | TwoBit wrote:
         | Was there ever anybody a fan of C style strings?
         | 
         | Maybe I misremember, but I think long ago people made arguments
         | that C strings were superior because of how simple they were.
         | Such was the thinking 30 years ago.
        
       | mazykhan1 wrote:
       | (y)
        
       | PaulHoule wrote:
       | Another example of so many things wrong with the status quo.
       | 
       | (1) Naive people find liberation in text-based protocols that
       | don't require tools to understand.
       | 
       | (2) With experience one finds that variable-length data
       | structures are much more treacherous than fixed-length data
       | structures: you find simplicity instead in binary data formats
       | that are dead easy to parse.
       | 
       | (3) Floating point numbers are astonishingly expensive to parse
       | and unparse to ASCII compared to how many FLOPs you could do in
       | that time. All that, for a function that isn't quite correct...
       | That is, the number 1/5, which you write as "0.2" doesn't
       | actually exist in IEEE floats. You don't notice at first because
       | that number unparses as "0.2" -- but then note that "0.1 + 0.2 ==
       | 0.3" is false in most computer languages.
       | 
       | (4) Null-terminated strings are the definition of insanity,
       | rivaled only by "pointers are integers, integers are pointers"
       | 
       | One pet project of mine lately is a "Data Assembler" that takes a
       | graph of data structures (say test data for a family of logic
       | chips, graphics data for a laser light show, or "maps" to upload
       | to the engine control unit in your car) in Python and converts it
       | to a byte array that is incorporated into an Arduino program.
       | 
       | The total size of the structure could be up to 32kbytes on a
       | small device, or up to 250kbytes or so on a big device, but the
       | program memory is huge compared to the RAM which might be as
       | little as 2kb so the unpacking code is going to stream the data
       | with a small working area.
       | 
       | The minimum it has to do is put together blocks of data that have
       | references to one another that are only resolved once all of the
       | blocks are otherwise complete.
       | 
       | I deliberated a lot about how to handle variable length data
       | structures and decided to embed a two-byte length before the
       | block for all blocks. I could have saved some bytes when the
       | length of the structure was known, but then I'd have to figure
       | out how to code lengths then (e.g. for an array of 2-byte ints do
       | I count the ints or bytes) and figured I'd make up my mind once.
        
         | rmrfrmrf wrote:
         | This whole rant is a bit of a non-sequitor for a minor
         | performance bug with a working fix.
        
         | CivBase wrote:
         | > Naive people find liberation in text-based protocols that
         | don't require tools to understand.
         | 
         | I don't think text is necessarily the problem here. Other
         | protocols might allow you to use less memory, but they aren't
         | running into a memory limitation in this instance.
         | 
         | You can divide parsers into two groups: parsers designed to
         | read data which follows a specific schema and parsers designed
         | to read arbitrary data.
         | 
         | Schema-aware parsers have additional context which allows them
         | to process the data much more efficiently. The stricter the
         | schema, the more efficient the parser can be. Meanwhile,
         | parsers for arbitrary data have no such context and must rely
         | completely on syntax to parse the data. A poorly optimized
         | schema-aware parser may still outperform a well optimized
         | parser for arbitrary data.
         | 
         | You can still encode data as text and even use popular syntaxes
         | like JSON or YAML with schema-aware parsers. I've personally
         | helped design a schema-aware parser for XML data. But you
         | should keep in mind that most third-party parsers for those
         | syntaxes will be designed to parse arbitrary data. Parsers for
         | arbitrary data can still be very fast - fast enough for many
         | purposes. However, schema-aware parsers have a significant
         | advantage when it comes to computational performance.
        
           | minitech wrote:
           | Text _is_ the problem, here and everywhere else. Text
           | encourages looking for delimiters, escaping, and flexibility
           | (optional whitespace, case insensitivity, leading zeros,
           | redundant signs, newline style compatibility). Compared to
           | binary with length prefixes, that's _really_ slow, prone to
           | injection vulnerabilities and subtle incompatibility,
           | requires copying, introduces more side channels (the length
           | of your message now depends on the values of numbers inside
           | it!), ....
        
         | astrobe_ wrote:
         | On (2): implicit type promotions with unsigned and signed types
         | can make it no-so-easy in C. FWIW, I always test with bytes
         | with the sign bit on (on two's complement architectures, never
         | worked with others) when I do stuff like that.
         | 
         | On (3): never perform equality tests on FP numbers in languages
         | that use IEEE standard FP format.
         | 
         | On (4): I have worked both with ASCIIZ and counted strings, I
         | think both have their pros and cons. But if you are writing
         | stuff that has any contact with C/C++, it's often a bad idea to
         | go against the stream. No, auto-adding a final 0 to your
         | counted strings won't save the day every time.
        
           | a1369209993 wrote:
           | Never perform equality tests on FP numbers _period_ - it
           | defeats the purpose of using floating point. This is not
           | specific to the IEEE format.
        
         | nly wrote:
         | I agree generally that if you can afford the cycles to
         | serialize and deserialize from decimal ASCII ("0.2") that you
         | can probably afford to use a decimal numeric type in your
         | business logic.
         | 
         | Personally I blame the C standard for not splitting strtod in
         | to 1 parsing function (returning an integer and the power of
         | ten), and 1 computation/conversion function. Without this
         | you're basically hosed with respect to locales.
        
         | nitrogen wrote:
         | _Naive people find liberation in text-based protocols that don
         | 't require tools to understand._
         | 
         | As a huge fan of text-based protocols (but not of JSON or
         | YAML), calling a whole class of debuggability,
         | interoperability, and onboarding benefits "naive" is a bit
         | much.
        
           | PaulHoule wrote:
           | It's a lot of the reason why software has gotten to be so
           | porous in terms of security flaws.
           | 
           | I would trust programmers at Intel to handle FORTRAN-style
           | data structures, and maybe even make a compiler in C.
           | 
           | I wouldn't trust anybody from Intel to write a web server or
           | even a web application with PHP in that wild west of
           | applications programming with strings they are sure to mess
           | it up and thus leave a vulnerability in the management engine
           | on your CPU. It might superficially look like good code, but
           | American Fuzzy Lop will find the branch that isn't obvious,
           | and so will some kid.
        
             | nicoburns wrote:
             | To me the problem seems to be old-school C programmers
             | thinking that they don't need to use a library to parse a
             | standard format. I have only ever seen JSON parsing errors
             | in C and C++ codebases because they are the only languages
             | where developers seem to think it's a good idea to parse
             | JSON using string functions rather than a JSON parser.
        
               | Cloudef wrote:
               | As a C programmer, i can say that most of the json
               | libraries fail on simple fuzzing test, or do some other
               | idiotic stuff like trying to handle numbers (and doing it
               | wrong of course). We have awesome parser generators such
               | as ragel, yet people still handroll parsing code that is
               | both buggy and slow.
        
               | nicoburns wrote:
               | > We have awesome parser generators such as ragel, yet
               | people still handroll parsing code that is both buggy and
               | slow.
               | 
               | Wasn't ragel responsible for the cloudbleed
               | vulnerability? Wherever possible we really shouldn't be
               | using C at all.
        
               | PaulHoule wrote:
               | Some of that is that they don't have systems like maven
               | or npm that would make it really easy to incorporate
               | libraries into the build.
               | 
               | That said, what I like about C is that it is a simple
               | language where you can throw out the standard library and
               | start anew: it makes me think of the old versions of
               | ALGOL that didn't specify mechanisms for I/O, and now we
               | know you can specify that behavior in libraries and leave
               | the language pure.
        
               | nicoburns wrote:
               | > Some of that is that they don't have systems like maven
               | or npm that would make it really easy to incorporate
               | libraries into the build.
               | 
               | Definitely! I wish the C and C++ communities would take
               | this problem seriously.
               | 
               | > That said, what I like about C is that it is a simple
               | language where you can throw out the standard library and
               | start anew
               | 
               | This definitely is a nice property. But it's not just C
               | that can do this. You can for example do this with Rust,
               | and get an excellent package management system (that can
               | indeed package these alternatives to the stdlib) to boot.
        
         | api wrote:
         | I'll add that variable length stuff is inefficient to parse as
         | it breaks instruction level parallelism. Later parsing events
         | become dependent on earlier events.
        
         | adrian_b wrote:
         | Whenever floating-point numbers converted to text are not
         | intended to be used by a human, converting the numbers to
         | decimal is a great mistake in my opinion.
         | 
         | The floating-point numbers should be converted to text as
         | hexadecimal, e.g., in C99 and later, using printf with the %a
         | or %A conversion specifiers.
         | 
         | Even when a human uses the textual floating-point numbers, if
         | they are used only for making copies or comparisons, the
         | hexadecimal numbers are more convenient and they are free of
         | conversion errors.
        
           | orf wrote:
           | How would the compiler be able to use any fast, floating
           | point specific instructions if you parsed it as a hexadecimal
           | string?
        
             | firebacon wrote:
             | You don't need any specific instructions. The hex encoded
             | double contains the final binary representation of the
             | floating point number. So no conversion is required to load
             | it, except maybe for swapping around bytes on some
             | architectures. Conceptually:                   double v;
             | memcpy(&v, "\x00\x00\x00\x00\x00\xe4\x94\x40", sizeof(v));
             | // LE
        
             | MauranKilom wrote:
             | Not sure I understand the question. The suggestion is to
             | use hexfloats, as they are unambiguous and straightforward
             | to convert from and to (as opposed to anything with decimal
             | significands):
             | https://www.cplusplus.com/reference/ios/hexfloat/
        
         | fokinsean wrote:
         | > (4) Null-terminated strings are the definition of insanity
         | 
         | Could you elaborate on this one? I've only used C in college
         | and don't know why this is bad.
        
           | edflsafoiewq wrote:
           | Another reason is a substring of a nul-terminated string is
           | only nul-terminated if it is a suffix. So slicing out a
           | substring generally requires copying to a new buffer so you
           | can append the nul byte.
        
           | ativzzz wrote:
           | It's incredibly easy to do something wrong and cause bugs or
           | security issues in your app.
           | 
           | Or, compare string manipulation in C with any high level
           | language like Ruby, Python, JS and the ease of use is night
           | and day.
        
           | ben509 wrote:
           | Besides the fact that you have to walk the string just to
           | find the length of it...
           | 
           | Null-terminated strings try to save space by encoding the
           | string length by convention. This can fail due to off-by-one
           | errors, mistakenly allocating a fixed buffer, strings that
           | have an unexpected null in them, and more.
           | 
           | Those can lead to all kinds of nasty problems like buffer
           | overflows, which allow someone who can craft an input to
           | write arbitrary stuff into memory.
           | 
           | God only knows how many security vulnerabilities and
           | performance problems could be traced back to null terminated
           | strings.
        
           | birktj wrote:
           | As is the case in this article it often happens that having
           | the length of a string is required for certain operations,
           | which leads to a lot of calls to strlen. This doesn't really
           | seem like such a big problem to begin with as one assumes
           | that the operations that require the length also have linear
           | complexity. However it turns out (like in the article) that
           | this is not always the case which leads to unexpected
           | quadratic algorithms. The pointer + length representation is
           | only 7 bytes longer on modern architectures and does not have
           | this problem and has a few other advantages like being able
           | to create substrings without allocating new memory.
        
             | wnoise wrote:
             | In fact, having the length of the string was not required.
             | Certain ways of implementing scanf() required getting the
             | length of the string, or otherwise scanning the whole
             | thing.
             | 
             | But saner ways did not.
        
           | TwoBit wrote:
           | The decision to use 0-terminated strings in C may be the
           | single worst decision in programming language history, at
           | least from a security standpoint.
        
         | pjc50 wrote:
         | > variable-length data structures are much more treacherous
         | than fixed-length data structures
         | 
         | A fixed-length structure is only fixed until you need to add a
         | field to it. I understand the appeal, but good solutions seem
         | to be very rare here. ASN.1 was possibly the first attempt but
         | turned into a total nightmare. Protobuf seems to be doing a
         | reasonable job, but only as a wire format and not a persistence
         | format.
         | 
         | Text always "works", even if inadequately. And it can be
         | handled with your normal diff and editing tools.
        
           | gpm wrote:
           | We've been using protobuf as a persistence format without
           | issue?
        
             | nitrogen wrote:
             | A self-describing format like MessagePack can be easier to
             | build readers for in other applications, or if the schema
             | is lost or changed.
        
               | gpm wrote:
               | If we've lost access to the schema, then we've lost
               | access to our entire source repository, so we probably
               | don't need to read our data anymore?
               | 
               | I guess I could see that being an interesting property in
               | software that wants to deal with incompatible forks, or
               | software that for whatever reason wants other people to
               | read their data but is unwilling to ship a schema with
               | it, but just for persistence on backend servers it
               | doesn't seem like a very interesting property or worth
               | the extra space requirements.
        
             | pjc50 wrote:
             | You have. Far more apps out there are using JSON or XML,
             | including (as we've seen GTA).
        
           | astrobe_ wrote:
           | > Text always work
           | 
           | Windows vs Linux end-of-line markers often screws-up things.
           | UTF often screws-up things. Tab-vs-space sometimes screws-up
           | things. Text is related to Postel's principle, which is often
           | questioned because it leads to sloppy implementations on one
           | side and complex implementations on the other side, and other
           | bug-is-now-a-feature niceties.
           | 
           | I designed a couple of small internal protocols, and have
           | been maintaining them for at least a decade. Maintenance time
           | is nearly 0 because I never had to add a field to them. Even
           | though the software using them evolved a lot and in
           | unexpected ways.
           | 
           | Yes, they stood the test of time because they solved the
           | problem we had, and those problems didn't evolve much. Not
           | every protocol or format needs to be hyper-extendable and
           | scalable.
           | 
           | > even if inadequately
           | 
           | So it works poorly but it works, so it can be shipped. That's
           | why you buy a new smartphone every two year and don't see
           | much improvement in response time or battery life.
        
         | arbitrage wrote:
         | > Null-terminated strings
         | 
         | There is nothing inherently wrong about a null-terminated
         | string convention.
        
           | PaulHoule wrote:
           | There's no guarantee that it terminates.
        
       | arduinomancer wrote:
       | So if I understand is the issue not that sscanf reads the whole
       | string length, but that people call sscanf multiple times on the
       | same string for each value they want to parse?
       | 
       | Meaning sscanf is actually designed to scan all your values in
       | one pass?
        
         | JdeBP wrote:
         | It isn't _designed_ that way, no. This is simply an internal
         | choice made by C library implementors, and one implementation
         | choice fares very badly in this situation, as the strings grow
         | long. The other choice is pretty much immune to long strings in
         | these scenarios. You cannot infer from this an intended usage
         | of sscanf(), because this issue isn 't part of the interface
         | contract, just an effect of a particular implementation.
        
       | smasher164 wrote:
       | Sounds like a new entry for
       | https://accidentallyquadratic.tumblr.com/?
        
       | high_byte wrote:
       | _GTA loading screen with cyberpunk music on loop intensifies_
        
       | simias wrote:
       | I write a lot of C but I literally never use scanf. C simply
       | lacks the facilities to make such an API truly useful and
       | flexible. You can't scan custom types, error handling is super
       | limited, it's very easy to mess up the format string and
       | potentially end up with corrupted data and security
       | vulnerabilities.
       | 
       | If you need to parse JSON or some high level format look for a
       | lib, for instance I can vouch for cJSON which is very small and
       | has a rather nice API (by C standards at least).
       | 
       | If I need to parse a simple format string like
       | "id:name:description" something like strtok_r does the job to
       | split the components and then you can parse individual components
       | any way you like, potentially offering better error messages.
       | 
       | If I need to parse a complex format then chances are scanf
       | wouldn't do the trick anyway, so it's either custom made parser
       | state machine or lex/yacc.
       | 
       | Note for instance that in the GTA case the call was `sscanf(p,
       | "%d", buf)` which is a glorified strtol with vastly worse
       | performance.
       | 
       | I would seriously advise not using scanf and friends for anything
       | serious. Too many footguns.
        
       | whalesalad wrote:
       | I believe I experienced a form of this recently with a regular
       | expression. I never did figure out what exactly was going on but
       | a certain string would cause the expression to run for an
       | eternity. That was an interesting bug to track down.
        
         | klyrs wrote:
         | Depends what you mean by "regular expression," but matching in
         | perl regexps is NP-Hard.
         | 
         | https://perl.plover.com/NPC/
        
         | secondcoming wrote:
         | A dodgy regex took down Cloudflare!
         | 
         | https://blog.cloudflare.com/details-of-the-cloudflare-outage...
        
       | djhaskin987 wrote:
       | I ran up against this issue when I was creating a parser that
       | parsed debian package information (code here[1]). Trying to match
       | against a line with regex slowed the program down past
       | feasibility. I had to switch to just assuming that two line
       | breaks meant a new package record instead. After that, the
       | program was zippy.
       | 
       | 1: https://github.com/djhaskin987/degasolv/blob/develop/cli-
       | src...
        
         | tambre wrote:
         | Does the control format have an official spec? I too have had
         | to resort to parsing those files with regexes.
        
           | JdeBP wrote:
           | It's chapter 5 of the _Debian Policy Manual_.
        
           | djhaskin987 wrote:
           | Not that I know of, but I have noticed that two newline
           | characters separate package records, and the record is of the
           | form `Key: Value` for each line. As you can see in my code,
           | I've gotten rid of most/all of the regex stuff, because, as
           | we observe, it kills performance.
        
       | yread wrote:
       | Just another example of the old adage "whatever problem you're
       | facing someone on the internet has already seen it and solved
       | it". We just need to get better at searching
        
         | aeturnum wrote:
         | Well, as long as you're facing this problem in 2020 and not
         | around 2012. Even though Rockstar should should have patched
         | GTA by now, battle-tested JSON parsing solutions were a lot
         | rarer 8 years ago.
        
         | ant6n wrote:
         | True. Searching is turning more and more into a not solved
         | problem. For a while, it seemed close to being a solved
         | problem.
        
       ___________________________________________________________________
       (page generated 2021-03-01 23:00 UTC)