[HN Gopher] Parsing can become accidentally quadratic because of...
___________________________________________________________________
Parsing can become accidentally quadratic because of sscanf
Author : JdeBP
Score : 403 points
Date : 2021-03-01 13:53 UTC (9 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| randomNumber7 wrote:
| I didn't think this is sensational, because I found this behavior
| while writing one of my first C programs a couple of years ago.
|
| Also you just have to google "sscanf slow"....
| petters wrote:
| This a performance bug in glibc (and others), right? Several
| developers seem to be hit by this.
| brundolf wrote:
| Something I've mused about is whether it would be possible to
| reason about performance characteristics at a types level, to
| statically identify things like quadratic behavior. No idea what
| this would look like; I just see this gap where type systems have
| gotten quite good at ensuring against incorrectness (not C's, of
| course, but that's another discussion), but there's almost no
| equivalent story for catastrophic performance regressions, which
| in some cases can be just as serious as a program-crashing bug.
|
| Basically when it comes to performance we're still in the place
| that we used to be with correctness: "either read the entire
| stack of source code to understand everything that's happening,
| or try and remember to read the docs which may or may not be
| correct and exhaustive and just hope for the best".
|
| The performance characteristics of a piece of code are part of
| its contract with the outside world, but that contract is in no
| way formalized right now.
| voldacar wrote:
| maybe you could encode the number of maximum nested loops into
| the type system, so when defining a function you set a maximum
| number which would fail to compile if exceeded? would be crude
| but possibly helpful at times
| zionic wrote:
| This is brilliant, I too feel like there's an opportunity there
| but don't have any idea off the top of my head how to implement
| such a thing.
| rurban wrote:
| I've been hit by this behaviour just a few days ago with the
| fantastic libfuzzer, which finds much more errors than afl++ or
| honggfuzz in much less time. But it's buffer strategy is enoying,
| leading to inefficiency and artificial sscanf bugs. Problem is,
| oss-fuzz/libfuzzer gives you a const char* buffer without zero
| termination. In reality every such input buffer is zero
| terminated, because otherwise strtol or sscanf are worthless and
| lead to buffer overflows. They don't even allow adding a zero at
| the end, so you have to copy the whole input buffer just to add
| your zero termination, which makes all the fuzzer speed
| advantages worthless.
| joe_the_user wrote:
| Broadly, it's actually easy to see how naive parsing can become
| quadratic if a programmer isn't careful.
|
| You parse by processing a long string/file via "get-a-token(),
| add-token-to-structure()" repeating until you have built your
| structure or made an error in the construction process (here a
| token a kind, a type of substring and get-a-token retrieves the
| substring type in some form).
|
| The problems come because the naive programmer is only thinking
| about their structure building process, so they use some library
| function to get their tokens. Whether that's a regular expression
| interpreter or some other flexible facility by provided by the
| library, it's easy for it's definition of a token-string to
| include things "not in the string" as well as things in the
| string - a "quoted string" begins and ends with a quote, so to
| discover that X isn't a quoted string, you have to scan to the
| end. So if thing go wrong, each call to get-a-token may scan the
| while file to make sure this token isn't that. Just an example.
| The problem here seems to be tokenizing floats, which also has a
| "where it end" problem in c++. And the alternative is parsing
| floats yourself, which indeed no one wants to do since they are
| complex and defined by a complex standard.
| ufo wrote:
| In case anyone missed it, this sscanf behavior was the cause of
| the long loading times in GTA V, a story that hit the front page
| yesterday:
|
| https://news.ycombinator.com/item?id=26296339
| aparsons wrote:
| That was a brilliant investigation. Reminded me of when I used
| to work on Windows debuggers. I wish the author touched on
| symbolization more. That is where a lot of the fun lies when
| dissecting precompiled binaries.
| jjoonathan wrote:
| What are the tools for symbolization? I recently spent a lot
| of time laboriously propagating RTTI information around
| Visual C++ binaries by hand in Ghidra, so I would be very
| interested in learning about this tooling, even if it's all
| IDA for now.
| aparsons wrote:
| It has been 2-3 decades since I've worked on this, so this
| may be outdated advice. In fact, I just learned of Ghidra
| when I read the article.
|
| In general there are three cases:
|
| - Developer forgot to strip/obfuscate symbols, in which
| case you're golden
|
| - Symbol you seek is in a shared library: patch the DLL.
| Often what we did was do investigations in a VM, where the
| n most common shared library calls were patched to send
| stack and symbol data to a local DB. Very similar play to
| RR, but only in the patched code.
|
| - If the (missing) symbol you seek is pre linked (static),
| things get interesting. We built a pattern matcher that was
| incredible. Idea is, if you have the disassembly and a
| loop, like:
|
| int fun(char* str) { call processChar on each uppercase
| character }
|
| It pattern matched a name like
| rax_foreach_iflte_cx_call_obfuscatedSymbolForProcessChar
|
| 90+% of code either:
|
| - enters the same dynamic library trampoline
|
| - displays very very common programming patterns.
|
| From what I know, we never released this outside early MS
| DN members because the 10% is often very useful code and
| developer tools weren't given away for free yet.
|
| But once the DLL is patched, you propagate names up graph.
| Any remaining symbols were hand annotated (and we run the
| propagator again).
|
| Brian Parks went on to extend this system to work with C++
| templates. C++ mangling is very predictable, so his program
| could scavenge for symbols (obfuscators only worked on the
| root symbol name) and build out the class at compile time
| for us to reference.
| wcarss wrote:
| just a little related: I watched a guy step'n'watch the no-
| symbols code once after he hooked WINDBG up to an MS Office
| process that had a paint issue. He narrowed it down to
| something IME was doing -- I gained nothing but awe.
| masklinn wrote:
| Well half the cause (or probably two thirds), as the post shows
| there's a second quadratic blowup when filling the not-a-
| hashmap thing after the very slow parsing has completed.
| crowbahr wrote:
| A filling that wasn't even required because it was de-duping
| already de-duped data instead of just inserting it into a
| list (or a hashmap, as suggested in the post).
| JdeBP wrote:
| I found this while making a collection of what C implementation
| does what at https://news.ycombinator.com/item?id=26298300 .
|
| There are two basic implementation strategies. The BSD (FreeBSD
| and OpenBSD and more than likely NetBSD too), Microsoft, GNU, and
| MUSL C libraries use one, and suffer from this; whereas the
| OpenWatcom, P.J. Plauger, Tru64 Unix, and my standard C libraries
| use another, and do not.
|
| The 2002 report in the comp.lang.c Usenet newsgroup (listed in
| that discussion) is the earliest that I've found so far.
|
| There have been several assertions that library support for
| strings in C++ is poor. However, note that the fix here was to
| switch from the _Standard C library_ function sscanf() to a C++
| library.
|
| * https://github.com/fastfloat
|
| One might advocate using other people's libraries, on the grounds
| that they will have got such things right. However, note that
| this is RapidYAML, and the bug got fixed several years after GTA
| Online was released. Other third-party parsing libraries have
| suffered from being layered on top of sscanf() in other ways.
|
| * https://github.com/json-c/json-c/issues/173
|
| * https://github.com/kgabis/parson/commit/96150ba1fd7f3398aa6a...
| kiwidrew wrote:
| It turns out that the strlen() behaviour was present as far
| back as the 4BSD libc [1] circa _1980_ , and via the Net/2
| releases has found its way into NetBSD/FreeBSD/OpenBSD along
| with countless other BSD-derived libc implementations.
|
| Of course, the underlying vfscanf() implementation that reads
| from a FILE* simply proceeds one character at a time; the
| pseudo-quadratic behaviour arising from the strlen() call is
| simply an artifact of the simplistic "simulate a file using a
| string" wrapper in sscanf().
|
| [1] https://minnie.tuhs.org/cgi-
| bin/utree.pl?file=4BSD/usr/src/l...
| Jasper_ wrote:
| Musl reads the "file" in 256 byte increments and calls strlen
| on each one. Still a strlen, but on short strings it would only
| be called once, on a short amount of data. It would work fine
| for this use case.
| JdeBP wrote:
| As laid out at https://news.ycombinator.com/item?id=26298300,
| it doesn't call strlen(); and it isn't 256 bytes. Moreover
| that use case is not the use cases at hand, which are people
| parsing 10MiB of JSON and 504KiB of YAML with parsers that
| are not calling sscanf() just once.
|
| MUSL, with the same implementation strategy of nonce FILE
| objects, ends up performing multiple passes over those big
| input strings, once to look for NUL and a second time to pull
| each character out one by one. And it's doing those multiple
| passes multiple times, as it sets up and tears down the nonce
| FILE object each call to sscanf(), meaning that it's scanning
| the same parts of the string for NUL over and over again as
| it repeats the same buffer fill(s) from the input string a
| few characters further on. As I said before, it's not
| _entirely_ free of this behaviour. With the other
| implementation strategy, in contrast, there 's only a single
| pass over the whole input string, even with those multiple
| calls to sscanf().
| Jasper_ wrote:
| Let's look at the source: https://git.musl-
| libc.org/cgit/musl/tree/src/stdio/vsscanf.c
|
| It fills in a buf with a string_read callback, but if we
| look at what vfscanf does, it uses shgetc, which eventually
| calls into __uflow, which is where the read is performed:
|
| https://git.musl-
| libc.org/cgit/musl/tree/src/stdio/vfscanf.c
| https://git.musl-
| libc.org/cgit/musl/tree/src/internal/shgetc...
| https://git.musl-
| libc.org/cgit/musl/tree/src/stdio/__uflow.c
|
| This will call into the string_read callback with a buf
| size of 1. The string_read will internally do a memchr(buf,
| 0, len+256);, so, OK, it's memchr-ing at most 257 bytes,
| sorry. The read data then gets stored in a stdio-internal
| buffer (the rpos/rlen below), so this code path will
| actually be skipped on future reads.
|
| If it then needs to unparse a character, it will use ungetc
| to put it back in a stdio-internal buffer. Reading this
| code isn't that hard, I don't know why you keep insisting
| it does something else.
| JdeBP wrote:
| You haven't read the code properly. The memcpy() copies
| _one_ character. After shgetc() reads that and possibly a
| few more characters, the entire FILE object goes away and
| the next call to sscanf() sets up an entirely fresh nonce
| FILE object and _does the initial fill all over again_ ,
| including scanning the stuff that it has already scanned
| for NUL, a few characters further on. As I just
| explained.
| Jasper_ wrote:
| Yes, but it's not scanning the entire string, it's
| scanning at most 256 bytes. In MSVC, the number of bytes
| scanned is `M*N` where M is the number of calls to scanf
| and N is the size of the file. In musl, N is now 257, no
| matter the size of the file. Compared to ~10MB of data,
| that's basically free.
|
| Sure, it's scanning a bit more than maybe it needs to,
| but it's much closer to a non-scanning libc in the amount
| of data it's consuming.
| JdeBP wrote:
| And this will be the third time that I've written the
| phrase "not _entirely_ free of this behaviour ", in
| comparison with the implementations that are entirely
| free of this behaviour.
| wnoise wrote:
| It's not entirely free of overhead. The overhead is
| constant per invocation though, so it's entirely free of
| _quadratic_ behavior. Which is what's being discussed.
| JdeBP wrote:
| No, that was not what was being discussed, neither here
| nor in that discussion. See
| https://news.ycombinator.com/item?id=26297612 . None of
| us mentioned quadratic behaviour. We are talking about
| sscanf() calling strlen(), and in the original discussion
| why.
| wnoise wrote:
| Sure, the quadratic behavior is attached to the "calling
| sscanf in a loop". The point is linear behavior for a
| call to sscanf. Which isn't the case for musl. Nor does
| musl call strlen, or even effectively call it. It will
| bizarrely scan for a NUL in up to 256 bytes ahead of
| where in the string it is currently. But that's not
| actually the same behavior as calling strlen or having
| other linear behavior.
| jka wrote:
| This isn't solely directed towards you, but as a general
| comment: if you're keen to illustrate a point about code
| behaviour (as we often are!) then it can potentially save
| time and ambiguity to include commit references and line
| ranges.
|
| It still doesn't guarantee that the behaviour is stable
| (because dependencies within the linked range may change
| before a reader gets to it) but it can help support your
| message (and hopefully your argument) both at time-of-
| authorship and in future.
|
| (and yep, it's a tradeoff of your own time -- and
| response time -- against future readers' time; often
| tricky to handle. I think code browsers could do a lot
| more to make it easier)
| [deleted]
| [deleted]
| 1f60c wrote:
| I think "(2020)" should be added to the title. (There was a very
| similar story on the front page yesterday, but alas, it's
| ultimately unrelated.)
| dividuum wrote:
| Seen the GTA post already and the sscanf behavior is very
| surprising as one would expect it to parse the given string until
| the given format is satisfied or \0 is hit. Seems like this is
| not the case. Found this: https://stackoverflow.com/a/23924112
|
| > sscanf() converts the string you pass in to an _IO_FILE* to
| make the string look like a "file". This is so the same internal
| _IO_vfscanf() can be used for both a string and a FILE*. However,
| as part of that conversion, done in a
| _IO_str_init_static_internal() function, it calls __rawmemchr
| (ptr, '\0'); essentially a strlen() call, on your input string.
|
| Looks correct. Call flow:
|
| https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
|
| https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
|
| https://github.com/bminor/glibc/blob/21c3f4b5368686ade28d90d...
|
| https://github.com/bminor/glibc/blob/83908b3a1ea51e3aa7ff422...
| kazinator wrote:
| By itself, that doesn't add up to N-squared behavior, though.
|
| > _It looks like the problem is that reading floats is done
| using c4::from_chars - > c4::atof -> sscanf which ultimately
| calls _IO_str_init_static_internal in the glibc implementation.
| As far as I understand, this essentially causes strlen to be
| called on the string that rapidyaml passes to it, essentially
| causing strlen to be called O(n^2) times._
|
| As such, this doesn't follow. Just because sscanf takes the
| length of the entire source string doesn't mean will get N^2
| behavior, unless you do silly things with sscanf, like extract
| a small token from the beginning of a large string, and then
| skip past it to do the same thing again after that small token:
| essentially, build a lexical analyzer in which the pattern
| matcher consists of calls to scanf. This is not an intended use
| case for sscanf. No self-respecting language parser designer
| would do such a thing in a non-throwaway, production program.
|
| sscanf has undefined behaviors for out-of-range numeric values,
| so you can't use it for just blindly extracting numeric tokens
| at all, regardless of any other issues.
|
| sscanf should only be used on a string that your program has
| either prepared itself, or carefully delimited by some other
| means.
|
| For instance, suppose we have a string which holds a date
| (could be an input to the program). We can test whether it's in
| the YYYY-MM-DD format, and get those values out:
| int year, month, day; char junk; if
| (sscanf(input, "%4d-%2d-%2d%c", &year, &month, &day, &junk) ==
| 3) { /* we got it */ }
|
| The purpose of junk is to test that there is no trailing
| garbage character. If there is, it will be scanned and the
| return value will be 4. If we don't care about that, we can
| just omit that argument and conversion specifier.
|
| This will accept inputs like 01-1-1. Worse, it will scan
| leading signs!
|
| If you wan to insist on nothing but digits, and exact digit
| counts, you can break it up as string tokens first:
| char yearstr[5], monthstr[3], daystr[3]; char junk;
| if (sscanf(input, "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr,
| monthstr, daystr, &junk) == 3) { int year =
| atoi(yearstr); int month = atoi(monthstr);
| int day = atoi(daystr); }
|
| atoi is safe here, by the way, because we know that the
| integers are decimal strings no more than four digits.
|
| This is the sort of thing where sscanf comes in very handy. If
| you know how to use it, it solves these sorts of things nicely,
| and the performance is probably okay. At least not degenerately
| bad.
|
| Complete sample with all the validation:
| #include <stdio.h> #include <stdlib.h> #include
| <string.h> int main(int argc, char **argv) {
| char yearstr[5] = "", monthstr[3] = "", daystr[3] = "";
| char junk; if (argv[0] && argv[1] &&
| sscanf(argv[1], "%4[0-9]-%2[0-9]-%2[0-9]%c", yearstr, monthstr,
| daystr, &junk) == 3 && yearstr[3] && monthstr[1]
| && daystr[1]) { int year = atoi(yearstr);
| int month = atoi(monthstr); int day = atoi(daystr);
| printf("%d:%d:%d\n", year, month, day); return
| 0; } return EXIT_FAILURE; }
|
| Some tests: $ ./date asdf $ ./date 1-2-3
| $ ./date 2020-01-1 $ ./date 20-01-01 $ ./date
| 2020-01-01 2020:1:1 $ ./date 2020-01-01z $
| ./date 2020-01-12 2020:1:12
| froh wrote:
| to my understanding there is a long input string, completely
| in memory. that string consists of a gazillion of floats. the
| string is walked with a pointer, the floats are sscanf-ed one
| after the other. each of the sscanfs will call strlen _once_
| , but each strlen will scan the full remaining huge string to
| find the \0 eos byte.
|
| the sum of all scanned bytes is quadratic to the length of
| the huge input string.
|
| ?
|
| edit: oh the autocorrect...
| kazinator wrote:
| That description seems to resemble my remark:
|
| > _extract a small token from the beginning of a large
| string, and then skip past it to do the same thing again
| after that small token: essentially, build a lexical
| analyzer in which the pattern matcher consists of calls to
| scanf._
|
| Scanning floats with sscanf means undefined behavior on a
| datum like 1.0E999 against a %lf (double), where the
| exponent is out of range. IEEEE double goes to E308, if I
| recall.
|
| C99 says _" Unless assignment suppression was indicated by
| a[n asterisk], the result of the conversion is placed in
| the object pointed to by the first argument following the
| format argument that has not already received a conversion
| result. If this object does not have an appropriate type,
| or if the result of the conversion cannot be represented in
| the object, the behavior is undefined."_
|
| I'm seeing that on glibc, 1E999 scans to an Inf, but that
| is not a requirement.
|
| This problem could be solved with sscanf, too, while
| maintaining the degenerate quadratic behavior. sscanf
| patterns could be used to destructure a floating-point
| number textually into pieces like the whole part,
| fractional part and exponent after the E, which could then
| be validated separately, like that the E part is between
| -308 and 308 and so forth. Someone could still do this in a
| loop over a string with millions of floats. I'd say though
| that you're way past the point of needing a real lexical
| analyzer, and not multiple passes through scanf.
| 3xccc wrote:
| What about parsing an unknown number of at-most-4-digits
| integers from a string (being careful, as you
| exemplified)? Isn't that a reasonable use of sscanf?
|
| Do I need a full fledged parser for that? No. Does it
| have to traverse the whole string every time? No, because
| I'm being very clear about the maximum length of the
| thing I'm trying to parse.
|
| You cannot _not_ conclude that it's a bug.
| adwn wrote:
| > _[...] so you can 't use it for just blindly extracting
| numeric tokens at all [...]_
|
| So in other words, it's completely useless?
|
| > _This is not an intended use case for sscanf._
|
| Then what is?
| kazinator wrote:
| Yes, scanf is almost completely useless, and sscanf a
| little less so. sscanf is useful if you need a pattern-
| based breakdown of some string datum, and the patterns are
| not out of sscanf's league. It will work "anywhere"; it's
| been in C since before 1989 ANSI C.
|
| In the entire TXR project, I used sscanf in one function: a
| floating to integer conversion routine. The context is that
| integers can be bignums, so we can't just rely on ordinary
| conversion, except in the fixnum range.
|
| http://www.kylheku.com/cgit/txr/tree/arith.c?h=txr-252#n291
| 2
|
| The approach taken is to sprintf the _double_ value into a
| generous buffer, with a generous precision, using the %g
| format which will automatically use the _e_ notation if
| necessary.
|
| We detect whether the _e_ notation was produced, and other
| characteristics, and then proceed accordingly.
|
| I see that an impossible case is being tested for; since
| values with a small magnitude are handled via the fixnum
| route, we will never hit the case where we can short-
| circuit to zero.
|
| Can anyone see a bug here, by the way? :)
| nitrogen wrote:
| The comment to which you reply uses the %[] conversion to
| impose stricter limits on acceptable characters to be
| interpreted as numbers.
|
| My guess at the historical intention of *scanf is that it
| was meant for parsing tab-delimited and fixed-width records
| from old-school data files.
| sdenton4 wrote:
| (As opposed to the comma-separated data files that we
| enlightened future-folk use!)
| Darkphibre wrote:
| You know, why don't we use 0x1F (Unit Separator)? Seems
| like we have had a hierarchical delimiters since ASCII /
| punch cards, yet I never hear about them. https://en.wiki
| pedia.org/wiki/C0_and_C1_control_codes#Field_...
|
| CSVs are such a pain if you have freeform text (in which
| you have to handle newlines and escaped quotes).
|
| It seems like using FS/GS/RS/US from the C0 control codes
| would be divine (provided there was implicit support for
| viewing/editing them in a simple text editor). I get that
| it's a non-printable control character, and thus not
| traditionally rendered... but that latter point could
| have been done ages ago as a standard convention in
| editors (enter VIM et. al., as they will indeed allow you
| to see them).
| jimmaswell wrote:
| It's nice being able to edit a CSV in a text editor. It
| wouldn't be worth it to lose that just to not have to
| escape some strings.
| afiori wrote:
| more than that it is nice to edit files with a standard
| keyboard,
| db48x wrote:
| If your editor doesn't allow you to enter arbitrary
| characters using a standard keyboard then you need a
| better editor.
| sp332 wrote:
| It wouldn't take that much effort to add a glyph in the
| editor for it?
| [deleted]
| JdeBP wrote:
| Text editors that can put control characters into files,
| and indeed PC keyboards that have a [Control] key, have
| been around for decades, though. For example: With
| WordStar, SideKick, and Microsoft's EDIT, one just typed
| [Control]+[P] and then the control character. M.
| Darkphibre mentioned VIM, but really _this_ isn 't an
| idea that was ever limited to just VIM and emacs.
| nitrogen wrote:
| No matter what delimiters we use, we still have either a
| data sanitization problem or a data escaping problem.
| I've worked with a wire protocol that used the ASCII
| record delimiters, but with binary data so you also had
| to use binary escape (0x1b) and set the eighth bit on the
| following byte.
| 0xbadcafebee wrote:
| There are plenty of ways a lazy programmer will want to
| pass the records into something else that either doesn't
| handle non-printables, or does but the user doesn't like
| how it looks.
|
| TSV also works more reliably than CSV, because most
| people don't put tab characters in the data in these
| kinds of records. Tab is even the default field delimiter
| for _cut_. But everybody uses CSV, because again it 's
| easier to reason about the above. _(shrug)_
| elygre wrote:
| Which is a mess, because some parts of the world use
| commas as decimal commas: 3,14159etc. So we have to use
| semicolon for our CSVs.
| bmn__ wrote:
| > some parts of the world
|
| _most_ parts
|
| Why can't you be normal? https://commons.wikimedia.org/wi
| ki/File:DecimalSeparator.svg
| unilynx wrote:
| C is full of record based stuff. Apart from the numeric
| modifiers to printf and scanf we have fread, fwrite,
| calloc all taking record sizes
| [deleted]
| shawnz wrote:
| > ... doesn't mean will get N^2 behavior, unless you do silly
| things with sscanf, like extract a small token from the
| beginning of a large string, and then skip past it to do the
| same thing again after that small token: essentially, build a
| lexical analyzer in which the pattern matcher consists of
| calls to scanf. This is not an intended use case for sscanf.
|
| Then why does it support the "%n" specifier? Isn't it exactly
| for that use case?
| [deleted]
| liuliu wrote:
| I am confused about this comment. The parent comment gave
| clear call stack that pointing to the problem where the
| glibc's sscanf implementation would strlen the input string
| no matter what format string you use (because it need to
| masquerade as a file and that requires strlen the whole
| input: https://github.com/bminor/glibc/blob/21c3f4b5368686ade
| 28d90d...)
|
| All your magic on format string doesn't change that poor
| behavior unless you manually add '\0' somewhere to null-
| terminate the input string early.
|
| If you parse the year-month-day in 100 places of that input
| string, it will strlen 100 times. And now replace the "100
| places" to a linear function of the input string length
| (strlen(input) / 11?), that is quadratic.
| jchw wrote:
| I agree with you. However, the point of the parent comment
| is that you should not use scanf to do this; it is not
| designed to parse a small part of a large string. And
| indeed, I would say it is bad for more reasons than just
| this performance cliff. You should really prefer to use
| strtod or strtol on the exact scanned token instead. Even
| then, these functions may be impacted by C locale settings,
| which means you should probably instead prefer using
| implementations of this algorithm outside of libc...
| josefx wrote:
| I just ran strtod through a debugger it at least ends up
| in strlen.S, so it probably wont improve the runtime
| complexity . It seems to be best to just assume the worst
| when it comes to any functions in the C standard library
| that have to do with string handling and avoid them in
| favor of literally anything else.
| jchw wrote:
| Did you call strtod with a null endptr? If not, I'd be
| surprised if it actually strlen'd the input rather than a
| copy of it. Still, I agree, you shouldn't really use the
| C standard library to do strings.
| [deleted]
| oceanghost wrote:
| > This is not an intended use case for sscanf.
|
| It's almost like JSON and XML were invented so people didn't
| have to understand lex/yacc.
|
| And behold... they don't.
| knuthsat wrote:
| let's say you have a newline delimited dates in &junk. now
| you do a while loop, whoops quadratic.
| cozzyd wrote:
| Well, not if you do something like char *
| start = data; char * next = strchr(data,'\n');
| while(next) { *next=0;
| sscanf(start, ...) start=next+1; char
| * next = strchr(start,'\n'); }
| nitrogen wrote:
| junk is a character, not a string, though.
| JdeBP wrote:
| The calls to sscanf() in the preceding post have been
| edited since xe wrote that. My guess is that xe meant the
| parameter now named "input".
| kazinator wrote:
| The valid point is that the junk could be arbitrarily
| long, and that glibc's sscanf will scan through it in
| taking the length of the input.
|
| But we are not continuing to scan into the junk here to
| get more tokens here; we don't have a loop that will
| cause O(N*N) behavior.
|
| E.g. this could be user input from a form field where
| they are expected to type a date, and we want to reject
| that if there is trailing junk. At that point, the user
| gets an error, and has to fix the input. We don't look
| for anything in the trailing junk.
| dmix wrote:
| Unrelated but I found this but in the repo interesting:
|
| (Under the heading "is it rapid?" Measuring performance).
|
| > The 450MB/s read rate for JSON puts ryml squarely in the same
| ballpark as RapidJSON and other fast json readers (data from
| here). Even parsing full YAML is at ~150MB/s, which is still in
| that performance ballpark, albeit at its lower end. This is
| something to be proud of, as the YAML specification is much more
| complex than JSON: 23449 vs 1969 words.
|
| How did YAML get so complicated. I'd assume it would be much more
| similarity YAML's quirks adding a few extra things to work on.
| But that spec is quite huge by comparison (of words alone).
| dragonwriter wrote:
| > How did YAML get so complicated
|
| YAML has both more syntactical options, and broader semantic
| coverage; particularly, it includes the concept of schemas
| defining YAML profiles.
|
| Also, the YAML spec has much more informational content.
| Examples are a stark illustration: RFC 8259 has 5 example JSON
| documents (3 of which are simple scalar values). The preview
| chapter of the YAML 1.2 spec has 28 examples, only a handful of
| which are single scalars.
| rob74 wrote:
| Is this good ol' Shlemiel the painter
| (https://www.joelonsoftware.com/2001/12/11/back-to-basics/)
| rearing his lovable-but-sometimes-a-little-bit-empty head again?
| thaumasiotes wrote:
| Same pattern, but what's going on is that sscanf has been
| implemented as a call to a related function, with setup that
| incidentally does a bunch of unnecessary work. The scanning
| isn't what's causing the problem; it's that, before sscanf
| starts its actual work, it passes over the entire string to see
| how long it is.
| rpnx wrote:
| Never been a fan of C style strings for exactly this reason. C
| style strlen is slow. In C++ strings, the size is included in the
| string structure. Also, fetching the distance between two
| std::string::iterator is fast. string_view::size should also be
| fast when that's available.
| TwoBit wrote:
| Was there ever anybody a fan of C style strings?
|
| Maybe I misremember, but I think long ago people made arguments
| that C strings were superior because of how simple they were.
| Such was the thinking 30 years ago.
| mazykhan1 wrote:
| (y)
| PaulHoule wrote:
| Another example of so many things wrong with the status quo.
|
| (1) Naive people find liberation in text-based protocols that
| don't require tools to understand.
|
| (2) With experience one finds that variable-length data
| structures are much more treacherous than fixed-length data
| structures: you find simplicity instead in binary data formats
| that are dead easy to parse.
|
| (3) Floating point numbers are astonishingly expensive to parse
| and unparse to ASCII compared to how many FLOPs you could do in
| that time. All that, for a function that isn't quite correct...
| That is, the number 1/5, which you write as "0.2" doesn't
| actually exist in IEEE floats. You don't notice at first because
| that number unparses as "0.2" -- but then note that "0.1 + 0.2 ==
| 0.3" is false in most computer languages.
|
| (4) Null-terminated strings are the definition of insanity,
| rivaled only by "pointers are integers, integers are pointers"
|
| One pet project of mine lately is a "Data Assembler" that takes a
| graph of data structures (say test data for a family of logic
| chips, graphics data for a laser light show, or "maps" to upload
| to the engine control unit in your car) in Python and converts it
| to a byte array that is incorporated into an Arduino program.
|
| The total size of the structure could be up to 32kbytes on a
| small device, or up to 250kbytes or so on a big device, but the
| program memory is huge compared to the RAM which might be as
| little as 2kb so the unpacking code is going to stream the data
| with a small working area.
|
| The minimum it has to do is put together blocks of data that have
| references to one another that are only resolved once all of the
| blocks are otherwise complete.
|
| I deliberated a lot about how to handle variable length data
| structures and decided to embed a two-byte length before the
| block for all blocks. I could have saved some bytes when the
| length of the structure was known, but then I'd have to figure
| out how to code lengths then (e.g. for an array of 2-byte ints do
| I count the ints or bytes) and figured I'd make up my mind once.
| rmrfrmrf wrote:
| This whole rant is a bit of a non-sequitor for a minor
| performance bug with a working fix.
| CivBase wrote:
| > Naive people find liberation in text-based protocols that
| don't require tools to understand.
|
| I don't think text is necessarily the problem here. Other
| protocols might allow you to use less memory, but they aren't
| running into a memory limitation in this instance.
|
| You can divide parsers into two groups: parsers designed to
| read data which follows a specific schema and parsers designed
| to read arbitrary data.
|
| Schema-aware parsers have additional context which allows them
| to process the data much more efficiently. The stricter the
| schema, the more efficient the parser can be. Meanwhile,
| parsers for arbitrary data have no such context and must rely
| completely on syntax to parse the data. A poorly optimized
| schema-aware parser may still outperform a well optimized
| parser for arbitrary data.
|
| You can still encode data as text and even use popular syntaxes
| like JSON or YAML with schema-aware parsers. I've personally
| helped design a schema-aware parser for XML data. But you
| should keep in mind that most third-party parsers for those
| syntaxes will be designed to parse arbitrary data. Parsers for
| arbitrary data can still be very fast - fast enough for many
| purposes. However, schema-aware parsers have a significant
| advantage when it comes to computational performance.
| minitech wrote:
| Text _is_ the problem, here and everywhere else. Text
| encourages looking for delimiters, escaping, and flexibility
| (optional whitespace, case insensitivity, leading zeros,
| redundant signs, newline style compatibility). Compared to
| binary with length prefixes, that's _really_ slow, prone to
| injection vulnerabilities and subtle incompatibility,
| requires copying, introduces more side channels (the length
| of your message now depends on the values of numbers inside
| it!), ....
| astrobe_ wrote:
| On (2): implicit type promotions with unsigned and signed types
| can make it no-so-easy in C. FWIW, I always test with bytes
| with the sign bit on (on two's complement architectures, never
| worked with others) when I do stuff like that.
|
| On (3): never perform equality tests on FP numbers in languages
| that use IEEE standard FP format.
|
| On (4): I have worked both with ASCIIZ and counted strings, I
| think both have their pros and cons. But if you are writing
| stuff that has any contact with C/C++, it's often a bad idea to
| go against the stream. No, auto-adding a final 0 to your
| counted strings won't save the day every time.
| a1369209993 wrote:
| Never perform equality tests on FP numbers _period_ - it
| defeats the purpose of using floating point. This is not
| specific to the IEEE format.
| nly wrote:
| I agree generally that if you can afford the cycles to
| serialize and deserialize from decimal ASCII ("0.2") that you
| can probably afford to use a decimal numeric type in your
| business logic.
|
| Personally I blame the C standard for not splitting strtod in
| to 1 parsing function (returning an integer and the power of
| ten), and 1 computation/conversion function. Without this
| you're basically hosed with respect to locales.
| nitrogen wrote:
| _Naive people find liberation in text-based protocols that don
| 't require tools to understand._
|
| As a huge fan of text-based protocols (but not of JSON or
| YAML), calling a whole class of debuggability,
| interoperability, and onboarding benefits "naive" is a bit
| much.
| PaulHoule wrote:
| It's a lot of the reason why software has gotten to be so
| porous in terms of security flaws.
|
| I would trust programmers at Intel to handle FORTRAN-style
| data structures, and maybe even make a compiler in C.
|
| I wouldn't trust anybody from Intel to write a web server or
| even a web application with PHP in that wild west of
| applications programming with strings they are sure to mess
| it up and thus leave a vulnerability in the management engine
| on your CPU. It might superficially look like good code, but
| American Fuzzy Lop will find the branch that isn't obvious,
| and so will some kid.
| nicoburns wrote:
| To me the problem seems to be old-school C programmers
| thinking that they don't need to use a library to parse a
| standard format. I have only ever seen JSON parsing errors
| in C and C++ codebases because they are the only languages
| where developers seem to think it's a good idea to parse
| JSON using string functions rather than a JSON parser.
| Cloudef wrote:
| As a C programmer, i can say that most of the json
| libraries fail on simple fuzzing test, or do some other
| idiotic stuff like trying to handle numbers (and doing it
| wrong of course). We have awesome parser generators such
| as ragel, yet people still handroll parsing code that is
| both buggy and slow.
| nicoburns wrote:
| > We have awesome parser generators such as ragel, yet
| people still handroll parsing code that is both buggy and
| slow.
|
| Wasn't ragel responsible for the cloudbleed
| vulnerability? Wherever possible we really shouldn't be
| using C at all.
| PaulHoule wrote:
| Some of that is that they don't have systems like maven
| or npm that would make it really easy to incorporate
| libraries into the build.
|
| That said, what I like about C is that it is a simple
| language where you can throw out the standard library and
| start anew: it makes me think of the old versions of
| ALGOL that didn't specify mechanisms for I/O, and now we
| know you can specify that behavior in libraries and leave
| the language pure.
| nicoburns wrote:
| > Some of that is that they don't have systems like maven
| or npm that would make it really easy to incorporate
| libraries into the build.
|
| Definitely! I wish the C and C++ communities would take
| this problem seriously.
|
| > That said, what I like about C is that it is a simple
| language where you can throw out the standard library and
| start anew
|
| This definitely is a nice property. But it's not just C
| that can do this. You can for example do this with Rust,
| and get an excellent package management system (that can
| indeed package these alternatives to the stdlib) to boot.
| api wrote:
| I'll add that variable length stuff is inefficient to parse as
| it breaks instruction level parallelism. Later parsing events
| become dependent on earlier events.
| adrian_b wrote:
| Whenever floating-point numbers converted to text are not
| intended to be used by a human, converting the numbers to
| decimal is a great mistake in my opinion.
|
| The floating-point numbers should be converted to text as
| hexadecimal, e.g., in C99 and later, using printf with the %a
| or %A conversion specifiers.
|
| Even when a human uses the textual floating-point numbers, if
| they are used only for making copies or comparisons, the
| hexadecimal numbers are more convenient and they are free of
| conversion errors.
| orf wrote:
| How would the compiler be able to use any fast, floating
| point specific instructions if you parsed it as a hexadecimal
| string?
| firebacon wrote:
| You don't need any specific instructions. The hex encoded
| double contains the final binary representation of the
| floating point number. So no conversion is required to load
| it, except maybe for swapping around bytes on some
| architectures. Conceptually: double v;
| memcpy(&v, "\x00\x00\x00\x00\x00\xe4\x94\x40", sizeof(v));
| // LE
| MauranKilom wrote:
| Not sure I understand the question. The suggestion is to
| use hexfloats, as they are unambiguous and straightforward
| to convert from and to (as opposed to anything with decimal
| significands):
| https://www.cplusplus.com/reference/ios/hexfloat/
| fokinsean wrote:
| > (4) Null-terminated strings are the definition of insanity
|
| Could you elaborate on this one? I've only used C in college
| and don't know why this is bad.
| edflsafoiewq wrote:
| Another reason is a substring of a nul-terminated string is
| only nul-terminated if it is a suffix. So slicing out a
| substring generally requires copying to a new buffer so you
| can append the nul byte.
| ativzzz wrote:
| It's incredibly easy to do something wrong and cause bugs or
| security issues in your app.
|
| Or, compare string manipulation in C with any high level
| language like Ruby, Python, JS and the ease of use is night
| and day.
| ben509 wrote:
| Besides the fact that you have to walk the string just to
| find the length of it...
|
| Null-terminated strings try to save space by encoding the
| string length by convention. This can fail due to off-by-one
| errors, mistakenly allocating a fixed buffer, strings that
| have an unexpected null in them, and more.
|
| Those can lead to all kinds of nasty problems like buffer
| overflows, which allow someone who can craft an input to
| write arbitrary stuff into memory.
|
| God only knows how many security vulnerabilities and
| performance problems could be traced back to null terminated
| strings.
| birktj wrote:
| As is the case in this article it often happens that having
| the length of a string is required for certain operations,
| which leads to a lot of calls to strlen. This doesn't really
| seem like such a big problem to begin with as one assumes
| that the operations that require the length also have linear
| complexity. However it turns out (like in the article) that
| this is not always the case which leads to unexpected
| quadratic algorithms. The pointer + length representation is
| only 7 bytes longer on modern architectures and does not have
| this problem and has a few other advantages like being able
| to create substrings without allocating new memory.
| wnoise wrote:
| In fact, having the length of the string was not required.
| Certain ways of implementing scanf() required getting the
| length of the string, or otherwise scanning the whole
| thing.
|
| But saner ways did not.
| TwoBit wrote:
| The decision to use 0-terminated strings in C may be the
| single worst decision in programming language history, at
| least from a security standpoint.
| pjc50 wrote:
| > variable-length data structures are much more treacherous
| than fixed-length data structures
|
| A fixed-length structure is only fixed until you need to add a
| field to it. I understand the appeal, but good solutions seem
| to be very rare here. ASN.1 was possibly the first attempt but
| turned into a total nightmare. Protobuf seems to be doing a
| reasonable job, but only as a wire format and not a persistence
| format.
|
| Text always "works", even if inadequately. And it can be
| handled with your normal diff and editing tools.
| gpm wrote:
| We've been using protobuf as a persistence format without
| issue?
| nitrogen wrote:
| A self-describing format like MessagePack can be easier to
| build readers for in other applications, or if the schema
| is lost or changed.
| gpm wrote:
| If we've lost access to the schema, then we've lost
| access to our entire source repository, so we probably
| don't need to read our data anymore?
|
| I guess I could see that being an interesting property in
| software that wants to deal with incompatible forks, or
| software that for whatever reason wants other people to
| read their data but is unwilling to ship a schema with
| it, but just for persistence on backend servers it
| doesn't seem like a very interesting property or worth
| the extra space requirements.
| pjc50 wrote:
| You have. Far more apps out there are using JSON or XML,
| including (as we've seen GTA).
| astrobe_ wrote:
| > Text always work
|
| Windows vs Linux end-of-line markers often screws-up things.
| UTF often screws-up things. Tab-vs-space sometimes screws-up
| things. Text is related to Postel's principle, which is often
| questioned because it leads to sloppy implementations on one
| side and complex implementations on the other side, and other
| bug-is-now-a-feature niceties.
|
| I designed a couple of small internal protocols, and have
| been maintaining them for at least a decade. Maintenance time
| is nearly 0 because I never had to add a field to them. Even
| though the software using them evolved a lot and in
| unexpected ways.
|
| Yes, they stood the test of time because they solved the
| problem we had, and those problems didn't evolve much. Not
| every protocol or format needs to be hyper-extendable and
| scalable.
|
| > even if inadequately
|
| So it works poorly but it works, so it can be shipped. That's
| why you buy a new smartphone every two year and don't see
| much improvement in response time or battery life.
| arbitrage wrote:
| > Null-terminated strings
|
| There is nothing inherently wrong about a null-terminated
| string convention.
| PaulHoule wrote:
| There's no guarantee that it terminates.
| arduinomancer wrote:
| So if I understand is the issue not that sscanf reads the whole
| string length, but that people call sscanf multiple times on the
| same string for each value they want to parse?
|
| Meaning sscanf is actually designed to scan all your values in
| one pass?
| JdeBP wrote:
| It isn't _designed_ that way, no. This is simply an internal
| choice made by C library implementors, and one implementation
| choice fares very badly in this situation, as the strings grow
| long. The other choice is pretty much immune to long strings in
| these scenarios. You cannot infer from this an intended usage
| of sscanf(), because this issue isn 't part of the interface
| contract, just an effect of a particular implementation.
| smasher164 wrote:
| Sounds like a new entry for
| https://accidentallyquadratic.tumblr.com/?
| high_byte wrote:
| _GTA loading screen with cyberpunk music on loop intensifies_
| simias wrote:
| I write a lot of C but I literally never use scanf. C simply
| lacks the facilities to make such an API truly useful and
| flexible. You can't scan custom types, error handling is super
| limited, it's very easy to mess up the format string and
| potentially end up with corrupted data and security
| vulnerabilities.
|
| If you need to parse JSON or some high level format look for a
| lib, for instance I can vouch for cJSON which is very small and
| has a rather nice API (by C standards at least).
|
| If I need to parse a simple format string like
| "id:name:description" something like strtok_r does the job to
| split the components and then you can parse individual components
| any way you like, potentially offering better error messages.
|
| If I need to parse a complex format then chances are scanf
| wouldn't do the trick anyway, so it's either custom made parser
| state machine or lex/yacc.
|
| Note for instance that in the GTA case the call was `sscanf(p,
| "%d", buf)` which is a glorified strtol with vastly worse
| performance.
|
| I would seriously advise not using scanf and friends for anything
| serious. Too many footguns.
| whalesalad wrote:
| I believe I experienced a form of this recently with a regular
| expression. I never did figure out what exactly was going on but
| a certain string would cause the expression to run for an
| eternity. That was an interesting bug to track down.
| klyrs wrote:
| Depends what you mean by "regular expression," but matching in
| perl regexps is NP-Hard.
|
| https://perl.plover.com/NPC/
| secondcoming wrote:
| A dodgy regex took down Cloudflare!
|
| https://blog.cloudflare.com/details-of-the-cloudflare-outage...
| djhaskin987 wrote:
| I ran up against this issue when I was creating a parser that
| parsed debian package information (code here[1]). Trying to match
| against a line with regex slowed the program down past
| feasibility. I had to switch to just assuming that two line
| breaks meant a new package record instead. After that, the
| program was zippy.
|
| 1: https://github.com/djhaskin987/degasolv/blob/develop/cli-
| src...
| tambre wrote:
| Does the control format have an official spec? I too have had
| to resort to parsing those files with regexes.
| JdeBP wrote:
| It's chapter 5 of the _Debian Policy Manual_.
| djhaskin987 wrote:
| Not that I know of, but I have noticed that two newline
| characters separate package records, and the record is of the
| form `Key: Value` for each line. As you can see in my code,
| I've gotten rid of most/all of the regex stuff, because, as
| we observe, it kills performance.
| yread wrote:
| Just another example of the old adage "whatever problem you're
| facing someone on the internet has already seen it and solved
| it". We just need to get better at searching
| aeturnum wrote:
| Well, as long as you're facing this problem in 2020 and not
| around 2012. Even though Rockstar should should have patched
| GTA by now, battle-tested JSON parsing solutions were a lot
| rarer 8 years ago.
| ant6n wrote:
| True. Searching is turning more and more into a not solved
| problem. For a while, it seemed close to being a solved
| problem.
___________________________________________________________________
(page generated 2021-03-01 23:00 UTC)