[HN Gopher] Parsing time stamps faster with SIMD instructions
___________________________________________________________________
Parsing time stamps faster with SIMD instructions
Author : mfiguiere
Score : 179 points
Date : 2023-07-02 14:50 UTC (8 hours ago)
(HTM) web link (lemire.me)
(TXT) w3m dump (lemire.me)
| victor82 wrote:
| The solution exposed don't seems competitive enough to solve with
| good rank the problem exposed here https://highload.fun/tasks/14
|
| That site is really good to play with SIMD code.
| reaperman wrote:
| How can I see Yuriy's top solution for this task?
| anonymoushn wrote:
| You can ask him but he probably won't share
| reaperman wrote:
| Kind of a weird leaderboard if you can't even check to see
| if answers satisfy correctness. I've seen plenty of
| leaderboards where the top answers would fail cases that
| weren't covered by the leaderboards testing.
| paulmd wrote:
| Yeah, most of the leetcode top solutions are a hardcoded
| table of answers for the scoring questions followed by
| the general solution.
| anonymoushn wrote:
| I think it's not such a big deal because the input
| generators are usually pretty good and because most
| solutions that give up some correctness for performance
| can pretty cheaply detect that they done goofed and fall
| back to a slow path where they start over and do it some
| other way. This ends up being useful because scoring is
| the median runtime of 3 successful runs (so if you fail
| one of the 3 then your submission fails completely). It
| also means that probabilistic solutions that would be
| correct in non-adversarial environments (like using a
| bloom filter or N bloom filters tuned to give an exact
| result given the input constraints with probability
| >99.9%) are admissible.
| victor82 wrote:
| The solution codes are not available as you can continuously
| update your solution.
| pvitz wrote:
| What is actually the task here? Parsing is clear and it looks
| like one has to sum all parsed timestamps, but what does the
| additional comment about the shortest time mean?
| anonymoushn wrote:
| It means that your solution is scored on how fast it runs.
|
| Frequently we end up with solutions that use tricks that
| would not be applicable to the n=1 problem in TFA ("parse 1
| timestamp") when we try to get the fastest solutions for
| "find the sum of all these timestamps."
| renewiltord wrote:
| Is the code for some of the solutions available? Thanks for the
| link. That's a fun place.
| jeffbee wrote:
| I don't think I have ever seen %Y%m%d%H%M%S used in real life.
| wood_spirit wrote:
| in datasets that I've worked with it is so incredibly common
| and so incredibly slow that I've previously made a fast library
| to work with them: https://github.com/williame/TimeMillis
|
| Not as fast as SIMD, perhaps, but much faster than the Java
| standard library...
| jeffbee wrote:
| "The TimeMillis library offers helper functions to read,
| write and operate on timestamps stored as milliseconds in
| UNIX epoch."
|
| That doesn't seem related?
| gliptic wrote:
| The parse function takes a string in the format mentioned.
| tuukkah wrote:
| Sweden uses "%Y-%m-%d %H:%M:%S" - removal of the separators
| using SIMD left as an exercise ;-)
| https://en.wikipedia.org/wiki/Date_and_time_notation_in_Swed...
| GordonS wrote:
| China does too, and IIRC Japan and South Korea.
| tjoff wrote:
| You mean without delimiters, or? Stripping superfluous
| delimiters in logs isn't unusal.
| vardump wrote:
| That's a very common format.
| [deleted]
| morelisp wrote:
| Other than RRSIG (which I would argue is not at all common),
| where have you seen it?
| masklinn wrote:
| Timestamped filenames.
| morelisp wrote:
| Whenever I've seen it on timestamped filenames I've also
| needed sub-second precision.
|
| If that's a performance concern, a (potentially zero-
| padded, but realistically you don't even needed that)
| Unix timestamp works just as well.
| wongarsu wrote:
| I've seen lots of %Y%m%d on filenames timestamped by
| humans.
|
| People timestamp stuff for different reasons. If you do
| it to avoid collisions then adding sub-second precision
| makes sense, but in many contexts second or even day
| precision is all you care about.
|
| The important point is that %Y%m%d%H%M%S is human
| parsable and still sorts correctly (though humans
| typically prefer %Y-%m-%d %H:%M:%S for readability)
| horse666 wrote:
| [dead]
| dahart wrote:
| Nothing wrong with not having seen it, but I default to time
| stamps like this for log files, I use it all the time in
| scripts. I even also do it manually and have an alias in my
| .bashrc I call 'now' that spits out a call to date with that
| format string. This is, obviously, useful for organizing and
| sorting and finding files & folders by date & time. (And,
| hopefully obviously, creation & mod times on files are
| completely unreliable for this once you factor in copying &
| archiving.)
| lyu07282 wrote:
| This is really cool, I recently wanted to implement a ARGB->RGBA
| conversion function with SIMD in Rust, but as far as I can tell
| SIMD is STILL not in stable Rust :(
| mirsadm wrote:
| You won't get much speed up using SIMD for such a simple task.
| You'll be memory bandwidth limited.
| lpghatguy wrote:
| The CPU-specific intrinsic stabilized a little while ago for
| cases where you need SIMD on a specific platform at least. A
| lot of crates like glam seem to only support SIMD on x86/x86-64
| for that reason.
|
| I know that `wide`, `faster`, and `simdeez` are options for
| crates to do cross-platform SIMD in safe Rust. Eventually we'll
| have `std::simd`, but I don't know what the timeline is or what
| the blockers are.
| ComputerGuru wrote:
| SIMD _is_ stable in rust, it just requires unsafe.
|
| Here's an article I wrote about using SIMD in rust, with a
| real-world example: https://neosmart.net/blog/using-simd-
| acceleration-in-rust-to...
| mabbo wrote:
| How hard would it be to modify this to work with ISO-8601, the
| international time stamp format?
|
| In the most common case, you'd have something like
| 2023-07-02T15:46:00Z possibly with a ".123" been the last "0" and
| "Z".
|
| Would it be fastest to just copy the relevant chars to the
| correct place for the code in the post? Or could you do it in
| place somehow?
| zokier wrote:
| Shouldn't be too hard, assuming you don't need any validation.
| There afaik are simd instructions to move bytes around (vpermb
| maybe?), so that should take care of getting rid of the
| separators
| anonymoushn wrote:
| The code in the post doesn't seem to finish the job. For a
| timestamp like "2023-07-02T15:46:00Z" or the other formats you
| mentioned, the timestamp fits in a single 32-byte register, so
| you could write code that deals with the stuff as it is. If you
| have a mix of timestamps with different numbers of decimals or
| with different time zones, dealing with those things will take
| a bit of work compered to the ideal case where you had all "Z"
| or all "+" or all "-" for the time zones and all the same
| number of decimals for the fractional seconds.
| zokier wrote:
| > We know that some character must be smaller than 9, for
| example, we cannot have more than 59 seconds and never 60
| seconds, in the time stamp string.
|
| This is how you get leap seconds bugs?
| ryandrake wrote:
| Yea, my eye twitched when I read that and I said to myself "so,
| who's gonna tell him?" Glad to see this is the top comment.
| This is why we should not roll our own anything.
| habitue wrote:
| > This is why we should not roll our own anything.
|
| This seems like a major overcorrection. It's equivalent to
| saying "you can't know anything" "no one is qualified to do
| anything"
|
| You really have to consider the downsides. In cryptography,
| "don't roll you own crypto" is the maxim because the downside
| is pretty bad. Make sure you're an expert before writing
| crypto.
|
| What's the downside to mis-parsing leap seconds? Maybe a 500
| on a super rare request? Then they patch it and it's fine?
| Most computers are off by more than 1 second anyway.
|
| I think this mindset comes from developers looking into some
| field and realizing it's super deep ("Did you know there are
| missing dates in the gregorian calendar if you go back
| hundreds of years? Your time library is incorrect if it
| doesn't handle these cases!"). And sure, most things aren't
| completely correct, but you have to consider what the
| downside is to being slightly wrong. Maybe it'll be wrong in
| cases that will never actually happen in practice.
| simlevesque wrote:
| > This is why we should not roll our own anything.
|
| How is anything going to be made then ?
|
| Also, the author isn't just anybody and has created stuff
| that you've most likely interacted with. If almost nobody
| should "roll his own", he's one of those who should.
| dataflow wrote:
| I believe if you're using epoch timestamps you already have
| leap seconds bugs. I don't think they increment/decrement when
| the leap second happens.
| morelisp wrote:
| If you're using epoch timestamps to represent time you don't
| have leap second bugs. Each Unix timestamp corresponds to one
| unique point in time which also has some representation in
| civil time.
|
| If you're subtracting Unix timestamps to get durations you
| may have bugs, but those bugs might align better to your
| user's expectations anyway.
| wongarsu wrote:
| > Each Unix timestamp corresponds to one unique point in
| time
|
| That depends. 30th June 2015 23:59:59 has the timestamp
| 1435708799. 1st July 2015 00:00:00 has the timestamp
| 1435708800. But what about the leap second that happened
| between those two seconds? I think most people would claim
| it also has timestamp 1435708799. but then "Each Unix
| timestamp corresponds to one unique point in time" isn't
| true, because 1435708799 now corresponds to two different
| points in time (30th June 2015 23:59:59 and 30th June 2015
| 23:59:60). Saying there is a unique mapping from unix
| timestamp to time would only be true if you either say that
| 30th June 2015 23:59:60 has no unix timestamp (which isn't
| a valid return value of any time API I've seen) or if you
| smear time.
| jodrellblank wrote:
| It still wouldn't be true because there are points in
| time smaller than a second.
| morelisp wrote:
| No, I didn't say it was a bijective mapping. (Although
| with smearing, it still can be.)
| bobbylarrybobby wrote:
| Well, all time stamps are rounded or floored to the
| nearest second (or millisecond, or micro, or...) because
| you only have a finite amount of space to store them --
| you always have infinitely many time stamps mapping to
| each Unix second.
|
| This means you can just adjust the mapping around the
| leap second. A naive implementation might have those two
| real-life seconds tick twice as fast to take only one
| Unix second. Or you could have the whole day's seconds
| tick 86401/86400 times as fast, etc. In terms of the
| mapping from continuous time to discrete time, this is
| just as fine as the non-leap-second case.
| morelisp wrote:
| > 30th June 2015 23:59:60 has no unix timestamp (which
| isn't a valid return value of any time API I've seen)
|
| https://go.dev/play/p/I8GcKBoAVXi
| zokier wrote:
| > Each Unix timestamp corresponds to one unique point in
| time which also has some representation in civil time
|
| This is plain false. For example 1435708799 corresponds to
| both 2015-06-30T23:59:59Z and 2015-06-30T23:59:60Z
| morelisp wrote:
| No, your misdesigned ntpd resets your clock to 1435708799
| at 2015-06-30T23:59:60Z. I have yet to see a convincing
| argument that smearing and freezing are not compatible
| with POSIX's definitions of time.
| zokier wrote:
| its kernel that does the leap second handling, ntpd just
| informs it of upcoming leaps
|
| https://elixir.bootlin.com/linux/v6.4.1/source/kernel/tim
| e/n...
|
| sure leap smearing could be considered compatible with
| posix in the sense that posix does afaik not require any
| level of accuracy from clocks, so even something silly
| like xkcd 2266 could be considered posix compatible
| https://xkcd.com/2266/
| dataflow wrote:
| No, bugs can come up without any subtraction. If an event
| was recorded as happening at 23:59:60.2, you need it to be
| sequenced strictly after 23:59:59.9. Winding back the clock
| by 1 second makes it so you end up with the later event as
| having occurred at 23:59:59.2, violating causality. Even
| without sub-second precision the situation is still buggy,
| just not as bad - you end up with two simultaneous events
| when there weren't any in the original representation,
| which is itself quite surprising when the two look like
| they have the same precision.
| heywhatupboys wrote:
| > you end up with the later event as having occurred at
| 23:59:59.1
|
| Unix time is seconds since 1970... If you convert between
| time formats you MAY loose information. That has nothing
| to do with unix time.
| dataflow wrote:
| "Lose information" is not a binary thing, nor is it
| equivalent to introducing _wrong_ information. You don 't
| expect (short)INT_MAX to suddenly give you 1337 just
| because it "loses information". You still want to have
| (and do have) some useful properties & information to be
| preserved even after certain information loss, otherwise
| the implementation is buggy despite the information loss.
|
| In this case, without leap seconds, you'd have still lost
| the _precision_ , but you wouldn't have actually violated
| causality. Winding back during a leap second makes it so
| that a timestamp that's > 1 second later in real time
| still ends up being ordered as strictly before the
| earlier time. That's a bug, regardless of how you slice
| it.
|
| Now this can be partially avoided by saturating the
| subtraction at the boundary rather than winding it back,
| and I'm not sure how many implementations do this. That
| seems better, but it's still surprising (and thus bug-
| prone) given the input and output representation
| otherwise appear to have sufficient precision.
| zokier wrote:
| > Unix time is seconds since 1970...
|
| This is unfortunately common misconception. Unix time in
| reality is seconds since epoch _minus the number of leap
| seconds_
| notfed wrote:
| Can you elaborate? How does this lead to a leap second bug?
| Would a string really say "60" as the number of seconds,
| somehow, due to leap seconds?
| layer8 wrote:
| Yes: http://www.niceties.com/leapsec.html
|
| See also: http://leapsecond.com/notes/leap-watch.htm
| paulddraper wrote:
| > we want to convert them to an integer presenting the number
| of seconds since the Unix epoch. The Unix epoch is January 1st
| 1970.
|
| UNIX time ignores leap seconds.
|
| I.e. it's the number of seconds ignoring leap seconds since
| 1970.
| dataflow wrote:
| > UNIX time ignores leap seconds.
|
| Side note but I always found this phrasing to be incredibly
| confusing (hence the need for the "i.e."...); it's the exact
| opposite of what I would think it means to "ignore" leap
| seconds. Ignoring something means you don't change your
| behavior when that something happens. But when leap seconds
| happen, the epoch timestamp _halts_ its counting. That 's not
| ignoring leap seconds -- that's actively countering the leap
| seconds!
| Groxx wrote:
| yeah - "ignoring" to me means "is ignorant of" means "one
| second passing means increasing the time by one second".
| You do by far the most obvious thing when time passes, and
| you ignore the complexities of reality to continue being
| maximally obvious.
|
| Doing anything else means _handling_ them by adding custom
| behavior.
| paulddraper wrote:
| Ignores = does nothing
|
| Normally, UNIX time goes up by one each second.
|
| Unless, it's a leap second, in which case UNIX time does
| nothing.
| dataflow wrote:
| Halting the advancement that would've ordinarily happened
| is not "ignoring it", it's doing something to actively
| counter it. That "something" ends up canceling the
| previous trajectory to evaluate to "nothing", but that
| doesn't happen through ignorance - it required a
| countering action.
|
| Example: You're driving your car when pedestrians appear
| in front of you. What does "ignoring" them constitute?
| Stopping the car, or running into them at the previous
| speed?
| agons wrote:
| I'd say that the people counting leap seconds are the
| ones "doing something".
|
| At the point they needed to be inserted, Unix systems
| carry on counting seconds as units of time as they pass,
| "ignoring" the need to add 1 for the purposes of
| accounting.
| zokier wrote:
| > At the point they needed to be inserted, Unix systems
| carry on counting seconds as units of time as they pass
|
| unix systems essentially will roll back clocks during
| leap seconds. here is handy dandy table as an example:
| Time tv_sec tv_usec
| 2015-06-30T23:59:58.9Z 1435708798 900000
| 2015-06-30T23:59:59.0Z 1435708799 0
| 2015-06-30T23:59:59.1Z 1435708799 100000
| ... ... ...
| 2015-06-30T23:59:59.9Z 1435708799 900000
| 2015-06-30T23:59:60.0Z 1435708799 0
| 2015-06-30T23:59:60.1Z 1435708799 100000
| ... ... ...
| 2015-06-30T23:59:60.9Z 1435708799 900000
| 2015-07-01T00:00:00.0Z 1435708800 0
| 2015-07-01T00:00:00.1Z 1435708800 100000
| ... ... ...
|
| I don't know about you but I wouldn't call that "counting
| seconds as units of time as they pass"
| paulddraper wrote:
| I suppose you can envision UNIX clock with
| "inertia"/Newton's second law.
|
| But when I count people passing me, I press my clicker
| each time to count them. Except once in a while I don't
| do anything, i.e. ignore them.
| catiopatio wrote:
| Not doing anything is a decision to depart from your
| standard behavior. It's not "ignoring".
| zokier wrote:
| Yes, that means that :60 should parse into same value as :59.
| My reading of the code is that now it might end up in the
| `return false` branch instead, although I'm not exactly sure
| how the limits thing is used here
| MobiusHorizons wrote:
| Sure, but he's not parsing Unix timestamps. He's parsing time
| in the format %Y%m%d%H%M%S, which would have leap seconds on
| most systems
| paulddraper wrote:
| He's parsing them into UNIX timestamps.
|
| And apparently these strings don't have second 60.
|
| Works for me.
| worewood wrote:
| Reminds me of that post about things programmers always get
| wrong about dates [1].
|
| [1]: https://news.ycombinator.com/item?id=4128208
| 0xr0kk3r wrote:
| I love how bad the comments are for this code. You'd think a
| professor would want to encourage better documentation, or at
| least expose students who are reading this post to some basic
| commenting. Especially because this is a really great code
| example of how to twiddle packed values of different sizes and
| ranges simultaneously, and not the standard SIMD examples of
| matrix math that students usually see first.
| dundarious wrote:
| Looks to be using a lot of similar ideas to the ones explained
| well in these articles:
|
| https://kholdstare.github.io/technical/2020/05/26/faster-int...
|
| http://0x80.pl/articles/simd-parsing-int-sequences.html
| jakear wrote:
| What are you even talking about? There's tons of comments and
| the code isn't obscure. https://github.dev/lemire/Code-used-on-
| Daniel-Lemire-s-blog/...
| 0xr0kk3r wrote:
| What do you think I'm talking about? OP pruned the comments
| out of github and put the code on the page w/o them. I didn't
| click to the github page. Still confused?
| DonPellegrino wrote:
| It's not bad at all https://github.com/lemire/Code-used-on-
| Daniel-Lemire-s-blog/...
| 0xr0kk3r wrote:
| Oh, I didn't follow that link, I was just reading the blog.
| Those are decent comments, too bad they were nixed for the
| blog.
| morelisp wrote:
| But the entire paragraph where they are introduced is a
| commentary on them.
| 0xr0kk3r wrote:
| Yes, I read that. It is in-prose form and detached from
| the comments. Can't believe I'm arguing about this, but
| comments need to be next to the code (like the github).
| Thanks for mansplaining tho.
| brigadier132 wrote:
| Code from academia is notoriously unreadable
| WJW wrote:
| > You'd think a professor would want to encourage better
| documentation, or at least expose students who are reading this
| post to some basic commenting.
|
| Welcome to academic code. We're lucky there's any comments at
| all. :)
| dmw_ng wrote:
| Neatly omits the massively expensive calendar conversion step for
| turning the timestamp into a useful format for fast arithmetic.
| Parsing integers from strings is not why parsing timestamps is
| slow, any more than counting the length of strings is why XML
| parsing might be slow.
|
| For time-sequential data, one easy speedup hack is caching the
| UNIX epoch for the current day start and adding the HH:MM:SS
| manually. No measurements to offer, but would suggest approaches
| like this would cleanly beat simdification of a trivial part of a
| complex task
| morelisp wrote:
| > omits the massively expensive calendar conversion step
|
| Is that not included in what's 6x faster? I agree comparing
| against strptime is a bit of a gimme and a scalar fixed-format
| parser would be fairer, but I'm pretty sure it will beat that
| too.
|
| _Edit to reply since rate-limited:_
|
| > No mention of calendar in either place. Did you see something
| in the article or code, or are you just asking us to check?
|
| The (lack so far of) calendar is mentioned in the blog post -
| "It does not get all the parsing done... We can just use
| standard C code for the result." - and is _very obviously_ in
| the GitHub code.
|
| https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...
| refulgentis wrote:
| After reading twice and scouring the code on GitHub, I don't
| think so.
|
| No mention of calendar in either place.
|
| Did you see something in the article or code, or are you just
| asking us to check?
| anarazel wrote:
| FWIW, the last time I measured the string handling part was far
| slower than the timezone conversion when loading timestamptz
| data in postgres.
| wood_spirit wrote:
| Indeed! It was a while ago but Im quite proud of the fast
| timestamp functions I made for Java after the standard library
| timestamp performance turned out to be critical to some data
| pipelines I was working on.
|
| The early versions of the code did cache the last parsed date
| etc but as the parser got quicker this became diminishing
| returns.
|
| The fastest way I found for the calendar month numbers etc was
| to use a precomputed table etc.
|
| https://github.com/williame/TimeMillis
| lichtenberger wrote:
| Wouldn't it be possible to contribute the faster algorithm to
| the JDK (e.g. the Instant class)? Maybe you'd need some
| iterations, but do you see any downsides?
| dataflow wrote:
| > one easy speedup hack is caching the UNIX epoch for the
| current day start and adding the HH:MM:SS manually
|
| Side note, this is a really great example of why you'd want an
| operation such as parsing to be state-mutating when you'd
| normally expect it not to. i.e. why it makes sense to support
| passing a Parser object rather than just a parse() function,
| even for operations where you don't think it's necessary at
| first glance.
| DrJokepu wrote:
| the downside is that now you will have to consider thread
| safety
| dataflow wrote:
| Yep, you'd want to copy the parser per thread.
| haberman wrote:
| Do you mean the conversion from YYYY-MM-DD to Unix Time? That
| can be quite fast; I wrote a blog article showing some
| algorithms that take less than 10ns to perform the conversion:
| https://blog.reverberate.org/2020/05/12/optimizing-date-algo...
|
| Recently I came across another variant of the algorithm that is
| even smaller and simpler. I've been meaning to test this
| variant and add it to my article (assuming it is correct), but
| I haven't had the chance:
| https://dotat.at/@/2008-09-10-counting-the-days.html. (EDIT: I
| just tried it out. It works, and while the source code is
| smaller than my other variants, epoch_days_fast() from my
| article is still smaller and faster in terms of machine code:
| https://godbolt.org/z/Kjnafzqcx vs
| https://godbolt.org/z/e7nchjKGW)
| dmw_ng wrote:
| Amazing work :) Do you know if glibc gmtime_r() might be
| impacted by the active timezone? My "massively expensive"
| experience above came from gmtime_r() on Amazon Linux running
| in Lambda, but I can't remember the exact scenario involved
| (or e.g. if TZ were set), it certainly seems unlikely worth
| putting the effort into it that I did if the costs were
| similar to those observed in your post.
| haberman wrote:
| Thanks! I didn't experiment with gmtime_r() since it goes
| in the other direction (UnixTime -> civil time). I wouldn't
| expect timegm() or gmtime_r() to depend on current
| timezone, since both are hard-coded to use UTC. I never did
| figure out why the macOS version is performing mutex locks
| on global data.
| kolbe wrote:
| Interesting. I think you'd find that doing this in chunks of
| 128-bit wide vectors with platform-agnostic logic would be even
| better. Have you tried something like Google Highway or writing
| this exact logic in scalar code and seeing if the compiler will
| autovectorize it? Then you have the option to do up to 4 of them
| at once instead of one if your platform supports it.
| cratermoon wrote:
| At one consulting gig I had at an airline, I noted that
| timestamps were everywhere and among the most frequent operations
| in the system was comparison with current time. The legacy
| systems spit out date/time in at least 3 different formats, and
| almost never included the timezone. The TZ had to be determined
| by looking up the airport's TZ.
|
| When I got there time comparisons were a mess, parsing in
| multiple places, some comparisons not taking into account the TZ,
| some add and subtract operations incorrectly code, and other
| issues.
|
| I tried to make the case that parsing and compare timestamps was
| a core operation, and the system should have all that
| functionality polished and packaged up to its own module so
| programmers could focus on the rest of the system, but was
| unsuccessful. We were writing in Go, so the language libraries
| did about 80% of the work, and the less experienced programmers
| thought it was sufficient.
|
| When I left they were struggling with issues around determining
| passenger age at the time of flight and getting off-by-one errors
| around certain cutoff ages.
|
| Time is Hard.
| https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b...
___________________________________________________________________
(page generated 2023-07-02 23:00 UTC)