[HN Gopher] Overflowing timeout
___________________________________________________________________
Overflowing timeout
Author : antonio2368
Score : 48 points
Date : 2021-07-14 13:18 UTC (1 days ago)
(HTM) web link (memgraph.com)
(TXT) w3m dump (memgraph.com)
| wahern wrote:
| > When you check the elapsed time, you subtract the old time from
| the new time. But if the new time is less than the old time, you
| get an overflow. Because we save time as an unsigned integer, you
| get a very large number that will for sure be larger than any
| timeout value you set, including our default 3 minutes.
|
| If they were using unsigned arithmetic then they would've gotten
| the expected value. For example, 0 - 0xffffffffffffffff is 1 when
| using 64-bit types, just as one might expect. Unsigned (modulo)
| arithmetic works exactly how you'd want in a situation like this;
| it works until the interval exceeds the range of the type. So
| either they're describing things wrong, or their code wasn't
| doing what they thought.
|
| Of course, in any event this wouldn't be the root of their
| problem--time stamp counter drift across cores. The Linux kernel
| has alot of code to decide whether to use the RDTSC-optimized
| clock_gettime vDSO implementation or fallback to a syscall. And
| when it does use the RDTSC method, IIRC it has code to train the
| time stamps to minimize any cross-core drift, and mask out any
| potential difference (reducing granularity). The moral of the
| story is to just use clock_gettime+CLOCK_MONOTONIC; at least on
| Linux it'll use RDTSC in most cases where that's practical.
| antonio2368 wrote:
| Sorry if something was explained awkwardly or incorrectly, I'm
| trying to get a hang of writing posts like this, technical
| posts that are interesting and fun to read.
|
| > If they were using unsigned arithmetic then they would've
| gotten the expected value.
|
| And we did, but we did not expect to get the input values we
| got, which are explained by your last paragraph: "time stamp
| counter drift across cores." This was the real problem, and
| something new we learned.
|
| > use clock_gettime+CLOCK_MONOTONIC
|
| I forgot to include this because I was focused on the solutions
| we ended up using, but after some research this is 100%
| correct! We were afraid that too many syscalls during the query
| execution can produce huge performance hits so we decided to
| look for a different solution.
|
| All in all, thanks for the comment!
| namibj wrote:
| Just to correct you: `clock_gettime` is handled by the vDSO,
| and does not involve an actual syscall for over a decade now.
| antonio2368 wrote:
| I didn't think about the vDSO, and that it's used for
| `clock_gettime`, so thank you for the information and
| correction!
|
| It's really hard for me to get a feel of what is okay to
| use, what is the standard way to do something without
| actually implementing everything and seeing how it behaves,
| so information like this help me a lot.
| namibj wrote:
| I recommend looking at it, as all those calls are very
| cheap: man 7 vdso
| pornel wrote:
| It turns out lots of systems are subtly broken, and need
| workarounds. This is where it's handy to have a decent standard
| library instead of learning all of this the hard way:
|
| https://github.com/rust-lang/rust/blob/b9197978a90be6f757074...
| ComputerGuru wrote:
| __rdtsc drift isn't an AMD bug, it's by design. The only way to
| use the rdtsc instruction is to pin code to the same physical
| core between calls. They probably just got lucky in that the
| drift always exceeded the time that elapsed between calls on
| other cpus or the Linux scheduler's behavior on Ryzen 7 differs
| from other CPUs (I'm not familiar with Ryzen 7 but ThreadRipper
| is basically non-numa and needed patches to the scheduler under
| Windows to fix some pathological conditions).
| wahern wrote:
| It seems more of a QoI issue. Most Intel chips maintain
| synchronized (same rate of change) TSCs. That knowledge is
| encoded in the Linux kernel: /* *
| Intel systems are normally all synchronized. *
| Exceptions must mark TSC as unstable: */ if
| (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) { /*
| assume multi socket systems are not synchronized: */ if
| (num_possible_cpus() > 1) return 1; }
|
| From unsynchronized_tsc function at
| https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
| antonio2368 wrote:
| > __rdtsc drift isn't an AMD bug, it's by design.
|
| Yes, I should've emphasized this.
|
| > hey probably just got lucky in that the drift always exceeded
| the time that elapsed between calls on other cpus
|
| And this is also true. :D But I would say we were lucky to
| notice this problem at all because it really doesn't happen on
| that many CPUs, at least on the ones we tried out.
___________________________________________________________________
(page generated 2021-07-15 23:02 UTC)