[HN Gopher] Nanolog supports logging with 7 ns median latency
___________________________________________________________________
Nanolog supports logging with 7 ns median latency
Author : eventhelix
Score : 110 points
Date : 2024-08-23 12:32 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| zdw wrote:
| Doesn't a logging system need a storage system that can keep up
| with it, if the goal is to persist logs for later analysis? What
| storage could keep up with this?
| rfoo wrote:
| This is for very bursty logs. You don't log every 7 ns. On
| average you are not generating a huge amount of logs.
|
| But you need each logging call to be very fast, cause logging
| calls are usually synchronous.
| mannyv wrote:
| If the logging call is in the critical path then logging data
| is probably critical as well. After all, the fastest call is
| the one you don't do.
|
| If you're writing telemetry then that's different. But if
| you're using logging to write your telemetry then there are
| better ways to do it.
| wdfx wrote:
| I think the idea here is to separate the log call site in
| application code from the log processing/persistence?
|
| So, the nanosecond values quoted are the impact exposed to your
| application code, but some other process takes over the bulk of
| the work of the logging.
|
| So as long as the offloaded process can keep up with the
| _average_ log rate, it 'll be fine - but also as a bonus the
| application does not see increased latency due to logging.
| 01HNNWZ0MV43FF wrote:
| Sounds a bit like how Tracy works
| cma wrote:
| Battery backed SRAM
| username81 wrote:
| Are there libraries like this, but in rust? As far as I
| understand, it relies on C's preprocessor, so it is impossible to
| create bindings for another language.
| cmptrnerd6 wrote:
| I have not used this and it says it targets embedded systems
| but maybe it is close enough to what you might be looking for:
| https://github.com/knurling-rs/defmt
| steveklabnik wrote:
| I have used this, but not the library in the link. From the
| link's README, they're at least analogous. While maybe not
| the exact same thing, they're at least the same idea.
| andrepd wrote:
| Sounds like something that would be doable in rust. I'm not
| sure how you would go about building the compile-time table of
| log locations and format strings, sounds like the hardest part.
| wtetzner wrote:
| > I'm not sure how you would go about building the compile-
| time table of log locations and format strings, sounds like
| the hardest part.
|
| Sounds like a job for a macro.
| steveklabnik wrote:
| Yep. And macros for logging are already the norm in Rust.
| cmptrnerd6 wrote:
| I've used https://github.com/rokath/trice which is similar but
| targeting microcontrollers. It isn't immediately clear to me if
| nanolog could run on a microcontroller with its output directed
| over RTT/uart/etc or not.
| fra wrote:
| This is a common technique in embedded software. A few other
| examples:
|
| 1. Thrice (already mentioned in the comments)
| https://github.com/rokath/trice
|
| 2. Pigweed's Tokenizer (from Google)
| https://pigweed.dev/pw_tokenizer/
|
| 3. Memfault's Compact Logs
| https://docs.memfault.com/docs/mcu/compact-logs
|
| 4. Defmt by Ferrous Systems https://defmt.ferrous-systems.com/
| frizlab wrote:
| 5. macOS logging system
| https://developer.apple.com/documentation/os/logging/viewing...
| enigmo wrote:
| 6. WPP in Windows 2000 ETW https://learn.microsoft.com/en-
| us/windows-hardware/test/weg/...
| lokar wrote:
| The google logging library also defers formatting
| perching_aix wrote:
| that sounds deliciously performant, love projects like these
| yas_hmaheshwari wrote:
| I was also thinking the same~ How come such a good idea is
| already not part of standard logging libraries -- to allow you
| to configure to another process or message queue!
|
| Loved the idea
| Validark wrote:
| Amazing work! I was wondering just a few months ago whether
| someone ever made a logger that deferred all the expensive work
| of string formatting to consumption time.
|
| ~~I'm a bit surprised that it didn't come along sooner though.
| How come nobody at Google or VMware who said they noticed this
| was a problem solved it? Or any other major tech company? I guess
| maybe this is partially an issue with our programming languages
| and build tools though? I'm a Zig enthusiast though so in my head
| it's trivial, but I guess it won't be until C++26 that they get
| potentially comparable comptime facilities for C++.~~ I'm
| surprised Go doesn't work like this by default though. For a
| language like Go, I'd have made a builtin log keyword that does
| this.
|
| EDIT: Looks like other implementations of similar ideas do exist.
| Still awesome though!
| yuliyp wrote:
| You have to be careful in deferring such work. It may end up
| more expensive if it means you have multiple threads accessing
| that data, and/or needing to extend the lifetime of an object
| so the logger can access it.
| jnordwick wrote:
| as long as you are just using static strings and native types
| it amounts to a pointer/index bump and a load/store per item.
| Lets imagine you have the format string, priority number,
| system id, and 7 pieces of data in the payload. That would be
| 10 items, so like 40 cycles? I can see the 18ns the paper
| gets.
|
| I had no doubt the 7ns number is heavily cooked.
| lokar wrote:
| The google logging library has deferred the formatting for
| years
| geertj wrote:
| The consumer side of this would be polling a memory location for
| new logs, correct? It would not be possible to wake up the
| consumer in 7ns as that would take a FUTEX_WAKE system call with
| is O(microseconds).
|
| I've been wondering about a FUTEX_WAKE that does not require a
| system call. Possibly, the kernel could poll a global memory
| area. Or maybe there is some low-level debugging API available
| where the kernel could be notified of a memory write by a
| process?
| gpderetta wrote:
| There isn't a significant advantage in having the kernel doing
| the polling, it would still be busy polling.
|
| If you just don't want to burn power but you can still dedicate
| a core, there is https://www.felixcloutier.com/x86/mwait.
| geertj wrote:
| > There isn't a significant advantage in having the kernel
| doing the polling, it would still be busy polling.
|
| I was thinking in terms of a generic syscall-less wake
| functionality where the kernel could do this for all
| processes in the system. So you'd lose one core per system
| instead if one core per consumer.
|
| >If you just don't want to burn power but you can still
| dedicate a core, there is
| https://www.felixcloutier.com/x86/mwait.
|
| Interesting. Could be used to make the kernel loop above burn
| less power.
|
| A user-space implementation could presumably also be built.
| There could be a shared memory segment shared between
| producers and a monitor. A producer sets a flag in case it
| needs attention, and the monitor busy polls the segment. The
| monitor could then use e.g. a signal to wake up consumers.
|
| The latency between the producer signaling and the consumer
| taking action would be a higher than with futexes. But there
| would be no waits/context switches in the producer at all.
| Might be a solution for some low latency use cases.
| toxik wrote:
| O(microseconds) = O(years), this is not what big O notation
| means.
| renewiltord wrote:
| That seems like a type error on your part. O(unit) is a
| categorically different notation than O(mathematical
| function). You must be confusing it with the asymptotic
| notion that is the latter. But units are not functions and so
| clearly it cannot be the same notation.
| toxik wrote:
| No, and it's absurdly petty to argue otherwise.
| davidsgk wrote:
| One might say it's also pretty petty to call out a casual
| usage of a notation being used in a way that people in
| the thread are understanding just fine...
| toxik wrote:
| I think it's helpful because it is a common mistake.
| jnordwick wrote:
| The background thread can just sleep if no data in the queue.
|
| Since you are preprocessing the log messages and you know the
| minimum bounds. If the thread sleeps for 1 ms. Be generous and
| say you can at most push to the queue in 5ns, and if you know
| the largest messages you push will be 200 bytes, you can
| statically determine a 40M ring buffer will ensure enough space
| to sleep even at max rate with the largest messages.
|
| And that's just a simplstic scheme. If you have a pool logging
| structs so your queue is just a pointer to one of those, you
| can get away with much less even.
|
| And I can think of faster ways too. It isn't that difficult to
| get around using a semaphore
| sriram_malhar wrote:
| The consumer (background thread) only polls; there is no need
| to wake up the consumer.
|
| At steady state, the consume is either blocked on I/O or is
| scanning the producer buffers. When all producer buffers are
| empty, it can just sleep for a short time.
| geertj wrote:
| This would work in this specific case where we know that
| there is a maximum rate at which work is produced. Arguably I
| was hijacking the thread to discuss a more general problem
| that I've been thinking about for a while. I have the sense
| that a ring-buffer that has a wait-free push with low bound
| on consumer side latency that doesn't require fixed interval
| polling would be a nice primitive that I certainly could have
| used at times.
|
| And in fact ... the wait-free wakeup part of this is already
| there. Now that io_uring has futex support, a producer can
| enable kernel-side busy polling on the uring, and then submit
| a FUTEX_WAKE to the ring without doing any sys calls. This
| Github issue [1] has a nice description.
|
| [1] https://github.com/axboe/liburing/issues/385
| jnordwick wrote:
| It uses a background thread to do most of the work, and it
| appears the 7ns latency numbers are a little cooked:
|
| 1. The paper's 7ns like number is 8ns for microbenchmarks but
| 18ns in applications. The 7ns number I'm guessing is
| microbenchmarks, and the true application level number is prob
| more in the 17ns range.
|
| 2. It isn't precisely clear what that is measuring. The says that
| is the invocation time of the logging thread. Considering the
| thread making the call to log just passes most of the work to a
| background threads through a multi-producer single consumer queue
| of some sort, this is likely the time to dump it in the queue.
|
| So you really aren't logging in 7ns. The way I'm reading this is
| you're dumping on a queue in 17ns and letting a background thread
| do the actual work.
|
| The workload is cut down by preprocessing the creating a
| dictionary of static elements do reduce the I/O cost of the
| thread doing the actual writing (I assume this just means take
| the format strings and index them, which you could build at
| runtime, so i'm not sure the pre-processing step is really
| needed).
|
| My logger than dumps binary blobs onto a ring buffer for another
| process to log might be able to beat this invocation latency.
| This isn't really groundbreaking. I know a few place that log the
| binary blobs and format them later. None of them do the
| dictionary part, but when that is going to a background thread,
| I'm not sure how much that matters.
| szundi wrote:
| Only thing that makes sense is that the thread sending the logs
| is blocket for 7ns - otherwise too much context dependent extra
| comes in to make a claim like this
| gpderetta wrote:
| Yes, the overhead in the logging thread is what this is trying
| to minimize. The background thread is considered "free". This
| sort of async logging is a common setup for some class of
| applications.
|
| And yes, it boils down to writing data to a message queue. Most
| of the overhead is probably the call to the hardware timestamp
| counter.
| jnordwick wrote:
| In my logging code I wrote that is basically a SPSC ring
| buffer, I use some RDTSC assembly and at startup I calculate
| the frequency and epoch offset. It has a throughput of around
| 30 cycles.
|
| That's already ~10 ns, so I'm not sure how they are getting
| their numbers. If they are timestamping the data when the
| background thread gets to it that would be pushing even more
| work to it. It guessing they do or else they could
| potentially be logging out of order data with multiple
| threads.
| gpderetta wrote:
| They are likely timestamping on the logging thread.
| Possibly they are just running at higher frequency.
| screcth wrote:
| You could store the current date and time in a global
| variable and have the producers just read it atomically.
| The consumer thread would then update it periodically.
|
| Timestamps will be somewhat inaccurate but it may help
| performance.
| jnordwick wrote:
| that's what the vdso version of clock_gettime does. If
| you use one of the *_COARSE clocks it will only update
| periodically and be much faster, but that means like 15
| milliseconds of log messages will all have the same
| timestamp.
|
| The fastest for nanosecond precision (bonus is this is
| even sub nanosecond) is just to store the return value of
| RDTSC and let the background thread figure it all out.
| You don't even need to precalcuate the freq or epoch
| offset. Just write a couple logging messages of the rdtsc
| value and CLOCK_REALTIME and let the post processing
| figure it out.
|
| To cut down on I/O each log message's timestamp can just
| be an offset from the last even.
|
| If you are willing to push a lot of work to the
| background thread and even more to the post processsing
| step, you really don't need to do very much.
| Thaxll wrote:
| > hat's what the vdso version of clock_gettime does. If
| you use one of the *_COARSE clocks it will only update
| periodically and be much faster, but that means like 15
| milliseconds of log messages will all have the same
| timestamp.
|
| Not sure it matters a lot of to have multiple messages
| with the same timestamp, since they were added in order
| you still know which one is older, the problem might
| arise when you send those logs to a remote place and the
| order of insert is discarded and the timestamp is used
| instead.
|
| I assume that when you use a single thread with a queue /
| ring buffer the order of insertion is kept.
| gpderetta wrote:
| FWIW, I have relied on the accuracy of log timestamps
| well into the low microseconds.
|
| If you have an event loop, it might be acceptable to
| sample the counter once per iteration, but less than that
| it becomes too lossy.
| pas wrote:
| probably in this case it's important to use some kind of
| synthetic timestamping to preserve the ordering (for
| example, for a 5.123 [ms] timestamp one can add ~1000000
| ns timestamps, so let's say are a thousand entries that
| need to be ordered, one can then represent them as
| 5.123000[n] ... and the "000" part is just a silly in-
| band way to give a hint to someone who will later examine
| the logs)
| jnordwick wrote:
| since you aren't going to be writing a message per
| nanosecond, you can always just do `last nanos = max(last
| nanos + 1, cur nanos)` and then use last nanos for the
| timestamp. you can even do it in rdtsc ticks and get 1/3
| of nano values. Obv the clock isn't nearly that accurate,
| but it lets you use those fractional nanos to ensure a
| strictly increasing ordering.
| andrepd wrote:
| If rdtsc is a cycle counter then this also assumes a
| constant clock rate.
| loeg wrote:
| rdtsc has been "invariant" for a long time.
| andrepd wrote:
| Oh TIL, I always thought it was a plain cycle counter
| loeg wrote:
| In a similar SPSC ring buffer setup I see something like
| 1.8ns per 16B entry without RDTSC (our design has a cached
| thread-local tsc that is updated periodically, so this is
| vaguely reasonable) or 7.6ns per entry with RDTSC
| (approximate RDTSC cost of 4.8ns). This is on a VM on some
| sort of Skylake CPU (maybe Skylake X).
| renewiltord wrote:
| Yeah. But that's okay. For the applications that care about
| this, it's just a question of how long the application thread
| is blocked. The logging thread can spin on its core and take as
| long as it wants so long as it can burn down the buffer faster
| than the application thread can add to it.
|
| I thought log4j2 used a disruptor in between app thread and
| logging thread but maybe I'm wrong.
| DesiLurker wrote:
| I did something similar about a decade ago. the main active
| costs for logging (assuming writing into a fixed sized lockless
| ring buffer) is for doing the string formatting and timestamp
| fetching. I actually did my own ts interpolation with rdtsc and
| periodic fetches to get this to minimal. for string formatting
| you could completely avoid by preregistering messages and store
| away arguments.
|
| I have also seen people make the logger into dso so they can
| dump strings from main app binary and record addresses to avoid
| the hassel of preregistering. but with preregistering I was
| able to get the perf all the way down to ~40ns per message w/o
| the separate log dumper thread. at that point you actually
| start running into disk IO limitations.
|
| one additional thing I did was to create the ring buffer in a
| shared memory segment and log to that from all services in the
| system. so this way you dont even have to think about crashing
| services. anyway the point is this hardly seems groundbreaking.
| what I'd like to see is this evolve into a fast syslog type
| service when one just freely logs trace messages throughout w/o
| worry about the cost and we can use them for postmortem
| analysis or filter at dump.
| hinkley wrote:
| One of the reasons I think logging as a production tool has
| risen in status in the last fifteen years is that in the 90's
| you had languages where it was decided that if your app crashes
| we want to know what the last operation that was in flight when
| it crashed.
|
| As we now know the syscalls to force flush to disk often
| didn't, so if you were trying to figure out what happened right
| before a kernel panic, good fucking luck. But if it was a
| purely userspace problem, you might be okay.
|
| These days we care more about throughput. I'd personally rather
| see us commit to stats for use cases that require that sort of
| logging traffic, but I think that's one of those conversations
| that's nearing the top of the heap. Maybe we'll have that
| argument as soon as we stop flooding the literature with talk
| of AI.
|
| So 7ns tells me something about throughput but I think i'd
| rather just know the messages per second per CPU. And perhaps
| more importantly, with a CPU that's 80% saturated doing non-
| logging tasks. Because I need to make forward progress between
| the log messages.
| packetlost wrote:
| I have ideas for a logging/wide metric system that uses this
| technique and some others stolen from DNS and IP. It's largely
| inspired by a combination of a system I've built at my day job
| that implements distributed command & control for servo-ing,
| monitoring, etc. It's been really successful, but the hardest
| part is mapping a unique numeric identifier to a human readable
| string in a way that is dynamic and efficient enough. It really
| seems like the exact same problem as DNS, which leads me to
| believe there's likely no way without a persistent centralized
| registry/database.
| kolbe wrote:
| I could swear I did a deep dive into Spdlog vs Nanolog six months
| ago, and the performance differences weren't nearly this stark
| synergy20 wrote:
| what do you mean? considering spdlog is the de facto logger for
| c++
| kolbe wrote:
| Nano is claiming insanely better performance over spdlog,
| which confuses me
| swah wrote:
| So you're actually spawning threads in this simple C++ example? I
| thought this was refrained in C++ land...
| #include "NanoLogCpp17.h" using namespace
| NanoLog::LogLevels; int main() {
| NANO_LOG(NOTICE, "Hello World! This is an integer %d and a double
| %lf\r\n", 1, 2.0); return 0; }
| bazzargh wrote:
| The paper says a lot of the secret sauce is dumping a dictionary
| of the static content and then logging in a binary format. That
| format looks a lot like gzip, if you squint. Could something like
| this use the _actual_ gzip format, but writing with a static
| dictionary, to make life easier for tools? (gzip has a trailer,
| but I'm not sure how much attention is paid to that, since it's
| often used for streams)
| newobj wrote:
| The real headline here is that log4j2 is faster than Boost.Log
| loeg wrote:
| 7ns latency is in the ballpark of _small_ writes to L1 cache.
| I.e. some sort of in-mem only append to somewhere predicted by
| the prefetcher of like, a single cache line or less. So yeah,
| some sort of ringbuffer log could definitely support this kind of
| figure. The question is how much throughput does your persistence
| channel have, how much memory can you devote to logging while
| your persistence runs async, are you ok with a somewhat volatile
| log, etc.
___________________________________________________________________
(page generated 2024-08-23 23:00 UTC)