[HN Gopher] Nanolog supports logging with 7 ns median latency
       ___________________________________________________________________
        
       Nanolog supports logging with 7 ns median latency
        
       Author : eventhelix
       Score  : 110 points
       Date   : 2024-08-23 12:32 UTC (10 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | zdw wrote:
       | Doesn't a logging system need a storage system that can keep up
       | with it, if the goal is to persist logs for later analysis? What
       | storage could keep up with this?
        
         | rfoo wrote:
         | This is for very bursty logs. You don't log every 7 ns. On
         | average you are not generating a huge amount of logs.
         | 
         | But you need each logging call to be very fast, cause logging
         | calls are usually synchronous.
        
           | mannyv wrote:
           | If the logging call is in the critical path then logging data
           | is probably critical as well. After all, the fastest call is
           | the one you don't do.
           | 
           | If you're writing telemetry then that's different. But if
           | you're using logging to write your telemetry then there are
           | better ways to do it.
        
         | wdfx wrote:
         | I think the idea here is to separate the log call site in
         | application code from the log processing/persistence?
         | 
         | So, the nanosecond values quoted are the impact exposed to your
         | application code, but some other process takes over the bulk of
         | the work of the logging.
         | 
         | So as long as the offloaded process can keep up with the
         | _average_ log rate, it 'll be fine - but also as a bonus the
         | application does not see increased latency due to logging.
        
           | 01HNNWZ0MV43FF wrote:
           | Sounds a bit like how Tracy works
        
         | cma wrote:
         | Battery backed SRAM
        
       | username81 wrote:
       | Are there libraries like this, but in rust? As far as I
       | understand, it relies on C's preprocessor, so it is impossible to
       | create bindings for another language.
        
         | cmptrnerd6 wrote:
         | I have not used this and it says it targets embedded systems
         | but maybe it is close enough to what you might be looking for:
         | https://github.com/knurling-rs/defmt
        
           | steveklabnik wrote:
           | I have used this, but not the library in the link. From the
           | link's README, they're at least analogous. While maybe not
           | the exact same thing, they're at least the same idea.
        
         | andrepd wrote:
         | Sounds like something that would be doable in rust. I'm not
         | sure how you would go about building the compile-time table of
         | log locations and format strings, sounds like the hardest part.
        
           | wtetzner wrote:
           | > I'm not sure how you would go about building the compile-
           | time table of log locations and format strings, sounds like
           | the hardest part.
           | 
           | Sounds like a job for a macro.
        
             | steveklabnik wrote:
             | Yep. And macros for logging are already the norm in Rust.
        
       | cmptrnerd6 wrote:
       | I've used https://github.com/rokath/trice which is similar but
       | targeting microcontrollers. It isn't immediately clear to me if
       | nanolog could run on a microcontroller with its output directed
       | over RTT/uart/etc or not.
        
       | fra wrote:
       | This is a common technique in embedded software. A few other
       | examples:
       | 
       | 1. Thrice (already mentioned in the comments)
       | https://github.com/rokath/trice
       | 
       | 2. Pigweed's Tokenizer (from Google)
       | https://pigweed.dev/pw_tokenizer/
       | 
       | 3. Memfault's Compact Logs
       | https://docs.memfault.com/docs/mcu/compact-logs
       | 
       | 4. Defmt by Ferrous Systems https://defmt.ferrous-systems.com/
        
         | frizlab wrote:
         | 5. macOS logging system
         | https://developer.apple.com/documentation/os/logging/viewing...
        
         | enigmo wrote:
         | 6. WPP in Windows 2000 ETW https://learn.microsoft.com/en-
         | us/windows-hardware/test/weg/...
        
         | lokar wrote:
         | The google logging library also defers formatting
        
       | perching_aix wrote:
       | that sounds deliciously performant, love projects like these
        
         | yas_hmaheshwari wrote:
         | I was also thinking the same~ How come such a good idea is
         | already not part of standard logging libraries -- to allow you
         | to configure to another process or message queue!
         | 
         | Loved the idea
        
       | Validark wrote:
       | Amazing work! I was wondering just a few months ago whether
       | someone ever made a logger that deferred all the expensive work
       | of string formatting to consumption time.
       | 
       | ~~I'm a bit surprised that it didn't come along sooner though.
       | How come nobody at Google or VMware who said they noticed this
       | was a problem solved it? Or any other major tech company? I guess
       | maybe this is partially an issue with our programming languages
       | and build tools though? I'm a Zig enthusiast though so in my head
       | it's trivial, but I guess it won't be until C++26 that they get
       | potentially comparable comptime facilities for C++.~~ I'm
       | surprised Go doesn't work like this by default though. For a
       | language like Go, I'd have made a builtin log keyword that does
       | this.
       | 
       | EDIT: Looks like other implementations of similar ideas do exist.
       | Still awesome though!
        
         | yuliyp wrote:
         | You have to be careful in deferring such work. It may end up
         | more expensive if it means you have multiple threads accessing
         | that data, and/or needing to extend the lifetime of an object
         | so the logger can access it.
        
           | jnordwick wrote:
           | as long as you are just using static strings and native types
           | it amounts to a pointer/index bump and a load/store per item.
           | Lets imagine you have the format string, priority number,
           | system id, and 7 pieces of data in the payload. That would be
           | 10 items, so like 40 cycles? I can see the 18ns the paper
           | gets.
           | 
           | I had no doubt the 7ns number is heavily cooked.
        
         | lokar wrote:
         | The google logging library has deferred the formatting for
         | years
        
       | geertj wrote:
       | The consumer side of this would be polling a memory location for
       | new logs, correct? It would not be possible to wake up the
       | consumer in 7ns as that would take a FUTEX_WAKE system call with
       | is O(microseconds).
       | 
       | I've been wondering about a FUTEX_WAKE that does not require a
       | system call. Possibly, the kernel could poll a global memory
       | area. Or maybe there is some low-level debugging API available
       | where the kernel could be notified of a memory write by a
       | process?
        
         | gpderetta wrote:
         | There isn't a significant advantage in having the kernel doing
         | the polling, it would still be busy polling.
         | 
         | If you just don't want to burn power but you can still dedicate
         | a core, there is https://www.felixcloutier.com/x86/mwait.
        
           | geertj wrote:
           | > There isn't a significant advantage in having the kernel
           | doing the polling, it would still be busy polling.
           | 
           | I was thinking in terms of a generic syscall-less wake
           | functionality where the kernel could do this for all
           | processes in the system. So you'd lose one core per system
           | instead if one core per consumer.
           | 
           | >If you just don't want to burn power but you can still
           | dedicate a core, there is
           | https://www.felixcloutier.com/x86/mwait.
           | 
           | Interesting. Could be used to make the kernel loop above burn
           | less power.
           | 
           | A user-space implementation could presumably also be built.
           | There could be a shared memory segment shared between
           | producers and a monitor. A producer sets a flag in case it
           | needs attention, and the monitor busy polls the segment. The
           | monitor could then use e.g. a signal to wake up consumers.
           | 
           | The latency between the producer signaling and the consumer
           | taking action would be a higher than with futexes. But there
           | would be no waits/context switches in the producer at all.
           | Might be a solution for some low latency use cases.
        
         | toxik wrote:
         | O(microseconds) = O(years), this is not what big O notation
         | means.
        
           | renewiltord wrote:
           | That seems like a type error on your part. O(unit) is a
           | categorically different notation than O(mathematical
           | function). You must be confusing it with the asymptotic
           | notion that is the latter. But units are not functions and so
           | clearly it cannot be the same notation.
        
             | toxik wrote:
             | No, and it's absurdly petty to argue otherwise.
        
               | davidsgk wrote:
               | One might say it's also pretty petty to call out a casual
               | usage of a notation being used in a way that people in
               | the thread are understanding just fine...
        
               | toxik wrote:
               | I think it's helpful because it is a common mistake.
        
         | jnordwick wrote:
         | The background thread can just sleep if no data in the queue.
         | 
         | Since you are preprocessing the log messages and you know the
         | minimum bounds. If the thread sleeps for 1 ms. Be generous and
         | say you can at most push to the queue in 5ns, and if you know
         | the largest messages you push will be 200 bytes, you can
         | statically determine a 40M ring buffer will ensure enough space
         | to sleep even at max rate with the largest messages.
         | 
         | And that's just a simplstic scheme. If you have a pool logging
         | structs so your queue is just a pointer to one of those, you
         | can get away with much less even.
         | 
         | And I can think of faster ways too. It isn't that difficult to
         | get around using a semaphore
        
         | sriram_malhar wrote:
         | The consumer (background thread) only polls; there is no need
         | to wake up the consumer.
         | 
         | At steady state, the consume is either blocked on I/O or is
         | scanning the producer buffers. When all producer buffers are
         | empty, it can just sleep for a short time.
        
           | geertj wrote:
           | This would work in this specific case where we know that
           | there is a maximum rate at which work is produced. Arguably I
           | was hijacking the thread to discuss a more general problem
           | that I've been thinking about for a while. I have the sense
           | that a ring-buffer that has a wait-free push with low bound
           | on consumer side latency that doesn't require fixed interval
           | polling would be a nice primitive that I certainly could have
           | used at times.
           | 
           | And in fact ... the wait-free wakeup part of this is already
           | there. Now that io_uring has futex support, a producer can
           | enable kernel-side busy polling on the uring, and then submit
           | a FUTEX_WAKE to the ring without doing any sys calls. This
           | Github issue [1] has a nice description.
           | 
           | [1] https://github.com/axboe/liburing/issues/385
        
       | jnordwick wrote:
       | It uses a background thread to do most of the work, and it
       | appears the 7ns latency numbers are a little cooked:
       | 
       | 1. The paper's 7ns like number is 8ns for microbenchmarks but
       | 18ns in applications. The 7ns number I'm guessing is
       | microbenchmarks, and the true application level number is prob
       | more in the 17ns range.
       | 
       | 2. It isn't precisely clear what that is measuring. The says that
       | is the invocation time of the logging thread. Considering the
       | thread making the call to log just passes most of the work to a
       | background threads through a multi-producer single consumer queue
       | of some sort, this is likely the time to dump it in the queue.
       | 
       | So you really aren't logging in 7ns. The way I'm reading this is
       | you're dumping on a queue in 17ns and letting a background thread
       | do the actual work.
       | 
       | The workload is cut down by preprocessing the creating a
       | dictionary of static elements do reduce the I/O cost of the
       | thread doing the actual writing (I assume this just means take
       | the format strings and index them, which you could build at
       | runtime, so i'm not sure the pre-processing step is really
       | needed).
       | 
       | My logger than dumps binary blobs onto a ring buffer for another
       | process to log might be able to beat this invocation latency.
       | This isn't really groundbreaking. I know a few place that log the
       | binary blobs and format them later. None of them do the
       | dictionary part, but when that is going to a background thread,
       | I'm not sure how much that matters.
        
         | szundi wrote:
         | Only thing that makes sense is that the thread sending the logs
         | is blocket for 7ns - otherwise too much context dependent extra
         | comes in to make a claim like this
        
         | gpderetta wrote:
         | Yes, the overhead in the logging thread is what this is trying
         | to minimize. The background thread is considered "free". This
         | sort of async logging is a common setup for some class of
         | applications.
         | 
         | And yes, it boils down to writing data to a message queue. Most
         | of the overhead is probably the call to the hardware timestamp
         | counter.
        
           | jnordwick wrote:
           | In my logging code I wrote that is basically a SPSC ring
           | buffer, I use some RDTSC assembly and at startup I calculate
           | the frequency and epoch offset. It has a throughput of around
           | 30 cycles.
           | 
           | That's already ~10 ns, so I'm not sure how they are getting
           | their numbers. If they are timestamping the data when the
           | background thread gets to it that would be pushing even more
           | work to it. It guessing they do or else they could
           | potentially be logging out of order data with multiple
           | threads.
        
             | gpderetta wrote:
             | They are likely timestamping on the logging thread.
             | Possibly they are just running at higher frequency.
        
             | screcth wrote:
             | You could store the current date and time in a global
             | variable and have the producers just read it atomically.
             | The consumer thread would then update it periodically.
             | 
             | Timestamps will be somewhat inaccurate but it may help
             | performance.
        
               | jnordwick wrote:
               | that's what the vdso version of clock_gettime does. If
               | you use one of the *_COARSE clocks it will only update
               | periodically and be much faster, but that means like 15
               | milliseconds of log messages will all have the same
               | timestamp.
               | 
               | The fastest for nanosecond precision (bonus is this is
               | even sub nanosecond) is just to store the return value of
               | RDTSC and let the background thread figure it all out.
               | You don't even need to precalcuate the freq or epoch
               | offset. Just write a couple logging messages of the rdtsc
               | value and CLOCK_REALTIME and let the post processing
               | figure it out.
               | 
               | To cut down on I/O each log message's timestamp can just
               | be an offset from the last even.
               | 
               | If you are willing to push a lot of work to the
               | background thread and even more to the post processsing
               | step, you really don't need to do very much.
        
               | Thaxll wrote:
               | > hat's what the vdso version of clock_gettime does. If
               | you use one of the *_COARSE clocks it will only update
               | periodically and be much faster, but that means like 15
               | milliseconds of log messages will all have the same
               | timestamp.
               | 
               | Not sure it matters a lot of to have multiple messages
               | with the same timestamp, since they were added in order
               | you still know which one is older, the problem might
               | arise when you send those logs to a remote place and the
               | order of insert is discarded and the timestamp is used
               | instead.
               | 
               | I assume that when you use a single thread with a queue /
               | ring buffer the order of insertion is kept.
        
               | gpderetta wrote:
               | FWIW, I have relied on the accuracy of log timestamps
               | well into the low microseconds.
               | 
               | If you have an event loop, it might be acceptable to
               | sample the counter once per iteration, but less than that
               | it becomes too lossy.
        
               | pas wrote:
               | probably in this case it's important to use some kind of
               | synthetic timestamping to preserve the ordering (for
               | example, for a 5.123 [ms] timestamp one can add ~1000000
               | ns timestamps, so let's say are a thousand entries that
               | need to be ordered, one can then represent them as
               | 5.123000[n] ... and the "000" part is just a silly in-
               | band way to give a hint to someone who will later examine
               | the logs)
        
               | jnordwick wrote:
               | since you aren't going to be writing a message per
               | nanosecond, you can always just do `last nanos = max(last
               | nanos + 1, cur nanos)` and then use last nanos for the
               | timestamp. you can even do it in rdtsc ticks and get 1/3
               | of nano values. Obv the clock isn't nearly that accurate,
               | but it lets you use those fractional nanos to ensure a
               | strictly increasing ordering.
        
             | andrepd wrote:
             | If rdtsc is a cycle counter then this also assumes a
             | constant clock rate.
        
               | loeg wrote:
               | rdtsc has been "invariant" for a long time.
        
               | andrepd wrote:
               | Oh TIL, I always thought it was a plain cycle counter
        
             | loeg wrote:
             | In a similar SPSC ring buffer setup I see something like
             | 1.8ns per 16B entry without RDTSC (our design has a cached
             | thread-local tsc that is updated periodically, so this is
             | vaguely reasonable) or 7.6ns per entry with RDTSC
             | (approximate RDTSC cost of 4.8ns). This is on a VM on some
             | sort of Skylake CPU (maybe Skylake X).
        
         | renewiltord wrote:
         | Yeah. But that's okay. For the applications that care about
         | this, it's just a question of how long the application thread
         | is blocked. The logging thread can spin on its core and take as
         | long as it wants so long as it can burn down the buffer faster
         | than the application thread can add to it.
         | 
         | I thought log4j2 used a disruptor in between app thread and
         | logging thread but maybe I'm wrong.
        
         | DesiLurker wrote:
         | I did something similar about a decade ago. the main active
         | costs for logging (assuming writing into a fixed sized lockless
         | ring buffer) is for doing the string formatting and timestamp
         | fetching. I actually did my own ts interpolation with rdtsc and
         | periodic fetches to get this to minimal. for string formatting
         | you could completely avoid by preregistering messages and store
         | away arguments.
         | 
         | I have also seen people make the logger into dso so they can
         | dump strings from main app binary and record addresses to avoid
         | the hassel of preregistering. but with preregistering I was
         | able to get the perf all the way down to ~40ns per message w/o
         | the separate log dumper thread. at that point you actually
         | start running into disk IO limitations.
         | 
         | one additional thing I did was to create the ring buffer in a
         | shared memory segment and log to that from all services in the
         | system. so this way you dont even have to think about crashing
         | services. anyway the point is this hardly seems groundbreaking.
         | what I'd like to see is this evolve into a fast syslog type
         | service when one just freely logs trace messages throughout w/o
         | worry about the cost and we can use them for postmortem
         | analysis or filter at dump.
        
         | hinkley wrote:
         | One of the reasons I think logging as a production tool has
         | risen in status in the last fifteen years is that in the 90's
         | you had languages where it was decided that if your app crashes
         | we want to know what the last operation that was in flight when
         | it crashed.
         | 
         | As we now know the syscalls to force flush to disk often
         | didn't, so if you were trying to figure out what happened right
         | before a kernel panic, good fucking luck. But if it was a
         | purely userspace problem, you might be okay.
         | 
         | These days we care more about throughput. I'd personally rather
         | see us commit to stats for use cases that require that sort of
         | logging traffic, but I think that's one of those conversations
         | that's nearing the top of the heap. Maybe we'll have that
         | argument as soon as we stop flooding the literature with talk
         | of AI.
         | 
         | So 7ns tells me something about throughput but I think i'd
         | rather just know the messages per second per CPU. And perhaps
         | more importantly, with a CPU that's 80% saturated doing non-
         | logging tasks. Because I need to make forward progress between
         | the log messages.
        
       | packetlost wrote:
       | I have ideas for a logging/wide metric system that uses this
       | technique and some others stolen from DNS and IP. It's largely
       | inspired by a combination of a system I've built at my day job
       | that implements distributed command & control for servo-ing,
       | monitoring, etc. It's been really successful, but the hardest
       | part is mapping a unique numeric identifier to a human readable
       | string in a way that is dynamic and efficient enough. It really
       | seems like the exact same problem as DNS, which leads me to
       | believe there's likely no way without a persistent centralized
       | registry/database.
        
       | kolbe wrote:
       | I could swear I did a deep dive into Spdlog vs Nanolog six months
       | ago, and the performance differences weren't nearly this stark
        
         | synergy20 wrote:
         | what do you mean? considering spdlog is the de facto logger for
         | c++
        
           | kolbe wrote:
           | Nano is claiming insanely better performance over spdlog,
           | which confuses me
        
       | swah wrote:
       | So you're actually spawning threads in this simple C++ example? I
       | thought this was refrained in C++ land...
       | #include "NanoLogCpp17.h"       using namespace
       | NanoLog::LogLevels;            int main()        {
       | NANO_LOG(NOTICE, "Hello World! This is an integer %d and a double
       | %lf\r\n", 1, 2.0);         return 0;       }
        
       | bazzargh wrote:
       | The paper says a lot of the secret sauce is dumping a dictionary
       | of the static content and then logging in a binary format. That
       | format looks a lot like gzip, if you squint. Could something like
       | this use the _actual_ gzip format, but writing with a static
       | dictionary, to make life easier for tools? (gzip has a trailer,
       | but I'm not sure how much attention is paid to that, since it's
       | often used for streams)
        
       | newobj wrote:
       | The real headline here is that log4j2 is faster than Boost.Log
        
       | loeg wrote:
       | 7ns latency is in the ballpark of _small_ writes to L1 cache.
       | I.e. some sort of in-mem only append to somewhere predicted by
       | the prefetcher of like, a single cache line or less. So yeah,
       | some sort of ringbuffer log could definitely support this kind of
       | figure. The question is how much throughput does your persistence
       | channel have, how much memory can you devote to logging while
       | your persistence runs async, are you ok with a somewhat volatile
       | log, etc.
        
       ___________________________________________________________________
       (page generated 2024-08-23 23:00 UTC)