[HN Gopher] A lock-free ring-buffer with contiguous reservations...
       ___________________________________________________________________
        
       A lock-free ring-buffer with contiguous reservations (2019)
        
       Author : simonpure
       Score  : 153 points
       Date   : 2024-02-29 14:57 UTC (8 hours ago)
        
 (HTM) web link (ferrous-systems.com)
 (TXT) w3m dump (ferrous-systems.com)
        
       | stmblast wrote:
       | Awesome article! Bookmarked.
        
       | cozzyd wrote:
       | Good thing the subtle concurrency bugs are on a platform rarely
       | used for embedded development
        
       | fritzo wrote:
       | See also the Java LMAX Disruptor https://github.com/LMAX-
       | Exchange/disruptor
       | 
       | I've built a similar lock-free ring buffer in C++11
       | https://github.com/posterior/loom/blob/master/doc/adapting.m...
        
         | GenericCanadian wrote:
         | I also wrote an LMAX Disruptor in Crystal:
         | https://github.com/nolantait/disruptor.cr
         | 
         | Here is one in Ruby: https://github.com/ileitch/disruptor
         | 
         | Both languages are quite readable and I've used these to teach
         | the concepts to beginners.
        
       | rdtsc wrote:
       | My favorite ring buffer structure is like the one described in
       | https://www.gnuradio.org/blog/2017-01-05-buffers/
       | 
       | > It asks the operating system to give it memory-mappable memory,
       | and map twice, "back to back". Blocks then get called with
       | pointers to the "earliest" position of a workload within this
       | memory region - guaranteeing that they can access all memory they
       | were offered in a linear fashion, as if they'd actually be
       | dealing with hardware ring buffers.
       | 
       | It imposes limitations on hardware and OS support in a way, but I
       | think it's pretty neat.
       | 
       | This also used by the kernel BPF ring buffer:
       | 
       | https://www.kernel.org/doc/html/latest/bpf/ringbuf.html
       | 
       | > ...data area is mapped twice contiguously back-to-back in the
       | virtual memory. This allows to not take any special measures for
       | samples that have to wrap around at the end of the circular
       | buffer data area, because the next page after the last data page
       | would be first data page again, and thus the sample will still
       | appear completely contiguous in virtual memory
        
         | dist1ll wrote:
         | Unfortunately that means the chunks are only contiguous in
         | virtual memory. So it won't work with the DMA use case
         | mentioned in the article, which requires contiguous physical
         | addresses.
         | 
         | But it's still a nice trick, I like it when people get creative
         | with HW features.
        
           | jnwatson wrote:
           | Theoretically, the same trick could be used to double map
           | addresses coming from an external master via an IOMMU.
        
             | speed_spread wrote:
             | It is unlikely that cheap microcontrollers where DMA is
             | most helpful will have an IOMMU.
        
           | gpderetta wrote:
           | But the hardware only needs to see on copy of the duplicated
           | memory and you can let it deal with the wraparound. The
           | software can use the convenience of the double mapping.
        
         | gpderetta wrote:
         | AKA the Magic Ring Buffer. It extremely convenient not having
         | to deal with split messages, especially if you support
         | variables size payloads.
        
         | scottlamb wrote:
         | This is a cool trick, and iirc there are a few Rust crates that
         | implement it, including slice-deque.
         | 
         | ...but I think there are a few significant downsides, even in
         | userspace:
         | 
         | * The most obvious one: you have to write `unsafe`, non-
         | portable code. The Linux, macOS, and Windows implementations
         | are totally different from each other.
         | 
         | * The setup and teardown of each ring is a bit expensive (few
         | system calls). For a regular allocation, the malloc
         | implementation typically caches that for you. Here you have to
         | do your own pooling if you might be frequently creating them.
         | 
         | * Using whole pages, and two mappings per ring, is wasteful in
         | terms of not only RAM (often no big deal) but also TLB space
         | (which often turns into significant CPU usage). If you just
         | allocate 32 64 KiB rings from the standard allocator, on x86-64
         | you might be talking about a single 2 MiB huge page mapping. If
         | you do this instead, you're talking about 1024 4 KiB page
         | mappings.
        
           | duped wrote:
           | Any real, production ready ring buffer should be using
           | unsafe. I would consider anything that doesn't to be a toy.
           | 
           | In Rust it's basically impossible to do this without
           | MaybeUninit. You _could_ use Option, but then you 're paying
           | a massive cost for a very easy to write and audit chunk of
           | unsafe code.
        
       | jamesmunns wrote:
       | Oh hey, one of the authors of this post here (James), happy to
       | answer any questions.
       | 
       | This post has been discussed here a couple times, but AMA :)
       | 
       | edit, the most commented version of this post was the original:
       | 
       | https://news.ycombinator.com/item?id=20096946.
       | 
       | This is what I'm up to these days:
       | 
       | https://onevariable.com/
        
         | agentultra wrote:
         | Has there been a formal specification/model that we can check?
         | I love circular buffers but I'm curious how we know that the
         | design is correct with respect to the stated properties.
         | 
         | I did a bit of link diving to the various blog posts and sites
         | but haven't been able to find it. Would be nice, if it exists,
         | to have it front and centre.
        
           | jamesmunns wrote:
           | As far as I know, it hasn't been formally verified.
           | 
           | Andrea Lattuada (https://andrea.lattuada.me/) is more likely
           | to have done work in his implementation of the algorithm than
           | I did on mine.
           | 
           | I have run the testing with various dynamic analysis tools,
           | but for sure this isn't formal verification. If someone is
           | interested in doing it, happy to chat with them!
        
             | anonymousDan wrote:
             | Shouldn't be that hard to at least model check it in TLA+ I
             | would have thought (albeit potentially more complex if
             | trying to account for weak memory).
        
         | Fiahil wrote:
         | I used the same approach while designing a lock-free bounded
         | broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only
         | Vec). It's quite easy to do because it's bounded. However, I
         | could not find a way to make it both unbounded and efficient.
         | 
         | Any ideas ?
        
           | jamesmunns wrote:
           | I am not sure! Most of the data structures I design are for
           | embedded systems without allocators. On the desktop, I mostly
           | defer to others.
           | 
           | I've used tokio's broadcast channel quite a bit before, but
           | it is also bounded. After talking to Eliza from the tokio
           | project, I'm fairly convinced that unbounded queues are a
           | scary thing to have around, operationally :).
           | 
           | But again - this is a bit out of my actual expertise!
        
           | paholg wrote:
           | I'm no expert here, but I wonder if a linked list of bounded
           | logs would work well.
           | 
           | So, everyone has a pointer to the current one, and there's an
           | atomic pointer to the next.
           | 
           | When the log fills up, you allocate a new one and set the
           | next pointer to it using compare_and_swap. If someone beat
           | you to it, you can walk the list and add yours to the end so
           | that the allocation isn't wasted.
           | 
           | This way, most of the time you're using the efficient bounded
           | log.
        
       | ChrisMarshallNY wrote:
       | That watermark is a simple, elegant idea.
       | 
       | I haven't really had the need for that kind of thing, in many
       | years, but I like the idea.
        
         | loeg wrote:
         | Recently (2022) I designed a ring buffer for use as a 'flight
         | recorder' type tracing system. (I.e., there is typically no
         | reader; the writer needs to write over old records without
         | blocking on any reader. If the reader is triggered, it flips a
         | switch that disables the writer temporarily while the buffer
         | contents are persisted.) In that design I subdivided the ring
         | into several subbuffers (~8). Each subbuffer has its own
         | equivalent of a watermark. That way, the valid portion of the
         | ring always starts at the beginning of one of the subbuffers,
         | and the writer could 'free' the next subbuffer worth of space
         | trivially (without having to scan through old contents record
         | by record). (Any write that did not fit in the current
         | subbuffer was advanced to the start of the next one.)
        
       | monocasa wrote:
       | > Contended writes from multiple threads on the same memory
       | location are a lot harder for the CPU's cache coherence protocol
       | to handle
       | 
       | FWIW, those are the same location according to most cache
       | coherency protocols, since cache coherency generally works on the
       | cache line level. You'd want to split the two contexts to their
       | own cache lines.
        
         | stefanha wrote:
         | Another cache optimization trick some ring-buffer
         | implementations use is to keep a shadow copy of the read or
         | write pointer to avoid frequently fetching the other context's
         | cache line. The latest version of the read pointer is only
         | needed when the writer catches up with their shadow copy and
         | vice versa.
        
           | anonymousDan wrote:
           | Yes this is absolutely crucial for performance.
        
       | liquid153 wrote:
       | Aren't lock free buffers usually just as expensive or more
       | expensive to use as locks.
        
         | zengid wrote:
         | Not if your program needs to be realtime or near realtime safe.
         | Locks are controlled by the OS typically, and can have non-
         | deterministic latency.
        
           | loeg wrote:
           | Even ignoring mutexes and OS scheduling, plain spinlocks add
           | contention that would not otherwise exist in a SPSC
           | ringbuffer. https://news.ycombinator.com/item?id=39551575
        
         | CyberDildonics wrote:
         | No, where are you getting that information?
        
         | loeg wrote:
         | No -- for a lock-free design with a single producer and
         | consumer, it's possible both are typically writing to
         | independent regions of memory. With a lock, both have to write
         | the same cache line to take and release the lock.
        
         | ww520 wrote:
         | Lock free has two advantages: the checking code can run in user
         | mode and non-contested access is very cheap with just one
         | instruction.
         | 
         | To do it correctly, lock needs to be done in the kernel thus
         | obtaining a lock requires calling into the kernel which is more
         | expensive.
         | 
         | I think you meant the memory barrier for syncing cache is just
         | as expensive as the lock version, which is true.
        
           | scaredginger wrote:
           | Obtaining an uncontested lock absolutely doesn't require
           | calling into the kernel
        
             | jcelerier wrote:
             | How can you give hard guarantees that on Windows, Mac,
             | Linux with the OS and/or libc provided locks?
        
               | gpderetta wrote:
               | If you really really really need such a guarantee, you
               | implement your own.
               | 
               | Otherwise you inspect the implementation, but in 2024 a
               | fast-pathed OS lock is table stakes.
        
       | piterrro wrote:
       | This is great article! Very detailed and explains things on a
       | low-level.
       | 
       | For a more high-level implementation, I just released yesterday a
       | blog post about ring buffer in Golang:
       | https://logdy.dev/blog/post/ring-buffer-in-golang
        
       | samsquire wrote:
       | I tried to write a lock free ringbuffer with weak atomics, I
       | haven't proved it right with TLA+ yet but I started writing a
       | model in it. I use tagging to avoid the ABA problem.
       | 
       | they're all on https://github.com/samsquire/assembly, i tried to
       | write multiple disruptor with multiple consumers, then one with
       | multiple producers then one with multiple consumers AND multiple
       | producers, inspired by LMAX Disruptor. (There's files for each of
       | them and table in the repo. it's not proven yet!)
       | 
       | the contention on the same memory address (the read/write index)
       | is the thing that seems difficult to address.
       | 
       | One thing I've learned about thread safety:
       | 
       | I think if you have thread-owned values then you can be thread
       | safe with a simple semaphore, providing that you have unique,
       | DISTINCT values for each thread.
       | 
       | If you have two threads that have this in a hot loop in parallel:
       | // thread 0                               if buffer[x].available
       | == 1:         // do stuff         buffer[x].available = 0
       | // thread 1       if buffer[x].available == 0:         // do
       | stuff         buffer[x].available = 1
       | 
       | Due to causality, no matter the interleaving, thread 0 owns the
       | buffer[x].available and body of the if statement when it is 1 and
       | thread 1 owns the body of the if statement buffer[x].available
       | when it is 0.
       | 
       | The CMP is a cheap mutex with distinct valued memory locations.
       | 
       | Even though thread 1 is writing to buffer[x].available and thread
       | 0 is writing to buffer[x].available it doesn't matter because the
       | causality is mutually exclusive. There is no interleaving of
       | buffer[x].available = x because of the if statement.
       | 
       | The buffer[x].available = 0 will never run while
       | buffer[x].available is equal to 0 overwriting or causing a data
       | race when setting buffer[x].available to 1. So the second line
       | cannot happen in parallel.
       | 
       | I need to write a TLA model to assert its safety.
       | 
       | If you have more than 2 threads, then you need different tokens
       | to provide admissability to the if statement.
       | 
       | Remember to use compiler memory barrier                  asm
       | volatile ("" ::: "memory");
       | 
       | so you don't need volatile struct values.
        
         | loeg wrote:
         | > The buffer[x].available = 0 will never run while
         | buffer[x].available is equal to 0 overwriting or causing a data
         | race when setting buffer[x].available to 1.
         | 
         | In particular, because loads and stores of the same variable
         | cannot be reordered out of program order. Once your algorithm
         | involves other variables, you would (likely) need to be a
         | little careful about loading/storing with acquire/release
         | semantics to prevent reordering other accesses relative to this
         | protocol.
         | 
         | > Remember to use compiler memory barrier
         | 
         | I would highly recommend using the language atomic types (and
         | barriers if truly needed) instead of gcc inline assembly
         | syntax.
        
           | samsquire wrote:
           | Thanks for your reply. This subject is still new to me.
           | 
           | My understanding of that syntax is that it is a compiler
           | memory barrier, not a CPU memory barrier because the asm
           | block is empty (no sfence or mfence).
        
             | loeg wrote:
             | Hey, no problem.
             | 
             | > My understanding of that syntax is that it is a compiler
             | memory barrier, not a CPU memory barrier because the asm
             | block is empty (no sfence or mfence).
             | 
             | In C11, you can write compiler-only fences with
             | atomic_signal_fence:
             | 
             | https://en.cppreference.com/w/c/atomic/atomic_signal_fence
             | 
             | (In practice, though, I think it is rare that you actually
             | want a compiler-only fence. Instead, correct use of
             | acquire/release operations prevents reorderings.)
        
               | samsquire wrote:
               | Thank you loeg, I appreciate you and information you
               | brought that TIL.
               | 
               | I've been using a compiler fence to force reloads from
               | memory to prevent -O3 from optimising away my
               | variables/structs changing by other threads and keeping
               | data in registers rather than reloading from memory each
               | time. I saw the volatile recommended against from the
               | Linux kernel programmers.
               | 
               | such as my thread->running == 1 in my event loops for my
               | threads.
               | 
               | https://www.kernel.org/doc/html/latest/process/volatile-
               | cons...
        
               | loeg wrote:
               | > I've been using a compiler fence to force reloads from
               | memory to prevent -O3 from optimising away my
               | variables/structs changing by other threads
               | 
               | I would highly recommend using the language standard
               | atomic primitives instead.
        
         | samsquire wrote:
         | When you check available, you might have to do it as a
         | (__atomic_load_n(&sender->realend, __ATOMIC_SEQ_CST) and do
         | __atomic_store_n when setting available.
         | 
         | rather than just a plain load.
        
         | gpderetta wrote:
         | The problem here is the                 [1] // Do stuff
         | [2] Buffer[X]. available = Y
         | 
         | There is no explicit nor implicit ordering between 1 and 2, so
         | the compiler or cpu can reorder them. You need a release
         | barrier between the two.
         | 
         | Also while most CPUs preserve the control dependency, not all
         | do (famously Alpha), and certainly not compilers. You would
         | need a consume barrier, except that c++11 consume is only for
         | data dependencies and unimplemented anyway.
         | 
         | Edit: with the correct barriers in place, you can prove
         | correctness by similitude to two size 1 SPSC queues used to
         | exchange a mutual exclusion token, with the added quirk that as
         | the queues are never used at the same time, they can actually
         | be physically colocated in memory.
        
           | samsquire wrote:
           | Thank you for you and for sharing your knowledge gpderetta,
           | appreciated, TIL.
        
         | anonymousDan wrote:
         | Regarding the contention, one thing that's important is to
         | cache a local copy of the shared head and tail variables every
         | time you access them. Then for subsequent operations you can
         | first check the local cached copy to see if you can perform the
         | read or write without needing to check the shared variables.
        
       | loeg wrote:
       | > The safe thing to do here is to always choose Ordering::SeqCst,
       | "sequential consistency", which provides the strongest
       | guarantees. ... This is often good enough in practice, and
       | switching to a weaker Ordering is only necessary to squeeze out
       | the last bit of performance.
       | 
       | If you're going to write lock-free algorithms using atomics, the
       | least you can do is learn about ordering semantics and use the
       | correct abstract semantics for your design's actual needs. It is
       | much easier to do it at design time than to try to figure out if
       | it is safe to relax SeqCst later. (This is one of the major flaws
       | of C++ std::atomic's default SeqCst semantics.) If you aren't
       | going to bother understanding ordering semantics, it is unlikely
       | you can write a safe lock-free algorithm anyway. (It's really
       | hard to do!)
        
         | jamesmunns wrote:
         | Back then, there weren't as good references for explaining
         | atomic ordering, and the blog post had gotten long enough.
         | Mentioning SeqCst was a bit of a cop out, though both Andrea
         | and I didn't end up using SeqCst past the inital impl anyway.
         | 
         | Today I would have just linked to https://marabos.nl/atomics/,
         | Mara does a much better job of explaining atomics than I could
         | have then or now.
        
           | hinkley wrote:
           | Back then? Do you mean 2019? Or a different "then"? Because
           | there was plenty of material in CS about this subject even in
           | 2010. Java was wrestling with this twenty years ago, and
           | databases long before that.
        
             | jamesmunns wrote:
             | I meant 2019, and there weren't any materials that I would
             | consider as clear and well defined as Mara's linked docs
             | explaining the different orderings used by C, C++, and Rust
             | (Relaxed, Release, Acquire, AcqRel, and SeqCst).
             | 
             | I'm very sure there were discussions and teaching materials
             | then, but none (that I was aware of) focused on Rust, and
             | something I'd link to someone who had never heard of atomic
             | ordering before.
        
             | anonymous-panda wrote:
             | I think the hard part of it is that x86 only has one atomic
             | ordering and none of the other modes do anything. As such,
             | it's really hard to build intuition about it unless you
             | spend a lot of time writing such code on ARM which wasn't
             | that common in the industry and today most people use
             | higher level abstractions.
             | 
             | By databases, do you mean those running on DEC Alphas?
             | Cause that was a niche system that few would have had
             | experience with. If you meant to compare in terms if
             | consistency semantically, sure but there's meaningful
             | differences between database consistency semantics of
             | concurrent transactions and atomic ordering in a
             | multithreaded concept.
             | 
             | Java's memory model "wrestling" was about defining it
             | formally in an era of multithreading and it's largely
             | sequentially consistent - no weakly consistent ordering
             | allowed.
             | 
             | The c++ memory model was definitely the first large scale
             | adoption of weaker consistency models I'm aware of and was
             | done so that ARM CPUs could be properly optimized for since
             | this was c++11 when mobile CPUs were very much front of
             | mind. Weak consistency remains really difficult to reason
             | about and even harder to play around with if you primarily
             | work with x86 and there's very little tooling around to
             | validate that can help you get confidence about whether
             | your code is correct. Of course, you can follow common
             | "patterns" (eg loads are always acquire and stores are
             | release), but fully grokking correctness and being able to
             | play with the model in interesting ways is no small task no
             | matter how many learning resources are out there.
        
               | gpderetta wrote:
               | Nit: x86 has acquire/release and seq_cst for load/stores
               | (it technically also has relaxed, but it is not useful to
               | map it to c++11 relaxed). What x86 lacks is weaker
               | ordering for RMW, but there are a lot of useful lock free
               | algorithms that are implementable just or mostly with
               | load and stores and it can be a significant win to use
               | non-seq-cst stores for this on x86
        
               | haberman wrote:
               | Indeed there is different code generated by seq_cst for
               | stores. Though for loads it appears to be the same:
               | https://godbolt.org/z/WbvEcM83q
        
               | gpderetta wrote:
               | Yes, seqcst loads map to plain loads on x86.
        
               | foobiekr wrote:
               | X86 might but devices connected to it in embedded world
               | have had to be very very aware of this stuff since the
               | 90s.
        
         | jcelerier wrote:
         | > It is much easier to do it at design time
         | 
         | Is it? I always worked the second way (starting from seq_cst
         | and then when the core design matured enough and didn't change
         | for a few months, trying to see what could actually be
         | relaxed). I'd be very afraid that in the first case, you start
         | with say relaxed semantics somewhere, them you change the
         | design because the requirements changed, and now you have to go
         | again through all the atomic operations to make sure the
         | assumptions all still hold.
        
           | jamesmunns wrote:
           | Back when this post was written, I would have agreed with
           | you. But Mara's book makes a good case for this:
           | 
           | https://marabos.nl/atomics/memory-ordering.html#common-
           | misco...                 More importantly, when reading code,
           | SeqCst basically tells the reader:       "this operation
           | depends on the total order of every single SeqCst operation
           | in the program," which is an incredibly far-reaching claim.
           | The same code       would likely be easier to review and
           | verify if it used weaker memory ordering       instead, if
           | possible...              It is advisable to see SeqCst as a
           | warning sign.
        
           | gpderetta wrote:
           | If you change the design of a lock free algo you very likely
           | have to go through all the atomic operations to make sure
           | that all assumptions hold anyway.
        
         | ajross wrote:
         | Completely agree, though the more iconoclastic corrolary that
         | goes unspoken there is that putting The Final Word on memory
         | ordering semantics into programming language standards was a
         | terrible mistake.
         | 
         | Memory ordering is a hardware behavior. It needs to be
         | specified at the hardware level, and hardware vendors have been
         | very mixed on clarity. And more importantly, lockless
         | algorithms (that rely on memory ordering control) are really,
         | really hard, and demand clarity over all things. And instead
         | we're crippling those poor programmers with nonsense like
         | "sequentially consistent"[1] or trying to figure out what on
         | earth "consume" means[2].
         | 
         | x86 does this pretty well with their comparatively simple model
         | of serializing instructions. Traditional ARM ISAs did only a
         | little worse by exposing the interface as read/write barrier
         | instructions. Everyone else... meh.
         | 
         | But if you really want to do this (and you probably don't) do
         | the analysis yourself at the level of
         | ISA/architecture/hardware, cite your references, and be
         | prepared to handle whatever portability works is needed on your
         | own.
         | 
         | [1] A statement about desired final state, not hardware
         | behavior!
         | 
         | [2] Nothing, on any hardware you will ever use. Don't ask.
        
           | gpderetta wrote:
           | On the contrary, fixing the memory model on a widely used
           | language like C++ forced hardware vendors to get their act
           | together and provide more rigorous memory model explanations.
           | For example intel went from Processor Ordering to TSO, and
           | arm started offering explicit acquire/release operations.
           | 
           | Java had the opportunity as well, but by initially only
           | providing a stronger, mostly sequentially consistent MO, the
           | hardware vendors managed to get away a little longer.
        
             | ajross wrote:
             | I don't think that's the case with Intel at all, the events
             | are off by a decade at least; do you have a blog or
             | something to cite there? I'd be curious to read it. And as
             | for ARM, "explicit acquire/release" is objectively less
             | informative and harder to reason about than the xxxSB
             | instructions were (and yes, I've used both). ARM went
             | _backwards_ to accommodate C++ 's nonsense.
             | 
             | Again, the language standard writers aren't remotely the
             | experts here, the hardware designers are. That C++ invented
             | its own metaphors instead of listening to the experts is a
             | bug, not a feature.
        
               | gpderetta wrote:
               | The hardware designers were involved on the
               | standardization process. I don't have citations at hand,
               | I think most of the mailing lists were the discussion re
               | the c++ MO happened have been lost, but (as a lurker
               | trying to learn this stuff) I was following the process
               | closely.
               | 
               | The question was, given PO, whether it was at all
               | possible to recover sequential consistently on intel
               | either with mfence or a lock xchg, given the possibility
               | of IRIW. Intel then updated their MO to exclude IRIW, de
               | facto standardizing on TSO.
               | 
               | This was early 2000s. I think both ARM and IBM published
               | revisions to their architecture clarifying details around
               | the same time.
               | 
               | This spawned a set of academic papers that proved the
               | correctness of the agreed mapping of the C++ memory model
               | to those architecture s.
        
               | ajross wrote:
               | > The hardware designers were involved on the
               | standardization process.
               | 
               | That sounds cyclic then. You're saying that Intel's SDM
               | was ambiguous[1] (which it was) and that it was sorted
               | out as part of a standardization process. I'm saying that
               | it doesn't really matter what the SDM said, it mattered
               | whether or not you could reliably write lockless code on
               | x86 using the hardware and docs available at the time,
               | and you could. And further, I'm saying that the standard
               | ended up making things worse by perpetuating arguments
               | like this about what some buggy English text in an SDM
               | said and _not about actual hardware behavior_.
               | 
               | [1] In ways that AFAICT didn't actually impact hardware.
               | I found this, which is probably one of the papers you're
               | citing. It's excellent work in standards-writing, but
               | it's also careful to note that the IRIW cases were never
               | observed on hardware.
               | https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-
               | paper.tpho...
        
               | gpderetta wrote:
               | It didn't impact hardware because intel hadn't taken
               | advantage yet of the additional latitude offered by their
               | original model. Then they closed the hole and they
               | guaranteed no IRIW[1]. But in the meantime if your
               | algorithm was susceptible to this reordering, there was
               | no written guarantee that an mfence would fix it. But
               | most importantly as the model was informal and not self
               | consistent, there was no possibility to write formal
               | proofs of correctness of an algorithm or run it against a
               | model checker.
               | 
               | [1] in practice this means no store-forwarding from
               | sibling hyper thread store buffers, something that for
               | example POWER allows and is observed in real hardware.
        
         | haberman wrote:
         | I've never actually seen a production lock-free algorithm that
         | uses SeqCst. I have a hard time even imagining an algorithm
         | where SeqCst is the right choice.
         | 
         | It seems like SeqCst was chosen as a default to reduce the
         | surprises and gotchas around lock-free programming. But lock-
         | free programming is inherently tricky; if you're trying to
         | avoid surprises and gotchas you should probably use a mutex.
        
         | derefr wrote:
         | > If you aren't going to bother understanding ordering
         | semantics, it is unlikely you can write a safe lock-free
         | algorithm anyway.
         | 
         | I think the implicit suggestion here is that the target
         | audience for this abstraction is actually two separate
         | developers:
         | 
         | 1. A developer who doesn't know much about locking semantics,
         | but can write the simple-but-slow case correctly.
         | 
         | 2. Another developer, much later on -- probably a senior SRE
         | type -- who can revisit this code and optimize it with full
         | understanding of the constraints.
         | 
         | (This might also be the same person, years later, with more
         | experience under their belt.)
         | 
         | The benefit of the way this library is designed, is that the
         | second developer doesn't have to completely rewrite everything
         | just to optimize it. The naive dev's code has already been
         | forced into an "almost but not quite" lock-free mold by the
         | design of the library's API surface.
        
       | magnat wrote:
       | > A typical serial port configuration is "115200 8N1", which
       | means 115,200 baud (or raw bits on the wire per second), with no
       | parity, and 1 stop bit. This means that for every data byte sent,
       | there will be 8 data bits, 1 unused parity bit, and 1 stop bit,
       | to signal the end of the byte, sent over the wire. This means
       | that we will need 40 bits on the wire to receive a 4 data bytes.
       | 
       | 8N1 means there is 1 start bit, 8 data bits and 1 stop bit (10
       | bits total), not 8 data bits, 1 unused parity bit and 1 stop bit
       | (also 10 bits total).
        
         | jamesmunns wrote:
         | Yep, good catch! That's a whoops in my explanation. I don't
         | work at FS any more, so I'm not sure I could PR that change,
         | but you're definitely right :)
        
       | bfrog wrote:
       | I have to say after looking at various DMA hardware I much much
       | prefer the scatter gather list type than the ring type of DMA.
       | 
       | The entire need of a bipbuffer allocation for DMA then goes way.
       | You can have a simple pool of fixed sized blocks to throw at the
       | DMA. Pretty easily done with a free list.
       | 
       | I do think the implementation here is cool though, and its nice
       | to see some work in this area.
        
         | 95014_refugee wrote:
         | To make scatter/gather go fast, you either spend a lot of
         | effort caching descriptor lists for pinned buffers (because you
         | expect to see them often), or heavily optimising your VM's v2p
         | translation machinery, or some combination of the two.
         | 
         | And then you wind up discovering that you the driver writer
         | aren't actually trusted and so you need to insert at least one
         | if not several IOMMUs between the peripheral and the
         | memory(ies) that they may access, managed by another software
         | component in a different address space.
         | 
         | Then someone asks you to make all of this work for clients in
         | VMs.
         | 
         | At which point you start wondering why you didn't just allocate
         | a physically contiguous buffer at startup and copy to/from your
         | client buffers using the CPU...
         | 
         | Sorry for sounding triggered... 8)
        
       | LAC-Tech wrote:
       | TIL about BipBuffers. I've been struggling with a similar data
       | structure, and to see it already has a name, and a better
       | implementation than what I've been doing, is very welcome.
        
       | dnedic wrote:
       | Bipartite buffers are amazing, and very rarely used
       | unfortunately. For those looking for C and C++ implementations
       | you can check out my libraries: lfbb and lockfee:
       | https://github.com/DNedic/lfbb,
       | https://github.com/DNedic/lockfree (although lockfree contains
       | more data structures as well)
        
       | nyanpasu64 wrote:
       | >In Andrea's implementation of the lock-free ring-buffer, spsc-
       | bip-buffer, some of the orderings are relaxed for performance.
       | This has the downside that it can introduce subtle concurrency
       | bugs that may only show up on some platform (ARM, for example):
       | to be a bit more confident that everything's still fine, Andrea's
       | has continous integation tests both on x86 and ARM.
       | 
       | It might be worth testing/forking the library to test on Loom
       | (https://github.com/tokio-rs/loom), which can model atomic
       | orderings and check for concurrency errors to some degree (though
       | I last used it years ago). TSAN might be able to check for
       | ordering errors in _visited_ execution traces (though I haven 't
       | tried using it in the past).
        
       ___________________________________________________________________
       (page generated 2024-02-29 23:00 UTC)