[HN Gopher] A lock-free ring-buffer with contiguous reservations...
___________________________________________________________________
A lock-free ring-buffer with contiguous reservations (2019)
Author : simonpure
Score : 153 points
Date : 2024-02-29 14:57 UTC (8 hours ago)
(HTM) web link (ferrous-systems.com)
(TXT) w3m dump (ferrous-systems.com)
| stmblast wrote:
| Awesome article! Bookmarked.
| cozzyd wrote:
| Good thing the subtle concurrency bugs are on a platform rarely
| used for embedded development
| fritzo wrote:
| See also the Java LMAX Disruptor https://github.com/LMAX-
| Exchange/disruptor
|
| I've built a similar lock-free ring buffer in C++11
| https://github.com/posterior/loom/blob/master/doc/adapting.m...
| GenericCanadian wrote:
| I also wrote an LMAX Disruptor in Crystal:
| https://github.com/nolantait/disruptor.cr
|
| Here is one in Ruby: https://github.com/ileitch/disruptor
|
| Both languages are quite readable and I've used these to teach
| the concepts to beginners.
| rdtsc wrote:
| My favorite ring buffer structure is like the one described in
| https://www.gnuradio.org/blog/2017-01-05-buffers/
|
| > It asks the operating system to give it memory-mappable memory,
| and map twice, "back to back". Blocks then get called with
| pointers to the "earliest" position of a workload within this
| memory region - guaranteeing that they can access all memory they
| were offered in a linear fashion, as if they'd actually be
| dealing with hardware ring buffers.
|
| It imposes limitations on hardware and OS support in a way, but I
| think it's pretty neat.
|
| This also used by the kernel BPF ring buffer:
|
| https://www.kernel.org/doc/html/latest/bpf/ringbuf.html
|
| > ...data area is mapped twice contiguously back-to-back in the
| virtual memory. This allows to not take any special measures for
| samples that have to wrap around at the end of the circular
| buffer data area, because the next page after the last data page
| would be first data page again, and thus the sample will still
| appear completely contiguous in virtual memory
| dist1ll wrote:
| Unfortunately that means the chunks are only contiguous in
| virtual memory. So it won't work with the DMA use case
| mentioned in the article, which requires contiguous physical
| addresses.
|
| But it's still a nice trick, I like it when people get creative
| with HW features.
| jnwatson wrote:
| Theoretically, the same trick could be used to double map
| addresses coming from an external master via an IOMMU.
| speed_spread wrote:
| It is unlikely that cheap microcontrollers where DMA is
| most helpful will have an IOMMU.
| gpderetta wrote:
| But the hardware only needs to see on copy of the duplicated
| memory and you can let it deal with the wraparound. The
| software can use the convenience of the double mapping.
| gpderetta wrote:
| AKA the Magic Ring Buffer. It extremely convenient not having
| to deal with split messages, especially if you support
| variables size payloads.
| scottlamb wrote:
| This is a cool trick, and iirc there are a few Rust crates that
| implement it, including slice-deque.
|
| ...but I think there are a few significant downsides, even in
| userspace:
|
| * The most obvious one: you have to write `unsafe`, non-
| portable code. The Linux, macOS, and Windows implementations
| are totally different from each other.
|
| * The setup and teardown of each ring is a bit expensive (few
| system calls). For a regular allocation, the malloc
| implementation typically caches that for you. Here you have to
| do your own pooling if you might be frequently creating them.
|
| * Using whole pages, and two mappings per ring, is wasteful in
| terms of not only RAM (often no big deal) but also TLB space
| (which often turns into significant CPU usage). If you just
| allocate 32 64 KiB rings from the standard allocator, on x86-64
| you might be talking about a single 2 MiB huge page mapping. If
| you do this instead, you're talking about 1024 4 KiB page
| mappings.
| duped wrote:
| Any real, production ready ring buffer should be using
| unsafe. I would consider anything that doesn't to be a toy.
|
| In Rust it's basically impossible to do this without
| MaybeUninit. You _could_ use Option, but then you 're paying
| a massive cost for a very easy to write and audit chunk of
| unsafe code.
| jamesmunns wrote:
| Oh hey, one of the authors of this post here (James), happy to
| answer any questions.
|
| This post has been discussed here a couple times, but AMA :)
|
| edit, the most commented version of this post was the original:
|
| https://news.ycombinator.com/item?id=20096946.
|
| This is what I'm up to these days:
|
| https://onevariable.com/
| agentultra wrote:
| Has there been a formal specification/model that we can check?
| I love circular buffers but I'm curious how we know that the
| design is correct with respect to the stated properties.
|
| I did a bit of link diving to the various blog posts and sites
| but haven't been able to find it. Would be nice, if it exists,
| to have it front and centre.
| jamesmunns wrote:
| As far as I know, it hasn't been formally verified.
|
| Andrea Lattuada (https://andrea.lattuada.me/) is more likely
| to have done work in his implementation of the algorithm than
| I did on mine.
|
| I have run the testing with various dynamic analysis tools,
| but for sure this isn't formal verification. If someone is
| interested in doing it, happy to chat with them!
| anonymousDan wrote:
| Shouldn't be that hard to at least model check it in TLA+ I
| would have thought (albeit potentially more complex if
| trying to account for weak memory).
| Fiahil wrote:
| I used the same approach while designing a lock-free bounded
| broadcast log (as in a "RWLock<Vec<T>>"; a MCMP, append-only
| Vec). It's quite easy to do because it's bounded. However, I
| could not find a way to make it both unbounded and efficient.
|
| Any ideas ?
| jamesmunns wrote:
| I am not sure! Most of the data structures I design are for
| embedded systems without allocators. On the desktop, I mostly
| defer to others.
|
| I've used tokio's broadcast channel quite a bit before, but
| it is also bounded. After talking to Eliza from the tokio
| project, I'm fairly convinced that unbounded queues are a
| scary thing to have around, operationally :).
|
| But again - this is a bit out of my actual expertise!
| paholg wrote:
| I'm no expert here, but I wonder if a linked list of bounded
| logs would work well.
|
| So, everyone has a pointer to the current one, and there's an
| atomic pointer to the next.
|
| When the log fills up, you allocate a new one and set the
| next pointer to it using compare_and_swap. If someone beat
| you to it, you can walk the list and add yours to the end so
| that the allocation isn't wasted.
|
| This way, most of the time you're using the efficient bounded
| log.
| ChrisMarshallNY wrote:
| That watermark is a simple, elegant idea.
|
| I haven't really had the need for that kind of thing, in many
| years, but I like the idea.
| loeg wrote:
| Recently (2022) I designed a ring buffer for use as a 'flight
| recorder' type tracing system. (I.e., there is typically no
| reader; the writer needs to write over old records without
| blocking on any reader. If the reader is triggered, it flips a
| switch that disables the writer temporarily while the buffer
| contents are persisted.) In that design I subdivided the ring
| into several subbuffers (~8). Each subbuffer has its own
| equivalent of a watermark. That way, the valid portion of the
| ring always starts at the beginning of one of the subbuffers,
| and the writer could 'free' the next subbuffer worth of space
| trivially (without having to scan through old contents record
| by record). (Any write that did not fit in the current
| subbuffer was advanced to the start of the next one.)
| monocasa wrote:
| > Contended writes from multiple threads on the same memory
| location are a lot harder for the CPU's cache coherence protocol
| to handle
|
| FWIW, those are the same location according to most cache
| coherency protocols, since cache coherency generally works on the
| cache line level. You'd want to split the two contexts to their
| own cache lines.
| stefanha wrote:
| Another cache optimization trick some ring-buffer
| implementations use is to keep a shadow copy of the read or
| write pointer to avoid frequently fetching the other context's
| cache line. The latest version of the read pointer is only
| needed when the writer catches up with their shadow copy and
| vice versa.
| anonymousDan wrote:
| Yes this is absolutely crucial for performance.
| liquid153 wrote:
| Aren't lock free buffers usually just as expensive or more
| expensive to use as locks.
| zengid wrote:
| Not if your program needs to be realtime or near realtime safe.
| Locks are controlled by the OS typically, and can have non-
| deterministic latency.
| loeg wrote:
| Even ignoring mutexes and OS scheduling, plain spinlocks add
| contention that would not otherwise exist in a SPSC
| ringbuffer. https://news.ycombinator.com/item?id=39551575
| CyberDildonics wrote:
| No, where are you getting that information?
| loeg wrote:
| No -- for a lock-free design with a single producer and
| consumer, it's possible both are typically writing to
| independent regions of memory. With a lock, both have to write
| the same cache line to take and release the lock.
| ww520 wrote:
| Lock free has two advantages: the checking code can run in user
| mode and non-contested access is very cheap with just one
| instruction.
|
| To do it correctly, lock needs to be done in the kernel thus
| obtaining a lock requires calling into the kernel which is more
| expensive.
|
| I think you meant the memory barrier for syncing cache is just
| as expensive as the lock version, which is true.
| scaredginger wrote:
| Obtaining an uncontested lock absolutely doesn't require
| calling into the kernel
| jcelerier wrote:
| How can you give hard guarantees that on Windows, Mac,
| Linux with the OS and/or libc provided locks?
| gpderetta wrote:
| If you really really really need such a guarantee, you
| implement your own.
|
| Otherwise you inspect the implementation, but in 2024 a
| fast-pathed OS lock is table stakes.
| piterrro wrote:
| This is great article! Very detailed and explains things on a
| low-level.
|
| For a more high-level implementation, I just released yesterday a
| blog post about ring buffer in Golang:
| https://logdy.dev/blog/post/ring-buffer-in-golang
| samsquire wrote:
| I tried to write a lock free ringbuffer with weak atomics, I
| haven't proved it right with TLA+ yet but I started writing a
| model in it. I use tagging to avoid the ABA problem.
|
| they're all on https://github.com/samsquire/assembly, i tried to
| write multiple disruptor with multiple consumers, then one with
| multiple producers then one with multiple consumers AND multiple
| producers, inspired by LMAX Disruptor. (There's files for each of
| them and table in the repo. it's not proven yet!)
|
| the contention on the same memory address (the read/write index)
| is the thing that seems difficult to address.
|
| One thing I've learned about thread safety:
|
| I think if you have thread-owned values then you can be thread
| safe with a simple semaphore, providing that you have unique,
| DISTINCT values for each thread.
|
| If you have two threads that have this in a hot loop in parallel:
| // thread 0 if buffer[x].available
| == 1: // do stuff buffer[x].available = 0
| // thread 1 if buffer[x].available == 0: // do
| stuff buffer[x].available = 1
|
| Due to causality, no matter the interleaving, thread 0 owns the
| buffer[x].available and body of the if statement when it is 1 and
| thread 1 owns the body of the if statement buffer[x].available
| when it is 0.
|
| The CMP is a cheap mutex with distinct valued memory locations.
|
| Even though thread 1 is writing to buffer[x].available and thread
| 0 is writing to buffer[x].available it doesn't matter because the
| causality is mutually exclusive. There is no interleaving of
| buffer[x].available = x because of the if statement.
|
| The buffer[x].available = 0 will never run while
| buffer[x].available is equal to 0 overwriting or causing a data
| race when setting buffer[x].available to 1. So the second line
| cannot happen in parallel.
|
| I need to write a TLA model to assert its safety.
|
| If you have more than 2 threads, then you need different tokens
| to provide admissability to the if statement.
|
| Remember to use compiler memory barrier asm
| volatile ("" ::: "memory");
|
| so you don't need volatile struct values.
| loeg wrote:
| > The buffer[x].available = 0 will never run while
| buffer[x].available is equal to 0 overwriting or causing a data
| race when setting buffer[x].available to 1.
|
| In particular, because loads and stores of the same variable
| cannot be reordered out of program order. Once your algorithm
| involves other variables, you would (likely) need to be a
| little careful about loading/storing with acquire/release
| semantics to prevent reordering other accesses relative to this
| protocol.
|
| > Remember to use compiler memory barrier
|
| I would highly recommend using the language atomic types (and
| barriers if truly needed) instead of gcc inline assembly
| syntax.
| samsquire wrote:
| Thanks for your reply. This subject is still new to me.
|
| My understanding of that syntax is that it is a compiler
| memory barrier, not a CPU memory barrier because the asm
| block is empty (no sfence or mfence).
| loeg wrote:
| Hey, no problem.
|
| > My understanding of that syntax is that it is a compiler
| memory barrier, not a CPU memory barrier because the asm
| block is empty (no sfence or mfence).
|
| In C11, you can write compiler-only fences with
| atomic_signal_fence:
|
| https://en.cppreference.com/w/c/atomic/atomic_signal_fence
|
| (In practice, though, I think it is rare that you actually
| want a compiler-only fence. Instead, correct use of
| acquire/release operations prevents reorderings.)
| samsquire wrote:
| Thank you loeg, I appreciate you and information you
| brought that TIL.
|
| I've been using a compiler fence to force reloads from
| memory to prevent -O3 from optimising away my
| variables/structs changing by other threads and keeping
| data in registers rather than reloading from memory each
| time. I saw the volatile recommended against from the
| Linux kernel programmers.
|
| such as my thread->running == 1 in my event loops for my
| threads.
|
| https://www.kernel.org/doc/html/latest/process/volatile-
| cons...
| loeg wrote:
| > I've been using a compiler fence to force reloads from
| memory to prevent -O3 from optimising away my
| variables/structs changing by other threads
|
| I would highly recommend using the language standard
| atomic primitives instead.
| samsquire wrote:
| When you check available, you might have to do it as a
| (__atomic_load_n(&sender->realend, __ATOMIC_SEQ_CST) and do
| __atomic_store_n when setting available.
|
| rather than just a plain load.
| gpderetta wrote:
| The problem here is the [1] // Do stuff
| [2] Buffer[X]. available = Y
|
| There is no explicit nor implicit ordering between 1 and 2, so
| the compiler or cpu can reorder them. You need a release
| barrier between the two.
|
| Also while most CPUs preserve the control dependency, not all
| do (famously Alpha), and certainly not compilers. You would
| need a consume barrier, except that c++11 consume is only for
| data dependencies and unimplemented anyway.
|
| Edit: with the correct barriers in place, you can prove
| correctness by similitude to two size 1 SPSC queues used to
| exchange a mutual exclusion token, with the added quirk that as
| the queues are never used at the same time, they can actually
| be physically colocated in memory.
| samsquire wrote:
| Thank you for you and for sharing your knowledge gpderetta,
| appreciated, TIL.
| anonymousDan wrote:
| Regarding the contention, one thing that's important is to
| cache a local copy of the shared head and tail variables every
| time you access them. Then for subsequent operations you can
| first check the local cached copy to see if you can perform the
| read or write without needing to check the shared variables.
| loeg wrote:
| > The safe thing to do here is to always choose Ordering::SeqCst,
| "sequential consistency", which provides the strongest
| guarantees. ... This is often good enough in practice, and
| switching to a weaker Ordering is only necessary to squeeze out
| the last bit of performance.
|
| If you're going to write lock-free algorithms using atomics, the
| least you can do is learn about ordering semantics and use the
| correct abstract semantics for your design's actual needs. It is
| much easier to do it at design time than to try to figure out if
| it is safe to relax SeqCst later. (This is one of the major flaws
| of C++ std::atomic's default SeqCst semantics.) If you aren't
| going to bother understanding ordering semantics, it is unlikely
| you can write a safe lock-free algorithm anyway. (It's really
| hard to do!)
| jamesmunns wrote:
| Back then, there weren't as good references for explaining
| atomic ordering, and the blog post had gotten long enough.
| Mentioning SeqCst was a bit of a cop out, though both Andrea
| and I didn't end up using SeqCst past the inital impl anyway.
|
| Today I would have just linked to https://marabos.nl/atomics/,
| Mara does a much better job of explaining atomics than I could
| have then or now.
| hinkley wrote:
| Back then? Do you mean 2019? Or a different "then"? Because
| there was plenty of material in CS about this subject even in
| 2010. Java was wrestling with this twenty years ago, and
| databases long before that.
| jamesmunns wrote:
| I meant 2019, and there weren't any materials that I would
| consider as clear and well defined as Mara's linked docs
| explaining the different orderings used by C, C++, and Rust
| (Relaxed, Release, Acquire, AcqRel, and SeqCst).
|
| I'm very sure there were discussions and teaching materials
| then, but none (that I was aware of) focused on Rust, and
| something I'd link to someone who had never heard of atomic
| ordering before.
| anonymous-panda wrote:
| I think the hard part of it is that x86 only has one atomic
| ordering and none of the other modes do anything. As such,
| it's really hard to build intuition about it unless you
| spend a lot of time writing such code on ARM which wasn't
| that common in the industry and today most people use
| higher level abstractions.
|
| By databases, do you mean those running on DEC Alphas?
| Cause that was a niche system that few would have had
| experience with. If you meant to compare in terms if
| consistency semantically, sure but there's meaningful
| differences between database consistency semantics of
| concurrent transactions and atomic ordering in a
| multithreaded concept.
|
| Java's memory model "wrestling" was about defining it
| formally in an era of multithreading and it's largely
| sequentially consistent - no weakly consistent ordering
| allowed.
|
| The c++ memory model was definitely the first large scale
| adoption of weaker consistency models I'm aware of and was
| done so that ARM CPUs could be properly optimized for since
| this was c++11 when mobile CPUs were very much front of
| mind. Weak consistency remains really difficult to reason
| about and even harder to play around with if you primarily
| work with x86 and there's very little tooling around to
| validate that can help you get confidence about whether
| your code is correct. Of course, you can follow common
| "patterns" (eg loads are always acquire and stores are
| release), but fully grokking correctness and being able to
| play with the model in interesting ways is no small task no
| matter how many learning resources are out there.
| gpderetta wrote:
| Nit: x86 has acquire/release and seq_cst for load/stores
| (it technically also has relaxed, but it is not useful to
| map it to c++11 relaxed). What x86 lacks is weaker
| ordering for RMW, but there are a lot of useful lock free
| algorithms that are implementable just or mostly with
| load and stores and it can be a significant win to use
| non-seq-cst stores for this on x86
| haberman wrote:
| Indeed there is different code generated by seq_cst for
| stores. Though for loads it appears to be the same:
| https://godbolt.org/z/WbvEcM83q
| gpderetta wrote:
| Yes, seqcst loads map to plain loads on x86.
| foobiekr wrote:
| X86 might but devices connected to it in embedded world
| have had to be very very aware of this stuff since the
| 90s.
| jcelerier wrote:
| > It is much easier to do it at design time
|
| Is it? I always worked the second way (starting from seq_cst
| and then when the core design matured enough and didn't change
| for a few months, trying to see what could actually be
| relaxed). I'd be very afraid that in the first case, you start
| with say relaxed semantics somewhere, them you change the
| design because the requirements changed, and now you have to go
| again through all the atomic operations to make sure the
| assumptions all still hold.
| jamesmunns wrote:
| Back when this post was written, I would have agreed with
| you. But Mara's book makes a good case for this:
|
| https://marabos.nl/atomics/memory-ordering.html#common-
| misco... More importantly, when reading code,
| SeqCst basically tells the reader: "this operation
| depends on the total order of every single SeqCst operation
| in the program," which is an incredibly far-reaching claim.
| The same code would likely be easier to review and
| verify if it used weaker memory ordering instead, if
| possible... It is advisable to see SeqCst as a
| warning sign.
| gpderetta wrote:
| If you change the design of a lock free algo you very likely
| have to go through all the atomic operations to make sure
| that all assumptions hold anyway.
| ajross wrote:
| Completely agree, though the more iconoclastic corrolary that
| goes unspoken there is that putting The Final Word on memory
| ordering semantics into programming language standards was a
| terrible mistake.
|
| Memory ordering is a hardware behavior. It needs to be
| specified at the hardware level, and hardware vendors have been
| very mixed on clarity. And more importantly, lockless
| algorithms (that rely on memory ordering control) are really,
| really hard, and demand clarity over all things. And instead
| we're crippling those poor programmers with nonsense like
| "sequentially consistent"[1] or trying to figure out what on
| earth "consume" means[2].
|
| x86 does this pretty well with their comparatively simple model
| of serializing instructions. Traditional ARM ISAs did only a
| little worse by exposing the interface as read/write barrier
| instructions. Everyone else... meh.
|
| But if you really want to do this (and you probably don't) do
| the analysis yourself at the level of
| ISA/architecture/hardware, cite your references, and be
| prepared to handle whatever portability works is needed on your
| own.
|
| [1] A statement about desired final state, not hardware
| behavior!
|
| [2] Nothing, on any hardware you will ever use. Don't ask.
| gpderetta wrote:
| On the contrary, fixing the memory model on a widely used
| language like C++ forced hardware vendors to get their act
| together and provide more rigorous memory model explanations.
| For example intel went from Processor Ordering to TSO, and
| arm started offering explicit acquire/release operations.
|
| Java had the opportunity as well, but by initially only
| providing a stronger, mostly sequentially consistent MO, the
| hardware vendors managed to get away a little longer.
| ajross wrote:
| I don't think that's the case with Intel at all, the events
| are off by a decade at least; do you have a blog or
| something to cite there? I'd be curious to read it. And as
| for ARM, "explicit acquire/release" is objectively less
| informative and harder to reason about than the xxxSB
| instructions were (and yes, I've used both). ARM went
| _backwards_ to accommodate C++ 's nonsense.
|
| Again, the language standard writers aren't remotely the
| experts here, the hardware designers are. That C++ invented
| its own metaphors instead of listening to the experts is a
| bug, not a feature.
| gpderetta wrote:
| The hardware designers were involved on the
| standardization process. I don't have citations at hand,
| I think most of the mailing lists were the discussion re
| the c++ MO happened have been lost, but (as a lurker
| trying to learn this stuff) I was following the process
| closely.
|
| The question was, given PO, whether it was at all
| possible to recover sequential consistently on intel
| either with mfence or a lock xchg, given the possibility
| of IRIW. Intel then updated their MO to exclude IRIW, de
| facto standardizing on TSO.
|
| This was early 2000s. I think both ARM and IBM published
| revisions to their architecture clarifying details around
| the same time.
|
| This spawned a set of academic papers that proved the
| correctness of the agreed mapping of the C++ memory model
| to those architecture s.
| ajross wrote:
| > The hardware designers were involved on the
| standardization process.
|
| That sounds cyclic then. You're saying that Intel's SDM
| was ambiguous[1] (which it was) and that it was sorted
| out as part of a standardization process. I'm saying that
| it doesn't really matter what the SDM said, it mattered
| whether or not you could reliably write lockless code on
| x86 using the hardware and docs available at the time,
| and you could. And further, I'm saying that the standard
| ended up making things worse by perpetuating arguments
| like this about what some buggy English text in an SDM
| said and _not about actual hardware behavior_.
|
| [1] In ways that AFAICT didn't actually impact hardware.
| I found this, which is probably one of the papers you're
| citing. It's excellent work in standards-writing, but
| it's also careful to note that the IRIW cases were never
| observed on hardware.
| https://www.cl.cam.ac.uk/~pes20/weakmemory/x86tso-
| paper.tpho...
| gpderetta wrote:
| It didn't impact hardware because intel hadn't taken
| advantage yet of the additional latitude offered by their
| original model. Then they closed the hole and they
| guaranteed no IRIW[1]. But in the meantime if your
| algorithm was susceptible to this reordering, there was
| no written guarantee that an mfence would fix it. But
| most importantly as the model was informal and not self
| consistent, there was no possibility to write formal
| proofs of correctness of an algorithm or run it against a
| model checker.
|
| [1] in practice this means no store-forwarding from
| sibling hyper thread store buffers, something that for
| example POWER allows and is observed in real hardware.
| haberman wrote:
| I've never actually seen a production lock-free algorithm that
| uses SeqCst. I have a hard time even imagining an algorithm
| where SeqCst is the right choice.
|
| It seems like SeqCst was chosen as a default to reduce the
| surprises and gotchas around lock-free programming. But lock-
| free programming is inherently tricky; if you're trying to
| avoid surprises and gotchas you should probably use a mutex.
| derefr wrote:
| > If you aren't going to bother understanding ordering
| semantics, it is unlikely you can write a safe lock-free
| algorithm anyway.
|
| I think the implicit suggestion here is that the target
| audience for this abstraction is actually two separate
| developers:
|
| 1. A developer who doesn't know much about locking semantics,
| but can write the simple-but-slow case correctly.
|
| 2. Another developer, much later on -- probably a senior SRE
| type -- who can revisit this code and optimize it with full
| understanding of the constraints.
|
| (This might also be the same person, years later, with more
| experience under their belt.)
|
| The benefit of the way this library is designed, is that the
| second developer doesn't have to completely rewrite everything
| just to optimize it. The naive dev's code has already been
| forced into an "almost but not quite" lock-free mold by the
| design of the library's API surface.
| magnat wrote:
| > A typical serial port configuration is "115200 8N1", which
| means 115,200 baud (or raw bits on the wire per second), with no
| parity, and 1 stop bit. This means that for every data byte sent,
| there will be 8 data bits, 1 unused parity bit, and 1 stop bit,
| to signal the end of the byte, sent over the wire. This means
| that we will need 40 bits on the wire to receive a 4 data bytes.
|
| 8N1 means there is 1 start bit, 8 data bits and 1 stop bit (10
| bits total), not 8 data bits, 1 unused parity bit and 1 stop bit
| (also 10 bits total).
| jamesmunns wrote:
| Yep, good catch! That's a whoops in my explanation. I don't
| work at FS any more, so I'm not sure I could PR that change,
| but you're definitely right :)
| bfrog wrote:
| I have to say after looking at various DMA hardware I much much
| prefer the scatter gather list type than the ring type of DMA.
|
| The entire need of a bipbuffer allocation for DMA then goes way.
| You can have a simple pool of fixed sized blocks to throw at the
| DMA. Pretty easily done with a free list.
|
| I do think the implementation here is cool though, and its nice
| to see some work in this area.
| 95014_refugee wrote:
| To make scatter/gather go fast, you either spend a lot of
| effort caching descriptor lists for pinned buffers (because you
| expect to see them often), or heavily optimising your VM's v2p
| translation machinery, or some combination of the two.
|
| And then you wind up discovering that you the driver writer
| aren't actually trusted and so you need to insert at least one
| if not several IOMMUs between the peripheral and the
| memory(ies) that they may access, managed by another software
| component in a different address space.
|
| Then someone asks you to make all of this work for clients in
| VMs.
|
| At which point you start wondering why you didn't just allocate
| a physically contiguous buffer at startup and copy to/from your
| client buffers using the CPU...
|
| Sorry for sounding triggered... 8)
| LAC-Tech wrote:
| TIL about BipBuffers. I've been struggling with a similar data
| structure, and to see it already has a name, and a better
| implementation than what I've been doing, is very welcome.
| dnedic wrote:
| Bipartite buffers are amazing, and very rarely used
| unfortunately. For those looking for C and C++ implementations
| you can check out my libraries: lfbb and lockfee:
| https://github.com/DNedic/lfbb,
| https://github.com/DNedic/lockfree (although lockfree contains
| more data structures as well)
| nyanpasu64 wrote:
| >In Andrea's implementation of the lock-free ring-buffer, spsc-
| bip-buffer, some of the orderings are relaxed for performance.
| This has the downside that it can introduce subtle concurrency
| bugs that may only show up on some platform (ARM, for example):
| to be a bit more confident that everything's still fine, Andrea's
| has continous integation tests both on x86 and ARM.
|
| It might be worth testing/forking the library to test on Loom
| (https://github.com/tokio-rs/loom), which can model atomic
| orderings and check for concurrency errors to some degree (though
| I last used it years ago). TSAN might be able to check for
| ordering errors in _visited_ execution traces (though I haven 't
| tried using it in the past).
___________________________________________________________________
(page generated 2024-02-29 23:00 UTC)