[HN Gopher] A viable solution for Python concurrency
___________________________________________________________________
A viable solution for Python concurrency
Author : zorgmonkey
Score : 392 points
Date : 2021-10-15 17:54 UTC (5 hours ago)
(HTM) web link (lwn.net)
(TXT) w3m dump (lwn.net)
| ferdowsi wrote:
| If this effort succeeds (and I hope it does) now Python
| developers will need to contend with the event-loop albatross of
| asyncio and all of its weird complexity.
|
| In an alternate Python timeline, asyncio was not introduced into
| the Python standard library, and instead we got a natively
| supported, robust, easy-to-use concurrency paradigm built around
| green/virtual threading that accommodates both IO and CPU bound
| work.
| BiteCode_dev wrote:
| asyncio is not a competition to threads, it's complementary.
|
| In fact, it's a perfectly viable strat in python to have
| several processes, each having several threads, each having an
| event loop.
|
| And it will still be so, once this comes out. You will
| certainly use threads more, and processes less, but replacing
| 1000000 coroutines by 1000000 system threads is not necessarily
| the right strategy for your task. See nginx vs apache.
| dralley wrote:
| Multiple threads with one asyncio loop per thread would be
| absolutely pointless in Python, because of the GIL.
|
| With that said, sure, threads and asyncio are complimentary
| in the sense that you can run tasks on threadpool executors
| and treat them as if they were coroutines on an event loop.
| But that serves no purpose unless you're trying to do
| blocking IO without blocking your whole process.
| bonzini wrote:
| In Python it would be pointless, but for example it's how
| Seastar/ScyllaDB work: each thread is bound to a CPU on the
| host and has its own reactor (event loop) with coroutines
| on it. QEMU has a similar design.
| yellowapple wrote:
| It's also (to my knowledge) how Erlang's VMs (e.g. BEAM)
| work: one thread per CPU core, and a VM on each thread
| preemptively switching between processes.
| BiteCode_dev wrote:
| It would not be pointless at all, because while one thread
| may lock on CPU, context switching will let another one
| deal with IO. This can let you smooth out the progress of
| each part of your program, and can be useful for workload
| when you don't want anything to block for too long.
| heavyset_go wrote:
| I read it as each process having multiple threads _and_ an
| event loop. If the threads are performing I /O or calling
| out to compiled code and releasing the GIL, said GIL won't
| block the event loop.
| azinman2 wrote:
| This entire article is about removing the GIL
| Zarathust wrote:
| "Viable" as in "you have no other choice sometimes". This
| forces you to deal with 3 libraries each with their own
| quirks, pitfalls and incompatibilities. Sometimes you even
| deal with dependencies reimplementing some parts in a 4th or
| 5th library to deal with shortcomings.
|
| I really don't care that much which of them survive, I just
| want to rely on less of them
| BiteCode_dev wrote:
| No, it's just useful. They are techs with different trade
| off, and life is full of opportunities.
| throwaway81523 wrote:
| Python Zen = one obvious way to do it. Having a bunch of
| very different ones, each with serious disadvantages, is
| a bad look.
| heavyset_go wrote:
| Zen of Python is an ideal, and at this point, kind of
| tongue-in-cheek.
|
| This is the same language that shipped with at least 3
| different methods to apply functions across iterables
| when the Zen of Python was adopted as a PEP in 2004.
| throwaway81523 wrote:
| There is at least some recognition in those cases that
| they introduced the new thing because they got it wrong
| in the old thing. That's different than saying they
| should co-exist on equal terms.
| BiteCode_dev wrote:
| It's a technical thread, not a political one. If you were
| so sure of your argument, you wouldn't use a throwaway.
|
| Besides, it's weird, like saying we should not have int,
| float and complex, there should be one way to do it.
|
| Just because those are 3 numbers doesn't mean they don't
| have each their own specific benefit.
| throwaway81523 wrote:
| int, float, and complex are for different purposes. async
| and threads paper over each others' weaknesses, instead
| of fixing the weaknesses at the start. Async itself is an
| antipattern (technical opinion, so there) but Python uses
| it because of the hazards and high costs of threads.
| Chuck Moore figured out 50 years ago to keep the async
| stuff out of the programmer's way, when he put
| multitasking into Polyforth, which ran on tiny machines.
| Python (and Node) still make the programmer deal with it.
|
| If you look at Haskell, Erlang/Elixir, and Go, they all
| let you write performant sequential code by pushing the
| async into the runtime where the programmer doesn't have
| to see it. Python had an opportunity to do the same, but
| stayed with async and coroutines. What a pain.
| throwaway81523 wrote:
| Yes, I have a big sense of tragedy about Python 3. Python
| should run on something like (or maybe the actual) Erlang BEAM
| with lightweight isolated processes. All my threaded Python
| code is written using that style anyway (threads communicating
| through synchronized queues) and I've almost never needed
| traditional shared mutable objects. Maybe completely never, but
| I'm not sure about a certain program any more.
|
| Added: I don't understand the downvotes. If Python 3 was going
| to make an incompatible departure from Python 2, they might as
| well have done stuff like the above, that brought real
| benefits. Instead they had 10+ years of pain over relatively
| minor changes that arguably weren't all improvements.
| BeetleB wrote:
| You are likely being downvoted because most claims about the
| pain of a Python 3 transition are inflated/hyperbole.
|
| It took less than a day to migrate all my code to Python 3.
| And by "less than a day" I mean "less than 2 hours". Granted,
| bigger projects would take longer, but saying stuff like "10+
| years of pain" is ridiculous. Probably less than 1% of
| projects had serious issues with the migration. We just hear
| of a few popular ones that had some pain and assume that was
| representative.
| KaiserPro wrote:
| > easy-to-use concurrency paradigm
|
| Well it has queues and threads already.
|
| Its just that asyncio for socket handling at least (in the
| testing that I did) is about 5% faster. (one asyncio socket
| "server" vs ten threads [with a number of ways to monitor for
| new connections])
|
| I always assumed that people wanted asyncio because they look
| at javascript and thought "hey I want GOTOs cosplaying as a fun
| paradigm"
| BiteCode_dev wrote:
| GOTO cosplaying should go away with structured concurrency
| (via TaskGroup) being adopted in 3.11, as pioneered by Trio.
|
| Check out anyio if you want to use them now.
| nine_k wrote:
| BTW I wonder why async is so painless in ES6 compared to
| Python. Why the presence of GIL (which JS also has) did not
| make running async coroutines completely transparent, as it
| made running generators (which are, well, coroutines already).
| Why the whole even loop thing is even visible at all.
| laurencerowe wrote:
| Because JavaScript never had threads so I/O in JavaScript has
| always been non-blocking and the whole ecosystem surrounding
| it has grown up under that assumption.
|
| JavaScript doesn't need a GIL because it doesn't really have
| threads. WebWorkers are more akin to multiprocessing than
| threads in Python. Objects cannot be shared directly across
| WebWorkers so transferring data comes with the expense of
| serializing/deserializing at the boundary.
| catlifeonmars wrote:
| JS now has shared array buffers.
| laurencerowe wrote:
| SharedArrayBuffer is just raw memory similar to using
| mmap from Python multiprocessing. The developer
| experience is very different to simply sharing objects
| across threads.
| BiteCode_dev wrote:
| I used them both extensively, and here are the main reasons I
| can think of:
|
| - The event loop in JS is invisible and implicit. V8 proved
| it can be done without paying a cost for it, and in fact most
| real life python projects are using uvloop because it's
| faster than asyncio default loop. JS dev don't think of the
| loop at all, because it's always been there. They don't have
| to chose a loop, or thinking about its lifecycle or
| scheduling. The API doesn't show the loop at all.
|
| - Asynchronous functions in JS are scheduled automatically.
| On python, calling a coroutine function does...nothing. You
| have to either await it, or pass it to something like
| asyncio.create_task(). The later is not only verbose, it's
| not intuitive.
|
| - Async JS functions can be called from sync functions
| transparently. It just returns a Promise after all, and you
| can use good old callbacks. Instantiating a Python coroutine
| does... nothing as we said. You need to schedule it AND await
| it. If you don't, it may or may not be executed. Which is why
| asyncio.gather() and co are to be used in python. Most people
| don't know that, and even if you know, it's verbose, and you
| can forget. All that, again, because using the event loop
| must be explicit. That's one thing TaskGroup from trio will
| help with in the next Python versions...
|
| - the early asyncio API sucked. The new one is ok,
| asyncio.run() and create_task() with implicit loop is a huge
| improvement. But you better use 3.7 at least. And you have to
| think about all the options for awaiting:
| https://stackoverflow.com/questions/42231161/asyncio-
| gather-...
|
| - asyncio tutorials and docs are not great, people have no
| idea how to use it. Since it's more complex, it compounds.
|
| E.G, if you use await:
|
| With node v14.8+: await async_func(params)
|
| With python 3.7+: import asyncio
| async def main(): # no top level await, it must
| happen in a loop await async_func(params)
| asyncio.run(main) # explicit loop, but easy one thanks to 3.7
|
| E.G, deep inside functions calls, but no await:
|
| With node: ... async_func(params)
|
| With python 3.7+: ... #
| async_func(params) alone would do nothing res =
| asyncio.create_task(async_func(params)) ...
| # you MAY get away with not using gather() or wait()
| # but you also may get "coroutine is never awaited" #
| RuntimeWarning: coroutine 'async_func' was never awaited
| asyncio.gather(res)
|
| Of course, you could use "run_until_complete()", but then you
| would be blocking. Which is just not possible in JS, there is
| one way to do it, and it's always non blocking and easy.
| Ironic, isn't it? Beside, which Python dev knows all this?
| I'm guessing most readers of this post will have heard of it
| for the first time.
|
| Python is my favorite language, and I can live with the
| explicit loop, but explicit scheduling is ridiculous. Just
| run the damn coroutine, I'm not instantiating it for the
| beauty of it. If I want a lazy construct, I can always make a
| factory.
|
| Now, thanks to the trio nursery concept, we will get
| TaskGroup in the next release (also you can already use them
| with anyio): async with asyncio.TaskGroup()
| as tg: tg.start_soon(async_func, params)
|
| Which, while still verbose, is way better:
|
| - no gather or wait. Schedule it, it will run or be cleaned
| up.
|
| - no need to chose an awaiting strat, or learn about a 1000
| things. This works for every cases. Wanna use it in a sync
| call ? Pass the tg reference in it.
|
| - lifecycle is cleanly scoped, a real problem with a lot of
| async code (including in JS, where it doesn't have a clean
| solution)
| heavyset_go wrote:
| > _Why the whole even loop thing is even visible at all._
|
| It isn't anymore. In [3]: from asyncio
| import run In [4]: async def async_func():
| print('Ran async_func()') In [5]:
| run(async_func()) Ran async_func()
|
| Top-level async/await is also available in the Python REPL
| and IPython, and there are discussions on the Python mailing
| list about making top-level async/await the default for
| Python[1]. In [1]: async def async_func():
| print('Ran async_func()') In [2]: await
| async_func() Ran async_func()
|
| [1] https://groups.google.com/g/python-
| ideas/c/PN1_j7Md4j0/m/0xy...
| BiteCode_dev wrote:
| Oh, top level await... I missed that.
|
| Not sure it will get there, but it would be nice. I think
| putting a top level "await" is explicit enough for stating
| you want an event loop anyway.
|
| Now, with TaskGroup in 3.11, things are going to get pretty
| nice, espacially if this top level await plays out,
| provided they include async for and async with in the mix.
|
| Now, if they could just make so that coroutines are
| automatically schedules to the nearest task group, we would
| almost have something usable.
| dekhn wrote:
| so true. I've been writing thread-callback code for decades
| (common in network and gui event loops, see QtPy as an example)
| and when I looked at asyncio my first thought is "this is not
| better". It's entirely nontrivial to analyze code using asyncio
| (or yield) compared to callbacks.
| harpiaharpyja wrote:
| If you are ever considering making use of asyncio for your
| project, I would strongly recommend taking a look at curio [1]
| as an alternative. It's like asyncio but far, far easier to
| use.
|
| [1] https://curio.readthedocs.io/en/latest/index.html
| acidbaseextract wrote:
| The video (or blog post) below is one of the best
| explanations I've seen about what subtle bugs are easy to
| make with asyncio, why it's easy to make them, and how the
| trio library addresses them.
|
| But yes, consider alternatives before you pick asyncio as
| your approach!
|
| Talk: https://www.youtube.com/watch?v=oLkfnc_UMcE
|
| Blog post: https://vorpus.org/blog/notes-on-structured-
| concurrency-or-g...
| VWWHFSfQ wrote:
| Highly recommend curio
| [deleted]
| BiteCode_dev wrote:
| While the design of Curio is quite interesting, it's may not
| be a good choice, not for technical reasons, but for
| logistical reasons: the chances it gets a wide adoption are
| slim to None.
|
| And since we are stuck with colored functions in python, the
| choice of stack matters very much.
|
| Now, if you want easier concurrency, and a solution to a lot
| of concurrency problems that curio solves, while still being
| compatible with asyncio, use anyio:
|
| https://anyio.readthedocs.io/en/stable/
|
| It's a layer that works on top of asyncio, so it's compatible
| with all of it. But it features the nursery concept from
| Trio, which makes async programming so much simpler and
| safer.
| heavyset_go wrote:
| anyio is also compatible with asyncio and Trio, so you can
| use it with either library or paradigm.
| sandGorgon wrote:
| Uvloop/uvicorn - which is the production grade asgi
| server only works with asyncio.
|
| Hypercorn works with trio..but you lose a LOT of
| performance
| quietbritishjim wrote:
| Curio's spiritual successor is Trio [1], which was written by
| one of the main Curio contributors and is more actively
| maintained (and, at this point, much more widely used). Like
| Curio, it's much easier to use than asyncio, although ideas
| from it are gradually being incorporated back into asyncio
| e.g. asyncio.run() was inspired by curio.run()/trio.run().
|
| I have used Trio in real projects and I thoroughly recommend
| it.
|
| This blog post [2] by the creator of Trio explains some of
| the benefits of those libraries in a very readable way.
|
| [1] https://trio.readthedocs.io/en/stable/
|
| [2] https://vorpus.org/blog/some-thoughts-on-asynchronous-
| api-de...
| btown wrote:
| > instead we got a natively supported, robust, easy-to-use
| concurrency paradigm built around green/virtual threading that
| accommodates both IO and CPU bound work
|
| Minus the "natively supported" part, we have this today in
| http://www.gevent.org/ ! It's so, so empowering to be able to
| access the entire historical body of work of synchronous-I/O
| Python libraries, and with a single monkey patch cause every
| I/O operation, no matter how deep in the stack, to yield to
| your greenlet pool _without code changes_.
|
| We fire up one process per core (gevent doesn't have good
| support for multiprocessing, but if you're relying on that,
| you're stuck on one machine anyways), spend perhaps 1 person-
| day a quarter dealing with its quirks, and in turn we never
| need to worry about the latencies of external services; our web
| servers and batch workers have throughput limited only by CPU
| and RAM, for which there's relatively little (though nonzero)
| overhead.
|
| IMO Python should have leaned into official adoption of gevent.
| It may not beat asyncio in raw performance numbers because
| asyncio can rely on custom-built bytecode instructions, whereas
| gevent has "userspace" code that must execute upon every yield.
| And, as with asyncio, you have to be careful about CPU-
| intensive code that may prevent you from yielding. But it's
| perfect for most horizontal-scaling soft-realtime web-style use
| cases.
| int_19h wrote:
| How would those green/virtual threads interface with native
| async APIs (e.g. the entirety of WinRT)?
| tbabb wrote:
| What specifically is the problem with asyncio? I quite like
| using it, so I'm curious if there's some aspect that makes it
| unsustainable?
| fullstop wrote:
| I like using it as well, but I've been bit several times by
| having runtime exceptions completely swallowed.
| throwaway81523 wrote:
| > What specifically is the problem with asyncio?
|
| Watch the very NSFW (lots of swearing) but hysterically funny
| video "node.js is bad ass rock star tech" on youtube sometime
| ;). https://www.youtube.com/watch?v=bzkRVzciAZg
| calpaterson wrote:
| The key disadvantage is largely that it bifurcates the
| library base. Async libraries and sync libraries co-exist
| uneasily in the same program.
|
| For nearly every popular library there is now a (usually
| inferior, less robust) async one. The benefits of Linus' Law
| are reduced.
| Redoubts wrote:
| Trifucates, since now there's stdlib asyncio, and a popular
| trio async flavor too.
| tbabb wrote:
| Fair point. Async is a big enough idea that it probably
| warrants designing the language with it in mind. I guess
| another way of phrasing it would be that it violates the
| "there's only one way to do it" maxim, and the "two ways of
| doing it" circumstance necessarily came about because the
| idea was discovered long after the core language and
| libraries were already written.
| aeyes wrote:
| It solves only one problem, the name says it: Async I/O
|
| If you do anything on the CPU or if you have any I/O which is
| not async you stall the event loop and everything grinds to a
| halt.
|
| Imagine a program which needs to send heartbeats or data to a
| server in a short interval to show liveness, Kafka for
| example. Asyncio alone can't reliably do this, you need to
| take great care to not stall the event loop. You only have
| exactly one CPU core to work with, if you do work on the CPU
| you stall the event loop.
|
| We see web frameworks built on asyncio but even simple API
| only applications constantly need to serialize data which is
| CPU-bound. These frameworks make no effort (and asyncio
| doesn't give us any tools) to protect the event loop from
| getting stalled by your code. They work great in simple
| benchmarks and for a few types of applications but you have
| to know the limits. And I feel that the general public does
| not know the limitations of asyncio, it wasn't made for
| building web frameworks on the async event loop. It was made
| for communicating with external services like databases and
| calling APIs.
| twic wrote:
| > With this scheme, the reference count in each object is split
| in two, with one "local" count for the owner (creator) of the
| object and a shared count for all other threads. Since the owner
| has exclusive access to its count, increments and decrements can
| be done with fast, non-atomic instructions. Any other thread
| accessing the object will use atomic operations on the shared
| reference count.
|
| > Whenever the owning thread drops a reference to an object, it
| checks both reference counts against zero. If both the local and
| the shared count are zero, the object can be freed, since no
| other references exist. If the local count is zero but the shared
| count is not, a special bit is set to indicate that the owning
| thread has dropped the object; any subsequent decrements of the
| shared count will then free the object if that count goes to
| zero.
|
| So in this program: import threading
| def produce(): global global_foo local_foo =
| "potato" global_foo = local_foo def consume():
| global global_foo local_foo = global_foo
| global_foo = None if __name__ == '__main__':
| produce() thread = threading.Thread(target=consume)
| thread.start() thread.join()
|
| What happens to the counts on the string "potato"?
|
| In produce, the main thread creates it and puts it in a local,
| and increments the local count. It assigns it to a global, and
| increments the local count. It then drops the local when produce
| returns, and decrements the local count. In consume, the second
| thread copies the global to a local, and increments the shared
| count. It clears out the global, and decrements the shared count.
| It then drops the local when consume returns, and decrements the
| shared count.
|
| That leaves the local count at 1 and the shared count at -1!
|
| You might think that there must be special handling around
| globals, but that doesn't fix it. Wrap the string in a perfectly
| ordinary list, and put the list in the global, and you have the
| same problem.
|
| I imagine this is explained in the paper by Choi et al, but i
| have not read it!
| ameixaseca wrote:
| I couldn't find this in the design document but the only
| obvious solution is to track globals via the shared count.
| Since a global reference is part of all threads simultaneously,
| it cannot be treated as local.
|
| If you follow this reasoning, the operations above result in
| local=0/shared=0 after the last assignment.
| twic wrote:
| As i said in the comment, that doesn't work. Put a list in
| the global, and then push and pop the string on the list.
| Even better, push the string into a local list, then put that
| list in another local list, then put that in a global, etc.
| You would need to dynamically keep every object reachable
| from a global marked as such, and that's a non-starter.
| twic wrote:
| The paper:
|
| > When the shared counter for an object becomes negative for
| the first time, the non-owner thread updating the counter also
| sets the object's Queued flag. In addition, it puts the object
| in a linked list belonging to the object's owner thread called
| QueuedObjects. Without any special action, this object would
| leak. This is because, even after all the references to the
| object are removed, the biased counter will not reach zero --
| since the shared counter is negative. As a result, the owner
| would trigger neither a counter merge nor a potential
| subsequent object deallocation.
|
| > To handle this case, BRC provides a path for the owner thread
| to explicitly merge the counters called the ExplicitMerge
| operation. Specifically, each thread has its own thread-safe
| QueuedObjects list. The thread owns the objects in the list. At
| regular intervals, a thread examines its list. For each queued
| object, the thread merges the object's counters by accumulating
| the biased counter into the shared counter. If the sum is zero,
| the thread deallocates the object. Otherwise, the thread
| unbiases the object, and sets the Merged flag. Then, when a
| thread sets the shared counter to zero, it will deallocate the
| object. Overall, as shown in invariant I4, an owner only gives
| up ownership when it merges the counters.
|
| Well, that works, but it's a bit naff.
| Jtsummers wrote:
| Two spaces in front of each line of the code block. As written,
| right now, your comment is hard to parse:
| import threading def produce(): global
| global_foo local_foo = "potato" global_foo =
| local_foo def consume(): global global_foo
| local_foo = global_foo global_foo = None if
| __name__ == '__main__': produce() thread =
| threading.Thread(target=consume) thread.start()
| thread.join()
| twic wrote:
| Sorry about that. I did indent before pasting the code - but
| gedit indents with tabs, which HN ignores!
| kzrdude wrote:
| It can be configured. I remember when Gedit was quite the
| potent editor, language plugins, snippets and stuff. But it
| has the basics left still :)
| overgard wrote:
| I feel like Gvr just doesnt want to change things, Feels doomed
|
| This has been a problem for like 20 years and they have refused
| fixes before. And there have been fixes. They just don't see this
| as important
|
| it's practically a religion that its a thing they wont change
| heavyset_go wrote:
| I disagree entirely. The last few releases of Python have made
| significant changes to the language, coinciding with the
| project becoming community-led after Guido stepped down.
| mixmastamyk wrote:
| Yes, stepped down as a result of him forcing the walrus
| operator change into the language over significant
| opposition.
| randlet wrote:
| Guido is no longer the BDF and spoke fairly positively about
| this change in the mailing list thread[1].
|
| "To be clear, Sam's basic approach is a bit slower for single-
| threaded code, and he admits that. But to sweeten the pot he
| has also applied a bunch of unrelated speedups that make it
| faster in general, so that overall it's always a win. But
| presumably we could upstream the latter easily, separately from
| the GIL-freeing part."
|
| [1] https://mail.python.org/archives/list/python-
| dev@python.org/...
| Waterluvian wrote:
| I can see it now:
|
| "My program has 2^64 references to an object, which caused it to
| become immortal"
|
| =)
| notriddle wrote:
| In a 64-bit address space, with objects requiring more than one
| word to store, that's literally impossible.
| The_rationalist wrote:
| Or you could just use GraalVM python
| https://github.com/oracle/graalpython
| misnome wrote:
| > This "optimization" actually slows single-threaded accesses
| down slightly, according to the design document, but that penalty
| becomes worthwhile once multi-threaded execution becomes
| possible.
|
| My understanding was that CPython viewed any single-threaded
| performance regression as a blocker to GIL-removal attempts,
| regardless of if other work by the developer has sped up the
| interpreter? This article seems to somewhat gloss over that with
| "it's only small". I'd be interested in knowing other estimations
| of what the "better-than-average chance" of this (promising
| sounding) attempt were.
|
| Breaking C extensions (especially the less-conforming ones, which
| seem likely to be the least maintained) also seems like it would
| be a very hard pill to swallow, and the sort of thing that might
| make it a Python 3-to-4 breaking change, which I imagine would
| also be approached extremely carefully given there are still
| people to-this-day who believe that python 3 is a mistake and one
| day everyone will realise it and go back to python 2 (yes,
| really).
| singhrac wrote:
| From the article:
|
| > Gross has also put some significant work into improving the
| performance of the CPython interpreter in general. This was
| done to address the concern that has blocked GIL-removal work
| in the past: the performance impact on single-threaded code.
| The end result is that the new interpreter is 10% faster than
| CPython 3.9 for single-threaded programs.
| singhrac wrote:
| Sorry, to be clear, I missed your point "regardless of if
| other work by the developer has sped up the interpreter".
| That's fair, though my personal opinion is that that seems
| like an incredibly high bar for any language.
| a1369209993 wrote:
| > and one day everyone will realise it
|
| No? Why would we think that? There are people who willingly use
| _java_ ; compared to that the problems with python 3 are
| downright non-obvious as long as you never need to work with
| things like non-Unicode text.
| Kranar wrote:
| C extensions can continue to be supported. Said extensions
| already explicitly lock/release the GIL, so to keep things
| backwards compatible it would be perfectly fine if there was a
| GIL that existed strictly for C extension compatibility.
| masklinn wrote:
| > My understanding was that CPython viewed any single-threaded
| performance regression as a blocker to GIL-removal attempts,
| regardless of if other work by the developer has sped up the
| interpreter?
|
| Previous GILectomy attempts incurred significant single-
| threaded performance penalties, on on the order of 50% or
| above. If Gross's work yields low single-digit performance
| penalty it's pretty likely to be accepted as this is the sort
| of impacts which can happen semi-routinely as part of
| interpreter updates.
|
| The complete breakage of C extensions would be a much bigger
| issue.
| ajkjk wrote:
| There are people who believe all kinds of crazy things; it
| doesn't reflect their truth. Going back to Python 2 is not
| going to ever happen (and no one working on Py3 would ever want
| to, anyway).
|
| A hard pill to swallow.. ain't that bad if it also benefits you
| tremendously, which fixing the GIL would do.
| EamonnMR wrote:
| I do wish for a world where Python 3 had handled
| unicode/bytes very differently.
| fatbird wrote:
| It was Guido's requirement that GIL removal not degrade single
| threaded performance at all, but in the talk I attend at PyCon
| 2019, the speaker mentioned nothing about qualifications on
| that. Guido's restriction was presented, quite reasonably, as
| "no one should have to suffer because of removing the GIL". So
| a net break-even or performance improvement is fine.
|
| And on top of that, Guido has retired now, and the steering
| committee may feel differently as long as the spirit of the
| restrictions is upheld.
| fatbird wrote:
| Guido has replied to Gross's announcement to observe that his
| performance improvements are not tied to removing the GIL and
| could be accepted separately. But he doesn't reject Gross's
| work outright, and if the same release that includes the GIL
| removal also delivers a concrete performance upgrade, I
| suspect that Guido would be fine with it. His concern is,
| after all, practical, to do with the actual use of python and
| not some architecture principle.
| efoto wrote:
| "The biggest source of problems might be multi-threaded programs
| with concurrency-related bugs that have been masked by the GIL
| until now."
| zinodaur wrote:
| > concurrency-related bugs that have been masked by the GIL
|
| Yeah... could phrase this as "All programs written with the
| assumption of a GIL are now broken" instead. Wish they had done
| this as part of the breaking changes for python 3, I guess
| they'll have to wait for Python 4 for this?
| Animats wrote:
| Yes. I once discovered that CPickle was not thread-safe. The
| response was that much of the library didn't really work in
| multi-threaded programs.
| formerly_proven wrote:
| You mean programs where you put an object into pickle and
| some other threads modify it while pickle is processing it?
| Doesn't surprise me - the equivalent written in plain Python
| would be very thread unsafe as well.
| Animats wrote:
| No, I mean several threads doing completely separate
| CPickle streams with no shared data or variables at the
| Python level.
| kzrdude wrote:
| Has it since been fixed?
| toyg wrote:
| Probably not. CPickle is famously shunned by anyone who
| has to do serious, performance-critical
| serialization/deserialization.
| kzrdude wrote:
| I was curious, and an issue that fits the description was
| fixed in Py 3.7.x here:
| https://bugs.python.org/issue34572 but other threading
| bugs remain: https://bugs.python.org/issue38884
| nomdep wrote:
| If the Python maintainers doesn't want to approve this, Gross
| should talk to the Pypi developers.
| otterley wrote:
| This may be a silly question, but if you really need concurrency,
| why not use a language that's built for concurrency from the
| ground up instead? Elixir is a great example.
| klyrs wrote:
| I rarely need concurrency, and do a lot of Python because it's
| what all my dependencies are written in. But sometimes, I find
| myself bottlenecked on a trivially parallelizable operation. In
| the state (my dependecies are in Python, I have a working
| Python implementation), there's _no way in hell_ that (rewrite
| my dependencies in Elixir, rewrite my code in Elixir) is a
| sensible next move.
| lucb1e wrote:
| Are you proposing to write anything that will need concurrency
| anywhere in your favorite language, or just call into the
| concurrent code from python? (Since comments like
| https://news.ycombinator.com/item?id=28883990 seem to be taking
| it as the former whereas I took it as the latter.)
| ska wrote:
| To a first approximation, people don't use python for itself,
| they use it for the vast ecosystem and network effect. If you
| jump to another language for better concurrency, what are you
| giving up?
|
| Unless you really are doing greenfield development in an
| isolated application, these considerations often trump any
| language feature.
| otterley wrote:
| Don't get me wrong; I'm not suggesting that anyone dump
| Python altogether to switch to a different language for any
| arbitrary project or purpose. Many businesses I work with use
| different languages for different components or applications,
| using the network or storage to intercommunicate when
| necessary. The right tool for the job, as it were.
| ferdowsi wrote:
| There are some organizations with lots of domain knowledge and
| expertise around around developing, securing and deploying
| Python and they don't have the Innovation Currency to spend on
| investing in a new language.
|
| Specific to your point, recruiting for Elixir talent is a
| problem compared to more mainstream languages. Recruiting in
| general is extremely hard at this moment.
| otterley wrote:
| Given all the corner cases people are going to continue to
| find whilst trying to coax Python into behaving correctly in
| a highly concurrent program -- especially one that utilizes
| random libraries from the ecosystem -- I can't help but
| wonder whether the Innovation Currency is better spent
| replacing the components that require high concurrency (which
| often is only a subset of them) instead of getting stuck in
| the mire of bug-smashing.
| pmontra wrote:
| A possible answer is that everybody in the company knows Python
| and no other language. Another one is that they have to reuse
| or extend a bunch of existing Python code. The latter happened
| to me. Performances were definitely not a concern but I
| suddenly needed threads doing extra functionality over the
| original single threaded algorithm. BTW, I used a queue to pass
| messages between them.
| otterley wrote:
| Using multiple interpreters with message passing is a
| workable, if expensive, way to deal with the problem. It is
| trading one cost for another. (These sort of tradeoffs are
| encountered all the time in business, to be sure.)
| Fordec wrote:
| Sometimes a project starts off aiming to solve a problem. maybe
| it's a data science problem, so support already exists in
| python so lets do that. Ok, it worked great and it's catching
| on with users. Now we need to scale, but we are running into
| concurrency issues. What is a better answer? Ok we will work on
| improving python concurrency under the hood, or completely
| scrap the code base and switch to a different language?
|
| Very few people set out going asking themselves about such low
| level details on day one of a project. Especially something
| that was an MVP or POC
| otterley wrote:
| I'll plead ignorance here: Do data science workflows often
| require high concurrency using a single interpreter? I
| thought all that stuff was compute-bound and parceled out to
| workers that farm out calculations to CPUs and GPUs.
| Animats wrote:
| Or you could just use PyPy, which uses a garbage collector, does
| more compile-time analysis, and runs much faster.
|
| CPython is a naive interpreter, like original JavaScript. There's
| been progress since then.
| llimllib wrote:
| lots of people need C extensions, which you can't* have on
| pypy.
|
| *: mostly true
| willvarfar wrote:
| Pypy is still single threaded.
| https://doc.pypy.org/en/latest/faq.html#does-pypy-have-a-gil...
|
| This work is super exciting! Can pypy use the same recipe to
| offer true parallelism plus the jit??
|
| Will be really interesting to see what pypy devs think of this
| work and how they might also lever it!
| nas wrote:
| I think it can't use the same recipe. Sam's approach for
| CPython uses biased reference counting. Internally, Pypy uses
| a tracing garbage collector, not reference counting. I don't
| know how difficult it would be to make their GC thread-safe.
| Probably you don't want to "stop the world" on every GC pass
| so I guess changes are non-trivial.
|
| Sam's changes to CPython's container objects (dicts, lists),
| to make them thread safe might also be hard to port directly
| to Pypy. Pypy implements those objects differently.
| willvarfar wrote:
| I think the biggest thing it will give is a need to go
| there. Until now, pypy has been able to not do parallelism.
| But if cpython is suddenly faster for a big class of
| program, pypy will have to bite the bullet to stay
| relevant?
| masklinn wrote:
| pypy also has a GIL.
| nerdponx wrote:
| PyPy being stuck on 3.7 hurts. If 3.8 support comes out soon,
| I'll be happy to switch for general-purpose work. 3.9 would be
| even nicer, to support the type annotation improvements. I
| donate every month, but I'm just an individual donating pocket
| change; it'd be great to see some corporate support for PyPy.
| calpaterson wrote:
| There are very few new features in 3.8.
|
| It is a much less important release (for features) than 3.7,
| which for example added dataclasses and lots of typing and
| asyncio stuff.
|
| The most significant change in 3.8 is a notoriously
| controversial new infix operator. Even it's supporters would
| say that it's a niche usecase.
| masklinn wrote:
| > There are very few new features in 3.8.
|
| > It is a much less important release (for features) than
| 3.7, which for example added dataclasses and lots of typing
| and asyncio stuff.
|
| That's funny because my take is the exact opposite:
| dataclasses are not very useful (attrs exists and does
| more), deferred type annotations are meh, contextvars,
| breakpoint(), and module-level getattr/settattr but not
| exactly anything you can't do without.
|
| Assignment expressions provide for great cleanups in some
| contexts (and avoiding redundant evaluations in e.g.
| comprehensions), expr= is tremendous for printf-debugging,
| posonly args is really useful, \N in regex can much improve
| their readability when relevant.
|
| $dayjob has migrated to python 3.7 and there's really
| nothing I'm excited to use (possibly aside from doing weird
| things with breakpoint), whereas 3.8 would be a genuine
| improvement to my day-to-day enjoyment.
| nerdponx wrote:
| Deferred type annotations with `from __future__ import
| annotations` are a game-changer IMO. You can use them
| 3.7, which is good enough for me. The big improvement in
| 3.9 is not having to use `typing.*` for a lot of basic
| data types.
|
| The biggest improvements between 3.7, 3.8, 3.9, and 3.10
| are in `asyncio`, which was pretty rough in 3.7 and very
| usable in 3.9. I use the 3rd-party `anyio` library in a
| lot of cases anyway (https://anyio.readthedocs.io/), but
| it's not always feasible.
| laurencerowe wrote:
| It's been a few years since I last played around with PyPy but
| while it provided amazing performance gains for simple
| algorithmic code I saw no speed up on a more complex web
| application.
| typical182 wrote:
| This is a great list of influences on the design (from the
| article comments where the prototype author Sam Gross responded
| to someone wishing for more cross pollination across language
| communities):
|
| ----------
|
| "... but I'll give a few more examples specific to this project
| of ideas (or code) taken from other communities:
|
| - Biased reference counting (originally implemented for Swift)
|
| - mimalloc (originally developed for Koka and Lean)
|
| - The design of the internal locks is taken from WebKit
| (https://webkit.org/blog/6161/locking-in-webkit/)
|
| - The collection thread-safety adapts some code from FreeBSD
| (https://github.com/colesbury/nogil/blob/nogil/Python/qsbr.c)
|
| - The interpreter took ideas from LuaJIT and V8's ignition
| interpreter (the register-accumulator model from ignition, fast
| function calls and other perf ideas from LuaJIT)
|
| - The stop-the-world implementation is influenced by Go's design
| (https://github.com/golang/go/blob/fad4a16fd43f6a72b6917eff65...
| )"
| [deleted]
| Ericson2314 wrote:
| > Gross has also put some significant work into improving the
| performance of the CPython interpreter in general.
|
| Earmarks work, folks!
| a1369209993 wrote:
| > With this scheme, the reference count in each object is split
| in two, with one "local" count for the owner (creator) of the
| object and a shared count for all other threads. Since the owner
| has exclusive access to its count, increments and decrements can
| be done with fast, non-atomic instructions. Any other thread
| accessing the object will use atomic operations on the shared
| reference count.
|
| > Whenever the owning thread drops a reference to an object, it
| checks both reference counts against zero. If both the local and
| the shared count are zero, the object can be freed, since no
| other references exist. If the local count is zero but the shared
| count is not, [ _]a special bit is set to indicate that the
| owning thread has dropped the object[_ ]; any subsequent
| decrements of the shared count will then free the object if that
| count goes to zero.
|
| This seems... off. Wouldn't it work better for the owning thread
| to hold (exactly) one atomic reference, which is released (using
| the same decref code as other threads) when the local reference
| count goes to zero?
|
| Edit: I probably should have explicitly noted that, as jetrink
| points out, the object is initialized with a atomic refcount of
| one (the "local refcount is nonzero" reference), and destroyed
| when the atomic refcount is one and to-be-decremented, so a
| purely local object never has atomic writes.
| [deleted]
| Someone wrote:
| > which is released (using the same decref code as other
| threads) when the local reference count goes to zero?
|
| (I may misunderstand your remark, as 'releasing' is a bit
| ambiguous. It could mean decreasing reference count and freeing
| the memory if the count goes to zero or just plain freeing the
| memory)
|
| The local ref count can go to zero while other threads still
| have references to the object (e.g. when the allocating thread
| sends an object as a message to another thread and, knowing the
| message arrived, releases it), so freeing the memory when it
| does would be a serious bug.
|
| Also, the shared ref count can go negative. From the paper:
|
| > _As an example, consider two threads T1 and T2. Thread T1
| creates an object and sets itself as the owner of it. It points
| a global pointer to the object, setting the biased counter to
| one. Then, T2 overwrites the global pointer, decrementing the
| shared counter of the object. As a result, the shared counter
| becomes negative._
|
| That can't happen with the biased counter because, when it
| would end up going negative, the object gets unbiased, and the
| shared counter gets decreased instead.
|
| That asymmetry is what ensures that only a single thread
| updates the biased counter, so that no locks are needed to do
| that.
| a1369209993 wrote:
| > I may misunderstand your remark, as 'releasing' is a bit
| ambiguous.
|
| The _reference_ is released; ie the (atomic) reference count
| is decremented (and the object is only freed if that caused
| the atomic reference count to go to zero).
|
| > From the paper
|
| I missed that there was a paper and was referring to the
| proposed implementation in python that was described in TFA.
| IIUC, biased refcount (in paper) is local (in my
| description), and shared is atomic, correct?
|
| > the shared ref count can go negative
|
| And _that_ makes sense. Thanks. (And also explains how to
| deal with references added by one thread and removed by
| another, when one of those threads is the object owner.)
| morelisp wrote:
| This seems like it would be less efficient if most objects
| don't escape their owning thread (you would need one atomic
| inc/dec versus zero), which is probably true of most objects.
| a1369209993 wrote:
| Sorry, should have been more clear; edited.
| johntb86 wrote:
| Suppose thread A (the owner) keeps a reference, but also puts
| another reference in a global variable. This would increment
| its local refcount to 1 and have a shared refcount of 1.
|
| Then thread B clears the global variable. With your scheme the
| local refcount would be 1 but the shared refcount would be 0,
| so thread B would destroy the object even though it's
| referenced by thread A.
| kccqzy wrote:
| I like this idea. In fact another possibility is to have a
| thread-local reference count for each thread that uses the
| object which can use fast non-atomic operations, and then each
| thread can use a shared atomic reference count, that counts how
| many threads use the object. When each thread-local count goes
| to zero, the shared count is decremented by one.
|
| This way, if an object is created in one thread and transferred
| to another, the other thread wouldn't even need to do a lot of
| atomic reference count manipulations. There wouldn't be
| surprising behavior in which different threads run the same
| code with different speed, just by virtue of whether they
| created the objects or not.
| Fronzie wrote:
| Good point. Windows COM did not follow your suggestion,
| leading to all sorts of awkwardness in applications that have
| compute- and ui-threads and share objects between the two.
| Object destruction becomes non-predictable and can hold up a
| UI thread.
| twic wrote:
| How would the storage be laid out?
|
| With the proposed scheme, there are two counters (and, i
| assume, the ID of the owning thread), so a small fixed-size
| structure, which can sit directly in the object header. With
| your scheme, you need a variable and unbounded number of
| counters. Where would they go?
| wishawa wrote:
| Maybe each thread could have a mapping storing the number
| of references held (in that thread) for each object? This
| way only the atomic refcount has to be in the object
| header. Also I don't think there would be an owning thread
| at all with this idea, so no ID needed.
| wishawa wrote:
| I think yours is a much cleaner design. In the original plan,
| if the owning thread just set the special bit, but before that
| set is propagated, another thread drops the shared refcount to
| zero, the object would never be released, would it?
|
| EDIT: never mind the question, I just read that the special bit
| is atomic.
| dennisafa wrote:
| I think that idea was mentioned earlier in the article:
|
| > The simplest change would be to replace non-atomic reference
| count operations with their atomic equivalents. However, atomic
| instructions are more expensive than their non-atomic
| counterparts. Replacing Py_INCREF and Py_DECREF with atomic
| variants would result in a 60% average slowdown on the
| pyperformance benchmark suite.
| a1369209993 wrote:
| > to replace non-atomic reference count operations with their
| atomic equivalents.
|
| Nope, my proposal still uses two reference counts (one
| atomic, one local); it just avoids having a seperate flag bit
| to indicate that the owning thread is done.
| cogman10 wrote:
| Yeah, I'm not exactly getting all the complexity here.
|
| I'm digging the 2 reference counters, that makes sense to me,
| but I don't know why it isn't something more like:
|
| "every time a new thread takes a reference, atomic +1, every
| time a new thread's local count hits 0, atomic -1. If the
| shared reference is 0, free".
|
| IDK what special purpose the flags are serving here.
| [deleted]
| jeremyjh wrote:
| Most objects are never shared so there would be a performance
| impact from incrementing (and decrementing) an atomic counter
| even just once.
| jetrink wrote:
| That is true, but what if the shared count were initialized
| to one and the creator thread frees an object when the
| shared count is equal to one and the local count is
| decremented zero? (Since it knows it holds one shared
| reference.) Then the increment and decrement would be
| avoided for non-shared objects.
| a1369209993 wrote:
| I probably should have mentioned that explicitly; edited.
| kzrdude wrote:
| To have one local count per thread would add memory overhead,
| I think? In his solution there are only two counters per
| object, local and shared.
|
| Any other thread can't know if it's the first time or not
| it's taking a reference to an object.
| AgentME wrote:
| >Edit: I probably should have explicitly noted that, as jetrink
| points out, the object is initialized with a atomic refcount of
| one (the "local refcount is nonzero" reference), and destroyed
| when the atomic refcount is one and to-be-decremented, so a
| purely local object never has atomic writes.
|
| I think you're under the impression that the refcount would
| only ever need to be incremented if the object was shared to
| another thread, but that's not the case. The refcount isn't a
| count of how many threads have a reference to the object; it's
| a count of how many objects and closures have a reference to
| the object. Even objects that never leave their creator thread
| will be likely to have their reference count incremented and
| decremented a few times over their life.
| a1369209993 wrote:
| > Even objects that never leave their creator thread will be
| likely to have their reference count incremented and
| decremented a few times over their life.
|
| I think you're under the impression that there's only one
| refcount. The point of the original design (and this one) is
| that there are two refcounts: one that's updated only by the
| thread that created the object, and therefor doesn't need to
| use slow atomic accesses, and one that's updated atomically,
| and therefor can be adjusted by arbitrary threads.
| AgentME wrote:
| Oh, I misunderstood you then. I thought you were trying to
| get rid of the local refcount and make the atomic one
| handle its job too, but what you're suggesting is a
| possible simplification of the logic that detects when it's
| time to destroy the object. That makes sense, just seems
| more minor than I thought you were going at and I guess I
| missed it.
| jeremyjh wrote:
| Threads don't hold references - other objects do and we have to
| know how many do. If threads held a reference it might never be
| released. Since most objects are never shared we wouldn't want
| to increment an atomic counter even once for those.
| a1369209993 wrote:
| I don't _think_ the objection you 're actually making is
| valid (the extra atomic reference is just a representation of
| the fact that the local refcount is nonzero), but come to
| think of it, even in the original version, how the heck does
| a thread know whether a reference held by (say) a dictionary
| that is itself accesible to multiple threads was increfed by
| the owning thread or another thread?
| lormayna wrote:
| Why not using something trio or curio? They are quite easy to
| learn, very powerful and have an approach similar to channel in
| golang.
| synchronizing wrote:
| async != multithreading
| calpaterson wrote:
| Those are not multithreading, they are asynchronous io which is
| different. With asynchronous io in Python the only
| concurrency/parallelism you can do is for IO.
|
| Multithreading in Python currently has the same limitation
| though but it needn't.
| rich_sasha wrote:
| This is some of the best news I read in a while!
|
| Multiprocessing sort of works but it's really sucky.
| jeremyis wrote:
| Way to go Sam! Mark my words: our generations' Carmack!
| dsr_ wrote:
| I'm going to assume that there is a reason that this isn't a
| switch control, so that the default is a single-threaded program
| and the programmer needs to state explicitly that this one will
| be multi-threaded, upon which the interpreter changes into the
| atomic mode for the rest of execution?
| toxik wrote:
| That would be expensive.
| mikepurvis wrote:
| Basically no one would get the glorious single-threaded
| performance then, since the first time you pip install
| anything, you're going to discover that it spins up a thread
| under the hood that you're never exposed to.
|
| Or worse, you end up with the async schism all over again, with
| new "threadless" versions of popular libraries springing up.
| __s wrote:
| Most references are thread local, where this implementation
| will still beat out atomic refcounts in a multi-threaded app
| sandGorgon wrote:
| has anyone built and run this in docker ? would love to test this
| out - i dont have a lot of experience in compiling python inside
| docker
|
| EDIT: there is a dockerfile in there
| https://raw.githubusercontent.com/colesbury/nogil/nogil/Dock...
| cormacrelf wrote:
| > _" biased reference counts" and is described in this paper by
| Jiho Choi et al. With this scheme, the reference count in each
| object is split in two, with one "local" count for the owner
| (creator) of the object and a shared count for all other threads_
|
| > _The interpreter 's memory allocator has been replaced with
| mimalloc_
|
| These are very similar ideas!
|
| Mimalloc is notable for its use of separate local and remote free
| lists, where objects that are being freed from a different thread
| than the page's heap's owner are placed in a separate queue. The
| local free list is (IIRC) non-atomic until it is empty and local
| allocs start pulling from the remote queue.
|
| The general idea is clearly lazy support for concurrency,
| matching up perfectly with Python's need to keep any single
| threaded perf it has. I'm impressed with the application of all
| of these things at once.
| marris wrote:
| How big a problem is the possible breakage of C extensions for
| new code? Is there currently some standard "future proofed for
| multi-thread" way of writing them that will reduce the odds of
| the C extension breaking? And maybe also being compatible with
| PyPy? Or do developers today need to write a separate version for
| each interpreter that they want to support?
| heavyset_go wrote:
| There are projects[1] that are abstracting away the C extension
| interface in order to standardize C extensions across
| implementations and prevent breaking changes.
|
| [1] https://github.com/hpyproject/hpy
| singhrac wrote:
| Notably the dev proposing this (Sam Gross aka colesbury) is/was a
| major PyTorch developer, so someone quite familiar with high
| performance Python and extensions.
| ajtulloch wrote:
| and he's a genius!
| jeremyis wrote:
| +1 ! Though, not a very good Oculus player (yet)!
| mzs wrote:
| Yikes, C extensions can't assume they are under GIL by default:
|
| https://github.com/colesbury/numpy/commits/v1.19.3-nogil
| kzrdude wrote:
| It looks like a total of four lines needed changing in numpy
| due to his change. That's a very good score in my book, numpy
| is huge.
| Jweb_Guru wrote:
| Unfortunately, every C extension will need to undergo manual
| review for safety, unless there's some very easy way to have
| the C extension opt into using the GIL. And some of them will
| be close to impossible to detangle in this way.
| veryupwork wrote:
| no
| int_19h wrote:
| It really depends on how the library is written, and how much
| shared data it has. It has been very common to use GIL as a
| general-purpose synchronization mechanism in native Python
| modules, since you have to pay that tax either way.
| ikiris wrote:
| I was half expecting a link to go or rust.
| jeffybefffy519 wrote:
| I dont know about others, but I really enjoy content about the
| Python GIL. Its a fascinatingly complex problem.
| lucb1e wrote:
| For a minute I thought I finally found someone else who likes
| the GIL, but then you said _content about_. Programs that just
| divide up work across processes are much easier to write
| without introducing obscure bugs due to the lack of atomicity.
| I 'm definitely excited for a GIL-less python, even if it's a
| rare scenario where it makes sense to try to do performant code
| in python in the first place rather than offloading a few lines
| to another language to be fast, but I am a bit afraid that
| people (particularly beginner programmers) will grab this with
| too many hands. Having also seen recommendations for this-or-
| that threading method going around in other languages, threads
| are recommended really much more often than where it makes
| sense and beginners won't have a comparative experience yet of
| writing multi-process code instead.
|
| That said, I am also always interested in GIL-related content
| like this! Loved the article.
| solarmist wrote:
| Yup. Me too, but I'm not sad to see it go.
| phkahler wrote:
| >> If that bit is set, the interpreter doesn't bother tracking
| references for the relevant object at all. That avoids contention
| (and cache-line bouncing) for the reference counts in these
| heavily-used objects. This "optimization" actually slows single-
| threaded accesses down slightly, according to the design
| document, but that penalty becomes worthwhile once multi-threaded
| execution becomes possible.
|
| Was going to say do the opposite. Set the bit if you want
| counting and then modify the increment and decrement to add or
| subtract the bit, thereby eliminating condition checking and
| branching. But it sounds like the concern is cache behavior when
| the count is written. Checking the bit can avoid any modification
| at all.
| Jtsummers wrote:
| Under this scheme, objects get freed when both local and shared
| counts are zero. By using a special value that makes the shared
| count non-zero (for eternal and long-lived objects), it ensures
| that should the owner (for some reason) drop them, they will
| not be freed. No extra logic has to be introduced, the shared
| count is non-zero and that's all that's needed to prevent
| freeing.
___________________________________________________________________
(page generated 2021-10-15 23:00 UTC)