[HN Gopher] Linear Address Spaces: Unsafe at any speed
___________________________________________________________________
Linear Address Spaces: Unsafe at any speed
Author : gbrown_
Score : 115 points
Date : 2022-06-29 19:45 UTC (3 hours ago)
(HTM) web link (queue.acm.org)
(TXT) w3m dump (queue.acm.org)
| Veserv wrote:
| Of course things would be faster if we did away with coarse
| grained virtual memory protection and instead merged everything
| into a single address space and guaranteed protection using fine
| grained permission mechanisms.
|
| The problem with that is that a single error in the fine grained
| mechanism anywhere in the entire system can quite easily cause
| complete system compromise. To achieve any safety guarantees
| requires achieving perfect safety guarantees across all arbitrary
| code in your entire deployed system. This is astronomically
| harder than ensuring safety guarantees using virtual memory
| protection where you only need to analyze the small trusted code
| base establishing the linear address space and do not need to be
| able to analyze or even understand arbitrary code to enforce
| safety and separation.
|
| For that matter, fine grained permissions are a strict superset
| of the prevailing virtual memory paradigm as you can trivially
| model the existing coarse grained protection by just making the
| fine grained protection more coarse. So, if you can make a safe
| system using fine grained permissions then you can trivially
| create a safe system using coarse grained virtual memory
| protection. And, if you can do that then you can create a
| unhackable operating system right now using those techniques. So
| where is it?
|
| Anybody who claims to be able to solve this problem should first
| start by demonstrating a mathematically proven unhackable
| operating system as that is _strictly easier_ than what is being
| proposed. Until they do that, the entire idea is a total
| pipedream with respect to multi-tenant systems.
| VogonPoetry wrote:
| I think that the plague of speculative execution bugs qualify
| as a single error in virtual memory systems that cause complete
| system compromise. This was not a logic error in code, but a
| flaw in the hardware. It isn't clear to me if CHERI would have
| been immune to speculative execution problems, but access
| issues would likely have shown up if the memory ownership tests
| were in the wrong place.
|
| I have been following CHERI. I note that in order to create the
| first FPGA implementation they had to first define the HDL for
| a virtual memory system -- all of the research "processor"
| models that were available did not have working / complete VM
| implementations. CHERI doesn't replace VM, it is in addition to
| having VM.
|
| I've found that memory bugs (including virtual memory ones) are
| difficult to debug, because the error is almost never in the
| place where the failures show up and there is no easy way to
| track back who ought to own the object or how long ago the
| error happened. CHERI can help with this by at least being able
| to identify the owner.
|
| Virtual memory systems are usually pretty complex. Take a look
| at the list of issues for the design of L3
| <https://pdos.csail.mit.edu/6.828/2007/lec/l3.html>. The
| largest section there is for creating address spaces. For the
| Linux kernel, in this diagram a lot of the MM code is colored
| green <https://i.stack.imgur.com/1dyzH.png>, it is a
| significant portion. More code means more bugs and much harder
| to formally verify.
|
| I am not convinced by the argument that it is possible to take
| a fine grained system and trivially expand it to a coarse
| grained system. How is shared memory handled, mmap'ed dylibs,
| page level copy-on-write?
| [deleted]
| potatoalienof13 wrote:
| You have misunderstood the article. It is not advocating for
| the return to single address space systems. It is advocating
| for potential alternatives to the linear address space model.
| Here [1] is an operating system that I think fits under the
| description of what you were talking about
|
| https://en.wikipedia.org/wiki/Singularity_%28operating_syste...
| [1]
| Genbox wrote:
| The more I research Singularity, the more I like it. I deep
| dived into all the design docs years ago and the amount of
| rethinking existing OS infrastructure is astounding.
|
| Joe Duffy has some great blog posts on Midori (OS based on
| Singularity) here:
| http://joeduffyblog.com/2015/11/03/blogging-about-midori/
| infogulch wrote:
| The Mill's memory model is one of its most interesting features
| IMO [1] and solves some of the same problems, but by going the
| other way.
|
| On the Mill the whole processor bank uses a global virtual
| address space. TLB and mapping to physical memory happens at the
| _memory controller_. Everything above the memory controller is in
| the same virtual address space, including L1-L3+ caches. This
| solves _a lot_ of problems, for example: If you go out to main
| memory you 're already paying ~300 cycles of latency, so having a
| large silicon area / data structure for translation is no longer
| a 1-cycle latency problem. Writes to main memory are flushed down
| the same memory hierarchy that reads come from and succeed as
| soon as they hit L1. Since all cache lines are in the same
| virtual address space you don't have to track and synchronize
| reads and writes across translation zones within the cache
| hierarchy. When you request an unallocated page you get the whole
| pre-zeroed page back _instantly_ , since it doesn't need to be
| mapped to physical pages until writes are flushed out of L3. This
| means its possible for a page to be allocated, written to, read,
| and deallocated which _never actually touches physical memory_
| throughout the whole sequence and the whole workload is served
| purely within the cache hierarchy.
|
| Protection is a separate system ("PLB") and can be much smaller
| and more streamlined since it's not trying to do two jobs at
| once. The PLB allows processes to give fine-grained temporary
| access of a portion of its memory to another process; RW, Ro, Wo,
| byte-addressed ranges, for one call or longer etc. Processes get
| allocated available address space on start, they can't just
| assume they own the whole address space or start at some specific
| address (you should be using ASLR anyways so this should have no
| effect on well-formed programs, though there is a legacy
| fallback).
|
| [1]: My previous comment:
| https://news.ycombinator.com/item?id=27952660
| pclmulqdq wrote:
| The Mill model is kind of cool, but today, many peripherals
| (including GPUs and NICs) have the ability to dump bytes
| straight into L3 cache. This improves latency in a lot of
| tasks, including the server-side ones that the Mill is designed
| for. This is possible due to the fact that MMUs are above the
| L3 cache.
|
| Honestly, I'm happy waiting for 4k pages to die and be replaced
| by huge pages. Page tables were added to the x86 architecture
| in 1985, when 1MB of memory was a ton of memory to have. Having
| 256 pages worth of memory in your computer was weird and
| exotic. Fast forward to today, and the average user has several
| GB of memory - mainstream computers can be expanded to over 128
| GB today - and we still mainly use 4k pages. That is the
| problem here. If we could swap to 2M pages in most
| applications, we would be able to reduce page table sizes by a
| factor of 512, and they would still be a lot larger than page
| tables when virtual memory was invented. And we wouldn't waste
| much memory!
|
| But no, 4k pages for backwards compatibility. 4k pages forever.
| And while we're at it, let's add features to Linux (like TCP
| zero copy) that rely on having 4k pages.
| a-dub wrote:
| > Why do we even have linear physical and virtual addresses in
| the first place, when pretty much everything today is object-
| oriented?
|
| are there alternatives to linearly growing call stacks?
| robotresearcher wrote:
| A stack is a list of objects with a LIFO interface. Doesn't
| have to be a contiguous byte sequence.
| a-dub wrote:
| is there an example of machine code that doesn't make use of
| a linear contiguous call stack?
|
| what would the alternative be? compute the size of all the
| stack frames a-priori in the compiler and then spray them all
| over main memory and then maintain a linear contiguous list
| of addresses? doesn't the linear contiguous nature of
| function call stacks in machine code preserve locality in
| order to make more efficient use of caches? or would the
| caches have to become smarter in order to know to preserve
| "nearby" stack frames when possible?
|
| also, why not just make the addresses wider and put the pid
| in the high bits? they're already doing this masking stuff
| for the security descriptors, why not just throw the pid in
| there as well and be done with it?
| robotresearcher wrote:
| The linked article doesn't mention call stacks explicitly,
| but describes the R1000 arch was object+offset addressed in
| HW. So unless they restricted the call stack to fit into
| one object and use only offsets, then yes, they must have
| chained objects together for the stack.
|
| When you have a page-based memory model, you've created the
| importance of address locality. If you have object-based
| memory model, and the working set is of objects, not pages,
| then address locality between objects doesn't matter.
|
| Of course, page-based based memory models are by FAR the
| most common in practice.
|
| (Note: pages ARE objects, but the objects are significant
| to the VM system and not to your program. So strictly,
| page-based models are a corner case of object-based models,
| where the objects are obscure.)
| a-dub wrote:
| would be interesting to see how the actual call stack is
| implemented. they must either have a fixed width object
| as you mention or some kind of linear chaining like
| you're describing.
|
| found this on wikipedia: https://resources.sei.cmu.edu/as
| set_files/TechnicalReport/19...
|
| memory and disk are unified into one address space, code
| is represented by this "diana" structure which can be
| compressed text, text, ast or machine code. would be
| curious how procedures are represented in machine code.
|
| what a fascinating machine!
| Someone wrote:
| > is there an example of machine code that doesn't make use
| of a linear contiguous call stack?
|
| Early CPUs didn't have support for a stack, and some early
| languages such as COBOL and Fortran didn't need one. They
| didn't allow recursive function calls, so return addresses
| could be stored at fixed addresses, and a return could
| either be an indirect jump reading from that address or a
| direct jump whose target address got modified when writing
| to that fixed address (see
| https://people.cs.clemson.edu/~mark/subroutines.html for
| the history of subroutine calls)
|
| Both go (https://blog.cloudflare.com/how-stacks-are-
| handled-in-go) and rust
| (https://mail.mozilla.org/pipermail/rust-
| dev/2013-November/00...) initially had split stacks
| (https://releases.llvm.org/3.0/docs/SegmentedStacks.html,
| https://gcc.gnu.org/wiki/SplitStacks)
| kazinator wrote:
| > _Why do we even have linear physical and virtual addresses in
| the first place, when pretty much everything today is object-
| oriented?_
|
| Simple: we don't want some low level kernel memory management
| dictating what constitutes an "object".
|
| Everything isn't object-oriented. E.g. large arrays, memory-
| mapped files, including executables and libraries.
|
| Linear memory sucks, but every other organization sucks more.
|
| Segmented has been done; the benefit-to-clunk ratio was
| negligible.
| MarkSweep wrote:
| The benefit-to-thunk ratio was not great either.
|
| ( one reference to thunks involving segmented memory:
| https://devblogs.microsoft.com/oldnewthing/20080207-00/?p=23...
| )
| kazinator wrote:
| Real segmentation would have solved the problem described in
| the article. Under virtual memory segments like on the 80386
| (and mainframes before that), you can physically relocate a
| segment and while adjusting its descriptor so that the
| addressing doesn't change.
|
| The problem was mainly caused by having no MMU, so moving
| around objects in order to save space required adjusting
| pointers. Today, a copying garbage collector will do the same
| thing; rewrite all the links among the moved objects. You'd
| have similar hacks on Apple Macintoshes, with their MC68K
| processors and flat space.
| mwcremer wrote:
| tl;dr page-based linear addressing induces performance loss with
| complicated access policies, e.g. multilevel page tables. Mr.
| Kamp would prefer an object model of memory access and
| protection. Also, CHERI
| (https://dl.acm.org/doi/10.5555/2665671.2665740) increases code
| safety by treating pointers and integers as distinct types.
| gumby wrote:
| The Multics system was designed to have segments (for this
| discussion == pages) that were handled the way he described, down
| to the pointer handling. Not bad for the 1960s, though Unix was
| designed for machines with a lot fewer transistors back at the
| time when that mattered a lot.
|
| Things like TLBs (not a new invention, but going back to the
| 1960s) really only matter to systems programmers, as he says, and
| judicious use simplifies and has simplified programming for a
| long time. I think if he really wants to go down this path he'll
| discover that the worst case behavior (five probes to find a
| page) really is worth it in the long run.
| anewpersonality wrote:
| CHERI is a gamechanger
| gralx wrote:
| Link didn't work for me. Direct link did:
|
| https://dl.acm.org/doi/abs/10.1145/3534854
| scottlamb wrote:
| tl;dr: conventional design bad, me smart, capability-based
| pointers (base+offset with provenance) can replace virtual
| memory, CHERI good (a real modern implementation of capability-
| based pointers).
|
| The first two points are similar to other Poul-Henning Kamp
| articles [1]. The last two are more interesting.
|
| I'm inclined to agree with "CHERI good". Memory safety is a huge
| problem. I'm a fan of improving it by software means (e.g. Rust)
| but CHERI seems attractive at least for the huge corpus of
| existing C/C++ software. The cost is doubling the size of
| pointers, but I think it's worth it in many cases.
|
| I would have liked to see more explanation of how capability-
| based pointers replacing virtual memory would actually work on a
| modern system.
|
| * Would we give up fork() and other COW sorts of tricks?
| Personally I'd be fine with that, but it's worth mentioning.
|
| * What about paging/swap/mmap (to compressed memory contents,
| SSD/disk, the recently-discussed "transparent memory offload"
| [2], etc)? That seems more problematic. Or would we do a more
| intermediate thing like The Mill [3] where there's still a
| virtual address space but only one rather than per-process
| mappings?
|
| * What bookkeeping is needed, and how does it compare with the
| status quo? My understanding with CHERI is that the hardware
| verifies provenance [4]. The OS would still need to handle the
| assignment. My best guess is the OS would maintain analogous data
| structures to track assignment to processes (or maybe an extent-
| based system rather than pages) but maybe the hardware wouldn't
| need them?
|
| * How would performance compare? I'm not sure. On the one hand,
| double pointer size => more memory, worse cache usage. On the
| other hand, I've seen large systems spend >15% of their time
| waiting on the TLB. Huge pages have taken a chunk out of that
| already, so maybe the benefit isn't as much as it seemed a few
| years ago. Still, if this nearly eliminates that time, that may
| be significant, and it's something you can measure with e.g.
| "perf"/"pmu-tools"/"toplev" on Linux.
|
| * etc
|
| [1] eyeroll at https://queue.acm.org/detail.cfm?id=1814327
|
| [2] https://news.ycombinator.com/item?id=31814804
|
| [3] http://millcomputing.com/wiki/Memory#Address_Translation
|
| [4] I haven't dug into _how_ when fetching pointers from RAM
| rather than pure register operations, but for the moment I 'll
| just assume it works, unless it's probabilistic?
| throw34 wrote:
| "The R1000 addresses 64 bits of address space instantly in every
| single memory access. And before you tell me this is impossible:
| The computer is in the next room, built with 74xx-TTL
| (transistor-transistor logic) chips in the late 1980s. It worked
| back then, and it still works today."
|
| That statement has to be coming with some hidden caveats. 64 bits
| of address space is crazy huge so it's unlikely the entire range
| was even present. If only a subset of the range was "instantly"
| available, we have that now. Turn off main memory and run right
| out of the L1 cache. Done.
|
| We need to keep in mind, the DRAM ICs themselves have a hierarchy
| with latency trade-offs.
| https://www.cse.iitk.ac.in/users/biswap/CS698Y/lectures/L15....
|
| This does seem pretty neat though. "CHERI makes pointers a
| different data type than integers in hardware and prevents
| conversion between the two types."
|
| I'm definitely curious how the runtime loader works.
| cmrdporcupine wrote:
| _" We need to keep in mind, the DRAM ICs themselves have a
| hierarchy with latency trade-offs_" Yes this is the thing --
| I'm not a hardware engineer or hardware architecture expert,
| but -- it seems to me that what we have now is a set of
| abstractions presented by the hardware to the software based on
| a model of what hardware "used to" look like, mostly what it
| used to look like in a 1970s minicomputer, when most of the
| intensive key R&D in operating systems architecture was done.
|
| One can reasonably ask, like Mr Kamp is, why we should stick to
| these architectural idols at this point in time. It's
| reasonable enough, except that the alternative of heterodox,
| alternative architectures is also heterogenous -- new concepts
| that don't necessarily "play well with others." All our
| compiler technology, all our OS conventions, our tooling, etc.
| would need to be rethought under new abstractions.
|
| And those are fun hobby or thought exercises, but in the real
| world of industry, they just won't happen. (Though I guess from
| TFA it could happen in a more specialized domain like
| aerospace/defence)
|
| In the meantime, hardware engineering is doing amazing things
| building powerfully performing systems that give us some nice
| convenient consistent (if sometimes insecure and awkward) myths
| about how our systems work, and they're making them faster
| every year.
| bentcorner wrote:
| Makes me wonder if 50 years from now we'll still be stuck
| with the hardware equivalent of the floppy disk icon, only
| because retooling the universe over from scratch is too
| expensive.
| nine_k wrote:
| As they say, C was designed for the PDP-11 architecture, and
| modern computers are forced to emulate it, because the tools
| to describe software (languages and OSes) which we have can't
| easily describe other architectures.
|
| There were modern semi-successful attempts though, see PS3 /
| Cell architecture. It did not stick though.
|
| I'd say that the modern heterodox architecture domain is
| GPUs, but we have one proprietary and successful interface
| for them (CUDA), and the open alternatives (openCL) are
| markedly weaker yet. And it's not even touching the OS
| abstractions.
| jart wrote:
| You can avoid the five levels of indirection by using "unreal
| mode". I just wish it were possible to do with 64-bit code.
| cmrdporcupine wrote:
| "The R1000 has many interesting aspects ... the data bus is 128
| bits wide: 64-bit for the data and 64-bit for data's type"
|
| _what what what?_
|
| How on earth would you ever need to have a type enumeration 2^64
| long?
|
| Neat, though.
| btilly wrote:
| My guess is that it is an object oriented system. The data's
| type is a pointer to the address that defines the type. Which
| could be anywhere in the system.
|
| This is also a security feature. If you find a way to randomly
| change the data's type, you're unlikely to successfully change
| it to another type.
| kimixa wrote:
| The other option is to use those 64bits to double the total
| bandwidth in the "Traditional" page-table system.
|
| All this extra complexity and bus width doesn't come for free,
| after all, there's opportunity cost.
| KerrAvon wrote:
| No idea, but consider that it could be a enum + bitfield rather
| than strictly an enum.
| robotresearcher wrote:
| I don't know if this machine supported it, but it could allow
| you to have a system-wide unique type for this-struct-in-this-
| thread-in-this-process, with strong type checking all the way
| through the compiler into run time. Which would be pretty cool.
|
| GUIDs for types.
| gpderetta wrote:
| At Intel they probably still have nightmares about iAPX 432. They
| are not going to try an OO architecture again.
|
| Having said that, I wouldn't be surprised if some form of
| segmentation became popular again.
| KerrAvon wrote:
| I'd hope that anyone at Intel with said nightmares would have
| read this paper by now (wherein Bob Colwell, et al, argue that
| the 432 could have been faster with some minor fixes, and
| competitive with contemporary CPUs with some additional larger
| modifications).
|
| https://archive.org/details/432_complexity_paper/
| gumby wrote:
| The underexplored value of early segmentation was the
| discretionary segment level permissions enforced by hardware.
|
| Years ago I prototyped a system that had filesystem permission
| support at the segment level. The idea was you could have a
| secure dynamic library for, say, manipulating the passwd file
| (you can tell how long ago that was). You could call into it if
| you had the execute bit set appropriately, even if you didn't
| have the read bit set, so you couldn't read the memory but
| could call into it at the allowed locations (i.e. PLT was x
| only).
|
| However it was clear everyone wanted to get rid of the segment
| support, so that idea never went anywhere.
| monocasa wrote:
| They made a decent go at it again in 16 and 32 bit protected
| mode. The GDT and LDT along with task gates were intended to be
| used as an hardware object capability system like the iAPX
| 432's.
| kimixa wrote:
| I'm a little confused about how the object base is looked up in
| these systems, and if they're sparse or dense and have any size
| or total object count limitations, and if that ends up having the
| same limitations on total count as page tables that required the
| current multi-level approach.
|
| As surely you could consider page table as effectively
| implementing a fixed-size "object cache"? It is just a lookup for
| an offset into physical memory, after all, with the "object ID"
| just being the masked first part of the address? And if the
| objects are variable sized, is it possible to end up with
| physical address fragmentation as objects of different sizes are
| allocated and freed?
|
| The claim of single-cycle lookups today would require an on-chip
| fixed-size (and small!) fast sram, as there's a pretty hard limit
| on the amount of memory you can get to read in a single clock
| cycle, no matter how fancy or simple the logic behind deciding to
| lookup. If we call this area the "TLB" haven't we got back to
| pagetables again?
|
| And for the size of sram holding the TLB/object cache entries -
| increasing the amount of data stored in them means you have less
| total too. A current x86_64 CPU supports 2^48 of physical address
| space, reduced to 36 bits if you know it's 4k aligned - and 2^57
| of virtual address space as the tag, again reduced to 45 bits if
| we know it's 4k aligned. That means to store the tag and physical
| address you need a total of 81 bits of SRRAM. A 64-bit object ID,
| plus 64-bit physical address plus 64-bit size is 192bits, over 2x
| that, so you could pack 2x the number of TLB entries into the
| same sram block. To match the capabilities of the example above,
| 57 bits of physical address (cannot be reduced as arbitrary sizes
| means it's not aligned), plus a similarly reduced to 48 bit
| object ID and size still adds up to 153, only slightly less than
| 2x, though I'm sure people could argue that reducing the
| capabilities here have merit, I don't know how many objects or
| their maximum possible size in such a system. And that's "worst
| case" 4k pages for the pagetable system too.
|
| I can't see how this idea could be implemented without extreme
| limitations - look at the TLB size of modern processors and
| that's the maximum number of objects you could have while meeting
| the claims of speed and simplicity. There may be some advantage
| in making them flexible in terms of size, rather than fixed-size,
| but then you run into the same fragmentation issues, and need to
| keep that size somewhere in the extremely-tight TLB memory.
| monocasa wrote:
| > As surely you could consider page table as effectively
| implementing a fixed-size "object cache"? It is just a lookup
| for an offset into physical memory, after all, with the "object
| ID" just being the masked first part of the address? And if the
| objects are variable sized, is it possible to end up with
| physical address fragmentation as objects of different sizes
| are allocated and freed?
|
| Because that's only a base, not a limit. The right pointer
| arithmetic can spill over to any other object base's memory.
| marshray wrote:
| > with the "object ID" just being the masked first part of the
| address?
|
| Doesn't that imply the minimum-sized object requires 4K
| physical ram?
|
| Is that a problem?
| kimixa wrote:
| Maybe? If you just round up each "object" to 4k then you can
| implement this using the current PTE on x86_64, but this
| removes the (supposed) advantage of only requiring a single
| PTE for each object (or "object cache" lookup entry or
| whatever you want to call it) in the cases when an object
| spans multiple page-sizes of data.
|
| Having arbitrary sizes objects will likely be possible in
| hardware - it's just an extra size being stored in the PTE if
| you can mask out the objectID from the address (in the
| example in the original post, it's a whole 64-bit object ID,
| allowing a full 64-bits of offset within each object, but
| totaling a HUGE 128bit effectively address)
|
| But arbitrary sizes feels like it pushes the issues that many
| current userspace allocators have to deal with today to the
| hardware/microcode - namely about packing to cope with
| fragmentation and similar (only instead of virtual address
| space they'll have to deal with physical address space). The
| solutions to this today are certainly non-trivial and still
| can fail in many ways, so far away from being solved, let
| along solved in a simple enough way to be implemented that
| close to hardware.
| avodonosov wrote:
| Since this addressing scheme is <object, offset>, and as these
| pairs need to fit in 64 bits, I am curious, is the numjer of bits
| for each part is fixed and what are those fixed widths. In other
| words what is the maximum possible offset within one object and
| the max number of objects?
|
| Probably segment registers in x86 can be thought as object
| identifiers, thus allowing the same non-linear approach?(Isn't
| that the purpose of segments even?)
|
| Update: BTW, another term for what the author calls "linear" is
| "flat".
| monocasa wrote:
| Yeah, x86 segments in the protected modes were intended to be
| used as a hardware object capability system like the author is
| getting at.
|
| And yeah, it's probably a fixed 64bit lookup into an object
| descriptor table.
| marshray wrote:
| Wouldn't it be hilarious if the 21st century brought about
| the re-adoption of the security design features introduced in
| the 80286 (1982)?
| monocasa wrote:
| I came this close to ordering custom "Make the LDT Great
| Again" hats after spectre was released, lol.
| dragontamer wrote:
| > Why do we even have linear physical and virtual addresses in
| the first place, when pretty much everything today is object-
| oriented?
|
| Well, GPU code is certainly not object-oriented, and I hope it
| never becomes that. SIMD code won't be able to jump between
| objects like typical CPU-oriented OOP does (unless all objects
| within a warp/workgroup jump to the same function pointers?)
|
| GPU code is common in video games. DirectX needs to lay out its
| memory very specifically as you write out the triangles and other
| vertex/pixel data for the GPU to later process. This memory
| layout is then memcopy'd over to PCIe using the linear address
| space mechanism, and GPUs are now cohesive with this space
| (thanks to Shared Virtual Memory).
|
| So today, thanks to shared virtual memory and advanced atomics,
| we can have atomic compare-and-swap coordinate CPU and GPU code
| operating over the same data (and copies of that data can be
| cached in CPU-ram or GPU-VRAM and transferred over automatically
| with PCIe memory barriers and whatnot).
|
| ----------
|
| Similarly, shared linear address spaces operate over rDMA (remote
| direct memory access), a protocol built on top of Ethernet. This
| means that your linear memory space is mmap'd on your CPU, but
| then asks for access to someone else's RAM over the network. The
| mmap then causes this whole "inefficient pointer-traversals" to
| then get turned into Ethernet packets to share RAM between CPUs.
|
| Ultimately, when you start dealing with high-speed data-sharing
| between "external" compute units (ie: a GPU, or a ethernet-
| connected far-away CPU), rather than "just" a NUMA-node or other
| nearby CPU, the linear address space seems ideal.
|
| --------
|
| Even the most basic laptop, or even Cell Phone, these days, is a
| distributed system consisting of a CPU + GPU. Apple chips even
| have a DSP and a few other elements. Passing data between all of
| these things makes sense in a distributed linear address space
| (albeit really wonky with PCIe, mmaps, base address pointers and
| all sorts of complications... but they are figured out, and it
| does work every day)
|
| I/O devices working directly in memory is going to only become
| more common. 100Gbps network connections exist in supercomputer
| labs, 10Gbps Ethernet is around the corner for consumers. NVMe
| drives are pushing I/O to such high bandwidths that'd make DDR2
| RAM blush. GPUs are growing more complicated and are rumored to
| start turning into distributed chiplets soon. USB3.0 and beyond
| are high-speed links that directly drop off data into linear
| address spaces (or so I've been told). Etc. etc.
| edave64 wrote:
| There is often a quite significant distance between the
| beautiful, elegant and efficient design that brings tears to the
| eyes of a designer, and being pragmatic and financially viable.
|
| Building a new competitive processor architecture isn't feasible
| if you can't at least ensure compile-time compatibility with
| existing programs. People won't buy a processor that won't run
| their programs.
| ajb wrote:
| This article compares CHERI to an 80's computer, the Rational
| R1000 (which I'm glad to know of). It's worth noting that CHERI's
| main idea was explored in the 70's by the CAP computer[1]. CAP
| and CHERI are both projects of the University of Cambridge's
| Computer Lab. It's fairly clear that CAP inspired CHERI.
|
| [1] https://en.wikipedia.org/wiki/CAP_computer
| yvdriess wrote:
| Are you sure it wasn't done before by IBM in the '60s?
|
| That's usually the case. For hardware, at least
|
| For software, it usually was done before by Lisp in the '70s.
| Animats wrote:
| The original machines like that were the Burroughs 5000
| (1961), and the Burroughs 5500 (1964), which was quite
| successful. Memory was allocated by the OS in variable length
| chunks. Addresses were not plain numbers; they were more like
| Unix paths, as in /program/function/variable/arrayindex.
|
| That model works, but is not compatible with C and UNIX.
| heavenlyblue wrote:
| How would you address recursive functions this way?
| EvanAnderson wrote:
| You beat me! CHERI totally made me think about those
| machines.
|
| There's some good background here for those who are
| interested: https://www.smecc.org/The%20Architecture%20%20o
| f%20the%20Bur...
|
| The architecture of the B5000 / B5500 / B6500 lives on
| today in the Unisys ClearPath line. I believe the OS, MCP,
| is one of the longest-maintained software operating systems
| still in active use, too.
| monocasa wrote:
| IBM didn't really play with hardware object capabilities
| until the S/38, and even the it's a bit of a stretch to call
| them that.
| cmrdporcupine wrote:
| Another system that had an object-based non-linear address space
| I believe was the "Rekursiv" CPU developed at Linn (yes, the
| Swedish audio/drum machine company; EDIT: Linn. Scottish. Not
| drum machine. Thanks for the corrections. In fact I even knew
| this at one time. Yay brain.) in the 80s.
|
| https://en.wikipedia.org/wiki/Rekursiv
|
| I actually have a copy of the book they wrote about it here
| somewhere. I often fantasize about implementing a version of it
| in FPGA someday.
| Gordonjcp wrote:
| > Linn (yes, the Swedish audio/drum machine company) in the 80s
|
| Uhm.
|
| Linn the audio company, known as Linn Products, are Scottish,
| being based a little to the south of Glasgow, and named after
| the park the original workshop was beside.
|
| Linn the drum machine company, known as Linn Electronics, were
| American, being founded by and named after Roger Linn.
|
| Two totally different companies, run by totally different
| people, not connected in any way, and neither of them Swedish.
|
| The Linn Rekursiv was designed by the audio company, and was
| largely unsuccessful, and none exist any more - not even bits
| of them :-/
| cmrdporcupine wrote:
| oops :-)
| kwhitefoot wrote:
| Surely Linn is Scottish.
| martincmartin wrote:
| "Unsafe at Any Speed" is the name of Ralph Nader's book on car
| manufacturers resisting car safety measures. It resulted in the
| creation of the United States Department of Transportation in
| 1966 and the predecessor agencies of the National Highway Traffic
| Safety Administration in 1970.
| akdor1154 wrote:
| > They also made it a four-CPU system, with all CPUs operating in
| the same 64-bit global address space. It also needed a good 1,000
| amperes at 5 volts delivered to the backplane through a dozen
| welding cables.
|
| That is absolutely terrifying.
| buildbot wrote:
| These days you just use 12v and convert right next to or on die
| - but we are still in that range of amps for big chips! Take
| for example a 3090 at 500w @12v, the core is running at 1.056v,
| that's 473 Amps!
___________________________________________________________________
(page generated 2022-06-29 23:00 UTC)