[HN Gopher] A 1000-processor chip powered by an AA battery (2016)
___________________________________________________________________
A 1000-processor chip powered by an AA battery (2016)
Author : Bluestein
Score : 84 points
Date : 2021-02-15 09:39 UTC (13 hours ago)
(HTM) web link (www.ucdavis.edu)
(TXT) w3m dump (www.ucdavis.edu)
| Bluestein wrote:
| "The chip is the most energy-efficient 'many-core' processor ever
| reported, Baas said. For example, the 1,000 processors can
| execute 115 billion instructions per second while dissipating
| only 0.7 Watts, low enough to be powered by a single AA battery.
| The KiloCore chip executes instructions more than 100 times more
| efficiently than a modern laptop processor."
| karmakaze wrote:
| Yeah, thought it was interesting that the subject chip didn't
| have a heat sink/fan and one supporting? chip on the board
| behind it did.
| Bluestein wrote:
| (Reminds me of this other chip - built by a "giant" of
| computing - that was designed to be -very- energy efficient,
| while having a very large number of very simple cores ...
|
| ... ran a special FORTH if I am not mistaken ...
|
| I cannot remember where I found a very nice presentation the
| gent gave about it ... :/ )
| palsecam wrote:
| http://www.GreenArrayChips.com/ by Chuck Moore, father of
| Forth.
| ajb wrote:
| There were previous iterations as well. I think this was
| all funded by some patent they licenced to Intel.
|
| The most astonishing part was that he rolled his own VLSI
| CAD system:
| https://mschuldt.github.io/www.colorforth.com/vlsi.html
| Bluestein wrote:
| Wow.
| brandonmenc wrote:
| A video of him demonstrating that system:
|
| https://www.youtube.com/watch?v=Dbd7Xu0ibJM
| Bluestein wrote:
| So kind :)
|
| Thank you.-
|
| Right on the dot ...
|
| (I found it fascinating, when I first saw it :)
| agumonkey wrote:
| There was two talks of Moore on GA chips. Multi core forth
| processors were (and still are) so inspiring.
| Bluestein wrote:
| That's it, indeed :)
| amelius wrote:
| One of the big difficulties with conventional processors is the
| memory hierarchy, where the entire memory is viewed as a shared
| memory space. There is no information in the article on how these
| processors communicate, so I'm guessing that this processor side-
| stepped the entire shared memory issue by offering only local
| memory combined with a simpler form of communication (e.g.
| message passing).
| vidarh wrote:
| This is reminiscent of Parallella [1]/Epiphany (which
| unfortunately failed to gain traction). They didn't release any
| designs with core counts like that, but the entire point of the
| design was to enable large numbers of cores with in-core memory
| and predictable latency for accessing the memory of cores
| elsewhere in the grid.
|
| Epiphany did have (slower) access to external memory too, but
| the challenge of these architectures is finding problems that
| requires too much branching to be able to take proper advantage
| of GPUs, yet so parallel that relatively low single-core
| performance is outweighed by throwing more cores at it and/or
| where the low power usage of the Epiphany would be worth it,
| and it seems they just didn't find enough of a market for it.
|
| I have two of the kickstarter boards sitting around. Wish
| they'd gotten larger scale versions out.
|
| [1] https://www.parallella.org/
| UncleOxidant wrote:
| I have two of those boards too. They were working on a 1024
| core version, but alas, they didn't survive as a company to
| make it.
| myself248 wrote:
| Also arguably spiritual descendants of the Transputer?
| vidarh wrote:
| Yes, there are definitively similarities in concept. The
| whole idea of an on-chip network between cores stems from
| the Transputer as far as I know.
| bee_rider wrote:
| This makes sense, everyone loves debugging MPI programs, so of
| course we'd want to replicate that experience on the desktop.
| djmips wrote:
| You might have to be dragged kicking and screaming to the
| multicore because CPUs aren't getting faster.
| bee_rider wrote:
| The distinction here is between distributed and shared
| memory, not between single and multicore programmming.
| Distributed memory is more often used on clusters. I would
| say distributed memory is generally considered easier on
| the hardware, harder on the programmer (although one could
| argue that a message passing communication scheme is more
| explicit than fork/join, so the programmer has more
| control).
| tornato7 wrote:
| Yep, at KnuEdge we built some processors with thousands of
| cores in that style architecture. It's a bit different to
| reason about but you can do a lot of wild computing that way.
| Too bad the company folded before any of our processors were
| available.
| nextaccountic wrote:
| Yes, "Cores operate at an average maximum clock frequency of
| 1.78 GHz, and they transfer data directly to each other rather
| than using a pooled memory area that can become a bottleneck
| for data."
| cromwellian wrote:
| Reminds me of the PS/2 CELL Architecture, with the SPUs having
| limited local memory, and needing to DMA stuff around in a
| streaming fashion to build up larger programs.
| rbanffy wrote:
| The Cell was a pain to program because of that hateful memory
| architecture. The anemic PowerPC CPUs didn't help much
| either.
|
| Nothing really prevents an architecture where part of the
| address space is core-local (with low latency) and not
| directly accessible from other cores, and the rest points to
| a (much higher latency) shared memory pool. I have a feeling
| this would be much nicer to program than a Cell or the
| average GPU.
| djmips wrote:
| Nevertheless people were able to harness the Cell for good
| performance. It took a re-think and I am not denigrating
| your experience but our experience was that we benefited
| everywhere when we reorganized our data and algorithms
| around the SPUs.
| breatheoften wrote:
| Seems like the way to go ... maybe even 3 layers of memory
| -- core local memory, memory that provides shared reader
| and writer consistency, and memory that provides shared
| reader, single writer consistency (for ownership
| transfers). Maybe the second form doesn't actually need to
| exist at all?
|
| Minimizing ownership transfer cost across hardware elements
| seems like a fundamental concept to me that future
| architectures will need to specifically optimize in order
| to maximize the benefits that can be reaped from "task
| specific" hardware.
|
| The more cores we add the more we have to gain by allowing
| them to be different from each other I think ...
| jabl wrote:
| > Nothing really prevents an architecture where part of the
| address space is core-local (with low latency) and not
| directly accessible from other cores, and the rest points
| to a (much higher latency) shared memory pool. I have a
| feeling this would be much nicer to program than a Cell or
| the average GPU.
|
| Many GPU's have this kind of local scratchpad memory.
| Nvidia GPU's even allows the programmer to partition the
| local SRAM between the scratchpad memory (called shared
| memory in CUDA docs) and L1 cache.
| rbanffy wrote:
| It's a shame nobody came up with an OS that runs entirely
| on a GPU.
| als0 wrote:
| That was the PS3.
| [deleted]
| FpUser wrote:
| This is very interesting stuff. I'd love for these kind of chips
| to be sold in a form of a coprocessor cards for PCs for example.
| For some type of servers and other uses it'd be a blessing.
| SuchAnonMuchWow wrote:
| It does exist in various forms. One I'm familiar with is Kalray
| MPPA (massively parallel processor array):
| https://www.kalrayinc.com/technology/ Their current generation
| of processors have 80 cores exposed to the userspace (the
| previous one had 256), and they are mainly used as PCIe
| acceleration boards for all kinds of applications.
|
| But as another poster said, the difficulty with those kinds of
| architecture is the memory hierarchy: you don't want all the
| cores to access the DDR directly, as that would be a massive
| bottleneck. And this becomes a larger issue as you increase the
| number of cores.
| jpm_sd wrote:
| Those are called GPUs! Or, if you like, GPGPUs.
|
| Readily available in The Cloud(TM) and everything.
|
| https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
| FpUser wrote:
| Nope. GPUs are SIMD. KiloCore is however 1000 totally
| independent cores.
| captainbland wrote:
| My understanding is that on Nvidia cards, each SM is
| basically an independent SIMD processor. So a RTX 3090 is
| like a processor with 82 independent but super-wide cores.
| I think the truth is actually even more complicated still
| with multiple warps being able to execute on a single SM
| with some memory subsystem limitations but it's something
| like that. The equivalent is true with AMD GPUs, too, just
| substitute the terminology where appropriate.
|
| Either way still not close to 1000 independent cores.
| desertrider12 wrote:
| In a way, there are many more independent cores in an
| Nvidia GPU than #SMs, since the SIMD (warp) is only 32
| threads wide, and >1k threads can be active at a time per
| SM. Each warp has its own independent program counter. So
| if you really wanted to, you could have a kernel with
| blocks that are Nx32, and code like:
|
| if(blockIdx.x == 0) doThing1(); else if(blockIdx.x == 1)
| doThing2(); ...
|
| etc., and the doThings could run at the same time, even
| on the same SM. Not super practical but it's allowed.
| dang wrote:
| Discussed at the time:
| https://news.ycombinator.com/item?id=11935999
| superrad wrote:
| Should we be training on TIS-100[0] to be ready for when these
| chips become a necessity.
|
| [0]https://en.wikipedia.org/wiki/TIS-100
| Zenst wrote:
| Some more detailed information here:
| https://en.wikichip.org/wiki/uc_davis/kilocore
|
| No idea how it has progressed or if sample units became
| accessible - which would be nice.
| kingosticks wrote:
| That's a good reference.
|
| I had assumed there were 1000 instruction memory SRAMs in there
| but of course that'd be impossible considering the leakage
| numbers and there's actually just 12 x 64KB SRAMs shared by all
| the cores: Per core 640 bytes
| (128x40-bit) local instruction memory 512 bytes
| (256x16-bit) local data memory 768 KB SRAM on-die
| 12 shared SRAM memory modules, 64 KB each
|
| Those local per-core memories are tiny and presumably
| implemented with registers. But can you really do much with
| just 128 instructions? The following makes it sound like they
| might be instruction caches:
|
| > Instructions may come from the local instruction memory or
| they may be fetched from one of the independent memory module
|
| But then goes onto say:
|
| > the KiloCore's processors do not contain traditional caches.
|
| Hopefully their paper has more info.
|
| The survey that the "World's First 1,000-Processor Chip" claim
| is based on (http://vcl.ece.ucdavis.edu/misc/many-core.html)
| doesn't seem to include any network processors, some of which
| are _technically_ programmable. But even if it had, they wouldn
| 't have found any of the information required for the table and
| definitely not a datasheet!
| klyrs wrote:
| This is kinda wild. It seems like one could implement this in a
| large FPGA today and reasonably expect to hit a similar clock
| frequency.
| karmakaze wrote:
| In addition to the memory access I was wondering how they got
| to 1000 cores and not 1024.
|
| > The chip is designed as a massively parallel processor array,
| with 992 cores arranged as a grid 32 by 31. Eight additional
| cores are found along with 12 memory modules of 64 KB SRAM each
| (for a total of 768 KB). Communication between cores is done
| via a dual-layer source-synchronous circuit-switched network
| and a very-small-area packet router (see wormhole routing). The
| circuit-switched network supports communication between
| adjacent and distant processors, as resources allow, with each
| link supporting a maximum rate of 28.5 Gbps. Maximum throughput
| is 45.5 Gbps per router. Both network types contribute to an
| array bisection bandwidth of 4.2 Tbps.
| AtlasBarfed wrote:
| Process/Fab errors?
|
| 24 cores of failure out of 1000 would be an error rate around
| 2%, which would help in yields.
| oblio wrote:
| We're probably a programming language revolution away from these
| kinds of CPUs being practical. Isn't parallelism complicated and
| practically dangerous in most mainstream languages?
|
| Is there any language where you can just go:
| parallel-for: do-my-stuff
|
| and have the programming language Do The Right Thing with no
| further hassle? The right thing usually being blocking execution
| until every parallel do-my-stuff finishes execution, in my
| experience.
|
| In most programming languages I've seen this is a 20+ line boiler
| plate full of caveats and risks.
| PartiallyTyped wrote:
| In python, with numpy, you can use `np.vectorize`, there exists
| a jax equivalent that can enable accelerators to do your
| parallel stuff. Cuda exists, but it isn't as simple as
| `parallel-for`. For trivial for-loops, you can use OMP.
| rand_r wrote:
| Unfortunately np.vectorize doesn't do anything smart.
|
| From the docs:
|
| The vectorize function is provided primarily for convenience,
| not for performance. The implementation is essentially a for
| loop.
| PartiallyTyped wrote:
| Ah right, yes I remember. Perhaps .vmap from pytorch?
| Someone wrote:
| In scala (https://docs.scala-lang.org/overviews/parallel-
| collections/o...):
|
| Serial: list.map(_ + 42)
|
| Parallel: list.par.map(_ + 42)
|
| In C# (https://docs.microsoft.com/en-
| us/dotnet/standard/parallel-pr...)
|
| Serial: for (int i = 0; i < matARows; i++)
|
| Parallel: Parallel.For(0, matARows, i =>
|
| (Function takes a lambda that gets called for each value)
|
| Swift has DispatchQueue.concurrentPerform (https://developer.ap
| ple.com/documentation/dispatch/dispatchq...). That is a bit
| less clear and less flexible (won't iterate over an array's
| values, only over its indexes) but also fairly easy to use.
|
| There probably are many more languages with similar constructs.
| snak wrote:
| Yes, I do agree C# does have many constructs (Parallel
| For/ForEach/Invoke...) but there's many things to consider
| that are not designed with thread safety in mind (e.g.
| generic collections, List<T>, Dictionary<T>, etc... must be
| replaced by their System.Collections.Concurrent respectives)
| and many other risks that parallel computing poses.
| int x = 0; Parallel.For(0, 99999, z => { x++;
| }); Console.WriteLine(x);
|
| Something as simple as an integer sum won't produce the
| expected output unless you're using Interlocked increments.
|
| And that definetely does not match what oblio means with
| "have the programming language Do The Right Thing with no
| further hassle".
| gpderetta wrote:
| #pragma omp parallel for
| oblio wrote:
| What's the operational cost of doing that?
| gpderetta wrote:
| What do you mean exactly?
|
| You need a C/C++/fortran compiler that supports openmp,
| which is most of the mainstream ones (I believe MSVC still
| only support an older standard, still more than enough for
| parallel for). Compilers that do not support openmp are
| supposed to ignore the pragma and the code will still run
| fine (although serially) and be fully compatible.
|
| Other than that and possibly passing the correct command
| line flag, openmp is fairly low friction. A parallel for
| can provide significant speedup for embassingly parallel
| problems and these days it even support custom random
| access iterators in C++.
|
| Of course it is more involved to express less
| embarrassingly parallelizable problems, but it is still
| doable.
| Faaak wrote:
| It's builtin into prolog's logic and its derivates, alas they
| are rarely used in the industry :'(
| jiofih wrote:
| Pony? https://www.ponylang.io/
| hvidgaard wrote:
| It's simple to do what you write, multiple languages already do
| just that. It's the class of problems that is called
| "embarrassingly parallel"
| (https://en.wikipedia.org/wiki/Embarrassingly_parallel).
| However, it turns out that it places some rather serious
| restrictions on the kind of calculations you can perform this
| way. In general for two pieces of code to be executed in
| parallel, they must be independent of each other. I.e for the
| following 1: tmpA := A(input) 2: tmpB
| := B(tmpA) 3: tmpC := C(tmpB) 4: result :=
| D(tmpC)
|
| it's impossible to calculate a line before the previous line
| has finished. You cannot do this calculation concurrently. On
| the other hand if it was like this: 1: resA
| := A(input) 2: resB := B(input) 3: resC :=
| C(input) 4: resD := D(input)
|
| You can calculate all 4 concurrently and achieve a great speed-
| up if you have resources to do it. Except if B, C, or D somehow
| access the result of a previous line. For mathematical notation
| this is simple, but for general programming languages there are
| plenty of ways to do this.
|
| We know that an optimal optimization, i.e. finding the fastest
| way a program can be executed is known uncomputable. Not
| difficult, not brute forceable, but impossible to compute with
| traditional computers. The best we can do is identify _some_
| concurrency optimizations, but so far it has not been a
| fruitful adventure for general programming languages. So what
| we can do is use patterns and constructs that rely on dividing
| the problems into concurrent pieces and use proper guarding on
| shared state. This is however notoriously easy to get wrong.
| chmod775 wrote:
| > We know that an optimal optimization, i.e. finding the
| fastest way a program can be executed is known uncomputable.
| Not difficult, not brute forceable, but impossible to compute
| with traditional computers.
|
| I'm really curious about the proof of this, since there's
| only a finite number of ways you can re-arrange instructions
| in a program. Likewise there's only a finite number of ways
| you could shard them across threads.
|
| For this to be true you'd clearly need to phrase the problem
| in a way that allows you to have an infinite number of
| possibilities, then prove you can't arrive at the optimal
| solution through some means that doesn't involve checking an
| infinite number of them. (Edit: Or prove that it is
| impossible to decide on which one is the fastest at the time
| the optimizer runs. When does the optimizer run?).
|
| Was the assumption that you'd also be (infinitely) unrolling
| loops or maybe 'rewriting' the program into something
| equivalent but faster? What does "finding the fastest way a
| program can be executed" mean? Just re-ordering instructions
| and distributing them across threads, or something more?
| Denzel wrote:
| Finding the optimal execution time of an arbitrary program
| is equivalent to the halting problem. [1] If you narrow the
| "arbitrary" constraint to only include well-structured,
| analyzable, guaranteed-to-terminate programs, then you can
| at least start to approximate a solution. Finding the true
| optimal case, even under those conditions, would be
| computationally expensive.
|
| [1]: https://en.wikipedia.org/wiki/Halting_problem
| chmod775 wrote:
| This wasn't about anything related to optimal execution
| time. That an optimizer needs to know whether the code it
| optimizes terminates (or how long it runs) would also
| need proof if you'd want to go that route.
|
| The assumptions and optimizations the optimizer is
| allowed to use and what kinds of programs we're talking
| about, and what even is considered an "optimal program"
| is also still unclear (Edit: I'm assuming lowest number
| of instructions executed serially?).
|
| Please do refer me to a paper. Please do not link me
| vaguely related Wikipedia articles.
| vidarh wrote:
| It's equivalent to the halting problem because there
| exists a set of problems for which for any input that
| optimiser creates the optimal solution for, there exists
| another input for which that problem is not optimal. The
| parallel to the halting problem is that with access to
| the output of the optimiser, you can always construct a
| problem where whatever the optimiser produces can be
| obstructed by producing an input that makes the optimised
| program non-optimal for the given input.
|
| A trivial example of such a program is a function that
| sorts it input and returns the sorted result, as no sort
| is optimal for all inputs.
| oblio wrote:
| I know about "embarrassingly parallel" problems, and the
| thing is, they're only "embarrassingly parallel" to solve in
| math.
|
| When programming the level of friction is from annoying to
| high unbearable, depending on the programming language. It
| should be trivial to implement these solutions, as trivial as
| the sequential solutions.
|
| It's not. Until this is the case in mainstream programming
| languages, I doubt we'll have CPUs with 1000 cores in our
| smartphones or laptops. I mean, we'll have them but they'll
| be useless because 99% of software today is practically
| single-threaded. The reason multiple cores are good for
| laptops, for example, is because we're multi-tasking between
| multiple applications, not because those cores are frequently
| used by day-to-day apps. I'm talking about random small apps,
| not compute intensive ones that have to use multi-threading
| intensively (rendering, compiling, what have you,
| professional level software).
| rbanffy wrote:
| > but they'll be useless because 99% of software today is
| practically single-threaded.
|
| I used to joke developers should get workstations with
| SPARC Niagaras or Xeon Phis for that reason: core count is
| going up and a CPU with a dozen small cores is much cheaper
| to build than one with 4 beefy ones. Now some of the low-
| end chips Intel is pushing have 4 SMT2 cores. HEDT is on
| the 16 SMT2 core range and, if our software continues to be
| single threaded, there will be a lot of silicon being used
| for nothing more than spreading heat.
| colejohnson66 wrote:
| Isn't the point of high core counts for workloads that
| can actually _be_ parallelized? Like compilation or
| graphics rendering?
| rbanffy wrote:
| A lot of tasks can be parallelized if you think hard
| enough. Your CPU is busy reordering instructions so they
| keep as many execution ports busy as it can so that it
| can retire as many instructions per core cycle as
| possible. SMT was invented to keep those execution units
| busy by running more than one instruction stream at a
| time.
|
| The incentive to do it when most computers have two SMT2
| cores is small, but as the average device starts getting
| 4 or 8 SMT2 cores, or 8 asymmetric cores, the incentives
| to make things run faster in parallel get better and
| better.
| astrange wrote:
| That wouldn't be good if they're targeting battery-
| powered devices, because you usually don't want to turn
| on extra cores there. Mobile programs can actually go
| faster if you make them single-threaded - either the
| device doesn't have more cores available to run your
| extra threads, or your process isn't high-priority enough
| to spend more power on.
| tenebrisalietum wrote:
| Imagine an OS where each new thread or process gets
| allocated its own core. You'd be limited to 1k
| threads/processes on this CPU, but what would otherwise be
| the benefits/drawbacks?
| spiritplumber wrote:
| Parallax's Spin lets you do this.
| flohofwoe wrote:
| You can get that without a programming language revolution by
| reformulating the problem a bit.
|
| For instance 3D-API shading languages do just that under the
| hood, just without the "parallel-for". Instead you tell a
| 3D-API what code should run per-vertex or per-pixel in
| (usually) a traditional sequential C-style language, and the
| GPU driver and hardware care about the parallelization and
| scheduling.
| bipson wrote:
| ADA is quite good at this.
|
| But not every problem (i.e. the Algorithm for the problem) can
| be divided easily without any coordination (i.e. communication)
| between the chunks. This overhead can quickly become
| significant.
|
| There is a reason why desktop software (particularly the
| computationally intensive) is having a hard time exploiting
| multi-core benefits (besides the programming language and
| habits in general).
| Kuinox wrote:
| Yes, SQL. Write your query and it will be magically
| parallelised without you knowing it.
| nicoburns wrote:
| Rust's Rayon library is pretty good for these use-cases
| https://github.com/rayon-rs/rayon use
| rayon::prelude::*; fn sum_of_squares(input: &[i32]) ->
| i32 { input.par_iter() // <-- just change that!
| .map(|&i| i * i) .sum() }
| cfstras wrote:
| Java has this since 8: Arrays.asList("hello",
| "world").parallelStream() .forEach(s ->
| System.out.println(s));
|
| If you want to calculate stuff, you can
| List<String> suffixed = Arrays.asList("hello",
| "world").parallelStream() .map(s -> s + "_suffix")
| .collect(Collectors.toList())
| kevin_thibedeau wrote:
| Hardware description languages are parallelized by default.
| legulere wrote:
| This architecture is called a manycore processor. There are
| several programming paradigms that are researched for it, among
| others actor languages, where the idea is to run parts of your
| program on different cores that communicate with each other.
|
| https://ne.wikipedia.org/wiki/Manycore_processor
| leprechaun1066 wrote:
| q has this out the box: https://code.kx.com/q/basics/peach/
| prennert wrote:
| Dont have many languages already some sort of map
| implementation that parallelizes pure function calls? Even
| python has ThreadPoolExecutor.map and ProcessPoolExecutor.map.
|
| It gets difficult once you have side effects and/or
| synchronization (join) of the results are required.
| GTP wrote:
| >Isn't parallelism complicated and practically dangerous in
| most mainstream languages?
|
| Yes, but functional languages solve this problem.
| rbanffy wrote:
| > Is there any language where you can just go:
|
| Nothing prevents a map function to run in parallel.
|
| There are other things that can happen in parallel, however -
| the whole instruction flow is reordered in the CPU before it is
| executed and if you have enough execution ports and enough in-
| flight instructions to keep them fed, you can accomplish a lot
| in a single clock cycle before your source code even needs to
| acknowledge that instructions don't happen exactly in the
| sequence they are written.
| gens wrote:
| While reading the article i thought of Erlang.
|
| Multiple independent programs sending messages to each other
| fits well into multiple independent cores sending messages to
| each other.
| foerbert wrote:
| I'm not sure this is the best link, but I don't have anything
| better in mind and it was the first thing that popped up, so...
|
| Anyway, there's a language called ParaSail[0] that's doing
| stuff in that direction. A little bit back the guy behind it
| got hired by AdaCore (Ada compiler folks) and they seem to have
| adopted it and continuing with it. I haven't heard too much
| about it though.
|
| [0]https://adacore.github.io/ParaSail/
| 4gotunameagain wrote:
| MATLAB has parfor, but that's barely a language..
| tonyedgecombe wrote:
| C# has it: https://docs.microsoft.com/en-
| us/dotnet/api/system.threading...
|
| Of course when you start doing things in parallel there are
| automatically a number of caveats and risks.
| jackpeterfletch wrote:
| Occam has exactly this. SEQ executes statements sequentially.
| PAR executes them in parallel, though the language is based on
| CSP and Cooperative Multitasking, so the idea of it doing 'the
| right thing' with a PAR is a little different.
|
| Think of Occam programs as big Factorio factories.
|
| https://en.wikipedia.org/wiki/Occam_(programming_language)
|
| Alot the ideas that came out of Occam and CSP have been brought
| into various coroutine libraries though. Kotlin Coroutines,
| Goroutines, etc. So its definitely something that is explored.
| jacquesm wrote:
| I was looking for this comment before posting. Thank you.
| api wrote:
| Rust is much safer for parallelism and has a lot of safety
| rules around closures, so this would probably be achievable in
| Rust without much pain via a clever template library.
|
| The problem is that not all problems are easy to parallelism at
| the algorithmic level, and for some it may be impossible.
|
| Alongside massively multi core chips I would also love to see
| more efforts to push the envelope on single threaded
| performance.
| yoshuaw wrote:
| MSVC's C++ Parallel algorithms library gets close [1], and so
| does Rust's Rayon crate [2]. C# has parallel for loops [3], and
| apparently so does PowerShell [4]
|
| Though none of these are quite as convenient as I think they
| could be. I suspect there is a path for Rust to add parallel
| iteration syntax in the future [5]. There are quite a few steps
| needed to get there, but the result could look something like
| this: let mut listener =
| TcpListener::bind("127.0.0.1:8080")?;
| println!("Listening on {}", listener.local_addr()?);
| par for stream? in listener.incoming() {
| println!("Accepting from: {}", stream.peer_addr()?);
| io::copy(&stream, &stream)?; }
|
| [1]: https://devblogs.microsoft.com/cppblog/using-c17-parallel-
| al...
|
| [2]: https://docs.rs/rayon/1.5.0/rayon/
|
| [3]: https://dotnettutorials.net/lesson/parallel-for-method-
| cshar...
|
| [4]: https://devblogs.microsoft.com/powershell/powershell-
| foreach...
|
| [5]: https://blog.yoshuawuyts.com/parallel-stream/#future-
| directi...
| AtlasBarfed wrote:
| I love GPars for Groovy, but unfortunately it isn't
| CompileStatic compatible or Groovy 3.0 compatible.
| eptcyka wrote:
| What'd be the concurrency primitive used in the parallel
| block? Is scheduling done per loop or globally? All of these
| questions are trivial if there's a runtime and become very
| cumbersome if the language has ambitions to be usable without
| any standard libraries.
| volta83 wrote:
| If you use the standard library, its parallel run-time gets
| automatically linked. This is often the right choice when
| building binaries for a platform with an operating system,
| etc.
|
| If you do not include the standard library _AND_ if you use
| these features, then you'd need to provide your own
| parallel run-time or your application won't compile.
|
| There are many valid implementations of these run-times
| tuned for different applications, these run-times can be as
| simple or complex as you'd like, in some cases some run-
| times only work on particular bare metal targets, and in
| most cases you just pull a library with the implementation
| that you'd want to use.
|
| Since a valid implementation of the run-time is to just run
| all tasks sequentially, you can provide a tiny (often
| "zero-size") run-time for those environments on which code-
| size trumps everything.
|
| You can also provide the standard parallel run-time, but
| this might require operating-system like functionality,
| like thread, mutex, etc. support. If your embedded target
| does not provide those, you'd need to provide
| implementations of these yourself.
|
| The important thing is that you don't need to change any
| code that uses the run-time, e.g., all the app code can
| still use fork-join parallelism, but you control how that's
| executed by swapping the modular runtime component.
| yoshuaw wrote:
| Yes, exactly. Rust already allows for switching global
| allocators through `#[global_allocator]`. But now also
| allows plugging allocators inline through the allocator-
| api methods.
|
| I'd imagine something similar could be introduced for
| threadpools / runtimes.
| Someone wrote:
| I don't think you can do that well without some global
| state and code to handle it (aka 'a runtime').
|
| It's not like separate parts of a program can each optimize
| for the number of threads and their memory for their goals.
| Optimization/tuning has to happen globally because the
| resources used (threads, memory) are shared between those
| parts.
|
| I think most solutions have some global state that
| magically decides how many threads to run in parallel (more
| for I/O bound code), how many to assign to each part (which
| don't even have to be perfectly separate. Part P might
| parallellisme a loop where each iteration calls into part
| Q, which also runs a parallel loop. If each of these
| decides to run one thread per CPU for N threads in total,
| the result still c/would be that N2 threads run in
| parallel, likely with suboptimal results)
| mhh__ wrote:
| D does this as a library i.e. just add parallel to your foreach
| aggregate.
|
| As to languages built around this, there are quite a few, e.g.
| Intel have Data Parallel C++
| peter_d_sherman wrote:
| This might make a good candidate architecture to become a future
| GPU (see discussion here):
|
| Nyuzi - An Experimental Open-Source FPGA GPGPU Processor
|
| https://news.ycombinator.com/item?id=26132726
| guerrilla wrote:
| I can't escape the nagging question: is it really 1,000 or
| actually 1,024 and simplified for the press release?
| Bluestein wrote:
| Useful comment upthread :)
|
| - https://news.ycombinator.com/item?id=26143976
| kitd wrote:
| _Cores ... transfer data directly to each other rather than using
| a pooled memory area that can become a bottleneck for data_
|
| This sounds like transputers. IIRC they had direct memory linkage
| too.
| thawkins wrote:
| yep each transputer had 4 serial links, which could be
| connected in matrixes, and allow data to be moved across the
| array directly. They used a language called OCCAM, and there
| was an OS called TAOS that allowed you to run various
| transputer topologies.
| kitd wrote:
| OCCAM!! That was it! Thx I was trying to remember the name of
| the language they used.
|
| I wonder whether OCCAM would be usable here, or if not, then
| something based on the lessons learnt from it.
| TickleSteve wrote:
| OCCAM implemented CSP (Communicating Sequential Processes)
| that Go took a lot of lessons from.
|
| (https://en.wikipedia.org/wiki/Communicating_sequential_pro
| ce...)
| TickleSteve wrote:
| TAOS also ran on many other architectures (68K, x86, ARM). It
| implemented a virtual processor (VP1) and evolved into Elate
| but didn't get much recognition despite its novelty.
|
| TAOS was definitely ahead of its time, but rarely gets
| mentioned unfortunately.
| rbanffy wrote:
| Dick Pountain wrote a lot of articles for BYTE about that.
|
| This is one: https://sites.google.com/site/dicknewsite/home/c
| omputing/pri...
| DrNosferatu wrote:
| ...from 2016. back in grad school, I did a comparison between the
| (then new) XeonPhi and GPUs to run my own N-Body engine using the
| same OpenCL implementation - the rest is history: GPUs still rule
| today for high intensity computational needs. Look at Machine
| Learning: GPUs are _so good_ at SIMD, that if your domain can
| benefit from faster computation, /i you should make it SIMD i/,
| if it's not already. ;)
| klelatti wrote:
| It's sad that OpenCL development seems to have lost its way. I
| have some OpenCL workloads where a running on a CPU is faster
| and some where GPUs are much quicker. The beauty of OpenCL is
| that I can choose between the two with zero code changes.
|
| But CUDA and ML have steamrollered everything!
| MayeulC wrote:
| I heard of Hallide quite recently, and find it promising:
| https://halide-lang.org/
|
| Not only you can decide on CPU vs GPU, you have more control
| over the kind of parallelism and intermediate results.
| lasagnaphil wrote:
| I really hope SYCL (https://en.m.wikipedia.org/wiki/SYCL) can
| gain some traction (since Intel also jumped on the GPU
| competition and are backing a substantial amount of it)
| DrNosferatu wrote:
| Indeed, OpenCL runs on everything - if you can set it up...
| :/
|
| Actually, at the time, what I did was a 3-way comparison
| between GPU, XeoPhi _and_ classic CPU (was it 8 cores?)
| running the same OpenCL code. Anyways, the "many-core"
| specimen, XeonPhi, benched at the geometric mean between
| classic CPU and GPU - so closer to classic CPU! (we were
| using beta hardware, our intel specialist - nice, helpful guy
| actually - went mute after I got those results :D ) On top of
| that, reimplementing the same algo in CUDA gave ~10%
| advantage over OpenCL on NVidia hardware. I wonder how it
| would fare on a FPGA of the same price/TDP?
|
| And more relevant for 2021: - I wonder if we would get the
| same kind of results using Vulkan?
|
| - Speaking of which, does anyone know of any good books /
| tutorials / runnable examples of Vulkan in a compute-oriented
| application?
|
| Cheers!
| DrNosferatu wrote:
| PS: Does anyone have any experience with ArrayFire?
| https://arrayfire.com/ (I believe it's from the people that
| did the [GPU]Jacket for Matlab)
___________________________________________________________________
(page generated 2021-02-15 23:02 UTC)