[HN Gopher] A 1000-processor chip powered by an AA battery (2016)
       ___________________________________________________________________
        
       A 1000-processor chip powered by an AA battery (2016)
        
       Author : Bluestein
       Score  : 84 points
       Date   : 2021-02-15 09:39 UTC (13 hours ago)
        
 (HTM) web link (www.ucdavis.edu)
 (TXT) w3m dump (www.ucdavis.edu)
        
       | Bluestein wrote:
       | "The chip is the most energy-efficient 'many-core' processor ever
       | reported, Baas said. For example, the 1,000 processors can
       | execute 115 billion instructions per second while dissipating
       | only 0.7 Watts, low enough to be powered by a single AA battery.
       | The KiloCore chip executes instructions more than 100 times more
       | efficiently than a modern laptop processor."
        
         | karmakaze wrote:
         | Yeah, thought it was interesting that the subject chip didn't
         | have a heat sink/fan and one supporting? chip on the board
         | behind it did.
        
         | Bluestein wrote:
         | (Reminds me of this other chip - built by a "giant" of
         | computing - that was designed to be -very- energy efficient,
         | while having a very large number of very simple cores ...
         | 
         | ... ran a special FORTH if I am not mistaken ...
         | 
         | I cannot remember where I found a very nice presentation the
         | gent gave about it ... :/ )
        
           | palsecam wrote:
           | http://www.GreenArrayChips.com/ by Chuck Moore, father of
           | Forth.
        
             | ajb wrote:
             | There were previous iterations as well. I think this was
             | all funded by some patent they licenced to Intel.
             | 
             | The most astonishing part was that he rolled his own VLSI
             | CAD system:
             | https://mschuldt.github.io/www.colorforth.com/vlsi.html
        
               | Bluestein wrote:
               | Wow.
        
               | brandonmenc wrote:
               | A video of him demonstrating that system:
               | 
               | https://www.youtube.com/watch?v=Dbd7Xu0ibJM
        
             | Bluestein wrote:
             | So kind :)
             | 
             | Thank you.-
             | 
             | Right on the dot ...
             | 
             | (I found it fascinating, when I first saw it :)
        
           | agumonkey wrote:
           | There was two talks of Moore on GA chips. Multi core forth
           | processors were (and still are) so inspiring.
        
             | Bluestein wrote:
             | That's it, indeed :)
        
       | amelius wrote:
       | One of the big difficulties with conventional processors is the
       | memory hierarchy, where the entire memory is viewed as a shared
       | memory space. There is no information in the article on how these
       | processors communicate, so I'm guessing that this processor side-
       | stepped the entire shared memory issue by offering only local
       | memory combined with a simpler form of communication (e.g.
       | message passing).
        
         | vidarh wrote:
         | This is reminiscent of Parallella [1]/Epiphany (which
         | unfortunately failed to gain traction). They didn't release any
         | designs with core counts like that, but the entire point of the
         | design was to enable large numbers of cores with in-core memory
         | and predictable latency for accessing the memory of cores
         | elsewhere in the grid.
         | 
         | Epiphany did have (slower) access to external memory too, but
         | the challenge of these architectures is finding problems that
         | requires too much branching to be able to take proper advantage
         | of GPUs, yet so parallel that relatively low single-core
         | performance is outweighed by throwing more cores at it and/or
         | where the low power usage of the Epiphany would be worth it,
         | and it seems they just didn't find enough of a market for it.
         | 
         | I have two of the kickstarter boards sitting around. Wish
         | they'd gotten larger scale versions out.
         | 
         | [1] https://www.parallella.org/
        
           | UncleOxidant wrote:
           | I have two of those boards too. They were working on a 1024
           | core version, but alas, they didn't survive as a company to
           | make it.
        
           | myself248 wrote:
           | Also arguably spiritual descendants of the Transputer?
        
             | vidarh wrote:
             | Yes, there are definitively similarities in concept. The
             | whole idea of an on-chip network between cores stems from
             | the Transputer as far as I know.
        
         | bee_rider wrote:
         | This makes sense, everyone loves debugging MPI programs, so of
         | course we'd want to replicate that experience on the desktop.
        
           | djmips wrote:
           | You might have to be dragged kicking and screaming to the
           | multicore because CPUs aren't getting faster.
        
             | bee_rider wrote:
             | The distinction here is between distributed and shared
             | memory, not between single and multicore programmming.
             | Distributed memory is more often used on clusters. I would
             | say distributed memory is generally considered easier on
             | the hardware, harder on the programmer (although one could
             | argue that a message passing communication scheme is more
             | explicit than fork/join, so the programmer has more
             | control).
        
         | tornato7 wrote:
         | Yep, at KnuEdge we built some processors with thousands of
         | cores in that style architecture. It's a bit different to
         | reason about but you can do a lot of wild computing that way.
         | Too bad the company folded before any of our processors were
         | available.
        
         | nextaccountic wrote:
         | Yes, "Cores operate at an average maximum clock frequency of
         | 1.78 GHz, and they transfer data directly to each other rather
         | than using a pooled memory area that can become a bottleneck
         | for data."
        
         | cromwellian wrote:
         | Reminds me of the PS/2 CELL Architecture, with the SPUs having
         | limited local memory, and needing to DMA stuff around in a
         | streaming fashion to build up larger programs.
        
           | rbanffy wrote:
           | The Cell was a pain to program because of that hateful memory
           | architecture. The anemic PowerPC CPUs didn't help much
           | either.
           | 
           | Nothing really prevents an architecture where part of the
           | address space is core-local (with low latency) and not
           | directly accessible from other cores, and the rest points to
           | a (much higher latency) shared memory pool. I have a feeling
           | this would be much nicer to program than a Cell or the
           | average GPU.
        
             | djmips wrote:
             | Nevertheless people were able to harness the Cell for good
             | performance. It took a re-think and I am not denigrating
             | your experience but our experience was that we benefited
             | everywhere when we reorganized our data and algorithms
             | around the SPUs.
        
             | breatheoften wrote:
             | Seems like the way to go ... maybe even 3 layers of memory
             | -- core local memory, memory that provides shared reader
             | and writer consistency, and memory that provides shared
             | reader, single writer consistency (for ownership
             | transfers). Maybe the second form doesn't actually need to
             | exist at all?
             | 
             | Minimizing ownership transfer cost across hardware elements
             | seems like a fundamental concept to me that future
             | architectures will need to specifically optimize in order
             | to maximize the benefits that can be reaped from "task
             | specific" hardware.
             | 
             | The more cores we add the more we have to gain by allowing
             | them to be different from each other I think ...
        
             | jabl wrote:
             | > Nothing really prevents an architecture where part of the
             | address space is core-local (with low latency) and not
             | directly accessible from other cores, and the rest points
             | to a (much higher latency) shared memory pool. I have a
             | feeling this would be much nicer to program than a Cell or
             | the average GPU.
             | 
             | Many GPU's have this kind of local scratchpad memory.
             | Nvidia GPU's even allows the programmer to partition the
             | local SRAM between the scratchpad memory (called shared
             | memory in CUDA docs) and L1 cache.
        
               | rbanffy wrote:
               | It's a shame nobody came up with an OS that runs entirely
               | on a GPU.
        
           | als0 wrote:
           | That was the PS3.
        
       | [deleted]
        
       | FpUser wrote:
       | This is very interesting stuff. I'd love for these kind of chips
       | to be sold in a form of a coprocessor cards for PCs for example.
       | For some type of servers and other uses it'd be a blessing.
        
         | SuchAnonMuchWow wrote:
         | It does exist in various forms. One I'm familiar with is Kalray
         | MPPA (massively parallel processor array):
         | https://www.kalrayinc.com/technology/ Their current generation
         | of processors have 80 cores exposed to the userspace (the
         | previous one had 256), and they are mainly used as PCIe
         | acceleration boards for all kinds of applications.
         | 
         | But as another poster said, the difficulty with those kinds of
         | architecture is the memory hierarchy: you don't want all the
         | cores to access the DDR directly, as that would be a massive
         | bottleneck. And this becomes a larger issue as you increase the
         | number of cores.
        
         | jpm_sd wrote:
         | Those are called GPUs! Or, if you like, GPGPUs.
         | 
         | Readily available in The Cloud(TM) and everything.
         | 
         | https://docs.aws.amazon.com/dlami/latest/devguide/gpu.html
        
           | FpUser wrote:
           | Nope. GPUs are SIMD. KiloCore is however 1000 totally
           | independent cores.
        
             | captainbland wrote:
             | My understanding is that on Nvidia cards, each SM is
             | basically an independent SIMD processor. So a RTX 3090 is
             | like a processor with 82 independent but super-wide cores.
             | I think the truth is actually even more complicated still
             | with multiple warps being able to execute on a single SM
             | with some memory subsystem limitations but it's something
             | like that. The equivalent is true with AMD GPUs, too, just
             | substitute the terminology where appropriate.
             | 
             | Either way still not close to 1000 independent cores.
        
               | desertrider12 wrote:
               | In a way, there are many more independent cores in an
               | Nvidia GPU than #SMs, since the SIMD (warp) is only 32
               | threads wide, and >1k threads can be active at a time per
               | SM. Each warp has its own independent program counter. So
               | if you really wanted to, you could have a kernel with
               | blocks that are Nx32, and code like:
               | 
               | if(blockIdx.x == 0) doThing1(); else if(blockIdx.x == 1)
               | doThing2(); ...
               | 
               | etc., and the doThings could run at the same time, even
               | on the same SM. Not super practical but it's allowed.
        
       | dang wrote:
       | Discussed at the time:
       | https://news.ycombinator.com/item?id=11935999
        
       | superrad wrote:
       | Should we be training on TIS-100[0] to be ready for when these
       | chips become a necessity.
       | 
       | [0]https://en.wikipedia.org/wiki/TIS-100
        
       | Zenst wrote:
       | Some more detailed information here:
       | https://en.wikichip.org/wiki/uc_davis/kilocore
       | 
       | No idea how it has progressed or if sample units became
       | accessible - which would be nice.
        
         | kingosticks wrote:
         | That's a good reference.
         | 
         | I had assumed there were 1000 instruction memory SRAMs in there
         | but of course that'd be impossible considering the leakage
         | numbers and there's actually just 12 x 64KB SRAMs shared by all
         | the cores:                    Per core             640 bytes
         | (128x40-bit) local instruction memory             512 bytes
         | (256x16-bit) local data memory         768 KB SRAM on-die
         | 12 shared SRAM memory modules, 64 KB each
         | 
         | Those local per-core memories are tiny and presumably
         | implemented with registers. But can you really do much with
         | just 128 instructions? The following makes it sound like they
         | might be instruction caches:
         | 
         | > Instructions may come from the local instruction memory or
         | they may be fetched from one of the independent memory module
         | 
         | But then goes onto say:
         | 
         | > the KiloCore's processors do not contain traditional caches.
         | 
         | Hopefully their paper has more info.
         | 
         | The survey that the "World's First 1,000-Processor Chip" claim
         | is based on (http://vcl.ece.ucdavis.edu/misc/many-core.html)
         | doesn't seem to include any network processors, some of which
         | are _technically_ programmable. But even if it had, they wouldn
         | 't have found any of the information required for the table and
         | definitely not a datasheet!
        
         | klyrs wrote:
         | This is kinda wild. It seems like one could implement this in a
         | large FPGA today and reasonably expect to hit a similar clock
         | frequency.
        
         | karmakaze wrote:
         | In addition to the memory access I was wondering how they got
         | to 1000 cores and not 1024.
         | 
         | > The chip is designed as a massively parallel processor array,
         | with 992 cores arranged as a grid 32 by 31. Eight additional
         | cores are found along with 12 memory modules of 64 KB SRAM each
         | (for a total of 768 KB). Communication between cores is done
         | via a dual-layer source-synchronous circuit-switched network
         | and a very-small-area packet router (see wormhole routing). The
         | circuit-switched network supports communication between
         | adjacent and distant processors, as resources allow, with each
         | link supporting a maximum rate of 28.5 Gbps. Maximum throughput
         | is 45.5 Gbps per router. Both network types contribute to an
         | array bisection bandwidth of 4.2 Tbps.
        
           | AtlasBarfed wrote:
           | Process/Fab errors?
           | 
           | 24 cores of failure out of 1000 would be an error rate around
           | 2%, which would help in yields.
        
       | oblio wrote:
       | We're probably a programming language revolution away from these
       | kinds of CPUs being practical. Isn't parallelism complicated and
       | practically dangerous in most mainstream languages?
       | 
       | Is there any language where you can just go:
       | parallel-for:              do-my-stuff
       | 
       | and have the programming language Do The Right Thing with no
       | further hassle? The right thing usually being blocking execution
       | until every parallel do-my-stuff finishes execution, in my
       | experience.
       | 
       | In most programming languages I've seen this is a 20+ line boiler
       | plate full of caveats and risks.
        
         | PartiallyTyped wrote:
         | In python, with numpy, you can use `np.vectorize`, there exists
         | a jax equivalent that can enable accelerators to do your
         | parallel stuff. Cuda exists, but it isn't as simple as
         | `parallel-for`. For trivial for-loops, you can use OMP.
        
           | rand_r wrote:
           | Unfortunately np.vectorize doesn't do anything smart.
           | 
           | From the docs:
           | 
           | The vectorize function is provided primarily for convenience,
           | not for performance. The implementation is essentially a for
           | loop.
        
             | PartiallyTyped wrote:
             | Ah right, yes I remember. Perhaps .vmap from pytorch?
        
         | Someone wrote:
         | In scala (https://docs.scala-lang.org/overviews/parallel-
         | collections/o...):
         | 
         | Serial:                 list.map(_ + 42)
         | 
         | Parallel:                 list.par.map(_ + 42)
         | 
         | In C# (https://docs.microsoft.com/en-
         | us/dotnet/standard/parallel-pr...)
         | 
         | Serial:                 for (int i = 0; i < matARows; i++)
         | 
         | Parallel:                 Parallel.For(0, matARows, i =>
         | 
         | (Function takes a lambda that gets called for each value)
         | 
         | Swift has DispatchQueue.concurrentPerform (https://developer.ap
         | ple.com/documentation/dispatch/dispatchq...). That is a bit
         | less clear and less flexible (won't iterate over an array's
         | values, only over its indexes) but also fairly easy to use.
         | 
         | There probably are many more languages with similar constructs.
        
           | snak wrote:
           | Yes, I do agree C# does have many constructs (Parallel
           | For/ForEach/Invoke...) but there's many things to consider
           | that are not designed with thread safety in mind (e.g.
           | generic collections, List<T>, Dictionary<T>, etc... must be
           | replaced by their System.Collections.Concurrent respectives)
           | and many other risks that parallel computing poses.
           | int x = 0;       Parallel.For(0, 99999, z => {         x++;
           | });       Console.WriteLine(x);
           | 
           | Something as simple as an integer sum won't produce the
           | expected output unless you're using Interlocked increments.
           | 
           | And that definetely does not match what oblio means with
           | "have the programming language Do The Right Thing with no
           | further hassle".
        
         | gpderetta wrote:
         | #pragma omp parallel for
        
           | oblio wrote:
           | What's the operational cost of doing that?
        
             | gpderetta wrote:
             | What do you mean exactly?
             | 
             | You need a C/C++/fortran compiler that supports openmp,
             | which is most of the mainstream ones (I believe MSVC still
             | only support an older standard, still more than enough for
             | parallel for). Compilers that do not support openmp are
             | supposed to ignore the pragma and the code will still run
             | fine (although serially) and be fully compatible.
             | 
             | Other than that and possibly passing the correct command
             | line flag, openmp is fairly low friction. A parallel for
             | can provide significant speedup for embassingly parallel
             | problems and these days it even support custom random
             | access iterators in C++.
             | 
             | Of course it is more involved to express less
             | embarrassingly parallelizable problems, but it is still
             | doable.
        
         | Faaak wrote:
         | It's builtin into prolog's logic and its derivates, alas they
         | are rarely used in the industry :'(
        
         | jiofih wrote:
         | Pony? https://www.ponylang.io/
        
         | hvidgaard wrote:
         | It's simple to do what you write, multiple languages already do
         | just that. It's the class of problems that is called
         | "embarrassingly parallel"
         | (https://en.wikipedia.org/wiki/Embarrassingly_parallel).
         | However, it turns out that it places some rather serious
         | restrictions on the kind of calculations you can perform this
         | way. In general for two pieces of code to be executed in
         | parallel, they must be independent of each other. I.e for the
         | following                   1: tmpA := A(input)         2: tmpB
         | := B(tmpA)         3: tmpC := C(tmpB)         4: result :=
         | D(tmpC)
         | 
         | it's impossible to calculate a line before the previous line
         | has finished. You cannot do this calculation concurrently. On
         | the other hand if it was like this:                   1: resA
         | := A(input)         2: resB := B(input)         3: resC :=
         | C(input)         4: resD := D(input)
         | 
         | You can calculate all 4 concurrently and achieve a great speed-
         | up if you have resources to do it. Except if B, C, or D somehow
         | access the result of a previous line. For mathematical notation
         | this is simple, but for general programming languages there are
         | plenty of ways to do this.
         | 
         | We know that an optimal optimization, i.e. finding the fastest
         | way a program can be executed is known uncomputable. Not
         | difficult, not brute forceable, but impossible to compute with
         | traditional computers. The best we can do is identify _some_
         | concurrency optimizations, but so far it has not been a
         | fruitful adventure for general programming languages. So what
         | we can do is use patterns and constructs that rely on dividing
         | the problems into concurrent pieces and use proper guarding on
         | shared state. This is however notoriously easy to get wrong.
        
           | chmod775 wrote:
           | > We know that an optimal optimization, i.e. finding the
           | fastest way a program can be executed is known uncomputable.
           | Not difficult, not brute forceable, but impossible to compute
           | with traditional computers.
           | 
           | I'm really curious about the proof of this, since there's
           | only a finite number of ways you can re-arrange instructions
           | in a program. Likewise there's only a finite number of ways
           | you could shard them across threads.
           | 
           | For this to be true you'd clearly need to phrase the problem
           | in a way that allows you to have an infinite number of
           | possibilities, then prove you can't arrive at the optimal
           | solution through some means that doesn't involve checking an
           | infinite number of them. (Edit: Or prove that it is
           | impossible to decide on which one is the fastest at the time
           | the optimizer runs. When does the optimizer run?).
           | 
           | Was the assumption that you'd also be (infinitely) unrolling
           | loops or maybe 'rewriting' the program into something
           | equivalent but faster? What does "finding the fastest way a
           | program can be executed" mean? Just re-ordering instructions
           | and distributing them across threads, or something more?
        
             | Denzel wrote:
             | Finding the optimal execution time of an arbitrary program
             | is equivalent to the halting problem. [1] If you narrow the
             | "arbitrary" constraint to only include well-structured,
             | analyzable, guaranteed-to-terminate programs, then you can
             | at least start to approximate a solution. Finding the true
             | optimal case, even under those conditions, would be
             | computationally expensive.
             | 
             | [1]: https://en.wikipedia.org/wiki/Halting_problem
        
               | chmod775 wrote:
               | This wasn't about anything related to optimal execution
               | time. That an optimizer needs to know whether the code it
               | optimizes terminates (or how long it runs) would also
               | need proof if you'd want to go that route.
               | 
               | The assumptions and optimizations the optimizer is
               | allowed to use and what kinds of programs we're talking
               | about, and what even is considered an "optimal program"
               | is also still unclear (Edit: I'm assuming lowest number
               | of instructions executed serially?).
               | 
               | Please do refer me to a paper. Please do not link me
               | vaguely related Wikipedia articles.
        
               | vidarh wrote:
               | It's equivalent to the halting problem because there
               | exists a set of problems for which for any input that
               | optimiser creates the optimal solution for, there exists
               | another input for which that problem is not optimal. The
               | parallel to the halting problem is that with access to
               | the output of the optimiser, you can always construct a
               | problem where whatever the optimiser produces can be
               | obstructed by producing an input that makes the optimised
               | program non-optimal for the given input.
               | 
               | A trivial example of such a program is a function that
               | sorts it input and returns the sorted result, as no sort
               | is optimal for all inputs.
        
           | oblio wrote:
           | I know about "embarrassingly parallel" problems, and the
           | thing is, they're only "embarrassingly parallel" to solve in
           | math.
           | 
           | When programming the level of friction is from annoying to
           | high unbearable, depending on the programming language. It
           | should be trivial to implement these solutions, as trivial as
           | the sequential solutions.
           | 
           | It's not. Until this is the case in mainstream programming
           | languages, I doubt we'll have CPUs with 1000 cores in our
           | smartphones or laptops. I mean, we'll have them but they'll
           | be useless because 99% of software today is practically
           | single-threaded. The reason multiple cores are good for
           | laptops, for example, is because we're multi-tasking between
           | multiple applications, not because those cores are frequently
           | used by day-to-day apps. I'm talking about random small apps,
           | not compute intensive ones that have to use multi-threading
           | intensively (rendering, compiling, what have you,
           | professional level software).
        
             | rbanffy wrote:
             | > but they'll be useless because 99% of software today is
             | practically single-threaded.
             | 
             | I used to joke developers should get workstations with
             | SPARC Niagaras or Xeon Phis for that reason: core count is
             | going up and a CPU with a dozen small cores is much cheaper
             | to build than one with 4 beefy ones. Now some of the low-
             | end chips Intel is pushing have 4 SMT2 cores. HEDT is on
             | the 16 SMT2 core range and, if our software continues to be
             | single threaded, there will be a lot of silicon being used
             | for nothing more than spreading heat.
        
               | colejohnson66 wrote:
               | Isn't the point of high core counts for workloads that
               | can actually _be_ parallelized? Like compilation or
               | graphics rendering?
        
               | rbanffy wrote:
               | A lot of tasks can be parallelized if you think hard
               | enough. Your CPU is busy reordering instructions so they
               | keep as many execution ports busy as it can so that it
               | can retire as many instructions per core cycle as
               | possible. SMT was invented to keep those execution units
               | busy by running more than one instruction stream at a
               | time.
               | 
               | The incentive to do it when most computers have two SMT2
               | cores is small, but as the average device starts getting
               | 4 or 8 SMT2 cores, or 8 asymmetric cores, the incentives
               | to make things run faster in parallel get better and
               | better.
        
               | astrange wrote:
               | That wouldn't be good if they're targeting battery-
               | powered devices, because you usually don't want to turn
               | on extra cores there. Mobile programs can actually go
               | faster if you make them single-threaded - either the
               | device doesn't have more cores available to run your
               | extra threads, or your process isn't high-priority enough
               | to spend more power on.
        
             | tenebrisalietum wrote:
             | Imagine an OS where each new thread or process gets
             | allocated its own core. You'd be limited to 1k
             | threads/processes on this CPU, but what would otherwise be
             | the benefits/drawbacks?
        
         | spiritplumber wrote:
         | Parallax's Spin lets you do this.
        
         | flohofwoe wrote:
         | You can get that without a programming language revolution by
         | reformulating the problem a bit.
         | 
         | For instance 3D-API shading languages do just that under the
         | hood, just without the "parallel-for". Instead you tell a
         | 3D-API what code should run per-vertex or per-pixel in
         | (usually) a traditional sequential C-style language, and the
         | GPU driver and hardware care about the parallelization and
         | scheduling.
        
         | bipson wrote:
         | ADA is quite good at this.
         | 
         | But not every problem (i.e. the Algorithm for the problem) can
         | be divided easily without any coordination (i.e. communication)
         | between the chunks. This overhead can quickly become
         | significant.
         | 
         | There is a reason why desktop software (particularly the
         | computationally intensive) is having a hard time exploiting
         | multi-core benefits (besides the programming language and
         | habits in general).
        
         | Kuinox wrote:
         | Yes, SQL. Write your query and it will be magically
         | parallelised without you knowing it.
        
         | nicoburns wrote:
         | Rust's Rayon library is pretty good for these use-cases
         | https://github.com/rayon-rs/rayon                   use
         | rayon::prelude::*;         fn sum_of_squares(input: &[i32]) ->
         | i32 {             input.par_iter() // <-- just change that!
         | .map(|&i| i * i)                  .sum()         }
        
         | cfstras wrote:
         | Java has this since 8:                   Arrays.asList("hello",
         | "world").parallelStream()             .forEach(s ->
         | System.out.println(s));
         | 
         | If you want to calculate stuff, you can
         | List<String> suffixed = Arrays.asList("hello",
         | "world").parallelStream()              .map(s -> s + "_suffix")
         | .collect(Collectors.toList())
        
         | kevin_thibedeau wrote:
         | Hardware description languages are parallelized by default.
        
         | legulere wrote:
         | This architecture is called a manycore processor. There are
         | several programming paradigms that are researched for it, among
         | others actor languages, where the idea is to run parts of your
         | program on different cores that communicate with each other.
         | 
         | https://ne.wikipedia.org/wiki/Manycore_processor
        
         | leprechaun1066 wrote:
         | q has this out the box: https://code.kx.com/q/basics/peach/
        
         | prennert wrote:
         | Dont have many languages already some sort of map
         | implementation that parallelizes pure function calls? Even
         | python has ThreadPoolExecutor.map and ProcessPoolExecutor.map.
         | 
         | It gets difficult once you have side effects and/or
         | synchronization (join) of the results are required.
        
         | GTP wrote:
         | >Isn't parallelism complicated and practically dangerous in
         | most mainstream languages?
         | 
         | Yes, but functional languages solve this problem.
        
         | rbanffy wrote:
         | > Is there any language where you can just go:
         | 
         | Nothing prevents a map function to run in parallel.
         | 
         | There are other things that can happen in parallel, however -
         | the whole instruction flow is reordered in the CPU before it is
         | executed and if you have enough execution ports and enough in-
         | flight instructions to keep them fed, you can accomplish a lot
         | in a single clock cycle before your source code even needs to
         | acknowledge that instructions don't happen exactly in the
         | sequence they are written.
        
         | gens wrote:
         | While reading the article i thought of Erlang.
         | 
         | Multiple independent programs sending messages to each other
         | fits well into multiple independent cores sending messages to
         | each other.
        
         | foerbert wrote:
         | I'm not sure this is the best link, but I don't have anything
         | better in mind and it was the first thing that popped up, so...
         | 
         | Anyway, there's a language called ParaSail[0] that's doing
         | stuff in that direction. A little bit back the guy behind it
         | got hired by AdaCore (Ada compiler folks) and they seem to have
         | adopted it and continuing with it. I haven't heard too much
         | about it though.
         | 
         | [0]https://adacore.github.io/ParaSail/
        
         | 4gotunameagain wrote:
         | MATLAB has parfor, but that's barely a language..
        
         | tonyedgecombe wrote:
         | C# has it: https://docs.microsoft.com/en-
         | us/dotnet/api/system.threading...
         | 
         | Of course when you start doing things in parallel there are
         | automatically a number of caveats and risks.
        
         | jackpeterfletch wrote:
         | Occam has exactly this. SEQ executes statements sequentially.
         | PAR executes them in parallel, though the language is based on
         | CSP and Cooperative Multitasking, so the idea of it doing 'the
         | right thing' with a PAR is a little different.
         | 
         | Think of Occam programs as big Factorio factories.
         | 
         | https://en.wikipedia.org/wiki/Occam_(programming_language)
         | 
         | Alot the ideas that came out of Occam and CSP have been brought
         | into various coroutine libraries though. Kotlin Coroutines,
         | Goroutines, etc. So its definitely something that is explored.
        
           | jacquesm wrote:
           | I was looking for this comment before posting. Thank you.
        
         | api wrote:
         | Rust is much safer for parallelism and has a lot of safety
         | rules around closures, so this would probably be achievable in
         | Rust without much pain via a clever template library.
         | 
         | The problem is that not all problems are easy to parallelism at
         | the algorithmic level, and for some it may be impossible.
         | 
         | Alongside massively multi core chips I would also love to see
         | more efforts to push the envelope on single threaded
         | performance.
        
         | yoshuaw wrote:
         | MSVC's C++ Parallel algorithms library gets close [1], and so
         | does Rust's Rayon crate [2]. C# has parallel for loops [3], and
         | apparently so does PowerShell [4]
         | 
         | Though none of these are quite as convenient as I think they
         | could be. I suspect there is a path for Rust to add parallel
         | iteration syntax in the future [5]. There are quite a few steps
         | needed to get there, but the result could look something like
         | this:                   let mut listener =
         | TcpListener::bind("127.0.0.1:8080")?;
         | println!("Listening on {}", listener.local_addr()?);
         | par for stream? in listener.incoming() {
         | println!("Accepting from: {}", stream.peer_addr()?);
         | io::copy(&stream, &stream)?;         }
         | 
         | [1]: https://devblogs.microsoft.com/cppblog/using-c17-parallel-
         | al...
         | 
         | [2]: https://docs.rs/rayon/1.5.0/rayon/
         | 
         | [3]: https://dotnettutorials.net/lesson/parallel-for-method-
         | cshar...
         | 
         | [4]: https://devblogs.microsoft.com/powershell/powershell-
         | foreach...
         | 
         | [5]: https://blog.yoshuawuyts.com/parallel-stream/#future-
         | directi...
        
           | AtlasBarfed wrote:
           | I love GPars for Groovy, but unfortunately it isn't
           | CompileStatic compatible or Groovy 3.0 compatible.
        
           | eptcyka wrote:
           | What'd be the concurrency primitive used in the parallel
           | block? Is scheduling done per loop or globally? All of these
           | questions are trivial if there's a runtime and become very
           | cumbersome if the language has ambitions to be usable without
           | any standard libraries.
        
             | volta83 wrote:
             | If you use the standard library, its parallel run-time gets
             | automatically linked. This is often the right choice when
             | building binaries for a platform with an operating system,
             | etc.
             | 
             | If you do not include the standard library _AND_ if you use
             | these features, then you'd need to provide your own
             | parallel run-time or your application won't compile.
             | 
             | There are many valid implementations of these run-times
             | tuned for different applications, these run-times can be as
             | simple or complex as you'd like, in some cases some run-
             | times only work on particular bare metal targets, and in
             | most cases you just pull a library with the implementation
             | that you'd want to use.
             | 
             | Since a valid implementation of the run-time is to just run
             | all tasks sequentially, you can provide a tiny (often
             | "zero-size") run-time for those environments on which code-
             | size trumps everything.
             | 
             | You can also provide the standard parallel run-time, but
             | this might require operating-system like functionality,
             | like thread, mutex, etc. support. If your embedded target
             | does not provide those, you'd need to provide
             | implementations of these yourself.
             | 
             | The important thing is that you don't need to change any
             | code that uses the run-time, e.g., all the app code can
             | still use fork-join parallelism, but you control how that's
             | executed by swapping the modular runtime component.
        
               | yoshuaw wrote:
               | Yes, exactly. Rust already allows for switching global
               | allocators through `#[global_allocator]`. But now also
               | allows plugging allocators inline through the allocator-
               | api methods.
               | 
               | I'd imagine something similar could be introduced for
               | threadpools / runtimes.
        
             | Someone wrote:
             | I don't think you can do that well without some global
             | state and code to handle it (aka 'a runtime').
             | 
             | It's not like separate parts of a program can each optimize
             | for the number of threads and their memory for their goals.
             | Optimization/tuning has to happen globally because the
             | resources used (threads, memory) are shared between those
             | parts.
             | 
             | I think most solutions have some global state that
             | magically decides how many threads to run in parallel (more
             | for I/O bound code), how many to assign to each part (which
             | don't even have to be perfectly separate. Part P might
             | parallellisme a loop where each iteration calls into part
             | Q, which also runs a parallel loop. If each of these
             | decides to run one thread per CPU for N threads in total,
             | the result still c/would be that N2 threads run in
             | parallel, likely with suboptimal results)
        
         | mhh__ wrote:
         | D does this as a library i.e. just add parallel to your foreach
         | aggregate.
         | 
         | As to languages built around this, there are quite a few, e.g.
         | Intel have Data Parallel C++
        
       | peter_d_sherman wrote:
       | This might make a good candidate architecture to become a future
       | GPU (see discussion here):
       | 
       | Nyuzi - An Experimental Open-Source FPGA GPGPU Processor
       | 
       | https://news.ycombinator.com/item?id=26132726
        
       | guerrilla wrote:
       | I can't escape the nagging question: is it really 1,000 or
       | actually 1,024 and simplified for the press release?
        
         | Bluestein wrote:
         | Useful comment upthread :)
         | 
         | - https://news.ycombinator.com/item?id=26143976
        
       | kitd wrote:
       | _Cores ... transfer data directly to each other rather than using
       | a pooled memory area that can become a bottleneck for data_
       | 
       | This sounds like transputers. IIRC they had direct memory linkage
       | too.
        
         | thawkins wrote:
         | yep each transputer had 4 serial links, which could be
         | connected in matrixes, and allow data to be moved across the
         | array directly. They used a language called OCCAM, and there
         | was an OS called TAOS that allowed you to run various
         | transputer topologies.
        
           | kitd wrote:
           | OCCAM!! That was it! Thx I was trying to remember the name of
           | the language they used.
           | 
           | I wonder whether OCCAM would be usable here, or if not, then
           | something based on the lessons learnt from it.
        
             | TickleSteve wrote:
             | OCCAM implemented CSP (Communicating Sequential Processes)
             | that Go took a lot of lessons from.
             | 
             | (https://en.wikipedia.org/wiki/Communicating_sequential_pro
             | ce...)
        
           | TickleSteve wrote:
           | TAOS also ran on many other architectures (68K, x86, ARM). It
           | implemented a virtual processor (VP1) and evolved into Elate
           | but didn't get much recognition despite its novelty.
           | 
           | TAOS was definitely ahead of its time, but rarely gets
           | mentioned unfortunately.
        
           | rbanffy wrote:
           | Dick Pountain wrote a lot of articles for BYTE about that.
           | 
           | This is one: https://sites.google.com/site/dicknewsite/home/c
           | omputing/pri...
        
       | DrNosferatu wrote:
       | ...from 2016. back in grad school, I did a comparison between the
       | (then new) XeonPhi and GPUs to run my own N-Body engine using the
       | same OpenCL implementation - the rest is history: GPUs still rule
       | today for high intensity computational needs. Look at Machine
       | Learning: GPUs are _so good_ at SIMD, that if your domain can
       | benefit from faster computation,  /i you should make it SIMD i/,
       | if it's not already. ;)
        
         | klelatti wrote:
         | It's sad that OpenCL development seems to have lost its way. I
         | have some OpenCL workloads where a running on a CPU is faster
         | and some where GPUs are much quicker. The beauty of OpenCL is
         | that I can choose between the two with zero code changes.
         | 
         | But CUDA and ML have steamrollered everything!
        
           | MayeulC wrote:
           | I heard of Hallide quite recently, and find it promising:
           | https://halide-lang.org/
           | 
           | Not only you can decide on CPU vs GPU, you have more control
           | over the kind of parallelism and intermediate results.
        
           | lasagnaphil wrote:
           | I really hope SYCL (https://en.m.wikipedia.org/wiki/SYCL) can
           | gain some traction (since Intel also jumped on the GPU
           | competition and are backing a substantial amount of it)
        
           | DrNosferatu wrote:
           | Indeed, OpenCL runs on everything - if you can set it up...
           | :/
           | 
           | Actually, at the time, what I did was a 3-way comparison
           | between GPU, XeoPhi _and_ classic CPU (was it 8 cores?)
           | running the same OpenCL code. Anyways, the  "many-core"
           | specimen, XeonPhi, benched at the geometric mean between
           | classic CPU and GPU - so closer to classic CPU! (we were
           | using beta hardware, our intel specialist - nice, helpful guy
           | actually - went mute after I got those results :D ) On top of
           | that, reimplementing the same algo in CUDA gave ~10%
           | advantage over OpenCL on NVidia hardware. I wonder how it
           | would fare on a FPGA of the same price/TDP?
           | 
           | And more relevant for 2021: - I wonder if we would get the
           | same kind of results using Vulkan?
           | 
           | - Speaking of which, does anyone know of any good books /
           | tutorials / runnable examples of Vulkan in a compute-oriented
           | application?
           | 
           | Cheers!
        
             | DrNosferatu wrote:
             | PS: Does anyone have any experience with ArrayFire?
             | https://arrayfire.com/ (I believe it's from the people that
             | did the [GPU]Jacket for Matlab)
        
       ___________________________________________________________________
       (page generated 2021-02-15 23:02 UTC)