[HN Gopher] Thinking in an array language
___________________________________________________________________
Thinking in an array language
Author : tosh
Score : 114 points
Date : 2022-05-14 12:21 UTC (10 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| carapace wrote:
| I was just playing with Nils M Holm's Klong this morning:
| https://t3x.org/klong/index.html (Klong rather than the others
| mostly because the C implementation looks like C so I have a
| ghost of a chance of actually grokking it.)
|
| These folks are really onto something, but I think they get
| sidetracked in the (admittedly very very fun) minutia of the
| languages and lose sight of the crucial insight in re:
| mathematical notation, to wit: it's a means of human
| communication.
|
| For APL or K to get out of their niches would require, I am
| convinced, something like a tome of documentation of a ratio of
| about 1.5 paragraphs per line of code. That would give us mere
| mortals a fighting chance at grokking these tools.
|
| A similar problem plagues the higher-order stuff they're pursuing
| over in Haskell land. I _know_ e.g. "Functional programming with
| bananas, lenses, envelopes and barbed wire" and "Compiling to
| Categories" are really important and useful, but I can't actually
| use them unless some brave Prometheus scales Olympus and returns
| with the fire.
|
| Stuff dribbles out eventually. Type inference and checking have
| finally made it into the mainstream after how many decades?
| jodrellblank wrote:
| > " _would require, I am convinced, something like a tome of
| documentation of a ratio of about 1.5 paragraphs per line of
| code. That would give us mere mortals a fighting chance at
| grokking these tools._ "
|
| You can download a free PDF copy of Mastering Dyalog APL by
| Bernard Legrand which is 700+ pages, from here:
|
| https://www.dyalog.com/mastering-dyalog-apl.htm
| snidane wrote:
| Array notation is great, but only for single array operations and
| for dense array operations.
|
| The moment you need to run complex group bys and store non-
| contiguous data efficiently, it gets awkward pretty quick.
|
| On the other hand, operations on dense data is pretty cumbersome
| in SQL. You can only do so much with limited support of proper
| scan algorithms, merge joins or bolted on window functions.
|
| Please somebody combine APL with SQL and you win the programming
| language wars.
| nerdponx wrote:
| R? Julia? Python with Pandas even.
| hoosieree wrote:
| > Please somebody combine APL with SQL and you win the
| programming language wars.
|
| kdb+ and q fit this description. Docs here:
| https://code.kx.com/q/basics/qsql/
|
| Here's an example.
|
| In SQL: SELECT stock, SUM(amount) AS total
| FROM trade GROUP BY stock
|
| In q: q)select total:sum amt by stock from
| trade
| mlochbaum wrote:
| I would say APL-family languages today have largely addressed
| these concerns with operators such as Key[0][1] and nested
| arrays. J also has built-in support for sparse arrays. Some
| more complicated things like storing a list of lists in a
| compact representation (perhaps lengths and data) aren't
| supported natively, but I'd consider that a niche concern as a
| list of lists will have similar performance, just with more
| memory use.
|
| There's a lot of database software built on array languages,
| with kdb and Jd being the most prominent as far as I know.
|
| [0] https://aplwiki.com/wiki/Key
|
| [1] https://code.jsoftware.com/wiki/Vocabulary/slashdot#dyadic
| ogogmad wrote:
| Array languages might make good maths notation. It's terse and
| easy to write, and there's a logical naming scheme (for instance,
| matrix multiplication is just (+/*)\: ). I suppose the trick is
| to think of (+/*)\: as one unit.
| pxeger1 wrote:
| APL, an array language, literally started out as a computer
| adaptation of traditional mathematical notation.
|
| https://aplwiki.com/wiki/Iverson_notation
|
| https://aplwiki.com/wiki/Comparison_with_traditional_mathema...
| hoosieree wrote:
| Math notation is highly context dependent (is + addition or
| boolean or?) and yet authors rarely feel the need to provide
| context.
|
| If they wrote in an array language instead of LaTeX, not only
| would it make writing papers easier (+/ is shorter than either
| \Sigma or \sum), but it would be trivially reproducible, due to
| being an executable notation.
| kmstout wrote:
| Somewhere on YouTube is a talk where Gerald Sussman described
| mathematics (or at least mathematical presentation) as
| "impressionistic."
| Mathnerd314 wrote:
| I think he says that in every talk he gives.
| https://youtu.be/HB5TrK7A4pI?t=1090,
| https://youtu.be/arMH5GjBwUQ?t=377,
| https://youtu.be/EbzQg7R2pYU?t=3201, etc. There's a 2002
| paper where he uses the term: https://dspace.mit.edu/bitstr
| eam/handle/1721.1/6707/AIM-2002...
| btheshoe wrote:
| This all seems very similar to writing vectorized code in numpy
| geophile wrote:
| Sort of related: thinking in relational algebra, or SQL. It
| appears to be "natural" to think about computing one atomic value
| at a time, in loops, or slightly less intuitively, recursive
| functions. (That latter choice may follow from whether your first
| language was pure Lisp.)
|
| I was fortunate to have a teacher whose database course drilled
| relational algebra into us. This was in the 70s, shortly after
| Codd's paper, and well before SQL was invented, much less
| established. Now I think about much computation algebraically
| (and often functionally). But I do see that this is "unnatural"
| for many, including students, having taught databases for several
| years.
|
| SQL reflects this. I often see students writing nested
| subqueries, because that is more procedural, where joins would be
| a cleaner choice. A colleague of mine wrote a paper many years
| ago, pointing out that thinking procedurally is more "natural"
| for many: https://dl.acm.org/doi/10.1145/319628.319656. But
| thinking set-at-a-time instead of one-at-a-time is a valuable
| skill, not that far off from thinking functionally.
| [deleted]
| snidane wrote:
| It's easier not to mess up table based filters using explicit
| semi-join operators (eg. in, not in, exists) instead of using
| regular joins because joins can introduce duplicates.
|
| Give me 'any join' operation - ie. just select the first value
| instead of all, and I'll happily use joins more. They are
| actually more intuitive.
|
| It's not that relational algebra is untintuitive. It's because
| standard SQL sucks.
| magicalhippo wrote:
| Indeed, I've taught myself to only use JOIN when I actually
| need some data from the table I join. For everything else I
| use EXISTS and friends.
|
| I was thinking SQL could do with a keyword for that, like
| maybe FILTER, that looks like a JOIN but works like EXISTS.
| snidane wrote:
| Clickhouse implements an explicit SEMI join. It can be
| called semi or any, it doesn't really matter. It's just
| another join modifier
| [OUTER|SEMI|ANTI|ANY|ASOF]
|
| https://clickhouse.com/docs/en/sql-
| reference/statements/sele...
| nerdponx wrote:
| My problem with semijoins is that the semantics of "what
| exactly does a SELECT evaluate to inside an expression" are
| sometimes murky and might vary across databases.
| magicalhippo wrote:
| Could you expand a little?
| fiddlerwoaroof wrote:
| "First" doesn't make sense without an order
| jodrellblank wrote:
| It even has its own tag on StackOverflow:
| https://stackoverflow.com/questions/tagged/greatest-n-per-
| gr...
|
| People who want it, want it with an order.
|
| Look at
|
| https://stackoverflow.com/questions/121387/fetch-the-row-
| whi...
|
| and
|
| https://stackoverflow.com/questions/3800551/select-first-
| row...
|
| and
|
| https://stackoverflow.com/questions/8748986/get-records-
| with...
|
| and their combined thousands of votes and dozens of
| answers, all full of awkward workaround or ill-performing
| or specialised-for-one-database-engine code for this common
| and desirable thing which would be trivial with a couple of
| boring loops in Python.
| snidane wrote:
| It does make sense for semi-joins. I care about the key,
| not the value.
|
| Random order is also a valid order.
| [deleted]
| agumonkey wrote:
| I think this is related to "wholemeal" programming as some
| haskellers/FP-ists do, thinking in sets/tree/graphs operations.
| webmaven wrote:
| > SQL reflects this. I often see students writing nested
| subqueries, because that is more procedural, where joins would
| be a cleaner choice.
|
| In my experience, in the non-ad-hoc use-case, views can often
| be substituted for the procedural approach, forming the
| equivalent of a POSIX pipe.
|
| *> A colleague of mine wrote a paper many years ago, pointing
| out that thinking procedurally is more "natural" for many:
| https://dl.acm.org/doi/10.1145/319628.319656. But thinking set-
| at-a-time instead of one-at-a-time is a valuable skill, not
| that far off from thinking functionally.
|
| Hmm. Given the proliferation of tabular data tools (especially
| spreadsheets) over the intervening 40 years, I wonder if those
| results would remain the same today (and whether there would be
| any difference among Excel power users that use pivot tables,
| etc.)
| TFortunato wrote:
| Having taught it, do you have any recommendations for folks who
| are looking to improve their thinking in relations/sets/SQL
| skills?
| co_dh wrote:
| https://arxiv.org/abs/1803.05316
| IshKebab wrote:
| "What if we could make entire programs as unreadable as regexes?"
| -K
| icsa wrote:
| And yet regular expressions are a part of most programming
| languages as a library or built-in syntax.
|
| Why is that?
| unnouinceput wrote:
| Any performance benchmarks between a simple C implementation of
| matrix multiplication like explained in the article and K one?
|
| Story time: 2 years ago, right before 1st lockdown, I landed a
| client. A data scientist who had already implemented an algorithm
| which dealt with matrices, implemented by a previous programmer
| he hired. Said implementation was in Python, no more than 10 line
| altogether, which performed well when matrix size where small,
| like 10x10. But the problem was, his real work need it to have
| matrices of size 10^6 x 10^6. Not only the Python implementation
| had to be ran on a beast of server with 4TB of memory, it also
| took 3 days to finish. And while the algorithm was small in size
| in Python, its explaining paper was 4 pages in total, which took
| me 1 week to understand. And then an entire month to implement in
| C. But in the end, when all was said and done, the run time when
| used with real data was only 20 minutes and consumed only 8 GB of
| memory, though it did required at least 16 virtual processors.
|
| Hence my question, in the end performance is what it matters, not
| number of lines.
| mlochbaum wrote:
| K timing, squaring a 100x100 matrix in ngn/k:
| a:100 0N#?10000 / 100x100 double matrix \t:1000
| a(+/*)\:a / Total of 1k runs in ms 1514
|
| Following C program outputs 272 when compiled and run with -O3
| -march=native. #include <stdio.h>
| #include <time.h> size_t monoclock(void) { struct
| timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts);
| return 1000000000*ts.tv_sec + ts.tv_nsec; }
| int main() { const size_t n = 100; double
| a[n][n], r[n][n]; size_t t = monoclock(); for
| (size_t iter=0; iter<1000; iter++) { for (size_t i=0;
| i<n; i++) { for (size_t j=0; j<n; j++) {
| double sum = 0; for (size_t k=0; k<n; k++) sum +=
| a[i][k]*a[k][j]; r[i][j] = sum; }
| } } printf("%lld\n", (monoclock() -
| t)/1000000); }
|
| So K is slower, by a factor of 5 rather than the 200 you saw
| with Python. K appears to scale to larger matrices better than
| C: 8664 in K vs 2588 in C for 200x200, but I can't be bothered
| to sort out heap allocation for the C code to do larger sizes.
| I would certainly not say that your story supports the idea
| that performance is what matters. In the month you took to
| implement the C version, the Python solution could have run ten
| times over, and there are many computations that only ever need
| to be performed once. Besides, I bet you were glad to have
| working Python code to use as a reference!
| unnouinceput wrote:
| Nice benchmarking, thank you for your effort.
|
| Also the scientist was leading a team, so my program would've
| been used by at least 20 people, 10 times per day. That was
| why Python one was a no go for them from beginning.
| harshreality wrote:
| Can you share the algorithm, or anything computationally
| equivalent, for people to try benchmarking different
| implementations?
| unnouinceput wrote:
| The one I wrote 2 years ago? That's intellectual property
| of said data scientist, not mine. Al I can say is that I
| parallelize it a lot, hence the entire month. From
| programming point of view is a mess and hard to follow
| its ~5k lines. Usually parallel programming is a mess,
| you should take a look at any parallelization CUDA code
| available on GitHub.
| johndough wrote:
| When you compile the C example with the compiler option
| "-Wall", you'll get a warning that the variable 'r' is set
| but not used, so the compiler would be free to simply skip
| the for loops. In fact, if you compile with clang instead of
| gcc, the compiler will do just that and you'll get almost
| zero computation time.
|
| It would be better to do something with the computed result
| so the compiler does not remove the computation, e.g. print a
| randomly selected value.
|
| I also benchmarked the fixed C code against Python and
| "Python" (which uses OpenBLAS under the hood in my case) was
| 10 times faster: import time import
| numpy as np a = np.random.rand(100, 100)
| t = time.perf_counter() for _ in range(1000): a
| @ a print(time.perf_counter() - t, "seconds")
|
| Implementation matters a lot for matrix multiplication.
___________________________________________________________________
(page generated 2022-05-14 23:01 UTC)