[HN Gopher] Laptop to Lambda: Outsourcing Everyday Jobs to Thous...
___________________________________________________________________
Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of
Transient Containers
Author : mlerner
Score : 99 points
Date : 2021-07-25 16:06 UTC (6 hours ago)
(HTM) web link (www.micahlerner.com)
(TXT) w3m dump (www.micahlerner.com)
| chubot wrote:
| 2019 thread on the paper:
| https://news.ycombinator.com/item?id=20433315
|
| Another review: https://buttondown.email/nelhage/archive/papers-
| i-love-gg/
| [deleted]
| keithwinstein wrote:
| Thank you for this nice writeup! This paper was led by my student
| Sadjad Fouladi (https://sadjad.org/), part of a broader theme of
| coercing a "purely functional-ish" design onto everyday
| applications. There's a less academic-ese version with a few
| extended results that was published in ;login: magazine (https://
| www.usenix.org/system/files/login/articles/login_fal...). There
| was also a good analysis here
| (https://buttondown.email/nelhage/archive/papers-i-love-gg/) and
| don't miss https://buttondown.email/nelhage/archive/http-
| pipelining-s3-... .
|
| Some of Sadjad's other work has included:
|
| - ExCamera, which somewhat kicked off the trend of "fire up 4,000
| lambda workers in a burst, all working on one job" -- for things
| like making a neural network search a video frame-by-frame, video
| compression in parallel at sub-GOP granularity, etc.
| (https://news.ycombinator.com/item?id=16197253)
|
| - Salsify, which reused the "purely functional" video codec from
| ExCamera to improve WebRTC/Zoom-style live video
| (https://news.ycombinator.com/item?id=16964112 ,
| https://news.ycombinator.com/item?id=20794541). Sadjad is giving
| an Applied Networking Research Prize talk about this work at IETF
| tomorrow.
|
| - 3D ray-tracing (running PBRT on thousands of Lambdas, sending
| rays across the network), SMT/SAT solving, etc.
|
| We're working to extend this line of work towards a more general,
| Wasm-based, "purely functional" operating system where most
| computations operate on content-addressed data and are content-
| addressed themselves, and determinism and reproducibility is
| properties guaranteed by the OS. Sort of analogous to how the
| operating systems of today (try to) guarantee memory isolation
| between processes. Imagine, e.g., a Git repository where you
| could represent the fact that "blob <x> is the result of running
| computation <y> given tree <z> as input," and anybody can verify
| that result, or rebase the computation to run on top of their own
| input. If you're interested in this general area, please consider
| doing a PhD at Stanford and/or get in touch -- I'm hiring.
| boulos wrote:
| Hi, Keith! Glad to see you're still enjoying hipster compute
| :).
|
| How/what do you think about Cloudflare workers, fly.io, and
| similar "run pure-ish functions anywhere"? I no longer have any
| skin in the game, but it seems to me that "ignoring locality"
| just means having to reinvent locality later on.
| keithwinstein wrote:
| Heya, great to see you pop up here! I gotta be honest -- I
| think EC2 (and, in general, doing computation in units of
| VMware/Xen-style virtual PCs) is the actual hipster compute
| substrate. AWS Lambda feels closer to cgi-bin from 1995, i.e.
| back when things still made sense. (Have you ever joined a
| tech company and been handed a 10 gigabyte VM image that uses
| a Vagrant pipeline to provision itself so you can get a
| working dev environment, except the pipeline only works if
| 100% of its 10,000 downloads succeed, so the whole thing is
| super-flaky, but nobody at the company knows because they
| only ran it once when they first joined and have just kept
| the same local dev VM ever since? That's what hipster compute
| means to me.)
|
| All that aside, Cloudflare workers/fly.io/Fastly
| Compute@edge/Lucet/Google Cloud Run seem really cool, and the
| resulting work on Wasm and its ecosystem is fantastic, but
| they're also not exactly what excites me. Deploying code
| close to the edge (or "anywhere" in particular) isn't very
| important if the application only makes one round-trip. Even
| if my code is pure, it's not like fly.io is willing to sign a
| certificate saying, "We evaluated function <y> on input <z>
| and the correct answer is <x>, signed Fly Inc., and if you
| can prove us wrong in the next 10 years, our insurance
| company will pay you $1 million from our E&O policy." Which
| would really be cool. And, I don't know of people spinning up
| 4,000 nodes on those systems in 100 ms to do a 1-second-long
| computation. I haven't seen any of the providers or outsiders
| benchmarking the "burst-to-N,000-nodes latency" numbers
| averaged over many trials at various times of day. (We
| measured GKE a small number of times in the gg paper [fig. 7]
| and found it to be... really slow at that particular metric.)
|
| I don't think we want to ignore locality! But I do want the
| OS to be able to secure access to thousands of cores in <1
| second for <10 second duration workloads, and I think many
| applications would be willing to _compromise_ on locality, or
| accept heterogeneous /irregular locality, in exchange for
| that. I'd still love _visibility_ into the locality I end up
| with, I 'd love not to have to do flaky NAT-traversal hacks
| to get direct communication among nodes, and I could imagine
| the application bidding more to persuade the infrastructure
| owner to provide computation in larger units (i.e. more cores
| on fewer machines, machines in a placement group with full
| bisection bandwidth, etc.), which is sort of where Lambda
| seems to be heading already.
|
| (Long term, I don't really think applications should be
| renting cores and RAM per unit time and thinking about
| locality; I'd love to be dealing with the infrastructure
| provider in terms of some higher-level abstraction, because
| then you could imagine the provider might be genuinely
| incentivized to discover better ways of computing the same
| answer, to our mutual benefit.)
| r3trohack3r wrote:
| I'm loving this train of thought Keith.
|
| What are your thoughts on program correctness and runaway
| cost. I'm a little uncomfortable running a workload that
| could scale unexpectedly to a denial of wallet.
|
| For this research, how did you enforce bounds on your
| workload to prevent exceeding your funding budget? Is the
| whole compute graph calculated locally? The recursive
| workloads seem particularly anxiety inducing.
| seg_lol wrote:
| I too have been thinking about <<general, Wasm-based, "purely
| functional">> content addressed computation.
|
| I think it can support both legacy applications as well as
| purely functional uses. I really want to support both, the case
| for linking against any git commit and doing live differential
| testing is really enticing. I toyed with a serverless
| deployment system years ago where code was callable by githash
| and was run directly from git. One could execute any version at
| any time. This system would be able to automatically rerun
| executions against new code to track regressions across many
| dimensions. On failure, the system could fall back to older
| code paths. TBD how to manage modularity and coherence across
| sets of functions, restart might need to happen at a much
| higher level.
|
| For processing an input stream, I think the lambdas would need
| to be tail recursive so that the internal state could be
| externally checkpointed,
| stream_setup/process_chunk/stream_close. process_chunk would
| need to emit either a total copy of its internal state, or a
| token linking to persistent storage.
|
| Curious what your current set of basis functions are and how
| failures are accounted for?
| haolez wrote:
| This could be very useful for quantum chemistry simulations,
| which are generally parallelizable and very CPU intensive. If gg
| gets tweaked to support MPI, this niche could have a
| breakthrough!
| z3ncyberpunk wrote:
| Disgustingly wasteful
| JZL003 wrote:
| My work uses GCP not AWS so I've been experimenting with google
| cloud run (it's actually parallelizing R code so need the docker
| container infra). My only problem is that I have very bursty
| useage and the auto-scaling is too slow. I made one attempt [1]
| to encourage larger allocation but don't know another way. Do
| people have experience with this
|
| [1] Slightly costly but ~ 5minutes before I need it, I set the
| minimum instance size to a larger number so it starts ramping up,
| then when I'm done I lower it
| neolog wrote:
| Hi Micah, I'd like to follow these posts but I don't like signing
| up. Would you mind adding an Atom/RSS feed?
| mlerner wrote:
| Thanks for the feedback! Does this Atom feed work for you?
| https://www.micahlerner.com/feed.xml
| neolog wrote:
| Yep, thanks.
| MarkSweep wrote:
| There is a feed. It is title "untitled" though so it may be
| hard to find in your feed reader after you add it:
|
| https://www.micahlerner.com/feed.xml
| gumby wrote:
| Good find -- I searched in the page source and there was no
| reference to it.
| gumby wrote:
| I filed a github issue asking the author to enable the RSS feed
| which their web tool (Hugo) has built in.
| mlerner wrote:
| Thanks!
___________________________________________________________________
(page generated 2021-07-25 23:01 UTC)