[HN Gopher] We used Elixir's Observer to hunt down bottlenecks
___________________________________________________________________
We used Elixir's Observer to hunt down bottlenecks
Author : todsacerdoti
Score : 107 points
Date : 2022-08-23 16:00 UTC (7 hours ago)
(HTM) web link (blog.sequin.io)
(TXT) w3m dump (blog.sequin.io)
| dminor wrote:
| Sounds like there are some very nice observability features built
| into BEAM. I wish NodeJS had something similar!
| lliamander wrote:
| The BEAM is really cool, and was actually originally intended
| to be a bare-metal operating system. That's why it has so many
| features that are useful for operations: they couldn't assume
| you'd have any other tooling available, and often didn't even
| have physical access to the machines that were running it.
| davidw wrote:
| > Second, we passed one particularly large data structure from a
| manager to a pool of dedicated worker processes. This meant we
| were reincurring the memory cost of this data structure for each
| worker process. We couldn't eliminate the repetition, but
| reducing the data to its bare essentials before passing it down
| to the workers minimizes that cost.
|
| Hard to say without knowing much about the data in question, but
| my recollection is that large Erlang/Elixir/BEAM "binaries" are
| actually not copied around. That might be a strategy for sharing
| larger things in some cases.
|
| Marshalling data is pretty easy in Erlang: 2>
| Bin = erlang:term_to_binary([1, 2, 3]).
| <<131,107,0,3,1,2,3>> 3> erlang:binary_to_term(Bin).
| [1,2,3]
| realcorvus wrote:
| If the data does not change, persistent_term is useful as well
| conradfr wrote:
| A related anecdote: some months ago I had a memory leak inside a
| (greatly duplicated) genserver while repeatedly calling a lib[0]
| function inside it, that would result in the server basically
| crashing after a while.
|
| I never understood what in that lib was causing the leak but I
| fixed it (or more accurately mitigated it) by wrapping the call
| in a Task.async/1
|
| Maybe that will help someone else one day.
|
| [0] https://hexdocs.pm/shoutcast/Shoutcast.html#read_meta/1
| filmor wrote:
| It was probably leaking refc binaries, see for example
| https://ferd.github.io/recon/recon.html#bin_leak-1.
|
| Running the function (which probably parses large binaries) in
| a separate process ensures that it's properly garbage collected
| in time.
| conradfr wrote:
| Interesting thanks.
|
| Yes that could be it.
| austinjp wrote:
| So, the graphic at the top of the article (on mobile) is AI-
| generated, right? The character's fingers are smooshed.
|
| Interesting to see this approach to article graphics after I
| first read about it on HN recently.
| _acco wrote:
| It is. Dall-e did the heavy lifting, I tweaked with Photoshop
|
| > Painting of a detective from the 1800s, portrait, looking at
| a magnifying glass at a computer monitor, digital art
| [deleted]
| cpursley wrote:
| Sequin is really cool! Are y'all listening postgres WAL?
| _acco wrote:
| Thanks! We considered using Postgres' WAL but decided not to
| for the time being.
|
| Our solution now uses trigger functions. These trigger
| functions fire whenever a create/update/delete happens on a
| Sequin table. They insert a row into a log table. That log
| table is processed by our workers to send changes to the
| upstream API.
|
| The advantage of using trigger functions + a log table are all
| about ease of use and compatibility: our customers don't have
| to do anything fancy to setup Sequin, we just need a role with
| `create` privileges in the database. The log table also makes
| it easy for both them and us to debug issues, as the stream of
| changes that we captured is right there in the database.
| cpursley wrote:
| Very cool.
|
| I'm using Elixir to listen to change events via
| https://github.com/cpursley/walex (which I basically ripped
| off from Supabase).
| losvedir wrote:
| This is really cool. We use Elixir at work, but we mostly use it
| in a "traditional web app" (i.e. non-Elixir) way, of Docker
| containers deployed to independent AWS instances.
|
| So I'm always intrigued by some of the more BEAM-specific things
| that folks do, like using `observer` on a remote (production??)
| node here, or distributed Elixir where the nodes communicate with
| each other, or "hot" code updates.
|
| How do companies deploy Elixir in such a way to take advantage of
| all those things? Does Sequin talk anywhere about their deploy
| process and how their infrastructure looks?
| mattbaker wrote:
| For us we have our app deployed to $N containers with a load
| balancer in front (pretty standard stuff I think?)
|
| In Erlang/Elixir you can actually override how instances of the
| BEAM find each other (instead of the standard EPMD daemon), so
| we have a module that does some DNS queries, finds the IPs of
| the other containers and says "hi, here's your cluster,
| discovery done." (Your setup may preclude all that, I know this
| all depends on how a system's architected.)
|
| After doing that we were free to use all of Erlang's cool
| cluster stuff! In our case we have in-memory caches for a few
| things, and if a given instance does a lookup because of a
| cache miss it broadcasts a message to all the other nodes
| saying "I just looked up $expensive_thing, here's its value" so
| they don't have to do the lookup themselves, they just cache
| that value, so you end up with a little distributed cache with
| a few lines of code. In our case, btw, these cache entries are
| short lived and a little inconsistency does us no harm if one
| of our instances misses the message, networks are networks, but
| it's been great!
|
| Anyway, I think it's super cool and I'd encourage you to play
| around if you get the chance.
|
| Also the observer is just amazing. We've debugged some pretty
| weird memory and cpu usage issues with it, I have some internal
| blog posts, maybe I should see if I could make them public.
| JohnCurran wrote:
| Can you speak more to how you bypass EPMD and send the IPs of
| the containers to each other? That would be great for a
| problem we're seeing where I work
| cpursley wrote:
| Distributed Elixir on Render is crazy easy. Fly.io also looks
| neat.
| lycos wrote:
| Distributed Elixir can be done with Docker containers too, see
| https://github.com/bitwalker/libcluster which by default has
| some Kubernetes support but you can also have third party (or
| custom) clustering strategies. I've not done this myself but
| I've seen articles about this a lot during the past years.
|
| Hot code updates for most applications aren't really worth it
| in my opinion, assuming you do something like blue/green
| rollover deployments. It's cool that it's possible though. But
| it requires appup files and afaik Distillery is one of the
| release tools that has support for it built-in.
| ranyefet wrote:
| If you deploy to fly.io it should be very easy to create a
| cluster of elixir nodes.
| conradfr wrote:
| I think the screenshot under the "Memory" section is not the
| correct one.
| _acco wrote:
| Fixed, thanks!
| ananthakumaran wrote:
| recon and observer_cli are the tools I reach out first to debug
| any issues in production. In any other language, I usually think
| about how to reproduce the issue locally. With Elixir, I just get
| into a remote shell in the affected machine and live debug the
| issue, and there are cases where we applied hotfix by using eval
| right there from the shell. The idea of the remote shell itself
| is alien to most languages.
| busterarm wrote:
| And unfortunately the kind of thing that compliance flags as a
| big no-no once you've got any kind of filing or privacy
| requirements.
| jon-wood wrote:
| This sort of thing doesn't have to be a compliance breach,
| but you will likely need some way of ensuring there's a
| second person in the loop, typically that would take the form
| of having someone in a separate production infrastructure
| team actually driving a while you talk them through what
| needs to happen.
| busterarm wrote:
| Yes and with the added benefit of having to explain that
| control to your rotating bunch of compliance people every
| single year.
|
| I'm not criticizing the methodology as much as the useless
| performative nature of compliance work.
| d4mi3n wrote:
| Compliance is performative until it isn't. If you've ever
| been party to a breach, the role of compliance and an
| audit trail to the security narrative becomes _very_
| important. Consider:
|
| 1. We had a breach. A factor in this was insufficient
| oversight on a process that granted privileged access to
| customer data. We fixed the problem, promise that your
| data is safe, and don't believe this will happen again.
|
| 2. We had a breach. A factor in this was due to a gap in
| an existing control around customer data that had a
| problem we had not anticipated. These were the people
| involved. This is exactly how this problem occurred. This
| is the data that was exposed. This is documentation of
| our response to this incident. This is our existing
| policy around how we handle data and how we respond to
| breaches.
|
| Customers, partners, regulators, and law enforcement
| respond a lot better when you can demonstrate good intent
| and at least imply that you have some kind of process. Of
| the two scenarios I outlined, the latter provides those
| assurances.
|
| Compliance isn't the only way to do this, but it's often
| the easiest.
| mattbaker wrote:
| Still wildly useful debugging things locally too!
___________________________________________________________________
(page generated 2022-08-23 23:00 UTC)