[HN Gopher] Scaling Kafka at Honeycomb
___________________________________________________________________
Scaling Kafka at Honeycomb
Author : i0exception
Score : 72 points
Date : 2021-11-30 19:25 UTC (3 hours ago)
(HTM) web link (www.honeycomb.io)
(TXT) w3m dump (www.honeycomb.io)
| rubiquity wrote:
| I've never used Kafka but this post is yet another hard earned
| lesson in log replication systems where storage tiering should be
| much higher on the hierarchy of needs than horizontal scaling of
| individual logs/topics/streams. In my experience the times when
| you need storage tiering something awful is already happening.
|
| During network partitions or other scenarios where your disks are
| filling up quickly it's much easier to reason about how to get
| your log healthy by aggressively offloading to tiered storage and
| trimming than it is to re-partition (read: reconfigure), which
| often requires writes to some consensus-backed metadata store,
| which is also likely experiencing its own issues at that time.
|
| Another great benefit of storage tiering is that you can
| externally communicate a shorter data retention period than you
| actually have in practice, while you really put your recovery and
| replay systems through their paces to get the confidence you
| need. Tiered storage can also be a great place to bootstrap new
| nodes from.
| StreamBright wrote:
| >> Historically, our business requirements have meant keeping a
| buffer of 24 to 48 hours of data to guard against the risk of a
| bug in retriever corrupting customer data.
|
| I have used much larger buffers before. Some bugs can lurk around
| for a while before noticed. For example the lack of something is
| much harder to notice.
| lizthegrey wrote:
| Yes -- now that we're just paying the cost to store once on S3
| rather than 3x on NVMe, we can plausibly extend the window to
| 72 hours or longer! before, it was a pragmatic, constraint-
| driven compromise.
| EdwardDiego wrote:
| > July 2019 we did a rolling restart to convert from self-
| packaged Kafka 0.10.0
|
| Ouch, that's a lot of fixed bugs you weren't reaping the benefits
| of >_< What was the reason to stick on 0.10.0 for so long?
|
| After we hit a few bad ones that finally convinced our sysop team
| to move past 0.11.x, life was far better - especially recovery
| speed after an unclean shutdown. Used to take two hours, dropped
| to like 10 minutes.
|
| There was a particular bug I can't find for the life of me that
| we hit about four times in one year where the replicas would get
| confused about where the high watermark was, and refuse to fetch
| from the leader. Although to be fair to Kafka 0.10.x, I think
| that was a bug introduced in 0.11.0. Which is where I developed
| my personal philosophy of "never upgrade to a x.x.0 Kafka release
| if it can be avoided."
|
| > The toil of handling reassigning partitions during broker
| replacement by hand every time one of the instances was
| terminated by AWS began to grate upon us
|
| I see you like Cruise Control in the Confluent Platform, did you
| try it earlier?
|
| > In October 2020, Confluent announced Confluent Platform 6.0
| with Tiered Storage support
|
| Tiered storage is slowly coming to FOSS Kafka, hopefully in
| 3.2.0, thanks to some very nice developers from AirBnB. Credit to
| the StreamNative team, that FOSS Pulsar has tiered storage (and
| schema registry) built-in.
| lizthegrey wrote:
| > What was the reason to stick on 0.10.0 for so long?
|
| Aforementioned self-packaging, we were mangling the .tar.gz
| files into .debs, and we had to remember to update the debs and
| then push them out onto our systems, instead of just using Apt.
| Thus why Confluent's prebuilt distro helped a lot! But also the
| team was just _afraid_ of Kafka and didn't want to touch it
| unnecessarily.
|
| > I see you like Cruise Control in the Confluent Platform, did
| you try it earlier?
|
| We definitely should have. We tried Datadog's Kafka-kit but
| found adapting it to use Wavefront or Honeycomb Metrics
| products was more problematic than it needed to be.
|
| > Tiered storage is slowly coming to FOSS Kafka, hopefully in
| 3.2.0, thanks to some very nice developers from AirBnB. Credit
| to the StreamNative team, that FOSS Pulsar has tiered storage
| built-in.
|
| Yeah, we're glad the rest of the world gets to have it, and
| also glad we paid upfront for Confluent's enterprise feature
| version to get us out of the immediate bind we had in 2020.
| Those EBS/instance storage bills were adding up fast.
| EdwardDiego wrote:
| > we paid upfront for Confluent's enterprise feature version
| to get us out of the immediate bind we had in 2020.
|
| Definitely agree it's an essential feature for large datasets
| - in the past I've used Kafka Connect to stream data to S3
| for longer term retention, but it's something else to manage,
| and getting data back into a topic if needed can be a bit
| painful.
| lizthegrey wrote:
| getting to just use the same consistent API without
| rewriting clients was AMAZING.
| ewhauser421 wrote:
| The irony of Honeycomb using an open source tool from Datadog
| is not lost on me =)
| lizthegrey wrote:
| stand on the shoulders of giants!
| mherdeg wrote:
| It's funny how my bugbears from interacting with distributed
| async messaging (Kafka) are like 90 degrees orthogonal from the
| things described here:
|
| (1) Occasionally have wanted to wonder _what the actual traffic
| is_. This takes extra software work (writing some kind of
| inspector tool to consume a sample message and produce a human-
| readable version of what 's inside it).
|
| (2) Sometimes see problems which happen at the broker-partition
| or partition-consumer assignment level, and tools for visualizing
| this are really messy.
|
| For example you have 200 partitions and 198 consumer threads --
| this means that because of the pigeonhole principle there are 2
| threads which own 2 partitions. Randomly, 1% of your data
| processing will take twice as long, which can be very hard to
| visualize.
|
| Or for example 10 of your 200 partitions that are managed by
| broker B which, for some reason, is mishandling messages -- so 5%
| of messages are being handled poorly, which may not emerge in
| your metrics the way you expect. Viewing slowness by partition,
| by owning consumer, and by managing broker can be tricky to
| remember to do when operating the system.
|
| (3) Provisioning capacity to have n-k availability (so that
| availability-zone-wide outages as well as deployments/upgrades
| don't hurt processing) can be tricky.
|
| How many messages per second are arriving? What is the mean
| processing time per message? How many processors (partitions) do
| you need to keep up? How much _slack_ do you have -- how much
| excess capacity is there above the typical message arrival rate,
| so that you can model how long it will take the cluster to
| process a backlog after an outage?
|
| (4) Remembering how to scale up when message arrival rate feels
| like a bit of a chore. You have to increase the number of
| partitions to be able to handle the new messages ... but then you
| also have to remember to scale up every consumer. You did
| remember that, right? And you know you can't ever reduce the
| partition count, right?
|
| (5) I often end up wondered what the processing latency is. You
| can approximate this by dividing the total backlog of unprocessed
| messages for an entire consumer group (unit "messages") by the
| message arrival rate (unit "arriving messages per second") which
| gets you something that has dimensionality of "seconds" and
| represents a quasi processing lag. But the lag is often different
| per-partition.
|
| Better is to teach the application-level consumer library to emit
| a metric about how long processing took and how old the message
| it evaluated was - then, as long as processing is still
| happening, you can measure delays. Both are messy metrics that
| need you get and remain hands-on with the data to understand
| them.
|
| (6) There's a complicated relationship between "processing time
| per message" and effective capacity -- any application changes
| which make a Kafka consumer slower may not have immediate effects
| on end-to-end lag SLIs, but they may increase the amount of
| parallelism needed to handle peak traffic, and this can be tough
| to reason about.
|
| (7) Planning only ex post facto for processing outages is always
| a pain. More than once I've heard teams say "this outage would be
| a lot shorter if we had built in a way to process newly arrived
| messages first", and I've even seen folks jury-rig LIFO by e.g.
| changing the topic name for newly arrived messages and using the
| previous queue as a backlog only.
|
| I wonder if my clusters have just been too small? The stuff here
| ("how can we afford to operate this at scale?") is super
| interesting, just not the reliability stuff I've worried about
| day-to-day.
| sealjam wrote:
| > ...RedPanda, a scratch backend rewrite in Rust that is client
| API compatible
|
| I thought RedPanda was mostly C++?
| zellyn wrote:
| The RedPanda website claims to be written in C++, and their
| open source github repo agrees.
| lizthegrey wrote:
| thanks for the correction! knew an error would slip in there
| somewhere! apparently they have considered rust though!
| https://news.ycombinator.com/item?id=25112601
|
| it's fixed now.
___________________________________________________________________
(page generated 2021-11-30 23:00 UTC)