[HN Gopher] Issues we've encountered while building a Kafka base...
___________________________________________________________________
Issues we've encountered while building a Kafka based data
processing pipeline
Author : poolik
Score : 141 points
Date : 2021-10-18 09:22 UTC (13 hours ago)
(HTM) web link (sixfold.medium.com)
(TXT) w3m dump (sixfold.medium.com)
| mfateev wrote:
| temporal.io provides much higher level abstraction for building
| asynchronous microservices. It allows one to model async
| invocations as synchronous blocking calls of any duration (months
| for example). And the state updates and queueing are
| transactional out of the box.
|
| Here is an example using Typescript SDK: async
| function main(userId, intervals){ // Send reminder
| emails, e.g. after 1, 7, and 30 days for (const
| interval of intervals) { await sleep(interval * DAYS);
| // can take hours if the downstream service is down
| await activities.sendEmail(interval, userId); }
| // Easily cancelled when user unsubscribes }
|
| Disclaimer: I'm one of the creators of the project.
| LgWoodenBadger wrote:
| I didn't quite follow their explanation for why producing to
| Kafka first didn't/wouldn't work for them (db state potentially
| being out of sync requiring continuous messaging until fixed).
| anentropic wrote:
| it's a chicken and egg problem
|
| you can either send a kafka message but potentially not commit
| the db transaction (i.e. an event is published for which the
| action did not actually occur) or commit the db transaction and
| potentially not send the kafka message
|
| it sounds like they implemented something like the
| Transactional Outbox pattern
| https://microservices.io/patterns/data/transactional-outbox....
|
| i.e. you use the db transaction to also commit a record of your
| intent to send a kafka message - you can then move the actual
| event sending to a separate process and implement at-least-once
| semantics
|
| This is the job queueing system they described in the article
| LgWoodenBadger wrote:
| Their solution seems like a "produce to Kafka first" but with
| extra steps.
|
| Regarding:
|
| _When we produce first and the database update fails
| (because of incorrect state) it means in the worst case we
| enter a loop of continuously sending out duplicate messages
| until the issue is resolved_
|
| I don't understand where either 1) the incorrect state or 2)
| the need to continuously send duplicate messages come from.
|
| Regarding:
|
| _The Job might still fail during execution, in which case
| it's retried with exponential backoff, but at least no
| updates are lost. While the issue persists, further state
| change messages will be queued up also as Jobs (with same
| group value). Once the (transient) issue resolves, and we can
| again produce messages to Kafka, the updates would go out in
| logical order for the rest of the system and eventually
| everyone would be in sync._
|
| This is the part that is equivalent to Kafka-first, except
| with all the extra steps of a job scheduling, grouping,
| tracking, and execution framework on top of it.
| fafle wrote:
| The issue of running a transaction that spans multiple
| heterogeneous systems is usually solved with a 2 phase commit.
| The "jobs" abstraction from the article looks similar to the
| "coordinator" in 2PC. The article does not talk about how they
| achieve fault tolerance in case the "job" crashes inbetween the
| two transactions. Postgres supports the XA standard, which might
| help with this. Kafka does not support it.
| jgraettinger1 wrote:
| I can't speak to their solution, but when solving an equivalent
| problem within Gazette, where you desire a distributed
| transaction that includes both a) published downstream
| messages, and b) state mutations in a DB, the solution is to 1)
| write downstream messages marked as pending a future ACK, and
| 2) encode the ACK you _intend_ to write into the checkpoint
| itself.
|
| Commit the checkpoint alongside state mutations in a single
| store transaction. Only then do you publish ACKs to all of the
| downstream streams.
|
| Of course, you can fail immediately after commit but before you
| get around to publishing all of those ACKS. So, on recovery,
| the first thing a task assignment does is publish (or re-
| publish) the ACKs encoded in the recovered checkpoint. This
| will either 1) provide a first notification that a commit
| occurred, or 2) be an effective no-op because the ACK was
| already observed, or 3) roll-back pending messages of a
| partial, failed transaction.
|
| More details:
| https://gazette.readthedocs.io/en/latest/architecture-exactl...
| nitwit005 wrote:
| The solution I've seen is to write the message you want to send
| to the DB along with the transaction, and have some separate
| thread that tries to send the messages to Kafka.
|
| Although, from various code bases I've seen, a lot of people
| just don't seem to worry about the possibility of data loss.
| eternalban wrote:
| IIRC ~2 decades ago we were dequeueing from JMS, updating
| RDBMS, and then enqueuing all under the cover of JTA (Java
| Transaction API) for atomic ops.
|
| https://docs.oracle.com/en/middleware/fusion-middleware/12.2...
|
| Using a very broad definition of 'noSQL' approach that would
| include solutions like Kafka, the issue becomes clear: A 2PC or
| 'distributed transaction manager' approach ala JTA comes with a
| performance/scalability cost -- arguably a non-issue for most
| companies who don't operate at LinkedIn scale (where Kafka was
| created).
| zeckalpha wrote:
| And MySQL didn't yet support transactions!
| jpgvm wrote:
| What you want is called Apache Pulsar. Log when you need it, work
| queue when you need that instead.
|
| As for ensuring transactional consistency that isn't so bad, you
| can use an table to track offset inserts making sure you verify
| from that before you update consumer offsets (or Pulsar
| subscription if you go that route).
| jgraettinger1 wrote:
| If you're in the Go ecosystem, Gazette [0] offers transactional
| integrations [1] with remote DB's for stateful processing
| pipelines, as well as local stores for embedded in-process state
| management.
|
| It also natively stores data as files in cloud storage. Brokers
| are ephemeral, you don't need to migrate data between them, and
| you're not constrained by their disk size. Gazette defaults to
| exactly-once semantics, and has stronger replication guarantees
| (your R factor is your R factor, period -- no "in sync
| replicas").
|
| Estuary Flow [2] is building on Gazette as an implementation
| detail to offer end-to-end integrations with external SaaS & DB's
| for building real-time dataflows, as a managed service.
|
| [0]: https://github.com/gazette/core [1]:
| https://gazette.readthedocs.io/en/latest/consumers-concepts....
| [2]: https://github.com/estuary/flow
| ivanr wrote:
| Small suggestion: If Gazette is ready for a wider adoption, it
| may be useful to bump it up to 1.0 as a signal of confidence.
| anotherhue wrote:
| I ran a few dozen kafka clusters at MegaCorp in a previous life.
|
| My answer to anyone who asks for kafka: Show me that you can't do
| what you need with a beefy Postgres.
| afandian wrote:
| I had a great time with Kafka for prototyping. Being able to
| push data from a number of places, have mulitple consumers able
| to connect, go back and forth though time, add and remove
| independent consumer groups. Ran in pre-production very
| reliably too, for years.
|
| But for a production-grade version of the system I'm going with
| SQL and, where needed, IaC-defined SQS.
| capableweb wrote:
| > My answer to anyone who asks for kafka: Show me that you
| can't do what you need with a beefy Postgres.
|
| ...
|
| You could probably hack any database to perform any task, but
| why would you? Use the right tool for the right task, not one
| tool for all tasks.
|
| If the right tool is a relational database, then use
| Postgres/$OTHER-DATABASE
|
| If the right tool is distributed, partitioned and replicated
| commit log service, then use Kafka/$OTHER-COMMIT-LOG
|
| Not sure why people get so emotional about the technologies the
| know the best. Sure you could hack Postgres to be a replicated
| commit log, but I'm sure it'll be easier to just throw in Kafka
| instead.
| ThinkBeat wrote:
| This is exactly how I feel about it.
|
| A while back I was on team building a non-critical, low volume
| application. It just involves people sending a message
| basically. (there is more too it).
|
| The consultants said he had to use Kafka because the messages
| could come in really fast.
|
| I said we should stick with Postgres.
|
| No, they said, we really need Kakfa to be able to handle this.
|
| Then I went and spun up Postgres on my work laptop (nothing
| special), and got a loaner to act as a client. I simulated
| about 300% more traffic than we had any chance of getting. It
| worked fine. (did tax my poor work laptop).
|
| No, we could not risk it, when we use Kafka we are safe.
|
| Took it to management, Kafka won since Buzzword.
|
| Now of course we have to write a process to feed the data into
| Postgres. After all its what everything else depends on
| sparsely wrote:
| There are various Kafka to Postgres adaptors. Of course, now
| you're running 3 bits of software, with 3 bottlenecks,
| instead of just 1.
| jghn wrote:
| People tend to not realize how big "at scale" problems really
| are. Instead anything at a scale at the edge of their
| experience is "at scale" and they reach for the tools they've
| read one is supposed to use in those situations. It makes
| sense, people don't know what they don't know.
|
| And thus we have a world where people have business needs
| that could be powered by my low end laptop but solutions
| inspired by the megacorps.
| jerf wrote:
| The GHz aspect of Moore's law died over a decade ago, and I
| suppose it's fair to say most other stuff has also slowed
| down, but if you've got a job that is embarrassingly
| parallel, which a lot of these "big data" jobs are, people
| badly underestimate how much progress there has been in the
| server space even so in the last 10 years if they're not
| paying attention. What was "big data" in 2011 can easily be
| "spin up a single 32-core instance with 1TB RAM for a
| couple of hours" in 2021. Even beyond the "big data" that
| my laptop comfortably handles.
|
| I'm slowly wandering into a data science role lately, and
| I've been dealing with teams who are all kinds of concerned
| about whether or not we can handle the sheer, overwhelming
| volume of their (summary) data. "Well, let's see, how much
| data are we talking about?" "Oh, gosh, we could generate
| 200 or 300 megabytes a day." (Of uncompressed JSON.) Well,
| you know, if I have to bust out a second Raspberry Pi I'll
| be sure to charge it to your team.
|
| The funny thing is that some of these teams have the
| experience that they ought to know better. They are
| legitimately running cloud services with dozens of large
| nodes continually running at high utilization and chewing
| through gigabytes of whatever per _second_. In their own
| worlds they would absolutely know that a couple hundred
| megabytes is _nothing_. They 'll often have known places in
| their stack where they burn through a few hundred megabytes
| in internal API calls or something unnecessarily, and it
| will barely rise to the level of a P3 bug, quite
| legitimately so. But when they start thinking in terms of
| (someone else's) databases it's like they haven't updated
| their sense of size since 2005.
| yongjik wrote:
| You are not thinking enterprisey enough. Everything is Big
| Scale if you add enough layers, because all these overheads
| add up, to which the solution is of course more layers.
| i_like_waiting wrote:
| But can you stream data from lets say MS SQL directly into
| Postgres? Easiest way I found is Kafka, I would love some
| simple python script instead
| BiteCode_dev wrote:
| Stream? Unless you have really hight traffic, put that in a
| python script while loop that regularly check for new rows,
| and it will be fine.
|
| If you want to get fancy, db now have pub/sub.
|
| There are use case for stream replication, but you need way
| more data than 99% of biz have.
| dtech wrote:
| Were you advocating an endpoint + DB or different apps
| directly writing into a shared DB? The latter is not really a
| good idea for numerous reasons. Kafka a - potentially
| overkill - replacement for REST or whatever, not for your DB.
| ceencee wrote:
| I see posts like this a lot, and it makes me wonder what the
| heck you were using Kafka for that Postgres could handle, yet
| you had dozens of clusters? I question if you actually ever
| used Kafka or just operated it? Sure anyone can follow the
| "build a queue on a database pattern" but it falls over at the
| throughputs that justify Kafka. If you have a bunch of trivial
| 10tps workloads, of course a distributed system is overkill.
| silisili wrote:
| Can you kinda high level the setup/processes for making
| Postgres a replacement for Kafka? I've not attempted such a
| thing before, and wonder about things like
| expiration/autodeletion, etc. Does it need to be vacuumed
| often, and is that a problem?
| gilbetron wrote:
| What velocity have you achieved with postgres, in terms of #
| messages/sec where messages could range between 1KB-100KB in
| size?
| lima wrote:
| Yup. That's what Kafka excels at and where it scales to
| throughput way beyond Postgres.
|
| But there are many many projects where Kafka is used for low
| value event sourcing stuff where a SQL DB could be easier.
| superyesh wrote:
| >My answer to anyone who asks for kafka: Show me that you can't
| do what you need with a beefy Postgres.
|
| Sorry thats just a clickbait-y statement. I love Postgres, try
| handling 100-500k rps of data coming in from various sources
| reading and writing to it. You are going to get bottlenecked on
| how many connections you can handle, you will end up throwing
| pgBouncers on top of it.
|
| Eventually you will run out of disk, start throwing more in.
|
| Then end up in VACCUUM hell all while having a single point of
| failure.
|
| While I agree Kafka has its own issues, it is an amazing tool
| to a real scale problem.
| fernandotakai wrote:
| at least for me, every system has its place.
|
| i love posgresql, but i would not use it to replace a
| rabbitmq instance -- one is an RDBMS, the other is a
| queue/event system.
|
| "oh but psql can pretend to be kafka/rabbitmq!" -- sure, but
| then you need to add tooling to it, create libraries to
| handle it, and handle all the edge cases.
|
| with rmq/kafka, there already a bunch of tools to handle the
| exact case of a queue/event system.
| dwohnitmok wrote:
| I think anotherhue would agree that half a million write
| requests per second counts as a valid answer to "you can't do
| what you need with a beefy Postgres," but that is also a
| minority of situations.
| IggleSniggle wrote:
| It's just hard to know what people mean when they say "most
| people don't need to do this." I was sitting wondering
| about a similar scale (200-1000 rps), where I've had issues
| with scaling rabbitmq, and have been thinking about whether
| kafka might help.
|
| Without context provided, you might think: "oh, here's
| somebody with kafka and postgres experience, saying that
| postgres has some other super powers I hadn't learned about
| yet. Maybe I need to go learn me some more postgres and see
| how it's possible."
|
| It would be helpful for folks to provide generalized
| measures of scale. "Right tool for the job," sure, but in
| the case of postgres, it often feels like there are a lot
| of incredible capabilities lurking.
|
| I don't know what's normal for day-to-day software
| engineers anymore. Was the parent comment describing
| 100-500 rps really "a minority of situations?" I'm sure it
| is for most _businesses_. But is it "the minority of
| situations" that _software engineers_ are actively trying
| to solve in 2021? I have no clue.
| Serow225 wrote:
| that seems like an awfully low number to be running into
| issues with RabbitMQ ?
| doliveira wrote:
| Yeah, once you do have to scale a relational database you're
| in for a world of pain. Band-aid after band-aid... I very
| much prefer to just start with Kafka already. At the very
| least you'll have a buffer to help you gain some time when
| the database struggles.
| dionian wrote:
| Honest question, how do you expire content in Postgres? Every
| time I start to use it for ephemeral data I start to wonder if
| I should have used TimescaleDB or if I should be using
| something else...
| LoriP wrote:
| You likely already know that TimescaleDB is an extension to
| PostgreSQL, so you get everything you'd get with PostgreSQL
| plus the added goodies of TimescaleDB. All that said, you can
| drop (or detach) data partitions in PostgreSQL (however you
| decide to partition...) Does that not do the trick for your
| use case, though? https://www.postgresql.org/docs/10/ddl-
| partitioning.html
|
| Transparency: I work for Timescale
| mvc wrote:
| Record an event log and reliably connect it to a variety of 3rd
| party sinks using off-the-shelf services
| sparsely wrote:
| I've also had much more success with "queues in postgres" than
| "queues in kafka". I'm sure the use cases exist where kafka
| works better, but you have to architect it so carefully, and
| there are so many ways to trip up. Whereas most of your team
| probably already understands a RDMS and you're probably already
| running one reliably.
| lmilcin wrote:
| Just because you can do it with Postgres doesn't mean it is the
| best tool for the job.
|
| Sometimes the restrictions placed on the user are as important.
| Kafka presents a specific interface to the user that causes
| users to build their applications in certain way.
|
| While you can replicate almost all functionality of Kafka with
| Postgres (except for performance, but hardly anybody needs as
| much of it), we all know what we end up with when we set up
| Postgres and use it to integrate applications with each other.
|
| If developers had discipline they could of course crate tables
| with appendable logs of data, marked with a partition, that
| consumers could process from with basically same guarantees as
| with Kafka.
|
| But that is not how it works in reality.
| bsaul wrote:
| using a sql db for push/pop semantic feels like using a hammer
| to squash a bug.. How would you model queues & partitions with
| ordering guarantees with pg ?
| throwaway81523 wrote:
| With transactions, and stored procedures if that helps ;).
| Redis also seems well suited to the use cases I've seen for
| Kafka. Kafka must have capabilities beyond those use cases,
| and I've sometimes wondered what they are.
| [deleted]
| anotherhue wrote:
| sql has many conveniences for doing so, it wouldn't be much
| work.
|
| > using a hammer to squash a bug..
|
| Agreed - but Kafka is a much much bigger hammer. SES/Az
| Queues are also good choices.
| mrweasel wrote:
| We work with a client who has requested a Kafka cluster. They
| can't really say why, or what they need it for, but the now
| have a very large cluster which doesn't do much. I know why
| they want it, same reason why they "need" Kubernetes. So far
| they use it as a sort of message bus.
|
| It's not that there's anything wrong with Kafka, it's a very
| good product and extremely robust. Same with Kubernetes, it has
| it uses and I can't fault anyone for having it as a
| consideration.
|
| My problem is when people ignore how capable modern servers
| are, and when developers don't see the risks in building these
| highly complex systems, if something much simpler would solve
| the same problem, only cheaper and safer.
| aqme28 wrote:
| > Show me that you can't do what you need with a beefy
| Postgres.
|
| I've found this question very useful when pitched any esoteric
| database.
| alephu5 wrote:
| Why Postgres? Why not redis, rabbitMQ or even Kafka itself?
| hardwaresofton wrote:
| Not those other tools because you can't achieve Postgres
| functionality with those other tools generally.
|
| Postgres can pretend to be Redis, RabbitMQ, and Kafka, but
| redis, RabbitMQ, and Kafka would have a hard time pretending
| to be Postgres.
|
| Postgres has the best database query language man has
| invented so far (AFAIK), well reasoned persistence and
| semantics, and as of recently partitioning features to boot
| and lots of addons to support different usecases. Despite all
| this postgres is mostly Boring Technology (tm) and easily
| available as well as very actively developed in the open,
| with a enterprise base that does consulting first and usually
| upstreams improvements after some time (2nd Quadrant, EDB,
| Citus, TimescaleDB).
|
| The other tools win on simplicity for some (I'd take managing
| a PostgreSQL cluster over RMQ or Kafka any day), but for
| other things especially feature wise Postgres (and it's
| amalgamation of mostly-good-enough to great features) wins
| IMO.
| LoriP wrote:
| Your comment on Boring Technology (tm) made me smile... At
| Timescale as you'll see in this blog we are happy to
| celebrate the boring, there's an awful lot to be said for
| it :) https://blog.timescale.com/blog/when-boring-is-
| awesome-build...
|
| For transparency: I'm Timescale's Community Manager
| sumtechguy wrote:
| I have done both patterns.
|
| Kafka has its use case. Databases have theirs. You can make a
| DB do what kafka does. But you also add in the programming
| overhead of getting the DB semantics correct to make an event
| system. When I see people saying 'lets put the DB into kafka' I
| make the exact same argument. You will spend more time making
| kafka act like a database and getting the semantics right.
| Kafka is more of a data/event transportation system. A DB is an
| at rest data store that lets you manipulate the data. Use them
| to their strengths or get crushed by weird edge cases.
| tengbretson wrote:
| I'd argue that having the event system semantics layered on
| top of a sql database is a big benefit when you have an
| immature product, since you have an incredibly powerful
| escape hatch to jump in and fire off a few queries to fix
| problems. Kafka's visibility for debugging is pretty poor in
| my experience.
| sumtechguy wrote:
| My issues typically with layering an event system on top of
| a db is replication and ownership of that event. Kafka
| makes some very nice guarantees about giving best attempt
| to make sure only one process works on something at a time
| inside of a consumer group. You have to build a system in
| the db using locks and different things that are poor
| substitutes.
|
| If you are having trouble debugging kafka you could use a
| connector to put the data into the database/file to also
| debug, or a db streamer. You can also use the built in cli
| tools to scroll along. I have had very good luck with using
| both of those to find out what is going wrong. Also kafka
| will basically by default keep all the messages for the
| past 7 days so you can play it back if you need to by
| moving the consumer offsets. IF you are trying to use kafka
| like a db and change messages on the fly you will have a
| bad time of it. Kafka is meant to be a here is something
| that happened and some data. Changing that data after the
| fact would in the kafka world be another event. In some
| types of systems that is a very desirable property (such as
| audit heavy cultures, banks, medical, etc). Now also kafka
| can be a real pain if you are debugging and messup a schema
| or produce a message that does not fit into the consumer.
| Getting rid of that takes some poor cli trickery when in a
| db it is a delete call.
|
| Also kafka is meant for a distributed system event based
| worker systems (typically some sort of microservice style
| system). If you are early on you more than likely not
| building that yet. Just dumping something into a list in a
| table and polling on that list is a very effective way for
| something that is early on or maybe even forever. But once
| you add in replication and/or multiple clients looking at
| that same task list table you will start to see the issues
| quickly.
|
| Using an event system like a db system and yes it will feel
| broken. Also vice versa. You can do it. But like you are
| finding out those edge cases are a pain and make you feel
| like 'bah its easier to do it this way'. In some cases yes.
| In your case you have a bad state in your event data. You
| are cleaning it up with some db calls. But what if instead
| you had an event that did that for you?
| BiteCode_dev wrote:
| Well, if you want events, you have lighter alternative. Redis
| pub/sub, crossbar... Even rabbitMQ is lighter.
|
| If you need queuing, you have libs like celery.
|
| You don't need to go full kafka
| emerongi wrote:
| Redis - it's probably in your stack already and fills all
| the use-cases.
| bsaul wrote:
| Very interested to hear how people here overcome the limits of
| kafka for ordered events delivery in real world, and what those
| were.
| luxurytent wrote:
| I feel as if you're using Kafka and expect guaranteed ordering,
| then you're using the wrong tool. At best you have guaranteed
| ordering per partition but then you've tied your
| ordering/keying strategy to the amount of partitions you've
| enabled ... which may not ideal.
|
| But, that's speaking from my light experience with it. I'm also
| curious if there's a better way :-)
| orobinson wrote:
| At lower data volumes (<10,000 events per minute) it's
| perfectly feasible to just use single partition topics and then
| ordered event delivery is no problem at all. If a consuming
| service has processing times that means horizontal scaling is
| necessary then the topic can be repartitioned into a new topic
| with multiple partitions and the processing application can
| handle sorting the outputted data to some SLA.
| BFLpL0QNek wrote:
| It depends on what the events are, how they are structured.
|
| You get guaranteed ordering at the partition level.
|
| Items are partitioned by key so you also get guaranteed
| ordering for a key.
|
| If you have guaranteed ordering for a key you can't get total
| ordering across all keys but you can get eventual consistency
| across the keys.
|
| Ultimately if you want ordering you have to design around being
| eventually consistent.
|
| I don't read a lot of papers but Leslie Lamports Time, Clocks,
| and the Ordering of Events in a Distributed System gave me a
| lot of insight in to the constraints.
| https://lamport.azurewebsites.net/pubs/time-clocks.pdf
| sumtechguy wrote:
| For kafka the default is round robin in each partition. A
| hash key can let you direct the work to particular
| partitions. Each partition is guaranteed ordering. Also only
| one consumer in a consumer group can remove an item from a
| partition at a time. No two consumers in a consumer group
| will get the same message.
| tiew9Vii wrote:
| It's round robin if no key specified otherwise it uses
| murmur2 hash of the key so the partition for a key is
| always deterministic.
|
| Just checking the docs it appears the round robin is no
| longer true after Confluent Platform 5.4. After 5.4 it
| looks like if no key specified the partition is assigned
| based on the batch being processed.
|
| > If the key is provided, the partitioner will hash the key
| with murmur2 algorithm and divide it by the number of
| partitions. The result is that the same key is always
| assigned to the same partition. If a key is not provided,
| behavior is Confluent Platform version-dependent:...
|
| https://docs.confluent.io/platform/current/clients/producer
| ....
| csours wrote:
| put a timestamp in the message. use a conflict free replicated
| data type
| sethammons wrote:
| If you need a guaranteed ordering, timestamps and distributed
| systems are not friends. See logical / vector clocks.
| jgraettinger1 wrote:
| Not for Kafka, but we are building Flow [1] to offer
| deterministic ordering, even across multiple logical and
| physical partitions, and regardless of whether you're back-
| filling over history or processing in real-time.
|
| This ends up being required for one of our architectural goals,
| which is fully repeatable transformations: You must be able to
| model a transactional decision as a Flow derivation (like "does
| account X have funds to transfer Y to Z ?", and if you create a
| _copy_ of that derivation months later, get the exact same
| result.
|
| Under the hood (and simplifying a bit) Flow always does a
| streaming shuffled read to map events from partitions to task
| shards, and each shard maintains a min-heap to process events
| in their ~wall-time order.
|
| This also avoids the common "Tyranny of Partitioning", where
| your upstream partitioning parallelism N also locks you into
| that same task shard parallelism -- a big problem if tasks
| manage a lot of state. With a read-time shuffle, you can scale
| them independently.
|
| [1]: https://github.com/estuary/flow
| HelloNurse wrote:
| The only time I used Kafka, it was involuntarily (included for
| the sake of fashion in some complicated IBM product, where it hid
| among WebSphere, DB2 and other bigger elephants) and it ran my
| server out of disk space because due to a bug ridiculously
| massive temporary files weren't erased. Needless to say, I wasn't
| impressed: just one more hazard to worry about.
| kitd wrote:
| _due to a bug_
|
| Data retention time is Kafka config 101. Are you sure it was a
| bug?
| geodel wrote:
| Considering how half-assed Kafka is in general, that it needs
| all clients code changes when Kafka servers are upgraded. It
| is very likely that user hit Kafka bug.
| kitd wrote:
| Citation needed.
|
| New server versions are protocol backwards compatible so
| I'm not sure what you're referring to.
|
| Ofc, if you downgraded a server without changing the
| client, that may cause problems, but tbh that's hardly
| Kafka's fault.
| sam0x17 wrote:
| Couldn't the "state" issue be solved simply by enclosing the
| database save and kafka message send in the same database
| transaction block and only doing the kafka send if it reaches
| that part of the code?
___________________________________________________________________
(page generated 2021-10-18 23:01 UTC)