[HN Gopher] Chronon, Airbnb's ML feature platform, is now open s...
       ___________________________________________________________________
        
       Chronon, Airbnb's ML feature platform, is now open source
        
       Author : vquemener
       Score  : 151 points
       Date   : 2024-04-08 17:27 UTC (1 days ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | nikhilsimha wrote:
       | Author. Happy to answer any questions.
        
         | morkalork wrote:
         | At what size of team, features or number of models would you
         | say the break even point is for investing time into using this
         | platform?
        
           | nikhilsimha wrote:
           | Offline is pretty easy to get started with. It should take
           | less than a week to set it up for new use-cases across the
           | company. (You can begin building training-sets if offline is
           | setup)
           | 
           | Online is a bit more involved - you need a month or more to
           | test that your KV store scales against traffic coming from
           | chronon for reads and writes.
        
         | dundun wrote:
         | How does this relate to Zipline and Bighead? Does it replace
         | those projects or is it a continuation of them?
        
           | echrisinger wrote:
           | I'd imagine a continuation... he is also the author of
           | Zipline
        
           | nikhilsimha wrote:
           | Bighead is the model training and inference platform.
           | 
           | Chronon is a full re-write of zipline with 1) a different
           | underlying algorithm for time-travel to address scalability
           | concerns. 2) a different serde and fetching strategy to
           | address latency concerns.
        
         | andscoop wrote:
         | I noticed airflow as the backing orchestration service. Was
         | there any consideration for another orchestration tool? I know
         | Airbnb has at least two internally, but also that airflow is
         | the predominant one for the data org still.
        
           | nikhilsimha wrote:
           | Airflow is the current implementation since it is the paved
           | path at airbnb. But we are open to accepting contributions
           | for other orchestrators.
           | 
           | Someone mentioned they wanted to add cadence support.
        
         | echrisinger wrote:
         | How do you/AirBnB handle deeply linked features (2-hop+?) that
         | are also latency sensitive? Maybe I'm missing something, but I
         | don't imagine that with the transformation DSL described in
         | Chronon.
         | 
         | For our org, those are by far the most complicated to handle.
         | Graph DBs are kind of scaling poorly, while storing state in
         | stream processing jobs is way too large/expensive. Those would
         | also be built on top of API sources, which then lead us to the
         | unfortunate "log & wait" approach for our most important
         | features
        
           | nikhilsimha wrote:
           | we call this chaining.
           | 
           | In the API itself - you could specify the chain links by
           | specifying the source.
           | 
           | To be precise - a GroupBy(aggregation primitive) can have a
           | Join(enrichment primitive) as a source. To rephrase, you can
           | enrich first and then aggregate and continue this chain
           | indefinitely.
           | 
           | > Graph DBs are kind of scaling poorly
           | 
           | That makes sense. Since you scaling these on the read side it
           | is much much harder than pre-computing on the write side.
           | (That is what Chronon allows you to do)
        
         | echrisinger wrote:
         | I'm also curious how you went from a non-platformatized
         | approach to adopting this platform; what were the important
         | insights for strategizing, prioritizing, motivating teams to
         | lift existing pipelines into the new thing? Open ended question
        
           | nikhilsimha wrote:
           | There were two main drivers -
           | 
           | - inability to back-test new real-time features. People were
           | forced to log-and-wait to create training sets for months.
           | Chronon reduces this to hours or days.
           | 
           | - the difficulty of creating the lambda system (batch
           | pipeline, streaming pipeline, index, serving endpoint) for
           | every feature group. In chronon, you simply set a flag on
           | your feature definition to spin up the lambda system.
        
       | whiplash451 wrote:
       | First of all, congrats on the release! Well done. A few
       | questions:
       | 
       | - Since the platform is designed to scale, it would be nice to
       | see scalability benchmarks
       | 
       | - Is the platform compatible with human-in-the-loop workflows? In
       | my experience, those workflows tend to require vastly different
       | needs than fully automated workflows (e.g. online advertising)
        
         | nikhilsimha wrote:
         | re: scalability benchmarks - we plan to publish more benchmark
         | information against publicly available datasets in the near
         | future.
         | 
         | re: human-in-the-loop workflows - do you mean labeling?
        
       | Reubend wrote:
       | Looks very useful. I'm not aware of any open source alternative
       | (although I could just be ignorant here!)
        
         | dundun wrote:
         | This is the biggest one: https://feast.dev/
        
           | echrisinger wrote:
           | This isn't really a drop-in replacement; they don't offer
           | transforms out of the box.
           | 
           | Admittedly some of the transforms proposed in this article
           | are a little simple & don't represent the full space of
           | feature eng requirements for all large orgs
        
             | econometrician wrote:
             | Actually feast does support transformations depending upon
             | the source. It supports transferring data on demand and via
             | streaming. It does not support batch transformation only
             | because technically it should just be an upload but we can
             | revisit that decision.
        
           | xLaszlo wrote:
           | I think feast is sunsetted
        
             | econometrician wrote:
             | There are new maintainers: https://feast.dev/blog/the-
             | future-of-feast/
        
         | jamesblonde wrote:
         | Hopsworks
        
         | nikhilsimha wrote:
         | Feathr from linkedin is the closest. But there doesn't seem to
         | be much recent activity on the project.
        
       | syntaxing wrote:
       | Why do major sites still use Medium as a blog platform.
        
         | dartos wrote:
         | Free income?
        
         | ttul wrote:
         | Ugh yes. The first thing I see on clicking the link is an
         | overwhelming login/join pop-over. I'm never visiting that blog
         | again...
        
           | appplication wrote:
           | Substack is the same
        
           | seattle_spring wrote:
           | Don't let having to tap "x" a single time ruin your day.
           | You're missing out on a lot of good stuff.
        
           | ayhanfuat wrote:
           | Disabling JavaScript helps with that (sometimes they don't
           | show the full article if JS is disabled though).
        
           | dumbo-octopus wrote:
           | Wild that it's 2024 and you still don't have UBlock Origin.
        
             | DandyDev wrote:
             | Maybe they opened the link on Safari on iOS like me?
        
         | whiplash451 wrote:
         | Reach (sadly)
        
         | mdaniel wrote:
         | for others who also hate medium: https://scribe.rip/airbnb-
         | engineering/chronon-airbnbs-ml-fea...
         | 
         | and probably the only link you care about:
         | https://github.com/airbnb/chronon#readme (Apache 2)
        
         | brolumir wrote:
         | It's tough to prioritize migrating to a new platform for the
         | engineering blog, without a very good ROI. Airbnb's eng blog
         | was set up on Medium a while ago, it's doing fine, they have no
         | real reason to spend a lot of resources on switching.
        
         | nikhilsimha wrote:
         | I am with you on this one.
        
       | xiasongh wrote:
       | How does Chronon handle mutable data when backfilling? Or does it
       | make some assumptions on the underlying data?
        
         | nikhilsimha wrote:
         | By mutable data do you mean - change data coming from OLTP
         | databases? If yes, we do this via the EntitySource api.
         | 
         | https://www.chronon.ai/authoring_features/Source.html#stream...
        
       | sfink wrote:
       | It's refreshing to read something about ML and inference and have
       | it _not_ be anything related to a transformer architecture
       | sending up fruit growing from a huge heap of rotten, unknown,
       | mostly irrelevant data. With traditional ML, it 's useful to talk
       | about the sources of bias and error, and even _measure_ some of
       | them. You can do things that improve them without starting over
       | on everything else.
       | 
       | With LLMs, it's more like you buy a large pancake machine that
       | you dump all of your compost into (and you suspect the installers
       | might have hooked up to your sewage line as input too). It
       | triples your electricity bill, it makes bizarre screeching noises
       | as it runs, you haven't seen your cat in a week, but at the end
       | out come some damn fine pancakes.
       | 
       | I apologize. I'm talking about the thing that I was saying was a
       | relief to be not talking about.
        
         | nikhilsimha wrote:
         | I agree with you - about the sentiment around the GenAI
         | megaphone.
         | 
         | FWIW, Chronon does serve context within prompts to personalize
         | LLM responses. It is also used to time-travel new prompts for
         | evaluation.
        
           | cactusplant7374 wrote:
           | > time-travel new prompts for evaluation
           | 
           | What does this mean?
        
             | nikhilsimha wrote:
             | Imagine you are building a customer support bot for a food
             | delivery app.
             | 
             | The user might say - I need a refund. The bot needs to know
             | contextual information - order details, delivery tracking
             | details etc.
             | 
             | Now you have written a prompt template that needs to be
             | rendered with contextual information. This rendered prompt
             | is what the model will use to decide whether to issue a
             | refund or not.
             | 
             | Before you deploy this prompt to prod, you want to evaluate
             | its performance - instances where it correctly decided to
             | issue or decline a refund.
             | 
             | To evaluate, you can "replay" historical refund requests.
             | The issue is that the information in the context changes
             | with time. You want to instead simulate the value of the
             | context at a historical point in time - or time-travel.
        
               | jamesblonde wrote:
               | Time-travel evals, nice.
        
               | jamesblonde wrote:
               | Are you using function calling for the context info?
        
               | uoaei wrote:
               | In what world is it appropriate or even legal to decide
               | on refunds via LLM?
               | 
               | Can you give an example that's not ripe for abuse? This
               | really doesn't sell LLMs as anything useful except
               | insulation from the consequences of bad decisions.
        
               | nikhilsimha wrote:
               | "Imagine" is the operative word :-)
        
       | giovannibonetti wrote:
       | What is the difference between a ML feature store and a low-
       | latency OLAP DB platform/data warehouse? I see many similarities
       | between both, like the possibility of performing aggregation of
       | large data sets in a very short time.
        
         | uoaei wrote:
         | Feature stores are more for fast read and moderate write/update
         | for ML training and inference flows. Good organization and fast
         | query of relatively clean data.
         | 
         | Data warehouse is more for relatively unstructured or blobby
         | data with moderate read access and capacity for massive files.
         | 
         | OLAP is mostly for feeding streaming and event-driven flows,
         | including but not limited to ML.
        
         | jamesblonde wrote:
         | You need the columnar store for both training data and batch
         | inference data. If you have a batch ML system that works with
         | time series data, the feature store will help you create point
         | in time correct training data snapshots from the mutable
         | feature datab(no future data leakagae), as well as batch
         | inference data.
         | 
         | For real-time ml systems, it give uou row oriented retrival
         | latencies for features.
         | 
         | Most importantly, it helps modularize your ML system into
         | feature pipelines training pipelines, and inference pipelines.
         | No onolithic ML pipelines.
        
         | nikhilsimha wrote:
         | the ability generate training sets against historical
         | inferences to back-test new features
         | 
         | another one is the focus on pushing as much compute to the
         | write-side as possible (within Chronon) - specially joins and
         | aggregations.
         | 
         | OLAP databases and even graph databases don't scale well to
         | high read traffic. Even when they do, the latencies are very
         | high.
        
           | giovannibonetti wrote:
           | You may want to take a look at Starrocks [1]. It is an open-
           | source DB [2] that competes with Clickhouse [3] and claims to
           | scale well - even with joins - to handle use cases like real-
           | time and user-facing analytics, where most queries should run
           | in a fraction of a second.
           | 
           | [1] https://www.starrocks.io/ [2]
           | https://github.com/StarRocks/starrocks [3]
           | https://www.starrocks.io/blog/starrocks-vs-clickhouse-the-
           | qu...
        
             | nikhilsimha wrote:
             | We did and gave up due to scalability limitations.
             | 
             | Fundamentally most of the computation needs to happen
             | before the read request is sent.
        
               | jvican wrote:
               | Hey! I work on the ML Feature Infra at Netflix, operating
               | a similar system to Chronon but with some crucial
               | differences. What other alternatives aside from Starrocks
               | did you evaluate as potential replacements prior to
               | building Chronon? Curious if you got to try Tecton or
               | Materialize.com.
        
               | nikhilsimha wrote:
               | We haven't tried materialize - IIUC materialized is pure
               | kappa. Since we need to correct upstream data errors and
               | forget selective data(GDPR) automatically - we need a
               | lambda system.
               | 
               | Tecton, we evaluated, but decided that the time-travel
               | strategy wasn't scalable for our needs at the time.
               | 
               | A philosophical difference with tecton is that, we
               | believe the compute primitives (aggregation and
               | enrichment) need to be composable. We don't have a
               | FeatureSet or a TrainingSet for that reason - we instead
               | have GroupBy and Join.
               | 
               | This enables chaining or composition to handle
               | normalization (think 3NF) / star-schema in the warehouse.
               | 
               | Side benefit is that, non ml use-cases are able to
               | leverage functionality within Chronon.
        
               | jamesblonde wrote:
               | FeatureSets are mutable data and TrainingSets are
               | consistent snapshots of feature data (from FeatureSets).
               | I fail to see what that has to do with composability.
               | Join is still available for FeatureSets to enable
               | composable feature views - join is resuse of feature
               | data. GroupBy is just an aggregation in a feature
               | pipeline, not sure your point here. You can still do star
               | schema (and even snowflake schema if you have the right
               | abstractions).
        
               | jamesblonde wrote:
               | Normalization is a model-dependent transformation and
               | happens after the feature store - needs to be consistent
               | between training and inference pipelines.
        
               | nikhilsimha wrote:
               | Normalization is overloaded. I was referring to schema
               | normalization (3NF etc) not feature normalization - like
               | standard scaling etc.
        
               | jamesblonde wrote:
               | Ok, but star schema is denormalized. Snowflake is
               | normalized.
        
               | nikhilsimha wrote:
               | To be pedantic, even in star schema - the dim tables are
               | denormalized, fact tables are not.
               | 
               | I agree that my statement would be much better if used
               | snowflake schema instead.
        
               | jvican wrote:
               | Thank you for sharing!
        
               | esafak wrote:
               | Please can you expand? What limitations, computations?
        
               | nikhilsimha wrote:
               | Let's say you want to compute avg transaction value of a
               | user in the last 90days. You could pull individual
               | transactions and average during the request time - or you
               | could pre compute a partial aggregates and re-aggregate
               | on read.
               | 
               | OLAP systems are fundamentally designed to scale the read
               | path - former approach. Feature serving needs the latter.
        
               | esafak wrote:
               | Does Chronon automatically determine what intermediate
               | calculations should be cached? Does it accept hints?
        
               | nikhilsimha wrote:
               | We don't accept hints yet - but we determine what to
               | cache.
        
               | omeze wrote:
               | That evaluation would be an amazing addendum or
               | engineering blog post! I know it's not as sexy as
               | announcing a product, but from an engineering perspective
               | the process matters as much as the outcome :)
        
         | csmpltn wrote:
         | There is none. The industry is being flooded with DS and "AI"
         | majors (and other generally non-technical people) that have
         | zero historical context on storage and database systems - and
         | so everything needs to be reinvented (but in Python this time)
         | and rebranded. At the end of the day you're simply looking at
         | different mixtures of relational databases, key-value stores,
         | graph databases, caches, time-series databases, column stores,
         | etc. The same stuff we've had for 50+ years.
        
           | nikhilsimha wrote:
           | Two main differences - ability to time travel for training
           | data generation and the ability to push compute to the write
           | side of the view rather than the read side for low latency
           | feature serving.
        
             | ShamelessC wrote:
             | > ability to time travel for training data generation
             | 
             | What now?
        
               | nikhilsimha wrote:
               | Pardon the jargon. But it is a necessary addition to the
               | vocabulary.
               | 
               | To evaluate if a feature is valuable, you could attach
               | the value of the feature to past inferences and retrain a
               | new model to check for improvement in performance.
               | 
               | But this "attach"-ing needs the feature value to be as of
               | the time of the past inference.
        
               | mulmen wrote:
               | That's not a new concept.
        
             | csmpltn wrote:
             | > "ability to time travel for training"
             | 
             | Nah, this is nothing new.
             | 
             | We've solved this for ages with "snapshots" or "archives",
             | or fancy indexing strategies, or just a freaking
             | "timestamp" column in your tables.
        
               | nikhilsimha wrote:
               | Snapshots can't travel back with milliseconds precision
               | or even minute level precision. They are just full dumps
               | at regular fixed intervals in time.
        
               | _se wrote:
               | Databases have had many forms of time travel for 30+
               | years now.
        
               | threeseed wrote:
               | Not at the latency needed for feature serving and most
               | databases struggle with column limits.
               | 
               | But please enlighten us on which databases to use so
               | Airbnb (and the rest of us) can stop wasting time.
        
               | refset wrote:
               | Shameless plug, but XTDB v2 is being built for low-
               | latency bitemporal queries over columnar storage and
               | might be applicable:
               | https://docs.xtdb.com/quickstart/query-the-past.html
               | 
               | We've not been developing v2 with ML feature serving in
               | mind so far, but I would love to speak with anyone
               | interested in this use case and figure out where the gaps
               | are.
        
               | mulmen wrote:
               | Snapshots don't have to be at regular intervals and can
               | be at whatever resolution you choose. You could snapshot
               | as the first step of training then keep that snapshot for
               | the life of the resulting model. Or you could use some
               | other time travel methodology. Snapshots are only one of
               | many options.
        
               | nikhilsimha wrote:
               | These are reconstruction of features / columns that don't
               | exist yet.
        
               | hobs wrote:
               | https://en.wikipedia.org/wiki/Sixth_normal_form Basically
               | we've had time travel (via triggers or built in temporal
               | tables or just writing the data) for a long time, its
               | just expensive to have it all for an OLTP database.
               | 
               | We've also had slowly changing dimensions to solve this
               | type of problem for a decent amount of time for the
               | labels that sit on top of everything, though really these
               | are just fact tables with a similar historical approach.
        
               | ezvz wrote:
               | 6NF works well for some temporal data, but I haven't seen
               | it work well for windowed aggregations because the
               | start/end time format of saving values doesn't handle
               | events "falling out of the window" too well. At least the
               | examples I've seen have values change due to explicit
               | mutation events.
        
               | ezvz wrote:
               | There's a lot more to it than snapshots or timestamped
               | columns when it comes to ML training data generation. We
               | often have windowed aggregations that need to computed as
               | of precise intra-day timestamps in order to achieve
               | parity between training data (backfilled in batch) and
               | the data that is being served online realtime (with
               | streaming aggregations being computed realtime).
               | 
               | Standard OLAP solutions right now are really good at
               | "What's the X day sum of this column as of this
               | timestamp", but when every row of your training data has
               | a precise intra-day timestamp that you need windowed
               | aggregations to be accurate as-of, this is a different
               | challenge.
               | 
               | And when you have many people sharing these aggregations,
               | but with potentially different timestamps/timelines, you
               | also want them sharing partial aggregations where
               | possibly for efficiency.
               | 
               | All of this is well beyond the scope that is addressed by
               | standard OLAP data solutions.
               | 
               | Not to mention the fact that the offline computation
               | needs to translate seamlessly to power online serving
               | (i.e. seeding feature values, and combining with
               | streaming realtime aggregations), and the need for
               | online/offline consistency measurement.
               | 
               | That's why a lot of teams don't even bother with this,
               | and basically just log their feature values from online
               | to offline. But this limits what kind of data they can
               | use, and also how quickly they can iterate on new
               | features (need to wait for enough log data to accumulate
               | before you can train).
        
               | mulmen wrote:
               | I'm still not seeing how this is a novel problem. You
               | just apply a filter to your timestamp column and re-run
               | the window function. It will give you the same value down
               | to the resolution of the timestamp every time.
        
               | ezvz wrote:
               | Let's try an example: `average page views in the last 1,
               | 7, 30, 60, 180 days`
               | 
               | You need these values accurate as of ~500k timestamps for
               | 10k different page ids, with significant skew for some
               | page ids.
               | 
               | So you have a "left" table with 500k rows, each with a
               | page id and timestamp. Then you have a `page_views` table
               | with many millions/billions/whatever rows that need to be
               | aggregated.
               | 
               | Sure, you _could_ do this with backfill with SQL and
               | fancy window functions. But let 's just look at what you
               | would need to do to actually make this work, assuming you
               | wanted it to be serving online with realtime updates
               | (from a page_views kafka topic that is the source of the
               | page views table):
               | 
               | For online serving: 1. Decompose the batch computation to
               | SUM and COUNT and seed the values in your KV store 2.
               | Write the streaming job that does realtime updates to
               | your SUMs/COUNTs. 3. Have an API for fetching and
               | finalizing the AVERAGE value.
               | 
               | For Backfilling: 1. Write your verbose query with
               | windowed aggregations (I encourage you to actually try
               | it). 2. Often you also want a daily front-fill job for
               | scheduled retraining. Now you're also thinking about how
               | to reuse previous values. Maybe you reuse your decomposed
               | SUMs/COUNTs above, but if so you're now orchestrating
               | these pipelines.
               | 
               | For making sure you didn't mess it up: 1. Compare logs of
               | fetched features to backfilled values to make sure that
               | they're temporally consistent.
               | 
               | For sharing: 1. Let's say other ML practitioners are also
               | playing around with this feature, but with a different
               | timelines (i.e. different timestamps). Are they redoing
               | all of the computation? Or are you orchestrating caching
               | and reusing partial windows?
               | 
               | So you can do all that, or you can write a few lines of
               | python in Chronon.
               | 
               | Now let's say you want to add a window. Or say you want
               | to change it so it's aggregated by `user_id` rather than
               | `page_id`. Or say you want to add other aggregations
               | other than AVERAGE. You can redo all of that again, or
               | change a few lines of Python.
        
               | mulmen wrote:
               | I admit this is a bit outside my wheelhouse so I'm
               | probably still missing something.
               | 
               | Isn't this just a table with 5bn rows of timestamp,
               | page_type, page_views_t1d, page_views_t7d,
               | page_views_t30d, page_views_t60d, and page_views_t180d?
               | You can even compute this incrementally or in parallel by
               | timestamp and/or page_type.
               | 
               | What's the magic Chronon is doing?
        
               | echrisinger wrote:
               | What's with the dismissiveness? The author is a senior
               | staff engineer at a huge company & has worked in this
               | space for years. I'd suspect they've done their
               | diligence...
        
             | jyhu wrote:
             | Have you guys considered Rockset? What you mentioned are
             | some classic real-time aggregation use cases and Rockset
             | seems to support that well:
             | https://docs.rockset.com/documentation/docs/ingestion-
             | rollup...
        
       | travisporter wrote:
       | Paywalled for me
        
         | nikhilsimha wrote:
         | It opens for me in incognito mode - albeit with a large popup
         | that I had to close.
        
       | djaykay wrote:
       | The downside is after you use the platform for a week, you have
       | to delete all the expired models yourself and clean up all the
       | labels or face a hefty housekeeping surcharge.
        
       | evolutionblues wrote:
       | great work! When it comes to batched computations, why not
       | leverage intermediate state much like streaming jobs. For
       | example, if we need to calculate past 30 day sum for a value
       | daily - it seems like this would compute so from scratch daily.
       | Would it not make sense to model this as a sliding window that's
       | updated daily?
        
         | nikhilsimha wrote:
         | We do this for training data generation already.
         | 
         | We have plans to implement this behavior for computing the
         | batch arm of feature serving.
        
       | siquick wrote:
       | What does Airbnb use ML for?
        
         | nikhilsimha wrote:
         | almost every button click is either powered by a model or
         | guarded by a model.
        
       | jumpora wrote:
       | download free font dafont style. with awesome font you can make a
       | new attrative design for your graphic project
       | https://dafont.style/
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:01 UTC)