hngopher.com

       [HN Gopher] The Rise and Fall of the OLAP Cube (2020)
       ___________________________________________________________________
        
       The Rise and Fall of the OLAP Cube (2020)
        
       Author : smartmic
       Score  : 129 points
       Date   : 2021-07-05 11:18 UTC (11 hours ago)
        
 (HTM) web link (www.holistics.io)
 (TXT) w3m dump (www.holistics.io)
        
       | random3 wrote:
       | Brings up nostalgic memories from my previous life building
       | distributed analytics systems. We had a fully distributed,
       | scalable, real-time OLAP cube 10 years ago on top of HBase and
       | Storm (or any other realtime processing pipeline). It was less
       | than 30 KLOC and could be used as a library as well as fully
       | distributed system.
       | 
       | There are several OSS solutions but I still weep every time I
       | think about the stupid reasons for which we delayed open-sourcing
       | it. It was such a simple and scalable data model including many
       | concepts that were not common at the time.
       | https://medium.com/hstackdotorg/hbasecon-low-latency-olap-wi...
       | 
       | I still believe there's room for a similar simple approach,
       | likely taking advantage of some of the developments related to
       | Algebird, Calcite, Gandiva, etc.
        
       | georgewfraser wrote:
       | The OLAP cube needs to come back, as an _optimization_ of a data-
       | warehouse-centric workflow. If you are routinely running queries
       | like:                   SELECT dimension, measure         FROM
       | table         WHERE filter = ?         GROUP BY 1
       | 
       | You can save a lot of compute time by creating a materialized
       | view [1] of:                   SELECT dimension, filter, measure
       | FROM table         GROUP BY 1, 2
       | 
       | and the query optimizer will even automatically redirect your
       | query for you! At this point, the main thing we need is for BI
       | tools to take advantage of this capability in the background.
       | 
       | [1] https://docs.snowflake.com/en/user-guide/views-
       | materialized....
        
         | motogpjimbo wrote:
         | Did you mean to write HAVING in your first query? Otherwise
         | your second query is not equivalent to the first, because the
         | WHERE will not be performed prior to the aggregation.
        
         | buremba wrote:
         | While native materialized view feature is a great start,
         | unfortunately they're not useful in a practical way if you have
         | data lineage. They works like a black-box and they can't
         | guarantee the query performance.
         | 
         | The new generation ELT tools such as dbt partially solve this
         | problem. You can model your data and incrementally update the
         | tables that can be used in your BI tools. Looker's Aggregate
         | Awareness is also a great start but unfortunately it only works
         | for Looker.
         | 
         | We try to solve this problem with metriql as well:
         | https://metriql.com/introduction/aggregates The idea is to
         | define these measures & dimensions once and use it everyone;
         | your BI tools, data science tools, etc.
         | 
         | Disclaimer: I'm the tech lead of the project.
        
           | glogla wrote:
           | Is that related to lightdash.com somehow?
           | 
           | It seems like a very similar technology and also the webpages
           | for both are almost identical.
        
             | buremba wrote:
             | We both use dbt as the data modeling layer but we don't
             | actually develop a standalone BI tool. Instead, we
             | integrate to third-party BI tools such as Data Studio,
             | Tableau, Metabase, etc.
             | 
             | We love Looker and wanted bring the LookML experience to
             | existing BI tools rather than introducing a new BI tool,
             | that's how metriql was born. I believe that Lightdash is a
             | cool project especially for data analysts who are
             | extensively using dbt but metriql targets users who are
             | already using a BI tool. I'm not particularly sure which
             | pages are identical, can you please point me?
        
               | glogla wrote:
               | Compare https://metriql.com/introduction/creating-
               | datasets and https://docs.lightdash.com/guides/how-to-
               | create-metrics
               | 
               | I though you are affiliated somehow, but looking at it
               | now, it seems you just use the same documentation website
               | generator :)
        
         | mulmen wrote:
         | But those queries aren't equivalent so how is anything saved by
         | materializing the second one?
         | 
         | e:
         | 
         | I believe (I could be wrong!) you edited the second query from
         | SELECT dimension, measure         FROM table         GROUP BY 1
         | 
         | To                 SELECT dimension, filter, measure
         | FROM table         GROUP BY 1, 2
         | 
         | This addresses the filtering but how is that any different from
         | the original table? Presumably `table` could have been a finer
         | grain than the filter and dimension but you'd do better to add
         | the rest of the dimensions as well, at which point you're most
         | of the way to a star schema.
         | 
         | This kind of pre-computed aggregate is typical in data
         | warehousing. But is it really an "OLAP cube"?
         | 
         | In general I agree there is value in the methods of the past
         | and we would be well served to adapt those concepts to our work
         | today.
        
           | georgewfraser wrote:
           | It's much smaller than the original table. If you compute
           | lots of these, then voila, you have an OLAP cube.
        
           | motogpjimbo wrote:
           | He did edit his comment, but unfortunately didn't acknowledge
           | the edit.
        
         | _dark_matter_ wrote:
         | looker does exactly this, though you do have to specify which
         | dimensions to aggregate: https://docs.looker.com/data-
         | modeling/learning-lookml/aggreg...
        
       | unklefolk wrote:
       | I'm not a big fan of OLAP cubes / data warehouses but it should
       | be noted that these aren't always just about scale and speed.
       | They also serve to provide a single view of the data when the
       | data is held across multiple databases / systems. You're orders
       | may be in PostgreSQL, your inventory tracked in MySQL etc. So
       | bringing all that data into a single database for analytics can
       | be valuable.
        
         | chipsa wrote:
         | This isn't against data warehouses, but just the cubes.
         | Bringing everything into a single database for analysis is
         | still required. But you extract load then transform, instead of
         | extract transform load.
        
           | rossdavidh wrote:
           | While that's true, it feels to me that your comment made this
           | point a lot more clearly than the article did.
        
         | AtlasBarfed wrote:
         | Which is why hadoop took off?
        
       | debarshri wrote:
       | I am not the subject matter expert but have spent first few year
       | of my career doing data warehousing. I think OLAP cubes are
       | falling out of fashion because traditional BI and reporting in
       | general is falling out of fashion.
       | 
       | With self-service BI tools like looker and other tools, Business
       | folks are running more adhoc queries than traditionally piped
       | data through ETL processes.
        
       | dang wrote:
       | One past thread:
       | 
       |  _The Rise and Fall of the OLAP Cube_ -
       | https://news.ycombinator.com/item?id=22189178 - Jan 2020 (53
       | comments)
        
       | gengstrand wrote:
       | My first exposure to OLAP was on a team developing a front end to
       | Essbase that semi-automated report generation after reloading the
       | cube. Since then, I have worked with various OLAP technologies
       | from Microsoft to Mondrian.
       | 
       | I agree with Mr. Chin who is the author of this blog. OLAP cubes
       | can scale up but not out which limits their potential size so
       | they can't really accommodate big data scenarios. My previous
       | company used Apache Druid instead. Druid can do a lot of what
       | OLAP did but using a distributed approach. Although Druid does
       | not provide tight integration with spreadsheets, it does provide
       | near real-time ingestion, on-the-fly rollup, and first class time
       | series analysis.
        
       | iblaine wrote:
       | I'm a bit opinionated on data modelling, so a couple points.
       | 
       | 1. "OLAP Cubes" arguably belong to Microsoft and refer to SQL
       | Server cubes that require MDX queries. It's a solution given to
       | us by Microsoft that comes with well understood features. "OLAP"
       | itself is a general term used to describe any database used for
       | analytical processing, so OLAP has a wide range of uses.
       | 
       | 2. OLAP Cubes (as defined above) started to decrease in
       | population in 2015 (I'd argue).
       | 
       | 3. Any solution to fixing "OLAP" that comes from a commercial
       | vendor is suspicious. As painful as Kimbal & Inmon are, they are
       | ideas that don't require vendor lock in.
       | 
       | 4. At my current company, we recently wrapped up a process where
       | we set out to define the future of our DW. Our DW was encumbered
       | by various design patterns and contractors that came and went
       | over the past decade. We analyzed the various tables & ETLs to
       | come up with a clear set of principles. The end result looks
       | eerily close to Kimball but with our own naming conventions. Our
       | new set of principles clearly solve our business needs. Point
       | being you don't need Kimball, Inmon, Looker, etc to solve data
       | modelling problems.
       | 
       | 5. Columnar databases are so pervasive these days that you should
       | no longer need to worry about a lack of options when choosing the
       | right storage engine for your OLAP db.
       | 
       | 6. More and more data marts are being defined by BI Teams
       | w/little modeling/CS skills. To that end, I think it's important
       | to educate your BI teams as a means to minimize the fallout, and
       | be accommodating in your DW as means to work efficiently with
       | your BI teams. This is to say settle on a set of data modelling
       | principles that work for you, but may not work for someone else.
        
       | willvarfar wrote:
       | I think the article is spot on about workflow but off on tech.
       | Under the hood, Cubes will be back.
       | 
       | Snowflake is the cubeless column store.
       | 
       | And now firebolt comes along, promising order of magnitudes
       | improvement. How? Largely "join indexes" and "aggregation
       | indexes" which seem to smell like cubes.
       | 
       | So take the modern column store and slap some cubes in front.
       | 
       | In firebolt this seems to be added by programmers. But I think
       | the future will see olap mpp engines that transparently pick
       | cubes, maintain them, reuse them and discard them all
       | automatically.
        
         | dominotw wrote:
         | > How? Largely "join indexes" and "aggregation indexes" which
         | seem to smell like cubes.
         | 
         | Just curious. What stops snowflake from adding these too.
         | Trying to understand what innovation firebolt is doing.
        
       | elthor89 wrote:
       | "Be suspicious of companies that are heavily locked into the OLAP
       | cube workflow. (Learn how to do this here)."
       | 
       | The opposite is true too. Be suspicious of companies trying to
       | lock you into their selfservice of data visualization tool. I
       | have seen many BI tool vendors trying lock their clients into
       | their tool. As result they get unnecessary complex models
       | running.
       | 
       | I find the olap a pretty good mental model. That developers and
       | users can understand.
       | 
       | The sheer amount of handwritten, tailor made etl or elt pipelines
       | is something I like to see automated or replaced with something
       | better.
        
         | glogla wrote:
         | Yup. At my job we ended up some way too important stuff
         | "modelled in BI tool" in a way that nobody could understand how
         | it actually works, and had to do full rewrite to model it in
         | SQL.
         | 
         | But not all BI tools are alike - something like Superset which
         | internally uses SQL and expects users to do self-service
         | transformation using views is easy to industrialize since you
         | already have the queries. Something like Tableau, not so much.
         | 
         | (If only Superset were more mature)
        
       | [deleted]
        
         | [deleted]
        
       | malkia wrote:
       | At previous job, in an an human ads evaluation team, we had a big
       | row table that got turned into column oriented one every couple
       | of hours. I think it got close to million columns (expanded from
       | leaf protobuf fields). We had some black/white listing as the
       | schema grew too big (~20GB? just the schema, and the limits were
       | few gigs). That was years ago, so probably changed a lot.
       | 
       | But the use case was for various analysts (statisticians,
       | linguists, etc., or any internal users other teams of the table).
       | As their quires might use any column they like - there was
       | probably a way to track down which columns got used over a time,
       | but since it was used for launching new things, you would not
       | know - better be safe than sorry I guess.
       | 
       | Anyway, I'm still fascinated that this terabyte thing got
       | converted using 20k machines for few hours and worked! (it wasn't
       | streaming, it was converting over and over from the begining,
       | this might've changed too!)
        
       | intrasight wrote:
       | I've been doing OLAP for over 20 years, so I can say with
       | authority that OLAP is generally a bag of hurt. The tooling
       | sucks, the languages suck, the costs suck. The name itself harks
       | back to seemingly a time before computers. "Online"? As opposed
       | to paper tablets?
        
         | [deleted]
        
         | myohmy wrote:
         | OLAP is legacy hell. I built an MDX query to solve a report
         | problem, but now I can't replace it because my seniors failed
         | to deconstruct it because it is stupid difficult to wrap your
         | head around. So now it sits in limbo until I can find time to
         | do the work myself. Let me just take a look at my calendar and
         | find my next free week... oh, there is none.
        
         | goodpoint wrote:
         | Online as opposed to batch.
        
           | intrasight wrote:
           | I know. But still, you must agree, a stupid name. The rest of
           | the IT industry would say "real-time vs batch". And OLAP is
           | batch for most implementations.
        
             | purerandomness wrote:
             | The name was coined way before "online" meant what it means
             | today.
        
             | bradleyjg wrote:
             | Real-time is engine control software not a webpage with a
             | spinner.
        
       | darepublic wrote:
       | I remember on a dark gloomy day going to the bookstore and
       | picking up a book about SQL server report server. Some of the
       | toughest time in my life.
        
       | crazydoggers wrote:
       | I always felt the value of an OLAP cube was uncovering questions
       | you may not know to ask. If you've got a cube with a good
       | interface, it means more people in your organization can inspect
       | the data looking for trends across dimensions that might not be
       | obvious. As an example in a previous organization, someone just
       | playing with an olap cube once found that certain DUI searches
       | became much more common on Weekend mornings (perhaps obvious in
       | retrospect). That insight allowed better targeting and sales, and
       | made concrete what otherwise is just a guess.
       | 
       | Sometimes just exploring the data allows stuff like that to pop
       | out.
       | 
       | Are there similar interfaces with columnar stores? Or do all the
       | analytics need to be pre compiled? The ability to
       | slice/dice/filter/aggregate your data in perhaps non obvious ways
       | is really the value of business analytics in my opinion.
        
         | mulmen wrote:
         | I don't think this benefit is unique to OLAP _cubes_. This is
         | just generally the reason you have a data warehouse at all. The
         | implementation is up to you but if you model the _business
         | process_ then you will be ready to answer questions that have
         | not been asked.
        
       | lou1306 wrote:
       | The author mentions the Codd-OLAP conflict of interest, lamenting
       | that despite what happened "Codd is still regarded as the father
       | of the relational database". But this remark sounds like a non
       | sequitur to me. No matter the scandal he got into in the 90s, he
       | still invented relational algebra back in the 70s, didn't he?
        
         | data_ders wrote:
         | the "scandal"
         | 
         | > coined the term Online analytical processing (OLAP) and wrote
         | the "twelve laws of online analytical processing". Controversy
         | erupted, however, after it was discovered that this paper had
         | been sponsored by Arbor Software (subsequently Hyperion, now
         | acquired by Oracle), a conflict of interest that had not been
         | disclosed, and Computerworld withdrew the paper.
         | 
         | https://en.m.wikipedia.org/wiki/Edgar_F._Codd
        
           | brazzy wrote:
           | He published a _scientific paper_ that was to some degree
           | paid advertising, and didn 't disclose the payment.
           | 
           | That kind of thing still totally counts as a scandal.
        
             | ryanmarsh wrote:
             | Indeed, but was the paper wrong?
        
               | skeeter2020 wrote:
               | It doesn't matter. The undisclosed conflict of interest
               | should eliminate it from consideration.
        
       | coldacid wrote:
       | >If you're running a query like "give me the average price of all
       | transactions over the past 5 years", a relational database would
       | have to load all the rows from the previous 5 years even though
       | it merely wants to aggregate the price field; a columnar database
       | would only have to examine one column -- the price column.
       | 
       | Wouldn't that be _two_ columns?
        
         | glogla wrote:
         | Yeah but it's nitpicky :)
         | 
         | Maybe "past 5 years" mean "over everything" and then it might
         | technically be one column.
         | 
         | Or if you're using something like S3 with Parquet for "columnar
         | database" and it is partitioned by date, then the date values
         | are stored in metadata - so you would logically read two
         | columns but physically only read one from storage. Same story
         | for something like Redshift and using date as sortkey.
        
         | citrin_ru wrote:
         | > Wouldn't that be two columns?
         | 
         | Columnar databases can also have indices. If there is an index
         | by date (or time) then you DB will know row range from the
         | index and will read only price column within given row range.
         | If there is no index by data it would be two columns, but it is
         | still much less than a full row with many columns.
        
         | Animats wrote:
         | If the number of transactions is small enough to query quickly,
         | a simple SQL query will give you that. If the number of
         | transactions is very large, querying a random sample of perhaps
         | 1 in 1000 will give you reasonably accurate results. You can
         | calculate the probable error from the sample size.
         | 
         | There are questions which require examining all the data, but
         | often you don't really need to.
        
       | ryanmarsh wrote:
       | You can pry the star schema database from my cold dead hands. You
       | can get a lot of work done, without much effort, in multi-
       | dimensional row stores with bare SQL. The ETL's are easier to
       | write than the map reduce jobs I've seen on columnar stores. ETL
       | pipelines get the data in the structure I need before I begin
       | analysis. ELT requires me to know too much about the original
       | structure of the data. Sure, that's useful when you find a novel
       | case, but shouldn't that be folded back into the data warehouse
       | for everyone?
        
         | garethrowlands wrote:
         | Columnar stores speak SQL these days, though they may map-
         | reduce under the hood. ELT vs ETL just moves some
         | transformation into SQL the data normally still ends up in the
         | data warehouse.
        
         | mulmen wrote:
         | I absolutely love Looker for this reason. It understands
         | foreign keys so if you model everything as stars it just writes
         | SQL for you.
         | 
         | So simple, so powerful. I wish something open source would do
         | this painfully obvious thing.
        
         | marcinzm wrote:
         | >Sure, that's useful when you find a novel case, but shouldn't
         | that be folded back into the data warehouse for everyone?
         | 
         | And this isn't possible to do in the data warehouse why
         | exactly? Most every company seems to use DBT so even analysts
         | can write transforms to generate tables and views in the
         | warehouse from other tables. Hell, even Fivetran lets you run
         | DBT or transformations after loading data.
        
       | eyeball wrote:
       | One big table combining all of your facts and dimensions for me.
       | 
       | Less compute time, less time for business users since they don't
       | need to figure out star schema joins. Storage is cheap.
        
       | RantyDave wrote:
       | I'm not sure we can call NoSQL "a fad that has gone away".
        
       | ridaj wrote:
       | I would say yes it's now possible to do analytics without cubes,
       | except when butting up against two things:
       | 
       | * Really big data (~ PB/day), which makes it impractical to
       | process every analytics query by going back to the source
       | 
       | * User data subject to GDPR-like retention constraints, where you
       | cannot usually hold on to the base data for ad long as you need
       | analytics for
        
       | didibus wrote:
       | One thing about columnar store that annoys me, is that when you
       | do want to retrieve the records for something you've found it's
       | really slow. I wish there was something of a hybrid.
        
       | mwexler wrote:
       | "We model data according to rigorous frameworks like Kimball or
       | Inmon because we must regularly construct OLAP cubes for our
       | analyses."
       | 
       | Actually, no. We model our data this way so it can be used for
       | business decisions. It doesn't take long for any entity of any
       | scale to discover that the logging and eventing done for
       | heartbeat status, debugging, and scaling is just different than
       | what you need to make a variety of business decisions.
       | 
       | You can solve with more events at different scales (button clicks
       | nested in screen views) or pick events or event rolkups that
       | appear to be clean business stages ("completed checkout") but
       | still, your finance team, marketing group, all have different
       | needs.
       | 
       | So, you decide to have some core shared metrics, derived and
       | defined, and make them usable by everyone. Folks agree on the
       | defns, and due to ease and trust, you see more data supporting
       | more decisions.
       | 
       | You discover that some folks are doing 10 table joins to get an
       | answer; it's fast but difficult to extend for new questions. You
       | decide to build a view that solves some of these pains, and
       | refactoring to allow a better time dimension. Your version links
       | with the metrics you created, and the resulting queries shed tons
       | of CTEs while becoming readable to the average user.
       | 
       | And now, you have some ELT pipelines, some event transforms that
       | result in counts and filters that map nicely to your business
       | needs but still allow you to get atomic raws, and you and your
       | teams start to trust in consistent results. Your metrics are
       | mostly clearly summable, and ones that aren't are in table views
       | that precalc the "daily uniques" or other metrics that may need a
       | bit special handling.
       | 
       | You've started modeling your data.
       | 
       | No, we don't need olap cubes. But we do need some type of rigor
       | around analytic data. Otherwise, why go to all the trouble to
       | collect it, count it, and predict from it if it may be wrong with
       | no measure of that uncertainty?
       | 
       | And yeah, Kimball et al are from a world where olap was the
       | answr, but it turns out they solved a broader set of sql
       | problems. So, worth learning the good, toss the dated, and see
       | what good data modeling can do for your analysis and predictions.
        
         | glogla wrote:
         | > Actually, no. We model our data this way so it can be used
         | for business decisions
         | 
         | Yeah, that would be my disagreement as well.
         | 
         | Sure, some of Kimball might be obsoleted by modern technology.
         | But I don't think the point of Kimball is cubes, and even if it
         | was, what better is there?
         | 
         | I'd be really interested in what else is there, that is more
         | modern and suited to the modern world.
         | 
         | Data Vault? That probably isn't it, most of all because it's
         | all about the "platform to build you Kimball on" but not the
         | actual business modelling. (But the parts about Metrics Vault
         | and Quality Vault and the "service stuff" are really good.)
         | 
         | Anchor Modelling? That one seems, looking by 2021 eyes, like a
         | way to implement columnar storage and schema-on-read in MS SQL
         | Server using store procedures ... which is probably not
         | actually a good idea.
         | 
         | Puppini Bridge aka. Unified Star Schema seems like interesting
         | concept, especially in self-service space. But even it's
         | proponents warn you it doesn't scale performance-wise and also
         | it is kind of incremental change on Kimball. (But the way the
         | bridge table is kinda adjacency matrix of graph of your objects
         | tickles my inner computer scientist fancy)
         | 
         | So really, what else is there?
        
         | mwexler wrote:
         | BTW, their followup piece at
         | https://www.holistics.io/blog/olap-is-not-olap-cube/ is a nicer
         | drill down into some of these issues. Also a good read.
        
         | tresil wrote:
         | Could not have written this any better. I've been in the data
         | business for over 10 years. A number of technologies have come
         | and gone during that time, but dimensional modeling and solid
         | SQL skills keep delivering remarkable value.
        
         | chrstr wrote:
         | Indeed, it is generally good for analytics to have a conceptual
         | data model that is reasonably efficient both on the technical
         | side (to quickly get answers for queries), and also on the
         | logical side (to quickly translate questions to queries).
         | 
         | This model doesn't need to be OLAP cubes, but it's also not
         | that easy to find something better.
        
       ___________________________________________________________________
       (page generated 2021-07-05 23:00 UTC)