hngopher.com

       [HN Gopher] The SQL query engine Trino (formerly PrestoSQL) reca...
       ___________________________________________________________________
        
       The SQL query engine Trino (formerly PrestoSQL) recaps a decade of
       innovation
        
       Author : simpligility
       Score  : 91 points
       Date   : 2022-08-04 16:54 UTC (6 hours ago)
        
 (HTM) web link (trino.io)
 (TXT) w3m dump (trino.io)
        
       | kache_ wrote:
       | trino is awesome
       | 
       | can't believe this shit is free as in freedom
        
       | mrwnmonm wrote:
       | Can you use Trino as a database proxy?
        
         | simpligility wrote:
         | Essentially yes ... it can be the query engine for many
         | databases at the same time.
        
       | jerryjerryjerry wrote:
       | One of the features I'm interested in (or would like to have)
       | from Trino or Presto is the workload management which can better
       | manage different types of queries and allocate resources
       | accordingly. This becomes important when more applications adopt
       | Trino or Presto as a distributed SQL database/platform, where the
       | impact from different queries or workloads can be mitigated,
       | besides the dedicated resources (CPU, MEM, etc.) can be allocated
       | to high priority workloads. I'm really wondering if/when such
       | capabilities may be provided.
       | 
       | BTW, purely curiosity, I compared Trino with Presto from OSS
       | point of view (https://ossinsight.io/analyze/prestodb/presto?vs=t
       | rinodb%2Ft...), both communities are still popular but Trino
       | seems more active than Presto now. I also wonder if two
       | communities may reunion someday again to really boost its impact
       | (comparing to Spark community).
        
         | bitsondatadev wrote:
         | For managing difrerent workloads, check out this blogs and this
         | videos from Shopify, Salesforce, Goldman Sachs, and Electronic
         | Arts, respectively:
         | 
         | - https://engineering.salesforce.com/how-to-etl-at-petabyte-
         | sc... - https://shopify.engineering/faster-trino-query-
         | execution-inf... - https://trino.io/episodes/33.html -
         | https://www.youtube.com/watch?v=-5mlZGjt6H4
         | 
         | All use the Lyft "Presto but really Trino"-Gateway project to
         | run different clusters to handle various workloads. They go
         | into various details for how this is achieved.
         | 
         | https://github.com/lyft/presto-gateway
         | 
         | Regarding the Trino/Presto split. I recommend looking at this
         | blog to better understand why these two communities aren't
         | mergeing. TL;DR Presto is a Facebook-driven project that mainly
         | considers running on the Facebook infrastructure. Trino is
         | community-driven that works on running well with all clouds and
         | common infasturcture in the Trino community which is why you
         | see a higher velocity there.
         | 
         | https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-...
         | https://trino.io/blog/2020/12/27/announcing-trino.html
         | 
         | Soon we anticipate that Trino will become the common name in
         | the community space but we'll always love the origins of the
         | Trino project being Presto.
        
           | jerryjerryjerry wrote:
           | Yes, definitely it may help if going with multiple clusters,
           | however, there are also many scenarios that we don't want to
           | maintain multiple clusters. For example, when we come to a
           | SaaS platform, multi-tenant is pretty typical where different
           | tenants may have different workloads, and workload management
           | would be needed for different users, or even within the same
           | tenant. So the "built-in" workload management (besides other
           | features for multi-tenant) would be a big plus.
        
           | ambigali wrote:
           | Ali here, with a perspective about the split. Disclosure - I
           | work at Ahana and am an active member of the Presto
           | Foundation. When I see things like this, it appears that
           | Trino/Starburst wants to continue to push the narrative that
           | Presto is a Facebook-driven project to keep the communities
           | fractured which is pretty unfortunate. In reality, Presto is
           | a community-based open source project housed under The Linux
           | Foundation and has dozens of companies actively contributing
           | to it and using it - Uber, Bytedance, Intel, Twitter,
           | Tencent, and many more. There's no reason why the 2
           | communities can't coexist peacefully.
           | 
           | For all intents and purposes, both projects are active and
           | lively. It seems that Trino is more focused on federation and
           | building out connectors. Presto is more focused on being the
           | engine for the data lake/lakehouse. Both projects are doing
           | well and solving different problems. There's been a lot of
           | innovative features in the Presto project over the last year
           | that are only in Presto, like Presto-on-Spark, disaggregated
           | coordinator, Project Aria, etc. In fact we just hosted a
           | fantastic user conference a few weeks ago that showcased a
           | lot of that innovation and how companies are using Presto at
           | massive scale today (if interested, check out the sessions: h
           | ttps://www.youtube.com/watch?v=Gi8i7eHqwyw&list=PLJVeO1NMmy..
           | .)
           | 
           | Long story short, Presto is alive and well, is not solely
           | backed by 1 company (quite the opposite of Trino/Starburst),
           | and has a lot of tech innovation on the roadmap. We're
           | excited about the future of Presto.
        
       | simpligility wrote:
       | Its great to see how far the project has come from the humble
       | beginnings to the current, rich open source ecosystem and
       | community.
        
       | tekkertje wrote:
       | One of my favorite OSS projects! Probably the most flexible and
       | fully featured distributed SQL query engine around. Congrats and
       | looking forward to the next decade!
        
         | bitsondatadev wrote:
         | <3
        
       | simpligility wrote:
       | Also working on a new edition for Trino: The Definitive Guide at
       | the moment.
        
       | mrwnmonm wrote:
       | We are building a SaaS BI tool.
       | 
       | To enable the users to connect to their databases... we have a
       | form that collects the database credentials from the user, saves
       | it in a secure way, and when the user writes or uses an SQL
       | query, we establish a database connection right away (from our
       | server), execute it, and return the results, and we keep the
       | connection alive for like 15mins.
       | 
       | But with serverless architecture, first query could go to
       | instance 1, so instance 1 will establish a db connection, then
       | the second query could go to instance 2, so instance 2 will
       | establish another one. You could end up with a lot of unnecessary
       | connections.
       | 
       | If you use AWS RDS (for yourself), beside lambda for example, AWS
       | have RDS Proxy to solve this problem.
       | 
       | So I was thinking about using Trino like the RDS Proxy, but for
       | more databases, and for our customers database, not ours. Is that
       | doable with Trino?
        
       | tzury wrote:
       | Trino vs ClickHouse, can anyone tell from experience how those
       | two compare?
        
         | bitsondatadev wrote:
         | Clickhouse is a realtime system where Trino is a batch-oriented
         | system. There are tradeoffs for doing realtime vs batch.
         | 
         | Realtime is generally more expensive to run as you process
         | every individual row as it comes, batch is when you can deal
         | with minute latency and want to handle a lot of data in chunks.
         | 
         | Trino is also a query engine rather than a database and it
         | connects to many different systems:
         | https://trino.io/docs/current/connector.html
         | 
         | It also happens to connect to Clickhouse and it's very common
         | that people will use Trino to query clickhouse realtime data
         | and join it with data in big query, an object store data lake,
         | or Snowflake:
         | https://trino.io/docs/current/connector/clickhouse.html
        
         | qoega wrote:
         | You can consider that ClickHouse allows both to query a lot of
         | supported external data sources(s3/hdfs/mysql/postgre/...) and
         | to store data in pretty efficient columnar way with
         | compression, indexes and all the bells and whistles. Native
         | storage allows to use all the information about keys/indices to
         | build query plan faster. With trino you can't store data inside
         | trino. You can't even insert data using trino which allows you
         | to solve scenarios like 'readonly analytics'. Trino allows you
         | to use single query language for all the supported systems. So
         | if you have a zoo of DBMS and object storages that you can just
         | query it can help you to hide this complexity.
        
         | [deleted]
        
       | dmead wrote:
       | I do the support for my department's trino cluster. We move ~1tb
       | (and growing) in ETL jobs and support interactive queries for the
       | data scientists/analysts.
       | 
       | It would be super good if you guys added big query write support.
       | Its really annoying to have to run a hive cluster in google to
       | act as a proxy for this.
        
         | atwebb wrote:
         | Any chance you have an overview of the architecture and
         | operations support required? How many data sources are you
         | pinging?
        
         | hashhar wrote:
         | BigQuery very recently announced their Storage Write API which
         | is one of the ways we were looking to implement this but there
         | are some issues with the latency and consistency guarantees
         | that it offers.
         | 
         | But, yes, we do plan to add that eventually after ironing out
         | all the kinks. See https://github.com/trinodb/trino/pull/13094
        
           | bitsondatadev wrote:
           | Also, you can keep track of all the BQ progress here:
           | https://github.com/trinodb/trino/issues/6867
        
       | QuotedAtoms wrote:
       | Can anyone clarify the differences between Trino and SparkSQL?
       | Our company has used SparkSQL to aggressively replace use-cases
       | that were based on PrestoSQL in the past.
        
         | ergocoder wrote:
         | I can chime in in an okayish useful manner.
         | 
         | Apart from implementation details, probably not much different.
         | It is similar to mysql vs postgresql. You are probably okay
         | with either.
        
           | hashhar wrote:
           | I must disclaim that I contribute to Trino.
           | 
           | I agree but it depends a bit on what purpose you are using
           | them for. If you mainly use the tool to JOIN some data in
           | bulk and then write output somewhere else (i.e. ETL) - either
           | will serve you fine.
           | 
           | If you write complex queries with multiple filters and want
           | to JOIN across multiple datasets - sure Spark can do that as
           | well but it's not as efficient in pushing down computation to
           | the source.
           | 
           | e.g. A query like SELECT c.custkey, sum(totalprice) FROM
           | orders o INNER JOIN customer c ON o.custkey = c.custkey WHERE
           | o.orderstatus = 'O' GROUP BY c.custkey; when ran on Spark
           | will pull both tables into memory and then perform the join +
           | filter for orderstatus = 'O' and then compute the sum.
           | 
           | While in case of Trino it'll push down the entire query into
           | the remote database (in this case, in other queries it'll
           | push down some parts of the query) so the source database
           | will not need to return gigabytes of data over the network
           | every time the query runs (and hence finish faster as well).
           | 
           | Trino tries to push-down some operations to the remote system
           | which can be done more efficiently there. e.g. filtering on a
           | column that has an index in the remote RDBMS will be faster
           | than pulling all data and then filtering in Trino. Spark
           | doesn't have strong pushdown and has to pull most of the raw
           | data and then apply processing on top of it.
           | 
           | That's one of the main differences. Spark is a distributed
           | job execution framework first while Trino is a distributed
           | federated query engine first and it shows in their strengths
           | and weaknesses.
           | 
           | If you want to run arbitrary user defined transformations on
           | data then Spark definitely has much more to offer than Trino.
        
       | skadamat wrote:
       | Big shout out to Brian Olsen from the Trino community (and
       | Starburst) for helping the Trino community be successful
       | 
       | - https://github.com/bitsondatadev
       | 
       | - https://www.linkedin.com/in/bitsondatadev/
       | 
       | I recommend the Trino Slack for people not already in it:
       | https://trino.io/slack.html
        
         | bitsondatadev wrote:
         | Thanks for the shoutout! :)
         | 
         | If you want to get started with Trino, here's a repo I created
         | to do so: https://github.com/bitsondatadev/trino-getting-
         | started
        
       | georgewfraser wrote:
       | The thing I wonder about with Presto and to a lesser extent Spark
       | is, how many of their users adopted this tool because it was an
       | easy migration path from Hive, and how many of those users will
       | eventually re-platform to something else?
        
         | bitsondatadev wrote:
         | I mean, the hive migration path is one thing. Now that Iceberg
         | is taking over the old Hive model, data lakes are all the rage
         | again.
         | 
         | The other thing I would say is that Trino and Presto are not
         | one-trick ponies or just hive replacements. There's also the
         | ability to query across multiple systems that is, to me, the
         | feature that future proofs a lot of architectures. It
         | inherently frees you up to fiddle with your data in different
         | systems but keep the access to that system in one location.
        
           | georgewfraser wrote:
           | Yeah I think that is the key question: will data lakes become
           | the dominant paradigm? There is certainly a lot of talk
           | around them, though I see a ton of companies are still just
           | going all in on a conventional data warehouse, but they tend
           | not to talk about it because it's not a new or interesting
           | thing to do.
        
             | bitsondatadev wrote:
             | Yeah, though a lot of Fivetran customers are likely the
             | type that would go all in on paying for a conventional data
             | warehouse where people using open source stacks may be the
             | ones that are using open ingestion alternatives.
             | 
             | We see a pretty even mix from the Trino/Starburst lens.
             | Bigger companies like to mix and match.
        
       | bitsondatadev wrote:
       | btw, if you want to know the backstory on why Presto is now
       | called Trino, here's the article:
       | 
       | https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-...
        
       | simpligility wrote:
       | Also just to note.. I am currently working on a refresh of Trino:
       | The Definitive Guide .. and would love to see you all at Trino
       | Summit in November.
       | 
       | https://trino.io/blog/2022/06/30/trino-summit-call-for-speak...
        
       | gavinray wrote:
       | I recently had to write SQL query generation for AWS Athena,
       | which is based off Presto 0.217
       | 
       | It turns out that the dialect doesn't support LATERAL joins with
       | a LIMIT in them. The below query only works if you remove the
       | LIMIT clause.
       | 
       | https://i.stack.imgur.com/rdB1s.png
       | 
       | This makes saying things like "Fetch all artists where ..., for
       | each artist fetch their first 3 albums where ..., and for each
       | album fetch the top 10 tracks where ..." really difficult
       | 
       | Does Trino support this out of curiosity?
        
         | karma_fountain wrote:
         | Is it possible to achieve this with a window function?
        
           | gavinray wrote:
           | I found out it is! Kudos to this kind internet stranger for
           | telling me:
           | 
           | https://stackoverflow.com/a/73129836/13485494
           | 
           | But man is it a huge PITA (especially when doing programmatic
           | code generation of the SQL) compared to LATERAL joins
           | 
           | Someone familiar with the CockroachDB query planner showed me
           | that a window function like this is what Cockroach turns
           | LATERAL joins into for instance:
           | demo@127.0.0.1:26257/movr> explain select * from abc, lateral
           | (select * from xyz where x = a limit 2);              *
           | filter         | estimated row count: 1         | filter:
           | row_num <= 2         |         +-- * window             |
           | estimated row count: 2             |             +-- * hash
           | join                 | estimated row count: 2
           | | equality: (x) = (a)                 |                 +-- *
           | scan                 |     estimated row count: 6 (100% of
           | the table; stats collected 2 minutes ago)                 |
           | table: xyz@xyz_pkey                 |     spans: FULL SCAN
           | |                 +-- * scan
           | estimated row count: 1 (100% of the table; stats collected 3
           | minutes ago)                         table: abc@abc_pkey
           | spans: FULL SCAN
        
         | bitsondatadev wrote:
         | Check out this PR. I believe we may have tackled this one but
         | you'd need to try it out on Trino:
         | https://github.com/trinodb/trino/pull/1415
        
           | gavinray wrote:
           | Hooray! Yet another data point for Trino > Presto as far as
           | I'm concerned ;^)
        
             | bitsondatadev wrote:
             | If you want to try an SaaS Athena alternative that's backed
             | by Trino you can check out Starburst Galaxy:
             | https://www.starburst.io/platform/starburst-galaxy/
             | 
             | Full disclosure I work at Starburst.
        
               | gavinray wrote:
               | Oh nice, I have high opinions of you folks!
               | 
               | Guy who goes by the name of "Randgalt" online builds some
               | great Java libraries and works there too I believe.
        
               | simpligility wrote:
               | Yep .. Jordan works on Trino and Starburst Galaxy. We got
               | lots of other great engineers helping as well btw.
               | 
               | https://github.com/randgalt
        
         | jjoonathan wrote:
         | AWS Athena: selling a buggy, old, stale copy of someone else's
         | work (Presto / Trino) for high prices and getting away with it
         | because you control the platform.
         | 
         | If that's not peak Amazon, I don't know what is.
        
           | lopatin wrote:
           | Can't you say the same thing for EC2 but with Linux instead
           | of Presto? Personally I like Athena. The fact that it's in
           | the Amazon platform and managed is a plus for me.
        
             | jjoonathan wrote:
             | EC2 involved substantial VM management and networking
             | innovations that I respect. Ditto lambda and S3. I would
             | not categorize any of these as OSS flips in nearly the way
             | that I would categorize Athena as an OSS flip.
        
           | slt2021 wrote:
           | AWS is just managed open source as a service
        
           | dmitrykoval wrote:
           | other arguments aside .. Athena costs $5 per 1TB scanned and
           | also supports predicates pushdown to S3 Select. I wouldn't
           | call this expensive, at least in comparison to self hosted
           | Presto.
        
             | hashhar wrote:
             | At a certain scale it does become very expensive. It's easy
             | math.
             | 
             | When your monthly Athena bill crosses whatever it would
             | cost to have 5 or 10 EC2 machines it'll be cheaper to use
             | Trino. At my previous workplace we moved from
             | ~$40,000/month to ~$18,000/month by replacing Athena.
             | 
             | Athena is a very good tool to start with - unless you have
             | super large scale you'll probably not outgrow it. But when
             | you do there's Trino.
             | 
             | I do contribute to Trino - although I was merely a user
             | when that cost reduction happened.
        
               | tfehring wrote:
               | I'm not sure the math is so easy. Even knowing the direct
               | cost savings in hindsight, engineers' time is expensive,
               | and it's not obvious that the ongoing engineering cost of
               | maintaining Trino on an EC2 cluster would be that far
               | below $22k/month. Even if you get a net cost savings on
               | an ongoing basis (which, granted, you probably do), you
               | may have a long payback period for the initial
               | engineering time spent evaluating solutions and getting
               | the deployment spun up.
               | 
               | And that's all with the benefit of hindsight - it's hard
               | to know _a priori_ how much cheaper your own deployment
               | will be compared to a managed service or how long it will
               | take to implement. Of course, anecdotes like yours help
               | with that, so thanks for sharing your experience!
        
               | dmitrykoval wrote:
               | Sure, I agree, above certain usage threshold hosted Trino
               | becomes totally justified. But then, some engineering
               | time to maintain the cluster has to be factored in as
               | well.. for anything ad-hoc in nature, I would start with
               | Athena by default.
        
           | mr_toad wrote:
           | Setting up and maintaining your own Trino cluster isn't
           | exactly trivial. You pay a premium for not having to do that.
        
           | RobinL wrote:
           | I think many users just see they can execute a query on huge
           | data cheaply and incredibly quickly and are delighted. That's
           | certainly my experience.
           | 
           | It's one of the backends available in Splink, our FOSS record
           | linkage software and it's revolutionary how it allows users
           | to execute large scale probabilistic record linkage
           | ridiculously cheaply. It wasn't long ago you needed very
           | expensive proprietary software plus a big on prem cluster,
           | costing in the hundreds of thousands, to achieve this.
           | 
           | A lot of the magic for me is on the infrastructure side: how
           | they can read/write large datasets from s3 so quickly, so the
           | value isn't just in the SQL engine.
        
             | pbronez wrote:
             | Splink looks cool. I'm familiar with Tamr and Senzing, but
             | this is the first FOSS option I've come across.
        
             | jfim wrote:
             | The biggest value for corporate users is that they get
             | everything already included as part of their existing cloud
             | agreement.
             | 
             | Adding a new vendor to the mix needs to involve
             | procurement, the legal team, vendor negotiations, while
             | using a new AWS feature is just a matter of using it, even
             | if it's not as good as the original ISV's version and
             | doesn't support the long term viability of the project.
        
             | jjoonathan wrote:
             | Yes, it's good to be platform king. We know. Low friction
             | for you, high friction for everyone else.
        
         | stevenmih wrote:
         | Steven here, I'm Ahana CEO (Managed Service for Presto) and
         | member of Linux Foundation's Presto Foundation Since you asked
         | about an alternative to AWS Athena, we offer a free-forever in-
         | VPC Managed Service for Presto on AWS, which keeps up with the
         | latest Presto releases.
         | 
         | Note: the original Presto continues to run in production at
         | Meta (fmr. Facebook), Uber, and recently ByteDance TikTok data
         | platform talked about running 1M queries a day with tens of
         | thousands of cores. Some reasons to stay with Presto: -
         | Reliability and scalability per above - Cutting edge
         | innovations only in later versions of Presto: multi-level
         | caching (project RaptorX) to boost query performance by 10X+
         | and table scan improvements (project Aria) to name a few - Only
         | PrestoDB is hosted by Linux Foundation, giving confidence to
         | community users that future releases will remain open.
        
         | hashhar wrote:
         | Works just fine in Trino.                 trino> USE
         | memory.default;       USE       trino:default> create table
         | artist (artistid int);       CREATE TABLE       trino:default>
         | create table album (albumid int, artistid int);       CREATE
         | TABLE       trino:default> insert into artist values 1, 2;
         | INSERT: 2 rows              Query 20220804_182827_00005_n4rat,
         | FINISHED, 1 node       Splits: 19 total, 19 done (100.00%)
         | 0.52 [0 rows, 0B] [0 rows/s, 0B/s]              trino:default>
         | insert into album values (11, 1), (12, 1), (21, 2);
         | INSERT: 3 rows              Query 20220804_182857_00006_n4rat,
         | FINISHED, 1 node       Splits: 19 total, 19 done (100.00%)
         | 0.18 [0 rows, 0B] [0 rows/s, 0B/s]              trino:default>
         | select * from (select * from artist limit 2) a cross join
         | lateral (select * from album where album.artistid = a.artistid
         | limit 2);        artistid | albumid | artistid
         | ----------+---------+----------               1 |      12 |
         | 1               1 |      11 |        1               2 |
         | 21 |        2       (3 rows)              Query
         | 20220804_182930_00007_n4rat, FINISHED, 1 node       Splits: 41
         | total, 41 done (100.00%)       0.35 [8 rows, 232B] [22 rows/s,
         | 661B/s]
        
       | ck_one wrote:
       | Can Trino be used as a Snowflake replacement? How is the query
       | speed compared to Snowflake?
        
         | cpard wrote:
         | Hey ck_one that's a hard question to answer and not get into
         | "benchmarketing" territory.
         | 
         | My suggestion is to try both under your own workloads and see
         | the difference. Trino is also used by products like Athena
         | (AWS) and Galaxy (Starburst) so if you want to play around and
         | see how Trino performs without spending too much time on
         | setting up clusters on your own, you can try these great
         | products.
         | 
         | Having said that, I'd like to add that building a performant
         | distributed query engine is just hard. Trino has been in
         | development for ten years and used by major companies in very
         | demanding environments, these environments is where the
         | technology has been defined and makes it what it is today and
         | it is a proof of its performance and stability.
         | 
         | (edited to add an important disclaimer that I work at
         | Starburst)
        
         | simpligility wrote:
         | Yes ... Starburst Enterprise, which is a commercial
         | distribution of Trino, can in fact also query Snowflake, but
         | also Delta Lake and many many other systems at the same time.
        
           | dominotw wrote:
           | hey do you work for the company. Prbly add a disclaimer.
        
             | simpligility wrote:
             | Yes... where would I put the disclaimer?
        
         | dominotw wrote:
         | Not sure if those are comparable 1:1. Do you mean you have a
         | datalake that you are using snowflake to query using 'external
         | tables' that you want to replace with trino ?
        
         | chrsig wrote:
         | i can't speak to trino, but with my experience with aws athena
         | and snowflake, they're roughly on par with each other across
         | the board.
        
         | simpligility wrote:
         | Also with Iceberg or Delta Lake and Trino you can basically run
         | a full lakehouse architecture on open source and places like
         | Bloomberg, Goldman Sachs and many others do just that.
         | 
         | https://trino.io/users.html
        
       ___________________________________________________________________
       (page generated 2022-08-04 23:02 UTC)