[HN Gopher] The SQL query engine Trino (formerly PrestoSQL) reca...
___________________________________________________________________
The SQL query engine Trino (formerly PrestoSQL) recaps a decade of
innovation
Author : simpligility
Score : 91 points
Date : 2022-08-04 16:54 UTC (6 hours ago)
(HTM) web link (trino.io)
(TXT) w3m dump (trino.io)
| kache_ wrote:
| trino is awesome
|
| can't believe this shit is free as in freedom
| mrwnmonm wrote:
| Can you use Trino as a database proxy?
| simpligility wrote:
| Essentially yes ... it can be the query engine for many
| databases at the same time.
| jerryjerryjerry wrote:
| One of the features I'm interested in (or would like to have)
| from Trino or Presto is the workload management which can better
| manage different types of queries and allocate resources
| accordingly. This becomes important when more applications adopt
| Trino or Presto as a distributed SQL database/platform, where the
| impact from different queries or workloads can be mitigated,
| besides the dedicated resources (CPU, MEM, etc.) can be allocated
| to high priority workloads. I'm really wondering if/when such
| capabilities may be provided.
|
| BTW, purely curiosity, I compared Trino with Presto from OSS
| point of view (https://ossinsight.io/analyze/prestodb/presto?vs=t
| rinodb%2Ft...), both communities are still popular but Trino
| seems more active than Presto now. I also wonder if two
| communities may reunion someday again to really boost its impact
| (comparing to Spark community).
| bitsondatadev wrote:
| For managing difrerent workloads, check out this blogs and this
| videos from Shopify, Salesforce, Goldman Sachs, and Electronic
| Arts, respectively:
|
| - https://engineering.salesforce.com/how-to-etl-at-petabyte-
| sc... - https://shopify.engineering/faster-trino-query-
| execution-inf... - https://trino.io/episodes/33.html -
| https://www.youtube.com/watch?v=-5mlZGjt6H4
|
| All use the Lyft "Presto but really Trino"-Gateway project to
| run different clusters to handle various workloads. They go
| into various details for how this is achieved.
|
| https://github.com/lyft/presto-gateway
|
| Regarding the Trino/Presto split. I recommend looking at this
| blog to better understand why these two communities aren't
| mergeing. TL;DR Presto is a Facebook-driven project that mainly
| considers running on the Facebook infrastructure. Trino is
| community-driven that works on running well with all clouds and
| common infasturcture in the Trino community which is why you
| see a higher velocity there.
|
| https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-...
| https://trino.io/blog/2020/12/27/announcing-trino.html
|
| Soon we anticipate that Trino will become the common name in
| the community space but we'll always love the origins of the
| Trino project being Presto.
| jerryjerryjerry wrote:
| Yes, definitely it may help if going with multiple clusters,
| however, there are also many scenarios that we don't want to
| maintain multiple clusters. For example, when we come to a
| SaaS platform, multi-tenant is pretty typical where different
| tenants may have different workloads, and workload management
| would be needed for different users, or even within the same
| tenant. So the "built-in" workload management (besides other
| features for multi-tenant) would be a big plus.
| ambigali wrote:
| Ali here, with a perspective about the split. Disclosure - I
| work at Ahana and am an active member of the Presto
| Foundation. When I see things like this, it appears that
| Trino/Starburst wants to continue to push the narrative that
| Presto is a Facebook-driven project to keep the communities
| fractured which is pretty unfortunate. In reality, Presto is
| a community-based open source project housed under The Linux
| Foundation and has dozens of companies actively contributing
| to it and using it - Uber, Bytedance, Intel, Twitter,
| Tencent, and many more. There's no reason why the 2
| communities can't coexist peacefully.
|
| For all intents and purposes, both projects are active and
| lively. It seems that Trino is more focused on federation and
| building out connectors. Presto is more focused on being the
| engine for the data lake/lakehouse. Both projects are doing
| well and solving different problems. There's been a lot of
| innovative features in the Presto project over the last year
| that are only in Presto, like Presto-on-Spark, disaggregated
| coordinator, Project Aria, etc. In fact we just hosted a
| fantastic user conference a few weeks ago that showcased a
| lot of that innovation and how companies are using Presto at
| massive scale today (if interested, check out the sessions: h
| ttps://www.youtube.com/watch?v=Gi8i7eHqwyw&list=PLJVeO1NMmy..
| .)
|
| Long story short, Presto is alive and well, is not solely
| backed by 1 company (quite the opposite of Trino/Starburst),
| and has a lot of tech innovation on the roadmap. We're
| excited about the future of Presto.
| simpligility wrote:
| Its great to see how far the project has come from the humble
| beginnings to the current, rich open source ecosystem and
| community.
| tekkertje wrote:
| One of my favorite OSS projects! Probably the most flexible and
| fully featured distributed SQL query engine around. Congrats and
| looking forward to the next decade!
| bitsondatadev wrote:
| <3
| simpligility wrote:
| Also working on a new edition for Trino: The Definitive Guide at
| the moment.
| mrwnmonm wrote:
| We are building a SaaS BI tool.
|
| To enable the users to connect to their databases... we have a
| form that collects the database credentials from the user, saves
| it in a secure way, and when the user writes or uses an SQL
| query, we establish a database connection right away (from our
| server), execute it, and return the results, and we keep the
| connection alive for like 15mins.
|
| But with serverless architecture, first query could go to
| instance 1, so instance 1 will establish a db connection, then
| the second query could go to instance 2, so instance 2 will
| establish another one. You could end up with a lot of unnecessary
| connections.
|
| If you use AWS RDS (for yourself), beside lambda for example, AWS
| have RDS Proxy to solve this problem.
|
| So I was thinking about using Trino like the RDS Proxy, but for
| more databases, and for our customers database, not ours. Is that
| doable with Trino?
| tzury wrote:
| Trino vs ClickHouse, can anyone tell from experience how those
| two compare?
| bitsondatadev wrote:
| Clickhouse is a realtime system where Trino is a batch-oriented
| system. There are tradeoffs for doing realtime vs batch.
|
| Realtime is generally more expensive to run as you process
| every individual row as it comes, batch is when you can deal
| with minute latency and want to handle a lot of data in chunks.
|
| Trino is also a query engine rather than a database and it
| connects to many different systems:
| https://trino.io/docs/current/connector.html
|
| It also happens to connect to Clickhouse and it's very common
| that people will use Trino to query clickhouse realtime data
| and join it with data in big query, an object store data lake,
| or Snowflake:
| https://trino.io/docs/current/connector/clickhouse.html
| qoega wrote:
| You can consider that ClickHouse allows both to query a lot of
| supported external data sources(s3/hdfs/mysql/postgre/...) and
| to store data in pretty efficient columnar way with
| compression, indexes and all the bells and whistles. Native
| storage allows to use all the information about keys/indices to
| build query plan faster. With trino you can't store data inside
| trino. You can't even insert data using trino which allows you
| to solve scenarios like 'readonly analytics'. Trino allows you
| to use single query language for all the supported systems. So
| if you have a zoo of DBMS and object storages that you can just
| query it can help you to hide this complexity.
| [deleted]
| dmead wrote:
| I do the support for my department's trino cluster. We move ~1tb
| (and growing) in ETL jobs and support interactive queries for the
| data scientists/analysts.
|
| It would be super good if you guys added big query write support.
| Its really annoying to have to run a hive cluster in google to
| act as a proxy for this.
| atwebb wrote:
| Any chance you have an overview of the architecture and
| operations support required? How many data sources are you
| pinging?
| hashhar wrote:
| BigQuery very recently announced their Storage Write API which
| is one of the ways we were looking to implement this but there
| are some issues with the latency and consistency guarantees
| that it offers.
|
| But, yes, we do plan to add that eventually after ironing out
| all the kinks. See https://github.com/trinodb/trino/pull/13094
| bitsondatadev wrote:
| Also, you can keep track of all the BQ progress here:
| https://github.com/trinodb/trino/issues/6867
| QuotedAtoms wrote:
| Can anyone clarify the differences between Trino and SparkSQL?
| Our company has used SparkSQL to aggressively replace use-cases
| that were based on PrestoSQL in the past.
| ergocoder wrote:
| I can chime in in an okayish useful manner.
|
| Apart from implementation details, probably not much different.
| It is similar to mysql vs postgresql. You are probably okay
| with either.
| hashhar wrote:
| I must disclaim that I contribute to Trino.
|
| I agree but it depends a bit on what purpose you are using
| them for. If you mainly use the tool to JOIN some data in
| bulk and then write output somewhere else (i.e. ETL) - either
| will serve you fine.
|
| If you write complex queries with multiple filters and want
| to JOIN across multiple datasets - sure Spark can do that as
| well but it's not as efficient in pushing down computation to
| the source.
|
| e.g. A query like SELECT c.custkey, sum(totalprice) FROM
| orders o INNER JOIN customer c ON o.custkey = c.custkey WHERE
| o.orderstatus = 'O' GROUP BY c.custkey; when ran on Spark
| will pull both tables into memory and then perform the join +
| filter for orderstatus = 'O' and then compute the sum.
|
| While in case of Trino it'll push down the entire query into
| the remote database (in this case, in other queries it'll
| push down some parts of the query) so the source database
| will not need to return gigabytes of data over the network
| every time the query runs (and hence finish faster as well).
|
| Trino tries to push-down some operations to the remote system
| which can be done more efficiently there. e.g. filtering on a
| column that has an index in the remote RDBMS will be faster
| than pulling all data and then filtering in Trino. Spark
| doesn't have strong pushdown and has to pull most of the raw
| data and then apply processing on top of it.
|
| That's one of the main differences. Spark is a distributed
| job execution framework first while Trino is a distributed
| federated query engine first and it shows in their strengths
| and weaknesses.
|
| If you want to run arbitrary user defined transformations on
| data then Spark definitely has much more to offer than Trino.
| skadamat wrote:
| Big shout out to Brian Olsen from the Trino community (and
| Starburst) for helping the Trino community be successful
|
| - https://github.com/bitsondatadev
|
| - https://www.linkedin.com/in/bitsondatadev/
|
| I recommend the Trino Slack for people not already in it:
| https://trino.io/slack.html
| bitsondatadev wrote:
| Thanks for the shoutout! :)
|
| If you want to get started with Trino, here's a repo I created
| to do so: https://github.com/bitsondatadev/trino-getting-
| started
| georgewfraser wrote:
| The thing I wonder about with Presto and to a lesser extent Spark
| is, how many of their users adopted this tool because it was an
| easy migration path from Hive, and how many of those users will
| eventually re-platform to something else?
| bitsondatadev wrote:
| I mean, the hive migration path is one thing. Now that Iceberg
| is taking over the old Hive model, data lakes are all the rage
| again.
|
| The other thing I would say is that Trino and Presto are not
| one-trick ponies or just hive replacements. There's also the
| ability to query across multiple systems that is, to me, the
| feature that future proofs a lot of architectures. It
| inherently frees you up to fiddle with your data in different
| systems but keep the access to that system in one location.
| georgewfraser wrote:
| Yeah I think that is the key question: will data lakes become
| the dominant paradigm? There is certainly a lot of talk
| around them, though I see a ton of companies are still just
| going all in on a conventional data warehouse, but they tend
| not to talk about it because it's not a new or interesting
| thing to do.
| bitsondatadev wrote:
| Yeah, though a lot of Fivetran customers are likely the
| type that would go all in on paying for a conventional data
| warehouse where people using open source stacks may be the
| ones that are using open ingestion alternatives.
|
| We see a pretty even mix from the Trino/Starburst lens.
| Bigger companies like to mix and match.
| bitsondatadev wrote:
| btw, if you want to know the backstory on why Presto is now
| called Trino, here's the article:
|
| https://trino.io/blog/2022/08/02/leaving-facebook-meta-best-...
| simpligility wrote:
| Also just to note.. I am currently working on a refresh of Trino:
| The Definitive Guide .. and would love to see you all at Trino
| Summit in November.
|
| https://trino.io/blog/2022/06/30/trino-summit-call-for-speak...
| gavinray wrote:
| I recently had to write SQL query generation for AWS Athena,
| which is based off Presto 0.217
|
| It turns out that the dialect doesn't support LATERAL joins with
| a LIMIT in them. The below query only works if you remove the
| LIMIT clause.
|
| https://i.stack.imgur.com/rdB1s.png
|
| This makes saying things like "Fetch all artists where ..., for
| each artist fetch their first 3 albums where ..., and for each
| album fetch the top 10 tracks where ..." really difficult
|
| Does Trino support this out of curiosity?
| karma_fountain wrote:
| Is it possible to achieve this with a window function?
| gavinray wrote:
| I found out it is! Kudos to this kind internet stranger for
| telling me:
|
| https://stackoverflow.com/a/73129836/13485494
|
| But man is it a huge PITA (especially when doing programmatic
| code generation of the SQL) compared to LATERAL joins
|
| Someone familiar with the CockroachDB query planner showed me
| that a window function like this is what Cockroach turns
| LATERAL joins into for instance:
| demo@127.0.0.1:26257/movr> explain select * from abc, lateral
| (select * from xyz where x = a limit 2); *
| filter | estimated row count: 1 | filter:
| row_num <= 2 | +-- * window |
| estimated row count: 2 | +-- * hash
| join | estimated row count: 2
| | equality: (x) = (a) | +-- *
| scan | estimated row count: 6 (100% of
| the table; stats collected 2 minutes ago) |
| table: xyz@xyz_pkey | spans: FULL SCAN
| | +-- * scan
| estimated row count: 1 (100% of the table; stats collected 3
| minutes ago) table: abc@abc_pkey
| spans: FULL SCAN
| bitsondatadev wrote:
| Check out this PR. I believe we may have tackled this one but
| you'd need to try it out on Trino:
| https://github.com/trinodb/trino/pull/1415
| gavinray wrote:
| Hooray! Yet another data point for Trino > Presto as far as
| I'm concerned ;^)
| bitsondatadev wrote:
| If you want to try an SaaS Athena alternative that's backed
| by Trino you can check out Starburst Galaxy:
| https://www.starburst.io/platform/starburst-galaxy/
|
| Full disclosure I work at Starburst.
| gavinray wrote:
| Oh nice, I have high opinions of you folks!
|
| Guy who goes by the name of "Randgalt" online builds some
| great Java libraries and works there too I believe.
| simpligility wrote:
| Yep .. Jordan works on Trino and Starburst Galaxy. We got
| lots of other great engineers helping as well btw.
|
| https://github.com/randgalt
| jjoonathan wrote:
| AWS Athena: selling a buggy, old, stale copy of someone else's
| work (Presto / Trino) for high prices and getting away with it
| because you control the platform.
|
| If that's not peak Amazon, I don't know what is.
| lopatin wrote:
| Can't you say the same thing for EC2 but with Linux instead
| of Presto? Personally I like Athena. The fact that it's in
| the Amazon platform and managed is a plus for me.
| jjoonathan wrote:
| EC2 involved substantial VM management and networking
| innovations that I respect. Ditto lambda and S3. I would
| not categorize any of these as OSS flips in nearly the way
| that I would categorize Athena as an OSS flip.
| slt2021 wrote:
| AWS is just managed open source as a service
| dmitrykoval wrote:
| other arguments aside .. Athena costs $5 per 1TB scanned and
| also supports predicates pushdown to S3 Select. I wouldn't
| call this expensive, at least in comparison to self hosted
| Presto.
| hashhar wrote:
| At a certain scale it does become very expensive. It's easy
| math.
|
| When your monthly Athena bill crosses whatever it would
| cost to have 5 or 10 EC2 machines it'll be cheaper to use
| Trino. At my previous workplace we moved from
| ~$40,000/month to ~$18,000/month by replacing Athena.
|
| Athena is a very good tool to start with - unless you have
| super large scale you'll probably not outgrow it. But when
| you do there's Trino.
|
| I do contribute to Trino - although I was merely a user
| when that cost reduction happened.
| tfehring wrote:
| I'm not sure the math is so easy. Even knowing the direct
| cost savings in hindsight, engineers' time is expensive,
| and it's not obvious that the ongoing engineering cost of
| maintaining Trino on an EC2 cluster would be that far
| below $22k/month. Even if you get a net cost savings on
| an ongoing basis (which, granted, you probably do), you
| may have a long payback period for the initial
| engineering time spent evaluating solutions and getting
| the deployment spun up.
|
| And that's all with the benefit of hindsight - it's hard
| to know _a priori_ how much cheaper your own deployment
| will be compared to a managed service or how long it will
| take to implement. Of course, anecdotes like yours help
| with that, so thanks for sharing your experience!
| dmitrykoval wrote:
| Sure, I agree, above certain usage threshold hosted Trino
| becomes totally justified. But then, some engineering
| time to maintain the cluster has to be factored in as
| well.. for anything ad-hoc in nature, I would start with
| Athena by default.
| mr_toad wrote:
| Setting up and maintaining your own Trino cluster isn't
| exactly trivial. You pay a premium for not having to do that.
| RobinL wrote:
| I think many users just see they can execute a query on huge
| data cheaply and incredibly quickly and are delighted. That's
| certainly my experience.
|
| It's one of the backends available in Splink, our FOSS record
| linkage software and it's revolutionary how it allows users
| to execute large scale probabilistic record linkage
| ridiculously cheaply. It wasn't long ago you needed very
| expensive proprietary software plus a big on prem cluster,
| costing in the hundreds of thousands, to achieve this.
|
| A lot of the magic for me is on the infrastructure side: how
| they can read/write large datasets from s3 so quickly, so the
| value isn't just in the SQL engine.
| pbronez wrote:
| Splink looks cool. I'm familiar with Tamr and Senzing, but
| this is the first FOSS option I've come across.
| jfim wrote:
| The biggest value for corporate users is that they get
| everything already included as part of their existing cloud
| agreement.
|
| Adding a new vendor to the mix needs to involve
| procurement, the legal team, vendor negotiations, while
| using a new AWS feature is just a matter of using it, even
| if it's not as good as the original ISV's version and
| doesn't support the long term viability of the project.
| jjoonathan wrote:
| Yes, it's good to be platform king. We know. Low friction
| for you, high friction for everyone else.
| stevenmih wrote:
| Steven here, I'm Ahana CEO (Managed Service for Presto) and
| member of Linux Foundation's Presto Foundation Since you asked
| about an alternative to AWS Athena, we offer a free-forever in-
| VPC Managed Service for Presto on AWS, which keeps up with the
| latest Presto releases.
|
| Note: the original Presto continues to run in production at
| Meta (fmr. Facebook), Uber, and recently ByteDance TikTok data
| platform talked about running 1M queries a day with tens of
| thousands of cores. Some reasons to stay with Presto: -
| Reliability and scalability per above - Cutting edge
| innovations only in later versions of Presto: multi-level
| caching (project RaptorX) to boost query performance by 10X+
| and table scan improvements (project Aria) to name a few - Only
| PrestoDB is hosted by Linux Foundation, giving confidence to
| community users that future releases will remain open.
| hashhar wrote:
| Works just fine in Trino. trino> USE
| memory.default; USE trino:default> create table
| artist (artistid int); CREATE TABLE trino:default>
| create table album (albumid int, artistid int); CREATE
| TABLE trino:default> insert into artist values 1, 2;
| INSERT: 2 rows Query 20220804_182827_00005_n4rat,
| FINISHED, 1 node Splits: 19 total, 19 done (100.00%)
| 0.52 [0 rows, 0B] [0 rows/s, 0B/s] trino:default>
| insert into album values (11, 1), (12, 1), (21, 2);
| INSERT: 3 rows Query 20220804_182857_00006_n4rat,
| FINISHED, 1 node Splits: 19 total, 19 done (100.00%)
| 0.18 [0 rows, 0B] [0 rows/s, 0B/s] trino:default>
| select * from (select * from artist limit 2) a cross join
| lateral (select * from album where album.artistid = a.artistid
| limit 2); artistid | albumid | artistid
| ----------+---------+---------- 1 | 12 |
| 1 1 | 11 | 1 2 |
| 21 | 2 (3 rows) Query
| 20220804_182930_00007_n4rat, FINISHED, 1 node Splits: 41
| total, 41 done (100.00%) 0.35 [8 rows, 232B] [22 rows/s,
| 661B/s]
| ck_one wrote:
| Can Trino be used as a Snowflake replacement? How is the query
| speed compared to Snowflake?
| cpard wrote:
| Hey ck_one that's a hard question to answer and not get into
| "benchmarketing" territory.
|
| My suggestion is to try both under your own workloads and see
| the difference. Trino is also used by products like Athena
| (AWS) and Galaxy (Starburst) so if you want to play around and
| see how Trino performs without spending too much time on
| setting up clusters on your own, you can try these great
| products.
|
| Having said that, I'd like to add that building a performant
| distributed query engine is just hard. Trino has been in
| development for ten years and used by major companies in very
| demanding environments, these environments is where the
| technology has been defined and makes it what it is today and
| it is a proof of its performance and stability.
|
| (edited to add an important disclaimer that I work at
| Starburst)
| simpligility wrote:
| Yes ... Starburst Enterprise, which is a commercial
| distribution of Trino, can in fact also query Snowflake, but
| also Delta Lake and many many other systems at the same time.
| dominotw wrote:
| hey do you work for the company. Prbly add a disclaimer.
| simpligility wrote:
| Yes... where would I put the disclaimer?
| dominotw wrote:
| Not sure if those are comparable 1:1. Do you mean you have a
| datalake that you are using snowflake to query using 'external
| tables' that you want to replace with trino ?
| chrsig wrote:
| i can't speak to trino, but with my experience with aws athena
| and snowflake, they're roughly on par with each other across
| the board.
| simpligility wrote:
| Also with Iceberg or Delta Lake and Trino you can basically run
| a full lakehouse architecture on open source and places like
| Bloomberg, Goldman Sachs and many others do just that.
|
| https://trino.io/users.html
___________________________________________________________________
(page generated 2022-08-04 23:02 UTC)