[HN Gopher] Apache Pinot 1.0
___________________________________________________________________
Apache Pinot 1.0
Author : PeterCorless
Score : 36 points
Date : 2023-09-19 19:44 UTC (3 hours ago)
(HTM) web link (pinot.apache.org)
(TXT) w3m dump (pinot.apache.org)
| gregw2 wrote:
| I poked around trying to find a high level understanding.
|
| Here's the best place to start from what I could tell:
| https://docs.pinot.apache.org/basics/concepts
|
| Based on that, it's a MPP columnar database focused on low-
| latency streaming-ingested/realtimeish use cases open sourced by
| LinkedIn's infra teams:
|
| _" Pinot is designed to deliver low latency queries on large
| datasets. To achieve this performance, Pinot stores data in a
| columnar format and adds additional indices to perform fast
| filtering, aggregation and group by._
|
| _Raw data is broken into small data shards. Each shard is
| converted into a unit called a segment. One or more segments
| together form a table, which is the logical container for
| querying Pinot using SQL /PQL._
|
| _... Logically, a cluster is simply a group of tenants. As with
| the classical definition of a cluster, it is also a grouping of a
| set of compute nodes. Typically, there is only one cluster per
| environment /data center. There is no needed to create multiple
| clusters since Pinot supports the concept of tenants. At
| LinkedIn, the largest Pinot cluster consists of 1000+ nodes
| distributed across a data center. The number of nodes in a
| cluster can be added in a way that will linearly increase
| performance and availability of queries."_
|
| Also per https://docs.pinot.apache.org/basics/getting-
| started/frequen...
|
| _Q: When are new events queryable when getting ingested into a
| real-time table?_
|
| _A: Events are available to queries as soon as they are
| ingested. This is because events are instantly indexed in memory
| upon ingestion._
|
| _The ingestion of events into the real-time table is not
| transactional, so replicas of the open segment are not
| immediately consistent. Pinot trades consistency for availability
| upon network partitioning (CAP theorem) to provide ultra-low
| ingestion latencies at high throughput. However, when the open
| segment is closed and its in-memory indexes are flushed to
| persistent storage, all its replicas are guaranteed to be
| consistent, with the commit protocol._
|
| _... Q: Why are segments not strictly time-partitioned?_
|
| _A: It might seem odd that segments are not strictly time-
| partitioned, unlike similar systems such as Apache Druid. This
| allows real-time ingestion to consume out-of-order events. Even
| though segments are not strictly time-partitioned, Pinot will
| still index, prune, and query segments intelligently by time
| intervals for the performance of hybrid tables and time-filtered
| data. When generating offline segments, the segments generated
| such that segments only contain one time interval and are well
| partitioned by the time column._
| emmanueloga_ wrote:
| Does anyone understand how the Apache foundation works? Do
| projects receive monetary funding, or is it just the "prestige"
| of becoming an Apache project? What's the advantage of being
| under their umbrella?
|
| At this point, any legitimacy of working with their foundation
| may be lost under the weight of hundreds or even thousands of
| projects of unknown quality levels (I'm not talking about this
| project's merits, which I know nothing about).
| latchkey wrote:
| 20+ year Apache Member here... yea, it is pretty much prestige.
| But you get community, infrastructure, legal, branding, as well
| as mentoring on 'how do to open source'.
|
| It is all pretty well documented. Here are a couple good links
| to get you started...
|
| The Apache Way
|
| https://www.apache.org/theapacheway/
|
| The PMC oversees the projects:
|
| https://www.apache.org/dev/pmc.html
| drewda wrote:
| Apache Foundation provides well-trod legal path for large
| corporations to release their internal code as open-source.
|
| They do have some competition. Linux Foundation is another
| large non-profit that creates umbrella entities for a bunch of
| open-source software originally created within larger tech
| companies. I get the impression that Apache Foundation goes for
| breadth, taking any and all donations, while Linux Foundation
| goes for depth in specific topics.
|
| In terms of funding, for open-source projects originally
| created within a larger company, that company will often
| provide a financial donation to the foundation that is taking
| on its ongoing management. The foundation will also take a cut
| of future donations to the project, to pay for the
| administrative overhead of the non-profit.
| politelemon wrote:
| So is this similar to Amazon's Athena? I'm trying to place what a
| 'realtime distributed OLAP datastore' is, or competes with, in
| cloudy/naive terms.
| fiddlerwoaroof wrote:
| My impression is that it's in the same space as RedShift,
| Snowflake, Citus, Greenplum, ClickHouse.
| glogla wrote:
| RedShift, Snowflake, Citus, Greenplum and Athena are OLAP
| engines, but not real-time focused. For this one, it is more
| similar to Druid, ClickHouse or RockSet.
|
| The 1.0 version of Pinot seems to bring a lot of maturity,
| they seem to have added new engine that can do joins now. I'm
| not sure how stable it is, but it seems interesting.
|
| As for what is this kind of database usedful for, this is for
| operational analytics on large data that also update in real
| time. In my domain that would be things like having insight
| into large supply chains or manufacturing operations, like
| power plants or factories, just in general for monitoring
| stuff. I know it's also used in security and finance (for
| fraud).
| zX41ZdbW wrote:
| It's hardly comparable with ClickHouse. Even loading a table
| with 100M rows is not an easy endeavor in Pinot: https://gith
| ub.com/ClickHouse/ClickBench/blob/main/pinot/ben...
___________________________________________________________________
(page generated 2023-09-19 23:00 UTC)