https://www.crunchydata.com/blog/crunchy-data-warehouse-postgres-with-iceberg-for-high-performance-analytics

Introducing Crunchy Data Warehouse: A next-generation Postgres-native
data warehouse. Crunchy Data Warehouse Learn more

Close
 
Products
Cloud
Solutions
Developer?
CustomersPricing
BlogContact
Login
Access Portal
Crunchy Bridge
Register
Open menu
Avatar for Marco Slot

Marco Slot

Nov 20, 2024*8 min read

More by this author

Latest Articles

  * PostGIS Day 2024 Summary
  * Crunchy Data Warehouse: Postgres with Iceberg for High
    Performance Analytics
  * Loading the World! OpenStreetMap Import In Under 4 Hours
  * Easy Totals and Subtotals in Postgres with Rollup and Cube
  * A change to ResultRelInfo - A Near Miss with Postgres 17.1

Analytics

Crunchy Data Warehouse: Postgres with Iceberg for High Performance
Analytics

Avatar for Marco Slot

Marco Slot

Nov 20, 2024*8 min read

PostgreSQL is the bedrock on which many of today's organizations are
built. The versatility, reliability, performance, and extensibility
of PostgreSQL make it the perfect tool for a large variety of
operational workloads.

The one area in which PostgreSQL has historically been lacking is
analytics, which involves queries that summarize, filter, or
transform large amounts of data. Modern analytical databases are
designed to query data in data lakes in formats like Parquet using a
fast vectorized query engine. Relational data is stored in open table
formats such as Apache Iceberg, which can support transactions while
also taking advantage of the cheap, infinite, durable storage that
object stores like S3 provide. Multiple applications can access data
directly from storage in a scalable way without losing data
consistency or integrity. If only you could have all of that in
PostgreSQL...

Today, we are thrilled to announce Crunchy Data Warehouse, a modern,
high performance analytics database built into PostgreSQL, initially
available as a managed service on AWS via Crunchy Bridge.

What is Crunchy Data Warehouse?

Crunchy Data Warehouse brings many ground-breaking new features into
PostgreSQL:

  * Iceberg tables in PostgreSQL: You can create, manage, query, and
    update Iceberg tables that are cheaply and durably stored in S3
    as easily as PostgreSQL tables, with fast analytical queries,
    and... perform ACID transactions that span across your operational
    tables and data lake .
  * High performance analytics: Crunchy Data Warehouse extends the
    PostgreSQL query planner to delegate part of the query to DuckDB
    for vectorized execution, and automatically caches files on local
    NVMe drives. Together these optimizations deliver on average over
    10x better performance than PostgreSQL (tuned) in TPC-H queries
    on the same machine, and even greater improvements on many common
    query patterns.
  * Query raw data files in your data lake: Most data lakes consist
    of CSV/JSON/Parquet files in S3, which are passed between
    different systems. You can easily query data files and
    directories that are already in S3, or insert them into Iceberg
    tables. You can also query external Iceberg tables, Delta tables,
    and various geospatial file formats.
  * Flexible data import/export: You can load data directly from an
    S3 bucket or http(s) URL into Iceberg or regular PostgreSQL
    tables, and you can write query results back to S3 to create
    advanced data pipelines.
  * Seamless integrations: Crunchy Data Warehouse follows the
    "Lakehouse" architecture and brings together the Iceberg and
    PostgreSQL ecosystems. External tools can interact with Iceberg
    tables via PostgreSQL queries, or retrieve data directly from
    storage.

All of this comes without sacrificing any PostgreSQL features or
compatibility with the ecosystem. Crunchy Data Warehouse uses
extensions to stay current with the latest version of PostgreSQL
(currently 17). Additionally, you can use Crunchy Data Warehouse as
both an analytical and operational database for mixed workloads by
combining regular tables with Iceberg tables.

crunchy data warehouse

Data Warehousing for the rest of us

You might be wondering: How does Crunchy Data Warehouse differ from
other cloud data warehouses?

First and foremost, it's just PostgreSQL!

You can use all the PostgreSQL tools and extensions you already know
and love with Crunchy Data Warehouse, and there are many interesting
new synergies between data warehouse features and existing PostgreSQL
features.

Crunchy Data Warehouse uses a light-weight, server-based architecture
that's optimized for efficient use of hardware and predictable cost.
Performance is very competitive even when querying billions of rows,
and you can supplement it with other tools thanks to the open Iceberg
format. You can also use standard PostgreSQL features like
materialized views to handle more demanding workloads.

We tried to make the experience of using Iceberg tables as simple and
smooth as possible. If you already know PostgreSQL, then there is
very little you need to learn. Even if you don't know PostgreSQL, we
made it very easy to load your data and start querying.

There are still many features, optimizations, user interfaces, and
integrations we are building or planning to build. However, if you're
looking for a simple, batteries-included data stack with one
extremely powerful, familiar tool that can serve most of your data
needs and has an already-established ecosystem, Crunchy Data
Warehouse might be a good choice.

Create your first Iceberg table in PostgreSQL

To get started using Iceberg tables, you simply add using iceberg
when creating a new table in Crunchy Data Warehouse. You can create a
table as usual by defining the column names and types or create the
table from a file (Parquet/CSV/JSON/...).

-- create an Iceberg table, infer the columns from an existing file and load it
create table trips_yellow ()
using iceberg
with (load_from = 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet');

\d trips_yellow
+-----------------------+-----------------------------+-----------+----------+---------+-------------+
|        Column         |            Type             | Collation | Nullable | Default | FDW options |
+-----------------------+-----------------------------+-----------+----------+---------+-------------+
| vendorid              | integer                     |           |          |         |             |
| tpep_pickup_datetime  | timestamp without time zone |           |          |         |             |
...
| airport_fee           | double precision            |           |          |         |             |
+-----------------------+-----------------------------+-----------+----------+---------+-------------+

The data in Iceberg tables is stored in compressed Parquet files in
managed object storage provided by Crunchy. You can also store tables
in your own S3 bucket in the same region.

Now let's load some more data. You can use regular insert and copy
statements, and also copy from a public or private URL:

-- load files directly from URLs
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-02.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-03.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-04.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-05.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-06.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-07.parquet';
copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-08.parquet';

-- or, load files programmatically (and transactionally) in a function or DO block
do $do$
declare month text;
begin
  for month in select to_char(m, 'YYYY-MM') from generate_series('2024-02-01', '2024-08-31', interval '1 month') m loop
    execute format($$copy trips_yellow from 'https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_%s.parquet'$$, month);
  end loop;
end
$do$;

Now we're ready to run our first analytical query. Below is a
comparison of running an analytical query on ~26M rows on Iceberg and
a regular heap table. Iceberg is >70x faster.

iceberg-side-by-side-highres.gif

Apart from regular queries, you can also perform inserts, updates,
deletes, and schema changes on Iceberg tables.

Small writes (e.g. single-row insert/update) will cause small files
to be added to the Iceberg table, which over time reduces
performance. You can vacuum the table, which compacts data and
metadata files, and deletes files that are no longer used.

-- inserts create small files
create table measurements (station_name text, value double precision) using iceberg;
insert into measurements values ('Arsuz', 24.9), ('Haarlem', 13.4), ('Thessaloniki', 20.3);
insert into measurements values ('Lawrence', 23.4);
insert into measurements values ('Istanbul', 18.5);

-- compact small files in the measurements table
vacuum measurements;

-- or, run vacuum on all Iceberg tables hourly in the current database using pg_cron
select cron.schedule('@hourly', 'vacuum (iceberg)');

In the near future, we'll add auto-vacuuming of Iceberg tables across
all databases.

Most PostgreSQL features are supported out-of-the-box on Iceberg
tables (even triggers!). Below is a list of features that are
supported and what is still on our road map.

+-------------------------------------------------------------------+
|Supported features for       |Road map for Iceberg tables          |
|Iceberg tables               |                                     |
|-----------------------------+-------------------------------------|
|* Full SELECT/INSERT/UPDATE/ |* Advanced Iceberg features: hidden  |
|DELETE support               |partitioning, branching, time travel |
|-----------------------------+-------------------------------------|
|* Fast data loading via COPY |* Iceberg REST API                   |
|-----------------------------+-------------------------------------|
|* Transactions across        |                                     |
|multiple tables (iceberg and |* External writes to Iceberg catalog |
|heap)                        |                                     |
|-----------------------------+-------------------------------------|
|* Query using external tools |* Indexes                            |
|(Spark, pyiceberg, ...)        |                                     |
|-----------------------------+-------------------------------------|
|* Vacuuming for compaction   |* Foreign key constraints, unique    |
|and data expiration          |constraints, exclusion constraints   |
|-----------------------------+-------------------------------------|
|* CREATE TABLE .. USING      |* Unlogged/temp tables               |
|iceberg AS SELECT ..         |                                     |
|-----------------------------+-------------------------------------|
|* Load data directly from CSV|* MERGE                              |
|/JSON/Parquet/GDAL           |                                     |
|-----------------------------+-------------------------------------|
|* Schema changes             |* INSERT .. ON CONFLICT              |
|-----------------------------+-------------------------------------|
|* Renames of columns, tables,|* TABLESAMPLE                        |
|schemas, databases           |                                     |
|-----------------------------+-------------------------------------|
|* Views                      |* SELECT .. FOR UPDATE               |
|-----------------------------+-------------------------------------|
|* Materialized views         |* Row-level security                 |
|-----------------------------+-------------------------------------|
|* Modifying CTEs             |* System columns                     |
|-----------------------------+-------------------------------------|
|* Nested data types (arrays, |* Logical replication                |
|composite types, maps)       |                                     |
|-----------------------------+-------------------------------------|
|* User-defined types         |* Materialized views stored in       |
|                             |Iceberg                              |
|-----------------------------+-------------------------------------|
|* User-defined functions     |                                     |
|-----------------------------+-------------------------------------|
|* Stored procedures          |                                     |
|-----------------------------+-------------------------------------|
|* Prepared statements        |                                     |
|-----------------------------+-------------------------------------|
|* Subtransactions            |                                     |
|(savepoints, exceptions)     |                                     |
|-----------------------------+-------------------------------------|
|* Triggers                   |                                     |
|-----------------------------+-------------------------------------|
|* Check constraints          |                                     |
|-----------------------------+-------------------------------------|
|* Sequences                  |                                     |
|-----------------------------+-------------------------------------|
|* Generated columns          |                                     |
|-----------------------------+-------------------------------------|
|* EXPLAIN                    |                                     |
|-----------------------------+-------------------------------------|
|* TRUNCATE                   |                                     |
|-----------------------------+-------------------------------------|
|* Table-, column-, and       |                                     |
|view-level access controls   |                                     |
|-----------------------------+-------------------------------------|
|* PostGIS geometry columns   |                                     |
|-----------------------------+-------------------------------------|
|* Joins between Iceberg and  |                                     |
|heap tables                  |                                     |
|-----------------------------+-------------------------------------|
|* Prepared transactions      |                                     |
|-----------------------------+-------------------------------------|
|* Export query results to CSV|                                     |
|/JSON/Parquet file in S3     |                                     |
|-----------------------------+-------------------------------------|
|* Crunchy Bridge managed     |                                     |
|service features             |                                     |
+-------------------------------------------------------------------+

Transactions across your operational data and data lake

As an engineer, one of my favorite Crunchy Data Warehouse features is
the fact that transactions that write to multiple Iceberg tables and
heap tables are singular ACID transactions.

For instance, you can safely perform transactional data movement
operations from a staging table used for low latency inserts into an
Iceberg table used for very fast analytics.

-- create a staging table
create table measurements_staging (like measurements);

-- do fast inserts on a staging table
insert into measurements_staging values ('Charleston', 21.3);
Time: 1.153 ms

-- periodically move all the rows from a staging table into Iceberg
select cron.schedule('flush-staging', '* * * * *', $$
  with new_rows as (
    delete from measurements_staging returning *
  )
  insert into measurements select * from new_rows;
$$);

By setting up a pg_cron job, transient failures are automatically
handled by re-running the job and the atomic behavior of
transactions.

You can even safely do unions across the staging table and the
iceberg table if you need completely up-to-date query results,
without risk of duplicates or missing rows.

-- get the total number of inserted rows across staging table and Iceberg table
select sum(count) from (
  select count(*) from measurements_staging
  union all
  select count(*) from measurements
) all_data;

Finally, adding an analytical query engine to a transactional
database comes with many other benefits. For instance, you can easily
create a materialized view directly from an Iceberg table or raw data
files, you can keep track of which files you imported in incremental
processing pipelines, or reliably export a time-partitioned table to
S3.

Transactions across operational and analytical tables solve many
traditional data engineering problems, and not having to worry about
them makes your data stack a lot simpler and more reliable.

Get started with your new Postgres Data Warehouse in minutes

You can get started in minutes by signing up for Crunchy Bridge and
creating a "Warehouse cluster". You can then connect using your
favorite PostgreSQL client, and consult the Crunchy Data Warehouse
documentation for more information. We also made it possible to
directly dump and restore your heap tables from any PostgreSQL server
into Iceberg tables, so you can immediately compare results.

Crunchy Data Warehouse succeeds Crunchy Bridge for Analytics and
includes all of its features. If you are an existing Crunchy Bridge
for Analytics customer, you can "resize" your cluster to Crunchy Data
Warehouse to add pay-as-you-go managed storage.

If you're interested in running Crunchy Data Warehouse on-premises,
please contact us.

Enjoy this article?

You will love our newsletter!

Do not fill this out please: [                    ]Do not fill this
out please, it will be pre-filled: [                    ]
Email address[                    ]
Join The List
Related Articles

  * Nov 27, 2024*8 min read

    PostGIS Day 2024 Summary
  * Nov 20, 2024*8 min read

    Crunchy Data Warehouse: Postgres with Iceberg for High
    Performance Analytics
  * Nov 19, 2024*6 min read

    Loading the World! OpenStreetMap Import In Under 4 Hours
  * Nov 18, 2024*5 min read

    Easy Totals and Subtotals in Postgres with Rollup and Cube
  * Nov 15, 2024*8 min read

    A change to ResultRelInfo - A Near Miss with Postgres 17.1

Footer

Products

  * Crunchy Postgres
  * Crunchy Postgres for Kubernetes
  * Crunchy Bridge
  * Crunchy Certified PostgreSQL
  * Crunchy PostgreSQL for Cloud Foundry
  * Crunchy MLS PostgreSQL
  * Crunchy Spatial

Services & Support

  * Enterprise PostgreSQL Support
  * Migrate from Heroku
  * Ansible
  * Red Hat Partner
  * Trusted PostgreSQL
  * Crunchy Data Subscription

Resources

  * Customer Portal
  * Software Documentation
  * Postgres Tutorials
  * Crunchy Bridge Walkthrough
  * Postgres Operator Walkthrough
  * Blog
  * Events

Company

  * About Crunchy Data
  * Team
  * News
  * Careers
  * Contact Us
  * Newsletter
  * Branding
  * Security

Crunchy Data Newsletter

Subscribe to the Crunchy Data Newsletter to receive Postgres content
every month.

Do not fill this out please: [                    ]Do not fill this
out please, it will be pre-filled: [                    ]
Email address[                    ]
Join The List
YouTubeLinkedInTwitterGitHub

(c) 2018-2024 Crunchy Data Solutions, Inc.

This site uses cookies for usage analytics to improve our service. By
continuing to browse this site, you agree to this use. See our
privacy policy to learn more.