https://tech.gerardbentley.com/python/data/intermediate/2022/04/26/holy-duck.html Gar's Bar [ ] About MeSearchTags Holy uck! Fast Analysis with DuckDB + Pyarrow Trying out some new speedy tools for data analysis Apr 26, 2022 * 12 min read python data intermediate View On GitHub Open In Binder Open In Colab Launch in Deepnote * Holy uck! Fast Analysis with DuckDB + Pyarrow + Background + NYC Uber Data + Enter the Duck + Filtration + Conclusions Holy uck! Fast Analysis with DuckDB + Pyarrow Turning to DuckDB when you need to crunch more numbers faster than pandas in your Streamlit app Inspired by "DuckDB quacks Arrow" blogpost cross-posted on duckdb and arrow Open in Streamlit Background streamlit and Streamlit Cloud are fantastic for sharing your data exploration apps. A very common pattern uses csv files with pandas to accomplish the necessary steps of: * Load the data into the program * Filter data by certain columns or attributes * Compute analyses on the data (averages, counts, etc.) NYC Uber Data * Streamlit DuckDB Uber NYC repo * Streamlit Original Uber NYC repo Let's take this NYC Uber dataset example from Streamlit. We'll pay attention to: * How much RAM / memory is used * How long it takes to perform each step import pandas as pd import numpy as np # import streamlit as st # singleton ignored because we're not in streamlit anymore # @st.experimental_singleton def load_data(): data = pd.read_csv( "uber-raw-data-sep14.csv.gz", nrows=100000, # approx. 10% of data names=[ "date/time", "lat", "lon", ], # specify names directly since they don't change skiprows=1, # don't read header since names specified directly usecols=[0, 1, 2], # doesn't load last column, constant value "B02512" parse_dates=[ "date/time" ], # set as datetime instead of converting after the fact ) return data %%time data = load_data() CPU times: user 3.29 s, sys: 33 ms, total: 3.33 s Wall time: 3.32 s data.info() RangeIndex: 100000 entries, 0 to 99999 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date/time 100000 non-null datetime64[ns] 1 lat 100000 non-null float64 2 lon 100000 non-null float64 dtypes: datetime64[ns](1), float64(2) memory usage: 2.3 MB Feel free to reference the read_csv documentation, the focus of this post is on the nrows=100000 argument though. This nrows is used to limit the number of rows that get loaded into our application. Taking in 100,000 rows landed us around 2.3 MB of memory allocation for the data. It loaded on my computer in ~3 seconds. Let's see how that would go without our nrows limitation def load_full_data(): data = pd.read_csv( "uber-raw-data-sep14.csv.gz", # nrows=100000, # approx. 10% of data names=[ "date/time", "lat", "lon", ], # specify names directly since they don't change skiprows=1, # don't read header since names specified directly usecols=[0, 1, 2], # doesn't load last column, constant value "B02512" parse_dates=[ "date/time" ], # set as datetime instead of converting after the fact ) return data %%time full_data = load_full_data() CPU times: user 33.6 s, sys: 243 ms, total: 33.8 s Wall time: 33.7 s full_data.info() RangeIndex: 1028136 entries, 0 to 1028135 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date/time 1028136 non-null datetime64[ns] 1 lat 1028136 non-null float64 2 lon 1028136 non-null float64 dtypes: datetime64[ns](1), float64(2) memory usage: 23.5 MB Ok, so with ~10 times as much data (1,028,136 vs 100,000) we use: * ~10 times as much memory (23.5 MB vs 2.3 MB) * ~10 times as much time (30.1 s vs 2.99 s) The first time this app loads in streamlit will be a bit slow either way, but the singleton decorator is designed to prevent having to re-compute objects like this. (Also note that this is a single month of data... a year might include ~12,337,632 entries based on this september 2014 data) Enter the Duck Using pyarrow and duckdb let's see if we get any improvement import duckdb import pyarrow as pa from pyarrow import csv import pyarrow.dataset as ds def load_data_duckdb(): data = csv.read_csv('uber-raw-data-sep14.csv.gz', convert_options=csv.ConvertOptions( include_columns=["Date/Time","Lat","Lon"], timestamp_parsers=['%m/%d/%Y %H:%M:%S'] )).rename_columns(['date/time', 'lat', 'lon']) # `dataset` is for partitioning larger datasets. Can't include timestamp parsing directly though # data = ds.dataset("uber-raw-data-sep14.csv.gz", schema=pa.schema([ # ("Date/Time", pa.timestamp('s')), # ('Lat', pa.float32()), # ('Lon', pa.float32()) # ]), format='csv') # DuckDB can query Arrow tables, so we'll just return the table and a connection for flexible querying return data, duckdb.connect(":memory:") %%timeit arrow_data, con = load_data_duckdb() 121 ms +- 4.21 ms per loop (mean +- std. dev. of 7 runs, 10 loops each) arrow_data[:5] pyarrow.Table date/time: timestamp[s] lat: double lon: double ---- date/time: [[2014-09-01 00:01:00,2014-09-01 00:01:00,2014-09-01 00:03:00,2014-09-01 00:06:00,2014-09-01 00:11:00]] lat: [[40.2201,40.75,40.7559,40.745,40.8145]] lon: [[-74.0021,-74.0027,-73.9864,-73.9889,-73.9444]] Holy Smokes! Well that was fast and fun! pyarrow read the whole dataset in 120 ms. That's 0.120 s compared to 30.1 s with pandas! So how much memory are pyarrow and duckdb using? def format_bytes(size): """from https://stackoverflow.com/a/49361727/15685218""" # 2**10 = 1024 power = 2**10 n = 0 power_labels = {0 : '', 1: 'kilo', 2: 'mega', 3: 'giga', 4: 'tera'} while size > power: size /= power n += 1 return size, power_labels[n]+'bytes' format_bytes(arrow_data.nbytes) (23.53216552734375, 'megabytes') Ok, the pyarrow table has roughly the same size as the full pandas Dataframe con.execute('PRAGMA database_size;') """ database_size VARCHAR, -- total block count times the block size block_size BIGINT, -- database block size total_blocks BIGINT, -- total blocks in the database used_blocks BIGINT, -- used blocks in the database free_blocks BIGINT, -- free blocks in the database wal_size VARCHAR, -- write ahead log size memory_usage VARCHAR, -- memory used by the database buffer manager memory_limit VARCHAR -- maximum memory allowed for the database """ database_size, block_size, total_blocks, used_blocks, free_blocks, wal_size, memory_usage, memory_limit = con.fetchall()[0] memory_usage '0 bytes' We haven't told duckdb to load anything into its own tables, so it still has no memory usage. Nevertheless, duckdb can query the arrow_data since it's a pyarrow table. (duckdb can also load directly from csv). So where does that leave us on loading the full 1,000,000 row dataset? * pandas: ~30 s of time and 23.5 MB * pyarrow: ~.1 s of time (120 ms) and 23.9 MB In fairness, I tried pandas with the pyarrow engine. At the time of writing I can't find a fast datetime parse and usecols throws an error in pyarrow (see end of post). Reading the full CSV without datetime parsing is in line in terms of speed though. (also see why the best CSV is not a CSV at all for more on this path) %%time arrow_df = pd.read_csv( "uber-raw-data-sep14.csv.gz", engine='pyarrow', names=[ "date/time", "lat", "lon", "CONST" ], # specify names directly since they don't change skiprows=1, # don't read header since names specified directly # usecols=[1, 2], # doesn't load last column, constant value "B02512" parse_dates=[ "date/time" ], # set as datetime instead of converting after the fact # infer_datetime_format=True # Unsupported for pyarrow date_parser=lambda x: pd.to_datetime(x) ) CPU times: user 33.6 s, sys: 248 ms, total: 33.9 s Wall time: 33.7 s arrow_df.info() RangeIndex: 1028136 entries, 0 to 1028135 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date/time 1028136 non-null datetime64[ns] 1 lat 1028136 non-null float64 2 lon 1028136 non-null float64 3 CONST 1028136 non-null object dtypes: datetime64[ns](1), float64(2), object(1) memory usage: 31.4+ MB %%timeit arrow_df_no_datetime = pd.read_csv( "uber-raw-data-sep14.csv.gz", engine='pyarrow', names=[ "date/time", "lat", "lon", "CONST" ], # specify names directly since they don't change skiprows=1, # don't read header since names specified directly # usecols=[1, 2], # doesn't load last column, constant value "B02512" ) 137 ms +- 189 us per loop (mean +- std. dev. of 7 runs, 10 loops each) Filtration We have 3 main analysis functions to compare between pandas and duckdb for this app, laid out below: # @st.experimental_memo def filterdata(df, hour_selected): return df[df["date/time"].dt.hour == hour_selected] # CALCULATE MIDPOINT FOR GIVEN SET OF DATA # @st.experimental_memo def mpoint(lat, lon): return (np.average(lat), np.average(lon)) # FILTER DATA BY HOUR # @st.experimental_memo def histdata(df, hr): filtered = data[ (df["date/time"].dt.hour >= hr) & (df["date/time"].dt.hour < (hr + 1)) ] hist = np.histogram(filtered["date/time"].dt.minute, bins=60, range=(0, 60))[0] return pd.DataFrame({"minute": range(60), "pickups": hist}) %%timeit # For fairness, we'll use the full dataframe filterdata(full_data, 14) 17.8 ms +- 116 us per loop (mean +- std. dev. of 7 runs, 100 loops each) %%timeit mpoint(full_data["lat"], full_data["lon"]) 404 us +- 564 ns per loop (mean +- std. dev. of 7 runs, 1,000 loops each) %%timeit histdata(full_data, 14) /var/folders/cp/ktx4zddx7q3bqctqfjykn5700000gn/T/ipykernel_82767/2026438809.py:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index. filtered = data[ 39.5 ms +- 275 us per loop (mean +- std. dev. of 7 runs, 10 loops each) How about Duckdb (with conversion back to pandas for fairness) def duck_filterdata(con, hour_selected): return con.query( f'SELECT "date/time", lat, lon FROM arrow_data WHERE hour("date/time") = {hour_selected}' ).to_df() def duck_mpoint(con): return con.query("SELECT AVG(lat), AVG(lon) FROM arrow_data").fetchone() def duck_histdata(con, hr): hist_query = f'SELECT histogram(minute("date/time")) FROM arrow_data WHERE hour("date/time") >= {hr} and hour("date/time") < {hr + 1}' results, *_ = con.query(hist_query).fetchone() return pd.DataFrame(results) %%timeit duck_filterdata(con, 14) 6.25 ms +- 20.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each) %%timeit duck_mpoint(con) 1.64 ms +- 8.87 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each) %%timeit duck_histdata(con, 14) 2.64 ms +- 26.6 us per loop (mean +- std. dev. of 7 runs, 100 loops each) We got a modest improvement in filterdata and more than 10x speedup in histdata, but actually lost out to numpy for finding the average of 2 arrays in mpoint! * filterdata: + pandas: 19.1 ms +- 284 us + duckdb: 6.53 ms +- 126 us * mpoint: + numpy: 403 us +- 5.35 us + duckdb: 1.7 ms +- 82.6 us * histdata: + pandas + numpy: 40.8 ms +- 430 us + duckdb: 2.93 ms +- 28.4 us 19.1 / 6.53 2.9249617151607965 403 / 1700 0.23705882352941177 40.8 / 2.93 13.924914675767916 Conclusions It's no secret that Python is not a fast language, but there are tricks to speed it up. Common advice is to utilize C optimizations via numpy and pandas. Another new contender is utilizing the C++ driven duckdb as an in-process OLAP database manager. It takes some re-writing of Python code into SQL (or utilize the Relational API or another library such as Ibis Project), but can play nicely with pandas and pyarrow. Speaking of Arrow , it seems to be efficient and growing in popularity and adoption. streamlit utilizes it to simplify objects in protobufs between browser and server. pandas has further integrations on their roadmap. polars [?] uses it to power their Rust-written DataFrame library. This post explores an example streamlit app that utilizes some pandas and numpy functions such as read_csv, average, and DataFrame slicing. Using pyarrow to load data gives a speedup over the default pandas engine. Using duckdb to generate new views of data also speeds up difficult computations. pd.read_csv( "uber-raw-data-sep14.csv.gz", # nrows=100000, # approx. 10% of data engine='pyarrow', names=[ "date/time", "lat", "lon", # "CONST" ], # specify names directly since they don't change skiprows=1, # don't read header since names specified directly # usecols=[1, 2], # doesn't load last column, constant value "B02512" parse_dates=[ "date/time" ], # set as datetime instead of converting after the fact # # infer_datetime_format=True # Unsupported for pyarrow date_parser=lambda x: pd.to_datetime(x) ) +-------------------------------------------------------------------+ | | 0 | date/time | lat | lon | |-------+----------------+--------------------------+--------+------| | 0 |9/1/2014 0:01:00|1970-01-01 |-74.0021|B02512| | | |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| | 1 |9/1/2014 0:01:00|1970-01-01 |-74.0027|B02512| | | |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| | 2 |9/1/2014 0:03:00|1970-01-01 |-73.9864|B02512| | | |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| | 3 |9/1/2014 0:06:00|1970-01-01 |-73.9889|B02512| | | |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| | 4 |9/1/2014 0:11:00|1970-01-01 |-73.9444|B02512| | | |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| | ... |... |... |... |... | |-------+----------------+--------------------------+--------+------| |1028131|9/30/2014 |1970-01-01 |-73.9845|B02764| | |22:57:00 |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| |1028132|9/30/2014 |1970-01-01 |-74.1773|B02764| | |22:57:00 |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| |1028133|9/30/2014 |1970-01-01 |-73.9319|B02764| | |22:58:00 |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| |1028134|9/30/2014 |1970-01-01 |-74.0066|B02764| | |22:58:00 |00:00:00.000000040 | | | |-------+----------------+--------------------------+--------+------| |1028135|9/30/2014 |1970-01-01 |-73.9496|B02764| | |22:58:00 |00:00:00.000000040 | | | +-------------------------------------------------------------------+ 1028136 rows x 4 columns pd.read_csv( "uber-raw-data-sep14.csv.gz", # nrows=100000, # approx. 10% of data engine='pyarrow', # names=[ # "date/time", # "lat", # "lon", # "CONST" # ], # specify names directly since they don't change # skiprows=1, # don't read header since names specified directly usecols=[0,1], # doesn't load last column, constant value "B02512" # parse_dates=[ # "date/time" # ], # set as datetime instead of converting after the fact # # infer_datetime_format=True # Unsupported for pyarrow # date_parser=lambda x: pd.to_datetime(x) ).info() --------------------------------------------------------------------------- TypeError Traceback (most recent call last) /Users/gar/projects/demo-uber-nyc-pickups-main/blog.ipynb Cell 38' in () ----> 1 pd.read_csv( 2 "uber-raw-data-sep14.csv.gz", 3 # nrows=100000, # approx. 10% of data 4 engine='pyarrow', 5 # names=[ 6 # "date/time", 7 # "lat", 8 # "lon", 9 # "CONST" 10 # ], # specify names directly since they don't change 11 # skiprows=1, # don't read header since names specified directly 12 usecols=[0,1], # doesn't load last column, constant value "B02512" 13 # parse_dates=[ 14 # "date/time" 15 # ], # set as datetime instead of converting after the fact 16 # # infer_datetime_format=True # Unsupported for pyarrow 17 # date_parser=lambda x: pd.to_datetime(x) 18 ).info() File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments..decorate..wrapper(*args, **kwargs) 305 if len(args) > num_allow_args: 306 warnings.warn( 307 msg.format(arguments=arguments), 308 FutureWarning, 309 stacklevel=stacklevel, 310 ) --> 311 return func(*args, **kwargs) File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/io/parsers/readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options) 665 kwds_defaults = _refine_defaults_read( 666 dialect, 667 delimiter, (...) 676 defaults={"delimiter": ","}, 677 ) 678 kwds.update(kwds_defaults) --> 680 return _read(filepath_or_buffer, kwds) File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/io/parsers/readers.py:581, in _read(filepath_or_buffer, kwds) 578 return parser 580 with parser: --> 581 return parser.read(nrows) File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1243, in TextFileReader.read(self, nrows) 1241 if self.engine == "pyarrow": 1242 try: -> 1243 df = self._engine.read() 1244 except Exception: 1245 self.close() File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pandas/io/parsers/arrow_parser_wrapper.py:153, in ArrowParserWrapper.read(self) 146 pyarrow_csv = import_optional_dependency("pyarrow.csv") 147 self._get_pyarrow_options() 149 table = pyarrow_csv.read_csv( 150 self.src, 151 read_options=pyarrow_csv.ReadOptions(**self.read_options), 152 parse_options=pyarrow_csv.ParseOptions(**self.parse_options), --> 153 convert_options=pyarrow_csv.ConvertOptions(**self.convert_options), 154 ) 156 frame = table.to_pandas() 157 return self._finalize_output(frame) File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pyarrow/_csv.pyx:580, in pyarrow._csv.ConvertOptions.__init__() File ~/miniconda3/envs/py39/lib/python3.9/site-packages/pyarrow/_csv.pyx:734, in pyarrow._csv.ConvertOptions.include_columns.__set__() File stringsource:15, in string.from_py.__pyx_convert_string_from_py_std__in_string() TypeError: expected bytes, int found Subscribe Tech Blog from Gerard Bentley with topics in AI, Python, Developer Habits, and More * *