[HN Gopher] One-liner for running queries against CSV files with...
___________________________________________________________________
One-liner for running queries against CSV files with SQLite
Author : jdblair
Score : 618 points
Date : 2022-06-21 13:59 UTC (9 hours ago)
(HTM) web link (til.simonwillison.net)
(TXT) w3m dump (til.simonwillison.net)
| throwaway892238 wrote:
| sqlite3 :memory: -cmd '.mode csv' ...
|
| It should be a war crime for programs in 2022 to use non-
| UNIX/non-GNU style command line options. Add it to the Rome
| Statute's Article 7 list of crimes against humanity. Full blown
| tribunal at The Hague presided over by the international criminal
| court. Punishable by having to use Visual Basic 3.0 for all
| programming for the rest of their life.
| DarmokJalad1701 wrote:
| Let me introduce you to bazel ...
|
| bazel build //:--foobar --//::\\\
| ggktk wrote:
| I'm not sure what that does, but it reminds me of GNU
| Parallel's interface, which although smart, is not very
| intuitive and memorable in my experience.
| _joel wrote:
| Would :memory: even parse in some shells?
| ketralnis wrote:
| Why wouldn't it? I'm not aware of colons being a special
| character in any shell I can think of
| naniwaduni wrote:
| Colons are a _little_ bit special in Bournish shells, since
| they 're a delimiter in variable assignments after which
| tilde expansions, and some shells extend this behavior to
| command line arguments.
|
| Frankly, I don't find it outside the realm of possibility
| that there's some combination of options that will make
| :memory: misparse on a popular shell, I just don't know of
| any...
| chasil wrote:
| My VMS users say the same thing. Wow, do I not want that to
| happen. I am not a fan of DCL.
| [deleted]
| svnpenn wrote:
| Go has been doing same thing for a decade now:
|
| https://godocs.io/flag
|
| Personally I'm fine with it. The whole, "let's combine 5 letter
| options into one string", always smacked of excess code golf to
| me.
| zrail wrote:
| The initial release of sqlite was in 2000. Yes, well after GNU-
| style command line options existed but not by much.
| hinkley wrote:
| 22 years is a long time to deprecate bad command line
| arguments.
| alana314 wrote:
| I like using q for querying CSVs on the command line:
| https://github.com/harelba/q
| AdrenalinMd wrote:
| Same. And I believe q uses sqlite under hood, so you can use
| the same SQL syntax as the one supported by sqlite. Joining
| multiple csv files is also possible without too much setup.
| http://harelba.github.io/q/
| valw wrote:
| Btw, am I alone in thinking that DataFrame abstractions in OOP
| languages (like Pandas in Python) are oftentimes _simply
| inferior_ to relational algebra? I 'm not sure that many Data
| Scientists are aware of the expressive power of SQL.
| lenwood wrote:
| Agree. I've completed data pipelines for several projects and
| have found that the cleanest, and often fastest solution is to
| use SQL to structure the data as needed. This is anecdotal and
| I'm not an expert with SQL, but I haven't come across a
| situation where R or Pandas dataframes worked better than a
| well written query for data manipulation. This has the benefit
| of simplifying collaboration across teams because within my
| company not everyone uses the same toolset for analysis, but we
| all have access to the same database. Other tools are better
| suited to analysis or expansion of the data with input from
| other sources, but within our own data SQL wins.
| deepsun wrote:
| Often -- yes. Always -- no.
|
| For example let's try changing/fixing sampling rate of a
| dataset (.resample() in Pandas).
|
| Or something like .cumsum() -- easy with SQL windowing
| functions, but man they are cumbersome.
|
| Or quickly store the result in .parquet.
|
| But all the above doesn't matter, because I feel like 99% of
| Pandas work involves quickly drawing charts on the data look at
| it or show to teammates.
| wenc wrote:
| SQL does not exactly implement relational algebra in its pure
| form.
|
| SQL implements a kind of set theory with relational elements
| and a bunch of practical features like pivots, window functions
| etc.
|
| Pandas does the same. Most data frame libraries like dplyr etc.
| implement a common set of useful constructs. There's not much
| difference in expressiveness. LINQ Is another language around
| manipulating sets that was designed with the help of category
| theory, and it arrives at the same constructs.
|
| However SQL is declarative, which provides a path for query
| optimizers to parse and create optimized plans. Whereas with
| chained methods, unless one implements lazy evaluation one
| misses out on look aheads and opportunities to do rewrites.
| valw wrote:
| > There's not much difference in expressiveness
|
| > However SQL is declarative
|
| Pick one :) the way I see it, if declarativeness is not a
| factor in assessing expressiveness, then expressiveness
| reduces to the uninteresting notion of Turing-equivalence.
| wenc wrote:
| Expressiveness and declarativeness are different things,
| no?
|
| Are you talking about aesthetics? I've used SQL for 20
| years and it's elegant in parts but it also has warts. I
| talk about this elsewhere but SQL gets repetitive and
| requires multi layer CTEs to express certain simple
| aggregations.
| devin-petersohn wrote:
| There are loads of things that are not possible or are very
| cumbersome to write in SQL, but that pandas and many other
| dataframe systems allow. Examples are dropping null values
| based on some threshold, one-hot encoding, covariance, and
| certain data cleaning operations. These are possible in SQL but
| very cumbersome to write. There are also things that are
| outright impossible in a relational database related to
| metadata manipulation.
|
| SQL is super expressive, but I think pandas gets a bad rap. At
| it's core the data model and language can be more expressive
| than relational databases (see [1]).
|
| I co-authored a paper that explained these differences with a
| theoretical foundation[1].
|
| [1] https://arxiv.org/abs/2001.00888
| valw wrote:
| Thanks for sharing this. I believe we essentially agree:
| chaining method calls is inexpressive compared to composing
| expressions in an algebraic language.
| Myrmornis wrote:
| I'm not defending Pandas but just want to point out that
| the inability to conveniently compose expressions is one of
| the biggest problems with SQL, since it was designed to be
| written as a sort of pseudo-English natural language, in an
| era when people imagined that it would be used by non-
| programmers. To be clear, that's a problem with SQL, not
| with the idea of a language based on relational algebra.
| There are various attempts to create SQL-alternatives which
| behave like real programming languages in terms of e.g.
| composability. This blog post makes the point better than I
| can:
|
| https://opensource.googleblog.com/2021/04/logica-
| organizing-...
| valw wrote:
| I absolutely agree - one of the biggest shortcomings of
| SQL is that its primary programming interface is based on
| text and intended for human, instead of being based on
| data structures and intended for programs.
| practice9 wrote:
| In the past I've used https://github.com/BurntSushi/xsv to query
| some large CSVs
| ArchD wrote:
| If you need to query against multiple CSVs, e.g. using joins, you
| could use QHS: https://github.com/itchyny/qhs
| gpvos wrote:
| csvsql from csvkit[0] can do that too.
|
| [0] https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html
| tgtweak wrote:
| This is far more useful for SQL users than chaining several
| sed/awk/sort commands on pipe (although a bit against nix
| principles).
| chrisweekly wrote:
| For a *nixy approach, try lnav (https://lnav.org)
| mrfusion wrote:
| How smart is SQLite at detecting column types from Csv data?
|
| I once wrote a Python script to load csv files into SQLite. It
| had a whole hierarchy of rules to determine the data type of each
| column.
| mattewong wrote:
| It doesn't detect column types automatically-- they are
| imported as text. You can, however, use math functions on them
| and sqlite3 will dynamically convert where possible (e.g.
| "select number_1_stored_as_text + 1 from mytable" will output
| 2)
| rgovostes wrote:
| SQLite's virtual table API (https://www.sqlite.org/vtab.html)
| makes it possible to access other data structures through the
| query engine. You don't need to know much if anything about how
| the database engine executes queries, you only need to implement
| the callbacks it needs to do its job. A few years ago I wrote an
| extension to let me search through serialized Protobufs which
| were stored as blobs in a regular database.
|
| https://github.com/rgov/sqlite_protobuf
| pstuart wrote:
| And in fact there is a CSV virtual table available from SQLite
| but it's not built in the normal client:
| https://www.sqlite.org/csv.html
|
| It really should be, as the code is tiny and this functionality
| is not overly exotic.
| cldellow wrote:
| In my experience [1], the CSV virtual table was really slow.
| Doing some analysis on a 1,291 MB file, a query took 24.7
| seconds using the virtual table vs 3.4 seconds if you
| imported the file first.
|
| The CSV virtual table source code is a good pedagogical tool
| for teaching how to build a virtual table, though.
|
| [1]: https://cldellow.com/2018/06/22/sqlite-parquet-
| vtable.html
| mattewong wrote:
| https://github.com/liquidaty/zsv/blob/main/app/external/sql
| i... modifies the sqlite3 virtual table engine to use the
| faster zsv parser. have not quantified the difference, but
| in all tests I have run, `zsv sql` runs faster (sometimes
| much faster) than other sqlite3-on-CSV solutions mentioned
| in this entire discussion (unless you include those that
| cache their indexes and then measure against a post-cached
| query). Disclaimer: I'm the main zsv author
| westurner wrote:
| What are the differences between the zsv and csv parsers?
|
| Is csvw with linked data URIs also doable?
| electroly wrote:
| I like the virtual table API a lot but it has some serious
| drawbacks. You don't need to know much and indeed, you _can 't_
| know much about the execution engine, even if that knowledge
| would help you. Many parts of the query are not pushed down
| into the virtual table.
|
| For instance, if the user query is:
|
| SELECT COUNT(*) FROM my_vtab;
|
| ... the query your virtual table will effectively see is:
|
| SELECT * FROM my_vtab;
|
| SQLite does the counting. That's great, unless you already know
| the count and could have reported it directly rather than
| actually returning every row in the table. You're forced to
| retrieve and return every row because you have no idea that it
| was actually just a count.
|
| As another example, if the user query includes a join, you
| won't see the join. Instead, you will receive a series of N
| queries for individual IDs, even if you could have more
| efficiently retrieved them in a batch.
|
| The join one is particularly nasty. If you're writing a virtual
| table that accesses a remote resource with some latency, any
| join will absolutely ruin your performance as you pay a full
| network roundtrip for each of those N queries.
|
| I wrote a module that exposes remote SQL
| Server/PostgreSQL/MySQL servers as SQLite virtual tables, and
| joins basically don't work at all if your server is not on your
| local network. There's nothing I can do about it (other than
| heuristically guessing what IDs might be coming and request
| them ahead of time) because SQLite doesn't provide enough
| information to the virtual table layer. It's my understanding
| that PostgreSQL's foreign data wrappers (a similar feature to
| SQLite's virtual tables) push much more information about the
| query down to the wrapper layer, but I haven't used it myself.
| lazypenguin wrote:
| You also can't add new columns using alter statements. I
| really like virtual tables in SQLite but It would be nice if
| documentation included some limitations and reasons not to
| use.
| westurner wrote:
| ## /? sqlite arrow
|
| - "Comparing SQLite, DuckDB and Arrow with UN trade data"
| (2021) https://news.ycombinator.com/item?id=29010103 ; partial
| benchmarks of query time and RAM requirements [relative to data
| size] would be
|
| - "Introducing Apache Arrow Flight SQL: Accelerating Database
| Access" (2022)
| https://arrow.apache.org/blog/2022/02/16/introducing-arrow-f...
| :
|
| > _Motivation: While standards like JDBC and ODBC have served
| users well for decades, they fall short for databases and
| clients which wish to use Apache Arrow or columnar data in
| general. Row-based APIs like JDBC or PEP 249 require
| transposing data in this case, and for a database which is
| itself columnar, this means that data has to be transposed
| twice--once to present it in rows for the API, and once to get
| it back into columns for the consumer. Meanwhile, while APIs
| like ODBC do provide bulk access to result buffers, this data
| must still be copied into Arrow arrays for use with the broader
| Arrow ecosystem, as implemented by projects like Turbodbc.
| Flight SQL aims to get rid of these intermediate steps._
|
| ## "The Virtual Table Mechanism Of SQLite"
| https://sqlite.org/vtab.html :
|
| > _- One cannot create a trigger on a virtual table._
|
| Just posted about eBPF a few days ago; opcodes have costs that
| are or are not costed:
| https://news.ycombinator.com/item?id=31688180
|
| > _- One cannot create additional indices on a virtual table.
| (Virtual tables can have indices but that must be built into
| the virtual table implementation. Indices cannot be added
| separately using CREATE INDEX statements.)_
|
| It looks like e.g. sqlite-parquet-vtable implements shadow
| tables to memoize row group filters. How does JOIN performance
| vary amongst sqlite virtual table implementations?
|
| > _- One cannot run ALTER TABLE ... ADD COLUMN commands against
| a virtual table._
|
| Are there URIs in the schema? Mustn't there thus be a meta-
| schema that does e.g. nested structs with portable types [with
| URIs], (and jsonschema, [and W3C SHACL])? #nbmeta
| #linkedresearch
|
| ## /? sqlite arrow virtual table
|
| - sqlite-parquet-vtable reads parquet with arrow for SQLite
| virtual tables https://github.com/cldellow/sqlite-parquet-
| vtable : $ sqlite/sqlite3 sqlite> .eqp on
| sqlite> .load build/linux/libparquet sqlite> CREATE
| VIRTUAL TABLE demo USING parquet('parquet-
| generator/99-rows-1.parquet'); sqlite> SELECT * FROM
| demo; // sqlite> SELECT * FROM demo WHERE foo =
| 123; sqlite> SELECT * FROM demo WHERE foo = '123'; //
| incurs a severe query plan performance regression without
| immediate feedback
|
| ## Sqlite query optimization
|
| `EXPLAIN QUERY PLAN` https://www.sqlite.org/eqp.html :
|
| > _The EXPLAIN QUERY PLAN SQL command is used to obtain a high-
| level description of the strategy or plan that SQLite uses to
| implement a specific SQL query. Most significantly, EXPLAIN
| QUERY PLAN reports on the way in which the query uses database
| indices. This document is a guide to understanding and
| interpreting the EXPLAIN QUERY PLAN output._ [...] _Table and
| Index Scans_ [...] _Temporary Sorting B-Trees_ (when there 's
| not an `INDEX` for those columns) ... `.eqp on`
|
| The SQLite "Query Planner" docs
| https://www.sqlite.org/queryplanner.html list Big-O
| computational complexity bound estimates for queries with and
| without prexisting indices.
|
| ## database / csv benchmarks
|
| - https://h2oai.github.io/db-benchmark/
| kitd wrote:
| I had to do something very similar for analysing CVE information
| recently, but I don't remember having to use the _:memory:_
| option. I suspect it defaults to that if no .db file is
| specified.
|
| Slightly tangentially, when doing aggregated queries, SQLite has
| a very useful _group_concat(..., ',')_ function that will
| concatenate the expression in the first arg for each row in the
| group, separated by the separator in the 2nd arg.
|
| In many situations SQLite is a suitable alternative to jq for
| simple tabular JSON.
| simonw wrote:
| I just tried it without :memory: and it dropped me into the
| SQLite shell without executing the query: %
| sqlite3 -cmd '.mode csv' -cmd '.import taxi.csv taxi' \
| 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi
| GROUP BY passenger_count' SQLite version 3.36.0
| 2021-06-18 18:58:49 Enter ".help" for usage hints.
| sqlite>
| ultrasounder wrote:
| This comes at the most opportune time when I am cranking through
| selectstarsql,sqlbolt,schemaverse,pgexcercise to bone up on my
| SQL skills for my upcoming data engineering interviews. SQL IS
| the secret super power that devs don't know they possess.
| gigatexal wrote:
| I've been doing this. But I hate it. CSVs need to die. They're
| terrible data formats. But here we are. And SQLlite makes things
| amazing.
| krylon wrote:
| CSV sucks, yes, but for moving/exchanging/aggregating data
| between various independent sources, it's the least terrible
| option everyone can process easily.
| gpvos wrote:
| I'm willing to bet CSV will still be around in 200 years. It's
| ugly, but exceedingly effective.
| wizofaus wrote:
| I wouldn't even assume the concept of text files as we know
| them today will still exist in any meaningful way in 100
| years. It's just as likely all digital data will be stored in
| something like neutral networks with no obvious textual
| representation. But yes, CSV has had remarkable persistence
| (the most recent major feature addition I made to our product
| had to read from a supposedly modern web API response that
| was in CSV format, despite all the other endpoints returning
| JSON).
| a-dub wrote:
| i used to daydream about adding an ALLOW DOING IT LIVE option to
| cassandra's csql client. in the event that your where clause was
| incompatible with your table's key, it would just wholesale dump
| the table in question into a sqlite while indexing the
| appropriate columns, run the query in question, actually return
| the result and then properly clean up.
| adamgordonbell wrote:
| I've become a fan of using SQLite-utils to work with CSV or JSON
| files.
|
| It's a two step process though. One to create and insert into a
| DB and a second to select from and return.
|
| https://sqlite-utils.datasette.io/en/stable/index.html
| simonw wrote:
| I added a feature last year that lets you do this as a one-step
| process - "sqlite-utils memory":
| https://simonwillison.net/2021/Jun/19/sqlite-utils-memory/
| eatonphil wrote:
| Those steps still exist though surely, just managed by the
| one command. If you're querying via SQLite you have to ingest
| first.
| simonw wrote:
| Yup - 'sqlite-utils memory' works by creating an in-memory
| database, importing various different file formats into it,
| running the query and then throwing everything away again.
| It's generally not a good fit for anything over a few dozen
| MBs of data.
| blacksqr wrote:
| Squawk: An Awk-like program that uses SQL. Can parse, format,
| filter, and combine data from multiple files. Powered by SQLite.
|
| https://wiki.tcl-lang.org/page/Sqawk
| adren67 wrote:
| spapas82 wrote:
| Can somebody post the equivalent in Windows cmd?
| gpvos wrote:
| Just replace the ' with " and lose the \ . For some SQL queries
| you may have to use more exotic quoting (generally, in cmd you
| can always quote the next character with ^ ).
| dotancohen wrote:
| Lately I've been using Visidata for any text file that looks like
| a table or other squarish data source, including JSON.
|
| https://www.visidata.org/
| dymk wrote:
| This looks like my new best friend
| dotancohen wrote:
| Just wait until you discover the file browser ))
| wenc wrote:
| Visidata is wonderful.
|
| Also for querying large CSV and Parquet files, I use DuckDB. It
| has a vectorized engine and is super fast. It can also query
| SQLite files directly. The SQL support is outstanding.
|
| https://duckdb.org/
|
| Just have start the DuckDB REPL and start querying e.g.
| Select * from 'bob.CSV' a Join 'Mary.parquet' b
| On a.Id = b.Id
|
| Zips through multi GB files in a few seconds.
| jwilk wrote:
| Beware that visidata phones home by default:
|
| https://github.com/saulpw/visidata/discussions/940
| keybored wrote:
| MODE is one of: ascii Columns/rows delimited by 0x1F
| and 0x1E
|
| Yes!
| chrisweekly wrote:
| One of my all-time favorite (and somehow still-obscure /
| relatively unknown) tools is called `lnav` ^1. It's a mini-ETL
| powertool with embedded SQLite, perfect for wrangling log files
| or other semi-structured data (a few millions of rows are no
| problem), it's intuitive and flexible...
|
| 1. https://lnav.org
| minusf wrote:
| lnav is super cool, but as its name says: log navigator, it's
| more of a less/tail/etc supercharged with sqlite under the
| hood.
|
| of course because it has a flexible format definition it can
| deal with csv files as well, but it's true power is getting sql
| queries out of nginx log files and the like without the
| intermediate step of exporting them to csv.
| flusteredBias wrote:
| I am a data scientists. I have used a lot of tools/libraries to
| interact with data. SQLite is my favorite. It is hard to beat the
| syntax/grammar.
|
| Also, when I use SQLite I do not output using column mode. I pipe
| to `tv` (tidy-viewer) to get a pretty output.
|
| https://github.com/alexhallam/tv
|
| transparency: I am the dev of this utility
| ramraj07 wrote:
| Just want to add that snowflake (imo) is better. You don't have
| to suffer SQLite's lack of data types and honestly snowflake is
| the best tool to work with messy data.
|
| Just fyi you can set up a snowflake account with a minimum
| monthly fee of 25 bucks. It'll be very hard to actually use 25
| bucks if your data isn't in 100s of GBs and you literally use
| as little compute as is needed so it's perfect.
| humanistbot wrote:
| This is in no way a relevant good-faith reply. It is spam. A
| web-based cloud-based data analytics platform isn't in the
| same category as piping command-line programs together.
| ehvatum wrote:
| We can't pay for anything that doesn't have Jira integration,
| sorry!
| samstave wrote:
| flusteredBias wrote:
| https://github.com/alexhallam/tv#inspiration
| flusteredBias wrote:
| Here is an example of how I would pipe with headers to `tv`.
|
| sqlite3 :memory: -csv -header -cmd '.import taxi.csv taxi'
| 'SELECT passenger_count, COUNT(*), AVG(total_amount) FROM taxi
| GROUP BY passenger_count' | tv
| queuebert wrote:
| What a nice tool. I love how Rust has reinvigorated command
| line utilities.
| corytheboyd wrote:
| piping to jq (using json mode of course) also works well for
| this
| pdimitar wrote:
| Or `jless`, which shows you a tree and doesn't flood your
| terminal.
| corytheboyd wrote:
| Nice, that's definitely better than `jq | less`
| mytherin wrote:
| Using DuckDB [1]: duckdb -c "SELECT
| passenger_count, COUNT(*), AVG(total_amount) FROM taxi.csv GROUP
| BY ALL"
|
| DuckDB will automatically infer you are reading a CSV file from
| the extension, then automatically infer column names from the
| header, together with various CSV properties (data types,
| delimiter, quote type, etc). You don't even need to quote the
| table name as long as the file is in your current directory and
| the file name contains no special characters.
|
| DuckDB uses the SQLite shell, so all of the commands that are
| mentioned in the article with SQLite will also work for DuckDB.
|
| [1] https://github.com/duckdb/duckdb
|
| Disclaimer: Developer of DuckDB
| snidane wrote:
| A bit clunky, but works. ps | awk '$1=$1'
| OFS=, | duckdb :memory: "select PID,TTY,TIME from
| read_csv_auto('/dev/stdin')"
| beaugunderson wrote:
| can't figure out how to make this work in bash; it just
| prints out "select PID,TTY,TIME from
| read_csv_auto('/dev/stdin')" but split into columns... using
| cli v0.4.0 da9ee490d which seems like the latest
| simonw wrote:
| How does the column data type inference work? I've run into
| that challenge myself in the past.
| mytherin wrote:
| The CSV auto-detector was implemented by Till Dohmen, who did
| his master thesis on the subject [1] and has actually written
| a paper about it [2].
|
| Essentially we have a list of candidate types for each column
| (starting with all types). We then sample a number of tuples
| from various parts of the file, and progressively reduce the
| number of candidate types as we detect conflicts. We then
| take the most restrictive type from the remaining set of
| types, with `STRING` as a last resort in case we cannot
| convert to any other type. After we have figured out the
| types, we start the actual parsing.
|
| Note that it is possible we can end up with incorrect types
| in certain edge cases, e.g. if you have a column that has
| only numbers besides one row that is a string. If that row is
| not present in the sampling an error will be thrown and the
| user will need to override the type inference manually. This
| is generally rather rare, however.
|
| You could also use DuckDB to do your type-inference for you!
| duckdb -c "DESCRIBE SELECT * FROM taxi.csv"
|
| And if you want to change the sample size:
| duckdb -c "DESCRIBE SELECT * FROM read_csv_auto('taxi.csv',
| sample_size=9999999999999)"
|
| [1] https://homepages.cwi.nl/~boncz/msc/2016-Doehmen.pdf
|
| [2] https://ir.cwi.nl/pub/26416/SSDM1111.pdf
| simonw wrote:
| This is fantastic, thanks.
|
| My solution is a lot less smart - I loop through every
| record and keep track of which potential types I've seen
| for each column: https://sqlite-
| utils.datasette.io/en/latest/python-api.html#...
|
| Implementation here: https://github.com/simonw/sqlite-
| utils/blob/3fbe8a784cc2f3fa...
| mytherin wrote:
| That works and is similar to what DuckDB does for small
| CSV files. We have the sampling step primarily for larger
| CSV files, when you might not want to do two passes over
| the file. This way we can keep the benefits of streamed
| file reading while offering type inference that "just
| works" most of the time without a major performance
| penalty.
| avogar wrote:
| Using ClickHouse you can also process local files in one line
| using clickhouse-local command tool. And it will look a lot
| easier:
|
| clickhouse local -q "SELECT passenger_count, COUNT(*),
| AVG(total_amount) FROM file(taxi.csv, 'CSVWithNames') GROUP BY
| passenger_count"
|
| And ClickHouse supports a lot of different file formats both for
| import and export (you can see all of them here
| https://clickhouse.com/docs/en/interfaces/formats/).
|
| There is an example of using clickhouse-local with taxi dataset
| mentioned in the post:
| https://colab.research.google.com/drive/1tiOUCjTnwUIFRxovpRX...
| eatonphil wrote:
| Clickhouse-local is incredible. It does the best of any similar
| tool I've benchmarked. But the reason I took it out of the
| linked benchmarks in OP's post is because it's 2+GB. That's a
| massive binary. It's the whole server. I'm not sure you want to
| be distributing this all over the place in general. It's just
| not in the same category IMO. Disclaimer: I build another tool
| that does similar things.
| qoega wrote:
| It is that big only with debug symbols. If you take packaged
| version of ClickHouse it will be much smaller. Or just strip
| large binary manually if you already have it.
| zX41ZdbW wrote:
| It is shipped with debug info and symbol tables just in case.
|
| Without debug info, it will be 350 MB, and compressed can fit
| in 50 MB:
| https://github.com/ClickHouse/ClickHouse/issues/29378
|
| It is definitely a worth improvement.
| eatonphil wrote:
| Aha, thanks for the clarification and link. I'll be
| following that issue.
| Uptrenda wrote:
| I'm looking through this guys website for 'today I learned' and
| at first I'm impressed by how many of them there are. But then I
| start thinking: when you're trying to solve a problem you search
| for a lot of data. None of his posts are attributed. He's getting
| all his information from somewhere and then he goes and posts
| these articles just ripping off other sources.
|
| I can understand when its based on your original work but this
| website reads more like basic questions posted on Stackoverflow.
| E.g. 'how to connect to a website with IPv6." Tell me he didn't
| just Google that and post the result. 0/10
| lemoncurd wrote:
| actionfromafar wrote:
| Hm, I guess most half-assed techy blogs are 0/10 then.
|
| Honestly I don't see a problem with the few posts I looked at.
| It's like recipes. You can't copyright recipes. At least it's
| not AI-generated blogspam, but a modicum of at least curating
| went in here.
| droopyEyelids wrote:
| Thats actually allowed, if you run your own personal website.
| simonw wrote:
| You should look harder! I attribute in plenty of these pieces,
| where appropriate.
|
| Here's a query showing the 23 posts that link to StackOverflow,
| for example:
| https://til.simonwillison.net/tils?sql=select+*+from+til+whe...
|
| And 41 where I credit someone on Twitter:
| https://til.simonwillison.net/tils?sql=select+*+from+til+whe...
|
| More commonly I'll include a link from the TIL back to a GitHub
| Issue thread where I figured something out - those issue
| threads often link back to other sources.
|
| For that IPv6 one:
| https://til.simonwillison.net/networking/http-ipv6
|
| I had tried and failed to figure this out using Google searches
| in the past. I wrote that up after someone told me the answer
| in a private Slack conversation - saying who told me didn't
| feel appropriate there.
|
| My goal with that page was to ensure that future people
| (including myself) who tried to find this with Google would get
| a better result!
|
| (I'm a bit upset about this comment to be honest, because
| attributing people is something of a core value for me - the
| bookmarks on my blog have a "via" mechanism for exactly that
| reason: https://simonwillison.net/search/?type=blogmark )
| cldellow wrote:
| I tried to reproduce the OP's complaint. Of the 5 most recent
| TILs, only 1 did not reference some other source as
| inspiration. One literally gave "thanks" to tips obtained
| elsewhere.
|
| I'm offended on your behalf! :)
| madacol wrote:
| I've been following you for some months now, and you always
| put links of your sources.
|
| So please, feel 100% free to ignore that person
| mattewong wrote:
| Over all these CLIs, I prefer zsv (
| https://github.com/liquidaty/zsv )-- then again, I wrote it so my
| preference should come as no surprise. On my Mac was 340% faster
| than the OP command, and does a lot more than just
| SQL/sqlite3-related
| neycmrtn wrote:
| There is also our somewhat older sqlet.py at http://www.sqlet.com
| with multiple input files, column index shortcuts.. (and probably
| overdue for an update).
| cube2222 wrote:
| Since many people are sharing one-liners with various tools...
|
| OctoSQL[0]: octosql 'SELECT passenger_count,
| COUNT(*), AVG(total_amount) FROM taxi.csv GROUP BY
| passenger_count'
|
| It also infers everything automatically and typechecks your query
| for errors. You can use it with csv, json, parquet but also
| Postgres, MySQL, etc. All in a single query!
|
| [0]:https://github.com/cube2222/octosql
|
| Disclaimer: author of OctoSQL
| elmomle wrote:
| I love the simplicity. Is there support for joins / use of
| multiple tables?
| cube2222 wrote:
| Yes, certainly! And those multiple tables can come from
| different data sources, so files or databases.
| aargh_aargh wrote:
| Hi cube2222!
|
| Just today I tried octosql for the first time when I wanted to
| correlate a handful of CSV files based on identifiers present
| in roughly equal form. Great idea but I immediately ran into
| many rough edges in what I think was a simple use case. Here
| are my random observations.
|
| Missing FULL JOIN (this was a dealbreaker for me). LEFT/RIGHT
| join gave me "panic: implement me".
|
| It took me a while to figure out how to quote CSV column names
| with non-ASCII characters and spaces. It's not documented as
| far as I've seen (please document quoting rules). This worked:
| octosql 'SELECT `tablename.Motley Crue` FROM tablename.csv'
|
| replace() is documented [1] as replace(old, new, text) but
| actually is replace(text, old, new) just like in postgres and
| mysql.
|
| index() is documented [1] as index(substring, text)
| (postgresql equivalent: position ( substring text IN string
| text ) - integer) octosql "SELECT index('y', 'Motley
| Crue')" Error: couldn't parse query: invalid argument
| syntax error at position 13 near 'index' octosql "SELECT
| index('Motley Crue', 'y')" Error: couldn't parse query:
| invalid argument syntax error at position 13 near 'index'
|
| Hope this helps and I wish you all the best.
|
| [1] https://github.com/cube2222/octosql/wiki/Function-
| Documentat...
| cube2222 wrote:
| Hey!
|
| Thanks a lot for this writeup!
|
| > but I immediately ran into many rough edges
|
| OctoSQL is definitely not in a stable state yet, so depending
| on the use case, there definitely are rough edges and
| occasional regressions.
|
| > Missing FULL JOIN (this was a dealbreaker for me).
| LEFT/RIGHT join gave me "panic: implement me".
|
| Indeed, right now only inner join is implemented. The others
| should be available soon.
|
| > replace() is documented [1] as replace(old, new, text) but
| actually is replace(text, old, new) just like in postgres and
| mysql. index() is documented [1] as index(substring, text)
|
| I've removed the offending docs pages, they were documenting
| a very old version of OctoSQL. The way to browse the
| available functions right now is built-in to OctoSQL:
| octosql "SELECT * FROM docs.functions"
|
| > invalid argument syntax error at position 13 near 'index'
|
| Looks like a parser issue which I can indeed replicate, will
| look into it.
|
| Thanks again, cheers!
| cube2222 wrote:
| Just to update this response, I've just released a new
| version that contains a new `position` function, as well as
| the capability to execute SELECT statements without a FROM
| part. So i.e. the above octosql "SELECT
| position('hello', 'ello')"
| eatonphil wrote:
| Heads up: cube2222 is the original author of this benchmark. :)
| I copied it and Simon copied my copy of it.
| cube2222 wrote:
| Thanks for the acknowledgement!
|
| Though btw., I think I personally prefer the SPyQL
| benchmarks[0], as they test a bigger variety of scenarios.
| This benchmark is mostly testing CSV decoding speed - because
| the group by is very simple, with just a few keys in the
| grouping.
|
| [0]:https://colab.research.google.com/github/dcmoura/spyql/bl
| ob/...
| eatonphil wrote:
| Actually I have a big issue with that benchmark in that it
| doesn't ORDER BY. I don't believe all those tools will
| produce the same result and it's not required by SQL for
| them to do so.
|
| That doesn't change the poor performance of dsq but it does
| change the relative and absolute scores in that benchmark.
| eli wrote:
| the .import command used for actually loading the CSV is kinda
| picky about your CSVs being well-formatted. I don't think it
| supports embedded newlines at all.
| simonw wrote:
| I just tested it against a CSV file with newlines that were
| wrapped in double quotes and it worked correctly. I used this
| CSV file:
| https://til.simonwillison.net/tils/til.csv?_stream=on&_size=...
|
| And this query: sqlite3 :memory: -cmd '.mode
| csv' -cmd '.import til.csv til' \ -cmd '.mode json'
| 'select * from til limit 1' | jq
| ttyprintk wrote:
| You can also prefix a SQLite import command with |, which
| hopefully produces text. At the system level, some import scripts
| can be entirely in one sql file.
| wilsonfiifi wrote:
| Another great tool written in Go is CSVQ [0][1] that can be used
| as a command line or a library. csvq 'select id,
| name from `user.csv`'
|
| [0] https://github.com/mithrandie/csvq
|
| [1] https://mithrandie.github.io/csvq/
___________________________________________________________________
(page generated 2022-06-21 23:01 UTC)