[HN Gopher] Q - Run SQL Directly on CSV or TSV Files
___________________________________________________________________
Q - Run SQL Directly on CSV or TSV Files
Author : gjvc
Score : 132 points
Date : 2022-09-21 15:31 UTC (7 hours ago)
(HTM) web link (harelba.github.io)
(TXT) w3m dump (harelba.github.io)
| Merad wrote:
| If you like the idea of this but aren't on Linux or want
| something with a little ui involved, check out the Rainbow CSV
| extension for VS Code, vim, Sublime, and Atom. It includes RBQL
| [0] which lets you query CSVs with a sql-like syntax.
|
| (Not affiliated with Rainbow CSV or RBQL at all, just a happy
| user)
|
| [0]: https://rbql.org/
| cm2187 wrote:
| Another solution would be to leverage the existing SQL engines.
|
| Someone mentioned sqlite virtual table.
|
| My home made solution to the very same problem is creating a
| simple winform that you can drag and drop spreadsheets and csv
| files onto, analyses them, creates an instance of localdb if
| there isn't one running, creates the table and uploads the data
| (so it reads the csv file twice). Then I can use my loved and
| trusted SQL Server Management studio with the MS SQL engine. The
| same UI allows to quickly delete tables and databases and create
| new database in two clicks. (future development: auto-normalise
| the table to reduce disk space and improve performance).
|
| What lacks is good import tools. Most csv import tools are super
| picky in term of the format of the data (dates in particular) and
| have too many steps.
| eatonphil wrote:
| Q is built on top of sqlite. :)
| redavni wrote:
| For Windows users who click here, this functionality is built in.
|
| https://learn.microsoft.com/en-us/cpp/data/odbc/data-source-...
| joshe wrote:
| Temp DB's feel like a slightly underused technique too. (Not
| saying it's always better.)
|
| Ie, just a quick script that adds a serial id as the first
| column. Then imports to postgres/mysql based on header names
| (column names) and file name (becomes table name) to a brand new
| db.
|
| Usually DBs are so long lived and carefully designed that there's
| a bit of mental block to just importing trash data and dropping
| the whole database later. I'm always 15 mins into awk before i
| remember.
|
| Also in postgres you can do it as a new schema in an existing
| database, and join with the existing data. Probably safest to not
| do that in production :-).
|
| Like so: https://stackoverflow.com/questions/5712387/can-we-join-
| two-...
|
| Then just drop the whole schema when you are done screwing
| around.
| SoftTalker wrote:
| Previous discussion:
|
| https://news.ycombinator.com/item?id=18453133 (284 points|devy|4
| years ago|96 comments)
|
| https://news.ycombinator.com/item?id=27423276 (121
| points|thunderbong|1 year ago|63 comments)
|
| https://news.ycombinator.com/item?id=24694892 (11
| points|pcr910303|2 years ago|2 comments)
| gizmodo59 wrote:
| I have been using Dremio to query large volume of CSV files:
| https://docs.dremio.com/software/data-sources/files-and-dire...
|
| Although having them in some columnar format is much better for
| fast responses.
|
| GitHub: https://github.com/dremio/dremio-oss
| eatonphil wrote:
| They're sort of different categories.
|
| Tools like Q are for command line use.
|
| Dremio is a server/web application, right?
|
| They accomplish the same thing but you might deploy a tool like
| Q on production servers for adhoc log analysis or install it in
| a docker container. (Not saying you should, just explaining the
| difference.)
| luzifer42 wrote:
| clickhouse-local is a powerful alternative.
|
| https://clickhouse.com/docs/en/operations/utilities/clickhou...
|
| https://news.ycombinator.com/item?id=31561780
| eatonphil wrote:
| It's fast. But it's also a 1-2gb binary. And its SQL
| implementation is work in progress and often makes up its own
| names for common functions.
|
| If you can put up with both for an adhoc cli exploration tool
| then yeah it's incredible.
|
| For analytics queries in general though (not talking about
| clickhouse-local) I don't think there's any OSS competition.
| tanin wrote:
| I've made a similar app, but I aim it to be more convenient,
| especially for people who don't want to handle installation and
| command line.
|
| Check it out: https://superintendent.app -- it is a paid app
| though.
| daniel-s wrote:
| Is this similar to CSV virtual tables in sqlite?
|
| https://www.sqlite.org/csv.html
| eatonphil wrote:
| It's easier to use than that. With virtual tables you must
| CREATE TABLE for every schema you have which is very tedious if
| you are querying CSV files with differing columns.
| cma wrote:
| Might mix people up with:
| https://en.m.wikipedia.org/wiki/Q_(programming_language_from...
| WASDx wrote:
| When will people stop giving things one character names, it's
| impossible to search for.
| LtWorf wrote:
| Maybe relevant.
|
| A while ago I wrote relational
| (https://ltworf.github.io/relational/) to do relational algebra
| queries... It can load csv files, has a gui and a cli.
| blacksqr wrote:
| "Sqawk is an Awk-like program that uses SQL and can combine data
| from multiple files. It is powered by SQLite."
|
| https://github.com/dbohdan/sqawk
| bachmeier wrote:
| These tools, useful as they are, appear to be only for querying
| files. AFAICT, insert and update are typically not supported as a
| way to modify the data.
| MilStdJunkie wrote:
| Reminds me of the textQL extension that's available in Asciidoc.
|
| Point it to an external CSV file, enable TextQL, and bam, there's
| your query returned as a table. Handy for parts lists, inventory,
| that kind of crap.
|
| https://github.com/dinedal/textql
|
| https://gist.github.com/mojavelinux/8856117
| redsaz wrote:
| A satisfied user here. Found it very useful when tools like cut
| and sort weren't enough, usually when I need to do a join on two
| different tables (err, files). Left joins work, but I don't think
| right joins are supported.
|
| I've used this in combination with jq as well. I'll use jq to
| convert json to CSV, and then use SQL to do whatever else.
| bobivl wrote:
| Cut, sort, join and awk can be pretty powerful and fast. If it
| becomes too tedious to manually write them, you can also use
| BigBash [1] to convert a SQL query automatically to a one-liner
| that only use these tools to execute the query.
|
| [1] http://bigbash.it
| aasasd wrote:
| Any experienced programmer learns to not use string
| processing on structured data, because that _will_ bite them
| in the ass.
|
| Meanwhile HN luddites: let me use awk, cut and whatnot
| despite the existence of an util that explicitly sidesteps
| this issue.
| jfoutz wrote:
| Have you looked at the Unix command 'join'? This is a cool
| tool, but I think join is pretty much everywhere.
| redsaz wrote:
| Hmm, I'll have to check it out!
| WASDx wrote:
| You also probably have GNU join installed:
| https://www.gnu.org/software/coreutils/manual/html_node/join...
| smartmic wrote:
| Depending on how complex the task is, I also jump from
| sort/join/awk/sed to the SQL train more often. But if I have
| already gone this step, then I would also like to have the
| whole SQL(ite) power and that would then but really blow up the
| command line. In such cases I usually write a TCL script, the
| integration of SQLite3 [0] is quasi native and besides the full
| SQLite3 functionality I also have flexible extensions (e.g.
| directly usable TCL procedures of any complexity) at my
| disposal. Tools like Q represent a middle ground, although they
| build on SQLite they remain behind in functionality [1]. But as
| long as I want to keep it simple on the command line while
| adhering to the UNIX philosophy, coreutils, sed, awk and
| possibly perl remain my best friends.
|
| [0]: https://www.sqlite.org/tclsqlite.html [1]:
| http://harelba.github.io/q/#limitations
| SoftTalker wrote:
| Same. You can go a long way with cut, sort, etc. and also awk
| with its pattern matching. But if you're handy with SQL, that
| can often feel more natural and certainly things like joins
| among separate CSV files, as well as sums and other aggregates,
| are easier.
|
| If you have "unclean" CSV data, e.g. where the data contains
| delimiters and/or newlines in quoted fields, you might want to
| pipe it through csvquote.
|
| https://github.com/dbro/csvquote
| dublin wrote:
| In addition to the usual cut/paste/sort/awk stuff, we've had
| really powerful "stream-operator" databases based on flat text
| data files for decades. They used to be somewhat slow. Not
| anymore, esp when running from RAMdisks.
|
| One good one is Strozzi NoSQL (his use of the term NoSQL
| predates the current use of the term by many years...):
| http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/NoSQL/Home%20P...
|
| Starbase is another, with interesting extensions for
| astronomical work.
|
| Linux Review article on the concept here:
| https://www.linuxjournal.com/article/3294
|
| The article that started it all:
| http://www.linux.it/~carlos/nosql/4gl.ps
|
| And there's even a book on the subject, centered on the /rdb
| implementation by the late RSW software. But I warn you,
| reading this WILL permanently change the way you think about
| databases: https://www.amazon.com/Relational-Database-
| Management-Prenti...
| dlkmp wrote:
| When I want to quickly query a csv file (usually log files), I
| like to use lnav which also supports running SQL queries and
| supports pretty much any log file format I happened to deal with.
|
| https://lnav.org/
| gorkish wrote:
| Don't know if it's the first such implementation, but perl's
| DBD::CSV is 25 years old this year.
| jmt_ wrote:
| You can do the same thing with csvsql from csvkit:
| https://towardsdatascience.com/analyze-csvs-with-sql-in-comm...
|
| Since csvkit comes with so many other tools, I'm not sure I see a
| reason to use q over csvsql
| gourabmi wrote:
| Big fan of csvsql here. I use it often to look at log files.
| eatonphil wrote:
| Q is a great project! Here's a comparison of Q against some of
| the other tools out there [0] (including dsq, which I wrote). And
| there's a benchmark in there too [1].
|
| Whichever tool you end up using, I'm sure it will help out with
| your CLI data exploration!
|
| [0] https://github.com/multiprocessio/dsq#comparisons
|
| [1] https://github.com/multiprocessio/dsq#benchmark
| margarina72 wrote:
| dsq is a great tool, using it regularly to work against csv
| files, and really nice to use.
| eatonphil wrote:
| Awesome to hear!
| josebrwn wrote:
| http://harelba.github.io/q/#requirements
|
| "q is packaged as a compiled standalone-executable that has no
| dependencies, not even python itself."
|
| This is not quite true, on MacOS:
|
| "q: A full installation of Xcode.app 12.4 is required to compile
| this software. Installing just the Command Line Tools is not
| sufficient.
|
| Xcode can be installed from the App Store. Error: q: An
| unsatisfied requirement failed this build."
| pessimizer wrote:
| I don't understand this objection. Being able to compile
| something with no dependencies is different from being able to
| run something with no dependencies.
| DandyDev wrote:
| I think you're confusing installing the compiled package and
| using it (which doesn't require anything else) with _compiling_
| the package, which requires XCode on MacOS
| hexo wrote:
| yet another Q among:
|
| https://en.wikipedia.org/wiki/Q_(programming_language_from_K...
|
| https://en.wikipedia.org/wiki/Pure_(programming_language)
___________________________________________________________________
(page generated 2022-09-21 23:00 UTC)