[HN Gopher] Show HN: Work with CSV files using SQL. For data sci...
___________________________________________________________________
Show HN: Work with CSV files using SQL. For data scientists and
engineers
Author : tanin
Score : 86 points
Date : 2021-07-18 08:40 UTC (14 hours ago)
(HTM) web link (superintendent.app)
(TXT) w3m dump (superintendent.app)
| aphrax wrote:
| This area interests me but being completely honest there are a
| few off-putting grammatical/spelling errors. Might be worth
| getting them cleaned up..
| tanin wrote:
| Thank you for your feedback.
|
| I'm a non native English speaker, so I would like to ask if you
| can point out a few places you feel they are off putting.
|
| I would really really love to fix those.
| robbiejs wrote:
| Looks really nice. In what language is the app itself written? If
| it's web app, I'm curious to know what table component you are
| using.
|
| Also, for anyone looking to edit simple csv files in an excel-
| like matter, please check https://editcsvonline.com, a free tool
| that I made, powered by DataGridXL (which I also made).
| gaha wrote:
| The way I love doing these kinds of things is by literally using
| three lines of R code:
|
| library(sqldf)
|
| tab1 = read.csv("file1.csv")
|
| sqldf("select * from tab1")
|
| sqldf [1] is a library with which you can access R dataframes
| just like tables in SQL (it is actually using SQLite in the
| background). I do not have much experience with programming in R
| itself, but you barely need it. This approach has the advantage
| that it is very flexible, and you get the power of the (IMHO)
| best plotting library ggplot with it. Of course, you can also do
| more complicated stuff like joining multiple dataframes and
| basically everything else you can do with SQLite, and then store
| results back in R dataframes. This workflow works if you use an
| IDE which lets you execute single lines or blocks of codes like
| R-Studio. Then you also get a nice GUI with it, but there are
| also plugins for VI/Emacs that work very well.
|
| [1] https://cran.r-project.org/web/packages/sqldf/index.html
|
| EDIT: code formatting
| RobinL wrote:
| Checkout the R bindings for DuckDB[0]. You should find that it
| does the same thing (i.e. run SQL against a dataframe/file on
| disk) much faster for many SQL operations.
|
| [0] https://duckdb.org/docs/api/r
| nojito wrote:
| If you're going to use R you might as well learn a bit of dplyr
| or data.table.
|
| You get that autocomplete goodness and the vast majority of sql
| can be reduced down to single lines.
| draegtun wrote:
| Perl DBD::CSV module is a classic in this genre (first version is
| from 1998) - https://metacpan.org/pod/DBD::CSV
| sgt wrote:
| It seems interesting, but what is the advantage over just \copy
| in Postgres from a CSV file - or better yet, import into SQLite
| without even providing the table definition?
| simlan wrote:
| Which is likely how this works under the hood.
| tanin wrote:
| The difference is between GUI and command-line.
|
| I work with a lot of CSV files, and for some reason I rarely
| load them into Postgres or use Python script. I still try very
| hard to use Excel.
|
| There is some degree of convenience that GUI offers over
| command-line, though some people still prefer command line.
| qorrect wrote:
| I use the command line all day, it definitely has its place.
| But trying to manipulate large CSV files has never felt like
| a good fit for the CLI.
|
| I'm already using and enjoying this tool alot . I use to do
| this in DBeaver ( also a great tool ) , but I have to
| manually specify the types in a JDBC string to get all the
| goodies of SQL this so far has been much nicer.
|
| I have already made this a daily tool thanks!
| SKILNER wrote:
| The advantage would be that you do not have your data in two
| places and can avoid all the problems and confusion that can
| result.
| jonnycomputer wrote:
| How does this avoid having data in two places? You still have
| to have a CSV file to import the data...
| banana_giraffe wrote:
| > A python script loading a 1GB file is likely to take more than
| 1 minutes on any machine.
|
| Since when? I just tested it, and I can process a 1gb CSV file on
| my machine using the csv module in Python in ~10 seconds. My
| machine isn't a world record setting machine, either.
| tanin wrote:
| This is OP, but that is amazing.
|
| I tried reading 1GB CSV and add all the rows into Sqlite, and
| it definitely takes minutes. This happens to both JS and
| python.
|
| Constructing a giant INSERT should already exceed 1 minutes (or
| transforming any 1gb-sized string).
|
| Would you mind sharing your Python snippet?
| nimrody wrote:
| Not sure how you'd do this from Python, but you can fire up
| sqlite3 from the command line and run .mode
| csv .import filename.csv table_name
|
| On my 2-year old Windows laptop (i7, lenovo x1) importing a
| 2M rows 1.4GB file takes about ~70 seconds.
|
| So not 10 seconds but no need to create an insert statement
| (I'm sure you can access this functionality from python if
| you need to)
| gigatexal wrote:
| I do this already with sqlite3
| beckingz wrote:
| Cool tool!
|
| Panda SQL is great for doing SQL on pandas dataframes.
| https://pypi.org/project/pandasql/
| RobinL wrote:
| DuckDB is a newer but much faster alternative that lets you run
| SQL on csv files (and also directories of parquet files etc.)
|
| It's so fast/good that I've actually been using it for a lot of
| data cleaning and transformation that previously I'd have done
| in Pandas.
|
| https://github.com/duckdb/duckdb
| toyg wrote:
| I'm old enough to know I should give a pass to anything "free
| while in beta"...
| _1tan wrote:
| Is this similar to https://github.com/harelba/q ?
| refactor_master wrote:
| If I open a CSV with pandas in debug mode I'm also pretty much
| free to play with whatever transformations I want, without being
| restricted by SQL syntax. Saved as a parquet, reloads are also
| pretty much instantaneous.
| zo1 wrote:
| Not everyone is sold on and proficient in those tools.
| Honestly, whenever I think I need pandas I try get into it and
| spend an hour in the atrocious docs before giving up and
| realizing I could have done that "simple" grouping or filtering
| or graphing by using other tools and have been done already by
| now even though it would have "taken longer".
|
| Same goes for all these NoSql DBs. E.g. I realize afterwards I
| spent 2 hours researching how to do an effing join between two
| different documents. Time I could have spent deciding on a
| relational entity model and putting down the ORM boilerplate
| for it. /minirant
| hodgesrm wrote:
| You can query CSV directly without a DBMS backend using
| clickhouse-local. [1] It's often used to clean data exactly as
| this article describes. You can also load CSV very easily into
| tables.
|
| Superintendent.app seems to bring automatic schema definition to
| the table. We've discussed adding this to ClickHouse. Would be a
| great PR if anyone is interested.
|
| [1] https://altinity.com/blog/2019/6/11/clickhouse-local-the-
| pow...
|
| (This note brought to you by the ClickHouse evangelization task
| force.)
| btown wrote:
| Interested, but writing and compiling a large new C++ codebase
| when I haven't written C++ in years isn't something I quite
| have time for at the moment :)
|
| For anyone else with more time to roll up their sleeves,
| https://github.com/ClickHouse/ClickHouse/blob/b0ddc4fb30f1a0...
| would be the place to split the header row, and perhaps use
| some heuristics on the first data row to identify datatypes!
| mjirv wrote:
| I like that this lets you join CSVs, which a lot of tools in this
| space don't.
| hermitcrab wrote:
| To do this in Easy Data Transform:
|
| -drag 2 CSV files on the Center pane
|
| -click 'join'
|
| -select the column in each CSV to join in the right pane
|
| -done
|
| Depending on the size of the CSV files you can do this in a few
| seconds (plus ~1 minute to install).
| eatonphil wrote:
| Very neat! I'm also working on an (open-source/core) app for
| running SQL against any data (copy-pasted stuff, data files,
| logs, HTTP responses, other SQL requests) and being able to
| script and graph as well.
|
| A difference might be that the goal I have for DataStation is to
| aid more in data exploration (but not being limited to smaller
| datasets like this site points out Google Sheets and Excel are).
| Some optimizations like streaming all data (and streaming through
| scripts) are currently out of scope and that will definitely
| limit the scale DataStation can handle.
|
| It's primarily to help out the case where you have a bunch of
| different data and you want to join/filter/group and graph the
| result for some business/product/engineering executive team.
|
| https://datastation.multiprocess.io/
| adam_gyroscope wrote:
| You can do this with bit.io, although it's not local; import your
| csv (drag and drop or browse to upload) and get a Postgres
| database in the cloud. Full disclosure I'm a founder at bit.io.
|
| I love to see anything that makes it easier to use data!
| cube2222 wrote:
| Great to see SQL being used more for such use cases! (I really
| like SQL)
|
| This looks nice, especially with the charting!
|
| It's good to know though that there are a few similar open source
| tools already:
|
| - q[0] - SQL on CSV and TSV
|
| - TextQL[1] - SQL on CSV and TSV
|
| - Datasette[2] - Running SQL on various datasets in various
| formats, by (if I understand it correctly) importing it to sqlite
| and running the query there. This one's the closest one to the
| posted project I think, with interactive data exploration and
| publishing built in.
|
| - OctoSQL[3] - SQL on files like CSV, JSON as well as non-file
| databases, endless streams with temporal SQL extensions and
| joining everything together in a single query.
|
| The performance numbers of this tool however are impressive if
| they will check out with real data. 1GB in 10 seconds is fast.
| However, looking at the documentation it looks like most of the
| heavy lifting is done by SQLite under the hood. From the docs:
|
| > Under the hood, Superintendent loads CSVs into a SQLite
| database.
|
| Disclaimer: I'm the author of OctoSQL. Still actively developing
| a rewrite on the redesign branch.
|
| [0]: https://github.com/harelba/q
|
| [1]: https://github.com/dinedal/textql
|
| [2]: https://datasette.io
|
| [3]: https://github.com/cube2222/octosql
| jolmg wrote:
| Shameless plug:
|
| https://github.com/jolmg/cq
| simonw wrote:
| I've been looking at calling ".import" in SQLite directly as an
| optimization for my sqlite-utils tool - I'm seeing a big speed
| up in comparison to parsing the CSV in Python:
| https://github.com/simonw/sqlite-utils/issues/297
| tanin wrote:
| Hi! I'm the OP here.
|
| The performance gain is mainly from reading/writing files in C
| and/or by Sqlite (which is also C).
|
| Any other tool that use dynamic languages like python will
| immediately takes minutes to handle any GB-sized file
| (initializing string is already slow).
|
| The main difference is essentially GUI vs command-line.
| Datasette is the closest but it seems to require command line
| (based on the usage page). I haven't tried it out yet, so I may
| be wrong here.
|
| OctoSQL looks very interesting.
|
| Initially, I was trying to use DuckDB but couldn't make it work
| on Windows.
|
| Sqlite's SQL dialect is somewhat lacking (hello WITH
| RECURSIVE). It is one of the gaps I would like to solve.
| jonnycomputer wrote:
| So its a GUI wrapper around sqlite. I guess I expect data
| scientists and engineers to be comfortable with command line and
| scripting, and there are GUIs that will work with sqlite already,
| so I guess I'm having hard time figuring out the target audience
| for this.
| chrisweekly wrote:
| This reminds me of https://lnav.org
| hprotagonist wrote:
| The datasette author offers this tool for conversion:
| https://github.com/simonw/csvs-to-sqlite
| simonw wrote:
| Also https://sqlite-
| utils.datasette.io/en/stable/cli.html#inserti... which comes
| with a lot of other useful tools for manipulating the scheme,
| adding indexes and so on after you've imported the data.
| kingosticks wrote:
| The tools in these replies also offer similar functionality
| (in terms of running sql queries against csv data)
| https://news.ycombinator.com/item?id=27763129
| EMM_386 wrote:
| This open source tool on Github allows you to quickly import a
| CSV and write SQL queries against it. The CSV is imported into a
| SQLite database. It also does a lot more, highly recommended.
|
| https://github.com/sqlitebrowser/sqlitebrowser
| rognjen wrote:
| Thanks for this. I've been doing it manually all along.
| kmerroll wrote:
| Loading and running SQL on structured text files (.CSV) is
| definitely an underserved data tools niche, but this statement
| threw me: "No more custom and slow Python script. No more
| vlookup, no more pivot tables." Wow, I'll keep my slow
| Python/Pandas script thank you.
| jdnier wrote:
| If your work is more OLAP focused, DuckDb has good tools for
| loading csv files. https://duckdb.org/docs/data/csv
|
| DuckDB is like a columnar SQLite. It also has great support for
| Pandas DataFrames.
| cube00 wrote:
| No license keys required with H2
| http://www.h2database.com/html/tutorial.html#csv
| rognjen wrote:
| Seems useful but at the same time it's comparison with loading
| CSV into SQL seems kind of disingenuous since if you're familiar
| with SQL you're probably familiar with command line enough to
| load CSV into SQLite (which is free in both meanings)
___________________________________________________________________
(page generated 2021-07-18 23:01 UTC)