[HN Gopher] Show HN: Work with CSV files using SQL. For data sci...
       ___________________________________________________________________
        
       Show HN: Work with CSV files using SQL. For data scientists and
       engineers
        
       Author : tanin
       Score  : 86 points
       Date   : 2021-07-18 08:40 UTC (14 hours ago)
        
 (HTM) web link (superintendent.app)
 (TXT) w3m dump (superintendent.app)
        
       | aphrax wrote:
       | This area interests me but being completely honest there are a
       | few off-putting grammatical/spelling errors. Might be worth
       | getting them cleaned up..
        
         | tanin wrote:
         | Thank you for your feedback.
         | 
         | I'm a non native English speaker, so I would like to ask if you
         | can point out a few places you feel they are off putting.
         | 
         | I would really really love to fix those.
        
       | robbiejs wrote:
       | Looks really nice. In what language is the app itself written? If
       | it's web app, I'm curious to know what table component you are
       | using.
       | 
       | Also, for anyone looking to edit simple csv files in an excel-
       | like matter, please check https://editcsvonline.com, a free tool
       | that I made, powered by DataGridXL (which I also made).
        
       | gaha wrote:
       | The way I love doing these kinds of things is by literally using
       | three lines of R code:
       | 
       | library(sqldf)
       | 
       | tab1 = read.csv("file1.csv")
       | 
       | sqldf("select * from tab1")
       | 
       | sqldf [1] is a library with which you can access R dataframes
       | just like tables in SQL (it is actually using SQLite in the
       | background). I do not have much experience with programming in R
       | itself, but you barely need it. This approach has the advantage
       | that it is very flexible, and you get the power of the (IMHO)
       | best plotting library ggplot with it. Of course, you can also do
       | more complicated stuff like joining multiple dataframes and
       | basically everything else you can do with SQLite, and then store
       | results back in R dataframes. This workflow works if you use an
       | IDE which lets you execute single lines or blocks of codes like
       | R-Studio. Then you also get a nice GUI with it, but there are
       | also plugins for VI/Emacs that work very well.
       | 
       | [1] https://cran.r-project.org/web/packages/sqldf/index.html
       | 
       | EDIT: code formatting
        
         | RobinL wrote:
         | Checkout the R bindings for DuckDB[0]. You should find that it
         | does the same thing (i.e. run SQL against a dataframe/file on
         | disk) much faster for many SQL operations.
         | 
         | [0] https://duckdb.org/docs/api/r
        
         | nojito wrote:
         | If you're going to use R you might as well learn a bit of dplyr
         | or data.table.
         | 
         | You get that autocomplete goodness and the vast majority of sql
         | can be reduced down to single lines.
        
       | draegtun wrote:
       | Perl DBD::CSV module is a classic in this genre (first version is
       | from 1998) - https://metacpan.org/pod/DBD::CSV
        
       | sgt wrote:
       | It seems interesting, but what is the advantage over just \copy
       | in Postgres from a CSV file - or better yet, import into SQLite
       | without even providing the table definition?
        
         | simlan wrote:
         | Which is likely how this works under the hood.
        
         | tanin wrote:
         | The difference is between GUI and command-line.
         | 
         | I work with a lot of CSV files, and for some reason I rarely
         | load them into Postgres or use Python script. I still try very
         | hard to use Excel.
         | 
         | There is some degree of convenience that GUI offers over
         | command-line, though some people still prefer command line.
        
           | qorrect wrote:
           | I use the command line all day, it definitely has its place.
           | But trying to manipulate large CSV files has never felt like
           | a good fit for the CLI.
           | 
           | I'm already using and enjoying this tool alot . I use to do
           | this in DBeaver ( also a great tool ) , but I have to
           | manually specify the types in a JDBC string to get all the
           | goodies of SQL this so far has been much nicer.
           | 
           | I have already made this a daily tool thanks!
        
         | SKILNER wrote:
         | The advantage would be that you do not have your data in two
         | places and can avoid all the problems and confusion that can
         | result.
        
           | jonnycomputer wrote:
           | How does this avoid having data in two places? You still have
           | to have a CSV file to import the data...
        
       | banana_giraffe wrote:
       | > A python script loading a 1GB file is likely to take more than
       | 1 minutes on any machine.
       | 
       | Since when? I just tested it, and I can process a 1gb CSV file on
       | my machine using the csv module in Python in ~10 seconds. My
       | machine isn't a world record setting machine, either.
        
         | tanin wrote:
         | This is OP, but that is amazing.
         | 
         | I tried reading 1GB CSV and add all the rows into Sqlite, and
         | it definitely takes minutes. This happens to both JS and
         | python.
         | 
         | Constructing a giant INSERT should already exceed 1 minutes (or
         | transforming any 1gb-sized string).
         | 
         | Would you mind sharing your Python snippet?
        
           | nimrody wrote:
           | Not sure how you'd do this from Python, but you can fire up
           | sqlite3 from the command line and run                   .mode
           | csv         .import filename.csv table_name
           | 
           | On my 2-year old Windows laptop (i7, lenovo x1) importing a
           | 2M rows 1.4GB file takes about ~70 seconds.
           | 
           | So not 10 seconds but no need to create an insert statement
           | (I'm sure you can access this functionality from python if
           | you need to)
        
       | gigatexal wrote:
       | I do this already with sqlite3
        
       | beckingz wrote:
       | Cool tool!
       | 
       | Panda SQL is great for doing SQL on pandas dataframes.
       | https://pypi.org/project/pandasql/
        
         | RobinL wrote:
         | DuckDB is a newer but much faster alternative that lets you run
         | SQL on csv files (and also directories of parquet files etc.)
         | 
         | It's so fast/good that I've actually been using it for a lot of
         | data cleaning and transformation that previously I'd have done
         | in Pandas.
         | 
         | https://github.com/duckdb/duckdb
        
       | toyg wrote:
       | I'm old enough to know I should give a pass to anything "free
       | while in beta"...
        
       | _1tan wrote:
       | Is this similar to https://github.com/harelba/q ?
        
       | refactor_master wrote:
       | If I open a CSV with pandas in debug mode I'm also pretty much
       | free to play with whatever transformations I want, without being
       | restricted by SQL syntax. Saved as a parquet, reloads are also
       | pretty much instantaneous.
        
         | zo1 wrote:
         | Not everyone is sold on and proficient in those tools.
         | Honestly, whenever I think I need pandas I try get into it and
         | spend an hour in the atrocious docs before giving up and
         | realizing I could have done that "simple" grouping or filtering
         | or graphing by using other tools and have been done already by
         | now even though it would have "taken longer".
         | 
         | Same goes for all these NoSql DBs. E.g. I realize afterwards I
         | spent 2 hours researching how to do an effing join between two
         | different documents. Time I could have spent deciding on a
         | relational entity model and putting down the ORM boilerplate
         | for it. /minirant
        
       | hodgesrm wrote:
       | You can query CSV directly without a DBMS backend using
       | clickhouse-local. [1] It's often used to clean data exactly as
       | this article describes. You can also load CSV very easily into
       | tables.
       | 
       | Superintendent.app seems to bring automatic schema definition to
       | the table. We've discussed adding this to ClickHouse. Would be a
       | great PR if anyone is interested.
       | 
       | [1] https://altinity.com/blog/2019/6/11/clickhouse-local-the-
       | pow...
       | 
       | (This note brought to you by the ClickHouse evangelization task
       | force.)
        
         | btown wrote:
         | Interested, but writing and compiling a large new C++ codebase
         | when I haven't written C++ in years isn't something I quite
         | have time for at the moment :)
         | 
         | For anyone else with more time to roll up their sleeves,
         | https://github.com/ClickHouse/ClickHouse/blob/b0ddc4fb30f1a0...
         | would be the place to split the header row, and perhaps use
         | some heuristics on the first data row to identify datatypes!
        
       | mjirv wrote:
       | I like that this lets you join CSVs, which a lot of tools in this
       | space don't.
        
         | hermitcrab wrote:
         | To do this in Easy Data Transform:
         | 
         | -drag 2 CSV files on the Center pane
         | 
         | -click 'join'
         | 
         | -select the column in each CSV to join in the right pane
         | 
         | -done
         | 
         | Depending on the size of the CSV files you can do this in a few
         | seconds (plus ~1 minute to install).
        
       | eatonphil wrote:
       | Very neat! I'm also working on an (open-source/core) app for
       | running SQL against any data (copy-pasted stuff, data files,
       | logs, HTTP responses, other SQL requests) and being able to
       | script and graph as well.
       | 
       | A difference might be that the goal I have for DataStation is to
       | aid more in data exploration (but not being limited to smaller
       | datasets like this site points out Google Sheets and Excel are).
       | Some optimizations like streaming all data (and streaming through
       | scripts) are currently out of scope and that will definitely
       | limit the scale DataStation can handle.
       | 
       | It's primarily to help out the case where you have a bunch of
       | different data and you want to join/filter/group and graph the
       | result for some business/product/engineering executive team.
       | 
       | https://datastation.multiprocess.io/
        
       | adam_gyroscope wrote:
       | You can do this with bit.io, although it's not local; import your
       | csv (drag and drop or browse to upload) and get a Postgres
       | database in the cloud. Full disclosure I'm a founder at bit.io.
       | 
       | I love to see anything that makes it easier to use data!
        
       | cube2222 wrote:
       | Great to see SQL being used more for such use cases! (I really
       | like SQL)
       | 
       | This looks nice, especially with the charting!
       | 
       | It's good to know though that there are a few similar open source
       | tools already:
       | 
       | - q[0] - SQL on CSV and TSV
       | 
       | - TextQL[1] - SQL on CSV and TSV
       | 
       | - Datasette[2] - Running SQL on various datasets in various
       | formats, by (if I understand it correctly) importing it to sqlite
       | and running the query there. This one's the closest one to the
       | posted project I think, with interactive data exploration and
       | publishing built in.
       | 
       | - OctoSQL[3] - SQL on files like CSV, JSON as well as non-file
       | databases, endless streams with temporal SQL extensions and
       | joining everything together in a single query.
       | 
       | The performance numbers of this tool however are impressive if
       | they will check out with real data. 1GB in 10 seconds is fast.
       | However, looking at the documentation it looks like most of the
       | heavy lifting is done by SQLite under the hood. From the docs:
       | 
       | > Under the hood, Superintendent loads CSVs into a SQLite
       | database.
       | 
       | Disclaimer: I'm the author of OctoSQL. Still actively developing
       | a rewrite on the redesign branch.
       | 
       | [0]: https://github.com/harelba/q
       | 
       | [1]: https://github.com/dinedal/textql
       | 
       | [2]: https://datasette.io
       | 
       | [3]: https://github.com/cube2222/octosql
        
         | jolmg wrote:
         | Shameless plug:
         | 
         | https://github.com/jolmg/cq
        
         | simonw wrote:
         | I've been looking at calling ".import" in SQLite directly as an
         | optimization for my sqlite-utils tool - I'm seeing a big speed
         | up in comparison to parsing the CSV in Python:
         | https://github.com/simonw/sqlite-utils/issues/297
        
         | tanin wrote:
         | Hi! I'm the OP here.
         | 
         | The performance gain is mainly from reading/writing files in C
         | and/or by Sqlite (which is also C).
         | 
         | Any other tool that use dynamic languages like python will
         | immediately takes minutes to handle any GB-sized file
         | (initializing string is already slow).
         | 
         | The main difference is essentially GUI vs command-line.
         | Datasette is the closest but it seems to require command line
         | (based on the usage page). I haven't tried it out yet, so I may
         | be wrong here.
         | 
         | OctoSQL looks very interesting.
         | 
         | Initially, I was trying to use DuckDB but couldn't make it work
         | on Windows.
         | 
         | Sqlite's SQL dialect is somewhat lacking (hello WITH
         | RECURSIVE). It is one of the gaps I would like to solve.
        
       | jonnycomputer wrote:
       | So its a GUI wrapper around sqlite. I guess I expect data
       | scientists and engineers to be comfortable with command line and
       | scripting, and there are GUIs that will work with sqlite already,
       | so I guess I'm having hard time figuring out the target audience
       | for this.
        
       | chrisweekly wrote:
       | This reminds me of https://lnav.org
        
       | hprotagonist wrote:
       | The datasette author offers this tool for conversion:
       | https://github.com/simonw/csvs-to-sqlite
        
         | simonw wrote:
         | Also https://sqlite-
         | utils.datasette.io/en/stable/cli.html#inserti... which comes
         | with a lot of other useful tools for manipulating the scheme,
         | adding indexes and so on after you've imported the data.
        
           | kingosticks wrote:
           | The tools in these replies also offer similar functionality
           | (in terms of running sql queries against csv data)
           | https://news.ycombinator.com/item?id=27763129
        
       | EMM_386 wrote:
       | This open source tool on Github allows you to quickly import a
       | CSV and write SQL queries against it. The CSV is imported into a
       | SQLite database. It also does a lot more, highly recommended.
       | 
       | https://github.com/sqlitebrowser/sqlitebrowser
        
         | rognjen wrote:
         | Thanks for this. I've been doing it manually all along.
        
       | kmerroll wrote:
       | Loading and running SQL on structured text files (.CSV) is
       | definitely an underserved data tools niche, but this statement
       | threw me: "No more custom and slow Python script. No more
       | vlookup, no more pivot tables." Wow, I'll keep my slow
       | Python/Pandas script thank you.
        
       | jdnier wrote:
       | If your work is more OLAP focused, DuckDb has good tools for
       | loading csv files. https://duckdb.org/docs/data/csv
       | 
       | DuckDB is like a columnar SQLite. It also has great support for
       | Pandas DataFrames.
        
       | cube00 wrote:
       | No license keys required with H2
       | http://www.h2database.com/html/tutorial.html#csv
        
       | rognjen wrote:
       | Seems useful but at the same time it's comparison with loading
       | CSV into SQL seems kind of disingenuous since if you're familiar
       | with SQL you're probably familiar with command line enough to
       | load CSV into SQLite (which is free in both meanings)
        
       ___________________________________________________________________
       (page generated 2021-07-18 23:01 UTC)