[HN Gopher] Practical SQL for Data Analysis
___________________________________________________________________
Practical SQL for Data Analysis
Author : steve-chavez
Score : 420 points
Date : 2021-05-03 14:55 UTC (8 hours ago)
(HTM) web link (hakibenita.com)
(TXT) w3m dump (hakibenita.com)
| michelpp wrote:
| This is an excellent example of what I call the Copy-Object-Copy
| effect. It's particularly apparent in frameworks with ORMs like
| Django. In Pandas case, devs will do 'SELECT *' and use pandas as
| a sort of pseudo ORM.
|
| You run a query to get your data as a bunch of objects, but
| you're copying the data over the db connection from the postgres
| wire protocol into Python objects in memory, which are typically
| then garbage collected at the end of the transaction, then copy
| the result to a JSON buffer that is also garbage collected, and
| then send the final result to send to the browser which has been
| waiting this whole time.
|
| I regularly see Django apps require several gigs of RAM per
| worker and when you look at the queries, it's just a bunch of
| poorly rendered SELECTs. It's grotesque.
|
| Contrast PostgREST: 100 megabytes of RAM per worker. I've seen
| Django DRF workers require 50 to 1 memory vs PostgREST for the
| same REST interface and result with PostgREST being much faster.
| Since the database is generating the json, it can do so
| _immediately_ upon generating the first result row which is then
| _streamed_ to the browser. No double buffering.
|
| Unlike Django, it's not that I don't like Pandas, it's that
| people way overuse it for SQL tasks as this article points out.
| I've seen pandas scripts that could be a simpler psql script. If
| psql can't do what you want, resist the temptation to 'SELECT *'
| into a data frame and break the problem up into stages where you
| get the database to do the maximum work before it gets to the
| data frame.
| password4321 wrote:
| I would appreciate pointers to any other low/no-copy db ->
| browser technologies.
|
| I was thinking it would be cool to use parts of JavaScript db
| libraries to parse query results passed straight through in
| (compressed?) db wire protocol format via WebRTC.
| geraneum wrote:
| Well it's true that bringing row(s) of data from database to
| python and then serializing it for any reason as has extra
| overhead.
|
| But in Django's case, the same arguments can be made as for
| Pandas. Django is a big complex framework and an application
| using it might consume more memory due to countless number of
| other reasons. There are also best practices to use Django ORM.
|
| But to say if a given Django instance consumes more memory it
| is only because of "Copy-Object-Copy effect"... I don't think
| so.
| tomnipotent wrote:
| > it's that people way overuse it for SQL tasks as this article
| points out
|
| I'm very confused by this. I've used pandas for ever a decade,
| and in most cases it's a massive time saver. I can do a single
| query to bring data into local memory in a Jupyter notebook,
| and from there re-use that memory across hundreds or more
| executions of pandas functions to further refine an analysis or
| whatever task I'm up to.
|
| Your "copy-object-copy" is not relevant in data analysis use
| cases, and exists in pretty much any system that pulls data
| from an external service be it SQL or not.
| tomrod wrote:
| I know a team that would read in multi-GB CSVs from an FTP
| site into an orchestrator server using Pandas, write to locsl
| CSV in the orchestator, validate data after write, reload
| into pandas, painfully and manually check every data type,
| then use a pandas connector to a database.
|
| Every step along the way here is wrong. The database came
| with SFTP connectors. The orchestrator should have simply
| told the database server to pull from the SFTP site into a
| landing table. Even if not, Pandas has datatypes you can
| specify on ingestion, which is heaven if your CSV or other
| files don't record filetypes yet are consistent in structure.
| Further, if you have custom validation you're applying (not a
| bad thing!) you likely rarely even need pandas; the native
| CSV library or (XLRD/openpyxl) for Excel are fine.
|
| Ultimately, it is a training issue. The toolspace is too
| complex, with too many buzzwords to describe simple things,
| that people get lost in the whirlwind.
| mistrial9 wrote:
| parent post says "devs do SELECT *" and ...
|
| relational databases can have a lot more data in one table,
| or related tables, than one query needs. It is often very
| wasteful of RAM to get ALL and filter again
| protomyth wrote:
| If your query does " * " it gets all the columns in all the
| tables. Often, the optimizer when all the columns you need
| are on the index, never visits the actual table (I remember
| the term covered index). " * " basically screws this up.
| "Select *" should never be used in anything in an actual
| production environment.
| tomnipotent wrote:
| Data analysis workloads more often run into problems
| solved by partitioning rather than indexes.
| mistrial9 wrote:
| yes and no - I trained on "larger than RAM datasets"
| intentionally, and subsequently took on projects that
| require "larger than RAM datasets". Two things happened
| on my way, companies led by Google invented and deployed
| datasets previously impossible e.g. BigTable, and
| secondly the equipment I worked on went from modest to
| mid-sized RAM (hard to define that). Granted that lots of
| very useful (and common?) tasks are not "larger than RAM
| datasets", and then have very different characteristics..
| which is starting to sound like "Excel problems look like
| this, they always fit in RAM, they need partitions not
| indexes" and the like..
|
| There is a blind-man-and-the-Elephant drift here, which
| is not terrible, but might need calling out to improve..
| tomnipotent wrote:
| Not for data analysis, it's a pointless constraint unless
| you're having issues.
|
| Most data analysis isn't against datasets larger than
| memory, and I'd rather have more data than I need to spend
| time waiting to bring it all local again because I forgot a
| few columns that turn out to be useful later on.
| disgruntledphd2 wrote:
| True, but in that kind of exploratory environment, you'd
| normally reduce data load and speed up iteration by using
| sampling, which is super easy in SQL.
|
| > Not for data analysis, it's a pointless constraint
| unless you're having issues.
|
| This will always happen, as time spent on the project
| scales up (unless you use some kind of autoscaling magic
| I suppose).
| EForEndeavour wrote:
| I totally get what you're saying because most of my
| datasets also used to be smaller than my laptop's memory.
|
| _Used to._
|
| I'm in the process of moving more and more processing
| "upstream" from local pandas to the DBMS and it's already
| proving to be a bit of a superpower. I regret not
| learning it earlier, but the second-best time is now.
| tarsinge wrote:
| The point is that if you only need to join and aggregate data
| it's easier and more efficient to do it directly in SQL. Also
| in production or when your database doesn't fit in memory the
| idea is to first use SQL to generate an optimized view of
| your data from the database before further analysis and
| transformation.
| jstrong wrote:
| > The point is that if you only need to join and aggregate
| data it's easier and more efficient to do it directly in
| SQL
|
| citation needed? I've seen plenty of cases where it would
| have taken the db ages to do something that pandas does
| fast, and I don't consider pandas to be particularly fast.
| qsort wrote:
| Pandas is great when you're working on the data manually
| because it lets you interleave python code and "queries", but
| it's strictly worse than a plain SQL statement if you're
| writing, say, a REST service that needs to output some data.
|
| SQL executed by the database is orders of magnitude more
| efficient, way more expressive, and doesn't require you to
| memorize pandas' absurd API.
|
| And I say this as someone who is by no means a SQL wizard,
| and fully acknowledging all of SQL's blemishes.
| tomnipotent wrote:
| > a REST service
|
| This is a post about data analysis, and everyone wants to
| point out that pandas isn't good at real-time service
| requests.
|
| > SQL executed by the database is orders of magnitude more
| efficient
|
| Compared to pandas? No, it's not. Once I have all the data
| I need locally, it's MUCH faster to use pandas locally then
| to re-issue queries to a remote database.
| crazygringo wrote:
| > _Once I have all the data I need locally, it 's MUCH
| faster to use pandas locally then to re-issue queries to
| a remote database._
|
| Both of you are right.
|
| Sure, if you need to grab a huge chunk of the entire
| database and then do tons of processing on every row that
| SQL simply cannot do, then you're right.
|
| But when most people think SQL and database, they're
| thinking of grabbing a tiny fraction of rows, sped up by
| many orders of magnitude because it utilizes indexes, and
| doing all calculations/aggregations/joins server-side.
| Where it absolutely _is_ going to be orders of magnitude
| more efficient.
|
| Traditional database client API's often aren't going to
| be particularly performant in your case, because they're
| usually designed to read and store the entire result of a
| query in-memory before you can access it. If you're lucky
| you can enable streaming of results that bypasses this.
| Other times you'll be far better off accessing the data
| via some kind of command-line table export tool and
| streaming its output directly into your program.
| 3pt14159 wrote:
| I agree, they're both right.
|
| I've been doing data science since around 2008 and it's a
| balance between local needs and repeated analysis and an
| often more efficient one off query. Sure the SQL
| optimizer is going to read off the index for a count(*),
| but it doesn't really help if I need all the rows locally
| for data mining _anyway_. The counts need to line up! So
| I 'll take the snapshot of the data locally for the one
| off analysis and call it a day. If I need this type of
| report to be run nightly, it will be off of the data
| warehouse infrastructure not the production DB server.
|
| Shrug. These things take type and experience to fully
| internalize and appreciate.
| tomnipotent wrote:
| > But when most people think SQL and database
|
| We're not talking about most people, but data analysts.
| If we're just doing simple sum/count/average aggregated
| by a column or two with some basic predicates, SQL and an
| RDBMS are your best friend. Always better to keep compute
| + storage as close together as possible.
|
| > Traditional database client API's
|
| I'm not sure what point you're trying to make with this.
| Most data analysts are not working against streaming
| data, but performing one-off analyses for business
| counterparts that can be as simple as "X,Y,Z by A,B,C"
| reports to "which of three marketing treatments for our
| snail mail catalogue was most effective at converting
| customers".
| crazygringo wrote:
| I'm not talking about streaming live data, I'm talking
| about streaming query results if you re-read my comment.
|
| If you're reading many MB's or GB's of data from a
| database, it's a lot more performant to stream it from
| the database directly into your local data structure,
| rather than rely on default processing of query results
| as a single chunk which will be a lot worse for memory
| and speed.
| qsort wrote:
| > Compared to pandas? No, it's not.
|
| I'm not saying you shouldn't use pandas, it depends on
| the size of the data. I'm working right now on a project
| where a SELECT * of the entire fact table would be a
| couple hundred gigabytes.
|
| The flow is SQL -> pandas -> manipulation, and as always
| in pipelines like those, the most work you can do at the
| earliest stage, the better.
| disgruntledphd2 wrote:
| Yeah, speaking as a data person, the SQL argument is
| correct. Python/R are much, much, much slower for this
| kind of work.
|
| OTOH, SQL is super limiting for a lot of data analysis
| tasks and you'll inevitably need the data in weird forms
| that require lots of munging.
|
| Personally, I'm a big fan of using SQL/Airflow/whatever
| to generate whatever data I'll need all the time at a
| high level of granularity (user/action etc), and then
| just run a (very quick) SQL query to get whatever you
| need into your analytics environment.
|
| Gives you the best of both worlds, IME.
| waltherg wrote:
| This is a great point and way of looking at it: pandas and
| SQL live at opposite ends of the maturation level of data
| wrangling pipelines.
| dragonwriter wrote:
| > If psql can't do what you want, resist the temptation to
| 'SELECT *' into a data frame and break the problem up into
| stages where you get the database to do the maximum work before
| it gets to the data frame.
|
| Why are we introducing an additional tool (psql) here
| unnecessarily? Sure, if it can be a simpler postgres query,
| that can be useful, but introducing psql and psql scripting
| into a workflow that is still going to use python is...utterly
| wasteful. And even simplifying by optimizing the use of SQL,
| while a good tool to have in the toolbox, is probably pretty
| low value for the effort for lots of data analysis tasks.
| EForEndeavour wrote:
| I'm pretty sure the premise here is that the many
| gigabytes/terabytes of data reside in an RDBMS to begin with,
| so we aren't introducing another tool -- we're saying it
| makes sense to leverage multiple decades' worth of database
| optimizations by doing a bunch of data filtering, processing,
| and analysis in the database software and then exporting
| intermediate results to a pandas environment for final
| tweaks, visualization, etc.
| dragonwriter wrote:
| > I'm pretty sure the premise here is that the many
| gigabytes/terabytes of data reside in an RDBMS to begin
| with, so we aren't introducing another tool
|
| psql is a separate scriptable client tool from the postgres
| database server.
| shkkmo wrote:
| The point was about doing more work using the SQL DB
| engine, the client library you do that with seems
| irrelevant to making that point.
| dragonwriter wrote:
| > The point was about doing more work using the SQL DB
| engine
|
| The specific reference was to "psql script", which runs
| in the psql client program, not the DB engine. Since I
| suspected that might be an error intending to reference
| SQL running on the server, I separately addressed, in my
| initial response, both what was literally said (psql
| script) and sql running in the DB.
| shkkmo wrote:
| Any psql script would be building and running SQL queries
| so I fail to see how that has any impact of the point
| being made.
|
| Edit: The argument is that there are operations that
| should be done in the SQL layer and it is worth time
| learning enough about that layer to understand when and
| how to use it for computation. Once you learn that, it
| isn't really relevant if you are using psql or some other
| client library to build those queries.
| mixmastamyk wrote:
| Interesting. I'm guessing the extra layer may be needed for
| some manipulations of data at the application layer, but
| perhaps that is almost always avoidable?
|
| Have you ever tried Hasura? I'm thinking of giving it a go in a
| future project.
| tylerhannan wrote:
| If/as you give it a go in the future...do let us know what is
| working and what could be better. We do, in fact, care.
|
| <disclaimer: I work at Hasura>
| musingsole wrote:
| It's only avoidable to the degree that your software
| architecture and company architecture allow you to make the
| more efficient decision to move the logic upstream.
|
| The cases where this pattern emerges are invariably because
| someone without insufficient access to the DB and its
| configuration bites the bullet and instead builds whatever
| they need to do it in the environment they have complete
| access to.
|
| I've similar problems lead to dead cycles in _processor_
| design where the impact is critical. These problems aren 't
| software technology problems. They're people coordination
| problems.
| arcticfox wrote:
| As a Hasura user, I highly recommend it for OLTP workloads.
| And for the specific subject at hand (memory usage) it is
| fantastic. Its (in-database) serialization speed just
| completely runs circles around anything we could do before in
| Ruby.
| WhompingWindows wrote:
| Within the VA hospital system, the data admins often got onto
| their "soapboxes" to dress down the 1400+ data analysts for
| writing inefficient SQL queries. If you need 6 million
| patients, joining data from 15 tables, gathering 250 variables,
| a beginning SQL user has the potential to take 15-20 hours
| where they could be pulling for 1-2 if they do some up-front
| filtering and sub-grouping within SQL. If you already know
| you'll throw out NA's on certain variables, if you need a
| certain date or age range, or even the way you format these
| filters: these all lead to important savings. And this saves R
| from being full on memory, which would often happen if you fed
| it way too much data.
|
| Within a system of 1400 analysts, it makes a big difference if
| everyone's taking 2X or 4X the time pulling that they could be.
| Then, even the efficiently written pulls get slowed down, so
| you have to run things overnight...and what if you had an error
| on that overnight pull? Suffice to say, it'd have been much
| simpler if people wrote solid SQL from the start.
| sagarm wrote:
| How many TBs of data are we really talking here, if it's just
| 6M rows? Surely this processing could easily be done on a
| single machine.
| sztanko wrote:
| It doesn't say it is just 6m rows. It is 6m patient, which
| only hints that one of the dimension tables is 6m. Facts
| gathered in patients might be significantly larger. Also,
| my experience is saying, if you have hundreds of queries
| running simultaneously, it is not the volume of data that
| can be a bottleneck. Depending on the system it can be
| anything, starting from acquiring lock on a record or
| initiating a transaction.
| solaxun wrote:
| I understand your point generally but I don't understand this
| example. If you need data from 15 tables, you need to do
| those joins, regardless of prefiltering or subgrouping,
| right?
| protomyth wrote:
| Well, yes but. Why do you need 15 tables for analysis
| queries and why isn't someone rolling those tables into
| something a bit easier with some backend process.
| [deleted]
| alextheparrot wrote:
| This is where I think the original data admins were
| deluding themselves. Expecting 1,400 analysts to write
| better code is a really non-trivial problem, but easy to
| proclaim.
|
| An actual solution is creating pre-joined tables and
| having processes ("Hey your query took forever, have you
| considered using X?") or connectors
| (".getPrejoinedPatientTable()") that make sure those
| tables are being used in practice.
| steve-chavez wrote:
| > why isn't someone rolling those tables into something a
| bit easier with some backend process.
|
| To put it in more concrete terms(plain SQL): the tables
| could be aggregated on a set returning function or a
| view.
| cgh wrote:
| Yes, in these situations materialized views with indexes
| are generally the correct answer.
| tomrod wrote:
| > why isn't someone rolling those tables into something a
| bit easier with some backend process.
|
| You'd identified why we're all going to have jobs in 100
| years.
|
| Automation sounds great. It's exponentially augmenting to
| some users in a defined user space.
|
| Until you get to someone like me, who looks at the
| production structure and goes: "this is wholly
| insufficient for what I need to build, but it has good
| bones, so I'm going to strip it out and rebuild it for my
| use case, and only my use case, because to wait for a
| team to prioritize it according to an arcane schedule
| will push my team deadlines exceedingly far."
|
| This is why you don't roll everything into backend
| processes. Companies set up for production (high
| automation value ROI) and analytics (high labor value
| ROI) and has a hard time serving the mid tail. EVERYTHING
| on either direction works against the mid-tail --
| security policies, data access policies, software
| approvals, you name it.
|
| People, policy, and technology. These are the three
| pillars. If your org isn't performing the way it should,
| then by golly work at one of these and remember that
| technology is only one of three.
| punnerud wrote:
| Tree pillars where you only can choose two to be perfect.
| Just like databases and CAP theorem
| tomrod wrote:
| Technology is the easiest to adjust, ironically.
| protomyth wrote:
| I don't disagree with writing solid SQL. I would go so far as
| to say some things (most) need to be in stored procedures
| that are reviewed by competent people. But, some folks don't
| think about usage sometimes.
|
| This is one of those things I just don't get about folks
| setting up their databases. If you have a rather large
| dataset that keeps building via daily transactions, then its
| time to recognize you really have some basic distinct
| scenarios and to plan for them.
|
| The most common is adding or querying data about a single
| entity. Most application developers really only deal with
| this scenario since that is what most applications care about
| and how you get your transactional data. Basic database
| knowledge gets most people to do this ok with proper primary
| and secondary keys.
|
| Next up is a simple rule, "if you put a state on an entity,
| expect someone to need to know all the entities with this
| state." This is a killer for application developers for some
| reason. It actually requires some database knowledge to setup
| correctly to be performant. If the data analysts have
| problems with those queries, then its to to get the DBA to
| fix the damn schema and write some stored procedures for the
| app team.
|
| At some point, you will need to do actual reporting, excuse
| me, business intelligence. You really should have some
| process that takes the transactional data and puts it into a
| form where the queries of data analysts can take place. In
| the old days that would be something to load up Red Brick or
| some equivalent. Transactional systems make horrid reporting
| systems. Running those types of queries on the same database
| as the transactional system is currently trying to work is
| just a bad idea.
|
| Of course, if you are buying something like IBM DB2 EEE
| spreading queries against a room of 40 POWER servers, then
| ignore the above. IBM will fix it for you.
| jnsie wrote:
| At its simplest its OLTP vs OLAP. Separate the data
| entry/transactional side of things from the reporting part.
| Make it efficient for data analysts do do their jobs.
| fifilura wrote:
| I can't help thinking that with better processes and tools
| those 1400 analysts could instead be 50 analysts? (or even
| 5?)
|
| And that building an aggregation ETL pipeline, maybe inspired
| by this post, could be the solution?
| 1980phipsi wrote:
| VA -> government -> bloat
| allknowingfrog wrote:
| Is it possible to do any kind of version control with
| PostgREST? I have no doubt that raw SQL offers huge performance
| advantages over my ORM, but I'm pretty fond of the ability to
| roll back to a previous state when I push a bad change to
| production. Performance is just one of a number of tradeoffs
| that I consider when I'm choosing tools.
| sitkack wrote:
| It is customary to check in all DB assets, table creation
| scripts, queries, stored procedures, etc. No different than
| the rest of the development process. If something _isn 't_
| being versioned, there is a huge problem.
| jonas_b wrote:
| Question, if I have a CSV file that I'd like to do some quick SQL
| queries on before moving the results into Pandas. What would be
| good resource to do this? Preferably compatible with the rest of
| the Python-dataframe ecosystem and as simple as pd.read_csv()
| Vaslo wrote:
| Try SQL Alchemy?
| vincnetas wrote:
| http://harelba.github.io/q/
|
| q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"
|
| It uses sqlite under the hood.
| arusahni wrote:
| In addition to the recommendation for sqlite, I've found
| `csvkit` to be an extremely useful set of CLI tools for CSV
| munging. The `csvsql` [1] entrypoint is especially handy
| because it allows you to issue SQL queries against your CSV
| file directly vs. loading then querying.
|
| 1: https://csvkit.readthedocs.io/en/latest/scripts/csvsql.html
| rahulj51 wrote:
| Try duckdb to query and even transform the data and then export
| to a pandas df.
| simonw wrote:
| If you want to try it in SQLite (pros: no need to run a server
| or install anything since it's in the Python standard library,
| cons: not nearly as many advanced statistical analysis features
| as PostgreSQL) my sqlite-utils CLI tool may help here: it can
| import from CSV/TSV/JSON into a SQLite database:
| https://sqlite-utils.datasette.io/en/stable/cli.html#inserti...
| saltcured wrote:
| The sqlite3 connection object in Python allows you to
| register callables which you can use as scalar or aggregate
| functions in your SQL queries. With this, you can fill some
| of the gaps compared to PostgreSQL by essentially importing
| Python libraries. I just found this nice tutorial while
| looking for relevant docs:
|
| https://wellsr.com/python/create-scalar-and-aggregate-
| functi...
|
| However, I think the limited type system in SQLite means you
| would still want to extract more data to process in Python,
| whether via pandas, numpy, or scipy stats functions. Rather
| introducing new composite types, I think you might be stuck
| with just JSON strings and frequent
| deserialization/reserialization if you wanted to build up
| structured results and process them via layers of user-
| defined functions.
| ithrow wrote:
| You can easily import the csv file into sqlite and you don't
| even have to create the table fields beforehand
| bruiseralmighty wrote:
| I keep a non-commercial installation of SQLServer on my machine
| for this reason, but this is likely overkill for most purposes
| and requires a Windows machine.
|
| It does have some nice import features for CSV data though.
| wswope wrote:
| SQLite is what you're looking for. If you want to use the CLI,
| .format csv .import <path to csv file> <table name>
|
| Alternatively, read your data into pandas and there's extremely
| easy interop between a DBAPI connection from the python
| standard lib Sqlite3 module and Pandas (to_sql, read_sql_query,
| etc.).
| thamer wrote:
| I would second this suggestion. I was querying CSV data
| imported into SQLite this weekend, and it was extremely easy
| to get started. SQLite is pretty snappy even for tables with
| millions of rows.
|
| SQLite supports defining columns without a type and will use
| TEXT by default, so you can take the first line of your CSV
| that lists the document's dimensions, put those in the
| brackets of a CREATE TABLE statement, and then run the
| .import described above (so just CREATE TABLE foo(x,y,z); if
| x,y,z are your column names).
|
| After importing the data don't forget to create indexes for
| the queries you'll be using most often, and you're good to
| go.
|
| Another suggestion for once your data is imported, have
| SQLite report it in table format: .mode
| column .headers on
| zabzonk wrote:
| Windows provides an ODBC driver for CSV files - I don't know
| how this would play with Pandas.
| jplr8922 wrote:
| tldr ; if you hear ''but we can do this in SQL'', RUN!!!
|
| My eyes hurt as I read this article. There are reasons why
| analysts dont use SQL to do their job, and it has nothing to do
| with saving RAM and memory.
|
| 1) Data analysis is not a linear process, it involve playing and
| manipulating the data in different way and letting your mind
| drift a bit. You want your project in an IDE made for that
| purpose, the ability to create charts, source control, export and
| share the information, ect. Pandas is just a piece of that puzzle
| which is not possible to replicate in pure SQL.
|
| 2) In 2020, there are numerical methods you want to try beyond a
| traditional regression. Most real world data problems are not
| made for stats101 tools included in sql. Kurtosis?
| Autocorrelation?
|
| 3) Politics. Most database administrator are control freaks who
| _hate_ the idea of somebody else doing stuff in their DB. Right
| now were I work we still have to use SSIS-2013 instead of stored
| procedures in order to avoid the DBA refusal bureaucratic
| process.
|
| 4) Eventual professional development. If your analysis is good
| and creates value, chances are it will become a 'real' program
| and you will have to explain what you are doing to turn in into
| an OOP tool. If you have CS101, good coding in python will make
| this process much easier than a 3000 lines spagetti-SQL SQL
| script.
|
| 5) Data cleaning. Dealing with outliers, NAN and all that jazz
| really depends on the problem you try to solve. The absence of a
| one size fits all solutions is a good case for R/pandas/etc.
| These issue will break an SQL script in no time.
|
| 6) Debugging in SQL. Hahahahahahahaha
|
| If you are still preocupied with the ram usage of your PC to do
| your project, here are two solutions which infuriate a lot of
| DBAs I've worked with.
|
| A)
| https://www.amazon.ca/s?k=ram&__mk_fr_CA=%C3%85M%C3%85%C5%BD...
|
| B) https://aws.amazon.com/
| qwertox wrote:
| What is your problem with this example, doesn't it get the job
| done? WITH temperatures AS ( /* ... */ )
| SELECT *, MAX(c) OVER ( ORDER
| BY t ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
| ) AS hottest_temperature_last_three_days FROM
| temperatures; t | c |
| hottest_temperature_last_three_days
| ------------+----+-------------------------------------
| 2021-01-01 | 10 | 10
| 2021-01-02 | 12 | 12
| 2021-01-03 | 13 | 13
| 2021-01-04 | 14 | 14
| 2021-01-05 | 18 | 18
| 2021-01-06 | 15 | 18
| 2021-01-07 | 16 | 18
| 2021-01-08 | 17 | 17
|
| Why should I fetch all the data and then form the result with
| another tool?
|
| What a great article.
| jplr8922 wrote:
| I re-read my comment, and I think that I expressed myself too
| harshly. If you like SQL and it get the job done for you, no
| problem. I would see myself using that SQL query.
|
| However in order to get there, you need to know why you need
| the ''hottest_temparature_last_three_days''. Why not 2 or 4
| days? Why not using heating degree day? What about serial
| correlation with other regions, or metering disfunction? What
| if you are working directly with raw data from instruments,
| and still need to choose wich cleaning method you will use?
| What if you want to check the correlation with another
| dataset which is not yet in your database (ex: private
| dataset in excel from a potential vendor)?
|
| If you know _exactly_ what you want, sure SQL is the way to
| go. However the first step of a data project is to admit that
| you dont know what you want. You will perform a litterature
| review and might have a general idea, but starting with a
| precise solution in mind is a receipe for failure. This is
| why you could need to fetch all the data and then form
| results with another tool. How do you even know which factors
| and features must be extracted before doing some exploration?
|
| If you know the process you need and only care about RAM-CPU
| optimization, sure SQL is the way to go. Your project is an
| ETL and your have the job of a programmer who is coding a
| report. There is no ''data analysis'' there...
| sagarm wrote:
| If you know SQL well, then doing exploratory analysis in
| SQL is perfectly reasonable. Especially if you are using a
| high-performance analytic database.
|
| Disclosure: I develop high-performance analytic database
| systems with SQL frontends.
| jplr8922 wrote:
| What do you mean by exploratory analysis? Lets assume
| that my dataset contains features which are not normally
| distributed, and I want to check for coskewness and
| cokurtosis matrices before and after I remove outliers
| with a method specific to my field of research, found in
| a academic paper. Am I supposed to code all of this in
| SQL? What is your definition of exporatory analysis, and
| which metrics do you have in mind?
|
| And what if I want to make graphs to present stuff to a
| colleague? Am I supposed to use... excel to do that?
| qwertox wrote:
| > If you know exactly what you want, sure SQL is the way to
| go.
|
| I've been doing similar aggregations with MongoDB, simply
| because when I start a project where I quickly want to
| store data and then a week later check what I can do with
| it, it's a tool which is trivial to use. I don't need to
| think about so many factors like where and how to store the
| data. I use MongoDB for its flexibility (schemaless),
| compared to SQL.
|
| But I also use SQL mostly because of the power relations
| have, yet I use it only for storing stuff where it's clear
| how the data will look for forever, and what (mostly
| simple) queries I need to perform.
|
| I've read your comment as if it was suggesting that these
| kind of articles are not good, yet for me it was a nice
| overview of some interesting stuff that can be done with
| SQL. I don't chase articles on SQL, so I don't get many to
| read, but this one is among the best I've read.
|
| To get back to the quote: How will I know if SQL can do
| what I want, if I don't read these kind of articles?
| kbelder wrote:
| Familiar. I think the biggest issue is understanding workflow.
| My hunch is that some developers view a data analysis project
| as a program... it needs to have these data processing steps
| performed, summarized in this certain way, and saved off in
| that certain format. Once those queries are defined, the
| program is ran and the analysis done.
|
| The actual work the analyst is doing is far more ad-hoc than
| that. It's iterative and frequently will get torn down,
| rebuilt, and joined to other sources, as they discover new
| aspects to the data. They need their hands on the raw data, not
| on a specifically summarized version of it.
| twobitshifter wrote:
| I now do most of my data analysis in Julia instead of pandas, but
| used pandas for a long time. SQL is a valuable skill and useful
| when you need to optimize a query for performance, which can
| sometimes happen with pandas and is much less likely with Julia.
|
| However, even if a pandas query takes 5X longer to run than a SQL
| query you must consider the efficiency to a developer. You can
| chain pandas commands together that will accomplish something
| that takes 10x the lines of SQL. With SQL you're more likely to
| end up with many intermediate CTEs along the way. So while you
| can definitely save processor cycles by using SQL, I don't think
| you'll save clock-face time by using it in most one off tasks.
| Datascience is usually column oriented and Julia and pandas allow
| you to stay in that world.
| nomel wrote:
| For me, the main use case of a database server is so I can do
| operations that require more memory than I have on my local
| computer. Reading through this comment section makes me think
| my use case is rare or something. Maybe everyone is ok with
| chunked operations?
| spamalot159 wrote:
| I think for a lot of people, SQL is a skill that doesn't stick.
| You learn enough to do the queries you need for your project,
| they work then you forget about them as you work on the rest of
| your project. These skills are perishable. Left outer join? Yeah,
| I knew what that was some time ago, but not anymore
|
| The days of dedicated SQL programers are mostly gone.
| zabzonk wrote:
| > The days of dedicated SQL programers are mostly gone.
|
| I've never met a "dedicated SQL programmer" - all the C++
| programmers I've worked with in the investment banking world
| were also expected to know, and did know, SQL pretty well -
| asking questions about it were normally part of the interview
| process.
| simonw wrote:
| I started learning SQL twenty years ago and it's one of the
| only skills from the start of my career that has been
| consistently useful ever since.
| slver wrote:
| We have a lot more clueless developers on average, which makes
| it seem like no one uses SQL anymore beyond basics. But we're
| still here, believe me. No one today is a dedicated X
| programmer, but doesn't mean we're all clueless about X.
| nisegami wrote:
| I held this view before graduating and moving back to my 3rd
| world home country. For some reason, people here (the
| established ones with 5+ years of experience) see SQL as a
| general purpose programming language with which they do as much
| as possible.
| andrewmcwatters wrote:
| Exactly. I own a software consultancy, and I tried explaining
| this to an engineering manager and he just didn't get it.
|
| For some reason, he couldn't understand that when you write SQL
| queries for a project, you typically do it once, and basically
| never again, with the exception of maybe adding or removing
| columns from the query.
|
| The "hard work," if you can call it that, is all in the joins.
| Then, I completely forget about it.
|
| You spend so much more time in development on everything else.
| muxator wrote:
| > This benchmark does not mention the memory consumed by the
| database itself - this is intentional. [...] Whether you decide
| to use the database or not, the memory is already paid for, so
| you might as well use it!
|
| This sentence is a big red flag for me. An analysis of a stategy
| that pushes work towards a subsystem, and then purposedly ignores
| the perforance implications on that subsystem is methodologically
| unsound.
|
| Personally, I am all for using the DB and writing good SQL. But
| if I weren't, this argument would not convince me.
| truth_seeker wrote:
| Pure SQL porn. Very exciting!
|
| Lately I am seeing rather really creative blog posts on HN. Keep
| it up guys.
| flakiness wrote:
| The more I learn SQL, the less I write Python.
|
| Although the SQL syntax is weird and so dated, its portability
| across tools trumps everything else. You finished the EDA and
| decided to port the insights to a dashboard? With SQL it's
| trivial. With Python... well, probably you'll have to port it to
| SQL unless you have Netflix-like, Jupyter-backed dashboard
| infrastructure in place. For many of us who only have much-more-
| prevalent SQL-based dashboard platform, why not starting from
| SQL? Copy-n-paste is your friend!
|
| I still hate the SQL as a programmer, but as a non-expert data
| analyst I now have accepted it.
| arcticfox wrote:
| After 15 years programming I have come to love SQL - 10 yrs ago
| I used to do everything in the book to avoid it out of hate,
| even though I knew how to use it, but over time I have moved
| more and more into the database / SQL and my code is just
| better and better as a result.
| smilbandit wrote:
| I'm neutral on SQL but very big on the right tool for the job.
| My rule of thumb is, if i'm querying data just to massage it
| and then stuff it back into the database then I first try to do
| that as a stored procedure. I remember learning about windowing
| in SQL late into a project and after redoing some python
| functions as stored procedures really improving performance.
| simonw wrote:
| This article is so useful. It starts out with SQL basics but then
| quickly leaps into all kinds of PostgreSQL tricks that I didn't
| know about - binning, efficient sampling, calculating, even
| linear regression.
| nerdponx wrote:
| SQL syntax sucks for doing non-trivial data analysis. I've tried
| it. Verbose, no composability or code reuse, not really portable
| across different databases, no easy interoperability with other
| tools, limited editor/IDE support, etc.
|
| I guess if you have huge amounts of data (10m+ rows) already
| loaded into a database then sure, do your basic summary stats in
| SQL.
|
| For everything else, I'll continue using SQL to get the data from
| the database and use Pandas or Data.Table to actually analyze it.
|
| That said, this is a very comprehensive review of SQL techniques
| which I think could be very useful for when you _do_ have bigger
| datasets and /or just need to get data out of a database
| efficiently. Great writeup and IMO required reading for anyone
| looking to be a serious "independent" data scientist (i.e. not
| relying on data engineers to do basic ETL for you).
|
| I'd be a huge fan of something like the PySpark DataFrame API
| that "compiles" to SQL* (but that doesn't require you to actually
| be using PySpark which is its own can of worms). I think this
| would be a lot nicer for data analysis than any traditional ORM
| style of API, at least for data analysis, while providing better
| composability and IDE support than writing raw SQL.
|
| *I also want this for Numexpr:
| https://numexpr.readthedocs.io/en/latest/user_guide.html, but
| also want a lot of other things, like a Arrow-backed data frames
| and a Numexpr-like C library to interact with them.
| ramraj07 wrote:
| Completely disagree. I have a choice of using snowflake and
| spark/pandas to do my EDA and I'll choose sql every time. The
| code is significantly more readable once you get used to it and
| you can most definitely do things one step at a time using udfs
| and temp tables / cte s. I've come back to EDA I did a year
| back and it's always easier to read a long sql script than a
| notebook with pandas code.
| disgruntledphd2 wrote:
| There's pluses and minuses to both. That being said, how do I
| write a function in SQL which can abstract over something I
| do a lot (like the pivoting example earlier), or even some
| date function like iff(date>'important_date', 'before',
| 'after') as grouper.
|
| Honestly, that's what ends up making me move away from doing
| analytics in SQL.
| saltcured wrote:
| Edit: oops, I think I replied a level deeper than intended.
| I was responding to the composition/abstraction topic. I
| think I should just leave this here now though?
|
| I assume you are talking about composition of generic set-
| processing routines, but I wonder if others realize that?
| It is easy enough to write a set-returning function and
| wrap it in arbitrary SQL queries to consume its output.
| But, it is not easy to write a set-consuming function that
| can be invoked on an arbitrary SQL query to define its
| input. Thus, you cannot easily build a library of set-
| manipulation functions and then compose them into different
| pipelines.
|
| I think different RDBMS dialects have different approaches
| here, but none feel like natural use of SQL. You might do
| something terrible with cursors. Or you might start passing
| around SQL string arguments to EXECUTE within the generic
| function, much like an eval() step in other interpreted
| languages. Other workarounds are to do everything as macro-
| processing (write your compositions in a different
| programming language and "compile" to SQL you pass to the
| query engine) or to abuse arrays or other variable-sized
| types (abuse some bloated "scalar" value in SQL as a quasi-
| set).
|
| What's missing is some nice, first-class query (closure)
| and type system. It would be nice to be able to write a CTE
| and use it as a named input to a function and to have a
| sub-query syntax to pass an anonymous input to a function.
| Instead, all we can do is expand the library of scalar and
| aggregate functions but constantly repeat ourselves with
| the boilerplate SQL query structures that orchestrate these
| row-level operations.
| disgruntledphd2 wrote:
| Yeah, that's exactly the issue. One can do this in
| Python, but not in SQL, and this leads to boilerplate.
|
| I actually think that SparkSQL is a good solution here,
| as you can create easily reusable functions.
| rdedev wrote:
| For apache arrow backed dataframes check out
| Polars(https://github.com/ritchie46/polars)
| disgruntledphd2 wrote:
| > (i.e. not relying on data engineers to do basic ETL for you).
|
| Is this actually a thing? Surely it can't be a thing.
| denimnerd42 wrote:
| of course, that's my job... we do basic ETL on hundreds of
| data sources to provide data to the analysts and quants
| EForEndeavour wrote:
| In a sufficiently large org, why not?
| dreyfan wrote:
| > SQL syntax sucks for doing non-trivial data analysis. I've
| tried it. Verbose, no composability or code reuse, not really
| portable across different databases, no easy interoperability
| with other tools, limited editor/IDE support, etc.
|
| You're entitled to your opinion and tooling choices of course,
| but the problem is you don't know SQL.
| dunefox wrote:
| That could be said about any tool, framework, language,
| library, etc. Invest enough time and you can do everything
| with malbolge but that doesn't mean it doesn't suck.
| mixmastamyk wrote:
| A more useful comment would be to illustrate how. I'd like to
| know as well as I need to reuse queries in SQL occasionally
| but am not an expert. Currently I believe learning pl/pgsql
| is the answer, but even it reeks of punch-cards and other
| sixtiesisms :-). Tough sell when you're used to Python etc.
| pandasusr wrote:
| I am a pandas user considering refactoring part of my ETL
| pipeline to SQL. I see the trade off as memory efficiency vs
| expressiveness, and for simple queries on big data, SQL wins.
| Would you disagree that Pandas/Python is more expressive than
| SQL? I'm less experienced in SQL but based on my limited
| experience there, it seems Pandas is clearly more expressive.
| What is the SQL equivalent of Pandas .apply(lambda x) ?
| higeorge13 wrote:
| Please define an example lambda function, in order to see
| whether there is an sql equivalent. Imo >90% of the data
| issues i have seen, can be solved with sql queries. I have
| seen some in the finance sector which would require super
| complex udfs, but other than these, sql is the first choise
| to solve a data issue.
| pandasusr wrote:
| I do work in finance sector and need complex functions :)
|
| And it's interesting that pandas was invented at a hedge
| fund, AQR.
|
| I agree that SQL may be better for vanilla BI.
| hodgesrm wrote:
| ClickHouse has lambdas for arrays. They are very useful.
| Here's an example. WITH ['a', 'bc', 'def',
| 'g'] AS array SELECT arrayFilter(v -> (length(v) >
| 1), array) AS filtered +-filtered-----+
| | ['bc','def'] | +--------------+
|
| The lambda in this case is a selector for strings with more
| than one character. I would not argue that they are as
| general as Pandas, but they are might useful. More examples
| from the following article.
|
| https://altinity.com/blog/harnessing-the-power-of-
| clickhouse...
| default-kramer wrote:
| "No composability or code reuse" is definitely a valid
| criticism of SQL. Check out the very first example of my
| personal project https://docs.racket-
| lang.org/plisqin/Read_Me_First.html and let me know how you
| would implement something akin to `(CategoryName p)` and
| `(TotalSales p)` in SQL. A view does not count, because then
| you either have one view per derived field and that is very
| cumbersome to consume, or you have one view with all the
| derived fields and you will end up paying performance costs
| for fields that you are not using.
| ok123456 wrote:
| Don't views and materialized views give you reuse?
|
| Can't you do composibility with foreign references?
| neurocean wrote:
| Sure, while views may be fine for small projects, using
| them for larger projects like data warehousing is usually
| a mistake you'll come to regret.
|
| SQL views are rarely unit tested so you always end up in
| a regression nightmare when you need to make updates.
|
| If you're going to go pure SQL, you should use something
| like dbt.
| hodgesrm wrote:
| > I guess if you have huge amounts of data (10m+ rows) already
| loaded into a database then sure, do your basic summary stats
| in SQL.
|
| This is not huge. This is quite small. Pandas can't handle data
| that runs into billions of rows _or_ sub-second response on
| arbitrary queries. Both are common requirements in many
| analytic applications.
|
| I like Pandas. It's flexible and powerful if you have
| experience with it. But there's no question that SQL databases
| (especially data warehouses) handle large datasets and low
| latency response far better than anything in the Python
| ecosystem.
| nerdponx wrote:
| _This is not huge. This is quite small. Pandas can 't handle
| data that runs into billions of rows or sub-second response
| on arbitrary queries. Both are common requirements in many
| analytic applications._
|
| You're right. It's "huge" with respect to what you can expect
| to load into Pandas and get instant results from. But it's
| not even "medium" data on the small-medium-big spectrum.
|
| _I like Pandas. It 's flexible and powerful if you have
| experience with it. But there's no question that SQL
| databases (especially data warehouses) handle large datasets
| and low latency response far better than anything in the
| Python ecosystem._
|
| I agree with this 100%, but I think a lot of people missed it
| in my post.
| hodgesrm wrote:
| OTOH Pandas/Numpy are great for results. One pattern that
| I'm hoping to try is running heavy lifting in SQL, transfer
| results via Apache Arrow, and manipulate the result set in
| Python with aforesaid libraries.
|
| I don't know enough about how Arrow handles streaming to
| understand how this would really work but it would break
| away from the single-threaded connectivity with expensive
| ser/deser databases have used since the days of Sybase Db-
| Library, the predecessor to ODBC. 36 years is probably long
| enough for that model.
| neurocean wrote:
| Yeah but that's not even close to being Pandas' value
| proposition. It's for the data scientist and analyst, not the
| db admin, data engineer or production applications.
| beforeolives wrote:
| Nice article. Pandas gets the job done but it's such a step
| backwards in terms of useability, API consistency and code feel.
| You can do anything that you can possibly need with it but you
| regularly have to look up things that you've looked up before
| because the different parts of the library are patched up
| together and don't work consistenly in an intuitive way. And then
| you end up with long lines
| of().chained().['expressions'].like_this(0).
| nvilcins wrote:
| You might be thinking of specific functionality that you find
| is being implemented in an overly long/verbose fashion.. But
| generally speaking, how are
|
| > long lines of().chained().['expressions'].like_this(0)
|
| a _bad thing_?
|
| IMHO these pandas chains are easy to read and communicate quite
| clearly what's being done. If anything, I've found that in my
| day-to-day while reading pandas I parse the meaning of those
| chains at least as efficiently as from comments of any level of
| specificity, or from what other languages (that I have had
| experience with) would've looked like.
| musingsole wrote:
| People don't like them because the information density of
| pandas chains is soooo much higher than the rest of the
| surrounding code. So, they're reading along at a happy place,
| consuming a few concepts per statement
|
| ...and then BOOM, pandas chain! One statement containing 29+
| concepts and their implications.
|
| Followed by more low density code. The rollercoaster leads to
| complaints because it _feels_ harder. Not because of any
| actual change in difficulty.
|
| /that's my current working theory, anyway
| disgruntledphd2 wrote:
| Hmmm, interesting. I don't mind the information density,
| coming from R which is even more terse, but the API itself
| is just not that well thought out (which is fair enough, he
| was learning as he went).
| isoprophlex wrote:
| This really grinds my gears too. There's something about the
| pandas API that makes it impossible for me to do basic ops
| without tedious manual browsing to get inplace or index
| arguments right... assignments and conditionals are needlessly
| verbose too.
|
| Pyspark on the other hand just sticks in my brain, somehow.
| Chained pyspark method calls looks much neater.
| ziml77 wrote:
| Setting inplace=True isn't too bad, but I definitely have had
| many issues with working with indexes in Pandas. I don't
| understand why they didn't design it so that the index can be
| referenced like any other columns. It overcomplicates things
| like having to know the subtle difference between join() and
| merge().
| nojito wrote:
| Setting Inplace=True is not recommended and should be used
| with caution
|
| https://github.com/pandas-dev/pandas/issues/16529
| ziml77 wrote:
| I've never seen anything about inplace being a poor idea
| to use. Is that documented anywhere or is that ticket the
| only info about it?
| Peritract wrote:
| It's not particularly front-and-centre, but it's all over
| various discussion boards if you go looking for it
| directly. I'd like the debate to be more visible,
| personally, particularly whenever the argument gets
| deprecated for a particular method.
|
| Essentially, `inplace=True` rarely actually saves memory,
| and causes problems if you like chaining things together.
| The people who maintain the library/populate the
| discussion boards are generally pro-chaining, so
| `inplace` is slowly and quietly on its way out.
| disgruntledphd2 wrote:
| Pandas is just a bad API, to be honest. I know base-R very,
| very well (which was one of the inspirations, I believe) and
| I still spend most of my time looking stuff up.
|
| It's such a shame that python doesn't have a better DF
| library.
| shankr wrote:
| I also switched from R to Python/pandas. I remember always
| being frustrated with pandas since it tries to emulate
| data.frame, but then just does its own thing without being
| consistent.
| EForEndeavour wrote:
| I've used pandas regularly for the past ~5 years and find
| its API intuitive enough not to complain. I can write and
| read decently long pandas chained expressions fluently, but
| I barely know any R. Am I unwittingly a hostage of an
| inferior API and don't know what I'm missing?
| disgruntledphd2 wrote:
| Yes.
|
| Base R is OK, but dplyr is magical.
|
| For instance, integer indexing in base R is df[row,col]
| rather than the iloc pandas stuff.
|
| plot, print and summary (and generic function OOP more
| generally is really underappreciated).
|
| Python is a better programming language, but R is a
| better data analysis environment.
|
| And dplyr is an incredibly fluent DSL for doing data
| analysis (not quite as good for modelling though).
|
| Seriously, I read the original vignette for dplyr in late
| 2013/early 2014 and within two weeks I'd switched most of
| my new analytical code over to it. So very, very good.
| Less idea-impedance match than any other environment, in
| my experience.
| shankr wrote:
| I was actually using data.table The syntax can be bit
| cryptic, but you get used to it.
| [deleted]
| shankr wrote:
| This might not seem like a big issue but in the
| beginning, these were my issues
|
| 1. Not so easy way to rename columns during aggregation
|
| 2. The group by generates its own grouped by data and
| hence you almost always need `reset_index`
|
| 3. Sometimes group by can convert a dataframe to series
|
| 4. Now `.loc` has provided bit consistent
| indexing/slicing, but earlier you had `.ix` `.iloc` and
| what not
|
| These are something I can remember from top of my head.
| Of course all of these have solutions, but it makes
| pandas much more verbose. In R, these are just much more
| succinct.
| carabiner wrote:
| Can you honestly say you'd prefer to be debug _thousands_ of
| lines of SQL versus the usual <100 lines of Python/pandas that
| does the same thing? It's no contest. A histogram in pandas:
| df.col.hist(). Unique values: df.col.value_counts(). Off the
| top of your head, what is the cleanest way of doing this in SQL
| and how does it compare? How anyone can say that SQL is
| objectively better (in readability, density, any metric) other
| than the fact that they learned it first and now are frustrated
| that they have to learn another new tool baffles me.
|
| I learned Pandas first. I have no issue with indexing,
| different ways of referencing cells, modifying individual rows
| and columns, numerous ways of slicing and dicing. It gets a
| little sprawling but there's a method to the madness. I can
| come back to it months later and easily debug. With SQL, it's
| just madness and 10x more verbose.
| beforeolives wrote:
| > Can you honestly say you'd prefer to be debug thousands of
| lines of SQL versus the usual <100 lines of Python/pandas
| that does the same thing?
|
| That's fair, I was using the opportunity to complain about
| pandas and didn't point out all the problems SQL has for some
| tasks. What I really want is a dataframe library for Python
| that's designed in a more sensible way than pandas.
| marktl wrote:
| Distinct Count and Group By in SQL will provide distinct
| Count and Histogram (numerical value) output. Like No SQL vs
| Relationship, seems there are use cases that lemme them
| selves to new vs old technologies. It's hard for me to put
| much stock in opinions from those that have vastly more
| experience in one tool and not the other being compared.
| nerdponx wrote:
| I much prefer writing actual functions in a real programming
| language to the verbose non-reusable non-composable relic that
| is SQL syntax.
|
| Give me a better query language and I will gladly drop Pandas
| and Data.Table.
| semitones wrote:
| SQL is a real programming language.
| thinkharderdev wrote:
| It's not turing complete which is I think why it doesn't
| feel like a "real" programming language.
| sagarm wrote:
| Recursive CTEs exist; while I haven't investigated
| thoroughly, I would be surprised if they didn't make SQL
| turing complete.
|
| More usefully, most databases allow you to write user-
| defined functions in other languages, including
| imperative SQL dialects.
| steve-chavez wrote:
| SQL is turing complete. A demonstration with recursive
| CTEs is done here[1].
|
| [1]: https://wiki.postgresql.org/wiki/Cyclic_Tag_System
| beforeolives wrote:
| Yes, SQL does have the problem of bad composeability and some
| things being less explicit than they are when working with
| dataframes. Dplyr is probably the most sane API out of all of
| them.
| alexilliamson wrote:
| +1 for dplyr. I've used both pandas and dplyr daily for a
| couple years at different times in my career, and there is
| no comparison in mind when it comes to
| usability/verbosity/number of times I need to look at the
| documentation.
| m16ghost wrote:
| >Pandas is a very popular tool for data analysis. It comes built-
| in with many useful features, it's battle tested and widely
| accepted. However, pandas is not always the best tool for the
| job.
|
| SQL is very useful, but there are some data manipulations which
| are much easier to perform in pandas/dplyr/data.table than in
| SQL. For example, the article discusses how to perform a pivot
| table, which takes data in a "long" format, and makes it "wider".
|
| In the article, the pandas version is:
|
| >pd.pivot_table(df, values='name', index='role',
| columns='department', aggfunc='count')
|
| Compared to the SQL version:
|
| >SELECT role, SUM(CASE department WHEN 'R&D' THEN 1 ELSE 0 END)
| as "R&D", SUM(CASE department WHEN 'Sales' THEN 1 ELSE 0 END) as
| "Sales" FROM emp GROUP BY role;
|
| Not only does the SQL code require you to know up front how many
| distinct columns you are creating, it requires you to write a
| line out for each new column. This is okay in simple cases, but
| is untenable when you are pivoting on a column with hundreds or
| more distinct values, such as dates or zip codes.
|
| There are some SQL dialects which provide pivot functions like in
| pandas, but they are not universal.
|
| There are other examples in the article where the SQL code is
| much longer and less flexible, such as binning, where the bins
| are hardcoded into the query.
| michelpp wrote:
| Does `tablefunc.crosstab()` do what you want?
|
| https://www.postgresql.org/docs/13/tablefunc.html
| m16ghost wrote:
| It is not much better than the canonical example given in the
| article. It still has the following usability issues:
|
| -You still need to enumerate and label each new column and
| their types. This particular problem is fixed by crosstabN().
|
| -You need to know upfront how many columns are created before
| performing the pivot. In the context of data analysis, this
| is often dynamic or unknown.
|
| -The input to the function is not a dataframe, but a text
| string that generates the pre-pivot results. This means your
| analysis up to this point needs to be converted into a
| string. Not only does this disrupt the flow of an analysis,
| you also have to worry about escape characters in your
| string.
|
| -It is not standard across SQL dialects. This function is
| specific to Postgres, and other dialects have their own
| version of this function with their own limitations.
|
| The article contains several examples like this where SQL is
| much more verbose and brittle than the equivalent pandas
| code.
| papercrane wrote:
| That's one of the non-standard ways to do it. MSSQL and
| Oracle also have a pivot function to do this. Unfortunately
| there is no standard way to do this.
| ellis-bell wrote:
| agreed that pivoting can be a pain.
|
| a pattern that i converged on --- at least in postgres --- is
| to aggregate your data into json objects and then go from
| there. you don't need to know how many attributes (columns)
| should be in the result of your pivot. you can also do this in
| reverse (pivot from wide to long) with the same technique.
|
| so for example if you have the schema `(obj_id, key, value)` in
| a long-formatted table, where an `obj_id` will have data
| spanning multiple rows, then you can issue a query like
|
| ``` SELECT obj_id, jsonb_object_agg(key, value) FROM table
| GROUP BY obj_id; ```
|
| up to actual syntax...it's been awhile since i've had to do a
| task requiring this, so details are fuzzy but pattern's there.
|
| so each row in your query result would look like a json
| document: `(obj_id, `{"key1": "value", "key2": "value", ...})`
|
| see https://www.postgresql.org/docs/current/functions-json.html
| for more goodies.
| [deleted]
| hantusk wrote:
| Agreed. https://ibis-project.org/ and
| https://dbplyr.tidyverse.org/ can compile dataframe-like input
| to SQL, which might bridge the gap in tooling (although there
| still are small differences to the pure dataframe syntax)
| kbelder wrote:
| I've been doing a lot of data analysis in Pandas recently. I
| started off thinking that for efficiency's sake, I should do as
| much initial processing in the DB as possible, and use Pandas
| just for the higher level functions that were difficult to do
| in SQL.
|
| But after some trial and error, I find it much faster to pull
| relatively large, unprocessed datasets and do everything in
| Pandas on the local client. Faster both in total analysis time,
| and faster in DB cycles.
|
| It seems like a couple of simple "select * from cars" and
| "select * from drivers where age < 30", and doing all the
| joining, filtering, and summarizing on my machine, is often
| less burdensome on the db than doing it up-front in SQL.
|
| Of course, this can change depending on the specific dataset,
| how big it is, how you're indexed, and all that jazz. Just
| wanted to mention how my initial intuition was misguided.
| ziml77 wrote:
| I've always been disappointed by the SQL pivot. It's hardly
| useful for me if I have to know up-front all of the columns
| it's going to pivot out into. The solution would be to use
| another SQL query to generate a dynamic SQL query, but at that
| point I would rather just use Pandas
| [deleted]
| racl101 wrote:
| I think I do a healthy mixture of both.
| codeulike wrote:
| I'm so glad to have been around long enough that SQL is now being
| seen as exotic again
| thinkharderdev wrote:
| Not long after I got into software engineering is when the
| first NoSQL craze began. I imagine that those who started not
| long after me and spent a few years wrangling with MongoDB
| queries (shudder) and trying to do joins, transactions and
| consistency guarantees in application code must see RDBMS as
| some sort of magical new technology.
| dmitrykoval wrote:
| Following similar observations I was wondering if one can
| actually execute SQL queries inside of a Python process with the
| access to native Python functions and Numpy as UDFs. Thanks to
| Apache Arrow one can essentially combine DataFrame API with SQL
| within data analysis workflows, without the need to copy the data
| and write operators in a mix of C++ and Python, all within the
| confines of the same Python process.
|
| So I implemented Vinum, which allows to execute queries which may
| invoke Numpy or Python functions as UDFs available to the
| interpreter. For example: "SELECT value, np.log(value) FROM t
| WHERE ..".
|
| https://github.com/dmitrykoval/vinum
|
| Finally, DuckDB makes a great progress integrating pandas
| dataframes into the API, with UDFs support coming soon. I would
| certainly recommend giving it a shot for OLAP workflows.
| justsomeuser wrote:
| Also I think SQLite lets you call Python functions from the SQL
| program.
| dmitrykoval wrote:
| That's correct, but SQLite would require to
| serialize/deserialize the data sent to Python func (from C to
| Python and back), while Arrow allows to get a "view" of the
| same data without making a copy. Which is probably not an
| issue in OLTP workloads, but may become more visible in OLAP.
| sbuttgereit wrote:
| The code: WITH dt AS ( SELECT
| unnest(array[1, 2]) AS n ) SELECT * FROM dt;
|
| Is more complex than necessary. This produces the same result:
| SELECT n FROM unnest(array[1, 2]) n; +---+ | n |
| +---+ | 1 | | 2 | +---+ (2 rows)
|
| I think I see some other opportunities as well.
|
| I know the code is from a section dealing with CTEs, but CTEs
| aren't needed for every situation including things like using
| VALUES lists. Most people in the target audience can probably
| ignore this next point (probably on a newer version of
| PostgreSQL), but older versions of PostgreSQL would materialize
| the CTE results prior to processing the main query, which is not
| immediately obvious.
| simonw wrote:
| I think that was a deliberate choice by the author to
| consistently demonstrate CTEs.
| sbuttgereit wrote:
| Sorry was editing to address this point probably while you
| were typing your response. It's well taken, but we should be
| clear that it's not required.
| liprais wrote:
| it seems the only reason people on hn hate sql is that they don't
| understand sql and/or already buy in another toolset,if you only
| have a hammer at hand,everything is a nail
___________________________________________________________________
(page generated 2021-05-03 23:00 UTC)