[HN Gopher] Exploratory data analysis for humanities data
       ___________________________________________________________________
        
       Exploratory data analysis for humanities data
        
       Author : yarapavan
       Score  : 95 points
       Date   : 2023-10-06 16:31 UTC (6 hours ago)
        
 (HTM) web link (awk.dev)
 (TXT) w3m dump (awk.dev)
        
       | simonw wrote:
       | Wow, imagine being a humanities major and having Brian Kernighan
       | teach you Awk!
        
       | smlavine wrote:
       | If I were at Princeton, I would take every one of Kernighan's
       | classes that I could! I wonder if that's a problem there.
        
         | EvanKelly wrote:
         | I'm at this point 15 years removed, but Prof Kernighan was one
         | of the most accessible professors and taught the most popular
         | CS survey course (333).
         | 
         | I have at least a half a dozen times where I was pointed his
         | direction from another professor and Kernighan spent an hour
         | with me looking into how to scrape a dynamic website for my
         | auction theory project. When he was stumped he introduced me to
         | a professor at another school who he knew had looked into the
         | topic.
        
       | yarapavan wrote:
       | Course website (linked from the article): https://www.hum307.com/
        
       | mbb70 wrote:
       | I'm all for this kind of exploratory hacking around before
       | booting up python/R/Excel/duckdb, especially in constrained
       | environments. A classic pain point is having to deal with column
       | numbers, so I'll share my favorite trick:
       | 
       | `head -n1 /path/to/file.csv | tr ',' '\n' | nl | grep
       | desired_column`
       | 
       | gives you the column number of desired_column
        
         | tejtm wrote:
         | yep, without knowing about `nl` I used `...| grep -n
         | column_header` or `...|grep -n .` to replicate the 'nl'
         | behavior.
         | 
         | edit: I like your 'nl' better as it is using white space
         | instead of colon as a separator.
        
         | patrec wrote:
         | Unless there is a quoted comma or an empty column beforehand
         | (nl "helpfully" skips empty lines for numbering purposes).
        
         | [deleted]
        
         | i15e wrote:
         | Something to watch out for with _nl_ is that by default it
         | doesn 't number empty lines. e.g.:                 $ printf
         | 'one\n\nthree\n' | nl          1  one               2  three
         | 
         | Set _-ba_ to enable numbering all lines.
         | 
         | For this use case I usually end up running _cat -n_ instead
         | since I find it easier to remember.
        
         | chaps wrote:
         | grep -n also works in place of `nl`!
        
       | pmarreck wrote:
       | Recent Awk convert (after, like most people, just using it for
       | one-liners for years); it's aged remarkably well (although I wish
       | it used more functional constructs, permitted proper variable
       | initialization, and had interrupt handling... but at that point,
       | it's probably best to switch to a "full" language...)
        
       | jph wrote:
       | Awk is awesome and Dr. Kernighan has taught me so much.
       | 
       | If you like exploratory data analysis using awk, you may like the
       | "num" command:
       | 
       | https://github.com/numcommand/num
       | 
       | Num uses awk for command line statistics, such as standard
       | deviation, kurtosis, quartiles, uniqueness, ordering, and more.
       | Num runs on a very wide range of Unix systems, such as systems
       | without package managers.
       | 
       | Feature requests and PRs are welcome.
        
       | qsort wrote:
       | The article nails down a very real pain point with libraries like
       | Pandas:
       | 
       | > looping over a set of input lines seems more natural than the
       | dataframe selectors that Pandas favors
       | 
       | Row-oriented operations, as opposed to aggregations and other
       | OLAP-style queries are kind of painful. The generator machinery
       | (yield from) is a partial fix to this, but Pandas itself offers
       | little relief.
        
         | esafak wrote:
         | pandas has a poor API. I'd rather use SQL with DuckDB.
        
           | importantbrian wrote:
           | This has become my workflow too. Admittedly though I've spent
           | most of my career writing large amounts of SQL, and was a
           | pretty heavy Tidyverse user for a while, so that all makes a
           | lot more sense to me than Pandas. I generally get my data
           | into whatever shape I need it in and then load it into
           | pandas.
        
           | slt2021 wrote:
           | pandas is way more powerful that most people use it.
           | 
           | when you have to deal with thousands of text files, mish mash
           | of csv, tsv, some rows overlap between the files, some files
           | spread across multiple different locations (shared drive, s3
           | bucket, URL, SQL db, etc), with column names that look
           | similar but not exactly similar - this is perfect use case
           | for pandas.
           | 
           | read csv file? just pd.read_csv()
           | 
           | read and concat N csv files? just pd.concat([pd.read_csv(f)
           | for f in glob("*.csv")])
           | 
           | read parquet or read_sql()? not a problem at all.
           | 
           | need to do some custom rules for data cleansing, or regex
           | matching or fuzzy matching on column names, converting data
           | from/to csv/parquet/sql - it will be pandas 1 liner
           | 
           | a lot of painful data processing/cleaning, correcting data is
           | just 1-liner in pandas, and I dont know of better tool that
           | can beat pandas - probably tidyR but it is essentially same
           | pandas just for R
        
             | IKantRead wrote:
             | > essentially same pandas just for R
             | 
             | You are aware the pandas was designed to replicate the
             | behavior of base R's dataframes?
             | 
             | I've been a heavy user of both and R's data frames are
             | still superior to pandas even without the tidyverse.
             | 
             | Pandas is really nice for the use case it was designed for:
             | working with financial data. This is a big part of why
             | Pandas's indices feel so weird for everything else, but if
             | your index is a time in a financial time series then all of
             | a sudden Pandas makes sense and works great
             | 
             | When not working with financial data I try to limit the
             | amount of time my code touches pandas, and increasingly
             | find numpy + regular python works better and is easier to
             | build out larger software with. It also makes it much
             | easier to port your code into another language for use in
             | production (i.e. it's quick and easy to map standard python
             | to language X, but not so much a large amount of non-
             | trivial pandas).
        
               | palae wrote:
               | R also has data.table, which extends data.frame and is
               | pretty powerful and very fast
        
               | slt2021 wrote:
               | with pandas2.0 and using arrow backend instead of numpy -
               | pandas became "cloud datalake native" - you can
               | essentially read from arrow files in S3 very efficiently
               | and at any large scale - and store/process arbitrarily
               | large amounts of files in a cheap serverless infra. Arrow
               | format is also supported by other languages.
               | 
               | with s3+sqs+lambda+pandas - and you can build cheap
               | serverless data processing pipelines and iterate
               | extremely quickly
        
               | Karrot_Kream wrote:
               | Do you have any benchmarks about how much data a given
               | lambda can search/process after loading Arrow data? Not
               | trying to argue, I'm curious because I never thought of
               | this architecture myself, because I would think that the
               | time it takes to ingest the Arrow data and then search
               | through it would be too long for a lambda but I may be
               | totally off base here. I've not played around in detail
               | with lambdas so I don't have particularly robust mental
               | model on their limitations.
        
               | slt2021 wrote:
               | reading/writing Arrow is zero serde overhead operation
               | to/from memory to disk.
               | 
               | I think of lambda as a thread, and you can put a trigger
               | on S3 bucket on each incoming file - to get processed.
               | This allows you to get around GIL, and lets you invoke
               | your lambda for each mini-batch.
               | 
               | assuming you have high volume and frequency of data - you
               | will need to "cool down" your high frequency data, and
               | switch from row-basis (like millions of rows per second)
               | to mini-batch basis (like one batch file per 100Mb).
               | 
               | This can be achieved by having kafka with high partition
               | number on the ingestion side, and sink to s3.
               | 
               | from S3 for each new file your lambda will be invoked and
               | minibatch will be processed by your python code, and you
               | can right size your lambda's RAM, but usually I reserve
               | 2-3x size of a batch file for lambda.
               | 
               | the killer feature is zero ops. Just by tuning your
               | minibatch size you can regulate how many times your
               | lambda will be invoked
        
             | esafak wrote:
             | You can do that with other tools too.
             | 
             | https://duckdb.org/docs/data/csv/overview.html
             | 
             | https://duckdb.org/docs/data/parquet/overview
             | 
             | https://duckdb.org/docs/data/multiple_files/overview.html
        
               | slt2021 wrote:
               | interesting, but I would still prefer pandas for data
               | cleansing/manipulation, just because I won't be limited
               | by SQL syntax - and can always use df.apply() and/or any
               | python package for custom processing.
               | 
               | pandas using apache arrow backend also makes it high
               | performance and compatible with cloud native data lakes
               | 
               | plus compatibility with sklearn package makes it a killer
               | feature, with just few lines you can bolt on ML model on
               | top of your data
        
             | aidos wrote:
             | It definitely has its place. I like to get it to grab the
             | data, clean it up and get out into python / Postgres. I
             | don't like to have spreading through the codebase.
        
             | wheresmycraisin wrote:
             | > Pandas is way more powerful
             | 
             | Only if you 1) don't know SQL and 2) working with tiny
             | datasets that are around 5% of your total RAM.
        
               | faizshah wrote:
               | I guess it depends on who you ask but personally I am
               | able to write pandas much faster than loading data into a
               | DB and then processing it. The reason is pandas defaults
               | on from_ and to_ are very sane and you don't need to
               | think about things like escaping strings and stuff. It's
               | also easy to deal with nulls quickly in pandas and
               | rapidly get some EDA graphs like in R.
               | 
               | The other benefit of pandas is it's in python so you can
               | use your other data analysis libraries whereas with SQL
               | you need to marshal back and forth between python and
               | SQL.
               | 
               | My usual workflow is: Explore data in pandas/datasette,
               | if it's big data I explore just a sample and use bash
               | tools to pull out the sample -> write my notebook in
               | pandas -> scale it up in spark/dask/polars depending on
               | use case.
               | 
               | This is pretty good cause ChatGPT understands pandas,
               | pyspark, and SQL really well so you can easily ask it to
               | translate scripts or give you code for different things.
               | 
               | On scalability if you need scale there's many options
               | today to process large datasets with a dataframe api e.g
               | koalas, polars, dask, modin etc.
        
               | slt2021 wrote:
               | >>Only if you 1) don't know SQL and 2) working with tiny
               | datasets that are around 5% of your total RAM.
               | 
               | this is only true only for newbie python devs that
               | learned about pandas from blogs on medium.com. I have
               | pipelines that process terabytes per day in a serverless
               | datalake, and it requires zero DBA work that usually
               | comes if you use anything *Sql
        
       | wslh wrote:
       | Ah! That is awkard! Sorry, I couldn't resist, I have all respect
       | for Awk.
        
       ___________________________________________________________________
       (page generated 2023-10-06 23:00 UTC)