Post AW41hdCopyZY6TTccC by dredmorbius@toot.cat
 (DIR) More posts by dredmorbius@toot.cat
 (DIR) Post #AW41hdCopyZY6TTccC by dredmorbius@toot.cat
       2023-05-27T00:20:49Z
       
       0 likes, 0 repeats
       
       Hacker News front-page analyticsA question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page.  Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York).  Results are further confounded by other factors.Thread:  https://news.ycombinator.com/item?id=36076870HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:https://news.ycombinator.com/front?day=2023-05-25Easy enough.So I'm crawling that and compiling a local archive.  Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns.  There's also looking at mean points and comments by various dimensions.Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian.  I'd thought HN leaned consistently less liberal.The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.Contents are the 30 top-voted stories for each day since 20 February 2007.If anyone has suggestions for other questions to ask of this, fire away.And, as of early 2015, top state mentions are: 1. new york:         150 2. california:       101 3. texas:             39 4. washington:        38 5. colorado:          15 6. florida:           10 7. georgia:           10 8. kansas:            10 9. north carolina:     910. oregon:             9NY is highly  overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC).  Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly.  I've also got some city-based analytics.#hn #hackernews #data #DataAnalysis #WebCrawling
       
 (DIR) Post #AW41hdyJzMZOTnhYem by WomanCorn@schelling.pt
       2023-05-27T00:47:54Z
       
       0 likes, 0 repeats
       
       @dredmorbius >  I'd thought HN leaned consistently less liberal.Why did you think that?Vanilla liberalism seems to be the most common among the tech crowd (though quieter than exotic political stances.)
       
 (DIR) Post #AW41hhc8RnCRm9LtdA by dredmorbius@toot.cat
       2023-05-27T00:31:47Z
       
       0 likes, 0 repeats
       
       For the above stories:"silicon valley" appears another 280 times, nearly 3x as often as "california" alone."new york times" and "new york city" appear 61 and 21 times respectively, so reduce New York's score by 82 to 68.Top cities:  new York, San Francisco, Boston, Hollywood, Seattle, Berkeley, Chicago, Cambridge (unsure which one as there is MA and UK), Detroite, Tempe.  After this there are some ambiguous names (rockford, jackson, kent) which ... I should probably clarify in my analysis code.  Sadly, many headline fail to unambiguously identify locations, e.g., Springfield (IL, MO, MA), Kansas City (MO, KS), Wilmington (DE, NC), and names which aren't unambiguously cities (Warren, MI, Allen, TS).  Or both (Aurora -- CO, IL, or meterological phenomenon?).  Welcome to the fun world of free-text parsing.
       
 (DIR) Post #AW449PvW0BaAw4lBia by dredmorbius@toot.cat
       2023-05-27T01:15:18Z
       
       0 likes, 0 repeats
       
       @WomanCorn Part of that's a sense from actively using the site (someone else's recent analysis had me as the 16th most active commenter during their analysis period, a bit of a shock to me, honestly), and much from the criticisms you'll find about HN particularly on the Fediverse.Techbro / Rothbardian/Randian Libertarian culture, especially, are mentioned / prevalent / frequently encountered in my experience.One reason I'm looking at the domain (submission site) stats is to get a better sense of that.  My crawl's just under 2/3 complete, I'm waiting on more recent trends to start diving in deeply, though I'm querying data and testing code as the data come in.
       
 (DIR) Post #AW6QYyCEJPcLlh5V6e by dredmorbius@toot.cat
       2023-05-27T20:44:08Z
       
       0 likes, 0 repeats
       
       Crawl complete:FINISHED --2023-05-27 20:11:03--Total wall clock time: 1d 17h 55m 39sDownloaded: 5939 files, 217M in 9m 48s (378 KB/s)NB:  wget performed admirably:grep 'HTTP request sent' fetchlog | sort | uniq -c | sort -k1nr5939 HTTP request sent, awaiting response... 200 OK  14 HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers.   1 HTTP request sent, awaiting response... Read error (Operation timed out) in headers.Each of the read errors succeeded on a 2nd try.I'm working on parsing.  Playing with identifying countries most often mentioned in titles right now, on still-partial data (missing the past month or so's front pages).Countries most likely to be confused with a major celebrity and/or IT/tech sector personality:  Cuba & Jordan.Country most likely to be confused with a device connection standard:  US (USB).Raw stats, top-20, THERE ARE ISSUES WITH THESE DATA:     1  US:  1350  (186  matched "USB")     2  U.S.:  1073 (USA: 59, U.S.A.: 2, America/American: 979)     3  China:  634     4  Japan:  526     5  India:  477     6  UK:  288     7  EU:  225 (E.U.: 54)     8  Russia:  221     9  Germany:  165    10  Canada:  162    11  Australia:  157    12  Korea:  140 (DRK: 69, SK: 38)    13  France:  116    14  Iran:  91    15  Dutch:  80 (25 Netherlands)    16  United States:  75    17  Brazil:  69    18  North Korea:  69    19  Sweden:  68    20  Cuba:  67 (32 "Mark Cuban")
       
 (DIR) Post #AW6QYyxjSncC91JR9E by dredmorbius@toot.cat
       2023-05-27T22:06:37Z
       
       0 likes, 0 repeats
       
       How Much Colorado Love?  Or a 16-year Hacker News Front Page analyticsI've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?"  (38 times, 5th most frequent US state).  This also affords the opportunity to ask and answer other questions.Preliminary report:  https://news.ycombinator.com/item?id=36098749#HackerNews #dataAnalysis #wget #awk #gawk #media #colorado
       
 (DIR) Post #AW6QYzgklPcyOeNOK0 by dredmorbius@toot.cat
       2023-05-27T23:05:31Z
       
       0 likes, 0 repeats
       
       I've confirmed that the story shortfall does represent actual HN experience.  Several days with fewer-than-usual stories, one day of complete outage, mostly in the first year of operations:2007-03-10:  292007-03-24:  262007-03-25:  252007-05-19:  272007-05-26:  262007-05-28:  292007-06-02:  192007-06-16:  282007-06-23:  172007-06-24:  282007-06-30:  202007-07-01:  282007-07-07:  272007-07-15:  262007-07-28:  272014-01-06:  0
       
 (DIR) Post #AW6QZ0MwEZN6VU754i by dredmorbius@toot.cat
       2023-05-28T03:41:50Z
       
       0 likes, 0 repeats
       
       And I think I've got the per-site reporting bits nicely generated.  This is interesting in a good way.  I do want to verify that I'm not goofing the code, though it smells reasonably sane.I need to find a better place to dump my reports though, probably a pastebin.Big finding for 2023 is that openai breaks out, not unsurprisingly.  Only 11 submissions, but an insanely-high engagement (votes/comments):  openai.com                           11      11728 ( 1066.18 )      7364 (  669.45 )(You can see that fixed-width by viewing this post/thread directly on toot.cat.)That's four-figures of votes and comments on average.  I had to bump up my display field-widths to accommodate that.I'm playing with cutoffs, currently the report is limited to a maximum of 40 sites (roughly:  domains) with > 10 submissions each.For 2023, that list is:  n/a, arxiv.org, wikipedia.org, youtube.com, nature.com, arstechnica.com, nytimes.com, bloomberg.com, quantamagazine.org, newyorker.com, wsj.com, archive.org, phys.org, theguardian.com, economist.com, ieee.org, science.org, smithsonianmag.com, simonwillison.net, theverge.com, lwn.net, reuters.com, theregister.com, acm.org, gist.github.com, infoq.com, righto.com, wired.com, apnews.com, bbc.com, openai.com, utoronto.ca("n/a" is no-domain, essentially a question or post directly to H/N.)
       
 (DIR) Post #AW6QZ17NRuWCpVqASW by dredmorbius@toot.cat
       2023-05-28T04:32:27Z
       
       0 likes, 1 repeats
       
       So ... part of this data-analysis live-blogging is me talking through the process I use to vet and explore data.One of the really remarkable aspects of the per-site report is how much some sites have fallen in front-page standing.  That's one bit I was wondering about in the parent toot.  Take the New York Times for example, "nytimes.com".  Here's front-page appearances by year:2007 2162008 3332009 2702010 2022011 1682012 1842013 1912014 2712015 2892016 3622017 3432018 3962019 3262020 832021 642022 672023 29That is, for 2022, there were only one-fifth the stories making the front page as in 2019.  I'm not sure why that is, with immediate thoughts being the NYT paywall, though that's been in place for over a decade (https://news.ycombinator.com/item?id=4970846) and I don't think it's markedly tighter now than it was four years ago.  At present rates, we're shooting for about 70--75 front-page entries for 2023, consistent with the past three years' trends.I could ask dang if NYT is penalised now, which is high on my list of suspicions.  Or it could simply be that there's a broader range of sites getting submitted.One of the challenges in devising reports like this is figuring out how to report salience.  Most-frequently-appearing sites misses out on sites that appear less frequently but get strong engagement, and I'm still thinking through how I want to look at that.  It may just end up being a number of different sorts...Other notable shifts:  techcrunch, top site for HN's first six years last appeared in the top-40 in 2020.  Oreilly.com hasn't listed since 2009.  Youtube rates surprisingly high.  plus.google.com hadn't appeared since 2014 (the site didn't die until 2019).Oh ... and I've just found a pretty annoying parsing issue.
       
 (DIR) Post #AW6QZGCBOKH7alCiX2 by dredmorbius@toot.cat
       2023-05-27T23:14:48Z
       
       0 likes, 0 repeats
       
       I'm wanting to test some reporting / queries / logic based on a sampling of data.Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.And to grab a random year's worth (365 days) of reports from across the set:ls rendered-crawl/* | sort -R | head -365 | sort(I've rendered the pages, using w3m's -dump feature, to speed processing).The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.#ShellScripting #StupidBashTricks #Linux #DataAnalysis
       
 (DIR) Post #AW6QZGS8R12GOEfSOe by dredmorbius@toot.cat
       2023-05-28T01:11:21Z
       
       0 likes, 0 repeats
       
       Another question I'd had was how activity (votes, comments) vary over time --- over the lifetime of HN, by day of week, month of year, and page placement (though note that a story's position on the HN front page varies dynamically over the course of the day).To answer all of that it helps to massage data into a format which can briefly summarise that information, and I've now achieved this:2007-2-20.1 2007 01 Tuesday ycombinator.com pg 166 582007-2-20.2 2007 02 Tuesday obvious.com beau 24 22007-2-20.3 2007 03 Tuesday pulse2.com perler 21 32007-2-20.4 2007 04 Tuesday pretheory.com wastedbrains 18 5That's an ISO-8601 date followed by a decimal and the story-number, effectively a unique identifier per story, followed by the year, the story number (again), the weekday, the submitted site, the submitter, the votes, and comments.(Hrm:  I could / should add month?)Now to find out what gold^W signal lies in them thar hills...
       
 (DIR) Post #AW6QZUJmsqg1OOrzKS by dredmorbius@toot.cat
       2023-05-28T04:35:21Z
       
       0 likes, 0 repeats
       
       Issue:  HN usually encodes the submitted site in parentheses at the end of the title.  Occasionally ... non-site-related stuff might be found there.  I'm finding that.Good news is that it's mostly low-significant stuff --- "domains" which appear only once and are obviously not domains.  But it somewhat mucks with a few assumptions I'd made and I now have to figure out how to filter those out w/ my regexes.The alternative is to parse HTML directly, which I've been avoiding (it's tedious and the tools that are most useful aren't quite at hand), but might still be in the cards.
       
 (DIR) Post #AW6QZXWIylTLK5kWky by dredmorbius@toot.cat
       2023-05-28T03:11:16Z
       
       0 likes, 0 repeats
       
       And some date-based analysis posted to HN:https://news.ycombinator.com/item?id=36100782(That site handles monospaced text pastes better than most Fediverse clients.)Looking at year-over-year trends overall (posts, votes, comments, and averages of those), as well as day-of-week trends.  Latter shows Tue/Wed as usually most active, a big Fri fall-off, Saturday as the slow day of the week, and an up-trend come Sunday.
       
 (DIR) Post #AW6QZXfWQUqRmg3t7g by dredmorbius@toot.cat
       2023-05-27T23:37:52Z
       
       0 likes, 0 repeats
       
       ... and if I want to bump that sampled number up or down, I can use bash arithmetic, say:head -$((365*4)) -> 4 years worth of data, sampled across the full rangehead -$((365/4))- > 1/4 year of data (rounds to 91 files)
       
 (DIR) Post #AW6QZXrvgMlmP9rnSi by dredmorbius@toot.cat
       2023-05-27T23:31:47Z
       
       0 likes, 0 repeats
       
       Clarifying:The 'ls' returns the full list of files."sort -R" randomises that list."head -365" grabs whatever 365 files happen to be at the top of that sort.And the final "sort" re-orders the (sampled) files so that the reporting I do is in date-order, as I'm expecting it to be.
       
 (DIR) Post #AW757sIWENxbmBECaO by niplav@schelling.pt
       2023-05-28T12:10:25Z
       
       0 likes, 0 repeats
       
       @dredmorbius"bash arithmetic", two of the scariest words in the English language
       
 (DIR) Post #AW75gm839tOgmMyOhs by dredmorbius@toot.cat
       2023-05-28T12:16:42Z
       
       0 likes, 0 repeats
       
       @niplav You're either hanging out in the right or wrong places.  I'm not sure which.But I've got a Very Simple Bash Script I can trust to answer that question for me ...
       
 (DIR) Post #AW7uo9HWZRQoH9g69Q by dredmorbius@toot.cat
       2023-05-28T21:23:33Z
       
       0 likes, 1 repeats
       
       HN Front Page / Global Cities MentionsOne question I've had about HN is how well or poorly it represents non-US (or even non-Silicon Valley) viewpoints and issues.Pulling from the Gllobalization and World Cities Reasearch Network list, the top 50 global cities names appearing in HN front-page titles:  1   191  San Francisco  2   164  London  3   117  Boston  4    86  Seattle  5    60  Tokyo  6    58  Paris  7    56  Chicago  8    56  Hong Kong  9    55  New York City 10    50  Berlin 11    50  Phoenix 12    45  Rome 13    40  Detroit 14    36  Singapore 15    31  Vancouver 16    30  Los Angeles 17    27  Austin 18    23  Beijing 19    20  Dubai 20    19  Shenzhen 21    19  Toronto 22    17  Amsterdam 23    16  Copenhagen 24    16  Houston 25    16  Moscow 26    15  Atlanta 27    14  Barcelona 28    14  Denver 29    13  Baltimore 30    13  San Jose 31    13  Stockholm 32    12  San Diego 33    12  Sydney 34    11  Cairo 35    10  Munich 36    10  Wuhan 37     9  Helsinki 38     9  Miami 39     9  Mumbai 40     9  Philadelphia 41     9  Shanghai 42     9  Vienna 43     8  Montreal 44     7  Beirut 45     7  Dublin 46     7  Istanbul 47     6  Bangalore 48     6  Dallas 49     6  Kansas City 50     6  Minneapolis(Best viewed in original on toot.cat.)Note that some idiosyncracies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc.  "New York" appears 315 times in titles (mostly as "New York Times").I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:https://news.ycombinator.com/item?id=15374051, on the 2017-9-30 front page:  https://news.ycombinator.com/front?day=2017-09-30So apply salt liberally.#HN #HackerNews #DataAnalysis #ShellScripting #GlobalCities #MediaAnalysis
       
 (DIR) Post #AW7uoA6DWxysoNOaAK by dredmorbius@toot.cat
       2023-05-28T21:28:11Z
       
       0 likes, 1 repeats
       
       Searching for "New York" followed by one or more capitalised words ... and massaging the resulting data a bit, results in the following list of New York institutions and/or aspects which receive mention at HN:  96  New York Times  51  New York City  25  New York   5  New York State   4  New York Public Library   3  New York City Subway   2  New York Harbor   2  New York Subway   2  New York Times Magazine   1  New York Attorney General   1  New York Charity   1  New York City Campus   1  New York Fed   1  New York Libraries   1  New York Magazine   1  New York Police   1  New York Post   1  New York Region   1  New York SenateNote that the 3rd entry, "New York" is itself ambiguous, and can refer to the city, state, or metro region, amongst others.A handy reminder that language is itself ambiguous and provides a useful but not precise mechanism for transferring meaning or understanding (or sometimes ambiguity and confusion) between entities.#HN #HackerNews #DataAnalysis #ShellScripting #GlobalCities #MediaAnalysis
       
 (DIR) Post #AWA17aFnV7v3TIRNmS by dredmorbius@toot.cat
       2023-05-29T21:37:22Z
       
       0 likes, 0 repeats
       
       According to the Hacker News front page, there are ...:313 things that suck.18 things that will fail.116 things that rock.157 things that are awesome.0 things that are bollocks.685 things that are great.75 things that are terrible.1 thing that is both terrible and amazing.  And it is you.28 things that are horrible.22 things that are a list of some number of things.33 things that are a list of some number of reasons.0 hot takes.3,101 things that are how to's.6,434 things that are "hows" but not how to's.98 things that are how not to's.21 things that are silly.86 things that are clever.318 things that are smart, none of which are phones.58 things that are brilliant.147 things that are stupid.20 things that are terrifying.19 things that you must do.Edit: Hashtag surgery (whitespace in hashtags is a thing that sucks).#HackerNews #HackerNewsAnalytics #TooMuchFunWithGrep #Suck #Fail #Rock #Awesome #Bollocks
       
 (DIR) Post #AWA17aySp3eFhpL3Oy by penguin42@mastodon.org.uk
       2023-05-29T22:09:38Z
       
       0 likes, 0 repeats
       
       @dredmorbius How many of those 18 things have failed so far?
       
 (DIR) Post #AWE9I1TUFjbRwa5Fdw by dredmorbius@toot.cat
       2023-05-31T21:57:12Z
       
       0 likes, 1 repeats
       
       Hacker News Front Page Analytics ... what next?I'm thinking through where else to take this.  I've had a few side discussions and commentary here and at HN.  Part of this is coming up with questions, part with the tools to answer them.The initial question concerned places and regional references found on the HN front page.  My initial analysis answered that (US states), and it was pretty easy to add cities (US and global) and countries to the list.I also wanted some overall summary statistics, for all time, by year, by period (I've done weekdays, I still need to get to months).  There were some interesting comparisons --- vote and comment activity by page position, for example (there's an 844.4 point advantage for 1st over 30th place in votes, 340.3 in comments, for 2022, on average).I've broken out overall and average (per-story) votes and comments, which is interesting.There's top-site and top-user activity, and how that changes over time.  I've done some work on this, I'm thinking of both other questions and how to represent this graphically.(Graphical representation is a question for other aspects as well ... what I've created so far is great for people who like reading 100s of pages of tables, less for those who prefer a visual representation.)What I've done less of, and am trying to think of ways to surface interesting elements rather than be strictly query/question driven, is to find patterns and trends in the data itself, most especially in the title text.  There are challenges:  HN doesn't provide much to work with (titles are restricted to 80 characters, generally), and there is of course ambiguity, though I'd posted a set of interesting/amusing items (see: https://toot.cat/@dredmorbius/110454128168815763).I've been playing with some simple ngram code (awk associative arrays of 2..5 elements ... mind-bogglingly easy to create and often surprisingly insightful).I've relied on some external lists of entities (states, cities, countries, etc.) which are useful.  I'd done an earlier analysis based on the Foreign Policy Top 100 Global Thinkers list, assessing salience level of various online sources, in 2015 (see: https://old.reddit.com/r/dredmorbius/comments/3hp41w/tracking_the_conversation_fp_global_100_thinkers/).  I can re-use that list, though I'd like to find a few others --- top startups / companies / people.  Also perhaps major stories and terms from the past two decades.  (I've done some searches based on my own recollection, e.g., MeToo, BLM, George Floyd, and the like with some success).And I'd like to do a deeper parse of the source HTML to grab both HN threads and source URLs.  I've found the html-xml-utils package useful, need to check that's installed locally (OK, seems it is) and wrap my head around it again (the tools are ... idiosyncratic).  Oh, and homebrew lists package executables in /usr/local/opt//bin/, which is good to know.  Yay!(Yes, I'm aware there are other tools.  I'm a simple basher.)#HackerNews #HackerNewsAnalytics