[HN Gopher] 1,600 Days of a Failed Hobby Data Science Project
___________________________________________________________________
1,600 Days of a Failed Hobby Data Science Project
Author : millimacro
Score : 18 points
Date : 2024-12-08 21:29 UTC (1 hours ago)
(HTM) web link (lellep.xyz)
(TXT) w3m dump (lellep.xyz)
| fardo wrote:
| The author's right about storytelling from day one, but then
| immediately throws cold water on the idea by saying it would have
| been a bad fit for this project.
|
| This feels in error, as the big value of seeking feedback and
| results early and often on a project is that it forces you to
| confront whether you're going to want or be able to tell stories
| in the space at all. It also gives you a chance to re-kindle
| waning interests, get feedback on your project by others, and
| avoid ratholing into something for about 5 years without having
| to engage with a public.
|
| If a project can't emotionally bear day one scrutiny, it's
| unlikely to fare better five years later when you've got a lot of
| emotions about incompleteness and the feeling your work isn't
| relevant anymore tied up in the project.
| rixed wrote:
| Would you be able to recommend a project whom author did engage
| in such public story telling from early on?
| Swizec wrote:
| Thinking Fast and Slow is a result of some 20 years of
| regularly publishing and talking about those ideas with
| others.
|
| Most really memorable works fit that same mold if you look
| carefully. An author spends years, even decades, doing small
| scale things before one day they put it all together into a
| big thing.
|
| Comedy specials are the same. Develop material in small scale
| live with an audience, then create the big thing out of
| individual pieces that survive the process.
|
| Hamming also talks about this as door open vs door closed
| researchers in his famous You And Your Research essay
| rjrdi38dbbdb wrote:
| The title seems misleading. Unless I'm missing something, all he
| did was scrape a news feed, which should only require a couple
| days of work to set up.
|
| The fact that he left it running for years without finding the
| time to do anything with the data isn't that interesting.
| plaidfuji wrote:
| I'm not sure I would call this a failure.. more just something
| you tried out of curiosity and abandoned. Happens to literally
| everyone. "Failed" to me would imply there was something
| fundamentally broken about the approach or the dataset, or that
| there was an actual negative impact to the unrealized result.
| It's very hard to finish long-running side projects that aren't
| generating income, attention, or driven by some quasi-
| pathological obsession. The fact you even blogged about it and
| made HN front page qualifies as a success in my book.
|
| > If I would have finished the project, this dataset would then
| have been released and used for a number of analyses using
| Python.
|
| Nothing stopping you from releasing the raw dataset and calling
| it a success!
|
| > Back then, I would have trained a specialised model (or used a
| pretrained specialised model) but since LLMs made so much
| progress during the runtime of this project from 2020-Q1 to
| 2024-Q4, I would now rather consider a foundational model wrapped
| as an AI agent instead; for example, I would try to find a
| foundation model to do the job of for example finding the right
| link on the Tagesschau website, which was by far the most
| draining part of the whole project.
|
| I actually just started (and subsequently ---abandoned--- paused)
| my own news analysis side project leveraging LLMs for
| consolidation/aggregation.. and yeah, the web scraping part is
| still the worst. And I've had the same thought that feeding raw
| HTML to the LLM might be an easier way of parsing web objects
| now. The problem is most sites are privy to scraping efforts and
| it's not so much a matter of finding the right element but
| bypassing the weird click-thru screens, tricking the site that
| you're on a real browser, etc...
| querez wrote:
| Some very weird things in this.
|
| 1. The title makes it sound like the author spent a lot of time
| on this project. But really, this mostly consisted of noting down
| a couple of URLs per day. So maybe 5 min / day = ~130h spent on
| the project. Let's say 200h to be on the safe side.
|
| 2. "Get first analyses results out quickly based on a small
| dataset and don't just collect data up front to "analyse it
| later"" => I think this actually killed the project. Collecting
| data for several years w/o actually doing anything doesn't with
| it is not a sound project.
|
| 3. "If I would have finished the project, this dataset would then
| have been released" ==> There is literally nothing stopping OP
| from still doing this. It costs maybe 2h of work and would
| potentially give a substantial benefit to others, i.e., turn this
| project into a win after all. I'm very puzzled why OP didn't do
| this.
| mNovak wrote:
| "The data collection process involved a daily ritual of manually
| visiting the Tagesschau website to capture links"
|
| I don't know what to say... I'm amazed they kept this up so long,
| but this really should never have been the game plan.
|
| I also had some data science hobby projects around covid; I got
| busy, lost interest after 6 months. But the scrapers keep running
| in the cloud, in case I get motivated again (anyone need
| structured data on eBay listings for laptops since 2020?), that's
| the beauty of automation for these sorts of things.
| plaidfuji wrote:
| Do you just pay the bill for the resources indefinitely?
___________________________________________________________________
(page generated 2024-12-08 23:00 UTC)