hngopher.com

       [HN Gopher] 1,600 Days of a Failed Hobby Data Science Project
       ___________________________________________________________________
        
       1,600 Days of a Failed Hobby Data Science Project
        
       Author : millimacro
       Score  : 18 points
       Date   : 2024-12-08 21:29 UTC (1 hours ago)
        
 (HTM) web link (lellep.xyz)
 (TXT) w3m dump (lellep.xyz)
        
       | fardo wrote:
       | The author's right about storytelling from day one, but then
       | immediately throws cold water on the idea by saying it would have
       | been a bad fit for this project.
       | 
       | This feels in error, as the big value of seeking feedback and
       | results early and often on a project is that it forces you to
       | confront whether you're going to want or be able to tell stories
       | in the space at all. It also gives you a chance to re-kindle
       | waning interests, get feedback on your project by others, and
       | avoid ratholing into something for about 5 years without having
       | to engage with a public.
       | 
       | If a project can't emotionally bear day one scrutiny, it's
       | unlikely to fare better five years later when you've got a lot of
       | emotions about incompleteness and the feeling your work isn't
       | relevant anymore tied up in the project.
        
         | rixed wrote:
         | Would you be able to recommend a project whom author did engage
         | in such public story telling from early on?
        
           | Swizec wrote:
           | Thinking Fast and Slow is a result of some 20 years of
           | regularly publishing and talking about those ideas with
           | others.
           | 
           | Most really memorable works fit that same mold if you look
           | carefully. An author spends years, even decades, doing small
           | scale things before one day they put it all together into a
           | big thing.
           | 
           | Comedy specials are the same. Develop material in small scale
           | live with an audience, then create the big thing out of
           | individual pieces that survive the process.
           | 
           | Hamming also talks about this as door open vs door closed
           | researchers in his famous You And Your Research essay
        
       | rjrdi38dbbdb wrote:
       | The title seems misleading. Unless I'm missing something, all he
       | did was scrape a news feed, which should only require a couple
       | days of work to set up.
       | 
       | The fact that he left it running for years without finding the
       | time to do anything with the data isn't that interesting.
        
       | plaidfuji wrote:
       | I'm not sure I would call this a failure.. more just something
       | you tried out of curiosity and abandoned. Happens to literally
       | everyone. "Failed" to me would imply there was something
       | fundamentally broken about the approach or the dataset, or that
       | there was an actual negative impact to the unrealized result.
       | It's very hard to finish long-running side projects that aren't
       | generating income, attention, or driven by some quasi-
       | pathological obsession. The fact you even blogged about it and
       | made HN front page qualifies as a success in my book.
       | 
       | > If I would have finished the project, this dataset would then
       | have been released and used for a number of analyses using
       | Python.
       | 
       | Nothing stopping you from releasing the raw dataset and calling
       | it a success!
       | 
       | > Back then, I would have trained a specialised model (or used a
       | pretrained specialised model) but since LLMs made so much
       | progress during the runtime of this project from 2020-Q1 to
       | 2024-Q4, I would now rather consider a foundational model wrapped
       | as an AI agent instead; for example, I would try to find a
       | foundation model to do the job of for example finding the right
       | link on the Tagesschau website, which was by far the most
       | draining part of the whole project.
       | 
       | I actually just started (and subsequently ---abandoned--- paused)
       | my own news analysis side project leveraging LLMs for
       | consolidation/aggregation.. and yeah, the web scraping part is
       | still the worst. And I've had the same thought that feeding raw
       | HTML to the LLM might be an easier way of parsing web objects
       | now. The problem is most sites are privy to scraping efforts and
       | it's not so much a matter of finding the right element but
       | bypassing the weird click-thru screens, tricking the site that
       | you're on a real browser, etc...
        
       | querez wrote:
       | Some very weird things in this.
       | 
       | 1. The title makes it sound like the author spent a lot of time
       | on this project. But really, this mostly consisted of noting down
       | a couple of URLs per day. So maybe 5 min / day = ~130h spent on
       | the project. Let's say 200h to be on the safe side.
       | 
       | 2. "Get first analyses results out quickly based on a small
       | dataset and don't just collect data up front to "analyse it
       | later"" => I think this actually killed the project. Collecting
       | data for several years w/o actually doing anything doesn't with
       | it is not a sound project.
       | 
       | 3. "If I would have finished the project, this dataset would then
       | have been released" ==> There is literally nothing stopping OP
       | from still doing this. It costs maybe 2h of work and would
       | potentially give a substantial benefit to others, i.e., turn this
       | project into a win after all. I'm very puzzled why OP didn't do
       | this.
        
       | mNovak wrote:
       | "The data collection process involved a daily ritual of manually
       | visiting the Tagesschau website to capture links"
       | 
       | I don't know what to say... I'm amazed they kept this up so long,
       | but this really should never have been the game plan.
       | 
       | I also had some data science hobby projects around covid; I got
       | busy, lost interest after 6 months. But the scrapers keep running
       | in the cloud, in case I get motivated again (anyone need
       | structured data on eBay listings for laptops since 2020?), that's
       | the beauty of automation for these sorts of things.
        
         | plaidfuji wrote:
         | Do you just pay the bill for the resources indefinitely?
        
       ___________________________________________________________________
       (page generated 2024-12-08 23:00 UTC)