[HN Gopher] 1,600 days of a failed hobby data science project
___________________________________________________________________
1,600 days of a failed hobby data science project
Author : millimacro
Score : 146 points
Date : 2024-12-08 21:29 UTC (1 days ago)
(HTM) web link (lellep.xyz)
(TXT) w3m dump (lellep.xyz)
| fardo wrote:
| The author's right about storytelling from day one, but then
| immediately throws cold water on the idea by saying it would have
| been a bad fit for this project.
|
| This feels in error, as the big value of seeking feedback and
| results early and often on a project is that it forces you to
| confront whether you're going to want or be able to tell stories
| in the space at all. It also gives you a chance to re-kindle
| waning interests, get feedback on your project by others, and
| avoid ratholing into something for about 5 years without having
| to engage with a public.
|
| If a project can't emotionally bear day one scrutiny, it's
| unlikely to fare better five years later when you've got a lot of
| emotions about incompleteness and the feeling your work isn't
| relevant anymore tied up in the project.
| rixed wrote:
| Would you be able to recommend a project whom author did engage
| in such public story telling from early on?
| Swizec wrote:
| Thinking Fast and Slow is a result of some 20 years of
| regularly publishing and talking about those ideas with
| others.
|
| Most really memorable works fit that same mold if you look
| carefully. An author spends years, even decades, doing small
| scale things before one day they put it all together into a
| big thing.
|
| Comedy specials are the same. Develop material in small scale
| live with an audience, then create the big thing out of
| individual pieces that survive the process.
|
| Hamming also talks about this as door open vs door closed
| researchers in his famous You And Your Research essay
| rjrdi38dbbdb wrote:
| The title seems misleading. Unless I'm missing something, all he
| did was scrape a news feed, which should only require a couple
| days of work to set up.
|
| The fact that he left it running for years without finding the
| time to do anything with the data isn't that interesting.
| amelius wrote:
| Yes, his #1 advice should be "do something with the data you
| collected".
| plaidfuji wrote:
| I'm not sure I would call this a failure.. more just something
| you tried out of curiosity and abandoned. Happens to literally
| everyone. "Failed" to me would imply there was something
| fundamentally broken about the approach or the dataset, or that
| there was an actual negative impact to the unrealized result.
| It's very hard to finish long-running side projects that aren't
| generating income, attention, or driven by some quasi-
| pathological obsession. The fact you even blogged about it and
| made HN front page qualifies as a success in my book.
|
| > If I would have finished the project, this dataset would then
| have been released and used for a number of analyses using
| Python.
|
| Nothing stopping you from releasing the raw dataset and calling
| it a success!
|
| > Back then, I would have trained a specialised model (or used a
| pretrained specialised model) but since LLMs made so much
| progress during the runtime of this project from 2020-Q1 to
| 2024-Q4, I would now rather consider a foundational model wrapped
| as an AI agent instead; for example, I would try to find a
| foundation model to do the job of for example finding the right
| link on the Tagesschau website, which was by far the most
| draining part of the whole project.
|
| I actually just started (and subsequently ---abandoned--- paused)
| my own news analysis side project leveraging LLMs for
| consolidation/aggregation.. and yeah, the web scraping part is
| still the worst. And I've had the same thought that feeding raw
| HTML to the LLM might be an easier way of parsing web objects
| now. The problem is most sites are privy to scraping efforts and
| it's not so much a matter of finding the right element but
| bypassing the weird click-thru screens, tricking the site that
| you're on a real browser, etc...
| smcin wrote:
| > Nothing stopping you from releasing the raw dataset and
| calling it a success!
|
| Right. OP: release it as a Kaggle Dataset
| (https://www.kaggle.com/datasets) and invite people to
| collaboratively figure out how to autonate the analyses. (Do
| you just want to get sentiment on a specific topic (e.g.
| vaccination, German energy supplies, German govt approval)? or
| quantitative predictions?) Start with something easy.
|
| > _for example, I would try to find a foundation model to do
| the job of for example finding the right link on the Tagesschau
| website, which was by far the most draining part of the whole
| project._
|
| Huh? To find the specific dates new item corresponding to a
| given topic? Why not just predict the date-range e.g. "Apr-Aug
| 2022"
|
| > _and yeah, the web scraping part is still the worst._
|
| Sounds wrong. OP, fix your scraping. (unless it was anti-AI
| heuristics that kept breaking it, which I doubt since it's
| Tagesschau). But Tagesschau has RSS feeds, so why are you
| blocked on scraping?
| https://www.tagesschau.de/infoservices/rssfeeds
|
| Compare to: Kaggle Datasets "10k German News Articles for topic
| classification", Schabus, Skowron Trspp, SIGIR 2017
| [https://www.kaggle.com/datasets/abhishek/10k-german-news-
| art...]
| IanCal wrote:
| I'll put a shoutout for https://zenodo.org/ and
| https://figshare.com/ as places to put your data, where
| you'll get a DOI and can let someone that's not a company
| look after hosting and backing it up. Zenodo is hosted as
| long as CERN is around (is the promise) and figshare is
| backed by the CLOCKSS archive (multiple geographically
| distributed universities).
| xelxebar wrote:
| Personally, I think it's helpful to feel disappointment and
| insufficiency when those emotions pop up. They are the voices
| of certain preferences, needs, and/or desires that work to
| enrich our lives. Recontextualizing the world into some kind of
| positive success story can often gaslight those emotions out of
| existence, which can, paradoxically, be self-sabotoging.
|
| The piece reads to me like a direct and honest confrontation
| with failure. It means the author thinks they can do better and
| is working to identify unhelpful subconscious patterns and
| overcome them.
|
| Personally, I found the author's laser focus on "data science
| projects" intriguing. I have a tendency to immediately go meta
| which biases towards eliding detail; however, even if overly
| narrow, the author's focus does end up precipitating out
| concrete, actionable hypotheses for improvement.
|
| Bravo, IMHO.
| querez wrote:
| Some very weird things in this.
|
| 1. The title makes it sound like the author spent a lot of time
| on this project. But really, this mostly consisted of noting down
| a couple of URLs per day. So maybe 5 min / day = ~130h spent on
| the project. Let's say 200h to be on the safe side.
|
| 2. "Get first analyses results out quickly based on a small
| dataset and don't just collect data up front to "analyse it
| later"" => I think this actually killed the project. Collecting
| data for several years w/o actually doing anything doesn't with
| it is not a sound project.
|
| 3. "If I would have finished the project, this dataset would then
| have been released" ==> There is literally nothing stopping OP
| from still doing this. It costs maybe 2h of work and would
| potentially give a substantial benefit to others, i.e., turn this
| project into a win after all. I'm very puzzled why OP didn't do
| this.
| apwell23 wrote:
| yep I spent more time on duolingo for 600+ day streak and can
| barely speak spanish.
| rrr_oh_man wrote:
| That seems to be a pattern
| galleywest200 wrote:
| It is because you never really practice talking with
| Duolingo. I am quite good at _reading_ French now, though.
| pessimizer wrote:
| > I am quite good at reading French now, though.
|
| If you are, that's actually quite an achievement and
| good. If you're talking about French outside of Duolingo,
| that is.
|
| I do not normally hear of people getting to reading
| fluency through Duolingo.
| wizzwizz4 wrote:
| Duolingo used to have a really good feature where you
| read through and collaboratively-translated texts, but
| they shut it down years back.
| j_bum wrote:
| Wow I forgot about that! When I was using it for French
| many years ago, I imagined they were using it as a way to
| get generate free translations, but still found it
| enjoyable and useful.
|
| Wonder why they took it away.
| smcin wrote:
| Well you can't practice producing unconstrained
| sentences. Only with their very narrow training-wheels.
| xandrius wrote:
| Duolingo is a pretty bad tool for learning a language, it's
| good to make you feel like you're learning though.
| waste_monk wrote:
| At this point it's more about being scared of the bird.
| mettamage wrote:
| Just to give a nuanced perspective on duolingo.
|
| My wife only did 50 hours of duolingo in total the past 2
| years. Combine that with me teasing her in Dutch and she's
| actually making progress.
|
| Duolingo is a chill tool to learn some vocab. That vocab
| then gets acquired by talking to me. We talk 2 minutes
| Dutch per day at most. So about 11 hours in total per year.
|
| She is 67% done with duolingo. So we bought the first real
| book to learn Dutch (De Opmaat).
|
| That book is IMO not for pure beginners. But for the level
| my wife was at, it seems perfect.
| selimthegrim wrote:
| Do you think it would be good for Flemish too or speaking
| standard Dutch in Belgium?
| mettamage wrote:
| I don't know how one would learn Flemish from books. I
| think you'd need to go to Belgium and speak Dutch there
| and then see what the differences are.
|
| Dutch and Flemish are interchangeable though. Sometimes
| it falls apart based on accent, but not on language.
| rjh29 wrote:
| I finished the whole tree in French and had nothing to show
| for it either. It really is a fun way to feel like you're
| learning, without connecting you to the language or culture
| in any significant way.
| Wololooo wrote:
| It's a useful tool if you're immersed in the language, it's
| not key to your learning but it can tremendously help.
| MarcelOlsz wrote:
| Anki is the way, especially with their new FSRS algo.
| bowsamic wrote:
| Yep, any good textbook or course with Anki for aiding raw
| memorisation. By far the best way to go
| raister wrote:
| I feel this whilst learning (trying to) German: when I think
| "how I would say this in German?" I got nothing less than a
| blank on my mind. But I'm a good "speaker" though, and sadly,
| I feel I'm not going anywhere as well...
| katzenversteher wrote:
| Surround yourself in the language. In Germany we have
| almost everything dubbed, so you can watch pretty much any
| popular movie or TV series in German or read any popular
| book in German. Besides that there are also quite a lot of
| German productions.
| ben_w wrote:
| Indeed.
|
| For learners, I'd also currently recommend "Easy German"
| podcasts and YouTube videos, as they come in all skill
| levels, are free, and are well made.
|
| https://youtube.com/@easygerman?si=EQdZPHMZ0lPNEl6V
| coffeecantcode wrote:
| Watch Dark on Netflix in original German on repeat, great
| way to subconsciously make note of tones and pronunciation
| while also watching an awesome show. Be very intentional
| about it though.
| Insanity wrote:
| For me - nothing beats in-person classes in lieu of a native
| speaker whom you can interact with. Being forced to actually
| speak the language in "mock settings" makes all the
| difference.
|
| And even if you don't get your grammar completely right, you
| will learn enough to survive in a real-life setting.
|
| I learned Spanish through a combination of both - I took
| Spanish classes after I started dating my Mexican wife,
| enough to get conversational. Then I started interacting in
| Spanish with her family, which helps me now maintain the
| language without needing the classes.
| ben_w wrote:
| Likewise, but also about that with Arabic on Duolingo and I
| never even mastered the alphabet.
| morkalork wrote:
| Point number 2. is super important for non-hobby projects.
| Collect a bit of data, even if you have to do it manually at
| first and do a "dry run" / first cut of whatever analysis
| you're thinking of doing so you confirm you're actually
| collecting what you need and what you're doing is even going to
| work. Seeing a pipeline get built, run for like two months and
| then the data scientist come along and say "this isn't what we
| needed" was complete goddamn shitshow. I'm just glad I was only
| a spectator to it.
| IanCal wrote:
| They touch on something relevant here and it's a great point
| to emphasise
|
| > The emphasis on preserving raw HTML proved vital when
| Tagesschau repeatedly altered their newsticker DOM structure
| throughout Q2 2020. This experience underscored a fundamental
| data engineering principle: raw data is king. While parsers
| can be rewritten, lost data is irretrievable.
|
| I've done this before keeping full, timestamped, versioned
| raw HTML. That still risks shifts to javascript based things
| but keeping your _collection_ and _processing_ distinct as
| much as you can so you can rerun things later is incredibly
| helpful.
|
| Usually, processing raw data is _cheap_. Recovering raw data
| is _expensive_ or _impossible_.
|
| As a bonus, collecting raw data is usually easier than
| collecting and processing it, so you might as well start
| there. Maybe you'll find out you were missing something, but
| it's no worse than if you'd tied things together.
|
| edit
|
| > Huh? To find the specific dates new item corresponding to a
| given topic? Why not just predict the date-range e.g. "Apr-
| Aug 2022"
|
| They say they had to manually find the links to the right
| liveblog subpage. So they had to go to the main page, find
| the link and then store it.
| IanCal wrote:
| While I understand the points I think it's worth being kinder
| about someone coming out to write about how they failed with a
| project.
|
| > 1. The title makes it sound like the author spent a lot of
| time on this project. But really, this mostly consisted of
| noting down a couple of URLs per day. So maybe 5 min / day =
| ~130h spent on the project. Let's say 200h to be on the safe
| side.
|
| Consistent work over multiple years shouldn't be looked down on
| like this. If you've done something every day for years it's
| still a lot of time in your life. We're not econs and so I
| don't think summing up the time really captures it either.
|
| > 3. "If I would have finished the project, this dataset would
| then have been released" ==> There is literally nothing
| stopping OP from still doing this. It costs maybe 2h of work
| and would potentially give a substantial benefit to others,
| i.e., turn this project into a win after all. I'm very puzzled
| why OP didn't do this.
|
| They might not realise how to do this sustainably, they might
| be mentally just done with it. It may be harder for them to
| think about.
|
| I'd recommend also that they release the data. If they put it
| on either Zenodo or Figshare it'll be hosted for free and
| referenceable by others.
|
| > 2. "Get first analyses results out quickly based on a small
| dataset and don't just collect data up front to "analyse it
| later"" => I think this actually killed the project.
|
| I agree, but again on the kinder side (because they also agree
| I think) there are multiple reasons for doing this and focusing
| on why might be more productive.
|
| 1. It gets you to actually process the data in some useful
| form. So many times I've seen things fail late on because
| people didn't realise something like "how are dates formatted"
| or whether some field was often missing or you just didn't
| capture something that turns out to be pretty key (e.g. scrape
| times then realise that at some point they changed it to "two
| weeks ago" and you didn't realise).
|
| This can be as simple as just plotting some data, counting
| uniques, anything. The automated system will fall over when
| things go wrong and you can check it.
|
| 2. What do people care about? What do you care about? Sometimes
| I've had a great idea for an analysis only to realise later
| maybe I'm the only one that cares or worse, the result is so
| obvious it's not even interesting to me.
|
| 3. Keeping interest. Keeping interest in a multi-year project
| that's giving you something back can be easier than something
| that's just taking.
|
| 4. Guilt. If I spend a long time on something, I feel it should
| be better. So I want to make it more polished, which takes
| time, which I don't have. So I don't add to it, then I'm not
| adding anything, then nothing happens. It _shouldn 't_ matter,
| but I've long realised that just wishing my mind worked
| differently isn't a good plan and instead I should just plan
| for reality. For that, doing something fast feels much better -
| I am happier releasing something that's taken me half a day and
| looks kinda-ok because
|
| 5. Get it out before something changes. COVID had or has no
| upfront endpoint.
|
| 6. Ensure you've actually got a plan. Unless you've got a very
| good reason, you can probably build what you need to analyse
| things and release it earlier. You can't run an analysis on an
| upcoming election, but even then you could do it on a previous
| year and see things working. This can help with motivation
| because at the end you don't have "oh right now I need to write
| and run loads of things" you just need to hit go again.
| mNovak wrote:
| "The data collection process involved a daily ritual of manually
| visiting the Tagesschau website to capture links"
|
| I don't know what to say... I'm amazed they kept this up so long,
| but this really should never have been the game plan.
|
| I also had some data science hobby projects around covid; I got
| busy, lost interest after 6 months. But the scrapers keep running
| in the cloud, in case I get motivated again (anyone need
| structured data on eBay listings for laptops since 2020?), that's
| the beauty of automation for these sorts of things.
| plaidfuji wrote:
| Do you just pay the bill for the resources indefinitely?
| hansvm wrote:
| I'm not the person you're asking, but I maintain a number of
| scraping projects. The bills are negligible for almost
| everything. A single $3/mo VPS can easily handle 1M QPS
| (enough for all the small projects put together), and most of
| these projects only accumulate O(10GB)/yr.
|
| Doing something like grabbing hourly updates of the inventory
| of every item in every Target store is a bit more involved,
| and you'll rapidly accumulate proxy/IP/storage/... costs, but
| 99% of these projects have more valuable data at a lesser
| scale, and it's absolutely worth continuing them on average.
| NavinF wrote:
| Inbound data is typically free on cloud VMs. CPU/RAM usage is
| also small unless you use chromedriver and scrape using an
| entire browser with graphics rendered on CPU. We're taking
| $5/mo for most scraping projects
| mNovak wrote:
| I paying < $0.50 a month, and that's primarily driven by S3.
| For the scraping itself I'm using lambda, with maybe minutes
| of runtime per day.
| FrustratedMonky wrote:
| "Data Science Project Failing After 1,600 Days"
|
| Sounds like my Thesis.
|
| How many people have spent 4+ years on a Thesis then just
| completely gave up, tired, drained, no interest in continuing.
| The bright eye'd bushy tailed wonder, all gone.
| dankwizard wrote:
| I don't speak the language so maybe what you're scraping isn't in
| this list, but why manual when they seem to have comprehensive
| RSS feeds? [1]
|
| Automating this part should have been day 1.
|
| [1] https://www.tagesschau.de/infoservices/rssfeeds
| smcin wrote:
| That's what I just concluded. I think the OP was oversold on
| the idea of using AI to do scraping, NLP and summarization, all
| in one go.
| j45 wrote:
| I don't know that projects ever fail.
|
| Doing them and learning and growing from them is the point.
|
| They shed a light on your path and also what you are able to see
| as possible.
| ddxv wrote:
| Why not open source? I've been slaving away at some possibly
| pointless data scraping sites that collect app data and the SDKs
| that apps use. I figure if I at least open source it that data
| and code is there for others to use.
| kqr wrote:
| I see some recommendations about running a small version of the
| analysis first to see if it's going to work at all. I agree, and
| the next level up is to also estimate the _value_ of performing
| the full analysis. I.e. not just whether or not it will work at
| all, but how much it is allowed to cost and still be useful.
|
| You may find, for example, that each unit of uncertainty reduced
| costs more than the value of the corresponding uncertainty
| reduction. This is the point at which one needs to either find a
| new approach, or be content with the level of uncertainty one
| has.
| brikym wrote:
| I know the feeling. I managed 9 months scraping supermarket data
| before I gave up mostly because a few other people were doing it
| and I was short on time.
| barrenko wrote:
| People relatively new to CS would be wise to be warned about what
| a colossal time sink it is.
| wodenokoto wrote:
| > Store raw data if possible. This allows you to condense it
| later.
|
| I have some daily scripts reading from an http endpoint, and I
| can't really decide what to do when it returns html instead of
| json. Should I store the HTML as it is "raw data" or should I
| just dismiss it? The API in question has a tendency to return 200
| with a webpage saying that the API can't be reached (typically
| because of a time out)
| IanCal wrote:
| I wouldn't store that usually, I'd use that to trigger retries.
|
| For you storing the raw data is storing the json that http
| endpoint returns rather than something like
| let content = get(url).json() info_i_care_about =
| content['data']['title'] store(info_i_care_about)
|
| as otherwise you'll get stuck when the json response moves the
| title to data.metadata.title or whatever
|
| It's usually less of an issue with structured data, things like
| html change more often, but keeping that raw data means you can
| process it in various different ways later.
|
| You also decouple errors so your parsing error doesn't stop
| your write from happening.
| tessierashpool9 wrote:
| the last thing the world or rather germany needs is a news ticker
| based on ... the tagesschau LOL
| KeplerBoy wrote:
| Oh boy, the topic (Covid) alone would have left me exhausted
| after a few months. I heard enough of it by mid 2021.
| rybosworld wrote:
| > The data collection process involved a daily ritual of manually
| visiting the Tagesschau website to capture links to both the
| COVID and later Ukraine war newstickers. While this manual
| approach constituted the bulk of the project's effort, it was
| necessitated by Tagesschau's unstructured URL schema, which made
| automated link collection impractical.
|
| > The emphasis on preserving raw HTML proved vital when
| Tagesschau repeatedly altered their newsticker DOM structure
| throughout Q2 2020.
|
| Another big takeaway is that it's not sustainable to rely on this
| type of a data source. Your data source should be stable. If the
| site offers API's, that's almost always better than parsing html.
|
| Website developers do not consider scrapers when they make
| changes. Why would they? So if you are ever trying to collect
| some unique dataset, it doesn't hurt to reach out to the web devs
| to see if they can provide a public API.
| abirch wrote:
| Please consider it an early Christmas present to yourself if
| you can pay a nominal amount for an API instead of spending
| your time scraping unless you enjoy doing the scraping.
| Uptrenda wrote:
| I think whether you 'succeed' or 'fail' on a side project they
| are still valuable. No matter if you can't finish it or it turns
| out different to how you imagined -- you get to come away as a
| better version of yourself. A person who is more optimized for a
| new strategy. And sometimes 'failure' is a worthwhile price for
| that ability. Who knows, it might be exactly what prepares you
| for something even bigger in the future.
| fuzzfactor wrote:
| I guess the kind of extreme effort that _doesn 't usually have
| a promising conclusion_ is more common in scientific research,
| or experimentation in general, but sometimes you just have to
| get accustomed to it.
|
| Eventually it doesn't really make any difference if there's no
| breathtaking milestone because it turned out to be impossible
| by nature, ran out of runway, or lost interest after a more or
| less valiant attempt.
|
| What can be gained is the strength to overcome the near-
| impossible next time and all it has to do is be a certain
| degree less-impossible and you know whether that would take you
| over the goal line like few others because you've been there.
|
| Without even worrying as much about whether you will lose
| interest or not, that's a lot less stress and pressure when you
| think about it.
|
| This can enable you more realistically to succeed in other
| areas where peers may find it impossible or not be able to do
| as well without as big an inconclusive project behind them.
| TheGoodBarn wrote:
| What I love about projects like this is they are dynamic enough
| to cover a number of interests all in one.
|
| I personally have some side projects that have started as X,
| transitioned into Y and Z, and then I stole some ideas and built
| A, which turned to B, which a requirement in my professional job
| necessitated the Z solution mixed with the B solution and
| resulted in something else which re-ignited my interest in X and
| helped me rebuild with a more clear mindset on what I intended in
| the first place.
|
| All that to say, these things are dynamic and a long list of
| "failed" projects is a historical narrative of learning and
| interests over time. I love to see it.
| sota_pop wrote:
| Nice article OP. I and a great many others suffer from the same
| struggles of bringing personal projects to "completion", and I've
| gotta respect the resilience in the length of time you hung in
| there. However, not to be overly pedantic, but I always felt
| "data science" was an exploratory exercise to discover insights
| into a given data set. I always personally filed the efforts to
| create the pipeline and associated automation (i.e. identify,
| capture, and store a given data set - more commonly referred to
| as "ETL") as a "data engineering" task, which these days is
| considered a different specialty. Perhaps if you scope your
| problem a little smaller, you may yet be able to capture
| something demonstrably valuable to others (and something you
| might consider "finished"). You'd be surprised how simple
| something that addresses a real issue can be to be able to
| provide real value for others.
|
| Nice work and great effort.
| sshrajesh wrote:
| Anyone knows what software is used to create these diagrams:
| https://lellep.xyz/blog/images/failed_data_science_project/2...
| regular_trash wrote:
| Excalidraw
| tvrg wrote:
| Looks like something you could create with excalidraw. It's an
| awesome tool!
|
| https://excalidraw.com/
| dowager_dan99 wrote:
| I for one don't want to start counting everything I lose interest
| in as a "failure", that would be too depressing. I actually think
| this is a feature not a flaw. You have very few attention tokens
| and should be aggressive in getting them back.
|
| I think this is very different from the "finishing" decision.
| That should focus on scope and iterations, while attempting to
| account for effort vs. reward and avoiding things like sunk cost
| influences.
|
| Combine both and you've got "pragmatic grit": the ability to get
| valuable shit done.
___________________________________________________________________
(page generated 2024-12-09 23:01 UTC)