Post AbIvS3sL4EvbOJDfJw by benlk@newsie.social
 (DIR) More posts by benlk@newsie.social
 (DIR) Post #AbIvS3sL4EvbOJDfJw by benlk@newsie.social
       2023-10-30T14:33:27Z
       
       0 likes, 0 repeats
       
       I have an idea for a 1000+record scraper project, but I'm not sure how to capture the data. CSV or a database? Which database? Any suggestions?
       
 (DIR) Post #AbIvS4yOzBN8nP41PU by benlk@newsie.social
       2023-10-30T14:35:06Z
       
       0 likes, 0 repeats
       
       If I end up putting this #scraper  project online, then it's probably best to use mysql, since that's guaranteed to be available on website hosts, but is there a better recommendation? @simon @palewire
       
 (DIR) Post #AbIvS5x1LnrTpPQQtc by simon@fedi.simonwillison.net
       2023-10-30T18:36:48Z
       
       0 likes, 0 repeats
       
       @benlk @palewire if the overall data is likely to be less than 1GB I strongly recommend a GitHub repository- best in class backup solution, for freeFor more than that I'd personally archive scraped data as JSON (or even SQLite) to S3, but the former doesn't matter so much as having somewhere repliable and cheap to keep it all
       
 (DIR) Post #AbIwloqUl0MeWULGhU by walinchus@journa.host
       2023-10-30T18:51:53Z
       
       0 likes, 0 repeats
       
       @benlk @simon @palewire  Is the data updated regularly? GitHub Actions is fantastic for that sort of thing.
       
 (DIR) Post #AbJF8wNOirp3iLuHh2 by benlk@newsie.social
       2023-10-30T20:00:57Z
       
       0 likes, 0 repeats
       
       @palewire @walinchus @simon Yep, updated ~daily, though a weekly scrape is adequate for my purposes.Dunno how large the initial scrape will be, because I haven't scraped it yet!🤔 100 records/day * 365 d/y * 2 kb/record =~ 75 MB.I guess if GitHub Actions creates a .sqlite file, I can write a cron job on the webserver to import that, if I turn this into something public-facing.
       
 (DIR) Post #AbJF8x8XtZXK4ZxwBM by simon@fedi.simonwillison.net
       2023-10-30T22:16:58Z
       
       0 likes, 0 repeats
       
       @benlk @palewire @walinchus Yeah I'd try using a plain text format (hence supports clean diffs) for that in a GitHub Actions repository - effectively this pattern https://simonwillison.net/2020/Oct/9/git-scraping/I've had a lot of success running GitHub Actions that build a SQLite database out of those plain text files and publish it elsewhere