fsebugoutzone.org:9999

       Post AbIvS3sL4EvbOJDfJw by benlk@newsie.social
 (DIR) More posts by benlk@newsie.social
 (DIR) Post #AbIvS3sL4EvbOJDfJw by benlk@newsie.social
       2023-10-30T14:33:27Z
       
       0 likes, 0 repeats
       
       I have an idea for a 1000+record scraper project, but I&#39;m not sure how to capture the data. CSV or a database? Which database? Any suggestions?
       
 (DIR) Post #AbIvS4yOzBN8nP41PU by benlk@newsie.social
       2023-10-30T14:35:06Z
       
       0 likes, 0 repeats
       
       If I end up putting this #scraper  project online, then it&#39;s probably best to use mysql, since that&#39;s guaranteed to be available on website hosts, but is there a better recommendation? @simon @palewire
       
 (DIR) Post #AbIvS5x1LnrTpPQQtc by simon@fedi.simonwillison.net
       2023-10-30T18:36:48Z
       
       0 likes, 0 repeats
       
       @benlk @palewire if the overall data is likely to be less than 1GB I strongly recommend a GitHub repository- best in class backup solution, for freeFor more than that I&#39;d personally archive scraped data as JSON (or even SQLite) to S3, but the former doesn&#39;t matter so much as having somewhere repliable and cheap to keep it all
       
 (DIR) Post #AbIwloqUl0MeWULGhU by walinchus@journa.host
       2023-10-30T18:51:53Z
       
       0 likes, 0 repeats
       
       @benlk @simon @palewire  Is the data updated regularly? GitHub Actions is fantastic for that sort of thing.
       
 (DIR) Post #AbJF8wNOirp3iLuHh2 by benlk@newsie.social
       2023-10-30T20:00:57Z
       
       0 likes, 0 repeats
       
       @palewire @walinchus @simon Yep, updated ~daily, though a weekly scrape is adequate for my purposes.Dunno how large the initial scrape will be, because I haven&#39;t scraped it yet!🤔 100 records/day * 365 d/y * 2 kb/record =~ 75 MB.I guess if GitHub Actions creates a .sqlite file, I can write a cron job on the webserver to import that, if I turn this into something public-facing.
       
 (DIR) Post #AbJF8x8XtZXK4ZxwBM by simon@fedi.simonwillison.net
       2023-10-30T22:16:58Z
       
       0 likes, 0 repeats
       
       @benlk @palewire @walinchus Yeah I&#39;d try using a plain text format (hence supports clean diffs) for that in a GitHub Actions repository - effectively this pattern https://simonwillison.net/2020/Oct/9/git-scraping/I&#39;ve had a lot of success running GitHub Actions that build a SQLite database out of those plain text files and publish it elsewhere