Post AGRaXplafW02jT09g0 by ThatWouldBeTelling@detroitriotcity.com
 (DIR) More posts by ThatWouldBeTelling@detroitriotcity.com
 (DIR) Post #AGRG3OjKcvDh6r9e3U by urusan@fosstodon.org
       2022-02-13T15:53:25Z
       
       0 likes, 0 repeats
       
       Website scraping?
       
 (DIR) Post #AGRGF1W8l04cajWn5s by specter@eattherich.club
       2022-02-13T15:55:29Z
       
       0 likes, 0 repeats
       
       @urusan different timezones/browsing times?
       
 (DIR) Post #AGRGPgLLp1CK6gZ0Xw by mdhughes@appdot.net
       2022-02-13T15:57:26Z
       
       0 likes, 0 repeats
       
       @urusan Tho mostly I find out what the content URL format is, and run my "snatch" script which iterates over a couple series, like from 00-00 to 99-99.
       
 (DIR) Post #AGRGR8DJgaFOppUrWC by frankie@mstdn.social
       2022-02-13T15:57:44Z
       
       0 likes, 0 repeats
       
       @urusan what's that? 😃
       
 (DIR) Post #AGRGTbeDDrfghYG0BM by dhfir@expired.mentality.rip
       2022-02-13T15:58:12.815335Z
       
       0 likes, 0 repeats
       
       @urusan I habe a general Idea how it's done, but I don't know specifics.
       
 (DIR) Post #AGRJF4gFIT3IGPyEnw by urusan@fosstodon.org
       2022-02-13T16:29:10Z
       
       0 likes, 0 repeats
       
       @frankie It's when you programmatically download the contents of a website to do something with it.At the most basic, it's just to archive it locally (or mirror it).You may also do additional processing, such as collecting statistics or putting the information in a different format.
       
 (DIR) Post #AGRJLebkdFspQd0PRY by frankie@mstdn.social
       2022-02-13T16:30:13Z
       
       0 likes, 0 repeats
       
       @urusan Ah, got it!My friend would write programs to do that! šŸ˜„ I vote 'Yes'. :blobcatgiggle:​
       
 (DIR) Post #AGRJs2P5hAsIUqWeUS by urusan@fosstodon.org
       2022-02-13T16:36:05Z
       
       0 likes, 0 repeats
       
       @dhfir Most likely useful links:https://2.python-requests.org/en/latest/https://www.crummy.com/software/BeautifulSoup/
       
 (DIR) Post #AGRP8JNQrAGXXgKStU by sotolf@fosstodon.org
       2022-02-13T17:35:09Z
       
       0 likes, 0 repeats
       
       @urusan yeah its basically what newpipe is based on, so I use it a lot :)
       
 (DIR) Post #AGRQHH7ygkJ57uwFg8 by macxool@fosstodon.org
       2022-02-13T17:48:00Z
       
       0 likes, 0 repeats
       
       @urusan Sometimes it's necessary ;-). And there are lots of tools with which to do it, so...
       
 (DIR) Post #AGRQhAH0nMCDOiVd5c by benjaminhollon@fosstodon.org
       2022-02-13T17:52:40Z
       
       0 likes, 0 repeats
       
       @urusan The only time I've done it was with Project Gutenberg, using a method they publicly posted.Other than that, I'm generally opposed but I'm willing to reconsider on a case-by-case basis.
       
 (DIR) Post #AGRaXplafW02jT09g0 by ThatWouldBeTelling@detroitriotcity.com
       2022-02-13T19:43:03.568144Z
       
       0 likes, 1 repeats
       
       @macxool @urusan I personally like wget as a command line tool (but I’m used to them; I believe it’s got some GUI front ends…), but that’s partly from familiarly, and it can get complicated, the problem itself is complicated.  (Have also written spidering code in C++, I don’t recommend that language….)Perhaps partly addressing @benjaminhollon comment below, the number one thing I emphasize is don’t hammer the server!!!  Be a good citizen on the net.For wget start with a fairly long interval between downloads, like 15 seconds at minimum, 30 is better, and start looking at the results quickly so you know if it’s doing what you want or you need to stop and change the parameters.  There’s also with wget a ā€œdon’t redownload pagesā€ option although I’m not sure how much load it really saves the server.  Also pay attention to things like depth to spider, and excluding media by extensions such as ā€œmp4ā€ if you’re not interested in that etc.  If you are, I’d seriously up the wait interval.
       
 (DIR) Post #AGRdIKGvmXZuZYxR0i by macxool@fosstodon.org
       2022-02-13T20:05:28Z
       
       0 likes, 0 repeats
       
       @ThatWouldBeTelling @benjaminhollon @urusan I know this is heresy ;-), but I work with PowerShell extensively at work and I've taken to doing this sort of thing on Linux using Powershell as well. I've been using Linux for a long time but only have mediocre shell scripting skills. But I can do just about anything with PowerShell very easily.
       
 (DIR) Post #AGRdIKzx59agpC1OBU by urusan@fosstodon.org
       2022-02-13T20:13:31Z
       
       1 likes, 0 repeats
       
       @macxool @ThatWouldBeTelling @benjaminhollon Powershell is open source: https://github.com/PowerShell/PowerShellI don't see any issues with it.
       
 (DIR) Post #AGRdTOjLXvuvaLWoXg by ThatWouldBeTelling@detroitriotcity.com
       2022-02-13T20:15:51.769853Z
       
       0 likes, 1 repeats
       
       @macxool @benjaminhollon @urusan Hey, no problem, from what I’ve heard PowerShell having ā€œobjectsā€ as the stuff it manipulates makes a whole lot of sense and is powerful.  And now that you’ve told me it’s been ported to Linux it’s gotten on my list.But mostly it was just thirty years too late for this old UNIX(TM) guy who also gave up on Windows after XP, or more like declined to move to Vista and started creating my first personal Linux systems.Here for wget there’s actually no scripting required, unless you have some way of generating filenames that you then feed into whatever scraper you use to do the work.Which does happen when for example page names are mechanically generated like [blah][incrementing ā€œnnnā€].html, or you can scrap something at the top level and massage and delete what you don’t want.
       
 (DIR) Post #AGRdUlcBwnhQqlpngu by macxool@fosstodon.org
       2022-02-13T20:16:06Z
       
       1 likes, 0 repeats
       
       @urusan @ThatWouldBeTelling @benjaminhollon Yeah. I know. It's just the spawn of Microsoft ;-). I find it extremely easy to learn and work with; and, most importantly, to remember. I think it's a good tool for certain kinds of jobs.