Post ApbNBKtk9zL9FdjnDk by futurebird@sauropods.win
(DIR) More posts by futurebird@sauropods.win
(DIR) Post #Apaj0OHbEsTIOg7dOi by futurebird@sauropods.win
2024-12-31T02:25:58Z
0 likes, 0 repeats
I haven't thought "I should try to build my *own* web spider, then maybe I could find things." since... Well, since 1998.:/
(DIR) Post #Apaj1EqToRGnPuqDz6 by tsturm@famichiki.jp
2024-12-31T02:46:39Z
0 likes, 0 repeats
@futurebird I recently have been thinking of what it would take to run my own spider... for the first time in about 25 years. The search results I'm getting lately are so bad, that a DIY spider might actually improve the situation for me.
(DIR) Post #Apaj1JXsL2MJzfAwBE by JessTheUnstill@infosec.exchange
2024-12-31T02:50:14Z
0 likes, 0 repeats
The problem is only partly that Google has gotten so much worse. It's also that SEO, botspam, LLM spam, and affiliate link spam has gotten so good that it's functionally impossible to algorithmically filter them out of the results. So just running your own spider is unlikely to matter much.@tsturm @futurebird
(DIR) Post #Apaj1YVEj8A2Np3XeK by futurebird@sauropods.win
2024-12-31T02:51:54Z
0 likes, 0 repeats
@JessTheUnstill @tsturm Well I'm thinking of doing something a little smaller and more targeted like this:https://sauropods.win/@futurebird/113744151630008623Because making a proper full web spider is a massive project. And even my small idea could be too big.
(DIR) Post #ApajsGFdJhT57mtE2a by djsumdog@djsumdog.com
2024-12-31T03:03:08.782703Z
0 likes, 0 repeats
I use Linkding to host my own personal bookmarks. The only Crawler I've looked into was Apache Nutch, but I haven't tried running it yet. I did run Yacy for a while, but it just kinda crashed a lot.Recently someone directed me to this which seems alright: https://wiby.meand I kinda want to try this one out: https://presearch.io/
(DIR) Post #Apal3J11Tlpkpzt0YC by JessTheUnstill@infosec.exchange
2024-12-31T03:09:59Z
0 likes, 0 repeats
Not a bad idea! My (vaguely) related is to fork a Fediverse app / make a browser plugin that caches and indexes only the Fediverse posts that I've browsed - whether on my timeline or on the explore page or whatever. Then I could search the content I've had access to, and I don't feel like I'd be violating anyone's privacy for caching and indexing the content I've already been allowed to view exclusively for my own personal use.Obviously, it'd be other problems if I started crawling and indexing content for public usage, but I think using a computer to augment my own fallible memory would be acceptable so I can find the posts I wanted to remember 2 weeks later.@futurebird @tsturm
(DIR) Post #Apal3Nz50QFaFQ11iS by JessTheUnstill@infosec.exchange
2024-12-31T03:11:06Z
0 likes, 0 repeats
Which of the content was already caches and indexed, it'd also be possible to strip out just the links to see what interesting had popped up.@futurebird @tsturm
(DIR) Post #ApamZqhBtP4GW1gwUK by JessTheUnstill@infosec.exchange
2024-12-31T03:32:24Z
0 likes, 0 repeats
Eh, there's way too many Mastodon/Fediverse server forks. I'm never going to have the time and focus to try and mess with all that. I at least have a chance at making a client side goodie that lets me search stuff out of my browser cache.@dalias @futurebird @tsturm
(DIR) Post #Apb7WpbrZVTxJfbO1Q by BrettCoulstock@adforward.org
2024-12-31T07:28:29Z
0 likes, 0 repeats
@futurebird This guy wrote his own search engine, and it's fun and interesting to play with. It finds a lot of different and idiosyncratic content ...https://www.marginalia.nu/
(DIR) Post #ApbLcmJc8DfCC8nPqC by dahukanna@mastodon.social
2024-12-31T10:06:26Z
0 likes, 1 repeats
@futurebird To remove & externalise bookmark dependency from browsers, I’ve resorted to manually collecting & curating links as I find them, with personal notes+tags reminding me why they are of interest. They’re always 100% searchable & findable.Given the inconsiderate, effective DDOS behavior of AI scraper bots, adding to that melee with more robo-indexing may not produced a usable search index - https://mastodon.social/@dahukanna/113741237599333856
(DIR) Post #ApbNBKtk9zL9FdjnDk by futurebird@sauropods.win
2024-12-31T10:23:56Z
0 likes, 0 repeats
@dahukanna I'm thinking of something much more modest:https://sauropods.win/@futurebird/113744151630008623
(DIR) Post #ApbRINYv3LUTMEbwem by dahukanna@mastodon.social
2024-12-31T11:09:57Z
0 likes, 0 repeats
@futurebird … extract links from within the post and links to the source post?
(DIR) Post #ApbS8SkF1sA8UPk56O by futurebird@sauropods.win
2024-12-31T11:19:25Z
0 likes, 0 repeats
@dahukanna I think so, yes. Basically I want a database of every single link that's been posted to *my* feed. It would also contain any hash tags used with the link, the post ID so I can go back and see the context. Next I'd strip out all of the "big sites" and focus more on the obscure. Then if I'm curious about, say # fossils I would get links mentioned in that context.And if # fossils is used with the tag # crinoids often I could move laterally and find more links.
(DIR) Post #ApbSJMIzJO1theBrTU by futurebird@sauropods.win
2024-12-31T11:21:26Z
0 likes, 1 repeats
@dahukanna Importantly this database would grow over time, it wouldn't be focused on "what's new" ... basically I have a high level of trust in the way people #onhere associate hash tags with links and I think that'd be a great way to find things.In fact I do it manually often enough, but it's time consuming. I just want all of the links sometimes.
(DIR) Post #ApbSg30BFO04UlDD04 by graveolensa@mathstodon.xyz
2024-12-31T11:25:26Z
0 likes, 0 repeats
@futurebird I have been trying to collect information on a local web server, so I don't need to have it come over the network every time I want to see it (keep local copies of things which matter!)