# web, gemini, ftp, open, distributed I want to close my tabs, and it seems like a waste to just close them. So here we go. This started with someone on IRC bringing up an old topic we talk about every so often. They want some decentralized way to search the internet and my original reaction was something like, getting site authors to implement search on their own site using some common API, then users can search all of those sites through it. Usually my brain went into the OpenSearch direction where people to xml documents that define a format for searching the site, with various output formats that clients can then parse reliably, like rss, or atom. => https://github.com/dewitt/opensearch Then I was thinking, maybe the site admins could provide the database they'd be using for their site's search directly, and let the user search it however many times they want without needing to bother the site more than once. Of course then you end up having users needing to download a huge fulltext search database even if they only want to search once, but I had remembered seeing someone make a website that would load an sqlite database bits at a time as needed. => https://boredcaveman.xyz/post/0x1_dbless-torrent-website.html this thing => https://boredcaveman.xyz/demo/megacat/ its live demo I had mistakenly remembered it having the sqlite database stored in a torrent. It seems to actually keep the database on ipfs. Though the orange site comments that I found had some mentions of other things that this was based on that /did/ store the database in a torrent. => https://news.ycombinator.com/item?id=29920043 => https://github.com/lmatteis/torrent-net => https://github.com/bittorrent/sqltorrent => http://bittorrent.org/beps/bep_0046.html Knowing it was possible to query sqlite databases, I decided the next step would be to figure out a good format that sites should provide their self-index in. Preferably something that was already commonly used. Like, something a crawler would output. I didn't find something that fit that exactly, but I found a tool that will convert warc files into an sqlite db and let you query it. => https://github.com/Florents-Tselai/WarcDB Places that use warc files are like, archive.org, and commoncrawl. It would have been really nice for archive.org and commoncrawl to have been using an sqlite based format already. Would have been a lot easier to plug stuff together and make magic. => https://archive.org/download/warc-ffshrine.org example archive.org link that has warc files => https://commoncrawl.org/the-data/get-started/ commoncrawl => https://news.ycombinator.com/item?id=31799147 orange site comments on warcdb I just looked up what yacy uses for their database, it seems to be some custom thing that might end up being better than warcdb, but I'm not about to write code to do it. => https://wiki.yacy.net/index.php/En:FAQ#What_kind_of_database_do_you_use.3F_Is_it_fast_enough.3F After thinking about applying this to gemini, which might be a bit easier, I remembered that FTP has been doing this kind of thing for quite a while, by placing ls-lR files in their root directory. Though not quite the same, because ls-lR is just metadata, and wouldn't be usable for full-text search. Having some pointer in my robots.txt to URLs that contain the my own site's self-crawl database would be a lot easier on my server, especially if other sites are going to be the interface for end-users to search my site. The format I'm thinking atm for providing crawl-data of a gemini site is kind of like... a tgz of message/gemini files named after the request used to retrieve that gemini response. ofc like, uriescaped since /s in filenames would cause trouble. Anyway. All my tabs are closed now.