[HN Gopher] A Unix-style personal search engine and web crawler ...
___________________________________________________________________
A Unix-style personal search engine and web crawler for your
digital footprint
Author : amirGi
Score : 257 points
Date : 2021-07-26 16:09 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| ctocoder wrote:
| wrote something along the same ilk but got distracted
| https://github.com/dathan/go-find-hexagonal
| yunruse wrote:
| I love this idea, but the name "digital footprint" sort of
| implies it's what effect you've had on the Internet for helping
| keep your online persona under control: your tweets, comments,
| emails, et cetera.
|
| But this is a great idea! Having a search engine for vaguely
| _anything_ you touch very much does look like it'd increase the
| signal:noise ratio. It'd be interesting to be able to add whole
| sites (using, say, DuckDuckGo as an external crawler) to be able
| to fetch general ideas, such as, say, "Stack Exchange posts
| marked with these tags".
| flanbiscuit wrote:
| > but the name "digital footprint" sort of implies it's what
| effect you've had on the Internet for helping keep your online
| persona under control: your tweets, comments, emails, et
| cetera.
|
| I had the exact same thought when I saw that in the title. That
| would also be a cool idea to be able to search within your own
| online accounts.
|
| So this is what the project's description of what "digital
| footprint" means:
|
| > Apollo is a search engine and web crawler to digest your
| digital footprint. What this means is that you choose what to
| put in it. When you come across something that looks
| interesting, be it an article, blog post, website, whatever,
| you manually add it (with built in systems to make doing so
| easy). If you always want to pull in data from a certain data
| source, like your notes or something else, you can do that too.
| This tackles one of the biggest problems of recall in search
| engines returning a lot of irrelevant information because with
| Apollo, the signal to noise ratio is very high. You've chosen
| exactly what to put in it.
|
| If I'm interpreting this correctly, this seems like an
| alternative way of bookmarking with advanced searching because
| it scrapes the data from the source. Cool idea, means I have to
| worry less about organizing my bookmarks.
| zerop wrote:
| How's it different from instapaper like services. There is also
| open source alternative of instapaper called wallabag.
| fidesomnes wrote:
| Adding support for transcribed voice notes like from Otter would
| be nice.
| dpcx wrote:
| Similar also to Promnesia
| (https://github.com/karlicoss/promnesia), which includes a
| browser extension to search the records.
| dandanua wrote:
| A similar tool - https://github.com/go-shiori/shiori
| encryptluks2 wrote:
| Shiori is "okay" but is not actively being maintained at all.
| The original author abandoned it and the new maintainer
| apparently never planned on supporting it.
| toomanyducks wrote:
| If nothing else, that README is fantastic!
| soheil wrote:
| Has the author tried pressing CMD+Y to view and search browser
| history?
| ThinkBeat wrote:
| I use Evernote for this.
|
| You can set it ot save a link, a screenshot, or content of the
| page. You can add tags if you want, and it is also easy to
| annotate it so you can remember the context better. You can also
| add links to other post inside Evernote.
|
| Pocket is also a great tool I used for many years. Quite similar
| and different.
|
| Both have browser extensions, so it is easy to clip.
|
| With Evernote I even have shortcuts defined so I dont have to
| click for the webpage to be clipped.
| Minor49er wrote:
| This looks really cool. It's beyond the scope of this project,
| but I think that having something like this as a browser
| extension would make it easier to use: instead of manually
| copying and scraping links, it could index and save pages that
| you've been on, placing much more significance on anything that
| you've bookmarked. Granted, this is just an immediate thought.
| I'm going to give this a proper try once I have some more spare
| time.
| ya1sec wrote:
| Great thought. I've adopted a similar workflow using the
| https://www.are.na/ chrome extension to save links to channels.
| Might be a nice touch to feed channels into the engine using
| their API
| Minor49er wrote:
| This looks like a fun way to explore topics. I just signed up
| pantulis wrote:
| Reminds me a lot of DEVONthink for Mac
| MisterTea wrote:
| > I've wasted many an hour combing through Google and my search
| history to look up a good article, blog post, or just something
| I've seen before.
|
| This is the fault of web browser vendors who have yet to give a
| damn about book marks.
|
| > Apollo is a search engine and web crawler to digest your
| digital footprint. What this means is that you choose what to put
| in it. When you come across something that looks interesting, be
| it an article, blog post, website, whatever, you manually add it
| (with built in systems to make doing so easy).
|
| So it's a searchable database for bookmarks then.
|
| > The first thing you might notice is that the design is
| reminiscent of the old digital computer age, back in the Unix
| days. This is intentional for many reasons. In addition to paying
| homage to the greats of the past, this design makes me feel like
| I'm searching through something that is authentically my own.
| When I search for stuff, I genuinely feel like I'm travelling
| through the past.
|
| This does not make any sense. It's Unix-like because it feels
| old? It seems like the author thoroughly misses the point of unix
| philosophy.
| chris_st wrote:
| > _So it 's a searchable database for bookmarks then._
|
| It appears to be that, but it appears also to pull out the
| _content_ of the web page and index that too, so you can
| (presumably) find stuff that isn 't in the "pure" bookmark,
| which I think of as a link with maybe a title.
| nextaccountic wrote:
| I think browsers should download a full copy of each bookmark
| (so you can still see it when they are taken down) and make
| it fully searchable.
|
| Actually, I've been trying to find Firefox extensions that
| give a better interface to bookmarks and there doesn't seem
| to be one. It's like, people don't use bookmarks anymore and
| accept that it might as well not exist, and use something
| else.
|
| It's telling that Firefox has two bookmark systems built-in
| (pocket and regular bookmarks) and they aren't integrated
| with each other; I suppose that people that use pocket never
| think about regular bookmarks.
|
| edit: but my pet peeve is that it isn't easy to search
| history for something I saw 10 days ago but I don't remember
| the exact keywords to search.
| forgotpwd16 wrote:
| >I think browsers should download a full copy of each
| bookmark [...] and make it fully searchable.
|
| This, outside a browser, could be implemented as a
| server/client self-hosted solution with a back-end taking
| care of downloading/searching and an extension acting as
| client. Maybe it could even be made entirely as extension?
| berkes wrote:
| That would miss all the personalized content, all the
| content behind authorization and so on.
|
| At the very least, it would need to be able to get the
| content pushed to it by the client, the way the client
| has it at moment of bookmarking, making the
| download/scraping kindof superflous.
|
| Indexing and doing search, however, is hard, but solved.
| Hard in the sense that it is not something a firefox
| addon could do very well. I presume a (self)hosted
| meilisearch would suffice, though.
| huanwin wrote:
| You and GP might find ArchiveBox to have overlap with
| what you're describing?
| https://github.com/ArchiveBox/ArchiveBox
|
| Edit: here's the description from their repo
|
| "ArchiveBox is a powerful, self-hosted internet archiving
| solution to collect, save, and view sites you want to
| preserve offline.
|
| You can set it up as a command-line tool, web app, and
| desktop app (alpha), on Linux, macOS, and Windows.
|
| You can feed it URLs one at a time, or schedule regular
| imports from browser bookmarks or history, feeds like
| RSS, bookmark services like Pocket/Pinboard, and more.
| See input formats for a full list.
|
| It saves snapshots of the URLs you feed it in several
| formats: HTML, PDF, PNG screenshots, WARC, and more out-
| of-the-box, with a wide variety of content extracted and
| preserved automatically (article text, audio/video, git
| repos, etc.). See output formats for a full list."
| phildenhoff wrote:
| The difference, to me, about Pocket is that I use it
| specifically as a to-read list. My list is just "sites I
| want to visit/read/watch later", whereas bookmarks are more
| of "I want to go here regularly". Also, all the bookmark
| systems I've ever used treat links as files that can only
| be in one folder, whereas Pocket at least has tags so links
| can associate with multiple topics.
| [deleted]
| forgotpwd16 wrote:
| >at least has tags so links can associate with multiple
| topics
|
| This has applied since ever to regular bookmarks as well.
| Basically you can just throw everything in unsorted and
| use tags only.
| cassepipe wrote:
| Firefox has bookmark tags
| joshuaissac wrote:
| Older versions IE used to have something like this.
| "Favourites" had a "Make available offline" box that could
| be ticked to keep an offline copy of the page. But they
| were not searchable.
| cratermoon wrote:
| > I think browsers should download a full copy of each
| bookmark
|
| Have you tried Zotero?
| totetsu wrote:
| Zotero is great for this. set up a webdav docker
| container and you can sync it easily too
| nojito wrote:
| Safari reader list does this and it's awesome.
| throwawayboise wrote:
| In Firefox, _File_ - > _Save Page As..._ lets me do this.
| Local search tools should be able to index such archives
| (if they can index Word documents, they should be able to
| index HTML). Seems a fairly solved problem if it 's
| something you need?
| asdff wrote:
| Pocket isn't for bookmarks. It's a reading list. Safari and
| Chrome have this feature too.
| medstrom wrote:
| If you don't categorize bookmarks anyway, Pocket and
| equivalent might be all-around better than bookmarks.
| chillpenguin wrote:
| Agree with the unix bit. I was expecting something "unix
| philosophy" but it turns out they just meant it looks retro.
| jll29 wrote:
| "looks (intentionally) retro" like Serenity OS.
| 1vuio0pswjnm7 wrote:
| "It seems like the author thoroughly misses the point of the
| unix philosophy."
|
| It's like a re-interpretation of history where AT&T still
| controls UNIX. (What do people think of AT&T these days.)
|
| "The first thing you might notice ..."
|
| First thing I notice is this project is 100% tied to Google,
| what with Chrome and Go (even for SNOBOL pattern matching,
| sheesh).
|
| "... this design makes me feel like I'm searching through
| something that is authentically my own."
|
| Except it isn't. It shuns the use of freely available, open-
| source UNIX-like projects in favor of software belonging to a
| company that Hoovers up personal data and sells online ad
| services. Enjoy the illusion. :)
|
| Life can be very comfortable inside the gilded cage.1 The
| Talosians will take good care of you.2
|
| 1. https://en.wikipedia.org/wiki/Gilded_cage
|
| 2. https://en.wikipedia.org/wiki/Talosians
| stevekemp wrote:
| I've been thinking recently it might be interesting/useful to
| write a simple SOCKS proxy which could be used by my browser.
|
| The SOCKS proxy would not just fetch the content of the page(s)
| requested, but would also dump them to
| ~/Archive/$year/$month/$day/$domain/$id.html.
|
| Of course I'd only want to archive text/plain and text/html,
| but it seems like it should be a simple thing to write and
| might be useful. Searching would be a simple matter of grep..
| habibur wrote:
| Did that. But then you will find your disk quickly getting
| filled up with GBs of cached contents that you rarely search
| within.
|
| Rather when you need that same content, you will find
| yourself going to google, searching that and the page is
| instantly there unless removed.
|
| There's a reason why bookmarks aren't as popular as it had
| been. People now use google + keywords instead of bookmarks.
| berkes wrote:
| It would also miss all the pages that are built from ajax-
| requests on the client side. Which, nowadays, is a large
| amount. The client is the one assembling all the content
| into the thing you read and so it is the most likely
| candidate to offer the copy that you want indexed.
| kbenson wrote:
| Maybe archive.org should run a subscription service where
| for a few bucks a month, you can request your page visits
| be archived (in a timely manner and with some level of
| assurance) and leverage their system for tracking content
| over time. That, in conjunction with something like Google,
| might actually give fairly good assurance that what you're
| searching for actually exists in a state like you saw it,
| while also leveraging that 30 people accessing this blog
| today that use the service don't use significantly more
| resources to store the data, and also helps archive.org
| fulfill its mission.
| ryandrake wrote:
| > This does not make any sense. It's Unix-like because it feels
| old? It seems like the author thoroughly misses the point of
| unix philosophy.
|
| Yea, I couldn't figure out what makes it Unix-like, either. I
| mean, which UNIX in particular? Solaris? AIX? HP-UX? Do you use
| UNIX commands to navigate it? Is there a shell or something?
| Kind of odd way to describe it.
| chillpenguin wrote:
| Usually when someone says something is unix-like, they mean
| it "embraces unix philosophy", which usually means something
| like it operates on stdin/stdout so it can be composed in a
| pipeline on the shell.
|
| Which is why I was mislead in this case :)
| jll29 wrote:
| Microsoft Research's Dr. Susan Dumais is the expert on this kind
| of personal information management.
|
| Her landmark system (and associated seminal SIGIR'03 paper)
| "Stuff I've Seen" tackled re-finding material:
| http://susandumais.com/UMAP2009-DumaisKeynote_Share.pdf
| simonw wrote:
| My version of this is https://dogsheep.github.io/ - the idea is
| to pull your digital footprint from various different sources
| (Twitter, Foursquare, GitHub etc) into SQLite database files,
| then run Datasette on top to explore them.
|
| On top of that I built a search engine called Dogsheep Beta which
| builds a full-text search index across all of the different
| sources and lets you search in one place:
| https://github.com/dogsheep/dogsheep-beta
|
| You can see a live demonstration of that search engine on the
| Datasette website: https://datasette.io/-/beta?q=dogsheep
|
| The key difference I see with Apollo is that Dogsheep separates
| fetching of data from search and indexing, and uses SQLite as the
| storage format. I'm using a YAML configuration to define how the
| search index should work:
| https://github.com/simonw/datasette.io/blob/main/templates/d... -
| it defines SQL queries that can be used to build the index from
| other tables, plus HTML fragments for how those results should be
| displayed.
| gizdan wrote:
| Wow! That's super cool. I will have to check this out at some
| point. Am I correct in understanding that the pocket tool
| actually imports the URLs contents? If not, how hard would it
| be to include the actual content of URLs? Specifically, I'll
| probably end up using something else (for me NextCloud
| bookmarks).
| simonw wrote:
| Sadly not - I'd love it to do that, but the Pocket API
| doesn't make that available.
|
| I've been contemplating building an add-on for Dogsheep that
| can do this for any given URL (from Pocket or other sources)
| by shelling out to an archive script such as
| https://github.com/postlight/mercury-parser - I collected
| some suggestions for libraries to use here:
| https://twitter.com/simonw/status/1401656327869394945
|
| That way you could save a URL using Pocket or browser
| bookmarks or Pinboard or anything else that I can extract
| saved URLs from an a separate script could then archive the
| full contents for you.
| neolog wrote:
| SingleFile and SingleFileZ are chrome extensions that
| export full web pages pretty effectively.
|
| https://chrome.google.com/webstore/detail/singlefile/mpiodi
| j...
|
| https://chrome.google.com/webstore/detail/singlefilez/offkd
| f...
| tomcam wrote:
| Holy crap you should submit as a Show HN
| mosselman wrote:
| Simon is not an unknown on HN.
| simonw wrote:
| It's failed to make the homepage a few times in the past:
| https://hn.algolia.com/?q=dogsheep - the one time it did make
| it was this one about Dogsheep Photos:
| https://news.ycombinator.com/item?id=23271053
| ryanfox wrote:
| I run a similar project: https://apse.io
|
| It runs locally on your laptop/desktop, so you don't need a
| server to host anything.
|
| Also, it can index _everything_ you do, not just web content.
|
| It works really well for me!
| totetsu wrote:
| there used to be an actity timeline journal program i ran on
| ubuntu that let me see which days i accessed which files. It was
| very useful as a sudent.
| cratermoon wrote:
| Interesting project but some of what the author writes just
| sounds flat-out weird. "The first thing you might notice is that
| the design is reminiscent of the old digital computer age, back
| in the Unix days."
|
| "Apollo's client side is written in Poseidon."
|
| I had to look that up: Poseidon is not a language, it's just a
| javascript framework for event-driven dom updates.
| wydfre wrote:
| It seems pretty cool - but I think falcon[0] is more practical.
| You can install it from the chrome extension store[1], if you are
| too lazy to get it running yourself.
|
| [0]: https://github.com/lengstrom/falcon
|
| [1]:
| https://chrome.google.com/webstore/detail/falcon/mmifbbohghe...
| grae_QED wrote:
| Are there any Firefox equivalents to Falcon? I'm very
| interested in something like this.
| news_to_me wrote:
| If it's a WebExtension, it's usually not too hard to port to
| Firefox (https://developer.mozilla.org/en-
| US/docs/Mozilla/Add-ons/Web...)
| nathan_phoenix wrote:
| In the issues someone says that it works even in FF. You just
| need to change the extension of the file. Tho I didn't try it
| yet.
|
| https://github.com/lengstrom/falcon/issues/73#issuecomment-6.
| ..
| soheil wrote:
| There is something really strange about a lot of recent Go
| projects including this one. I can't put my finger on, but the
| combination of the author and the type of problem they choose to
| tackle oftentimes seems baffling to me. Most projects seem to be
| solving a problem that is often misidentified or otherwise badly
| solved, but somehow the focus ends up being on the code
| architecture or the UI design. It's like they're trying to solve
| a problem just for the sake of writing some code and the correct
| way to use Go idiomatically or something and don't really care
| about the problem or how well the solution actually works.
| asdff wrote:
| I think projects like this are just resume builders. Everyone
| says "show a project on github," well here is one of these
| projects. The dev is probably hoping this helps land them a job
| offer. Its fine if the project is ultimately "lame" in some
| way, since its not the job description of a developer to make a
| cool unique app, but to follow orders from the project manager
| and write code, which is what this project shows this dev can
| do.
| jrm4 wrote:
| Yeah, as a bit of an old-timer, I'm trying to learn to stop
| worrying and learn to love watching everybody reinvent wheels?
| For me it's "why are you people doing that in Javascript?" that
| continually comes up in my own head, but I suppose I should try
| to be patient and see if anything comes of it.
| rhn_mk1 wrote:
| This seems similar to recoll augmented with recoll-we.
|
| https://addons.mozilla.org/en-US/firefox/addon/recoll-we/
| SahAssar wrote:
| Looks very much like one of the ideas I've been thinking of
| building! The way I planned to do it was to use a similar
| approach to rga for files ( https://github.com/phiresky/ripgrep-
| all ) and having a webextension to pull all webpages I vist
| (filtered via something like
| https://github.com/mozilla/readability ), dump that into either
| sqlite with FTS5 or postgres with FTS for search.
|
| A good search engine for "my stuff" and "stuff I've seen before"
| is not available for most people in my experience. Pinboard and
| similar sites fill some of that role, but only for things that
| you bookmark (and I'm not sure they do full-text search of the
| documents).
|
| ---
|
| Two things I'd mention are:
|
| 1. Digital footprint usually means your info on other sites, not
| just things I've accessed. If I read a blog that is not part of
| my footprint, but if I leave a comment on that blog that comment
| is part of it. The term is also mostly used in a tracking and
| negative context (although there are exceptions), so you might
| want to change that:
| https://en.wikipedia.org/wiki/Digital_footprint
|
| 2. I don't really get what makes it UNIX-style (or what exactly
| you mean by that? There seems to be many definitions), and the
| readme does not seem to clarify much besides expecting me to
| notice it by myself.
| eddieh wrote:
| I've been toying with an idea like this too. I set my browser
| to never delete history items years ago, so I have a huge
| amount of daily web use that needs to be indexed. The browser's
| built in history search has saved me a few times, but it is so
| primitive it hurts.
| grae_QED wrote:
| >I don't really get what makes it UNIX-style
|
| I think what they meant was that it's an entirely text based
| program. Perhaps they are conflating UNIX with CLI.
| alanh wrote:
| code comment in the readme describes the Record as constituting
| an 'interverted index'. typo for inverted? although it is not
| obvious to me what would make this an inverted index instead of a
| normal index
| [deleted]
| etherio wrote:
| This is cool! Similar to one of the goals I'm trying to
| accomplish with Archivy (https://archivy.github.io) with the
| broader goal of not just storing your digital presence but also
| acting as a personal knowledge base.
| kordlessagain wrote:
| Cool! It's great to see others thinking about this. I've been
| working on https://mitta.us for a while now and it uses solr, a
| headless brrowser and google vision to snapshot and index full
| text. The UI is a bit odd but you can just append mitta.us/ to
| any URL to save it.
| encryptluks2 wrote:
| Why do all these bookmark projects:
|
| 1. Rely on JavaScript for the interface. Being built in Go, why
| not just paginate the results and utilize Bleve or Xapian for
| search?
|
| 2. Store data in a format that is not easily readable by itself.
| The only exception to this is nb.
|
| 3. Suck at CLI tools. I'm looking to rclone, Hugo, kubectl, etc
| for the right way to build a CLI.
___________________________________________________________________
(page generated 2021-07-26 23:00 UTC)