[HN Gopher] Whoogle Search: A self-hosted, ad-free, privacy-resp...
___________________________________________________________________
Whoogle Search: A self-hosted, ad-free, privacy-respecting
metasearch engine
Author : wsc981
Score : 161 points
Date : 2021-08-27 10:43 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| keb_ wrote:
| I deployed an instance of this for my own use. Been using it as
| my daily search engine for a little over a month now. No
| complaints. One overlooked feature is that Whoogle supports DDG-
| style bangs!
| nathan_phoenix wrote:
| What's the difference with SearX[1]? SearX seems both more
| popular and supports other search engines besides Google.
|
| [1]: https://github.com/searx/searx
| bigethan wrote:
| It's mentioned in their FAQ[1], less config easier to set up
|
| [1]:https://github.com/benbusby/whoogle-search#faq
| freediver wrote:
| The available public instances [1] seem to take 5-6 seconds for
| search results page.
|
| Is this to be expected with self-hosted too?
|
| [1] https://github.com/benbusby/whoogle-search#public-instances
| ivrrimum wrote:
| I have been hosting(and using as a full replacement) a self-
| hosted youtube alt(called piped). Might switch to this engine as
| well. Goal is to be fully self hosted! :)
| izzytcp wrote:
| This sounds too good to be true in practice. If I deploy it on a
| server and continuously query it from all of my devices, will
| Google ban that serve rIP?
| supah wrote:
| scraping yes. normal usage no.
| tyingq wrote:
| Isn't normal usage scraping in this case? Looking at whoogle-
| search/request.py it's "scraping" google urls via the python
| requests module. I'm reasonably sure google fingerprints
| requests and assigns different weights for "probably
| scraping". I wouldn't be surprised if this has a lower
| threshold for triggering their captchas and/or blocking.
| andai wrote:
| I often trigger Google's captchas during normal usage. It
| seems to suspect the more "advanced" features like
| "intitle:" or "inurl:", or if I search too rapidly. I take
| being mistaken for a machine as a compliment!
| ricardo81 wrote:
| That's understandable, a lot of exploit seekers use those
| features to find exploits e.g. "powered by [cms with
| known exploit]", Google (and Bing) are definitely are
| more prone to showing you a captcha for those searches,
| especially if you're looking beyond the first page.
| packet_nerd wrote:
| I set this up for my family and I a few months ago and we set
| all our browsers and devices to use it as the default search
| engine. We've been really happy with it.
|
| I also set up a small script on a cron job that queries random
| search strings every few minutes and opens the first few hits
| in selenium. My theory is that if I can't completely stop them
| from tracking us, I can at least dilute their data with bogus
| searches.
|
| We haven't had any issues from Google.
| DavideNL wrote:
| I guess it will be just like Searx: https://searx.me/
|
| If an IP generates too much search queries, Google will block /
| throttle it...
| spinax wrote:
| I'm on a shared IP with millions of people using the same
| public IP (T-Mobile CGNAT), there is one IP (many of them,
| actually) doing that right now from every T-Mobile customer.
| Your one server will be a blip on their radar if it even
| registers.
| izzytcp wrote:
| Dude, thanks for letting me know that everyone uses Google.
|
| I'm talking about static IPs like servers in the cloud, not
| your home. That stuff is automated and I am sure static IPs
| get banned but here must be a quota of something.
| corobo wrote:
| I get "you appear to be a robot" whenever I use my
| DigitalOcean box as an exit node. I'd imagine you'd have to
| host this at home or get really lucky there
|
| The moment I switch it on and use Google, no excessive
| searches etc
| spinax wrote:
| Dude, CGNAT is handled at the ISP layer, I do not have an
| IPv4 address at all locally, it's a 464XLAT done on
| T-Mobile's side. All users come from a shared IPv4 on
| _their_ network, not mine. Dude.
| treesknees wrote:
| Dude - Each of those users behind the NAT will have a
| different set of cookies, user agents, screen sizes,
| among other fingerprints that qualify them as unique.
| ISPs also routinely place their CGNAT addresses on
| specific whitelists so that services don't block them for
| abuse (you can look through the NANOG email list to find
| examples of this.) IP addresses are also classified as
| residential, cloud/server, etc. If Google sees rapid
| requests from the same IP classified as a server that's
| sending a Python Requests user-agent, they can absolutely
| block it.
| 123pie123 wrote:
| how long should google block the IP? when the ISP
| reassigns it to a new home they would be blocked also
| blocked from google
| anonymousisme wrote:
| Is Whoogle a Lougle wannabe?
|
| https://whatculture.com/film/10-famous-things-invented-movie...
| obiwanpallav1 wrote:
| Side question:
|
| If I can just instruct the browser to delete the sessions,
| cookies, localstorage, etc. after I close a Google search tab,
| then would it require us to self host Whoogle? This considers
| that I'll never login to Google using that browser and 3rd party
| cookies are disabled.
|
| Or, can Google still recognize me?
| option_greek wrote:
| If you consistently do that, you will start seeing captcha wall
| of hell. Google gets its pound of flesh one way or the other.
| sildur wrote:
| I once had to solve thousands of captchas for an archiving
| project, and buster helped me with a quarter of the captchas
| (https://github.com/dessant/buster)
| obiwanpallav1 wrote:
| I did not understand the last sentence. If I solve the
| captchas, get the search results and then do the cleanup,
| it'll just continuously ask for the captchas and that's the
| only added pain, no? Will they be able to conclude if all the
| requests are coming from the same user?
| sodality2 wrote:
| IP and other fingerprinting techniques are enough to
| identify you
| type0 wrote:
| It might even temporarily block you and put out the message
| that they are seeing suspicious "bot" activity from your
| device.
| option_greek wrote:
| Yes. But its going to keep asking you the captchas till the
| user changes behavior (sample of 1) :) (and of course the
| ip address too like the other poster mentioned - you can
| try it by searching for insurance/anything with good money
| on your phone and switch to desktop - assuming both are
| connected through the same router).
| svieira wrote:
| Yes, Google can still recognize "you" for some variation of
| "you". Anecdote - the other day my wife searched for an address
| on her phone while on WiFi. I searched for that same address
| just one minute later on a different computer (on the same
| WiFi) and the address was auto-completed by Google _before
| there was enough of the address entered to make it
| unambiguous_.
|
| (Consider living in a neighborhood where all the streets around
| you start with "Fl". And then you go to search for "Flanders
| Drive", which you have never searched for before, and it gets
| auto-completed. Even though you would have expected "Fl" to
| expand to "Florence Road" since that's the thing you commonly
| search for. That's what happened here.)
| akie wrote:
| Install Firefox and then install the Google Container extension
| [1]. It keeps all your Google related stuff separate from the
| rest of the world.
|
| [1] https://addons.mozilla.org/en-US/firefox/addon/google-
| contai...
| elliekelly wrote:
| Do containers get around the fingerprinting issue though?
| stonewareslord wrote:
| They can through browser fingerprinting, yes.
| Canvas/webgl/fonts/IP/accelerometer/every other web api
| basically
|
| > If I can...
|
| You can! Install Firefox, Multi account containers (add-on by
| Mozilla), and Temporary Containers (third party addon). You can
| configure Temporary Containers to spawn a new container for
| every tab, or every google tab, etc. Each container is like a
| new browser session. It can clear the data of a closed
| container after 15 minutes.
| raffraffraff wrote:
| I live Firefox containers. One feature I've been waiting on
| for ages is per-container security settings.
| stonewareslord wrote:
| What kind of settings? If you want per-site settings there
| is (was) uMatrix which allows for extremely granular
| configuration. Sadly it is discontinued right now. But it
| still works.
| obiwanpallav1 wrote:
| Thanks for the info!
|
| One question: Let's assume that I've added Firefox containers
| and instructed it to open every tab in a new container. If I
| open a link from the Google search result in a new tab(that
| is inside a new container) then can Google still trace the
| flow because the opened links are from Google and not the
| actual search result and it may contain the tracking info?
| stonewareslord wrote:
| You should do 2 things to mitigate this:
|
| 1) Install ClearURLs. This addon strips tracking
| identifiers from URLs. If you hover a google search link
| you'll see it doesn't direct to you to website.com it
| directs you to google.com which then forwards you to the
| site you clicked without this addon
|
| 2) Configure Temporary Containers to make a new container
| for every different subdomain or domain. This way, if you
| click a link from google search, regardless of using
| ClearURLs, a new container spawns for any domain/subdomain
| that does not match (ex: click netflix.com from google and
| TemporaryContainers identifies this and spawns a second tab
| for netflix). This makes some things impossible, like SSO,
| so configuring it properly can be tricky. You might be able
| to configure it such that only links clicked from
| google.com spawn a new container and those that redirect to
| sso sites don't, but I haven't done this. You can always
| open a private window where the context is shared
| (temporary containers don't work in private) if you need
| SSO.
|
| Obviously there's more you have to do to be even safer
| because with pings on by default and js enabled on google,
| they can still see you clicked a link. Also, with Google
| Analytics (GA), they can infer someone searching "x" and
| then "another user" from the same IP fetches "x" GA
| tracking scripts a second later is the same person. The
| list goes on and Google _really_ likes tracking people, so
| it 's very difficult to mitigate. The first and most
| important thing you can do is GET OFF CHROME/EDGE!
| feanaro wrote:
| Yes. There are addons that degooglify the links though.
| It's such an evil practice they've introduced.
___________________________________________________________________
(page generated 2021-08-27 23:01 UTC)