[HN Gopher] Whoogle Search: A self-hosted, ad-free, privacy-resp...
       ___________________________________________________________________
        
       Whoogle Search: A self-hosted, ad-free, privacy-respecting
       metasearch engine
        
       Author : wsc981
       Score  : 161 points
       Date   : 2021-08-27 10:43 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | keb_ wrote:
       | I deployed an instance of this for my own use. Been using it as
       | my daily search engine for a little over a month now. No
       | complaints. One overlooked feature is that Whoogle supports DDG-
       | style bangs!
        
       | nathan_phoenix wrote:
       | What's the difference with SearX[1]? SearX seems both more
       | popular and supports other search engines besides Google.
       | 
       | [1]: https://github.com/searx/searx
        
         | bigethan wrote:
         | It's mentioned in their FAQ[1], less config easier to set up
         | 
         | [1]:https://github.com/benbusby/whoogle-search#faq
        
       | freediver wrote:
       | The available public instances [1] seem to take 5-6 seconds for
       | search results page.
       | 
       | Is this to be expected with self-hosted too?
       | 
       | [1] https://github.com/benbusby/whoogle-search#public-instances
        
       | ivrrimum wrote:
       | I have been hosting(and using as a full replacement) a self-
       | hosted youtube alt(called piped). Might switch to this engine as
       | well. Goal is to be fully self hosted! :)
        
       | izzytcp wrote:
       | This sounds too good to be true in practice. If I deploy it on a
       | server and continuously query it from all of my devices, will
       | Google ban that serve rIP?
        
         | supah wrote:
         | scraping yes. normal usage no.
        
           | tyingq wrote:
           | Isn't normal usage scraping in this case? Looking at whoogle-
           | search/request.py it's "scraping" google urls via the python
           | requests module. I'm reasonably sure google fingerprints
           | requests and assigns different weights for "probably
           | scraping". I wouldn't be surprised if this has a lower
           | threshold for triggering their captchas and/or blocking.
        
             | andai wrote:
             | I often trigger Google's captchas during normal usage. It
             | seems to suspect the more "advanced" features like
             | "intitle:" or "inurl:", or if I search too rapidly. I take
             | being mistaken for a machine as a compliment!
        
               | ricardo81 wrote:
               | That's understandable, a lot of exploit seekers use those
               | features to find exploits e.g. "powered by [cms with
               | known exploit]", Google (and Bing) are definitely are
               | more prone to showing you a captcha for those searches,
               | especially if you're looking beyond the first page.
        
         | packet_nerd wrote:
         | I set this up for my family and I a few months ago and we set
         | all our browsers and devices to use it as the default search
         | engine. We've been really happy with it.
         | 
         | I also set up a small script on a cron job that queries random
         | search strings every few minutes and opens the first few hits
         | in selenium. My theory is that if I can't completely stop them
         | from tracking us, I can at least dilute their data with bogus
         | searches.
         | 
         | We haven't had any issues from Google.
        
         | DavideNL wrote:
         | I guess it will be just like Searx: https://searx.me/
         | 
         | If an IP generates too much search queries, Google will block /
         | throttle it...
        
         | spinax wrote:
         | I'm on a shared IP with millions of people using the same
         | public IP (T-Mobile CGNAT), there is one IP (many of them,
         | actually) doing that right now from every T-Mobile customer.
         | Your one server will be a blip on their radar if it even
         | registers.
        
           | izzytcp wrote:
           | Dude, thanks for letting me know that everyone uses Google.
           | 
           | I'm talking about static IPs like servers in the cloud, not
           | your home. That stuff is automated and I am sure static IPs
           | get banned but here must be a quota of something.
        
             | corobo wrote:
             | I get "you appear to be a robot" whenever I use my
             | DigitalOcean box as an exit node. I'd imagine you'd have to
             | host this at home or get really lucky there
             | 
             | The moment I switch it on and use Google, no excessive
             | searches etc
        
             | spinax wrote:
             | Dude, CGNAT is handled at the ISP layer, I do not have an
             | IPv4 address at all locally, it's a 464XLAT done on
             | T-Mobile's side. All users come from a shared IPv4 on
             | _their_ network, not mine. Dude.
        
               | treesknees wrote:
               | Dude - Each of those users behind the NAT will have a
               | different set of cookies, user agents, screen sizes,
               | among other fingerprints that qualify them as unique.
               | ISPs also routinely place their CGNAT addresses on
               | specific whitelists so that services don't block them for
               | abuse (you can look through the NANOG email list to find
               | examples of this.) IP addresses are also classified as
               | residential, cloud/server, etc. If Google sees rapid
               | requests from the same IP classified as a server that's
               | sending a Python Requests user-agent, they can absolutely
               | block it.
        
               | 123pie123 wrote:
               | how long should google block the IP? when the ISP
               | reassigns it to a new home they would be blocked also
               | blocked from google
        
       | anonymousisme wrote:
       | Is Whoogle a Lougle wannabe?
       | 
       | https://whatculture.com/film/10-famous-things-invented-movie...
        
       | obiwanpallav1 wrote:
       | Side question:
       | 
       | If I can just instruct the browser to delete the sessions,
       | cookies, localstorage, etc. after I close a Google search tab,
       | then would it require us to self host Whoogle? This considers
       | that I'll never login to Google using that browser and 3rd party
       | cookies are disabled.
       | 
       | Or, can Google still recognize me?
        
         | option_greek wrote:
         | If you consistently do that, you will start seeing captcha wall
         | of hell. Google gets its pound of flesh one way or the other.
        
           | sildur wrote:
           | I once had to solve thousands of captchas for an archiving
           | project, and buster helped me with a quarter of the captchas
           | (https://github.com/dessant/buster)
        
           | obiwanpallav1 wrote:
           | I did not understand the last sentence. If I solve the
           | captchas, get the search results and then do the cleanup,
           | it'll just continuously ask for the captchas and that's the
           | only added pain, no? Will they be able to conclude if all the
           | requests are coming from the same user?
        
             | sodality2 wrote:
             | IP and other fingerprinting techniques are enough to
             | identify you
        
             | type0 wrote:
             | It might even temporarily block you and put out the message
             | that they are seeing suspicious "bot" activity from your
             | device.
        
             | option_greek wrote:
             | Yes. But its going to keep asking you the captchas till the
             | user changes behavior (sample of 1) :) (and of course the
             | ip address too like the other poster mentioned - you can
             | try it by searching for insurance/anything with good money
             | on your phone and switch to desktop - assuming both are
             | connected through the same router).
        
         | svieira wrote:
         | Yes, Google can still recognize "you" for some variation of
         | "you". Anecdote - the other day my wife searched for an address
         | on her phone while on WiFi. I searched for that same address
         | just one minute later on a different computer (on the same
         | WiFi) and the address was auto-completed by Google _before
         | there was enough of the address entered to make it
         | unambiguous_.
         | 
         | (Consider living in a neighborhood where all the streets around
         | you start with "Fl". And then you go to search for "Flanders
         | Drive", which you have never searched for before, and it gets
         | auto-completed. Even though you would have expected "Fl" to
         | expand to "Florence Road" since that's the thing you commonly
         | search for. That's what happened here.)
        
         | akie wrote:
         | Install Firefox and then install the Google Container extension
         | [1]. It keeps all your Google related stuff separate from the
         | rest of the world.
         | 
         | [1] https://addons.mozilla.org/en-US/firefox/addon/google-
         | contai...
        
           | elliekelly wrote:
           | Do containers get around the fingerprinting issue though?
        
         | stonewareslord wrote:
         | They can through browser fingerprinting, yes.
         | Canvas/webgl/fonts/IP/accelerometer/every other web api
         | basically
         | 
         | > If I can...
         | 
         | You can! Install Firefox, Multi account containers (add-on by
         | Mozilla), and Temporary Containers (third party addon). You can
         | configure Temporary Containers to spawn a new container for
         | every tab, or every google tab, etc. Each container is like a
         | new browser session. It can clear the data of a closed
         | container after 15 minutes.
        
           | raffraffraff wrote:
           | I live Firefox containers. One feature I've been waiting on
           | for ages is per-container security settings.
        
             | stonewareslord wrote:
             | What kind of settings? If you want per-site settings there
             | is (was) uMatrix which allows for extremely granular
             | configuration. Sadly it is discontinued right now. But it
             | still works.
        
           | obiwanpallav1 wrote:
           | Thanks for the info!
           | 
           | One question: Let's assume that I've added Firefox containers
           | and instructed it to open every tab in a new container. If I
           | open a link from the Google search result in a new tab(that
           | is inside a new container) then can Google still trace the
           | flow because the opened links are from Google and not the
           | actual search result and it may contain the tracking info?
        
             | stonewareslord wrote:
             | You should do 2 things to mitigate this:
             | 
             | 1) Install ClearURLs. This addon strips tracking
             | identifiers from URLs. If you hover a google search link
             | you'll see it doesn't direct to you to website.com it
             | directs you to google.com which then forwards you to the
             | site you clicked without this addon
             | 
             | 2) Configure Temporary Containers to make a new container
             | for every different subdomain or domain. This way, if you
             | click a link from google search, regardless of using
             | ClearURLs, a new container spawns for any domain/subdomain
             | that does not match (ex: click netflix.com from google and
             | TemporaryContainers identifies this and spawns a second tab
             | for netflix). This makes some things impossible, like SSO,
             | so configuring it properly can be tricky. You might be able
             | to configure it such that only links clicked from
             | google.com spawn a new container and those that redirect to
             | sso sites don't, but I haven't done this. You can always
             | open a private window where the context is shared
             | (temporary containers don't work in private) if you need
             | SSO.
             | 
             | Obviously there's more you have to do to be even safer
             | because with pings on by default and js enabled on google,
             | they can still see you clicked a link. Also, with Google
             | Analytics (GA), they can infer someone searching "x" and
             | then "another user" from the same IP fetches "x" GA
             | tracking scripts a second later is the same person. The
             | list goes on and Google _really_ likes tracking people, so
             | it 's very difficult to mitigate. The first and most
             | important thing you can do is GET OFF CHROME/EDGE!
        
             | feanaro wrote:
             | Yes. There are addons that degooglify the links though.
             | It's such an evil practice they've introduced.
        
       ___________________________________________________________________
       (page generated 2021-08-27 23:01 UTC)