[HN Gopher] Show HN: Bernard - a link checker for your website
       ___________________________________________________________________
        
       Show HN: Bernard - a link checker for your website
        
       Introducing Bernard, the project I have been working on solo since
       November 2022. After months of planning, coding and talking with a
       small number of potential users, I feel it is time to launch and
       get some real feedback.   _GOAL_ : Bernard is an automated links
       checker that crawls your website daily, identifying broken links
       (internal and external), and missing redirects, such as after a
       redesign or URL change.   _TARGET AUDIENCE_ : Anyone with a website
       --be it is a personal blog or company portal--looking to prevent
       link rot, to keep old URLs reachable and to avoid the dreaded 404
       Page Not Found error.   _STACK_ : Elixir, with Phoenix Live View
       for the interactive dashboard. It runs on PostgreSQL, hosted as
       Podman containers on a single dedicated Hetzner server.   _PLAN_ :
       I am releasing this as a free, open beta to address the last few
       issues, and hope to introduce paid plans in February 2024. While
       there is a backlog of improvements and bug fixes, my primary focus
       now is making sure the product is aligned with what users need.
       The pricing model is yet to be defined, but I'm considering a free
       tier for small website owners, and a usage-based plan starting from
       $10/mo for X,000 links checked per month. This would pair well with
       the upcoming REST API to provide a links-checker-as-a-service
       product to embed in third-party systems. (Feel free to contact me
       if you might be interested in using this API)  I also wrote about
       my journey to get this off the ground, and the challenges I faced
       at https://combo.cc/posts/bernard-devlog-3-pre-launch-reflectio...
       Looking forward to your criticism and suggestions.
        
       Author : sph
       Score  : 78 points
       Date   : 2024-01-23 12:18 UTC (1 days ago)
        
 (HTM) web link (bernard.app)
 (TXT) w3m dump (bernard.app)
        
       | bapetel wrote:
       | I just tested it and i like it : i have 46 broken links,
       | certainly due to the latest changes i made. I know i would have
       | the broken links specifically in the blog part, but was
       | procrastinating to check them. GOD JOB, I LIKE IT
       | 
       | Now, regarding the pricing, i really don't like subscriptions, we
       | already deal with a lot of subscriptions every month, but it's up
       | to you.
       | 
       | I would suggest also to give ability to pay once for checking the
       | links. So, when i come back maybe in 6 months later, i will again
       | pay one, and so on, etc...
       | 
       | I would also like to download the link in an HTML page and
       | consult them or do whatever stuff i like, for example, scrapping
       | my own HTML to find broken link and do some action in development
       | server or robot testing, but downloading shouldn't be free to
       | 
       | But it's nice, keep going and find your perfect market
        
         | sph wrote:
         | Re: subscriptions--I hear you. They are certainly simpler for a
         | business to deal with, but I will consider the possibility of
         | being able to top up a plan at your own convenience.
         | 
         | Exporting the list in CSV format is planned in the very near
         | future, along with better filtering tools, search and ignore
         | patterns.
         | 
         | Thanks for taking the time to trying it out :)
        
           | bapetel wrote:
           | you're welcome
        
       | hammyhavoc wrote:
       | What's the advantage over FOSS broken link checkers? Some even
       | have a pretty good GUI.
        
         | sph wrote:
         | Care to link any of them, please? I'm not sure which you refer
         | to.
         | 
         | I plan on writing a long post on my experience writing a web
         | crawler in Elixir, and I believe it is a harder problem than
         | many--including me one year ago--think it is. Web servers out
         | there are just broken, pages have invalid HTML that somehow
         | browsers are able to interpret, etc.
         | 
         | Here's some things Bernard does that many other don't:
         | 
         | * I respect robots.txt for each link I visit, and keep a single
         | connection per {host,port} not to hammer popular domains. Maybe
         | not ideal speed-wise, but I try to be a good Internet citizen.
         | 
         | * Modern MITM-like services such as Cloudflare, Akamai, etc.
         | need to be accounted for, as they might return 403s even if the
         | link is working perfectly. I am planning on registering as a
         | "Cloudflare good bot" very soon, right now I simply ignore and
         | log 403s for Cloudflare-server domains. The goal ultimately is
         | to have as many false negatives as possible.
         | 
         | * In the future, Bernard will be able to parse CSS and JS
         | files, to find and test links to @font-face URLs, imported JS
         | scripts, etc.
         | 
         | * If there is enough demand, as it is quite involved
         | infrastructure-wise, I might consider adding support for
         | client-side only websites, through headless Chrome (on a
         | separate, more expensive tier).
         | 
         | * Not shipped in this MVP, the original goal of Bernard is to
         | remember all the links its seen, so it is able to notify you if
         | one 404s because you changed its URL, moved it around, and
         | forgot to set up a redirect. This happens far too often in my
         | experience, breaking bookmarks and causing SEO issues.
         | 
         | And last, but not least, none of the alternative I have tried,
         | freeware or FOSS, felt good, correct or well-designed enough to
         | use, from my point of view. I do believe there is always a
         | space to do better than the status quo.
        
           | ashishb wrote:
           | Here is the one I wrote for myself a few years back
           | https://github.com/ashishb/outbound-link-checker
        
             | latexr wrote:
             | Amusingly, the first image link on that README is broken.
        
       | AznHisoka wrote:
       | Our company site has a very strict anti-scraping Cloudflare
       | mechanism. Will your crawler be able to work around it?
        
         | sph wrote:
         | Which means that requests to your website might randomly return
         | HTTP 403. At the moment, this is such a common issue I just
         | ignore HTTP 403 errors when there's a Server: Cloudflare header
         | (and a handful of others).
         | 
         | That said, Cloudflare has a process to register as a "good bot"
         | [1], which I believe we are (we respect robots.txt, keep a
         | single connection alive per host, etc.), so it's possible that
         | once I set that up, the gods that control the Internet might
         | let us through.
         | 
         | In general though, Bernard follows the HTTP spec but cannot do
         | miracles. If your webserver is trigger-happy or very strict, it
         | is your responsibility to let us through if you want to use the
         | service.
         | 
         | 1: https://radar.cloudflare.com/traffic/verified-bots
        
       | loph wrote:
       | It would be useful to list IP ranges that the scans come from --
       | a lot of my web stuff has "hosting providers" firewall dropped
       | because so much malicious traffic emanates from the hosting
       | providers (Do you hear me, Digital Ocean and AWS?)
        
         | sph wrote:
         | Good idea. For now, I advertise as "User-agent: bernard/1.0",
         | but I might create a .well-known URL listing all the IPs we
         | scan from. I already need to do that for Cloudflare.
        
       | bibliotekka wrote:
       | i like it and would love to replace our existing link checker
       | with something better at some point. a killer feature would be
       | the ability to notify specific individuals by email when their
       | content has broken links detected. love the dog favicon, too. our
       | government entity overlords did not bat an eye when budgeting
       | $2000 USD per year for a service like this.
        
         | sph wrote:
         | The dog is supposed to be a St. Bernard :) I want to commission
         | a proper icon, but it's a bit further down my todo list.
         | 
         | Feel free to contact me if you might be interested in test
         | driving it directly or through a REST API, which I'm not sure
         | if there is any demand for.
        
       | xnx wrote:
       | Good list of free and open source link checkers here:
       | https://www.devopsschool.com/blog/list-of-free-open-source-s...
       | 
       | I've been using Xenu's Link Sleuth
       | (https://home.snafu.de/tilman/xenulink.html) forever, but I
       | should probably try some others out and see if there's something
       | better now. Xenu's is also a funny throwback to the old web, sort
       | of like the Space Jam website.
        
         | croisillon wrote:
         | the list doesn't have links though, and w3c link checker is
         | there twice, is it some AI-generated text?
        
       | supz_k wrote:
       | This is cool! I recently worked on integrating something like
       | this into our blogging platform [0] to help bloggers monitor
       | their links automatically. One main problem is with popular
       | websites which have pretty aggressive bot prevention mechanisms.
       | They often return 5xx codes even in HEAD requests. How do you
       | combat that?
       | 
       | [0] https://blogs.hyvor.com
        
         | sph wrote:
         | I had to write a piece of logic that sends a GET when HEAD
         | fails, because of a non-HTTP compliant servers. HEAD should in
         | theory return the same status as a GET, but without any body.
         | In practice, many web servers return 404, sometimes 500. HN
         | itself returns 405 Method Not Allowed, which makes some sense,
         | I guess.
         | 
         | The code in question, simplified:                   defp
         | try_get_request?(%Link{method: :head}, %StatusResponse{} =
         | response) do           # Per HTTP specification, the HEAD
         | response should be functionally equivalent to a           #
         | GET, but shall not contain any body.           # Not all
         | servers respect this, so might have a different status response
         | on HEAD than on GET.           #           # We assume that
         | some HTTP status codes are suspicious and worth retrying.
         | #           # HTTP 520 is seen with Cloudflare to mean "Web
         | Server Returned an Unknown Error", possibly           # in
         | response to a HEAD request.                   response.code in
         | [403, 404, 405, 520]        end
        
       | stopachka wrote:
       | Just signed up, this is awesome! Ran it over my blog and found 8
       | broken links.
       | 
       | One light bug: links to HN stories are marked as 403 forbidden (I
       | guess the server blocks the crawler)
       | 
       | For the free tier: honestly I think you can charge even for small
       | blog owners. I would pay a small yearly fee to make sure links
       | are healthy
        
         | sph wrote:
         | Yes, I know. I have a task in my TODO list to to check whether
         | the 403 comes from HN, and just ignore it. Thankfully,
         | news.ycombinator.com links never seem to break :)
        
       | Eiim wrote:
       | There's a bit of a "bug" I found where Bernard interprets
       | backslashes in file paths literally and URL-encodes them as %5C,
       | whereas modern browsers automatically correct them to /. You
       | could definitely argue that the URL is specified wrong (assuming
       | that / was indeed intended) but it causes inconsistencies between
       | Bernard and an actual user.
        
         | sph wrote:
         | Can you share an example, please?
         | 
         | I've had a lot of URL encoding-related issues in the past 24
         | hours, due to Elixir's URI library being a little too strict,
         | and a lot of websites having broken links that browsers are
         | able to work around and interpret just fine. I reckon I will
         | have to spend my weekend writing a bespoke URL encoder/decoder.
        
           | Eiim wrote:
           | Sure, the images on this page are where I noticed it:
           | https://chainswordcs.com/e-reader-protos.html
           | 
           | No doubt URL encoding is frustrating, good luck taking it on!
        
       | crowcroft wrote:
       | This is nice!
       | 
       | Would love to know whether found a need to define any custom
       | logic for 'international' websites?
       | 
       | For example a lot of global companies will have
       | 
       | www.theirdomain.com/example-page
       | 
       | But then also have
       | 
       | www.theirdomain.com/us/example-page
       | www.theirdomain.com/au/example-page
       | www.theirdomain.com/nz/example-page etc...
       | 
       | If you're crawling from one regional or the 'global' entry point
       | you might not always find all of the regional variations of
       | pages. Sometimes the site might redirect you to your previous
       | region if you try to change.
       | 
       | A common tool SEO teams use is screaming frog which offers
       | similar features, crazy that a web service isn't more commonly
       | used for this kind of thing.
        
         | sph wrote:
         | That's a good question. Right now you can only enter root
         | domains, but eventually I want to relax it and you'd be able to
         | enter `example.com/en` and `example.com/au` separately.
         | 
         | If the webserver redirects based on IP geolocation, it is
         | planned _way_ down the line to have a few Bernard workers
         | spread around the globe, but I 'll be honest, it's not my
         | highest priority right now unless this is a common enough
         | issue.
        
           | crowcroft wrote:
           | Interesting, probably a good example of that last 2% of a
           | product that could easily take 80% of your time.
           | 
           | Big enterprises that might want this kind of thing are
           | probably also the types of people that are least likely to
           | use this kind of tool though.
        
       | sbdaman wrote:
       | For fun: run your local government's site through this.
        
       | lwhsiao wrote:
       | I gave this a shot and was shown
       | 
       | > We have analysed 68 links on your website and haven't found a
       | single issue.
       | 
       | I'd love to see what this actual list of links was, even if they
       | weren't broken.
       | 
       | I'm partially curious because I personally use lychee [1] for
       | this purpose, but when I run lychee (even while ignoring several
       | classes of links), it is checking _significantly_ more links.
       | E.g., the output looks like:
       | 
       | > 1401 Total (in 7s) 1361 OK 4 Errors 35 Excluded
       | 
       | Curious to understand what it is checking that Bernard is not.
       | 
       | [1]: https://github.com/lycheeverse/lychee
        
         | sph wrote:
         | What is the URL are you testing with? If you want also share
         | the /demo URL and I'll check what is going on.
         | 
         | Seems like lychee also extracts links from text, Bernard does
         | not. Only links inside HTML tags are considered, text nodes are
         | ignored.
        
           | lwhsiao wrote:
           | https://bernard.app/demo/33b26fe3-6c67-4338-a6cd-
           | dd4e9204fbf...
        
           | lwhsiao wrote:
           | I also suspect a different is when I invoke lychee, I'm
           | running it on ALL of the *.html files output by my static
           | site generator. Perhaps Bernard is only testing a single
           | page?
           | 
           | If I only run lychee on the same single page, the numbers are
           | closer:
           | 
           | > 76 Total (in 1s) 73 OK 0 Errors 2 Excluded
        
             | sph wrote:
             | No, it follows every link recursively, as long as they are
             | part of the same domain.
             | 
             | So, if you add example.com, the crawl starts at / and all
             | links found therein are added to the queue, and so on. For
             | convenience www.example.com is also considered within crawl
             | scope (and viceversa). Everything else is treated as an
             | external link, so tested but not parsed.
             | 
             | Is it possible that your links are hosted on another
             | subdomain than the one you have entered? I'll do a deeper
             | dive into this issue tomorrow.
        
       | KolmogorovComp wrote:
       | Tested on various open-source related websites, and while it is
       | usually correct. I found that for lichess.org all dead links were
       | false positives. Spurious connection when scrapping or block from
       | the website, I don't know.
        
         | sph wrote:
         | I'll look into it, thanks!
        
       | graemep wrote:
       | I considered building something like this an year or to ago, and
       | was put off by a few things - marketing, the need to deal with
       | Cloudflare and the like (a lot of potential customer's websites
       | re entirely behind services that block bots) and all the things
       | you mention too.
       | 
       | I really like your blog and find it encouraging with regard to
       | other ideas I am thinking about.
        
         | toomuchtodo wrote:
         | Really surprised Cloudflare doesn't have this built in.
        
       | hiatus wrote:
       | Interesting, I helped someone with this recently. I did it with a
       | github action and an open-source link checker, something like
       | this:                 run: |         yarn export:start &
       | sleep 5         wget https://github.com/filiph/linkcheck/releases
       | /download/3.0.0/linkcheck-3.0.0-linux-x64.tar.gz && tar xvf
       | linkcheck-3.0.0-linux-x64.tar.gz         linkcheck/linkcheck -d
       | -e --skip-file .skip :3111 || export RETURN=$?         killall
       | node         if [[ "$RETURN" -eq "0" ]]; then           exit;
       | else            exit "$RETURN"         fi
        
       | podviaznikov wrote:
       | going to keep and eye on this. maybe I can integrate it in my own
       | service https://montaigne.io.
       | 
       | I saw many broken links on users' sites and wanted to do this
       | myself, but didn't have time to implement.
        
       | gtirloni wrote:
       | Does anyone know a good open source tool for checking broken
       | links? I've used get/curl with scripts but it's fragile.
        
         | bashy wrote:
         | https://github.com/spatie/http-status-check
        
       | city41 wrote:
       | One of my websites has links to nytimes.com. They work fine if
       | clicked on manually. Bernard reports them as a 403. I wonder if
       | NYT is classifying Bernard as a scraper?
        
         | floodle wrote:
         | I'm also seeing valid links to reuters.com coming up as 401
         | unauthorized
        
       ___________________________________________________________________
       (page generated 2024-01-24 23:01 UTC)