[HN Gopher] In search of the perfect URL validation regex (2010)
___________________________________________________________________
In search of the perfect URL validation regex (2010)
Author : Jonhoo
Score : 105 points
Date : 2021-09-25 17:28 UTC (5 hours ago)
(HTM) web link (mathiasbynens.be)
(TXT) w3m dump (mathiasbynens.be)
| mpeg wrote:
| I was once failed on a technical interview, partly because on the
| coding test I was asked to write a url parser "from scratch, the
| way a browser would do it" and I explained it would take way too
| long to account for every edge case in the URL RFC but that I
| could do a quick and dirty approach for common urls.
|
| After I did this, the interviewer stopped me and told me in a
| negative way that he expected me to use a regex, which kinda
| shows he had no idea how a web browser works.
| axiosgunnar wrote:
| How do browsers parse URLs then?
| wolfgang42 wrote:
| Here's a polyfill for the JS URL() interface which should
| give you a taste: https://github.com/zloirock/core-
| js/blob/272ac1b4515c5cfbf34... (I tried finding the one in
| Firefox but I couldn't actually work out where it started,
| this one is much easier to follow)
|
| TLDR: it's a traditional parser--a big state machine that
| steps through the URL character by character and tokenizes it
| into the relevant pieces.
| djur wrote:
| There's actually a standard for it these days.
|
| https://url.spec.whatwg.org/#url-parsing
| specialist wrote:
| Ya. I've also suffered copypasta trials administered by bar
| raisers, mensa members, and other self appointed keepers of the
| sacred nerd flame.
|
| My imagined remedies are no 1:1 interviews and recording these
| sessions for "possible quality assurance and training
| purposes".
| elif wrote:
| how would you even "parse" a url with a regex? dynamically
| defined named subpatterns for each url parameter? I think the
| best i could do on paper with a regex is say "yup this is a
| url" or maybe "yup i can count the number of params"
|
| Unless it was a specific url with specific params?
| tyingq wrote:
| I assume they meant _" some regex implementation, including
| replace and/or match groups"_.
|
| Like, for just the params part (yes, broken and simplistic):
| #!/usr/bin/perl $_="a=b&c=d&e=f&whatever=some thing";
| while (s/^([^&]*)=([^&]*)(&|$)//) { print "[$1]
| [$2]\n"; }
| chrismorgan wrote:
| Match groups so you can split it up into scheme, username,
| password, host, port, path, query, fragment. Not difficult to
| approximate, though for best results with diverse schemes
| you'd want an engine that allows repeated named groups, and I
| don't know if any do (JavaScript and Python don't).
| powersnail wrote:
| Python's `regex` package does allow repeated named group.
| mercora wrote:
| its not very likely this is whats happening here but i feel
| like this could be done on purpose to see how you act in this
| kind of situation. it kinda tells how you would act once you
| inevitably go into a conflict with colleagues arguing over
| stuff like that.
| codetrotter wrote:
| In that case I think the proper response should be: "I am
| very sure that browsers don't do it that way. But let's have
| a look." And then pull up the source code for Chromium and
| Firefox. Assuming it's not whiteboard only.
|
| And if they still insist even after the source of Chromium
| and FF has been consulted. Well then it's time to leave.
| Don't want to work with anyone like that.
| axiosgunnar wrote:
| How do browsers parse URLs then?
| Sephr wrote:
| See https://chromium.googlesource.com/chromium/src/+/HEAD
| /url/#c...
| tapland wrote:
| Or it could be one of those outsourced interviews.
| [deleted]
| jhgb wrote:
| Did you point out that his two requirements were contradictory?
| maciejgryka wrote:
| Using https://regex.help/, I got this beauty which passes all the
| ones, which should pass. Obviously some room for improvement ;)
| But it works! ^(?:http(?:(?:://(?:(?:(?:code\.goo
| gle\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2
| f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(
| ?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\
| )(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password
| @ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p
| =364|223\.255\.255\.254|udaahrnn\.priikssaa|1(?:42\.42\.1\.1/|337
| \.net)|mthl\.khtbr|df\.ws/123|a\.b\-c\.de|\.ws/|[?]\.ws/|Li Zi
| \.Ce Shi |j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:ww
| w\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:u
| id(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b
| (?:_\(wiki\))?|[?]\.ws))|ftp://foo\.bar/baz)$
|
| I had to replace some words with shorter ones to squeeze under
| 1000 char limit and there's no way to provide negative examples
| right now. Something to fix!
| axiosgunnar wrote:
| > [?]\\.ws
|
| I guess this is the regex equivalent of overfitting :)
| saghm wrote:
| Yeah, not to mention "code.google.com" being right in there!
| maciejgryka wrote:
| Yeah, grex (the library powering this) is really cool, but
| doesn't generalize very well. I'm sure there are ways to
| improve it, but it's not a trivial thing to do.
| Sephr wrote:
| > Assume that this regex will be used for a public URL shortener
| written in PHP, so URLs like http://localhost/, //foo.bar/,
| ://foo.bar/, data:text/plain;charset=utf-8,OHAI and
| tel:+1234567890 shouldn't pass (even though they're technically
| valid)
|
| At Transcend, we need to allow site owners to regulate any
| arbitrary network traffic, so our data flow input UI1 was
| designed to detect all valid hosts (including local hosts, IDN,
| IPv6 literal addresses, etc) and URLs (host-relative, protocol-
| relative, and absolute). If the site owner inputs content that is
| not a valid host or URL, then we treat their input as a regex.
|
| I came up with these simple utilities built on top of the URL
| interface standard2 to detect all valid hosts & URLs:
|
| * isValidHost:
| https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...
|
| Example valid inputs: host.example
| hazimeyou.minna (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
| [::1] (IPv6 address) 0xdeadbeef (IPv4 address;
| 222.173.190.239) 123.456 (IPv4 address; 123.0.1.200)
| 123456789 (IPv4 address; 7.91.205.21) localhost
|
| * isValidURL (and isValidAbsoluteURL):
| https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...
|
| Example valid inputs to isValidURL:
| https://absolute-url.example //relative-protocol.example
| /relative-path-example
|
| 1. https://docs.transcend.io/docs/configuring-data-flows
|
| 2. https://developer.mozilla.org/en-US/docs/Web/API/URL
| mercora wrote:
| while not terribly important or outright not required this
| fails (treats urls as regex) for link-local addresses with
| device identifier (zone-id) applied like
| "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to
| be fixed in the standard if its desired :)
|
| i've found some reasoning[0] as to why its not supported with
| browsers in mind though.
|
| [0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2
| gregsadetsky wrote:
| I was just struggling with this -- specifically, our users' "UX"
| expectation that entering "example.com" should work when asked
| for their website URL.
|
| Most URL validation rules/regex/librairies/etc. reject
| "example.com". However, if you head over to Stripe (for example),
| in the account settings, when asked for your company's URL,
| Stripe will accept "example.com", and assume "http://" as the
| prefix (which yes, can have its own problems)
|
| What's a good solution? I both want to validate URLs, but also
| let users enter "example.com". But if I simply do
| if(validateURL(url)) { return true; } else
| if(validateURL("http://" + url)) { return true;
| } else { return false; }
|
| i.e. validate the given URL, and as a fallback, try to validate
| "http://" + the given url, that opens the door to weird, non-URLs
| strings being incorrectly validated...
|
| Help :-)
| Rebelgecko wrote:
| This could potentially be abused, but you could actually try to
| resolve the DNS to determine if it's valid (could be weird for
| some cases like localhost or IP addresses). Or just do a "curl
| https://whatever.com" and see what happens (assuming that all
| of the websites are running a webserver, although idk if that
| is true in your situation)
| dilatedmind wrote:
| i would suggest bias your implementation against false
| negatives. They can always come back and update it if it's
| wrong, and their url could just as easily be "valid" but
| incorrect, eg any typo in a domain name.
|
| if it's really important, you could try making a request to the
| url and see if it loads, but that still doesn't validate its
| the url they intended to input.
|
| might be cool to load the url with puppeteer and capture a
| screenshot of the page. if they can't recognize their own
| website, it's on them.
| anderskaseorg wrote:
| Parse, don't validate. If you need a heuristic that accepts
| non-URL strings as if they were valid URLs, you should
| _convert_ those non-URL strings to valid URLs so the rest of
| your code can just deal with valid URLs. if
| (validateURL(url)) { return url; } else if
| (validateURL("http://" + url)) { return "http://" +
| url; } else { return null; }
| JadeNB wrote:
| I know we're not golfing, but it pains me to see that
| repetition in the middle. Mightn't we write
| if (!validateURL(url)) { url = "http://" + url;
| if (!validateURL(url)) { url = null;
| } } return url;
|
| to snip a small probability of a bug?
| wolfgang42 wrote:
| I find that branchiness (and mutation of the variable)
| harder to follow. Personally, I'd just take "parse, don't
| validate" to its logical conclusion and go for:
| const parseUrl = url => validateUrl(url) ? url : null;
| return parseUrl(url) || parseUrl('http://'+url) || null;
| lelandfe wrote:
| Address validators for online checkout are notoriously
| inaccurate, though they still help a lot. You just have to
| prompt the user, "Did you mean 123 Example St?"
|
| I'd probably do the same for poorly formatted URLs. When the
| user hits Submit, a prompt appears saying, "Did you mean
| `https://example.com`?"
| dang wrote:
| Two past discussions, for the curious:
|
| _In search of the perfect URL validation regex_ -
| https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77
| comments)
|
| _In search of the perfect URL validation regex_ -
| https://news.ycombinator.com/item?id=7928968 - June 2014 (81
| comments)
| dmix wrote:
| @stephenhay seems to be the winner here if you don't need IP
| addresses (or weird dashed URLS). It's only 38 characters long
| and easy to understand.
| @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
|
| The simpler the better, if you're going to use something that is
| not ideal.
| loloquwowndueo wrote:
| Doesn't cover mailto: which is fairly common. To be
| pedantic/strict, mailto: are URIs not URLs.
| dmurray wrote:
| > I also don't want to allow every possible technically valid URL
| -- quite the opposite.
|
| Well, that should make things a lot easier. What does he mean
| here? The rest of the text doesn't make it clear to me, unless
| it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL"
| which isn't exactly "the opposite".
| lelandfe wrote:
| The next paragraph _might_ be that clarification, although I
| agree it isn 't totally clear what he meant there:
|
| > Assume that this regex will be used for a public URL
| shortener written in PHP, so URLs like http://localhost/,
| //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and
| tel:+1234567890 shouldn't pass (even though they're technically
| valid). Also, in this case I only want to allow the HTTP, HTTPS
| and FTP protocols.
| jabo wrote:
| Tangentially related, but mentioning to hopefully save someone
| time: if you ever find yourself wanting to check if a version
| string is semver or not, before inventing your own, there is an
| official regex that's provided.
|
| I just discovered this yesterday and I'm glad I didn't have to
| come up with this:
|
| https://semver.org/#is-there-a-suggested-regular-expression-...
|
| My use case for it: https://github.com/typesense/typesense-
| website/blob/25562d02...
| azalemeth wrote:
| Honest question: there is a famous and very funny stack exchange
| answer on the topic of parsing html with a regex [1] that states
| that the problem is in general impossible and if if you find
| yourself doing this, something has gone wrong and you should re-
| evaluate your life choices / pray to Cthulu.
|
| So, does this apply to URLs? The fact that these regexes
| are....so huge...makes me think that something is fundamentally
| wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they
| sufficiently regular that using a Regex is sensible? What do the
| actual browsers do?
|
| [1] https://stackoverflow.com/questions/1732348/regex-match-
| open...
| likium wrote:
| Even if you built a URL validation regex that follows rfc3986[1]
| and rfc3987[2], you will still get user bug reports because web
| browsers follow a different standard.
|
| For example, <http://example.com./> , <http:///example.com/> and
| <https://en.wikipedia.org/wiki/Space (punctuation)> are
| classified as invalid urls in the blog, but they are accepted in
| the browser.
|
| As the creator of cURL puts it, there is no URL standard[3].
|
| [1]: https://www.ietf.org/rfc/rfc3986.txt
|
| [2]: https://www.ietf.org/rfc/rfc3987.txt
|
| [3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/
| jt2190 wrote:
| There's also a question of what we're _really_ trying to
| validate, IMHO. All of these regex patterns will tell you that
| a string looks like a URL, but they won 't actually tell you
| if: There's any web server listening at that particular URL;
| Whether that server has the resource in that location; If that
| server is reachable from where you want to fetch it; etc.
| yyyk wrote:
| <http://example.com./> is a valid URL, see for example:
|
| https://jdebp.uk/FGA/web-fully-qualified-domain-name.html
| MildlySerious wrote:
| Tangentially, Youtube had a bug surface last year where
| adding that extra dot let you avoid all ads. Previous
| discussion[1]
|
| [1] https://news.ycombinator.com/item?id=23479435
| dhsysusbsjsi wrote:
| Also nearly every paywalled media site
| Sephr wrote:
| There might not have been a generally accepted standard then,
| but there is now: https://url.spec.whatwg.org/
| queuebert wrote:
| Uh oh, Regex is approaching sentience.
| MaxBarraclough wrote:
| Every known sentient being is a finite state machine. Every
| finite state machine corresponds to a regular expression, and
| vice versa.
| JadeNB wrote:
| > Every known sentient being is a finite state machine.
|
| I know this is just a cutesy slogan, but how could you
| possibly know whether a living creature is a finite state
| machine? What would it even mean? I know I don't respond
| identically to identical stimuli presented on different
| occasions ....
| throwamon wrote:
| Obnoxious, I mean, trivial, answer: Just make "occasions" a
| variable. Assuming your lifetime is finite, you could
| simply assign each point in time to a value, and there you
| have it: a finite mapping from each moment to a state.
| MaxBarraclough wrote:
| > I know this is just a cutesy slogan
|
| Mostly, yes, but I do think there's a real point here as
| well.
|
| > how could you possibly know whether a living creature is
| a finite state machine?
|
| As I understand it, physicists don't really know whether
| the physical world has a finite number of states, or an
| infinite number. I think they tend to lean toward finite,
| though.
|
| Even if it's infinite, I doubt it's of consequence. That is
| to say, I doubt that sentience depends on the physical
| possibility of an infinite number of states. (Of course, if
| it turns out the physical world only has a finite number of
| states, that demonstrates that sentience is compatible with
| the finite-states constraint.)
|
| > What would it even mean?
|
| Systems can be modelled as finite state machines. Sentient
| entities like people are extremely sophisticated systems,
| but that's just a matter of degree, not of category.
|
| > I know I don't respond identically to identical stimuli
| presented on different occasions
|
| Right, because you're in a different state. You'll never be
| in the same state twice. We don't need to resort to non-
| determinism.
___________________________________________________________________
(page generated 2021-09-25 23:00 UTC)