[HN Gopher] URLs: It's Complicated
___________________________________________________________________
URLs: It's Complicated
Author : salutonmundo
Score : 117 points
Date : 2021-06-22 05:30 UTC (1 days ago)
(HTM) web link (www.netmeister.org)
(TXT) w3m dump (www.netmeister.org)
| jfrunyon wrote:
| > making this is a valid URL:
| https://!$%:)(*&^@www.netmeister.org/blog/urls.html
|
| Uh, no. "%:)" is not <"%" HEXDIG HEXDIG> nor is % allowed outside
| of that. (Although your browser will likely accept it)
|
| > This includes spaces, and the following two URLs lead to the
| same file located in a directory that's named " ": >
| https://www.netmeister.org/blog/urls/ /f >
| https://www.netmeister.org/blog/urls/%20/f > Your client may
| automatically percent-encode the space, but e.g., curl(1) lets
| you send the raw space:
|
| Uh, no. Just because one of your clients is wrong and some
| servers allow it doesn't mean it's allowed by the spec.
|
| In fact, the HTTP/1.1 RFC defers to RFC2396 for the meaning of
| <abs_path>: <path_segments> which begin with a /.
|
| What is <path_segments>? A bunch of slash-delimited <segment>s.
|
| What is <segment>? A bunch of <pchar> and maybe a semicolon.
|
| What is <pchar>? <unreserved>, <escaped>, or some special
| characters (not including space).
|
| What is <unreserved>? Letters, digits, and some special
| characters (not including space).
|
| What is <escaped>? <"%" hex hex>.
|
| Most HTTP clients and servers are pretty forgiving about what
| they accept, because other people do broken stuff, like sending
| them literal spaces. But that doesn't mean it's "allowed", that
| doesn't mean every server allows it, and that doesn't mean it's a
| good idea.
|
| > That is, if your web server supports (and has enabled) user
| directories, and you submit a request for "~username": [it does
| stuff]
|
| Uh, no. If you're using Apache, that might be true. As you
| mentioned, this is implementation-defined (as are all pathnames).
|
| > Now with all of this long discussion, let's go back to that
| silly URL from above: ... Now this really looks like the Buffalo
| buffalo equivalent of a URL.
|
| Not really.
|
| > Now we start to play silly tricks: "/ /www.netmeister.org" uses
| the fraction slash characters
|
| You are aware that URLs predate Unicode, right? Not to mention
| that Unicode lookalike characters are a Unicode (or UI) problem,
| not a URL problem?
|
| > The next "https" now is the hostname component of the
| authority: a partially qualified hostname, that relies on
| /etc/hosts containing an entry pointing https to the right IP
| address.
|
| Or on a search domain (which could be configured locally, or
| through GPO on Windows, or through DHCP!). Or maybe your resolver
| has a local zone for it. Or maybe ...
| surfingdino wrote:
| I have come across even more issues caused by IRIs used
| incorrectly in place of URIs by a popular web framework, causing
| havoc with OAuth redirects.
|
| https://en.wikipedia.org/wiki/Internationalized_Resource_Ide...
| scandinavian wrote:
| Specifically using another font for the code tag then the rest of
| the blog to hide the difference between // and // seems weird. I
| get that it wouldn't be interesting if not doing that, but
| doesn't that just show that it's really not as complicated as you
| make it out to be?
| teknopaul wrote:
| URLs are not complicated, unless you complicate them.
|
| foo|foo -foo 's^foo^foo^'"">foo 2>>foo
|
| is not a very good example for teaching the structure of the the
| command line.
|
| Pick a better one.
|
| It's simple.
| LambdaComplex wrote:
| "The average URL" and "what is allowed by the URL
| specifications" are two very different things. (And the same
| could be said about your command line example)
| prepend wrote:
| It doesn't seem complicated at all. Complicated to me means
| difficult to understand. This just involves reading the spec and
| it all seems pretty simple and consistent.
|
| Complicated doesn't mean "new to me." If I haven't read a man
| page, that doesn't mean the command is complicated.
| alphabet9000 wrote:
| Even browser developers have made mistakes as a result of the
| complexity of the spec, resulting in things like CVE-2018-6128
| [0] happening.
|
| [0]
| https://bugs.chromium.org/p/chromium/issues/detail?id=841105
| Stratoscope wrote:
| > _This just involves reading the spec and it all seems pretty
| simple and consistent._
|
| Worth noting from the article:
|
| > And this is one of the main take-aways here: while the URL
| specification prescribes or allows one thing, different clients
| _and servers_ behave differently.
| kube-system wrote:
| If we're being sticklers for reading the docs...
|
| > com*pli*cat*ed | 'kampl@,kad@d |
|
| > adjective
|
| > 1 consisting of many interconnecting parts or elements
| [deleted]
| treve wrote:
| I'm trying to figure out what your point is. Is it a criticism
| of the article? Are you sharing with us that you are clever?
| I'm trying to give it a generous interpretation, but I'm having
| a hard time.
|
| So it's not complicated to you... what made you want to share
| this?
| zepearl wrote:
| All extremely useful: the overview, the examples and the
| comments.
|
| A few months ago while writing a bot/crawler I searched for hours
| for something like this, but I found only full specs or just bits
| and pieces scattered around that used different terminology
| and/or had different opinions.
|
| In the end I didn't even clearly understand what should be the
| max total URL length (e.g. mixed opinions here
| https://stackoverflow.com/questions/417142/what-is-the-maxim... -
| come on, a xGiB long URL?) => most of the time 2000 bytes is
| mentioned but it's not 100% clear.
|
| Writing a bot made me understand 1) why browsers are so
| complicated and 2) that the Internet is a mess (e.g. once I even
| found a page that used multiple character encodings...).
|
| My personal opinion is that everything is too lax. Browsers try
| to be the best ones by implementing workarounds for stuff that
| does not have (yet) or does not comply to a spec => this way it
| can only end up in a mess. A simple example is the HTTP-header
| "Content-Encoding" ( https://developer.mozilla.org/en-
| US/docs/Web/HTTP/Headers/Co... ) which I think should only
| indicate what kind of compression is being used, but I keep
| seeing in there stuff like
| "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and
| all those pages/files work perfectly in the browsers even if with
| those values they should actually be rejected... .
| indymike wrote:
| The spec is silent on length. 2000 bytes came from some web
| servers (old IIS comes to mind) that capped the URL at 2K or
| something close to that. So extra long URLS were problematic
| (and a lot of early web apps went nuts with parameters). So,
| max length is up to the implementer. All I know is that I've
| had to fix lots of code where someone assumed that 255
| characters is all you'll ever need for a URL.
| jfrunyon wrote:
| There is no single max total URL length. You probably shouldn't
| enforce one other than to prevent DoS.
| mananaysiempre wrote:
| The use of Content-Encoding for compression is actually
| something of a historical wart: what was intended to be used
| for that purpose is Transfer-Encoding, but modern browsers
| don't even send the TE header necessary to permit the HTTP
| server to use it (except for Transfer-Encoding: chunked which
| every HTTP 1.1 client must accept), even though some servers
| are perfectly capable of it and all but the most broken will at
| least ignore it. Things like 7bit, 8bit, binary, or quoted-
| printable are not supposed to be in the HTTP Content-Encoding
| header, either, but their presence is at least somewhat
| understandable as they are valid in the MIME Content-Transfer-
| Encoding header, and HTTP originally shares much of its
| infrastructure with MIME (think Content-Disposition:
| attachment).
|
| I guess what I'm getting at here is that the blame for the C-E
| weirdness lies in large part on the browsers, which could've
| made a clean break and improved the semantics at the same time
| by using T-E, but instead chose to initiate a chicken-and-egg
| dilemma out of a desire to support broken HTTP servers from the
| last century.
|
| (The intended semantics is that C-E, an "end-to-end" header,
| says "this resource genuinely exists in this encoded form",
| while T-E, a "hop-to-hop" header, says "the origin or proxy
| server you're using incidentally chose to encode this resource
| in this form"; this is why sometimes the wrong combination of
| hacks in the HTTP server and the Web browser will lead you to
| downloading a tar file when you expected a tar.gz file.)
|
| The use of "gzip" as the compression is also a wart, because
| it's "deflate" (which is what you want: DEFLATE compression
| with a checksum) with a useless decompressed filename (wat?) +
| decompressed mtime (double wat?) header stacked on top.
| jfrunyon wrote:
| The original filename is optional in gzip. It is not included
| in the response sent by, for example, Apache.
|
| (There is a mandatory MTIME which is included, and an OS
| byte, but those only waste 5 bytes total. Far less than gzip
| will typically save.)
| leifg wrote:
| It seems like the colon is too ambiguous (is used as a protocol
| delimiter, delimiter for user/pass, delimiter for port).
|
| Reminds a little bit of Java labels where you can do this:
| public class Labels { public static void main(String
| args[]){ https://hn.ycombinator.com
| for(int i=0; i<10; i++){
| System.out.println("......."+i ); } }
| }
|
| the https: is a label named https and everything after the colon
| is a comment so this is valid code.
| jackewiehose wrote:
| > It seems like the colon is too ambiguous (is used as a
| protocol delimiter, delimiter for user/pass, delimiter for
| port).
|
| and because that was still too boring they came up with ipv6
| [deleted]
| mananaysiempre wrote:
| IIUC the IPv6 weirdness here is simply due to very
| unfortunate timing: IPv6 was being finalized at a time (first
| half of the 90s) when the Web (and with it URLs) was already
| nearly frozen but still not obviously important.
| gpvos wrote:
| Not in URLs, but related: Larry's 1st Law of
| Language Redesign: Everyone wants the colon Larry's 2nd
| Law of Language Redesign: Larry gets the colon
|
| https://thelackthereof.org/Perl6_Colons
| sitdown wrote:
| Layouts using <table>s are complicated too. For example, this
| page has a ~7800px-wide <pre> tag in a <table> that's 720px wide.
| mananaysiempre wrote:
| Just to share a little more of the weirdness (discovered while
| reading a couple of the historical URL & URI RFCs several days
| ago):
|
| Per the original spec, in FTP URLs,
|
| - ftp://example.net/foo/bar will get you bar inside the foo
| directory inside the default directory of the FTP server at
| example.net ( _i.e._ CWD foo, RETR bar);
|
| - ftp://example.net//foo/bar will get you bar inside the foo
| directory _inside the empty string directory_ inside the default
| directory of the FTP server at example.net ( _i.e._ CWD, CWD foo,
| RETR bar; what do FTP servers even do with this?);
|
| - and it's ftp://example.net/%2Ffoo/bar that you must use if you
| want bar inside the foo directory inside the root directory of
| the FTP server at example.net ( _i.e._ CWD /foo, RETR bar; %2F
| being the result of percent-encoding a slash character).
| 1vuio0pswjnm7 wrote:
| From source code of preferred ftp/http client, maybe this is
| helpful. Also suggest reading source code for djb's ftp server.
| Parse URL of form (per RFC 3986):
| <type>://[<user>[:<password>]@]<host>[:<port>][/<path>]
| XXX: this is not totally RFC 3986 compliant; <path> will have
| the leading `/' unless it's an ftp:// URL, as this makes
| things easier for file:// and http:// URLs. ftp:// URLs
| have the `/' between the host and the URL-path removed,
| but any additional leading slashes in the URL-path are
| retained (because they imply that we should later do
| "CWD" with a null argument). Examples:
| input URL output path
| --------- -----------
| "http://host" "/" "http://host/"
| "/" "http://host/path" "/path"
| "file://host/dir/file" "dir/file"
| "ftp://host" "" "ftp://host/"
| "" "ftp://host//" "/"
| "ftp://host/dir/file" "dir/file"
| "ftp://host//dir/file" "/dir/file" If we
| are dealing with a classic `[user@]host:[path]'
| (urltype is CLASSIC_URL_T) then we have a raw directory
| name (not encoded in any way) and we can change
| directories in one step. If we are dealing with
| an `ftp://host/path' URL (urltype is FTP_URL_T), then
| RFC 3986 says we need to send a separate CWD command
| for each unescaped "/" in the path, and we have to
| interpret %hex escaping *after* we find the slashes.
| It's possible to get empty components here, (from
| multiple adjacent slashes in the path) and RFC 3986
| says that we should still do `CWD ' (with a null
| argument) in such cases. Many ftp servers don't
| support `CWD ', so if there's an error performing that
| command, bail out with a descriptive message.
| Examples: host:
| dir="", urltype=CLASSIC_URL_T logged in
| (to default directory) host:file
| dir=NULL, urltype=CLASSIC_URL_T "RETR
| file" host:dir/ dir="dir",
| urltype=CLASSIC_URL_T "CWD dir", logged in
| ftp://host/ dir="", urltype=FTP_URL_T
| logged in (to default directory) ftp://host/dir/
| dir="dir", urltype=FTP_URL_T "CWD dir",
| logged in ftp://host/file
| dir=NULL, urltype=FTP_URL_T "RETR file"
| ftp://host//file dir="", urltype=FTP_URL_T
| "CWD ", "RETR file" host:/file
| dir="/", urltype=CLASSIC_URL_T "CWD /",
| "RETR file" ftp://host///file
| dir="/", urltype=FTP_URL_T "CWD ", "CWD ",
| "RETR file" ftp://host/%2F/file
| dir="%2F", urltype=FTP_URL_T "CWD /",
| "RETR file" ftp://host/foo/file
| dir="foo", urltype=FTP_URL_T "CWD foo",
| "RETR file" ftp://host/foo/bar/file
| dir="foo/bar" "CWD foo", "CWD bar", "RETR
| file" ftp://host//foo/bar/file
| dir="/foo/bar" "CWD ", "CWD foo", "CWD
| bar", "RETR file" ftp://host/foo//bar/file
| dir="foo//bar" "CWD foo", "CWD ", "CWD
| bar", "RETR file" ftp://host/%2F/foo/bar/file
| dir="%2F/foo/bar" "CWD /", "CWD foo", "CWD
| bar", "RETR file" ftp://host/%2Ffoo/bar/file
| dir="%2Ffoo/bar" "CWD /foo", "CWD bar",
| "RETR file" ftp://host/%2Ffoo%2Fbar/file
| dir="%2Ffoo%2Fbar" "CWD /foo/bar", "RETR
| file" ftp://host/%2Ffoo%2Fbar%2Ffile dir=NULL
| "RETR /foo/bar/file" Note that we don't need
| `dir' after this point. The `CWD ' command
| (without a directory), which is required by RFC 3986
| to support the empty directory in the URL pathname (`//'),
| conflicts with the server's conformance to RFC 959.
| [deleted]
| a1369209993 wrote:
| > i.e. CWD, CWD foo, RETR bar; what do FTP servers even do with
| this?
|
| If you go by shell sematics, that pokes around the home
| directory of the user running the FTP daemon; hopefully that
| doesn't actually work.
|
| > it's ftp://example.net/%2Ffoo/bar that you must use if you
| want bar inside the foo directory inside the root directory
|
| This smells like a security vulnerability for most setups.
| jfrunyon wrote:
| > If you go by shell sematics, that pokes around the home
| directory of the user running the FTP daemon; hopefully that
| doesn't actually work.
|
| This is why FTP servers have default directories. They're the
| equivalent of user home directories. By the way, many FTP
| servers (especially historically) map FTP logins to real,
| local users.
|
| > This smells like a security vulnerability for most setups.
|
| How do you figure? Surely your sensitive files aren't world-
| readable... /s
| mananaysiempre wrote:
| > This smells like a security vulnerability for most setups.
|
| Yes, but if you look around on some old FTP servers (like on
| the few still-extant mirror networks) you'll find that some
| do actually let you CWD to the system /, and sometimes they
| even drop you there by default (so you have to CWD pub or
| whatever to get at the things you actually want).
| jfrunyon wrote:
| > what do FTP servers even do with this?
|
| Pretty sure CWD by itself isn't even valid (at least RFC959
| assumes it has an argument), and therefore // isn't valid in
| FTP URLs.
|
| The %2Ffoo/bar is needed because of the fact that FTP CWD and
| RETR paths are system dependent (with, theoretically, system
| dependent path separators), but URLs are not, so the FTP client
| breaks the URL on / and sequentially executes CWD down the tree
| so that it doesn't need to know what it's connected to.
|
| In other words: URL paths are not system paths, and it's a
| mistake to think of them as such.
|
| (Alternate in other words: FTP is awful)
___________________________________________________________________
(page generated 2021-06-23 23:02 UTC)