[HN Gopher] URLs: It's Complicated
       ___________________________________________________________________
        
       URLs: It's Complicated
        
       Author : salutonmundo
       Score  : 117 points
       Date   : 2021-06-22 05:30 UTC (1 days ago)
        
 (HTM) web link (www.netmeister.org)
 (TXT) w3m dump (www.netmeister.org)
        
       | jfrunyon wrote:
       | > making this is a valid URL:
       | https://!$%:)(*&^@www.netmeister.org/blog/urls.html
       | 
       | Uh, no. "%:)" is not <"%" HEXDIG HEXDIG> nor is % allowed outside
       | of that. (Although your browser will likely accept it)
       | 
       | > This includes spaces, and the following two URLs lead to the
       | same file located in a directory that's named " ": >
       | https://www.netmeister.org/blog/urls/ /f >
       | https://www.netmeister.org/blog/urls/%20/f > Your client may
       | automatically percent-encode the space, but e.g., curl(1) lets
       | you send the raw space:
       | 
       | Uh, no. Just because one of your clients is wrong and some
       | servers allow it doesn't mean it's allowed by the spec.
       | 
       | In fact, the HTTP/1.1 RFC defers to RFC2396 for the meaning of
       | <abs_path>: <path_segments> which begin with a /.
       | 
       | What is <path_segments>? A bunch of slash-delimited <segment>s.
       | 
       | What is <segment>? A bunch of <pchar> and maybe a semicolon.
       | 
       | What is <pchar>? <unreserved>, <escaped>, or some special
       | characters (not including space).
       | 
       | What is <unreserved>? Letters, digits, and some special
       | characters (not including space).
       | 
       | What is <escaped>? <"%" hex hex>.
       | 
       | Most HTTP clients and servers are pretty forgiving about what
       | they accept, because other people do broken stuff, like sending
       | them literal spaces. But that doesn't mean it's "allowed", that
       | doesn't mean every server allows it, and that doesn't mean it's a
       | good idea.
       | 
       | > That is, if your web server supports (and has enabled) user
       | directories, and you submit a request for "~username": [it does
       | stuff]
       | 
       | Uh, no. If you're using Apache, that might be true. As you
       | mentioned, this is implementation-defined (as are all pathnames).
       | 
       | > Now with all of this long discussion, let's go back to that
       | silly URL from above: ... Now this really looks like the Buffalo
       | buffalo equivalent of a URL.
       | 
       | Not really.
       | 
       | > Now we start to play silly tricks: "/ /www.netmeister.org" uses
       | the fraction slash characters
       | 
       | You are aware that URLs predate Unicode, right? Not to mention
       | that Unicode lookalike characters are a Unicode (or UI) problem,
       | not a URL problem?
       | 
       | > The next "https" now is the hostname component of the
       | authority: a partially qualified hostname, that relies on
       | /etc/hosts containing an entry pointing https to the right IP
       | address.
       | 
       | Or on a search domain (which could be configured locally, or
       | through GPO on Windows, or through DHCP!). Or maybe your resolver
       | has a local zone for it. Or maybe ...
        
       | surfingdino wrote:
       | I have come across even more issues caused by IRIs used
       | incorrectly in place of URIs by a popular web framework, causing
       | havoc with OAuth redirects.
       | 
       | https://en.wikipedia.org/wiki/Internationalized_Resource_Ide...
        
       | scandinavian wrote:
       | Specifically using another font for the code tag then the rest of
       | the blog to hide the difference between // and // seems weird. I
       | get that it wouldn't be interesting if not doing that, but
       | doesn't that just show that it's really not as complicated as you
       | make it out to be?
        
       | teknopaul wrote:
       | URLs are not complicated, unless you complicate them.
       | 
       | foo|foo -foo 's^foo^foo^'"">foo 2>>foo
       | 
       | is not a very good example for teaching the structure of the the
       | command line.
       | 
       | Pick a better one.
       | 
       | It's simple.
        
         | LambdaComplex wrote:
         | "The average URL" and "what is allowed by the URL
         | specifications" are two very different things. (And the same
         | could be said about your command line example)
        
       | prepend wrote:
       | It doesn't seem complicated at all. Complicated to me means
       | difficult to understand. This just involves reading the spec and
       | it all seems pretty simple and consistent.
       | 
       | Complicated doesn't mean "new to me." If I haven't read a man
       | page, that doesn't mean the command is complicated.
        
         | alphabet9000 wrote:
         | Even browser developers have made mistakes as a result of the
         | complexity of the spec, resulting in things like CVE-2018-6128
         | [0] happening.
         | 
         | [0]
         | https://bugs.chromium.org/p/chromium/issues/detail?id=841105
        
         | Stratoscope wrote:
         | > _This just involves reading the spec and it all seems pretty
         | simple and consistent._
         | 
         | Worth noting from the article:
         | 
         | > And this is one of the main take-aways here: while the URL
         | specification prescribes or allows one thing, different clients
         | _and servers_ behave differently.
        
         | kube-system wrote:
         | If we're being sticklers for reading the docs...
         | 
         | > com*pli*cat*ed | 'kampl@,kad@d |
         | 
         | > adjective
         | 
         | > 1 consisting of many interconnecting parts or elements
        
           | [deleted]
        
         | treve wrote:
         | I'm trying to figure out what your point is. Is it a criticism
         | of the article? Are you sharing with us that you are clever?
         | I'm trying to give it a generous interpretation, but I'm having
         | a hard time.
         | 
         | So it's not complicated to you... what made you want to share
         | this?
        
       | zepearl wrote:
       | All extremely useful: the overview, the examples and the
       | comments.
       | 
       | A few months ago while writing a bot/crawler I searched for hours
       | for something like this, but I found only full specs or just bits
       | and pieces scattered around that used different terminology
       | and/or had different opinions.
       | 
       | In the end I didn't even clearly understand what should be the
       | max total URL length (e.g. mixed opinions here
       | https://stackoverflow.com/questions/417142/what-is-the-maxim... -
       | come on, a xGiB long URL?) => most of the time 2000 bytes is
       | mentioned but it's not 100% clear.
       | 
       | Writing a bot made me understand 1) why browsers are so
       | complicated and 2) that the Internet is a mess (e.g. once I even
       | found a page that used multiple character encodings...).
       | 
       | My personal opinion is that everything is too lax. Browsers try
       | to be the best ones by implementing workarounds for stuff that
       | does not have (yet) or does not comply to a spec => this way it
       | can only end up in a mess. A simple example is the HTTP-header
       | "Content-Encoding" ( https://developer.mozilla.org/en-
       | US/docs/Web/HTTP/Headers/Co... ) which I think should only
       | indicate what kind of compression is being used, but I keep
       | seeing in there stuff like
       | "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and
       | all those pages/files work perfectly in the browsers even if with
       | those values they should actually be rejected... .
        
         | indymike wrote:
         | The spec is silent on length. 2000 bytes came from some web
         | servers (old IIS comes to mind) that capped the URL at 2K or
         | something close to that. So extra long URLS were problematic
         | (and a lot of early web apps went nuts with parameters). So,
         | max length is up to the implementer. All I know is that I've
         | had to fix lots of code where someone assumed that 255
         | characters is all you'll ever need for a URL.
        
         | jfrunyon wrote:
         | There is no single max total URL length. You probably shouldn't
         | enforce one other than to prevent DoS.
        
         | mananaysiempre wrote:
         | The use of Content-Encoding for compression is actually
         | something of a historical wart: what was intended to be used
         | for that purpose is Transfer-Encoding, but modern browsers
         | don't even send the TE header necessary to permit the HTTP
         | server to use it (except for Transfer-Encoding: chunked which
         | every HTTP 1.1 client must accept), even though some servers
         | are perfectly capable of it and all but the most broken will at
         | least ignore it. Things like 7bit, 8bit, binary, or quoted-
         | printable are not supposed to be in the HTTP Content-Encoding
         | header, either, but their presence is at least somewhat
         | understandable as they are valid in the MIME Content-Transfer-
         | Encoding header, and HTTP originally shares much of its
         | infrastructure with MIME (think Content-Disposition:
         | attachment).
         | 
         | I guess what I'm getting at here is that the blame for the C-E
         | weirdness lies in large part on the browsers, which could've
         | made a clean break and improved the semantics at the same time
         | by using T-E, but instead chose to initiate a chicken-and-egg
         | dilemma out of a desire to support broken HTTP servers from the
         | last century.
         | 
         | (The intended semantics is that C-E, an "end-to-end" header,
         | says "this resource genuinely exists in this encoded form",
         | while T-E, a "hop-to-hop" header, says "the origin or proxy
         | server you're using incidentally chose to encode this resource
         | in this form"; this is why sometimes the wrong combination of
         | hacks in the HTTP server and the Web browser will lead you to
         | downloading a tar file when you expected a tar.gz file.)
         | 
         | The use of "gzip" as the compression is also a wart, because
         | it's "deflate" (which is what you want: DEFLATE compression
         | with a checksum) with a useless decompressed filename (wat?) +
         | decompressed mtime (double wat?) header stacked on top.
        
           | jfrunyon wrote:
           | The original filename is optional in gzip. It is not included
           | in the response sent by, for example, Apache.
           | 
           | (There is a mandatory MTIME which is included, and an OS
           | byte, but those only waste 5 bytes total. Far less than gzip
           | will typically save.)
        
       | leifg wrote:
       | It seems like the colon is too ambiguous (is used as a protocol
       | delimiter, delimiter for user/pass, delimiter for port).
       | 
       | Reminds a little bit of Java labels where you can do this:
       | public class Labels {         public static void main(String
       | args[]){             https://hn.ycombinator.com
       | for(int i=0; i<10; i++){
       | System.out.println("......."+i );             }           }
       | }
       | 
       | the https: is a label named https and everything after the colon
       | is a comment so this is valid code.
        
         | jackewiehose wrote:
         | > It seems like the colon is too ambiguous (is used as a
         | protocol delimiter, delimiter for user/pass, delimiter for
         | port).
         | 
         | and because that was still too boring they came up with ipv6
        
           | [deleted]
        
           | mananaysiempre wrote:
           | IIUC the IPv6 weirdness here is simply due to very
           | unfortunate timing: IPv6 was being finalized at a time (first
           | half of the 90s) when the Web (and with it URLs) was already
           | nearly frozen but still not obviously important.
        
         | gpvos wrote:
         | Not in URLs, but related:                   Larry's 1st Law of
         | Language Redesign: Everyone wants the colon         Larry's 2nd
         | Law of Language Redesign: Larry gets the colon
         | 
         | https://thelackthereof.org/Perl6_Colons
        
       | sitdown wrote:
       | Layouts using <table>s are complicated too. For example, this
       | page has a ~7800px-wide <pre> tag in a <table> that's 720px wide.
        
       | mananaysiempre wrote:
       | Just to share a little more of the weirdness (discovered while
       | reading a couple of the historical URL & URI RFCs several days
       | ago):
       | 
       | Per the original spec, in FTP URLs,
       | 
       | - ftp://example.net/foo/bar will get you bar inside the foo
       | directory inside the default directory of the FTP server at
       | example.net ( _i.e._ CWD foo, RETR bar);
       | 
       | - ftp://example.net//foo/bar will get you bar inside the foo
       | directory _inside the empty string directory_ inside the default
       | directory of the FTP server at example.net ( _i.e._ CWD, CWD foo,
       | RETR bar; what do FTP servers even do with this?);
       | 
       | - and it's ftp://example.net/%2Ffoo/bar that you must use if you
       | want bar inside the foo directory inside the root directory of
       | the FTP server at example.net ( _i.e._ CWD  /foo, RETR bar; %2F
       | being the result of percent-encoding a slash character).
        
         | 1vuio0pswjnm7 wrote:
         | From source code of preferred ftp/http client, maybe this is
         | helpful. Also suggest reading source code for djb's ftp server.
         | Parse URL of form (per RFC 3986):
         | <type>://[<user>[:<password>]@]<host>[:<port>][/<path>]
         | XXX: this is not totally RFC 3986 compliant; <path> will have
         | the        leading `/' unless it's an ftp:// URL, as this makes
         | things easier        for file:// and http:// URLs.  ftp:// URLs
         | have the `/' between the        host and the URL-path removed,
         | but any additional leading slashes        in the URL-path are
         | retained (because they imply that we should        later do
         | "CWD" with a null argument).              Examples:
         | input URL                       output path
         | ---------                       -----------
         | "http://host"                   "/"            "http://host/"
         | "/"            "http://host/path"              "/path"
         | "file://host/dir/file"          "dir/file"
         | "ftp://host"                    ""            "ftp://host/"
         | ""            "ftp://host//"                  "/"
         | "ftp://host/dir/file"           "dir/file"
         | "ftp://host//dir/file"          "/dir/file"               If we
         | are dealing with a classic `[user@]host:[path]'
         | (urltype is CLASSIC_URL_T) then we have a raw directory
         | name (not encoded in any way) and we can change
         | directories in one step.                 If we are dealing with
         | an `ftp://host/path' URL         (urltype is FTP_URL_T), then
         | RFC 3986 says we need to         send a separate CWD command
         | for each unescaped "/"         in the path, and we have to
         | interpret %hex escaping         *after* we find the slashes.
         | It's possible to get         empty components here, (from
         | multiple adjacent         slashes in the path) and RFC 3986
         | says that we should         still do `CWD ' (with a null
         | argument) in such cases.                 Many ftp servers don't
         | support `CWD ', so if there's an         error performing that
         | command, bail out with a descriptive         message.
         | Examples:                               host:
         | dir="", urltype=CLASSIC_URL_T                      logged in
         | (to default directory)         host:file
         | dir=NULL, urltype=CLASSIC_URL_T                      "RETR
         | file"         host:dir/                            dir="dir",
         | urltype=CLASSIC_URL_T                      "CWD dir", logged in
         | ftp://host/                          dir="", urltype=FTP_URL_T
         | logged in (to default directory)         ftp://host/dir/
         | dir="dir", urltype=FTP_URL_T                      "CWD dir",
         | logged in         ftp://host/file
         | dir=NULL, urltype=FTP_URL_T                      "RETR file"
         | ftp://host//file                     dir="", urltype=FTP_URL_T
         | "CWD ", "RETR file"         host:/file
         | dir="/", urltype=CLASSIC_URL_T                      "CWD /",
         | "RETR file"         ftp://host///file
         | dir="/", urltype=FTP_URL_T                      "CWD ", "CWD ",
         | "RETR file"         ftp://host/%2F/file
         | dir="%2F", urltype=FTP_URL_T                      "CWD /",
         | "RETR file"         ftp://host/foo/file
         | dir="foo", urltype=FTP_URL_T                      "CWD foo",
         | "RETR file"         ftp://host/foo/bar/file
         | dir="foo/bar"                      "CWD foo", "CWD bar", "RETR
         | file"         ftp://host//foo/bar/file
         | dir="/foo/bar"                      "CWD ", "CWD foo", "CWD
         | bar", "RETR file"         ftp://host/foo//bar/file
         | dir="foo//bar"                      "CWD foo", "CWD ", "CWD
         | bar", "RETR file"         ftp://host/%2F/foo/bar/file
         | dir="%2F/foo/bar"                      "CWD /", "CWD foo", "CWD
         | bar", "RETR file"         ftp://host/%2Ffoo/bar/file
         | dir="%2Ffoo/bar"                      "CWD /foo", "CWD bar",
         | "RETR file"         ftp://host/%2Ffoo%2Fbar/file
         | dir="%2Ffoo%2Fbar"                      "CWD /foo/bar", "RETR
         | file"         ftp://host/%2Ffoo%2Fbar%2Ffile       dir=NULL
         | "RETR /foo/bar/file"                 Note that we don't need
         | `dir' after this point.                 The `CWD ' command
         | (without a directory), which is required by            RFC 3986
         | to support the empty directory in the URL pathname (`//'),
         | conflicts with the server's conformance to RFC 959.
        
           | [deleted]
        
         | a1369209993 wrote:
         | > i.e. CWD, CWD foo, RETR bar; what do FTP servers even do with
         | this?
         | 
         | If you go by shell sematics, that pokes around the home
         | directory of the user running the FTP daemon; hopefully that
         | doesn't actually work.
         | 
         | > it's ftp://example.net/%2Ffoo/bar that you must use if you
         | want bar inside the foo directory inside the root directory
         | 
         | This smells like a security vulnerability for most setups.
        
           | jfrunyon wrote:
           | > If you go by shell sematics, that pokes around the home
           | directory of the user running the FTP daemon; hopefully that
           | doesn't actually work.
           | 
           | This is why FTP servers have default directories. They're the
           | equivalent of user home directories. By the way, many FTP
           | servers (especially historically) map FTP logins to real,
           | local users.
           | 
           | > This smells like a security vulnerability for most setups.
           | 
           | How do you figure? Surely your sensitive files aren't world-
           | readable... /s
        
           | mananaysiempre wrote:
           | > This smells like a security vulnerability for most setups.
           | 
           | Yes, but if you look around on some old FTP servers (like on
           | the few still-extant mirror networks) you'll find that some
           | do actually let you CWD to the system /, and sometimes they
           | even drop you there by default (so you have to CWD pub or
           | whatever to get at the things you actually want).
        
         | jfrunyon wrote:
         | > what do FTP servers even do with this?
         | 
         | Pretty sure CWD by itself isn't even valid (at least RFC959
         | assumes it has an argument), and therefore // isn't valid in
         | FTP URLs.
         | 
         | The %2Ffoo/bar is needed because of the fact that FTP CWD and
         | RETR paths are system dependent (with, theoretically, system
         | dependent path separators), but URLs are not, so the FTP client
         | breaks the URL on / and sequentially executes CWD down the tree
         | so that it doesn't need to know what it's connected to.
         | 
         | In other words: URL paths are not system paths, and it's a
         | mistake to think of them as such.
         | 
         | (Alternate in other words: FTP is awful)
        
       ___________________________________________________________________
       (page generated 2021-06-23 23:02 UTC)