[HN Gopher] The Pains of Path Parsing
___________________________________________________________________
The Pains of Path Parsing
Author : lukastyrychtr
Score : 22 points
Date : 2021-04-30 09:20 UTC (13 hours ago)
(HTM) web link (www.fpcomplete.com)
(TXT) w3m dump (www.fpcomplete.com)
| mlex wrote:
| Great article; in particular I hadn't thought about empty path
| components too closely and how websites usually just omit them
| when you try to go to e.g. https://github.com/nodejs//node
| michael1999 wrote:
| Once you accept that urls don't even specify the character
| encoding, you realize it's impossible in general.
| saurik wrote:
| > Whether to include trailing slashes in URLs has been an old
| argument on the internet. Personally, because I consider the
| parsing-into-segments concept to be central to path parsing, I
| prefer excluding the trailing slash. And in fact, Yesod's default
| (and, at least for now, routetype-rs's default) is to treat such
| a URL as non-canonical and redirect away from it. I felt even
| more strongly about that when I realized lots of frameworks have
| special handling for "final segments with filename extensions."
| For example, /blog/bananas/ is good with a trailing slash, but
| /images/bananas.png should not have a trailing slash.
|
| So, this is not an argument in which people can really have an
| opinion: the URLs have fundamentally different semantics and
| behaviors with respect to relative paths. If you are talking
| about "the subresource banana under the resource blog" then you
| _must_ use /blog/bananas, and the trailing slash is incorrect;
| in such a case, if you were to have a relative link to "apples"
| it would bring you to /blog/apples. In contrast, if you are
| wanting some kind of "default" resource--say, the moral
| equivalent of an "index.html" as is implemented in many web
| servers (but has nothing to do with the actual information model
| of the web)--as a subresource of the subresource bananas, then
| you _must_ use /blog/bananas/; in such a case, if you were to
| have relative link to apples, it would bring you to
| /blog/bananas/apples and to link to /blog/apples you'd have to
| use ../apples.
|
| FWIW, I absolutely agree that "for the narrow question of blog
| posts, is a blog post a file or a folder?" to be an interesting
| argument for which one might have a different opinion than
| someone else for a reasonable reason, but generalizing it to the
| concept of URL routing itself is wrong: if you believe a blog
| post is semantically a folder--which is very reasonable, as a
| blog post might "contain" a number of media attachments"--and the
| post itself is part of that folder it would simply be wrong to
| elide the trailing slash, and web frameworks or content
| management systems that return the representation of a folder
| from "inside" that folder without a trailing slash deserve a
| special circle of www hell :/. My hope is that this author is
| _actually_ just expressing an opinion on the semantics of a blog
| post, not some general notion about URLs, but it is certainly
| written as the latter and it seems like the software they work on
| is general purpose.
|
| As an example, I personally find the usage on GitHub to not just
| be "incorrect" but "flagrantly ridiculous": it has decided to
| make no opinion of whether a trailing slash has any semantic
| meaning or not, and so relative paths essentially make no sense
| in the context of their website. Is the landing page of my
| repository "inside" the folder of my repository, or is my
| repository itself a resource of sorts that happens to also
| contains subresources? In the former case, the landing page of
| other repositories in my organization are siblings of my
| repository's landing page, and the other information about my
| repository is a subresource of said landing page; while, in the
| latter case, the landing page of other repositories in my
| organization is the aunt/uncle of my repository's landing page,
| and other information about my repository is a sibling of my
| repository's landing page. Only one of these is supposed to be
| true!
| slver wrote:
| Regarding blog post, resources and trailing slashes: we have
| base tags.
| gopalv wrote:
| About 17 years ago, I had to solve the same problem, but since I
| used regexes, I had two problems at the end of it - memory usage
| and performance. /* RFC2396 : Appendix B
| As described in Section 4.3, the generic URI syntax is not
| sufficient to disambiguate the components of
| some forms of URI. Since the "greedy
| algorithm" described in that section is identical to the
| disambiguation method used by POSIX regular expressions, it is
| natural and commonplace to use a regular expression for parsing
| ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
| 12 3 4 5 6 7 8 9
| (Modified to support mailto: syntax as well) */
|
| But at least it wasn't the problem I started with.
| SavantIdiot wrote:
| That is a nightmare of a regex, IETF be damned. Splitting
| regexess up into smaller chunks helps readability, support,
| memory usage -and- performance. I suspect that example was
| provided as a definition and not actually intended to be
| implemented. Although there are better forms for representing
| construction, like BNF.
| the-dude wrote:
| _Regular Expressions: Now You Have Two Problems_
|
| https://blog.codinghorror.com/regular-expressions-now-you-ha...
___________________________________________________________________
(page generated 2021-04-30 23:01 UTC)