[HN Gopher] Finding CSV files that start with a BOM using ripgrep
___________________________________________________________________
Finding CSV files that start with a BOM using ripgrep
Author : pcr910303
Score : 113 points
Date : 2021-05-29 10:27 UTC (12 hours ago)
(HTM) web link (til.simonwillison.net)
(TXT) w3m dump (til.simonwillison.net)
| asicsp wrote:
| > _The --multiline option means the search spans multiple lines -
| I only want to match entire files that begin with my search term,
| so this means that ^ will match the start of the file, not the
| start of individual lines._
|
| That's not correct because the `m` flag gets enabled by the
| multiline option. $ printf 'a\nbaz\nabc\n' | rg
| -U '^b' baz
|
| Need to use `\A` to match start of file or disable `m` flag using
| `(?-m)`, but seems like there's some sort of bug though (will
| file an issue soon): $ printf 'a\nbaz\nabc\n' |
| rg -U '\Ab' baz $ printf 'a1\nbaz\nabc\n' | rg -U
| '\Ab' baz $ printf 'a12\nbaz\nabc\n' | rg -U
| '\Ab' $
| burntsushi wrote:
| Yup, that's exactly right. '\A' or '(?-m)^' should work, but
| don't, because of an incorrectly applied optimization.
|
| The bug is fixed on master. Thanks for calling this to my
| attention! https://github.com/BurntSushi/ripgrep/issues/1878
| tialaramex wrote:
| Hmm. Thanks for fixing this, but, two things about the tests
|
| 1. These seem like they're effectively integration tests.
| They check the entire ripgrep command line app works as
| intended. Is this because the bug was _not_ where it looks
| like it is, in the regex crate, but elsewhere? If not, it
| seems like they 'd be better as unit tests closer to where
| the bugs they're likely to detect would lie?
|
| 2. While repeating yourself exactly once isn't automatically
| a bad sign, it smells suspicious. It seems like there would
| be a _lot_ of tests that ought to behave exactly the same
| with or without --mmap and so maybe that 's a pattern worth
| extracting.
| tyingq wrote:
| "BOM" == UTF-8 Byte Order Mark I guess.
|
| I initially thought it was searching for "Bill of Materials" for
| electronics projects or similar.
| specialist wrote:
| TIL: I had no idea UTF-8 could have a BOM.
|
| https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
|
| https://www.w3.org/International/questions/qa-utf8-bom.en.ht...
| maxnoe wrote:
| There is no utf8 bom. Utf8 has no Byte order ambiguity. Only
| utf-16 needs a bom.
| codeulike wrote:
| https://superuser.com/questions/1553666/utf-8-vs-
| utf-8-with-...
| cerved wrote:
| huh?
| catblast01 wrote:
| Why does utf-32 not require a bom?
| gpvos wrote:
| It's practically never serialized to a file. And if you
| need one, you can just use the same BOM value as UTF-16,
| just add two zero bytes in the correct place.
| masklinn wrote:
| It does: https://www.unicode.org/faq/utf_bom.html#bom4
|
| Well _require_ is a bit excessive, but it certainly allows
| and recommends one.
|
| Utf8 does not need one because the code units are bytes, so
| bytes order is not a concern.
|
| Exchanging utf32 is pretty rare though, and as long as you
| don't move anything between machines bytes order is not an
| issue.
| burntsushi wrote:
| From Unicode 23.8[1]:
|
| > In UTF-8, the BOM corresponds to the byte sequence <EF BB
| BF>. Although there are never any questions of byte order
| with UTF-8 text, this sequence can serve as signature for
| UTF-8 encoded text where the character set is unmarked. As
| with a BOM in UTF-16, this sequence of bytes will be
| extremely rare at the beginning of text files in other
| character encodings. For example, in systems that employ
| Microsoft Windows ANSI Code Page1252, <EF16 BB16 BF16>
| corresponds to the sequence <i diaeresis, guillemet, inverted
| question mark> "i >> ?".
|
| In practice, the UTF-8 BOM pops up. I usually see it on
| Windows.
|
| [1] - http://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#
| G1963...
| cerved wrote:
| it pops up a lot and it's annoying as it's an invisible
| diff in git
| Measter wrote:
| I've also had fun due to a BOM. In my case it was for
| configuring the Assetto Corsa server. It takes an INI for
| the entry list, but halts parsing if it encounters
| unexpected input. Without any kind of message. The BOM
| was an unexpected input, so the server just immediately
| shut down because the entry list was "empty".
|
| That was a fun, and totally unstressful way to begin my
| time managing a racing league's race events.
| andylynch wrote:
| It's definitely a thing, it's even in RFC 3629 but definitely
| not recommended. However some Microsoft tools default to
| writing CSV as utf8+bom and others expect it too so it's hard
| to ignore.
| nemetroid wrote:
| Here's a coreutils (two-liner) version: printf
| '\xEF\xBB\xBF' >bom.dat find . -name '*.csv' \
| -exec sh -c 'head --bytes 3 {} | cmp --quiet - bom.dat' \; \
| -print
|
| The -exec option for find can be used as a filter (though -exec
| disables the default action, -print, so it must be reenabled
| after).
|
| Could be made into a oneliner by replacing the 'bom.dat' argument
| to cmp with '<(printf ...)'.
| xorcist wrote:
| The cmp in coreutils understands: -n,
| --bytes=LIMIT compare at most LIMIT bytes
|
| so head is not really necessary: find . -name
| '*.csv' -type f -exec cmp -sn 3 {} bom.dat \; -print
|
| Using -exec as a filter is a nice feature more people should
| use. That -type was put there just to avoid directories.
| RedShift1 wrote:
| Off topic but related, why does UTF-16 and UTF-32 even exist?
| Doesn't UTF-8 have the capability to go up to 32 bit wide
| characters already?
| pwdisswordfish8 wrote:
| UTF-16 was first. Or rather, UCS-2, which was limited to the
| Basic Multilingual Plane, and which UTF-16 extends to the whole
| of Unicode.
| kbumsik wrote:
| They exist before UTF-8 afaik.
| marcosdumay wrote:
| UTF-16 and UTF-32 are older than UTF-8.
|
| Besides, at the beginning people were really against variable
| size encodings. UTF-8 won despite the Unicode consortium and
| all the committees effort, not because of it.
| Dylan16807 wrote:
| UTF-16 came after UTF-8. Software had gotten locked into 16
| bit back when 16 bit meant fixed width, before either of
| those formats existed.
| ChrisSD wrote:
| Note that UTF-8 wasn't actually standardized until Unicode
| 2.0 in 1996. This was at the same time as the surrogate
| pairs needed for UTF-16. And UTF-8 didn't find its final
| form until 2003, which was around the time when it really
| started to gain legs.
|
| However, as you say, by 1996 people were already using the
| older UCS-2 standard.
| SCLeo wrote:
| Do you know why UTF-8 won? I feel textual data only
| constitutes a very tiny portion of memory used, but working
| with fixed size encoding is so much more easier than variable
| size encodings.
| nneonneo wrote:
| The only universal, fixed-size encoding is UTF-32, which,
| as you can imagine, is very wasteful on space for ASCII
| text. Like it or not, most of the interesting strings in a
| given program are probably ASCII.
|
| UTF-16 is not a fixed-size encoding thanks to surrogate
| pairs. UCS-2 is a fixed-size encoding but can't represent
| code points outside the BMP (such as emoji) which makes it
| unsuitable for many applications.
|
| Besides, most of the time individual code points aren't
| what you care about anyway, so the cost of a variable-sized
| encoding like UTF-8 is only a small part of the overall
| work you need to support international text.
| a1369209993 wrote:
| Because Unicode (not UTF-anything, _Unicode itself_ )
| is/became a variable-width encoding (eg U+78 U+304 "x" is a
| single character, but two Unicode code points[0]). So
| encoding Unicode code points with a fixed-width encoding is
| completely useless, because your characters are still
| variable-width (it's also hazardous, since it increases how
| long it takes for bugs triggered by variable-width
| characters to surface, especially if you normalize to NFC).
|
| 0: Similarly, U+1F1 "DZ" is two characters, but one Unicode
| code point, which is much, much worse as it means you can
| no longer treat encoded strings as concatenations of
| encoded characters. UTF-8-as-such doesn't have this problem
| - any 'string' of code points can only be encoded as the
| concatenation of the encodings of its elements - but UTF-8
| in practice does inherit the character-level version of
| this problem from Unicode.
| Dylan16807 wrote:
| The only way to "properly" have a fixed width encoding is
| to allocate 80-128 bytes per character _. Anything else
| will break horribly on accents and other common codepoints.
| So everyone uses the less-easy methods.
|
| _ I base this number off the "Stream-Safe Text Format"
| which suggests that while it's preferred that you accept
| infinitely-long characters, a cap of 31 code points is more
| or less acceptable.
| dahfizz wrote:
| A file containing every single Unicode codepoint once would be
| smaller in UTF-16 than in UTF-8. UTF-16 can make sense in some
| applications.
| foepys wrote:
| A character in UTF-8 can even be more than 4 bytes long. One
| example are flags or skin-colored emojis.
| barrkel wrote:
| UTF-8 encodes code points, as do UTF-16 and UTF-32. Once you
| go from a sequence of bytes to a sequence of code points,
| you've moved beyond the specifics of the encoding.
|
| Code points might be combined to form graphemes and grapheme
| clusters. Some of the latest emojis are extended grapheme
| clusters, for e.g. handling the combinatorics of mixed
| families. This is a higher level composition than UTF-x, it's
| logically a separate layer.
|
| IMO talking about characters in the context of Unicode is
| often unhelpful because it's vague.
| tialaramex wrote:
| Right. "Character" is almost never what you meant, unless
| your goal was to be as vague as possible. In human
| languages I like the word "squiggle" to mean this thing you
| have fuzzy intuitive beliefs about, rather than
| "character". In Unicode the Code Unit, and Code Point are
| maybe things to know about, but neither of them is a
| "character".
|
| In programming languages or APIs where precision matters,
| your goal should be to avoid this notion of characters as
| much as practical. In a high level language with types,
| just do not offer a built-in "char" data type. Sub-strings
| are all anybody in a high level language actually needs to
| get their job done, "A" is a perfectly good sub-string of
| "CAT" there's no need to pretend you can slice strings up
| into "characters" like 'A' that have any distinct
| properties worth inventing a whole datatype.
|
| If you're writing device drivers, once again, what do you
| care about "characters"? You want a byte data type, most
| likely, some address types, that sort of thing, but who
| wants a "character" ? However, somewhere down in the guts a
| low-level language will need to think about Unicode
| encoding, and so eventually they do need a datatype for
| that when a 32-bit integer doesn't really cut it. I think
| Rust's "char" is a little bit too prominent for example, it
| needn't be more in your face than say,
| std::num::NonZeroUsize. Most people won't need it, most of
| the time and that's as it should be.
| tialaramex wrote:
| Others have talked about the history of UTF-16. I'll focus on
| that last part: You must not write 32-bit wide characters in
| UTF-8.
|
| Unicode / ISO 10646 is specifically defined to only have code
| points from 0 to 0x10FFFF. As a result UTF-8 that would decode
| outside that range is just invalid, no different from if it was
| 0xFF bytes or something.
|
| It also doesn't make sense to write UTF-8 that decodes as
| U+D800 through U+DFFF since although these code points exist,
| the standard specifically reserves them to make UTF-16 work,
| and you're not using UTF-16.
| bmn__ wrote:
| > You must not write 32-bit wide characters in UTF-8.
|
| You can't tell me what to do, dad. I'll encode 64 bits and
| you can't stop me! Bwahahahaa! $ perl
| -MEncode=encode_utf8 -e'print encode_utf8
| "\x{7fff_ffff_ffff_ffff}"' | hex 0000 ff 80 87 bf bf
| bf bf bf bf bf bf bf bf y??????????
| Dylan16807 wrote:
| > As a result UTF-8 that would decode outside that range is
| just invalid, no different from if it was 0xFF bytes or
| something.
|
| That's needlessly pedantic. If you use an old version of the
| spec those bytes are valid.
|
| And "have the capability" seems to me to be talking about
| what the underlying method is able to do, not the full set of
| "must not" rules.
| Dylan16807 wrote:
| > Doesn't UTF-8 have the capability to go up to 32 bit wide
| characters already?
|
| 31.
| superjan wrote:
| One large source of byte order marks in utf8 is Windows. In MS
| DOS and later windows, 8 bit encoded files are assumed to be in
| the system code page, which to enable all the worlds writing
| systems varies from country to country. When utf8 came along,
| Microsoft tools disambiguated those from the local code page by
| prefixing them with a byte order mark. They also do this in (for
| instance) the .net framework Xml libraries(by default). I don't
| know what .net core does. I suppose it made sense at the time but
| I'm sure they regret this by now.
| superjan wrote:
| And, I bet a significant portion of the offending CSV files are
| from excel. Excel is really annoying because it also silently
| localizes CSV files: my language uses the comma as a decimal
| separator, so excel wil switch to semicolon for the delimiter.
| dazfuller wrote:
| I probably won't ever need this, but I love the write up for a
| tool which I use daily
| nwellnhof wrote:
| Is there anything like --multiline in GNU grep?
| thijsvandien wrote:
| See here: https://stackoverflow.com/a/7167115/1163893.
| wodenokoto wrote:
| I don't know if I'll ever gonna need this, but I loved learning
| it!
___________________________________________________________________
(page generated 2021-05-29 23:01 UTC)