[HN Gopher] Finding CSV files that start with a BOM using ripgrep
       ___________________________________________________________________
        
       Finding CSV files that start with a BOM using ripgrep
        
       Author : pcr910303
       Score  : 113 points
       Date   : 2021-05-29 10:27 UTC (12 hours ago)
        
 (HTM) web link (til.simonwillison.net)
 (TXT) w3m dump (til.simonwillison.net)
        
       | asicsp wrote:
       | > _The --multiline option means the search spans multiple lines -
       | I only want to match entire files that begin with my search term,
       | so this means that ^ will match the start of the file, not the
       | start of individual lines._
       | 
       | That's not correct because the `m` flag gets enabled by the
       | multiline option.                   $ printf 'a\nbaz\nabc\n' | rg
       | -U '^b'         baz
       | 
       | Need to use `\A` to match start of file or disable `m` flag using
       | `(?-m)`, but seems like there's some sort of bug though (will
       | file an issue soon):                   $ printf 'a\nbaz\nabc\n' |
       | rg -U '\Ab'         baz         $ printf 'a1\nbaz\nabc\n' | rg -U
       | '\Ab'         baz         $ printf 'a12\nbaz\nabc\n' | rg -U
       | '\Ab'         $
        
         | burntsushi wrote:
         | Yup, that's exactly right. '\A' or '(?-m)^' should work, but
         | don't, because of an incorrectly applied optimization.
         | 
         | The bug is fixed on master. Thanks for calling this to my
         | attention! https://github.com/BurntSushi/ripgrep/issues/1878
        
           | tialaramex wrote:
           | Hmm. Thanks for fixing this, but, two things about the tests
           | 
           | 1. These seem like they're effectively integration tests.
           | They check the entire ripgrep command line app works as
           | intended. Is this because the bug was _not_ where it looks
           | like it is, in the regex crate, but elsewhere? If not, it
           | seems like they 'd be better as unit tests closer to where
           | the bugs they're likely to detect would lie?
           | 
           | 2. While repeating yourself exactly once isn't automatically
           | a bad sign, it smells suspicious. It seems like there would
           | be a _lot_ of tests that ought to behave exactly the same
           | with or without --mmap and so maybe that 's a pattern worth
           | extracting.
        
       | tyingq wrote:
       | "BOM" == UTF-8 Byte Order Mark I guess.
       | 
       | I initially thought it was searching for "Bill of Materials" for
       | electronics projects or similar.
        
         | specialist wrote:
         | TIL: I had no idea UTF-8 could have a BOM.
         | 
         | https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8
         | 
         | https://www.w3.org/International/questions/qa-utf8-bom.en.ht...
        
         | maxnoe wrote:
         | There is no utf8 bom. Utf8 has no Byte order ambiguity. Only
         | utf-16 needs a bom.
        
           | codeulike wrote:
           | https://superuser.com/questions/1553666/utf-8-vs-
           | utf-8-with-...
        
           | cerved wrote:
           | huh?
        
           | catblast01 wrote:
           | Why does utf-32 not require a bom?
        
             | gpvos wrote:
             | It's practically never serialized to a file. And if you
             | need one, you can just use the same BOM value as UTF-16,
             | just add two zero bytes in the correct place.
        
             | masklinn wrote:
             | It does: https://www.unicode.org/faq/utf_bom.html#bom4
             | 
             | Well _require_ is a bit excessive, but it certainly allows
             | and recommends one.
             | 
             | Utf8 does not need one because the code units are bytes, so
             | bytes order is not a concern.
             | 
             | Exchanging utf32 is pretty rare though, and as long as you
             | don't move anything between machines bytes order is not an
             | issue.
        
           | burntsushi wrote:
           | From Unicode 23.8[1]:
           | 
           | > In UTF-8, the BOM corresponds to the byte sequence <EF BB
           | BF>. Although there are never any questions of byte order
           | with UTF-8 text, this sequence can serve as signature for
           | UTF-8 encoded text where the character set is unmarked. As
           | with a BOM in UTF-16, this sequence of bytes will be
           | extremely rare at the beginning of text files in other
           | character encodings. For example, in systems that employ
           | Microsoft Windows ANSI Code Page1252, <EF16 BB16 BF16>
           | corresponds to the sequence <i diaeresis, guillemet, inverted
           | question mark> "i >> ?".
           | 
           | In practice, the UTF-8 BOM pops up. I usually see it on
           | Windows.
           | 
           | [1] - http://www.unicode.org/versions/Unicode13.0.0/ch23.pdf#
           | G1963...
        
             | cerved wrote:
             | it pops up a lot and it's annoying as it's an invisible
             | diff in git
        
               | Measter wrote:
               | I've also had fun due to a BOM. In my case it was for
               | configuring the Assetto Corsa server. It takes an INI for
               | the entry list, but halts parsing if it encounters
               | unexpected input. Without any kind of message. The BOM
               | was an unexpected input, so the server just immediately
               | shut down because the entry list was "empty".
               | 
               | That was a fun, and totally unstressful way to begin my
               | time managing a racing league's race events.
        
           | andylynch wrote:
           | It's definitely a thing, it's even in RFC 3629 but definitely
           | not recommended. However some Microsoft tools default to
           | writing CSV as utf8+bom and others expect it too so it's hard
           | to ignore.
        
       | nemetroid wrote:
       | Here's a coreutils (two-liner) version:                 printf
       | '\xEF\xBB\xBF' >bom.dat       find . -name '*.csv' \
       | -exec sh -c 'head --bytes 3 {} | cmp --quiet - bom.dat' \; \
       | -print
       | 
       | The -exec option for find can be used as a filter (though -exec
       | disables the default action, -print, so it must be reenabled
       | after).
       | 
       | Could be made into a oneliner by replacing the 'bom.dat' argument
       | to cmp with '<(printf ...)'.
        
         | xorcist wrote:
         | The cmp in coreutils understands:                 -n,
         | --bytes=LIMIT   compare at most LIMIT bytes
         | 
         | so head is not really necessary:                 find . -name
         | '*.csv' -type f -exec cmp -sn 3 {} bom.dat \; -print
         | 
         | Using -exec as a filter is a nice feature more people should
         | use. That -type was put there just to avoid directories.
        
       | RedShift1 wrote:
       | Off topic but related, why does UTF-16 and UTF-32 even exist?
       | Doesn't UTF-8 have the capability to go up to 32 bit wide
       | characters already?
        
         | pwdisswordfish8 wrote:
         | UTF-16 was first. Or rather, UCS-2, which was limited to the
         | Basic Multilingual Plane, and which UTF-16 extends to the whole
         | of Unicode.
        
         | kbumsik wrote:
         | They exist before UTF-8 afaik.
        
         | marcosdumay wrote:
         | UTF-16 and UTF-32 are older than UTF-8.
         | 
         | Besides, at the beginning people were really against variable
         | size encodings. UTF-8 won despite the Unicode consortium and
         | all the committees effort, not because of it.
        
           | Dylan16807 wrote:
           | UTF-16 came after UTF-8. Software had gotten locked into 16
           | bit back when 16 bit meant fixed width, before either of
           | those formats existed.
        
             | ChrisSD wrote:
             | Note that UTF-8 wasn't actually standardized until Unicode
             | 2.0 in 1996. This was at the same time as the surrogate
             | pairs needed for UTF-16. And UTF-8 didn't find its final
             | form until 2003, which was around the time when it really
             | started to gain legs.
             | 
             | However, as you say, by 1996 people were already using the
             | older UCS-2 standard.
        
           | SCLeo wrote:
           | Do you know why UTF-8 won? I feel textual data only
           | constitutes a very tiny portion of memory used, but working
           | with fixed size encoding is so much more easier than variable
           | size encodings.
        
             | nneonneo wrote:
             | The only universal, fixed-size encoding is UTF-32, which,
             | as you can imagine, is very wasteful on space for ASCII
             | text. Like it or not, most of the interesting strings in a
             | given program are probably ASCII.
             | 
             | UTF-16 is not a fixed-size encoding thanks to surrogate
             | pairs. UCS-2 is a fixed-size encoding but can't represent
             | code points outside the BMP (such as emoji) which makes it
             | unsuitable for many applications.
             | 
             | Besides, most of the time individual code points aren't
             | what you care about anyway, so the cost of a variable-sized
             | encoding like UTF-8 is only a small part of the overall
             | work you need to support international text.
        
             | a1369209993 wrote:
             | Because Unicode (not UTF-anything, _Unicode itself_ )
             | is/became a variable-width encoding (eg U+78 U+304 "x" is a
             | single character, but two Unicode code points[0]). So
             | encoding Unicode code points with a fixed-width encoding is
             | completely useless, because your characters are still
             | variable-width (it's also hazardous, since it increases how
             | long it takes for bugs triggered by variable-width
             | characters to surface, especially if you normalize to NFC).
             | 
             | 0: Similarly, U+1F1 "DZ" is two characters, but one Unicode
             | code point, which is much, much worse as it means you can
             | no longer treat encoded strings as concatenations of
             | encoded characters. UTF-8-as-such doesn't have this problem
             | - any 'string' of code points can only be encoded as the
             | concatenation of the encodings of its elements - but UTF-8
             | in practice does inherit the character-level version of
             | this problem from Unicode.
        
             | Dylan16807 wrote:
             | The only way to "properly" have a fixed width encoding is
             | to allocate 80-128 bytes per character _. Anything else
             | will break horribly on accents and other common codepoints.
             | So everyone uses the less-easy methods.
             | 
             | _ I base this number off the  "Stream-Safe Text Format"
             | which suggests that while it's preferred that you accept
             | infinitely-long characters, a cap of 31 code points is more
             | or less acceptable.
        
         | dahfizz wrote:
         | A file containing every single Unicode codepoint once would be
         | smaller in UTF-16 than in UTF-8. UTF-16 can make sense in some
         | applications.
        
         | foepys wrote:
         | A character in UTF-8 can even be more than 4 bytes long. One
         | example are flags or skin-colored emojis.
        
           | barrkel wrote:
           | UTF-8 encodes code points, as do UTF-16 and UTF-32. Once you
           | go from a sequence of bytes to a sequence of code points,
           | you've moved beyond the specifics of the encoding.
           | 
           | Code points might be combined to form graphemes and grapheme
           | clusters. Some of the latest emojis are extended grapheme
           | clusters, for e.g. handling the combinatorics of mixed
           | families. This is a higher level composition than UTF-x, it's
           | logically a separate layer.
           | 
           | IMO talking about characters in the context of Unicode is
           | often unhelpful because it's vague.
        
             | tialaramex wrote:
             | Right. "Character" is almost never what you meant, unless
             | your goal was to be as vague as possible. In human
             | languages I like the word "squiggle" to mean this thing you
             | have fuzzy intuitive beliefs about, rather than
             | "character". In Unicode the Code Unit, and Code Point are
             | maybe things to know about, but neither of them is a
             | "character".
             | 
             | In programming languages or APIs where precision matters,
             | your goal should be to avoid this notion of characters as
             | much as practical. In a high level language with types,
             | just do not offer a built-in "char" data type. Sub-strings
             | are all anybody in a high level language actually needs to
             | get their job done, "A" is a perfectly good sub-string of
             | "CAT" there's no need to pretend you can slice strings up
             | into "characters" like 'A' that have any distinct
             | properties worth inventing a whole datatype.
             | 
             | If you're writing device drivers, once again, what do you
             | care about "characters"? You want a byte data type, most
             | likely, some address types, that sort of thing, but who
             | wants a "character" ? However, somewhere down in the guts a
             | low-level language will need to think about Unicode
             | encoding, and so eventually they do need a datatype for
             | that when a 32-bit integer doesn't really cut it. I think
             | Rust's "char" is a little bit too prominent for example, it
             | needn't be more in your face than say,
             | std::num::NonZeroUsize. Most people won't need it, most of
             | the time and that's as it should be.
        
         | tialaramex wrote:
         | Others have talked about the history of UTF-16. I'll focus on
         | that last part: You must not write 32-bit wide characters in
         | UTF-8.
         | 
         | Unicode / ISO 10646 is specifically defined to only have code
         | points from 0 to 0x10FFFF. As a result UTF-8 that would decode
         | outside that range is just invalid, no different from if it was
         | 0xFF bytes or something.
         | 
         | It also doesn't make sense to write UTF-8 that decodes as
         | U+D800 through U+DFFF since although these code points exist,
         | the standard specifically reserves them to make UTF-16 work,
         | and you're not using UTF-16.
        
           | bmn__ wrote:
           | > You must not write 32-bit wide characters in UTF-8.
           | 
           | You can't tell me what to do, dad. I'll encode 64 bits and
           | you can't stop me! Bwahahahaa!                   $ perl
           | -MEncode=encode_utf8 -e'print encode_utf8
           | "\x{7fff_ffff_ffff_ffff}"' | hex         0000  ff 80 87 bf bf
           | bf bf bf  bf bf bf bf bf           y??????????
        
           | Dylan16807 wrote:
           | > As a result UTF-8 that would decode outside that range is
           | just invalid, no different from if it was 0xFF bytes or
           | something.
           | 
           | That's needlessly pedantic. If you use an old version of the
           | spec those bytes are valid.
           | 
           | And "have the capability" seems to me to be talking about
           | what the underlying method is able to do, not the full set of
           | "must not" rules.
        
         | Dylan16807 wrote:
         | > Doesn't UTF-8 have the capability to go up to 32 bit wide
         | characters already?
         | 
         | 31.
        
       | superjan wrote:
       | One large source of byte order marks in utf8 is Windows. In MS
       | DOS and later windows, 8 bit encoded files are assumed to be in
       | the system code page, which to enable all the worlds writing
       | systems varies from country to country. When utf8 came along,
       | Microsoft tools disambiguated those from the local code page by
       | prefixing them with a byte order mark. They also do this in (for
       | instance) the .net framework Xml libraries(by default). I don't
       | know what .net core does. I suppose it made sense at the time but
       | I'm sure they regret this by now.
        
         | superjan wrote:
         | And, I bet a significant portion of the offending CSV files are
         | from excel. Excel is really annoying because it also silently
         | localizes CSV files: my language uses the comma as a decimal
         | separator, so excel wil switch to semicolon for the delimiter.
        
       | dazfuller wrote:
       | I probably won't ever need this, but I love the write up for a
       | tool which I use daily
        
       | nwellnhof wrote:
       | Is there anything like --multiline in GNU grep?
        
         | thijsvandien wrote:
         | See here: https://stackoverflow.com/a/7167115/1163893.
        
       | wodenokoto wrote:
       | I don't know if I'll ever gonna need this, but I loved learning
       | it!
        
       ___________________________________________________________________
       (page generated 2021-05-29 23:01 UTC)