grep regex .* not matching everything
I've recently gotten into using tools like grep, wc, cat, etc. because I have to deal with some very large CSV files (>10GB) which aren't quite delimited correctly (for instance, having occurrences of the delimiter character inside some of the fields.
In my working with one of these files, I've run the following command in the process of trying to figure out a way to correctly identify which instances of ; is a delimiter and replace them with some other character:
grep -v -n --text "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]" < Transactions.csvThe regex can probably be done much better, but anyway; what is surprising is that, among others, the above code outputs the following line:
12345678:2016-10-25;12345678912345;2016-10-25;gobbledegook �IDNR: 69 ;12345.67;.00;2003-09-05;12345678;2003-09-03;stuff stuff ;12345 fgadfkjgbsdkb;12/3/45678/9(as this was actually transaction data, I've changed most of the fields' values, except for the offending �) Maybe I'm being silly, but why doesn't the above regex match that line? It seems like the regex .* somehow doesn't match that character for some reason.
I suspect that the file is saved using the UTF-16 encoding, if that makes any difference.
Edit: Thanks to @exore for the answer. As it turns out, my file was encoded in ISO-8859-15, which I was able to figure by grepping out the lines containing special characters, which were relatively few, into a file and opening that in gedit. I then used iconv to convert that to utf8, after which it worked fine!
1 Answer
This is a typical char encoding problem. . means any character. But which sequence of byte is a legal character is a matter of encoding. Dealing with text without the knowledge of the encoding is a sure failure. Your grep command probably expect UTF-8 encoded string. UTF-8 is a multibyte encoding, meaning that some char are represented by multiple bytes. However, not all sequence of bytes are valid. See, for example, the Wikipedia article on UTF-8.
When grep encounters a byte sequence that is not a valid char in the expected encoding, it cannot recognise it as a character, the line doesn't match, it's output. Since your terminal doesn't recognise the char either, you get a �.
There is a workaround in your case. Tell grep not to bother about encoding, and consider one byte as one char.
env LANG=C grep ....or maybe
env LANG=C LC_ALL=C grep ....You may test easily:
Create 2 files, one utf-8 encoded, one utf-16-be:
$ echo éléphant | tee file.std | iconv -f utf8 -t utf16be >file.utf16beCheck content of files:
$ cat file*
éléphant
�l�phantTry to grep. The utf16be string is not recognised, no output:
$ grep '^.*$' file*
file.std:éléphantDon't use encoding at all. One byte is one char. all strings are matched
the � just means the terminal doesn't recognise the utf16be sequence as a
valid utf-8 char. Note the use of -a to tell grep to consider binary is
is some text.
$ env LANG=C grep -a '^.*$' file*
file.std:éléphant
file.utf16be:�l�phantAlternatively, if you know the encoding, then you can use iconv to first convert your file then use grep. One of the following should work.
iconv -f utf16 -t utf8 < file | grep ...
iconv -f utf16le -t utf8 < file | grep ...
iconv -f utf16be -t utf8 < file | grep ...