grep regex .* not matching everything

I've recently gotten into using tools like grep, wc, cat, etc. because I have to deal with some very large CSV files (>10GB) which aren't quite delimited correctly (for instance, having occurrences of the delimiter character inside some of the fields.

In my working with one of these files, I've run the following command in the process of trying to figure out a way to correctly identify which instances of ; is a delimiter and replace them with some other character:

grep -v -n --text "[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].*[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]" < Transactions.csv

The regex can probably be done much better, but anyway; what is surprising is that, among others, the above code outputs the following line:

12345678:2016-10-25;12345678912345;2016-10-25;gobbledegook �IDNR: 69 ;12345.67;.00;2003-09-05;12345678;2003-09-03;stuff stuff ;12345 fgadfkjgbsdkb;12/3/45678/9

(as this was actually transaction data, I've changed most of the fields' values, except for the offending �) Maybe I'm being silly, but why doesn't the above regex match that line? It seems like the regex .* somehow doesn't match that character for some reason.

I suspect that the file is saved using the UTF-16 encoding, if that makes any difference.

Edit: Thanks to @exore for the answer. As it turns out, my file was encoded in ISO-8859-15, which I was able to figure by grepping out the lines containing special characters, which were relatively few, into a file and opening that in gedit. I then used iconv to convert that to utf8, after which it worked fine!

1 Answer

This is a typical char encoding problem. . means any character. But which sequence of byte is a legal character is a matter of encoding. Dealing with text without the knowledge of the encoding is a sure failure. Your grep command probably expect UTF-8 encoded string. UTF-8 is a multibyte encoding, meaning that some char are represented by multiple bytes. However, not all sequence of bytes are valid. See, for example, the Wikipedia article on UTF-8.

When grep encounters a byte sequence that is not a valid char in the expected encoding, it cannot recognise it as a character, the line doesn't match, it's output. Since your terminal doesn't recognise the char either, you get a �.

There is a workaround in your case. Tell grep not to bother about encoding, and consider one byte as one char.

env LANG=C grep ....

or maybe

env LANG=C LC_ALL=C grep ....

You may test easily:

Create 2 files, one utf-8 encoded, one utf-16-be:

$ echo éléphant | tee file.std | iconv -f utf8 -t utf16be >file.utf16be

Check content of files:

$ cat file*
éléphant
�l�phant

Try to grep. The utf16be string is not recognised, no output:

$ grep '^.*$' file*
file.std:éléphant

Don't use encoding at all. One byte is one char. all strings are matched the � just means the terminal doesn't recognise the utf16be sequence as a valid utf-8 char. Note the use of -a to tell grep to consider binary is is some text.

$ env LANG=C grep -a '^.*$' file*
file.std:éléphant
file.utf16be:�l�phant

Alternatively, if you know the encoding, then you can use iconv to first convert your file then use grep. One of the following should work.

iconv -f utf16 -t utf8 < file | grep ...
iconv -f utf16le -t utf8 < file | grep ...
iconv -f utf16be -t utf8 < file | grep ...

grep regex .* not matching everything

1 Answer

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

Minecraft on iMAC

What is "rubber banding"?

How do you increase your team members' loyalty in Mass Effect 2?