Proper encoding for file names in zip archives created in Windows and unpacked in linux

I have problems with different charsets in Windows and Linux (Centos).

I have files with special characters in their filenames from many different languages. The zip archive is generated under Win7 and uploaded on a Linux server. Under Windows all characters were displayed normal, as expected. But after uploading and extracting with, either phps' ZipArchive() or Linux unzip, some special characters were displayed with strange wrong characters.

I know that this is a known problem in the interplay between Windows and Linux, but I'm not able to solve my problem. I've tried to unzip my zip file with different charsets, but nothing worked for me. In Portuguese the charater õ makes a lot of problems, but ç is okay.

aplicações.txt is after unzipping aplicaçΣes.txt

As far as I understood it right, windows uses the ASCII code charset IBM860, but sometimes I read windows-1257 and I do not know which charset is used, when the zip archive is made with WinRar under Win7. Is there a way to check this, or tell WinRar to use UTF-8?

When the zip archive is uploaded to a linux os and unzipped by ZipArchive() (php) or on the Linux bash with unzip, the filenames are wrong. Think it is because linux used UTF-8.

Under linux command I tried:

unzip -O windows-1257 uploaded.zip -d zipout/

Under linux command I tried:

unzip -O IBM860 uploaded.zip -d zipout/

Under linux command I tried:

unzip -O IBM437 uploaded.zip -d zipout/

Under linux command I tried:

unzip -O UTF-8 uploaded.zip -d zipout/

Under linux command I tried:

unzip -O UTF-16 uploaded.zip -d zipout/

6 Answers

If the language of your Windows 7 version used for zipping files is the Brazilian Portuguese language, then the encoding are probably IBM-850 or Windows-1252. Try these.

I have this issue too. But also happens when transferring between different languages of Windows. Between the English and the Brazilian Portuguese Windows versions, for example, the English version uses IBM-437 and the pt-BR version uses IBM-850.

If you use the WinZip for zipping, this issue does not happens. I do not recommend to use the built-in Windows to zipping and/or extracting, as this also causes that encoding issue on filenames.

According to :

The latest ZIP format specification supports Unicode file names. Names must be encoded in UTF-8, and the 11th bit in the general purpose flags field (2 bytes at offset 6) must be set.

So if you upgrade your tools to versions supporting the newer ZIP format, things should work automatically.

I was able to fix it using:

saveLang=$LANG
export LANG=en_US
7z x file.zip
export LANG=$saveLang

Since you have found a workaround using LANG=en_US, you can probably also workaround your issues by specifying the file encoding with a command like: 7z x -mcp=437 file.zip

Instead of stashing the environment variable and then resetting it after, you can also temporarily set variables during just one command invocation by using env: env LANG=en_US 7z x file.zip

On Ubuntu you can use patched p7zip instead of unzip to get proper oem charset support.

sudo apt-add-repository ppa:alkisg/ppa
sudo apt-get update
sudo apt-get install p7zip p7zip-full

For other distros you can build patched p7zip yourself. Patch for unzip is available also. Discussion:

This issue with zips has been fixed in the most recent far2l file and archive manager. For zip legacy charset detection by far2l to work properly, your system language setting should match the one set on the system where the archive was created (Windows' internal "zip folders" tool uses just the same logic).

Proper encoding for file names in zip archives created in Windows and unpacked in linux

6 Answers

Your Answer

Sign up or log in

Post as a guest

You Might Also Like

What is the significance of II.9 in a Kingdom Hearts 3 scene?

Is there a way to craft Podzol in Minecraft?

Is there a hard limit to the number of trees able to exist in your town smaller than what is geometrically possible by the rules of tree growth?