Skip to content

unzip not correct with cjk filename. #45

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dousee163 opened this issue Aug 15, 2019 · 26 comments
Closed

unzip not correct with cjk filename. #45

dousee163 opened this issue Aug 15, 2019 · 26 comments

Comments

@dousee163
Copy link

fff - ╕▒▒╛.txt

correct name is
fff - 副本.txt

How can i resolve it?

@srikanth-lingala
Copy link
Owner

Can you please attach the zip file?

@dousee163
Copy link
Author

a.zip

@srikanth-lingala
Copy link
Owner

The issue is with the tool that was used to create the zip file. For non-ascii file names, utf8 flag has to be set in zip file header, and also the filename has to be encoded with utf8 charset. This flag was not set in this case, and therefore zip4j uses the default charset. I tried extracting the zip file with another zip tool, and I see the same behaviour (random characters in file name).

Zip4j uses utf8 by default, so it this zip file was (most likely) not created by Zip4j. If this was indeed created with zip4j, please post the code used to create the zip file. If not, I am afraid I cannot help you much, because the issue is with the tool that created the zip file.

@dousee163
Copy link
Author

It created by 7zip.
And I zip the files with windows default function,but It is the same result.

Zip4j unzip is only support it self zipped file?

@srikanth-lingala
Copy link
Owner

It is not a question of wether the zip4j created the zip file or not. But it is more a question of wether the utf8 flag was set in zip headers or not. Zip4j checks to see if this flag is set. If yes, it uses utf8 and if not it uses cp437. Please note that the file name also has to be encoded with utf8. This is according to the zip specification.

I tried extracting your zip file with the default compression tool on mac, The Unarchiver and also Keka, and all three of them had the same trouble extracting the zip file. I really wonder why 7Zip is not using utf8 flag. Do you use any custom charset when zipping files in 7Zip? Can you please check those settings? I would have presumed 7Zip uses utf8 when necessary by default.

@srikanth-lingala
Copy link
Owner

Looks like the file name was encoded using the charset euc-kr. This is not a standard charset to be used in zip specification, and your zip file will have trouble extracting not just with zip4j, but with any other zip tool, unless the zip tool uses system default charset, and your default system charset is euc-kr.

One way to workaround this issue is with this code:

public void test() throws ZipException, UnsupportedEncodingException {
  ZipFile zipFile = new ZipFile("/Users/someuser/Downloads/a.zip");
  for (FileHeader fileHeader : zipFile.getFileHeaders()) {
    zipFile.extractFile(fileHeader, "/Users/someuser/Downloads/extract", getEucKrEncodedFileName(fileHeader.getFileName()));
  }
}

private String getEucKrEncodedFileName(String fileName) throws UnsupportedEncodingException {
  return new String(fileName.getBytes("Cp437"), "euc-kr");
}

Basically what we do in the above code is to overwrite the default charset which zip4j uses with the charset that is used in your case.

@dousee163
Copy link
Author

It depend on os language?

@dousee163
Copy link
Author

I can not specify the zip file encoding, It maybe chinese, korea, japanese, or english.

@LeeYoung624
Copy link
Contributor

LeeYoung624 commented Aug 15, 2019

Hey guys.
I inspected into this case and I found something intersting :
Obviously the correct file name should be fff - 副本.txt. The words 副本 in the name is actually Chinese which means a copy of the file (this may confuse srikanth-lingala cause you are not a native Chinese speaker :) ).
I inspected the a.zip with hex editor and found out the 副本 in Local File Header is 0x B8 B1 B1 BE. Actually they are encoded with charset GBK, which is the default charset for Chinese on most Windows.

image

But when I tried to reproduce this problem with srikanth-lingala's code. I found that zip4j seems to have a wrong result of the filename:

image

It seems the filename is parsed as a wrong result. I think there's some bugs in this. But I didn't dig into it yet. Give me some time and I will dig deeper into this.

@srikanth-lingala
Copy link
Owner

As I have mentioned above, Zip specification only allows for cp437 or utf8 charsets. Any other charsets that are used by zip tools will not be zip specification compliant and may not be supported by other zip tools. This is not a bug in zip4j, but zip4j just sticks with the zip specification.

If you have custom charsets, you can use the above code and just replace the charset used with your custom charset.

@LeeYoung624
Copy link
Contributor

LeeYoung624 commented Aug 15, 2019

I just tested and found you are right @srikanth-lingala . Sorry that I didn't notice the filename is already encoded with Cp437.
@dousee163 You can get the correct file name like this :

public void test() throws ZipException, UnsupportedEncodingException {
  ZipFile zipFile = new ZipFile("/Users/someuser/Downloads/a.zip");
  for (FileHeader fileHeader : zipFile.getFileHeaders()) {
    zipFile.extractFile(fileHeader, "/Users/someuser/Downloads/extract", getGbkEncodedFileName(fileHeader.getFileName()));
  }
}

private String getGbkEncodedFileName(String fileName) throws UnsupportedEncodingException {
  return new String(fileName.getBytes("Cp437"), "GBK");
}

@dousee163
Copy link
Author

But I zip the files with windows default function.
It's incorrect??
无标题

Or can you tell me which zip tools should I use?
And the zip file maybe zippend under japanese OS, or other language os.
Above code only resolve chinese os problerm。

@dousee163
Copy link
Author

无标题2
And this is 7zip screenshort,there's no encode setting.

@LeeYoung624
Copy link
Contributor

I think srikanth-lingala is right : we should follow Zip specification. That's the only offical rules.
Maybe you can try this:

public void test() throws ZipException, UnsupportedEncodingException {
  ZipFile zipFile = new ZipFile("/Users/someuser/Downloads/a.zip");
  for (FileHeader fileHeader : zipFile.getFileHeaders()) {
    zipFile.extractFile(fileHeader, "/Users/someuser/Downloads/extract", getOsDefaultEncodedFileName(fileHeader.getFileName()));
  }
}

private String getOsDefaultEncodedFileName(String fileName) throws UnsupportedEncodingException {
  return new String(fileName.getBytes("Cp437"), System.getProperty("sun.jnu.encoding"));
}

I tested this on Windows and it works. But I can not guarantee it works on other operating systems or other languages. Maybe you can have a try and let me know @dousee163 .

@azhao-2019
Copy link

@dousee163 我觉得你需要的是先判断这个压缩包里面的编码,然后在根据对应的编码去设置,这样获取到的正确的名称了,上面他们提供的一个案例就是对应我们中文的编码。

@srikanth-lingala Can we use zip4j API to judge zip encoding ?

@dousee163
Copy link
Author

@LeeYoung624
I test your code.
If there's chinese folder, the result seems incorrect.
1
2

@azhao-2019
我也是这么想的,如果能判断文件的编码直接在内部处理就好了。

@dousee163
Copy link
Author

with chinese folder
a.zip

@dousee163
Copy link
Author

3

If I change HeaderUtil.java to above, It returned correct result.

@srikanth-lingala
Copy link
Owner

Read here for more info on this issue on 7zip side. I am really surprised that 7zip does not use utf8 as default. I cannot change zip4j to use any other charsets. This is not according to the zip specification.

A workaround is to use "cu" (without quotes) as a parameter in 7zip settings. Try and see if that works. The text box where this has to be entered is highlighted below:

7zip

@dousee163
Copy link
Author

@srikanth-lingala
7zip
It worked with the parameter cu.

but didn't work with windows default function.

@LeeYoung624
Copy link
Contributor

LeeYoung624 commented Aug 15, 2019

@dousee163 我觉得你需要的是先判断这个压缩包里面的编码,然后在根据对应的编码去设置,这样获取到的正确的名称了,上面他们提供的一个案例就是对应我们中文的编码。

Google translation: @dousee163 I think what you need is to first judge the encoding in the zip file, and then set it according to the corresponding encoding, so that the correct name is obtained. One of the cases provided by them is the encoding corresponding to our Chinese.

@dousee163 @azhao-2019 Well. I think you guys are not familiar with ZIP File Format Specification like me. You guys can check this chapter out

APPENDIX D - Language Encoding (EFS)

As srikanth-lingala has already said, the zip file only support 2 kinds of charset : IBM Code Page 437/Cp437 and UTF-8. The charset is saved in zip file as bit 11 in general purpose bit flag. In your case, the charset was set as Cp437 by 7 zip, while the default of zip4j is UTF-8. It seems that 7 zip made some charset transform :

When you create zip archives under windows, 7-zip (like other zip programs) tries to use your current locale page to store the filename and if some characters in a filename cannot be encoded in your current locale page, then 7-zip stores the utf8 filename.

It's 7-zip's problem, not zip4j's. 7-zip perform some other operations that are not required by ZIP File Format Specification.
I think srikanth-lingala is right. It seems there are no other solutions if you want to use the default settings in 7-zip, because the charset could not be saved in zip file.

@dousee163
Copy link
Author

thank you

@Erich-Chen
Copy link

Erich-Chen commented Oct 29, 2020

I usually try the following command line parameter:
<path_to_7z_direcotry>\7z.exe -mcp=936 x <fielname>.zip

For example:
"C:\Program Files\7-Zip\7z.exe" -mcp=936 x "C:\Users\Erich\Downloads\a.zip"

(Refer to the discussion on: https://sourceforge.net/p/sevenzip/bugs/2198/)

The characters "副本" appears correctly in the extracted file's name.

Regarding the encoding within the file, try notepad++ and choose "Encoding > Character sets > Chinese > GB2312".

BTW, The other software BandZip takes better care of encoding issue in zipped filename.

@iamqiz
Copy link

iamqiz commented Feb 12, 2025

I usually try the following command line parameter: <path_to_7z_direcotry>\7z.exe -mcp=936 x <fielname>.zip

For example: "C:\Program Files\7-Zip\7z.exe" -mcp=936 x "C:\Users\Erich\Downloads\a.zip"

The characters "副本" appears correctly in the extracted file's name.

Regarding the encoding within the file, try and choose "Encoding > Character sets > Chinese > GB2312".notepad++

BTW, The other software takes better care of encoding issue in zipped filename.BandZip

@Erich-Chen 哥们,你在哪里看到 -mcp=936 这个选项的, 真的有用 谢谢

@Erich-Chen
Copy link

I usually try the following command line parameter: <path_to_7z_direcotry>\7z.exe -mcp=936 x <fielname>.zip
For example: "C:\Program Files\7-Zip\7z.exe" -mcp=936 x "C:\Users\Erich\Downloads\a.zip"
The characters "副本" appears correctly in the extracted file's name.
Regarding the encoding within the file, try and choose "Encoding > Character sets > Chinese > GB2312".notepad++
BTW, The other software takes better care of encoding issue in zipped filename.BandZip

@Erich-Chen 哥们,你在哪里看到 -mcp=936 这个选项的, 真的有用 谢谢

很高兴看到它对你有用。我一直在这样用,其实记不清最初的来源是哪里。互联网上相对比较早的讨论可以参考这里:

https://sourceforge.net/p/sevenzip/bugs/2198/

@iamqiz
Copy link

iamqiz commented Mar 18, 2025

@Erich-Chen 😃我后来在文档里找到了, -mcp其实是-m 选项在zip情况下的一个子选项cp(即code page),如下图所示
Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants