Discussion:
[BackupPC-users] BackupPC_zipCreate and charset for encoding file names
Alexander Moisseev
2007-12-10 09:53:03 UTC
Permalink
#------------------------------------------------------------------------
# Version 3.1.0beta0, 3 Sep 2007
#------------------------------------------------------------------------
* Made the default charset for BackupPC_zipCreate cp1252, which
appears to work correctly with WinZip.
I think it was not good idea to fix charset. In some countries people using national charsets. For example in Russia we are using cp1251. I believe that $Conf{ClientCharset} is more appropriate (as it was in version 3.0.0.).

Additionally I found no way to set cp1251 when restore through CGI.

Alexander
Craig Barratt
2007-12-21 10:37:41 UTC
Permalink
Post by Alexander Moisseev
#------------------------------------------------------------------------
# Version 3.1.0beta0, 3 Sep 2007
#------------------------------------------------------------------------
* Made the default charset for BackupPC_zipCreate cp1252, which
appears to work correctly with WinZip.
I think it was not good idea to fix charset. In some countries people
using national charsets. For example in Russia we are using cp1251. I
believe that $Conf{ClientCharset} is more appropriate (as it was in
version 3.0.0.).
The problem is two fold:

- $Conf{ClientCharset} represents what the XferMethod delivers, not
necessarily what is actually on the clien. For example, smb by
default it will deliver utf8 encoding, so $Conf{ClientCharset}
is set to utf8.

- I don't know how charsets work with win zip format files. Is the
encoding included in the file? Is it just interpreted using the
local charset when you extract? I don't know. Can anyone help
here?
Post by Alexander Moisseev
Additionally I found no way to set cp1251 when restore through CGI.
Yes, unfortunately the charset can't be set via CGI.
You can, however, do it from the command line.

Craig
Fernando Laudares Camargos
2008-10-16 14:50:39 UTC
Permalink
Hello,

I'm replying to this old thread because I believe that my question/co=
mmentary fit here better than in any other thread regarding this subj=
ect.

I'm using BackupPC to backup among others a samba file server that us=
es a ISO8859-1 charset. To get the characters displayed correctly in =
the command line I have to set BackupPC's ClientCharset to ISO-8859-1=
. That's fine. But when I try to download a file to a Ubuntu/Windows =
desktop I've got encoding problems.

Using the option "-e UTF8" with BackupPC_zipCreate" in the command li=
ne works. But to get it working from the CGI interface I had to "hard=
coded" it in lib/BackupPC/CGI/Restore.pm :
------------------------------------------
$bpc->cmdSystemOrEvalLong(["$BinDir/BackupPC_zipCreate",
"-h", $host,
"-n", $num,
"-c", $In{compressLevel},
"-s", $share,
@pathOpts,
@fileList, # add the ,
"-e UTF8" # HERE
],
sub { print(@_); },
-------------------------------------------

I'm sure this is not the best way to do it but I haven't had success =
with any other try I did. Does anybody known a better way to do that =
?

Thank you,
--=20
Fernando Laudares Camargos

R=E9volution Linux
http://www.revolutionlinux.com
---------------------------------------
* Tout opinion et prise de position exprim=E9e dans ce message est ce=
lle
de son auteur et pas n=E9cessairement celle de R=E9volution Linux.
** Any views and opinion presented in this e-mail are solely those of
the author and do not necessarily represent those of R=E9volution Lin=
ux.
=20
#----------------------------------------------------------------=
--------
# Version 3.1.0beta0, 3 Sep 2007
#----------------------------------------------------------------=
--------
* Made the default charset for BackupPC_zipCreate cp1252, which
appears to work correctly with WinZip.
I think it was not good idea to fix charset. In some countries peo=
ple
using national charsets. For example in Russia we are using cp1251=
. I
believe that $Conf{ClientCharset} is more appropriate (as it was i=
n
version 3.0.0.).
=20
=20
- $Conf{ClientCharset} represents what the XferMethod delivers, n=
ot
necessarily what is actually on the clien. For example, smb by
default it will deliver utf8 encoding, so $Conf{ClientCharset}
is set to utf8.
=20
- I don't know how charsets work with win zip format files. Is t=
he
encoding included in the file? Is it just interpreted using th=
e
local charset when you extract? I don't know. Can anyone help
here?
=20
Additionally I found no way to set cp1251 when restore through CGI=
.
=20
Yes, unfortunately the charset can't be set via CGI.
You can, however, do it from the command line.
=20
Craig
=20
-------------------------------------------------------------------=
------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
BackupPC-users mailing list
List: https://lists.sourceforge.net/lists/listinfo/backuppc-user=
s
Wiki: http://backuppc.wiki.sourceforge.net
Project: http://backuppc.sourceforge.net/
Alexander Moisseev
2008-10-17 05:55:50 UTC
Permalink
As workaround I'm simply use BackupPC_zipCreate from 3.0.0 with BackupPC 3.1.0.
It uses $Conf{ClientCharset} and work with CGI interface.

But problem is deeper.
Post by Craig Barratt
The problem I have is I can't find any documentation
for zip files and any standards around charset encoding for the
file names in a zip file. Is utf8 the correct default, or does
it depend on which platform is trying to unpack the zip file?
I tried different archivers for archive browsing. Some archivers displays national characters correctly when zip was created by BackupPC_zipCreate with one encoding, but other archivers - with other one.

But I never had seen that behavior with conventional zip-files.

Alexander
Post by Craig Barratt
Hello,
I'm replying to this old thread because I believe that my question/commentary fit here better than in any other thread regarding this subject.
I'm using BackupPC to backup among others a samba file server that uses a ISO8859-1 charset. To get the characters displayed correctly in the command line I have to set BackupPC's ClientCharset to ISO-8859-1. That's fine. But when I try to download a file to a Ubuntu/Windows desktop I've got encoding problems.
------------------------------------------
$bpc->cmdSystemOrEvalLong(["$BinDir/BackupPC_zipCreate",
"-h", $host,
"-n", $num,
"-c", $In{compressLevel},
"-s", $share,
@pathOpts,
@fileList, # add the ,
"-e UTF8" # HERE
],
-------------------------------------------
I'm sure this is not the best way to do it but I haven't had success with any other try I did. Does anybody known a better way to do that ?
Thank you,
Craig Barratt
2008-10-17 00:59:21 UTC
Permalink
I'm replying to this old thread because I believe that my question/commentary fit here better than in any other thread regarding this subject.
I'm using BackupPC to backup among others a samba file server that uses a ISO8859-1 charset. To get the characters displayed correctly in the command line I have to set BackupPC's ClientCharset to ISO-8859-1. That's fine. But when I try to download a file to a Ubuntu/Windows desktop I've got encoding problems.
------------------------------------------
$bpc->cmdSystemOrEvalLong(["$BinDir/BackupPC_zipCreate",
"-h", $host,
"-n", $num,
"-c", $In{compressLevel},
"-s", $share,
@pathOpts,
@fileList, # add the ,
"-e UTF8" # HERE
],
-------------------------------------------
I'm sure this is not the best way to do it but I haven't had success
with any other try I did. Does anybody known a better way to do that ?
I added the command-line argument for charset but didn't implement
a CGI setting. The problem I have is I can't find any documentation
for zip files and any standards around charset encoding for the
file names in a zip file. Is utf8 the correct default, or does
it depend on which platform is trying to unpack the zip file?

Craig
Alexander Moisseev
2008-10-21 08:37:51 UTC
Permalink
Only documentation I can find is Application Note on the .ZIP file format from PKWARE
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

I realized that:
1. ZIP format officially supports only ISO8859–1 file names and not include any information about encoding at all.
2. In fact most of Windows archiver programs use OEM encoding.
3. But some one (e.g. IZarc, Info-Zip, Wiz) leaves file names without recoding.
4. Not sure, but seems like Unix/Linux archivers uses current locale encoding.
5. UTF-8 file name storage appears in version 6.3.2 of APPNOTE about 1 year ago.

I can't succeed with UTF-8 archives on Windows yet. May be somebody knows what windows archivers currently support UTF-8?

For 2. encoding must be OEM (e.g. CP866 for Russian).
For 3. encoding must be $Conf{ClientCharset} (e.g. CP1251 for Russian).

Quote from APPNOTE:
The upper byte indicates the compatibility of the file
attribute information. If the external file attributes
are compatible with MS-DOS and can be read by PKZIP for
DOS version 2.04g then this value will be zero. If these
attributes are not compatible, then this value will
identify the host system on which the attributes are
compatible. Software can use this information to determine
the line record format for text files etc.

So, if we do "windows" encoding of file names, we also must set MS-DOS compatibility.

I have done some experimenting.
If I change in Central directory structure of zip archive (central file header signature 0x02014b50) the upper byte of "version made by" field from 3 (UNIX) to 0 (MS-DOS) and zip archive was created with OEM encoding (CP866 for Russian) both 2. and 3. types of archivers displays file names perfectly well.

Craig, is it possible to set "version made by" field with BackupPC_zipCreate?

Alexander
Post by Craig Barratt
I added the command-line argument for charset but didn't implement
a CGI setting. The problem I have is I can't find any documentation
for zip files and any standards around charset encoding for the
file names in a zip file. Is utf8 the correct default, or does
it depend on which platform is trying to unpack the zip file?
Craig
Alexander Moisseev
2008-11-18 08:46:49 UTC
Permalink
My conclusions from PKWARE Application Note:

1. Encoding must be OEM code page of terminal where zip will be viewed or unpacked;
2. "version made by" field must be set in accordance with OS where zip will be viewed or unpacked;
3. If filenames encoded in UTF-8 (Linux terminal code page), general purpose bit 11 for UTF-8 must be set for crossplatform compatibility. Last versions of some windows archivers supports it.

I have no idea yet what command line options and defaults must be and how integrate it with CGI.
But with hardcoded charset and "version made by" BackupPC_zipCreate works perfectly well with most windows archivers.

BackupPC_zipCreate:
8<--------------------------
#=> Set OEM code page for unpacking terminal
my $Charset = "cp866"; #Cyrillic code page. Change to yours windows terminal code page.
8<--------------------------
# Specify the compression level for this member
$zipmember->desiredCompressionLevel($compLevel) if ($compLevel =~ /[0-9]/);

#=> Set "version made by" field to 0 (MS-DOS)
$zipmember->fileAttributeFormat('FA_MSDOS');

#=> Set general purpose bit 11 for UTF-8
$zipmember->{bitFlag} = $zipmember->{bitFlag} | 0x0800 if ( $Charset eq "" );

# Finally Zip the member
$zipfh->addMember($zipmember);
}
8<--------------------------

P.S. Also Appendix D of Application Note on the .ZIP file defines optional "Extra Field" storage for UTF filenames. But no archivers that really supports it.
Loading...