perldoc -f open example for utf encoding may be incorrect #5779

p5pRT · 2002-07-25T05:28:08Z

Migrated from rt.perl.org#15533 (status was 'resolved')

Searchable as RT15533$

p5pRT · 2002-07-25T05:28:09Z

From dcd@tc.fluke.com

Created by dcd@tc.fluke.com

perldoc -f open shows an example

For example
open(FH, "<:utf8", "file")

will open the UTF-8 encoded file containing
Unicode characters, see perluniintro.

but to read a Unicode file generated on Windows it appears that
the arg should be "<:encoding(utf16)", as hinted at in binmode
example in perlopentut.pod
binmode($fh, ":encoding(utf16)");

Perl Info


Flags:
    category=docs
    severity=medium

Site configuration information for perl v5.8.0:

Configured by dcd at Tue Jul 23 15:17:41 PDT 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0 patch 17654) configuration:
  Platform:
    osname=linux, osvers=2.4.19-rc3-ac3, archname=i686-linux
    uname='linux dd 2.4.19-rc3-ac3 #1 tue jul 23 09:02:05 pdt 2002 i686 '
    config_args='-Dmksymlinks -Dinstallusrbinperl -Uversiononly -Dusedevel -Doptimize=-O3 -g -de -Dcf_email=dcd@tc.fluke.com'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 -g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='egcs-2.91.66.1 19990314/Linux (egcs-1.1.2 release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -ldbm -ldb -ldl -lm -lc
    perllibs=-ldl -lm -lc
    libc=/lib/libc.so.5.4.44, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    DEVEL17641


@INC for perl v5.8.0:
    /usr/local/lib/perl5/5.8.0/i686-linux
    /usr/local/lib/perl5/5.8.0
    /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl/5.7.3
    /usr/local/lib/perl5/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/dcd
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/dcd/bin:/sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11/bin:/usr/games:/usr/local/samba:/home/hobbes/tools/scripts:/home/hobbes/tools/linux:/usr0/hobbes/tools/scripts:/usr0/dcd/bin:/apps/general/bin:/usr/public
    PERL_BADLANG (unset)
    SHELL=/bin/bash

p5pRT · 2002-11-27T04:35:03Z

From @jhi

I do not quite understand that "for a Unicode file generated in Windows" the sample should be changed to be "utf-16". Windows can generate utf-8 just fine.

p5pRT · 2002-11-27T18:13:33Z

From david.dyck@fluke.com

On 27 Nov 2002 at 04:35 -0000, Jarkko Hietaniemi <perlbug@perl.org> wrote:

From: Jarkko Hietaniemi <perlbug@perl.org>
To: dcd
Date: 27 Nov 2002 04:35:03 -0000
Subject: [perl #15533] perldoc -f open example for utf encoding may be
incorrect

I do not quite understand that "for a Unicode file generated in Windows"
the sample should be changed to be "utf-16". Windows can generate utf-8
just fine.

Thanks for your question, and as I'm still learning a bit about unicode,
please bear with my thinking.

I created a text file consisting of the word "test" in windows NT using
the supplied accessory notepad.exe, and in the save as dialog box I
selected the option to "save as unicode". (There was no choice as to
utf-8 or utf-16, and I don't know what format other windows tools save
files in.)

A hex dump of the file would tend to indicate that the file was utf-16,
both for the BOM and the nulls, right?

$ hd -x /usr0/dcd/test-unicode.txt
00000000 FF FE 74 00 65 00 73 00 74 00 |..t.e.s.t.|

http://www.unicode.org/unicode/faq/utf_bom.html#25
Q: When a BOM is used, is it only in 16-bit Unicode text?

A: No, a BOM can be used as a signature no matter how the Unicode text
is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising
the BOM will be whatever the Unicode character FEFF is converted into by
that transformation format. In that form, the BOM serves to indicate both
that it is a Unicode file, and which of the formats it is in. Examples:

Bytes Encoding Form
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8

The above leads me to think that the default unicode text format on
windows in utf-16 (at least for unicode files created by notepad.exe).

Upon re-reading the "perldoc -f open" section that I questioned

... For example
open(FH, "<:utf8", "file")
will open the UTF-8 encoded file containing
Unicode characters, see perluniintro. ....

I would have to agree that the statement was not literally
incorrect, but it did mislead me at first what I tried
to write code on windows to read a 'unicode' text file
created by notepad. If one tries to change the example
open command to open a UTF-16 file eg open(FH, "<:utf16", "file")
one gets an error that perl can't locate PerlIO/utf16.pm in @INC.

perl -e 'open(FH, "<:utf16", "/usr0/dcd/test-unicode.txt") || die "open failed:$!"'
Can't locate PerlIO/utf16.pm in @INC (@INC contains:
/usr/local/lib/perl5/5.9.0/i686-linux /usr/local/lib/perl5/5.9.0
/usr/local/lib/perl5/site_perl/5.9.0/i686-linux
/usr/local/lib/perl5/site_perl/5.9.0
/usr/local/lib/perl5/site_perl/5.8.0/i686-linux
/usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.7.3
/usr/local/lib/perl5/site_perl .) at (eval 1) line 3.
open failed:No such file or directory.

Perhaps I should submit a bug that reports that utf16 files are not
supported in the three argument open form, and after that is resolved,
then submit a document enhancement request to show how to open a windows
unicode .txt file.

p5pRT · 2002-12-03T02:17:33Z

From @jhi

Okay, now I understand a little bit better. Still, you really should get rid of the "Windows Unicode" meme :-) There's only one Unicode, which has various different encodings. Perl prefers UTF-8, Windows prefers (little-endian) UTF-16. In the three argument open the ":utf8" is currently a special case. The general case is ":encoding(foobar)", so it would be ":encoding(utf16)"-- but I have to admit that this seems to have a bug currently: one gets strange "UTF-16:Partial character" warnings that I think shouldn't happen. I'll ask the guy working on the encoding bits. I guess we could make also ":utf16" (and ":utf32", I guess) another special case since UTF-16 is prevalent enough.

p5pRT · 2002-12-03T20:03:32Z

From david.dyck@fluke.com

On 3 Dec 2002 at 02:17 -0000, Jarkko Hietaniemi <perlbug@perl.org> wrote:

... The general case is ":encoding(foobar)",
so it would be ":encoding(utf16)"-- but I have to admit that this seems
to have a bug currently: one gets strange "UTF-16:Partial character"
warnings that I think shouldn't happen.

I didn't get any warnings from the following code
when using Encode-1.83

perl -we 'open(FH, "<:encoding(utf16)", "/usr0/dcd/test-unicode.txt")
|| die "open failed:$!";
while (<FH>) { print "$_\n" }'

I guess we could make also ":utf16" (and ":utf32",
I guess) another special case since UTF-16 is prevalent enough.

That would be nice, but next time I should just read the
perluniintro pod as the fine manual suggested that I do.

p5pRT · 2002-12-04T01:59:32Z

From @jhi

Okay... I think you got me convinced that things work as they should :-)
I'm marking the problem ticket as resolved.

p5pRT · 2002-12-04T01:59:33Z

@jhi - Status changed from 'new' to 'resolved'

p5pRT closed this as completed Dec 4, 2002

p5pRT added Severity Medium distro-Linux documentation labels Oct 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perldoc -f open example for utf encoding may be incorrect #5779

perldoc -f open example for utf encoding may be incorrect #5779

p5pRT commented Jul 25, 2002

p5pRT commented Jul 25, 2002

p5pRT commented Nov 27, 2002

p5pRT commented Nov 27, 2002

p5pRT commented Dec 3, 2002

p5pRT commented Dec 3, 2002

p5pRT commented Dec 4, 2002

p5pRT commented Dec 4, 2002

perldoc -f open example for utf encoding may be incorrect #5779

perldoc -f open example for utf encoding may be incorrect #5779

Comments

p5pRT commented Jul 25, 2002

p5pRT commented Jul 25, 2002

From dcd@tc.fluke.com

Created by dcd@tc.fluke.com

p5pRT commented Nov 27, 2002

From @jhi

p5pRT commented Nov 27, 2002

From david.dyck@fluke.com

p5pRT commented Dec 3, 2002

From @jhi

p5pRT commented Dec 3, 2002

From david.dyck@fluke.com

p5pRT commented Dec 4, 2002

From @jhi

p5pRT commented Dec 4, 2002