Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perldoc -f open example for utf encoding may be incorrect #5779

Closed
p5pRT opened this issue Jul 25, 2002 · 7 comments
Closed

perldoc -f open example for utf encoding may be incorrect #5779

p5pRT opened this issue Jul 25, 2002 · 7 comments

Comments

@p5pRT
Copy link

p5pRT commented Jul 25, 2002

Migrated from rt.perl.org#15533 (status was 'resolved')

Searchable as RT15533$

@p5pRT
Copy link
Author

p5pRT commented Jul 25, 2002

From dcd@tc.fluke.com

Created by dcd@tc.fluke.com

perldoc -f open shows an example

  For example
  open(FH, "<​:utf8", "file")

  will open the UTF-8 encoded file containing
  Unicode characters, see perluniintro.

but to read a Unicode file generated on Windows it appears that
the arg should be "<​:encoding(utf16)", as hinted at in binmode
example in perlopentut.pod
  binmode($fh, "​:encoding(utf16)");

Perl Info

Flags:
    category=docs
    severity=medium

Site configuration information for perl v5.8.0:

Configured by dcd at Tue Jul 23 15:17:41 PDT 2002.

Summary of my perl5 (revision 5.0 version 8 subversion 0 patch 17654) configuration:
  Platform:
    osname=linux, osvers=2.4.19-rc3-ac3, archname=i686-linux
    uname='linux dd 2.4.19-rc3-ac3 #1 tue jul 23 09:02:05 pdt 2002 i686 '
    config_args='-Dmksymlinks -Dinstallusrbinperl -Uversiononly -Dusedevel -Doptimize=-O3 -g -de -Dcf_email=dcd@tc.fluke.com'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64',
    optimize='-O3 -g',
    cppflags='-DDEBUGGING -fno-strict-aliasing -I/usr/local/include'
    ccversion='', gccversion='egcs-2.91.66.1 19990314/Linux (egcs-1.1.2 release)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=4
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lgdbm -ldbm -ldb -ldl -lm -lc
    perllibs=-ldl -lm -lc
    libc=/lib/libc.so.5.4.44, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'

Locally applied patches:
    DEVEL17641


@INC for perl v5.8.0:
    /usr/local/lib/perl5/5.8.0/i686-linux
    /usr/local/lib/perl5/5.8.0
    /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.0
    /usr/local/lib/perl5/site_perl/5.7.3
    /usr/local/lib/perl5/site_perl
    .


Environment for perl v5.8.0:
    HOME=/home/dcd
    LANG (unset)
    LANGUAGE (unset)
    LD_LIBRARY_PATH (unset)
    LOGDIR (unset)
    PATH=/home/dcd/bin:/sbin:/usr/local/bin:/bin:/usr/bin:/usr/X11/bin:/usr/games:/usr/local/samba:/home/hobbes/tools/scripts:/home/hobbes/tools/linux:/usr0/hobbes/tools/scripts:/usr0/dcd/bin:/apps/general/bin:/usr/public
    PERL_BADLANG (unset)
    SHELL=/bin/bash

@p5pRT
Copy link
Author

p5pRT commented Nov 27, 2002

From @jhi

I do not quite understand that "for a Unicode file generated in Windows" the sample should be changed to be "utf-16". Windows can generate utf-8 just fine.

@p5pRT
Copy link
Author

p5pRT commented Nov 27, 2002

From david.dyck@fluke.com

On 27 Nov 2002 at 04​:35 -0000, Jarkko Hietaniemi <perlbug@​perl.org> wrote​:

From​: Jarkko Hietaniemi <perlbug@​perl.org>
To​: dcd
Date​: 27 Nov 2002 04​:35​:03 -0000
Subject​: [perl #15533] perldoc -f open example for utf encoding may be
incorrect

I do not quite understand that "for a Unicode file generated in Windows"
the sample should be changed to be "utf-16". Windows can generate utf-8
just fine.

Thanks for your question, and as I'm still learning a bit about unicode,
please bear with my thinking.

I created a text file consisting of the word "test" in windows NT using
the supplied accessory notepad.exe, and in the save as dialog box I
selected the option to "save as unicode". (There was no choice as to
utf-8 or utf-16, and I don't know what format other windows tools save
files in.)

A hex dump of the file would tend to indicate that the file was utf-16,
both for the BOM and the nulls, right?

$ hd -x /usr0/dcd/test-unicode.txt
00000000 FF FE 74 00 65 00 73 00 74 00 |..t.e.s.t.|

http​://www.unicode.org/unicode/faq/utf_bom.html#25
Q​: When a BOM is used, is it only in 16-bit Unicode text?

A​: No, a BOM can be used as a signature no matter how the Unicode text
is transformed​: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising
the BOM will be whatever the Unicode character FEFF is converted into by
that transformation format. In that form, the BOM serves to indicate both
that it is a Unicode file, and which of the formats it is in. Examples​:

  Bytes Encoding Form
  00 00 FE FF UTF-32, big-endian
  FF FE 00 00 UTF-32, little-endian
  FE FF UTF-16, big-endian
  FF FE UTF-16, little-endian
  EF BB BF UTF-8

The above leads me to think that the default unicode text format on
windows in utf-16 (at least for unicode files created by notepad.exe).

Upon re-reading the "perldoc -f open" section that I questioned

... For example
  open(FH, "<​:utf8", "file")
will open the UTF-8 encoded file containing
Unicode characters, see perluniintro. ....

I would have to agree that the statement was not literally
incorrect, but it did mislead me at first what I tried
to write code on windows to read a 'unicode' text file
created by notepad. If one tries to change the example
open command to open a UTF-16 file eg open(FH, "<​:utf16", "file")
one gets an error that perl can't locate PerlIO/utf16.pm in @​INC.

perl -e 'open(FH, "<​:utf16", "/usr0/dcd/test-unicode.txt") || die "open failed​:$!"'
Can't locate PerlIO/utf16.pm in @​INC (@​INC contains​:
  /usr/local/lib/perl5/5.9.0/i686-linux /usr/local/lib/perl5/5.9.0
  /usr/local/lib/perl5/site_perl/5.9.0/i686-linux
  /usr/local/lib/perl5/site_perl/5.9.0
  /usr/local/lib/perl5/site_perl/5.8.0/i686-linux
  /usr/local/lib/perl5/site_perl/5.8.0 /usr/local/lib/perl5/site_perl/5.7.3
  /usr/local/lib/perl5/site_perl .) at (eval 1) line 3.
open failed​:No such file or directory.

Perhaps I should submit a bug that reports that utf16 files are not
supported in the three argument open form, and after that is resolved,
then submit a document enhancement request to show how to open a windows
unicode .txt file.

@p5pRT
Copy link
Author

p5pRT commented Dec 3, 2002

From @jhi

Okay, now I understand a little bit better. Still, you really should get rid of the "Windows Unicode" meme :-) There's only one Unicode, which has various different encodings. Perl prefers UTF-8, Windows prefers (little-endian) UTF-16. In the three argument open the "​:utf8" is currently a special case. The general case is "​:encoding(foobar)", so it would be "​:encoding(utf16)"-- but I have to admit that this seems to have a bug currently​: one gets strange "UTF-16​:Partial character" warnings that I think shouldn't happen. I'll ask the guy working on the encoding bits. I guess we could make also "​:utf16" (and "​:utf32", I guess) another special case since UTF-16 is prevalent enough.

@p5pRT
Copy link
Author

p5pRT commented Dec 3, 2002

From david.dyck@fluke.com

On 3 Dec 2002 at 02​:17 -0000, Jarkko Hietaniemi <perlbug@​perl.org> wrote​:

... The general case is "​:encoding(foobar)",
so it would be "​:encoding(utf16)"-- but I have to admit that this seems
to have a bug currently​: one gets strange "UTF-16​:Partial character"
warnings that I think shouldn't happen.

I didn't get any warnings from the following code
when using Encode-1.83

perl -we 'open(FH, "<​:encoding(utf16)", "/usr0/dcd/test-unicode.txt")
  || die "open failed​:$!";
  while (<FH>) { print "$_\n" }'

I guess we could make also "​:utf16" (and "​:utf32",
I guess) another special case since UTF-16 is prevalent enough.

That would be nice, but next time I should just read the
perluniintro pod as the fine manual suggested that I do.

@p5pRT
Copy link
Author

p5pRT commented Dec 4, 2002

From @jhi

Okay... I think you got me convinced that things work as they should :-)
I'm marking the problem ticket as resolved.

@p5pRT
Copy link
Author

p5pRT commented Dec 4, 2002

@jhi - Status changed from 'new' to 'resolved'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant