Ramblings

<rant>

Is there a use for u"æøå" notation in perl?

,

After struggling a bit with charset issues on MyOpera, mostly because of historical decisions or lack of, specifically agnostic use of strings in databases, I have become friends (love/hate relationship) with Encode, decode_utf8 for reading and encode_utf8 for outputting. But that's strictly for handling I/O.

When using strings constants containing utf8 characters you would need to either set the utf8 flag (utf8::_utf8_on) on your string, use utf8::upgrade or use utf8;.

The latter a bit scary and possibly inconsistency as in addition to letting you use utf8 characters as variable names etc, it also treats all string constants as utf8 strings. However it needs to be included everywhere to be sure, and that might be cumbersome when using many modules, possibly external ones that don't etc. etc.

So I was wondering if there actually could be any use of having a similar notation to python to allow you to explicit set string constants as utf8. An example would be:

perl -wle 'sub u { utf8::upgrade(shift); }; print encode_utf(u"æøå");'
What do you think?

Packaging HTML::Tidy for Debian 5.0 "Lenny"Nomnomnom

Comments

Unregistered user Monday, February 1, 2010 11:14:42 AM

Anonymous writes: You're doing it wrong. use utf8; 'æøå'; # already does what one would expect. Do not mess with utf8::_utf8_on or utf8::upgrade or anything from the utf8 pragma module yourself. You maintenance programmer will always have to wonder WTF you were thinking to accomplish. If you have to, use Unicode::Semantics instead, which is much clearer in conveying its intent. PS: No comment preview? If it comes out wrong, it's your fault.

Nicolas Mendozanicomen Monday, February 1, 2010 11:27:08 AM

Hehe, but if one module has use utf8, but another one doesn't, concatenating strings end up in a mess AFAIK. Would you recommend to have use utf8; in a startup.pl for mod_perl for instance? (Also tinkering with PERL_UNICODE doesn't always work depending on modules accessing external data).

Also I have to wrongly flagged content, and enforcing it to be utf8 some places, so that it passes correctly thru regexps and substr and lc() calls etc. What to do then?

Unregistered user Monday, February 1, 2010 1:59:43 PM

Anonymous writes: > but if one module has use utf8, but another one doesn't, concatenating strings end up in a mess True, but your suggestion in the article does not help with that, either. The problem is not in the pragma, but in the strings themselves. > Would you recommend to have use utf8; in a startup.pl for mod_perl for instance? No, because the pragma is lexically scoped, so this won't have any influence on later loaded modules. It needs to be declared everywhere in each module. > wrongly flagged content As I said above, see Unicode::Semantics for the simplest cases. Most of the time, the functions from the Encode module are enough. Document the fact that you have to undo someone else's sloppy programming, e.g. my $perl_string = # explicitely decode because Foo::Bar didn't decode 'UTF-8', $passed_through_unchanged_from_outside; For more tenacious cases, try Encode::Repair.

Nicolas Mendozanicomen Friday, February 12, 2010 6:01:35 PM

Yeah, known problem, but to get a fix out we are trying to finish up something else first.

Nicolas Mendozanicomen Friday, February 12, 2010 6:55:09 PM

This should be working now wink

Write a comment

New comments have been disabled for this post.

May 2013
M T W T F S S
April 2013June 2013
1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31