Friday, 14. March 2008, 10:15:19
unicode, fun, characters
An ex-colleague just sent me this, which is from the
Cyrillic block from the
Unicode Character Code Charts. Co-incidence or malicious intent? I’ll let you be the judge of this yourself:
Monday, 18. February 2008, 07:24:12
unicode, comic, characters, fun
Wednesday, 6. February 2008, 07:31:10
flame, linus torvalds, file systems, unicode
The Sydney Morning Herald reports that Linus Torvalds, creator of Linux,
claims that “[Mac Os X’s] file system is complete and utter crap, which is scary”. I wonder if this stems from the
recent flame war on the Git mailing list over how Mac OS X stores file names as [partially] de-composed
Unicode strings encoded as
UTF-8, while Git expects file names to be a sequence of octets.
While I think it is good that people have their opinions, it is possible to keep discussions civilized. Calling something “complete and utter crap” will not help anything.
Personally, I think the idea of storing file names as sequences of Unicode characters, however encoded (like Windows and Mac OS X both do) has vast advantages over storing them as sequences of bytes (like Linux does). If you have ever tried accessing files with non-ASCII names over Samba, or tried to switch your locale encoding on a running Linux system from, say,
ISO-8859-1 to UTF-8, while trying to access your file names with non-ASCII characters in them, you know what I mean.
Linux needs to take the step forward from the 1980s and accept that a sequence of octets is not good enough.
The world needs Unicode.
(via
MacWorld)
Tuesday, 19. September 2006, 07:15:04
unicode, books
One of the things I work on in Opera is support for Unicode and various legacy character ecndoings. Having good litterature on the subject is imperative, and one of the major works of reference is “
The Unicode Standard”, published by the
Unicode consortium. This book contains all the finer details about the Unicode standard, including references to all the characters it defines.
The updated data files for Unicode version 5.0 was
released earlier this year, but the book has not yet been published. Yesterday, pre-orders for version 5.0 of the book was published. If you want to know all there is to know about Unicode, this is a book I can recommend.
There are of course other good books on this and related subjects. I can recommend are “
CJKV Information Processing”, which is
the reference work on the processing of east-Asian text, although it could need an update, a lot has happened since 1999. Another good book is “
Unicode demystified”, which tries to explain Unicode in a bit more verbose form than the standard does.
Friday, 16. June 2006, 07:45:45
browsers, unicode, encodings
This neat trick was published on wincustomize.com:
- Create a text file in Notepad (or another text editor, do not use Wordpad, Word or any another word processor).
- Type this sentence exatly, without the quotes: “this app can break”.
- Exit the text editor and open the file in Notepad (by double-clicking, or by File→Open).
- Notice that the text has transformed into “桴獩愠灰挠湡戠敲歡” (a nonsensical Chinese text).
Why did it do that?
Michael Kaplan has the full explanation, but in short it is because Notepad takes a stab at auto-detecting what character encoding the file was saved in, and fails horribly. The same happens all the time on the Web, which is why browsers have implemented various ways of
guessing what the author meant. It often works well, but sometimes it fails. Perhaps not as completely as in the Notepad example above, but enough to make pages difficult or impossible to read.
The only solution to the problem is for Web authors to make sure they declare the character encoding for the documents, scripts and style sheets they create. The easiest way to do this is to make the server software add the tag to the HTTP header, Apache can, for instance, do this with the configuration flag
AddDefaultCharset. If you cannot control the server, you can also add it as a
<meta> tag for HTML, an
encoding declaration for XML, or a
@charset at-rule for CSS. There is no way to declare the character encoding for a piece of JavaScript or a plain text file, so there you really, really should configure the server to send the correct information.