This topic has been closed. No new entries allowed.
Reason: You can now post comments on articles on Dev Opera
You need to be logged in to post in the forums. If you do not have an account, please sign up first.
Originally posted by daniiswara:
on a doctype xhtml 1.1, better to use the unicode value, I guess
XHTML is a tricky beast. One school of thought is that documents should be written to be parsed by generic, non-validating XML parsers. Since they are generic they do not know HTML named character references, and since they are non-validating they don't parse the DTD to find them out either. Hence, numeric character references for everything except the XML 5.
... but if you subscribe to that school of thought, there is no point in limiting it to a specific version of XHTML. The logic either applies to all versions or none.
"You have three choices for representing characters on the page, firstly by using the literal value, secondly by escaping using a named entity, thirdly by using a numeric entity. Each has its own problems. Literal characters are the shortest, canonical way to represent the character, but you have to be very clear in your mind that your content is in UTF-8, that your CMS can be guaranteed to losslessly transfer that, and that your page encoding and HTTP headers are set up to guarantee that the content is correctly received at the other end. This ought to be the case already since you have to be able to handle accents or non-Latin alphabets (what if a user of your site happens to be called Noël?). You should never paint your site management software into a corner where bulletproof basic language support becomes an add-on, so literal characters ought to always work. As a side note, make sure you never even try to store any content in ISO-8859/Latin 1 or Windows 1252; because the way these protocol names are handled on the web you can never safely transfer content using them.
"Secondly, there are named entities. These are somewhat useful if you find yourself writing raw code for your site. That ought never to happen though, unless your CMS is dysfuncional, and you should be able to type literals in all latin scripts and punctuation blocks effortlessly using your compose/option key, unless your operating system is dysfunctional too. Nevertheless, there is only one good reason though to positively avoid named entities, and that is if there is the slightest chance that your data will be ever by parsed as by a general XML parser or non-validating SGML parser (and a few other corner cases). Because browsers are clever, you get lucky nearly always and your named entities are replaced by their usual character, and some other sorts of web user agents like Googlebot are also clever, but you cannot guarantee that everything will work always. The two most common use cases where named entities can break is in non-XML content bodged into an XML Atom container, and XHTML served as XML. For forward-compatibility, it is generally best to avoid named entities. As a technical aside, note that this does not apply to the four entities lt;, gt;, amp;, and quot;, which are guaranteed to always work (obscure fact: beware apos; which is in fact not part of the HTML 4 DTDs).
"Thirdly, numeric entities will always work. They can be input on any keyboard, transferred using any character encoding, and handled by any user agent. Their big problem is that they are maximally user-unfriendly. Pasting the entities in by hand is too slow, and learning the numbers takes us back to the insane of days of Windows forcing you to memorise alt-codes to enter accents. For paranoid software doing something crazy with encodings, or a CMS including content of pages which may be served up in multiple encodings, there is a use-case for these, but there is generally no good reason for a user ever to need to mess with entities at all."
Note that this explanation is minimal to discuss the situation, and ought to be entirely unnecessary for simply authoring content. Users should have their keyboards set up so they can type easily, and CMSs should be solidly architectured to guarantee safe handling and transfer of content without worry or breaking corner cases. Unfortunately, on most users' computers and sevrer software, this is not the case, as Windows trails the world in keyboard drivers and leading CMSs produce unsafe or faulty output.