Supplementary: Common HTML entities used for typography

Forums » Dev.Opera » Archived Article Discussions

This topic has been closed. No new entries allowed.

Reason: You can now post comments on articles on Dev Opera

Forum rules and guidelines

You need to be logged in to post in the forums. If you do not have an account, please sign up first.

Go to last post

26. September 2008, 15:25:13

bhenick

Posts: 2

Supplementary: Common HTML entities used for typography



( Read the article )

29. October 2008, 17:49:28

stonetownmike

Posts: 1

What about ampersand? Is html entity code no longer necessary?

1. February 2009, 21:59:30

scarby421

Posts: 16

Hi, I find the less than and the greater than arrows are handy also. IE: <....>

24. February 2009, 04:16:59

daniiswara

Posts: 6

on a doctype xhtml 1.1, better to use the unicode value, I guess

25. February 2009, 21:47:15

dorward

Posts: 16


For the sake of portability, Unicode entity references should be reserved for use in documents certain to be written in the UTF-8 or UTF-16 character sets.



What user agents get numeric character references wrong in non-UTF-8/16 documents?
David Dorward
http://dorward.me.uk/

25. February 2009, 21:49:53

dorward

Posts: 16

Originally posted by daniiswara:

on a doctype xhtml 1.1, better to use the unicode value, I guess



XHTML is a tricky beast. One school of thought is that documents should be written to be parsed by generic, non-validating XML parsers. Since they are generic they do not know HTML named character references, and since they are non-validating they don't parse the DTD to find them out either. Hence, numeric character references for everything except the XML 5.

... but if you subscribe to that school of thought, there is no point in limiting it to a specific version of XHTML. The logic either applies to all versions or none.
David Dorward
http://dorward.me.uk/

10. June 2010, 14:03:45

nicholaswilson

Posts: 1

The comment about 'Unicode entities' is misleading or at best confusing. This is the minimal amount of detail I feel needed to properly discuss the issue:

"You have three choices for representing characters on the page, firstly by using the literal value, secondly by escaping using a named entity, thirdly by using a numeric entity. Each has its own problems. Literal characters are the shortest, canonical way to represent the character, but you have to be very clear in your mind that your content is in UTF-8, that your CMS can be guaranteed to losslessly transfer that, and that your page encoding and HTTP headers are set up to guarantee that the content is correctly received at the other end. This ought to be the case already since you have to be able to handle accents or non-Latin alphabets (what if a user of your site happens to be called Noël?). You should never paint your site management software into a corner where bulletproof basic language support becomes an add-on, so literal characters ought to always work. As a side note, make sure you never even try to store any content in ISO-8859/Latin 1 or Windows 1252; because the way these protocol names are handled on the web you can never safely transfer content using them.

"Secondly, there are named entities. These are somewhat useful if you find yourself writing raw code for your site. That ought never to happen though, unless your CMS is dysfuncional, and you should be able to type literals in all latin scripts and punctuation blocks effortlessly using your compose/option key, unless your operating system is dysfunctional too. Nevertheless, there is only one good reason though to positively avoid named entities, and that is if there is the slightest chance that your data will be ever by parsed as by a general XML parser or non-validating SGML parser (and a few other corner cases). Because browsers are clever, you get lucky nearly always and your named entities are replaced by their usual character, and some other sorts of web user agents like Googlebot are also clever, but you cannot guarantee that everything will work always. The two most common use cases where named entities can break is in non-XML content bodged into an XML Atom container, and XHTML served as XML. For forward-compatibility, it is generally best to avoid named entities. As a technical aside, note that this does not apply to the four entities lt;, gt;, amp;, and quot;, which are guaranteed to always work (obscure fact: beware apos; which is in fact not part of the HTML 4 DTDs).

"Thirdly, numeric entities will always work. They can be input on any keyboard, transferred using any character encoding, and handled by any user agent. Their big problem is that they are maximally user-unfriendly. Pasting the entities in by hand is too slow, and learning the numbers takes us back to the insane of days of Windows forcing you to memorise alt-codes to enter accents. For paranoid software doing something crazy with encodings, or a CMS including content of pages which may be served up in multiple encodings, there is a use-case for these, but there is generally no good reason for a user ever to need to mess with entities at all."

Note that this explanation is minimal to discuss the situation, and ought to be entirely unnecessary for simply authoring content. Users should have their keyboards set up so they can type easily, and CMSs should be solidly architectured to guarantee safe handling and transfer of content without worry or breaking corner cases. Unfortunately, on most users' computers and sevrer software, this is not the case, as Windows trails the world in keyboard drivers and leading CMSs produce unsafe or faulty output.
Nicholas Wilson
www.nicholaswilson.me.uk

Forums » Dev.Opera » Archived Article Discussions