How to force correct rendering of U+200B?

Forums » General Opera topics » Opera and cross-browser Web design

You need to be logged in to post in the forums. If you do not have an account, please sign up first.

Go to last post

7. April 2010, 17:36:52

toscho

Posts: 154

How to force correct rendering of U+200B?

Hi,

U+200B is the ZERO WIDTH SPACE, and it’s well supported in all common browsers (IE 8, Safari 4, Firefox 3.6): They create no visible space, but they allow line breaks.

All except Opera.

<p>A&#x200B;B</p>

… and …
<p>AB</p>


… should be visual identical, but Opera inserts a normal space in the first variant. Is there any known workaround that doesn’t requires extra markup? Is this a known bug, or should i file a new one?

8. April 2010, 12:47:31

Frenzie

Posts: 15541

Your sample document renders both ABs the same for me, except when cranking up the zoom level when the AB with the zero width space results in a soft break.
The DnD Sanctuary — a safety net for My Opera's demise.

8. April 2010, 13:57:15

toscho

Posts: 154

Originally posted by GwenDragon:

With my Opera 10.51 Final Build 3315 i cannot see any space between them.



Haha, it depends on the font! All ›Vista fonts‹ are handled incorrectly while the old fonts render just fine.

Live: http://labs.toscho.de/test/thin-space.html

<!doctype html><meta charset="utf-8">
<title>Thin space</title>
<style>
p#calibri{font-family: Calibri;}
p#cambria{font-family: Cambria;}
p#candara{font-family: Candara;}
p#consolas{font-family: Consolas;}
p#corbel{font-family: Corbel;}
p#constantia{font-family: Constantia;}

p#georgia  {font-family: Georgia;}
p#arial  {font-family: Arial;}
p#times  {font-family: Times New Roman;}

p[id]::before{content:attr(id);display:block}
</style>

<p>A&#x200B;B</p>
<p id="arial">A&#x200B;B</p>
<p id="times">A&#x200B;B</p>
<p id="georgia">A&#x200B;B</p>


<p id="calibri">A&#x200B;B</p>
<p id="cambria">A&#x200B;B</p>
<p id="candara">A&#x200B;B</p>
<p id="consolas">A&#x200B;B</p>
<p id="corbel">A&#x200B;B</p>
<p id="constantia">A&#x200B;B</p>


Opera 10.52 (Build 3338) Win XP:

9. April 2010, 19:41:19 (edited)

deathshadow

Excitable Boy

Posts: 768

It's not the font rendering system, it's the font. Your last example renders identical here for me in IE, FF, Opera, Saffy and Chrome.

Characters kerning for all characters, even zero width spaces, are set by the font used. Arial' and Times New Roman's zero width spaces are actually BROKEN, because instead of being zero width they have a NEGATIVE width. The behavior of the newer fonts is in fact the correct one as a zero width space should render IDENTICAL to how it looks if you don't put a space between them.

add this to your test:
<p id="arial">A​B AB</p>
<p id="times">A​B AB</p>

The second AB pairing on the line should render identically, but they do not. The old Microsoft fonts zwsp is broken. They're not rendering zero width, they're rendering NEGATIVE width.

But again, this is why you cannot rely upon certain characters for formatting - it's also why when I write my code the only reason I'll use UTF extended characters is for alternative languages - if I'm working in English or developing a site template I restrict myself to 7 bit ASCII and HTML entities... that way it doesn't matter what the character encoding is.

and 0x200B has no entity associated with it - though zwnj‍ (0x200C) or zwj‌ (0x200D) could serve much the same purpose.

Basically, if it doesn't have a named entity equivalent, I wouldn't try to use it on a website... and if something like kerning is going to break something visually, you probably aren't designing for the web in the first place since you cannot rely on the font or rendering technology you design for always being available to the user.

Good reference of which ones have names:
http://htmlhelp.com/reference/html40/entities/special.html

I'd also be asking WHY you are using a zero width space - if you are using it JUST to remove that space between characters then it's semantically incorrect and I'd be looking at using a negative letter-spacing instead, since that WILL behave consistently cross-browser and cross-font and cross-rendering engine.
So what's wrong with YOUR website? (an ongoing series)
So what's wrong with HTML 5?
Javascript is to Java as Hamburger is to Ham

10. April 2010, 01:09:46

toscho

Posts: 154

Originally posted by deathshadow:

It's not the font rendering system, it's the font. Your last example renders identical here for me in IE, FF, Opera, Saffy and Chrome.


Last? I gave just one …

Anyway, Opera is the only browser on Win XP which creates a wide space between the letters. Opera 10.10 on Linux (Opensuse 11.1) renders the test case fine.

Originally posted by deathshadow:

Characters kerning for all characters, even zero width spaces, are set by the font used. Arial' and Times New Roman's zero width spaces are actually BROKEN, because instead of being zero width they have a NEGATIVE width.


Now I’ve checked this more intensive: Arial, Times New Roman and none of the Vista Fonts even have U+200B. smile

But this doesn’t matter: Even with some fonts, which have U+200B, Opera (Win) renders a wide space. I’ve updated (and moved) the test case: http://labs.dev/test/thin-space/

Current result:



Originally posted by deathshadow:

The behavior of the newer fonts is in fact the correct one as a zero width space should render IDENTICAL to how it looks if you don't put a space between them.



Errm … that’s what I said. Opera doesn’t do that.

Originally posted by deathshadow:

add this to your test:

A​B AB


A&​B AB



Okay.

Originally posted by deathshadow:

The second AB pairing on the line should render identically, but they do not. The old Microsoft fonts zwsp is broken. They're not rendering zero width, they're rendering NEGATIVE width.


On my computer, they don’t have this character, and Opera renders it correct as a null space (not negative).

Originally posted by deathshadow:

if I'm working in English or developing a site template I restrict myself to 7 bit ASCII and HTML entities... that way it doesn't matter what the character encoding is.



The encoding is not the problem: The available characters in HTML are always identical with the latest Unicode standard. X(HT)ML has some restrictions here (FORM FEED for example), but the encoding is just a notation, it doesn’t determine the range of visible characters.

Originally posted by deathshadow:

Basically, if it doesn't have a named entity equivalent, I wouldn't try to use it on a website...



Entities are useless, since we have UTF-8. They are not good supported in XML, and in HTML we write just the character. I literally never use them.

Originally posted by deathshadow:

I'd also be asking WHY you are using a zero width space


I want to allow a line break, where the Unicode Line Breaking Algorithm otherwise would forbid one.

10. April 2010, 04:34:18 (edited)

deathshadow

Excitable Boy

Posts: 768

Originally posted by toscho:


Last? I gave just one


Oh yeah, the other one was first responder, my bad.

Originally posted by toscho:


Anyway, Opera is the only browser on Win XP which creates a wide space between the letters. Opera 10.10 on Linux (Opensuse 11.1) renders the test case fine.


Which oddly I'm unable to recreate here using your example.. Admittedly the only thing I have here that still has XP on it is my M$ VirtualPC install for browser testing.

Originally posted by toscho:


Now I’ve checked this more intensive: Arial, Times New Roman and none of the Vista Fonts even have U+200B. smile


Odd, the older ones for me exist and have a negative kerning applied to them. Internationalization difference? Difference between Corporate/Business and Home maybe? That would take a bit more digging methinks.

Originally posted by toscho:


But this doesn’t matter: Even with some fonts, which have U+200B, Opera (Win) renders a wide space. I’ve updated (and moved) the test case: http://labs.dev/test/thin-space/


Broken URL - I assume you mean the same URL as the last testcase.

Originally posted by toscho:


Errm … that’s what I said. Opera doesn’t do that.


Does here... EVEN in XP (corporate) Wait, I'm looking at the code to your example... and christmas on a cracker, do you spend 90% of your time on the alt key or something? EVERYTHING you have has invalid/UTF only characters in it. Even your demo page is showing a bunch of those wonderful 'missing character' blocks and

Hmm... Idea... testing --- HAH!!! - add all the 'missing' tags you went 1990's transitional markup with in your testcase - basically put a XHTML 1.0 Strict doctype on it, a VALID encoding meta, and all those missing tags like HTML, HEAD, BODY that are 'optional'...

and Opera does something even more interesting - it shows the font's "Missing character" box instead of a space... I'm wondering if on your version it's not showing the missing character element due to some sort of OS level difference like international settings or something.

Hmm, and the "missing box" character suddenly shows up in every browser except firefox on XP here.

Rewritten with REAL markup (it's php so I can set the encoding since my server still does ISO 8859-1 as default):
http://www.cutcodedown.com/for_others/toscho/test1.php

Images showing how it renders in XP here:
http://www.cutcodedown.com/for_others/toscho/x200B_IE6.jpg
http://www.cutcodedown.com/for_others/toscho/x200B_Opera.jpg

... and this is why NOT using a full HTML spec document is a REALLY bad idea - and just part of why I HATE HTML 5 since it seems bound and determined to undo all the progress we've made using XHTML and STRICT doctypes. When testing, use a full document and don't leave out a half dozen tags that are the difference between valid code and just slapping it together any old way. It's often jaw-dropping the difference it makes in all the different browsers rendering - even when the tranny specification says said elements are 'optional'. Transitional markup - "It's a Trap"

Originally posted by toscho:

The encoding is not the problem


I beg to differ.

Originally posted by toscho:

The available characters in HTML are always identical with the latest Unicode standard.


SINCE WHEN?!? That's definately news to me since I've rarely had it ever work right - which is why the ONLY reason I'll use UTF-8 is for language support. I certainly wouldn't try to use it for formatting since that's not even HTML's job. (That's what CSS is for!)...

Originally posted by toscho:

X(HT)ML has some restrictions here (FORM FEED for example), but the encoding is just a notation, it doesn’t determine the range of visible characters.


Ok, I see what you are saying, but you are missing what I'm saying. UTF-8 often declares a range of characters that EXCEEDS the capabilities of the declared font and font renderer. Half of the time you are lucky if the font designer completed the 7 bit ASCII set, much less the 8 bit -1252... a full UTF character set? PLEASE, as if. Hell, windows fonts are Windows-1252, not UTF-8 so naturally there will be gaps in the translation matrix - especially of rarely used formatting characters that really do things that don't belong in HTML CDATA - since HTML is for structure, CSS is for appaearance - and CDATA is for data, NOT for formatting.

Originally posted by toscho:

Entities are useless


Work just fine in EVERY doctype since XHTML 1.0 inherits from HTML4 those as valid properties.

Originally posted by toscho:

They are not good supported in XML


Which means jack **** for web development since XHTML 1.1 is undeployable real world if you give a *** about supporting IE, as is using a XML mime-type.

Originally posted by toscho:

and in HTML we write just the character. I literally never use them.


Funny, every time I come across UTF 'special' characters I want to scream at the stupidity. It's nothing but a badly broken pain in the ass headache - from the bullshit "styled quote" nonsense to trying to dictate formatting in the CDATA - to unreliable translation maps of -1252 or -8859 fonts to the UTF matrix. It seems like you and I have entirely contrary experiences when it comes to dealing with UTF - every time I encounter it my experience is that it's more hassle than it's worth, even using the hex entities. Named entities work damned near everyplace since they are character encoding neutrual - so if someone screws up and serves it as 8859-1 or win-1252 the page still works.

Originally posted by toscho:


I want to allow a line break, where the Unicode Line Breaking Algorithm otherwise would forbid one.


Which has what to do with a web browser or HTML exactly? That SOUNDS like playing with character encoding behaviors that have little if anything to do with HTML's behaviors - and in HTML it's rules will trump the character encodings - which is WHY I restrict myself to the working entities list on that since THAT is part of the HTML rules... Especially since IE's UTF support is, well... lacking - and so is Opera's translation matrix it would seem.

Though if you are using the data in something OTHER than a browser I could see the point; but then I'd use something like php to neuter it down to avoid these types of issues - part of why I don't use 'true' XML for anything more than porting between database formats.

You know, that almost sounds like a job for a bit of CSS trickery - depending on what you are trying to show. A dummy span or other inline-level tag could solve a LOT of those issues.

H<span class="breakPoint">​&amp;#x200B;</span>B



.breakPoint {
	display:-moz-inline-block;
	display:-moz-inline-box;
	display:inline-block; /* so we can set a width */
	width:2em;
	margin-right:-2em; /* element now has zero render width */
	text-indent:-999em; /* if it wants to show the no character box, hide it */
}


It's ugly, but should work. Then you have the character for when CSS is not present but UTF is supported, and when CSS is working it makes the browsers like IE and Opera on XP also work.

I actually use a similar technique in a rewrite of SMF's 'forced break' handler, though I just have a normal space inside there since I wasn't that concerned about CSS off users having a word that's too long have an extra space thrown in. Beats the living tar out of well, let's just say that their code comment says it all:
// This is SADLY and INCREDIBLY browser dependent.

Made all the worse by browser sniffing and inlined style... But I'm the whackjob who would deprecate - ok, who are we kidding - I'd obsolete STYLE as both a attribute and a tag since IMHO that *** doesn't belong in the markup; EVER... Just as I don't see any good reason to use anything more than 7 bit ASCII in building an english language website - EVER.
So what's wrong with YOUR website? (an ongoing series)
So what's wrong with HTML 5?
Javascript is to Java as Hamburger is to Ham

10. April 2010, 04:55:53

toscho

Posts: 154

Originally posted by deathshadow:

Originally posted by toscho:


Anyway, Opera is the only browser on Win XP which creates a wide space between the letters. Opera 10.10 on Linux (Opensuse 11.1) renders the test case fine.


Which oddly I'm unable to recreate here using your example.. Admittedly the only thing I have here that still has XP on it is my M$ VirtualPC install for browser testing.


Maybe you have just the dozen pre-installed fonts on your system and Opera has no chance to choose the wrong font.

Originally posted by deathshadow:

Originally posted by toscho:


Now I’ve checked this more intensive: Arial, Times New Roman and none of the Vista Fonts even have U+200B. smile


Odd, the older ones for me exist and have a negative kerning applied to them. Internationalization difference? Difference between Corporate/Business and Home maybe? That would take a bit more digging methinks.


I don’t think so. I’ve checked it with the charmap, and none of the named fonts had U+200B.

Originally posted by deathshadow:

Broken URL - I assume you mean the same URL as the last testcase.


Sorry, it is http://labs.toscho.de/test/thin-space/ of course.

Originally posted by deathshadow:

Wait, I'm looking at the code to your example... and christmas on a cracker, do you spend 90% of your time on the alt key or something? EVERYTHING you have has invalid/UTF only characters in it. Even your demo page is showing a bunch of those wonderful 'missing character' blocks and



No, this page is valid UTF-8. Your system seems to be more broken than mine. wink

Originally posted by deathshadow:

Hmm... Idea... testing --- HAH!!! - add all the 'missing' tags you went 1990's transitional markup with in your testcase - basically put a XHTML 1.0 Strict doctype on it, a VALID encoding meta, and all those missing tags like HTML, HEAD, BODY that are 'optional'...


XHTML is dead. This is valid HTML 5. The meta element is irrelevant anyway, the HTTP headers rule. I’ve created the minimal test case on purpose.

Originally posted by deathshadow:

and Opera does something even more interesting - it shows the font's "Missing character" box instead of a space... I'm wondering if on your version it's not showing the missing character element due to some sort of OS level difference like international settings or something.


Probably you just have no font with U+200B as fallback.

Originally posted by deathshadow:

Rewritten with REAL markup (it's php so I can set the encoding since my server still does ISO 8859-1 as default):
http://www.cutcodedown.com/for_others/toscho/test1.php


Besides the fact, that I don’t see any advance in downgrading the test case to XHTML – I still see wide thin spaces.

Originally posted by deathshadow:

Images showing how it renders in XP here:
http://www.cutcodedown.com/for_others/toscho/x200B_IE6.jpg
http://www.cutcodedown.com/for_others/toscho/x200B_Opera.jpg


Obviously, you have non of the referenced fonts, not even the Vista fonts, and the markup doesn’t have any influence (maybe the MIME type has, but we both at least agree, that text/html is better, we should keep this out of this analysis). This doesn’t help narrowing the underlying factors.

Originally posted by deathshadow:

Originally posted by toscho:

The available characters in HTML are always identical with the latest Unicode standard.


SINCE WHEN?!? That's definately news to me since I've rarely had it ever work right - which is why the ONLY reason I'll use UTF-8 is for language support.



Ahem, since HTML 4.

The ASCII character set is not sufficient for a global information system such as the Web, so HTML uses the much more complete character set called the Universal Character Set (UCS), defined in [ISO10646]. This standard defines a repertoire of thousands of characters used by communities all over the world.



Explained:

Also, this reference assumes that the character sets defined by ISO 10646 and Unicode remain character-by-character equivalent. This reference also includes future publications of other parts of 10646 (i.e., other than Part 1) that define characters in planes 1-16.



So, if you don’t write a HTML 3.2 document, you may use any character from the planes 1—16 of Unicode. As real character, numeric reference or as entity.

Originally posted by deathshadow:

UTF-8 often declares a range of characters that EXCEEDS the capabilities of the declared font and font renderer. Half of the time you are lucky if the font designer completed the 7 bit ASCII set, much less the 8 bit -1252... a full UTF character set?


This is the reason why glyph substitution exists. If a font lacks a character, the rendering engine takes it from another font. I’ve written in detail about this topic … in German. English is not my native language, so I can’t explain it as precise as it would be necessary.


Originally posted by deathshadow:

Originally posted by toscho:

Entities are useless


Work just fine in EVERY doctype since XHTML 1.0 inherits from HTML4 those as valid properties.


Especially in XHTML (served as real XHTML, not text/html) you cannot rely on entities: the DTD doesn’t contain the entities anymore, and the support is optional. Some (older) browsers ignore entities in XHTML (Opera 7 for example or early Safari builds). Entities are arche.

Originally posted by deathshadow:

Originally posted by toscho:

They are not good supported in XML


Which means jack **** for web development since XHTML 1.1 is undeployable real world if you give a *** about supporting IE, as is using a XML mime-type.


Well, some of us deliver newsfeeds … or MathML … or JSON. I care. smile

Originally posted by deathshadow:

It's nothing but a badly broken pain in the ass headache - from the bullshit "styled quote" nonsense


These quotation marks are not ›styled‹ but typographical correct. That’s a difference.
In most countries ›"‹ stands for an inch, not for a quotation mark. And they’re ugly as hell.

Originally posted by deathshadow:

It seems like you and I have entirely contrary experiences when it comes to dealing with UTF - every time I encounter it my experience is that it's more hassle than it's worth, even using the hex entities. Named entities work damned near everyplace since they are character encoding neutrual - so if someone screws up and serves it as 8859-1 or win-1252 the page still works.


I use UTF-8, because I can validate it, because I work often with different APIs (Twitter, Delicious, Google), because it is the default encoding for AJAX requests, and because I can detect different languages easier.
Try to find out if a string is encoded in iso-8859-1 or iso-8859-15. Impossible. A nightmare.
UTF-8 has so many advantages that I don’t even think about the use of any other encoding (besides UTF-16).

Originally posted by deathshadow:

Originally posted by toscho:

I want to allow a line break, where the Unicode Line Breaking Algorithm otherwise would forbid one.


Which has what to do with a web browser or HTML exactly? That SOUNDS like playing with character encoding behaviors that have little if anything to do with HTML's behaviors - and in HTML it's rules will trump the character encodings - which is WHY I restrict myself to the working entities list on that since THAT is part of the HTML rules... Especially since IE's UTF support is, well... lacking - and so is Opera's translation matrix it would seem.



The Unicode line breaking algorithm applies to HTML too!
Example:

Pick a tld: .com, .de, .pro, .org, .net, .us, .nu, .museum.


The Ulba forbids a line break before the dot. Paste this line into an HTML document and narrow the window. In Opera you’ll see no line break between the TLDs.

Really, this is essential. It has nothing to do with the actual encoding, the document type declaration, entities or other incidentals. You can try to force line breaks via CSS, but at the risk of breaks between real letters or at the cost of additional and purely presentational markup. Fixing this on the character level is a much better approach.

Originally posted by deathshadow:

A dummy span or other inline-level tag could solve a LOT of those issues.


Yes, I know the CSS ›solutions‹. But I prefer my documents to be fully accessible and usable without CSS. And we’re talking about semantics on the character level. Markup comes ›too late‹ in this sense.

10. April 2010, 09:36:37

Frenzie

Posts: 15541

Originally posted by toscho:

But this doesn’t matter: Even with some fonts, which have U+200B, Opera (Win) renders a wide space. I’ve updated (and moved) the test case: http://labs.dev/test/thin-space/


404

Originally posted by deathshadow:

Rewritten with REAL markup (it's php so I can set the encoding since my server still does ISO 8859-1 as default):

http://www.cutcodedown.com/for_others/toscho/test1.php


I don't actually have most of those fonts, but Arno Pro messes up. Chromium shows a "glyph not found" square while Fx correctly renders a (presumably replaced) zero width space.

Originally posted by deathshadow:

Just as I don't see any good reason to use anything more than 7 bit ASCII in building an english language website - EVER.


You do realize things like déjà vu are English too? Or perhaps someone might like to talk about an English author like Emily Brontë. Or...

Originally posted by toscho:

Originally posted by deathshadow:

Broken URL - I assume you mean the same URL as the last testcase.



Sorry, it is http://labs.toscho.de/test/thin-space/ of course.


Ah, that works. No visible difference with the testcase deathshadow posted. I didn't peek at the HTML.
The DnD Sanctuary — a safety net for My Opera's demise.

12. April 2010, 13:13:29

deathshadow

Excitable Boy

Posts: 768

Originally posted by toscho:


Maybe you have just the dozen pre-installed fonts on your system and Opera has no chance to choose the wrong font.


Since HTML is about device independance, there should be no such thing as the wrong font... and frankly if you are designing around fonts other than the core ones - or even that the behavior you expect of a font will be present in whatever the family list ends up cascading down to - you aren't developing your content for the realities of the web or the intent of HTML. (the intent being User agent can trump "designer" to match the capabilities of the device)

Originally posted by toscho:


XHTML is dead. This is valid HTML 5.


Ah, I see... Playing with a specification not even out of draft that even the project leader doesn't think will be completed until 2022... which frankly that in my mind doesn't make it real world deployable until 2030... Gee you'd think some jackasses decided we needed fifty new tags and attributes a third of which are presentational, half being redundant to existing tags and undoing all the progress of STRICT, the remainder being cute, but of course when people can't be bothered to use fieldset, label, legend, caption, th, thead, tbody - is adding more tags and attributes for them not to learn REALLY the answer? Sorry, NOT a fan of HTML 5 - much less even trying to deploy it DECADES before it's ready.

Originally posted by toscho:


Probably you just have no font with U+200B as fallback.


You mean like Arial... Oh wait, arial HAS that character with negative kerning and STILL shows the box on XP.

Originally posted by toscho:


Besides the fact, that I don’t see any advance in downgrading the test case to XHTML – I still see wide thin spaces.


Odd that I don't in XP, and it behaves as expected in Win7.

Originally posted by toscho:


Obviously, you have non of the referenced fonts, not even the Vista fonts, and the markup doesn’t have any influence (maybe the MIME type has, but we both at least agree, that text/html is better, we should keep this out of this analysis). This doesn’t help narrowing the underlying factors.


Uhm, it's XP - if you are designing for the web you probably shouldn't assume the user is even going to have any of those and be developing with that in mind... Otherwise you just told all XP users to take a hike...

Originally posted by deathshadow:

Originally posted by toscho:

The available characters in HTML are always identical with the latest Unicode standard.


SINCE WHEN?!? That's definately news to me since I've rarely had it ever work right - which is why the ONLY reason I'll use UTF-8 is for language support.



Originally posted by toscho:


Ahem, since HTML 4.
QUOTE]
Sure, and how well is that specification implented when half the default fonts are missing a slew of characters, the translation matrix used for character mapping ends up with gaps becuase of it, font renderers and kerning handlers on the host OS arent' equipped for it, etc, etc, etc. Remember one of my number one complaints about browser developers shitting out HTML5 and CSS3 support when they don't even have HTML4/CSS2.1 polished off properly yet.

Originally posted by toscho:


So, if you don’t write a HTML 3.2 document, you may use any character from the planes 1—16 of Unicode. As real character, numeric reference or as entity.


While good luck getting that to work in every browser cross platform since they all have gaping holes in their HTML4 implementations - some of them even having bug reports and still open for decades.

Originally posted by toscho:


Originally posted by deathshadow:

UTF-8 often declares a range of characters that EXCEEDS the capabilities of the declared font and font renderer. Half of the time you are lucky if the font designer completed the 7 bit ASCII set, much less the 8 bit -1252... a full UTF character set?


This is the reason why glyph substitution exists. If a font lacks a character, the rendering engine takes it from another font.


... and if the fonts the user has installed don't have the one you want, you're shit out of luck. Hey look, the translation matrix (an array used to point characters at glyphs across different fonts) is pointing the missing non-breaking space at the missing character box - who'd have thought.

Originally posted by toscho:


Especially in XHTML (served as real XHTML, not text/html) you cannot rely on entities: the DTD doesn’t contain the entities anymore, and the support is optional. Some (older) browsers ignore entities in XHTML (Opera 7 for example or early Safari builds). Entities are arche.


Show me a browser released in the past five years that has that problem with them... show me a browser people still USE released in the past DECADE (other than beta's) that has a problem with them...

Originally posted by toscho:

These quotation marks are not ›styled‹ but typographical correct. That’s a difference.


Is that why I'm getting chevrons here? wink

Originally posted by toscho:


In most countries ›"‹ stands for an inch, not for a quotation mark. And they’re ugly as hell.


Uhm, most countries don't even HAVE inches. wink

Originally posted by toscho:


I use UTF-8, because I can validate it, because I work often with different APIs (Twitter, Delicious, Google), because it is the default encoding for AJAX requests, and because I can detect different languages easier.
Try to find out if a string is encoded in iso-8859-1 or iso-8859-15. Impossible. A nightmare.
UTF-8 has so many advantages that I don’t even think about the use of any other encoding (besides UTF-16).


While restricting myself to 7 bit ascii gives me guaranteed support no matter what character encoding is chosen.

Originally posted by deathshadow:

Originally posted by toscho:

I want to allow a line break, where the Unicode Line Breaking Algorithm otherwise would forbid one.


Which has what to do with a web browser or HTML exactly? That SOUNDS like playing with character encoding behaviors that have little if anything to do with HTML's behaviors - and in HTML it's rules will trump the character encodings - which is WHY I restrict myself to the working entities list on that since THAT is part of the HTML rules... Especially since IE's UTF support is, well... lacking - and so is Opera's translation matrix it would seem.



Originally posted by toscho:


The Unicode line breaking algorithm applies to HTML too!
Example:

Pick a tld: .com, .de, .pro, .org, .net, .us, .nu, .museum.

The Ulba forbids a line break before the dot. Paste this line into an HTML document and narrow the window. In Opera you’ll see no line break between the TLDs.



Yeah, a behavior that is a colossal pain in the ass when it rears it's ugly head, and more hindrance than help... especially since not all browsers even obey it. (hello IE and FF)

Originally posted by toscho:


Really, this is essential. It has nothing to do with the actual encoding, the document type declaration, entities or other incidentals. You can try to force line breaks via CSS, but at the risk of breaks between real letters or at the cost of additional and purely presentational markup. Fixing this on the character level is a much better approach.


To me, that's just an unneccessary mess... It's been the cause of more problems for me than it's been a help.

Originally posted by toscho:

Yes, I know the CSS ›solutions‹. But I prefer my documents to be fully accessible and usable without CSS. And we’re talking about semantics on the character level. Markup comes ›too late‹ in this sense.


Thing is, that's NOT semantics, it's presentation. Line breaks, special spacing - that's layout, NOT structure. Layout is presentation, it BELONGS in the CSS.

Oh, and I love how your UTF is getting mangled between the edits wink

Originally posted by Frenzie:

You do realize things like déjà vu are English too?


Every time I see it styled that way, I feel like backhanding someone since to me, that's like when a news reporter is reading along in a American mid-western accent, comes across a Spanish name and suddenly they're members of the Latin Kings.

I know it's in the dictionary that way, still just feels wrong... maybe it's that pesky 10 years of using computers BEFORE we had anything more than 7 bit ascii and decade after where nobody bothered using the 8 bit extended set either since you couldn't rely on it across different hardware bigsmile

Which is why back then I restricted myself to 7 bit ascii as well; Since all those cute IBM extended characters would end up as the wrong characters on a DEC Rainbow or worse, graphics on a Trash-80. Sometimes it's best to keep it simple and go for the lowest common denominator - and in most cases that's ASCII... Hell, there's a reason it's all you can use in your XML prolog and why it's generally not a good idea to use anything else than it until you get past your Content-Type meta, even WITH the right mime-type.

So what's wrong with YOUR website? (an ongoing series)
So what's wrong with HTML 5?
Javascript is to Java as Hamburger is to Ham

12. April 2010, 13:19:15

deathshadow

Excitable Boy

Posts: 768

Just had to add:

Originally posted by toscho:

Originally posted by deathshadow:

Originally posted by toscho:

Entities are useless


Work just fine in EVERY doctype since XHTML 1.0 inherits from HTML4 those as valid properties.


Especially in XHTML (served as real XHTML, not text/html) you cannot rely on entities: the DTD doesn’t contain the entities anymore, and the support is optional. Some (older) browsers ignore entities in XHTML (Opera 7 for example or early Safari builds). Entities are arche.



You know, that statement is an absolute riot given that to even HAVE valid XHTML, you have to use the named entity every time you want an ampersand in the code...

Oh yeah, named entities are SO out in XHTML.
So what's wrong with YOUR website? (an ongoing series)
So what's wrong with HTML 5?
Javascript is to Java as Hamburger is to Ham

12. April 2010, 13:33:17

Frenzie

Posts: 15541

Originally posted by deathshadow:

Since HTML is about device independance, there should be no such thing as the wrong font... and frankly if you are designing around fonts other than the core ones - or even that the behavior you expect of a font will be present in whatever the family list ends up cascading down to - you aren't developing your content for the realities of the web or the intent of HTML. (the intent being User agent can trump "designer" to match the capabilities of the device)


Well spoken.

Originally posted by deathshadow:

Every time I see it styled that way, I feel like backhanding someone since to me, that's like when a news reporter is reading along in a American mid-western accent, comes across a Spanish name and suddenly they're members of the Latin Kings.

roflmao

I know it's in the dictionary that way, still just feels wrong... maybe it's that pesky 10 years of using computers BEFORE we had anything more than 7 bit ascii and decade after where nobody bothered using the 8 bit extended set either since you couldn't rely on it across different hardware bigsmile

I wouldn't be surprised if that's somehow related. p Either way, you may not need it for English, but you need such characters for Dutch, with or without commonplace French expressions mixed in.

Which is why back then I restricted myself to 7 bit ascii as well; Since all those cute IBM extended characters would end up as the wrong characters on a DEC Rainbow or worse, graphics on a Trash-80. Sometimes it's best to keep it simple and go for the lowest common denominator - and in most cases that's ASCII... Hell, there's a reason it's all you can use in your XML prolog and why it's generally not a good idea to use anything else than it until you get past your Content-Type meta, even WITH the right mime-type.


True.

Anyway, I just think it's an important point that 7-bit ASCII is insufficient for a proper typographical representation of English. even if it may be closer than for many other languages.

Originally posted by deathshadow:

You know, that statement is an absolute riot given that to even HAVE valid XHTML, you have to use the named entity every time you want an ampersand in the code...

Oh yeah, named entities are SO out in XHTML.


XHTML5 only has the XML entities though (i.e. I believe that's limited to quot, amp, apos, lt, gt and the #xxxx stuff)
The DnD Sanctuary — a safety net for My Opera's demise.

12. April 2010, 13:39:37

Originally posted by deathshadow:

You know, that statement is an absolute riot given that to even HAVE valid XHTML, you have to use the named entity every time you want an ampersand in the code...



Oh yeah, named entities are SO out in XHTML.


There are five pre-defined named entities in XML: lt, gt, amp, quot and apos. Those are always safe to use, since every XML parser must support them.

All the named entities defined via the XHTML DTDs, on the other hand, are not (theoretically) safe, since a non-validating XML parser doesn't have to read the DTD. Browsers have a hard-coded list these days, but generic XML parsers do not.

Of course it's more or less a moot point since XHTML is unusable anyway and is virtually always served as HTML, which means it's parsed by the browsers' HTML parsers. Still, not all browsers support all named character entities. If you want to be on the safe side, use numeric character references (NCR).
Tommy Olsson

Forums » General Opera topics » Opera and cross-browser Web design