An Entity Reference is, well, a reference to a declared entity, basically a small code that is meant to be replaced (usually upon display, but also under other kinds of processing) by its declared content. There are a number of different kinds of entity references in SGML and XML, but for our purposes here there's just two relevant ones, named character entity references and numeric character entity references. Other types of SGML and XML entities are described on Wikipedia's SGML Entity page.

All entity references found within a document begin with an ampersand character "&" and end with a semicolon character ";". Generally speaking, most XML-aware software (including most Web browsers) will ignore anything that looks a bit like an entity reference, e.g., a word that begins with an ampersand but is missing the semicolon. Strictly speaking, ampersand characters are not supposed to appear at all in SGML or XML (e.g., HTML and XHTML) documents, and should always be escaped by (yes), an entity reference to an ampersand character, "&".

Named Character Entity References#

A named character entity reference is reference to an entity based on a name. As with the aforementioned character entity reference for an ampersand, its name is "amp"; hence in use "&". With HTML and XHTML, there's about 250 predeclared named entities, most of them special characters, diacriticals (symbols used to denote a difference in the way of pronouncing a letter), and a variety of symbols. These are listed in the HTML and XHTML specifications, as well as on Wikipedia's full list of HTML/XHTML named character entities.

A few examples include the umlauted 'ü' (ü), M-dash '—' (—), Japanese yen '¥' symbol (¥), the copyright '©' symbol (©), and the degree '°' sign (°).

Numeric Character Entity References#

Numeric character entity references are very similar to the named entities, except instead of a name we find a number. The number is called a code point, and is a position within the enormous body of characters defined by the Unicode standard, which includes a lot of special symbols and all of the world's languages (and even some that aren't, like Klingon — which might suggest a high level of geekiness within the membership of the Unicode consortium).

The lower 128 characters of Unicode overlap the US ASCII standard, which is the basis for most of the world's computers. Use of any numeric reference below 127 will result in instantiation of that character.

You can reference a numeric entity either using decimal or hexidecimal. The latter is popular because the Unicode standard uses hexidecimal in its code tables.

The Unicode code table contains the entire list of Unicode characters. A small portion of the table looks like:

0403;CYRILLIC CAPITAL LETTER GJE;Lu;0;L;0413 0301;;;;N;;;;0453;
0405;CYRILLIC CAPITAL LETTER DZE;Lu;0;L;;;;;N;;;;0455;
0407;CYRILLIC CAPITAL LETTER YI;Lu;0;L;0406 0308;;;;N;;Ukrainian;;0457;
0408;CYRILLIC CAPITAL LETTER JE;Lu;0;L;;;;;N;;;;0458;
0409;CYRILLIC CAPITAL LETTER LJE;Lu;0;L;;;;;N;;;;0459;
040B;CYRILLIC CAPITAL LETTER TSHE;Lu;0;L;;;;;N;;Serbocroatian;;045B;
040C;CYRILLIC CAPITAL LETTER KJE;Lu;0;L;041A 0301;;;;N;;;;045C;
040E;CYRILLIC CAPITAL LETTER SHORT U;Lu;0;L;0423 0306;;;;N;;Byelorussian;;045E;

A few examples of numeric character entities:

Character Unicode code point: hex (decimal) In Use Description
A U+0041 (65) A or A uppercase 'A'
Z U+005A (90) Z or Z uppercase 'Z'
& U+0026 (38) A or & ampersand
\ U+005C (92) A or \ back slash
> U+003E (62) > or > greater-than sign
U+2663 (9827) ♣ or ♣ black club suit
Љ U+0409 (1033) Љ or Љ Cyrillic capital letter LJE
U+304B (12363) か or か Hirigana letter KA

(Note that if you don't see a black club suit symbol, a Cyrillic capital letter LJE, or Japanese hirigana KA in the first column of the table above, this simply means that the fonts necessary to display these character aren't installed on your computer.)

Special Entity References#

In SGML and XML there are a few "special" named entities, used within the markup language itself. These include:

Name Character Unicode code point: hex (decimal) Description
quot " U+0022 (34) quotation mark
amp & U+0026 (38) ampersand
apos ' U+0027 (39) apostrophe
lt < U+003C (60) less-than sign
gt > U+003E (62) greater-than sign

Note that the presence of these special characters in wiki text is automatically escaped (turned into an entity reference); you don't have to escape them yourself.

Font Support#

The ability of a Web browser to display these references will depend on the installed font on the machine doing the viewing, though nowadays this support is getting better. For example, many linux distributions now by default include support for European, Arabic, Chinese, Japanese, and other Asian languages. Since you as author have no control over the installed fonts out there on the world's computers, you can only do your best in providing the entity references. This is actually better than using a character displayed on a proprietary operating system using a proprietary-encoded character set, as nobody but other people using that OS will correctly see the character. With Unicode, everyone with the installed font does.

[NOTE: This page is a work in progress. -- MurrayAltheim]

See also: Wikipedia's HTML Character Entity Reference, or the full list of HTML/XHTML named character entities.

Add new attachment

Only authorized users are allowed to upload new attachments.
« This page (revision-1) was last changed on 18-Apr-2007 15:31 by MurrayAltheim