Tips on Character Encoding

When typing in a word processor, you can easily add in all sorts of characters or glyphs like &, <, > and so on —basically whatever characters are contained within the fonts on your computer. It’s not necessarily so simple when you are authoring a web page. There are some extra considerations to take into account to ensure that all characters appear as intended.

Some basic keyboard characters—such as &, < and >—have special meaning in markup languages like HTML and XML. For example, if you want to display <p> on your web page, typing <p> in your code won’t work—because the greater than and less than symbols constitute markup to tell a browser to render what follows as a paragraph. You will need to do something different if you want <p> to appear on your web page. You may also want to display characters that do not appear on your keyboard, such as a Copyright symbol, Greek letters or mathematical symbols. This is where “character encoding”, “character references” and “entity references” come in.

For a long time I was confused about character encoding and the rest (and I can’t claim to know all about them now). It was a very frustrating exercise to try to get clear information on these topics. Here I attempt to offer some clarification on what it’s all about. (I’m happy to be corrected if I’ve made mistakes, too.)

Firstly, the terminology is very confusing—with lots of inconsistencies—so I would rather start by just demonstrating what to do, then look at naming afterwards.

Take the example of & (the “ampersand”), a regular keyboard character. To display this character on your page, you have a few choices when writing your code:

just type “&” (but beware of certain pitfalls)
replace & with &
replace & with & or &
replace & with &

A few points about each option follow …

Just type the character

The first option—just typing unencoded “&”—will generally work in HTML, though XML doesn’t like it. (An example where it will not work is when it appears in a URL. In that case, you will run into problems. You will need to use one of the other solutions listed above, which is otherwise know as “escaping” the character.)

With other characters you may not be so lucky. For example, if you place the € (Euro) symbol in your HTML, it may or may not appear correctly on-screen. A determining factor may be the character encoding of your page.

Character Encoding

Two common character encodings are ISO-8859-1 and UTF-8.

How do you know if your page’s character encoding is ISO-8859-1, UTF-8 or something else? It is common to have something like this in the <head> section of your document:

<meta http-equiv="content-type" content="text/html; charset=utf-8">

However, this may not represent the real encoding of your page. Normally, your server is set up to serve your pages with a pre-defined encoding, which may or may not be the same as the encoding in your page’s meta tag. This server encoding overrides your meta tag encoding… which is generally only useful for viewing your page offline.

Even if you do not have ready access to your server configurations, you can easily determine which encoding your server is using. One way it to use an online service such as Mozilla Web-sniffer. The Opera browser also provides this information by clicking View > Toolbars > Panels and then the + at the bottom of the Panels bar (left of screen) and enable Info. The Info tab displays page information, including the character encoding. (In Firefox you can also click View > Character Encoding.)

If the character encoding being sent by your server is not to your liking, you can change this if you have access to your server configurations. If you don’t have such access, you can also specify your desired encoding via an htaccess file.

UTF-8 and ISO-8859-1

OK, so what do UTF-8 and ISO-8859-1 stand for?

They both represent a defined set of characters. UTF-8 has a bigger set of characters than ISO-8859-1, so it allows you to add a greater range of unencoded (or “unescaped”) characters to your HTML than ISO-8859-1, which only contains “Latin” characters. (So, when you type the Euro symbol (€) into your code, it will appear on screen as intended if you are using UTF-8 encoding, but it would not render properly using ISO-8859-1, because that character is not included in the ISO-8859-1 list.)

The “set” of characters from which both UTF-8 and ISO-8859-1 are drawn is called the Universal Character Set (UCS). UCS is closely associated with Unicode, and the two are effectively one in the same. UCS and Unicode contain over 100,000 characters—encompassing alphabets from around the world, mathematical symbols and so on.

Generally speaking, UTF-8 is the best character encoding to use, as it is able to represent all of the available Unicode characters. Still, if you are using an encoding like ISO-8859-1, you can still represent any UCS/Unicode character by using a special kind of reference code. Each UCS character has a special “character reference” that you can use. Using such a character reference helps to ensure that the character appears on your page—no matter what character encoding you are using.

Character References

Each UCS character is identified by a distinct number (also called a “code point”). This code point takes two forms: a “numeric character reference” and a “hexadecimal character reference”. These character references can be used both to represent a myriad of uncommon symbols on-screen, and to “escape” common character that may have special meaning for markup languages like HTML.

To go back to the example of &, if you look this up on the USC chart, you find this information: U+0026 (38).

The first part (U=0026) is the hexadecimal reference, and the 38 is the numeric reference.

If you want to use the hexadecimal reference to represent &, you take the 0026 and present it like this: & (or you can leave out the initial zeros and use this: &).

To use the numeric reference, you take the number 38 and present it like this: &.

Entity References

Some characters have what’s called an “entity reference” associated with them as well as character references. For example, to represent & on your page you can use &, the ampersand’s entity reference. For the Euro symbol you can use €, and for the copyright symbol (©) you can use ©. As you can see, entity references are characterized by a series of letters that in some way form a shorthand name of the character (hence they are also sometimes called “named entities”). There are some 250 such entity references.

Some other common entity references are &mdash (an “em” dash), &ndash; (an “en” dash), ” (right-hand double quotes), < (the “less than” symbol), > (the “greater than” symbol), and so on.

The limitations here are that many useful symbols don’t have such an entity reference; and if you are working with XML, only five of these entity references can be used—namely, <, >, &, ", and '. (Just to top it all off, HTML does not recognize this last one!)

It really doesn’t matter whether you use a character reference or an entity reference to represent a character (unless, of course, you are writing XML). Both are equally good (although there is one slight warning about hexadecimal character references noted below). An obvious advantage of entity references is that they are easier to read and remember.

The important point is that—in most if not all cases—it is a much safer practice to escape any character that does not appear on your keyboard, and any that do appear on your keyboard that have special meaning as markup.

Displaying code

One instance of when you must escape a character is when you are displaying code on your web pages—as mentioned above. If you want to display code like <div>, for example, you cannot simply type this into your HTML as is, since the browser will interpret it as a div element and try to render it accordingly. You must replace the < and > symbols with their encoded forms, meaning that you need to write either <div> or <div> in your code. It’s not much fun to type this up, but it’s essential.

Warnings

Unfortunately, browsers are not always able to render all UCS characters. One reason for this is that your computer may not have a font installed that can represent the character. In other words, the display of special characters still depends on there being a font installed on a user’s computer that contains that character. Generally speaking, though, this is not likely to be a big problem, as most computers have a huge range of fonts installed for this purpose.

Another point to note is that hexadecimal references—such as &— may not work in older browsers.

If you copy text from a text editor and paste it into your UTF-8 page, some characters may not display correctly. This can happen when the copied text has been saved in an encoding other than UTF-8. To stop this happening, first save the original file as UTF-8.

You don’t have to worry about this if you are using entity or character references, because they are independent of a document’s character encoding. People often argue that it is pointless to use character references in a UTF-8 document, as they involve more code, they can get quite messy if you are using a lot of them, and they are easy to mis-type. However, character references do have their uses even in UTF-8 documents. As pointed out by Tommy Olsson in a SitePoint discussion, a few characters—such as the soft-hyphen and non-breaking space—are best inserted with their character references ( and   respectively).

Below I provide a few links to charts that lay out the UCS character references and the smaller number of entity (or named) references.

Acknowledgements & Links

For a nice starting list of character references, try this Wikipedia entry, or for the full list, explore the official Unicode Code Charts
The HTML 4.01 Specification provides this full list of entity references.
The Leftlogic site lets you type in a character or character name and see the necessary code. You can also download a desktop widget that does the same thing. Really nice resource.
CopyPasteCharacter.com has a fun page that allows you to copy a range of characters as text or HTML code, which you can paste into a document. A similar service is found here.
Some nice entity reference lists are offered by EntityCode and Escape Codes
Other useful Wikipedia entries include Unicode and Unicode and HTML.
Another useful resource is The Definitive Guide to Web Character Encoding, by Tommy Olsson.