Character Encoding
The translation of computer binary to human readable characters.
Character sets
A character is any letter, number, space, punctuation mark, or symbol that can be typed on a computer.
A character set is a family or characters.
Below is a character set of printable characters, these are characters that can be read from the screen or printed out on paper.
Examples of non-printable characters would include “Delete”, “Go back to homepage” and “exit” which are commands. (This was created back in the days before we had user interface, and instead of using a computer mouse we had to use the keyboard).
Character encoding
Character encoding is translation of binary to characters. Without character encoding we humans would have read and write in binary in order to operate a computer.
Types of character sets
ASCII character set
One of the first character sets was ASCII encoding which was invented in the 60s by the American National Standards Institute.
Below is an extract from the ASCII binary character table which illustrates how a binary number converts to a ASCII character.
Binary | Character |
---|---|
01100001 |
a |
01100010 |
b |
01100011 |
c |
01100100 |
d |
01100101 |
e |
01100110 |
f |
The problem with ASCII is that all of the characters in the ASCII character set are limited to the American alphabet (in fact this encoding is also called “US Only”). There are no characters for other languages such as ä or é. So the ASCII character set can’t be used for translating binary to Swedish, (because instead of Hej världen, you would get Hej v�rlden).
Unicode character set
So later came a Unicode, this character set has been developing since 1991, over the years this character set continued to grow by adding more and more characters. For instance in…
1991 – Characters from the Greek, Thai and Hebrew alphabets were added (eg. Ω β)
2003 -Ancient characters from languages such as Cypriot syllabary were added.
Most recently, emoji icons (eg. 😀 😎 🙀) were added.
Types of encoding
The first character encoding was ASCII encoding which uses the ASCII character set. Later came encoding that uses the unicode character set, which are UTF-7, UTF-8, UTF-16 and UTF-32.
Which one do I use?
Not only are people using UTF-8 for their pages, but Unicode encodings are the basis of the Web itself. All browsers use Unicode internally, and convert all other encodings to Unicode for processing. As do all search engines. All modern operating systems also use Unicode internally. It has become part of the fabric of the Web. The W3C strongly recommends that content authors should only use the UTF-8 encoding for their documents. This is partly to avoid the security risks associated with some encodings, but also to ensure world-wide usability of Web pages. It also gives you much more flexibility about what characters you can include in your web page without special escapes, from copyright symbols to emoji.
So how do we define character encoding?
HTTP
Content-Type: text/html; charset=utf-8
HTML
< meta http-equiv="Content-Type" content="text/html; charset=utf-8" >
CSS
@charset "UTF-8";
XML
< ?xml version="1.0" encoding="utf-8"? >
Last modified: October 2, 2018
Mark Endley