[Encoding] Character Encoding

Character encoding

If you ever thought that:

plain text = ascii = characters are 8 bits

You are SUPER wrong!

History of Character Encoding

ASCII (0-127)

Back when unix was invented, the only characters that mattered were unaccented English letters, represented with a code for them called ASCII. ASCII was able to represent every character using a number between 32 and 127.
Eg. Space was 32, and the letter "A" was 65.
This could all be stored in 7 bits.

Most computers back then were using 8-bit bytes, so not only could you store every possible ASCII character, but you still have one bit to spare.
Codes below 32 were called unprintable. They were used for control characters, like 7 which made your computer beep and 12 caused the current page of the paper to go flying out of the printer.

Everything was good - assuming you were an English speaker.

ASCII (128-255)

Because bytes have room for up to 8 bits, lots of people started thinking how they can use codes 128-255 for their own purposes; many people had different ideas.
For example, the IBM-PC came up with something known as OEM character set, which provided some accented characters for European languages and a bunch of line drawing characters - horizontal bars and vertical bars to make tables, rows, drawings.

However, if you bought a computer outside of America, there were many different kinds of OEM character sets. For example, on some PCs, the character code 130 would display as accent e, but on some computers in Israel, it was the Hebrew letter Gimel.

ANSI Standard

Eventually, this OEM free-for-all got codified in the ANSI standard. In this standard, everybody agreed on what to do below 128, which was like ASCII, but there were lots of different ways to handle the characters from 128 and on up, depending on where you lived.

DBCS (Double byte character set)

In Asia, things were more crazy because Asian alphabets have thousands of letters, which were never going to fit into 8 bits.
The solution was DBCS, the "Double byte character set", which some letters stored in one byte and others took two.
It was easy to move forward in a string, but dang near impossible to move backwards.
Programmers were encouraged not to use s++ and s- to move backwards and forwards, but instead call functions such as Window's AnsiNext and AnsiPrev.

Unicode

When the internet happened, people often copied strings from one computer to another - and character encoding became messy. Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet and some make-believe ones like Klington, too.

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore, there are 65,536 possible characters. This is not, actually, correct.

Until now, we've assumed that a letter maps to some bits which you can store on disk or in memory.
Unicode actually has a different way of thinking about characters.

U+ means "unicode", and the number that follows are hexidecimal.
You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site (http://www.unicode.org/).

There is no real limit on the number of letters that Unicode can define.

Some background of bit/byte:

byte = 8 bit = hexidecimal (0-9, A-F) = maximum is 2^8 = 256

Code Point

Every letter in every alphabet is assigned a machine number by the Unicode Consortium, which is written like this: U+0639.
This magic number is called a code point; this maps to a letter.

Example

For example, we have a string, "Hello", which in unicode corresponds to 5 code points:
U+0048 U+0065 U+006C U+006C U+006F.

And then you can store these as just 48 65 6C 6C 6F.

They are really just a bunch of code points. We will explain how we can store them more efficiently later on.

UTF-8 (Encoding)

We now explain UTF-8 - it was a system to store your string of unicode code points.

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2,3, up to 6 bytes.

If 0th to 127th code point in UTF-8 has the size of one byte, why does it only store 128 code points, rather than 256 (2^8)?

This has to do with the structural design put into UTF-8:

Backwards Compatability

UTF-8 is designed to be backwards compatible with ASCII (which had the standard 128 characterse)

Prefix Code

in UTF-8, the first few bits indicates how many bytes the particular character spans across.
For example, you will see n-byte characters in the this format:

1-byte characters are in the form of 0xxx xxxx
2-byte characters are in the form of 110x xxxx 10xx xxxx
3-byte characters are in the form of 1110 xxxx 10xx xxxx 10xx xxxx
4-byte characters are in the form of 1111 xxxx 10xx xxxx 10xx xxxx 10xx xxxx

Self-Sychronzation (continuation byte)

In 2+ bytes UTF-8, the 2nd to 6th byte start with 10. This is called 'continuation byte'. This is so we know to keep reading the byte before we stop and interpret the character it represents. (shown with the n-byte format above)
This means a search will not accidentally find the sequence for one character starting in the middle of another character.
It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte.
An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

Other unicode encodings

There are actually other ways to store unicode.

UCS-2 is a two-byte method.
UTF-16 (has 16 bits) is another.
UTF-7 exists. Similar to UTF-8, but guarantees the high bit to always be zero.

While unicode will work with ASCII or OEM encodings for English, you will usually get a question mark(?), box or a diamond with question mark when you have no equivalent unicode code point.

The single most important fact about Encodings

"There ain't no such thing as plain text".

It doesn't make sense to have a string without knowing what encoding it uses.
If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

Web and Email - How do we preserve the encoding of an email and website?

Email

In an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

Web Page

For a web page, the original idea was the web server would return a similar Content-Type http header along with the website itself - not in the HTML - but as one of the response headers. However, there is a problem - what if each page has a different encoding? (due to different contributors)

Therefore, encoding info is added into the HTML itself, specifically in the meta tag (inside head, inside html).

How can we read the HTML file if you don't know what encoding it's in?

Luckily, almost all encoding has the same characters between 32 and 127, which is enough to read up to the meta tag containing the encoding info:

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very first thing in the <head> section because as soon as the web browser sees this tag it’s going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified.

What do web browsers do if they don’t find any Content-Type, either in the http headers or the meta tag?

Internet Explorer actually does something quite interesting - Based on the frequency on various bytes, it tries to guess what encoding is used.

Because the various old 8 bit code pages tended to put their national letters in different ranges between 128 and 255, and because every human language has a different characteristic histogram of letter usage, this actually has a chance of working.

Used Case of Encoding

For the latest version of CityDesk, the web site management software published by my company, we decided to do everything internally in UCS-2 (two byte) Unicode, which is what Visual Basic, COM, and Windows NT/2000/XP use as their native string type. In C++ code we just declare strings as wchar_t (“wide char”) instead of char and use the wcs functions instead of the str functions (for example wcscat and wcslen instead of strcat and strlen). To create a literal UCS-2 string in C code you just put an L before it as so: L"Hello".

Note: char in C is used for ASCII character se. wchar_t is used for Unicode, UTF-16.

When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

Resources:
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
https://janav.wordpress.com/2013/04/27/unicode-and-utf-8/
https://en.wikipedia.org/wiki/UTF-8

Search This Blog

NotAfraidOfWong