character encodings

ASCII¶

`ord()` and `chr()`¶

# a text like this one

text1 = "abcd\nefgh\n"

is actually encoded in memory according to ASCII:
https://www.man7.org/linux/man-pages/man7/ascii.7.html

# and to confirm that, we can use ord()
# which applies on a single char (raises an exception otherwise)

for c in text1:
    print(f"{c} -> {ord(c)}")

a -> 97
b -> 98
c -> 99
d -> 100

 -> 10
e -> 101
f -> 102
g -> 103
h -> 104

 -> 10

# FYI, note we also have the reverse function, pour info, qu'on a aussi la fonction inverse de ord
# qui s'appelle chr()

chr(97)

'a'

write into a text file¶

from pathlib import Path

# remember to always use a with: when dealing with files

with Path('encodings1-utf8').open('w', encoding='UTF-8') as f:
    f.write(text1)

inspect that file with a hex editor - e.g. you can use vs-code and install the HEX Editor extension

you will see that, with ASCII-only characters, your file has exactly one byte per character

non-ASCII & UTF-8¶

let us now consider a text with French accents and cedilla - any non-ASCII character would do

text2 = "abçd\néfgh\n"

inspection¶

# notice the values > 127
# which are not supported in ASCII

for c in text2:
    print(f"{c} -> {ord(c)}")

a -> 97
b -> 98
ç -> 231
d -> 100

 -> 10
é -> 233
f -> 102
g -> 103
h -> 104

 -> 10

# focus on the 2 characters whose encoding is > 127

hex(231), hex(233)

('0xe7', '0xe9')

bin(231), bin(233)

('0b11100111', '0b11101001')

again, note that chr() and ord() are the inverse of one another

chr(231), ord('ç')

('ç', 231)

write into a text file¶

with Path('encodings2-utf8').open('w', encoding='UTF-8') as f:
    f.write(text2)

read back¶

# no encoding in binary mode, would make no sense !

with Path('encodings2-utf8').open('rb') as f:
    raw = f.read()

and as we read bytes here, and we have more bytes than the initial text had characters

len(raw), len(text2)

(12, 10)

it adds up, since the each of the 2 alien characters will need 2 bytes each to be encoded
(European characters usually take 2 bytes; some more exoctic chars can take 3 or 4 bytes)

UTF-8 logic¶

this table describes how the UTF8 encoding works:

a visual example¶

on a sample 4-characters string: été\n

let’s check that

on our own data¶

# here is it again
raw

b'ab\xc3\xa7d\n\xc3\xa9fgh\n'

let us number the contents of raw

012   3   45 6   7   89
ab\xc3\xa7d\n\xc3\xa9fgh\n

# extract the 2-bytes areas for each alien character

ccedilla = raw[2:4]
eaccent = raw[6:8]

for b in ccedilla:
    print(f"byte {b} {hex(b)} {bin(b)}")

byte 195 0xc3 0b11000011
byte 167 0xa7 0b10100111

for b in eaccent:
    print(f"byte {b} {hex(b)} {bin(b)}")

byte 195 0xc3 0b11000011
byte 169 0xa9 0b10101001

sounds good

decode manually (gory details)¶

this is totally optional of course, but if we wanted to do the decoding ourselves...
(you may skip to the next section)

# we want 5 bits from the first byte and 6 from the second byte

on2bytes_0_len = 5
on2bytes_1_len = 6

# and that's what should occur in the remaining (left-hand-side) bits
on2bytes_0_pad = 0b110
on2bytes_1_pad = 0b10

def mask_from_len(length):
    """
    for e.g. len == 5, we compute a mask that has
    3 bits set and 5 bits unset (because 3+5=8)
    """
    return 2**8 - 2**length

# let us check that it works as advertised:

# e.g. for byte0
# the result allows to separate 
# the (3-bits) padding from 
# the (5-bits) payload

bin(mask_from_len(5))

'0b11100000'

# with that we can manually decode 2-bytes UTF-8 !

on2bytes_0_mask = mask_from_len(on2bytes_0_len)
on2bytes_1_mask = mask_from_len(on2bytes_1_len)

def decode(on2bytes):
    b0, b1 = on2bytes
    # check masks
    # e.g. check that the 3 high bits in 0xc9 are indeed 0b110
    assert (b0 & on2bytes_0_mask) >> on2bytes_0_len == on2bytes_0_pad
    # same on byte 1
    assert (b1 & on2bytes_1_mask) >> on2bytes_1_len == on2bytes_1_pad
    # extract meaningful bits
    # for that we just need to invert the mask
    bits0 = b0 & ~ (on2bytes_0_mask)
    bits1 = b1 & ~ (on2bytes_1_mask)
    # asemble bits into codepoint
    # b0 has the high bits so it needs to be shifted
    # by the number of meaningful bits in byte1
    codepoint = bits1 | bits0 << on2bytes_1_len
    return chr(codepoint)

# and indeed 
decode(eaccent), decode(ccedilla)

('é', 'ç')

exercise¶

use this table to write a complete UTF-8 decoder

UTF-32¶

let us now take a quick look at the UTF-32 encoding
this is a fixed size encoding, meaning each character will use 4 bytes
this is convenient e.g. when you need to do direct access to the $n-th$ character in a file

write with UTF-32¶

let’s write our text with 2 alien characters in a second file

with Path("encodings2-utf32").open('w', encoding='utf-32') as f:
    f.write(text2)

size and BOM¶

however the total file size is not exactly $4*n$ , and this is due to something called the BOM (Byte Order Mark)

# computing file size: use pathlib !

p = Path("encodings2-utf32")
print(f"file has {p.stat().st_size} bytes")

file has 44 bytes

len(text2)

10

44 is because

4 * 10 chars = 40 bytes
plus 4 bytes for the BOM located in the first 4 bytes

# read the 4 first bytes

with Path("encodings2-utf32").open('rb') as f:
    bom = f.read(4)

bom

b'\xff\xfe\x00\x00'

which indeed matches the UTF-32 little-endian (LE) BOM as shown on Byte order mark

decoding is way easier¶

with that in mind, it is easier to

compute the location of a given character from its rank in the string
and to decode the raw binary stream

for example: decode the ç in our initial string

# read the whole file

with Path("encodings2-utf32").open('rb') as f:
    raw = f.read()

# remember that ç is at index 2

index = 2

text2[index]

'ç'

so this means it gets encoded in the file on 4 bytes starting at offset
4 + 4 * index

offset = 4 + 4*index

b4 = raw[offset:offset+4]

b4

b'\xe7\x00\x00\x00'

because it is little endian - see Endianness - it means we have to mirror the data bytes to get the actual value

# int.from_bytes knowns how to transform a sequence
# of bytes into an int, given the endian-ness

int.from_bytes(b4, 'little')

231

# and indeed, that is what was encoded in the file !

chr(231)

'ç'

ASCII¶

ord() and chr()¶

write into a text file¶

non-ASCII & UTF-8¶

inspection¶

write into a text file¶

read back¶

UTF-8 logic¶

a visual example¶

on our own data¶

decode manually (gory details)¶

exercise¶

UTF-32¶

write with UTF-32¶

size and BOM¶

decoding is way easier¶

`ord()` and `chr()`¶