Skip to article frontmatterSkip to article content

ASCII

ord() and chr()

# a text like this one

text1 = "abcd\nefgh\n"

is actually encoded in memory according to ASCII:
https://www.man7.org/linux/man-pages/man7/ascii.7.html

# and to confirm that, we can use ord()
# which applies on a single char (raises an exception otherwise)

for c in text1:
    print(f"{c} -> {ord(c)}")
a -> 97
b -> 98
c -> 99
d -> 100

 -> 10
e -> 101
f -> 102
g -> 103
h -> 104

 -> 10
# FYI, note we also have the reverse function, pour info, qu'on a aussi la fonction inverse de ord
# qui s'appelle chr()

chr(97)
'a'

write into a text file

from pathlib import Path
# remember to always use a with: when dealing with files

with Path('encodings1-utf8').open('w', encoding='UTF-8') as f:
    f.write(text1)

inspect that file with a hex editor - e.g. you can use vs-code and install the HEX Editor extension

you will see that, with ASCII-only characters, your file has exactly one byte per character

non-ASCII & UTF-8

let us now consider a text with French accents and cedilla - any non-ASCII character would do

text2 = "abçd\néfgh\n"

inspection

# notice the values > 127
# which are not supported in ASCII

for c in text2:
    print(f"{c} -> {ord(c)}")
a -> 97
b -> 98
ç -> 231
d -> 100

 -> 10
é -> 233
f -> 102
g -> 103
h -> 104

 -> 10
# focus on the 2 characters whose encoding is > 127

hex(231), hex(233)
('0xe7', '0xe9')
bin(231), bin(233)
('0b11100111', '0b11101001')

again, note that chr() and ord() are the inverse of one another

chr(231), ord('ç')
('ç', 231)

write into a text file

with Path('encodings2-utf8').open('w', encoding='UTF-8') as f:
    f.write(text2)

read back

# no encoding in binary mode, would make no sense !

with Path('encodings2-utf8').open('rb') as f:
    raw = f.read()

and as we read bytes here, and we have more bytes than the initial text had characters

len(raw), len(text2)
(12, 10)

it adds up, since the each of the 2 alien characters will need 2 bytes each to be encoded
(European characters usually take 2 bytes; some more exoctic chars can take 3 or 4 bytes)

UTF-8 logic

this table describes how the UTF8 encoding works:

a visual example

on a sample 4-characters string: été\n

let’s check that

on our own data

# here is it again
raw
b'ab\xc3\xa7d\n\xc3\xa9fgh\n'

let us number the contents of raw

012   3   45 6   7   89
ab\xc3\xa7d\n\xc3\xa9fgh\n
# extract the 2-bytes areas for each alien character

ccedilla = raw[2:4]
eaccent = raw[6:8]
for b in ccedilla:
    print(f"byte {b} {hex(b)} {bin(b)}")
byte 195 0xc3 0b11000011
byte 167 0xa7 0b10100111
for b in eaccent:
    print(f"byte {b} {hex(b)} {bin(b)}")
byte 195 0xc3 0b11000011
byte 169 0xa9 0b10101001

sounds good

decode manually (gory details)

this is totally optional of course, but if we wanted to do the decoding ourselves...
(you may skip to the next section)

# we want 5 bits from the first byte and 6 from the second byte

on2bytes_0_len = 5
on2bytes_1_len = 6

# and that's what should occur in the remaining (left-hand-side) bits
on2bytes_0_pad = 0b110
on2bytes_1_pad = 0b10
def mask_from_len(length):
    """
    for e.g. len == 5, we compute a mask that has
    3 bits set and 5 bits unset (because 3+5=8)
    """
    return 2**8 - 2**length
# let us check that it works as advertised:

# e.g. for byte0
# the result allows to separate 
# the (3-bits) padding from 
# the (5-bits) payload

bin(mask_from_len(5))
'0b11100000'
# with that we can manually decode 2-bytes UTF-8 !

on2bytes_0_mask = mask_from_len(on2bytes_0_len)
on2bytes_1_mask = mask_from_len(on2bytes_1_len)

def decode(on2bytes):
    b0, b1 = on2bytes
    # check masks
    # e.g. check that the 3 high bits in 0xc9 are indeed 0b110
    assert (b0 & on2bytes_0_mask) >> on2bytes_0_len == on2bytes_0_pad
    # same on byte 1
    assert (b1 & on2bytes_1_mask) >> on2bytes_1_len == on2bytes_1_pad
    # extract meaningful bits
    # for that we just need to invert the mask
    bits0 = b0 & ~ (on2bytes_0_mask)
    bits1 = b1 & ~ (on2bytes_1_mask)
    # asemble bits into codepoint
    # b0 has the high bits so it needs to be shifted
    # by the number of meaningful bits in byte1
    codepoint = bits1 | bits0 << on2bytes_1_len
    return chr(codepoint)
# and indeed 
decode(eaccent), decode(ccedilla)
('é', 'ç')

exercise

use this table to write a complete UTF-8 decoder

UTF-32

let us now take a quick look at the UTF-32 encoding
this is a fixed size encoding, meaning each character will use 4 bytes
this is convenient e.g. when you need to do direct access to the nthn-th character in a file

write with UTF-32

let’s write our text with 2 alien characters in a second file

with Path("encodings2-utf32").open('w', encoding='utf-32') as f:
    f.write(text2)

size and BOM

however the total file size is not exactly 4n4*n, and this is due to something called the BOM (Byte Order Mark)

# computing file size: use pathlib !

p = Path("encodings2-utf32")
print(f"file has {p.stat().st_size} bytes")
file has 44 bytes
len(text2)
10

44 is because

# read the 4 first bytes

with Path("encodings2-utf32").open('rb') as f:
    bom = f.read(4)
bom
b'\xff\xfe\x00\x00'

which indeed matches the UTF-32 little-endian (LE) BOM as shown on Byte order mark

decoding is way easier

with that in mind, it is easier to

for example: decode the ç in our initial string

# read the whole file

with Path("encodings2-utf32").open('rb') as f:
    raw = f.read()
# remember that ç is at index 2

index = 2

text2[index]
'ç'

so this means it gets encoded in the file on 4 bytes starting at offset
4 + 4 * index

offset = 4 + 4*index

b4 = raw[offset:offset+4]

b4
b'\xe7\x00\x00\x00'

because it is little endian - see Endianness - it means we have to mirror the data bytes to get the actual value

# int.from_bytes knowns how to transform a sequence
# of bytes into an int, given the endian-ness

int.from_bytes(b4, 'little')
231
# and indeed, that is what was encoded in the file !

chr(231)
'ç'