Encoding in Javascript: Part 2
This is a continuation of Encoding 1.
Raw data in Javascript ⛁
In the real world, we almost never type out 1s and 0s to represent binary data. We typically work with numbers or integers, which correspond to values of bytes or groups of bytes.
Although Javascript was designed to be a high level language, Brendan Eich included some bitwise operators in his original implementation of the language in 1995 (&
,|
,^
,~
,<<
,>>
and >>>
) and their corresponding assigners (&=
, &|
, etc.). These operators are used on numbers and were the only way to perform low level binary operations in the language until 2015. You can see the small example below to illustrate how each bitwise operator works. Note that the Javascript bitwise operations are performed on 32 bit values, which means that each number is converted from 64 bits to 32 bits and back to 64 bits in each binary operation.
const a = 5; // Binary: 0101
const b = 3; // Binary: 0011
// AND
const andResult = a & b; // Binary: 0001, Decimal: 1
console.log(`AND: ${andResult}`);
// OR
const orResult = a | b; // Binary: 0111, Decimal: 7
console.log(`OR: ${orResult}`);
// XOR
const xorResult = a ^ b; // Binary: 0110, Decimal: 6
console.log(`XOR: ${xorResult}`);
// NOT
const notResult = ~a; // Binary: 1010 (in 32-bit: 11111111111111111111111111111010), Decimal: -6
console.log(`NOT: ${notResult}`);
// Left Shift
const leftShiftResult = a << 1; // Binary: 1010, Decimal: 10
console.log(`Left Shift: ${leftShiftResult}`);
// Right Shift
const rightShiftResult = a >> 1; // Binary: 0010, Decimal: 2
console.log(`Right Shift: ${rightShiftResult}`);
// Zero-fill Right Shift
const zeroFillRightShiftResult = a >>> 1; // Binary: 0010, Decimal: 2
console.log(`Zero-fill Right Shift: ${zeroFillRightShiftResult}`);
Over the years, Javascript has been able to build on top of the foundation of bitwise operators to make working with binary data more versatile, using TypedArray
s, ArrayBuffer
s and DataView
s. Below is an example of how you can directly modify image data in Javascript using a kind of TypedArray
called a Uint8ClampedArray
. This example encodes the binary to base64 using HTMLCanvasElement.toDataURL()
. Take a look:
const originalImg = document.querySelector("#original-img");
const modifiedImg = document.querySelector("#modified-img");
const canvas = document.querySelector("canvas");
const input = document.querySelector('input[type="range"]');
input.addEventListener("input", (e) => {
const value = parseInt(e.target.value);
const ctx = canvas.getContext("2d");
const img = new Image();
img.src = originalImg.src;
img.onload = () => {
canvas.width = img.width;
canvas.height = img.height;
ctx.drawImage(img, 0, 0);
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height);
// getting Uint8ClampedArray from data source
const data = imageData.data;
// looping through four bytes at a time
for (let i = 0; i < data.length; i += 4) {
// manually modifying the raw bytes
data[i] = Math.min(data[i] + value, 255);
data[i + 1] = Math.min(data[i + 1] + value, 255);
data[i + 2] = Math.min(data[i + 2] + value, 255);
}
ctx.putImageData(imageData, 0, 0);
// encoding to base64 for the img element
modifiedImg.src = canvas.toDataURL();
};
});
Working in raw data enables us to modify any image and in any format. You can see this implemented in the following demo:
Change the brightness to see updated data:
/imgs/assets/binary-md.webp
Raw data in Javascript can be allocated to memory with an ArrayBuffer
, which represents a byte array. We cannot edit an ArrayBuffer
directly. Instead A TypedArray
can be used to house an ArrayBuffer
.
A TypedArray
is a special type of array that breaks up data into groups. The names of each type of TypedArray
describes how the data is encoded and interpreted. For example Uint8Array
contains unsigned 8 bit elements of data and Int32Array
contains signed 32 bit elements of data.
const binaryStr = "10010010 10000000 01000100 00010011";
const numbers = binaryStr.split(" ").map((str) => parseInt(str, 2));
// with unsigned 8 bits, the highest value is 255 (11111111)
// If you enter 256, it will count up 256 times, past 255 landing on 0
// Uint8ClampedArray will max at 255 and not continue past
const uint8 = new Uint8Array([...numbers, 256]);
console.log(`Uint8Array bytes: ${uint8.byteLength}`); // 5 (4 from binaryStr, 1 from 256)
// A sign allows for negative numbers
// But allocating a bit for the sign means a max/min of +/- 2147483648
// Unsigned 32 bits would allow for values as high as 4294967295
const int32 = new Int32Array([55, -2147483648, 2000000000]);
console.log(`Int32Array bytes: ${int32.byteLength}`); // 12 (each value is 4)
Each TypedArray
has it's own constructor(new Uint8Array()
, new Uint8ClampedArray()
, new Int32Array
, etc.). Their constructors allow a variety of arguments like a length, an array of numbers, another TypedArray
or an ArrayBuffer
.
const typedArr1 = new Uint8Array(7); // use length of 7 where all values are initialized at 0 (00000000)
const typedArr2 = new Uint8Array([1, 7, 11]); // use an array of numbers
const typedArr3 = new Uint8Array(typedArr2); // use another TypedArray
const buff = new ArrayBuffer(8);
const typedArr4 = new Uint8Array(buff); // use an ArrayBuffer
// when using a buffer you can enter a byteOffset and total length
const typedArr5 = new Uint8Array(typedArr3.buffer, 1, 2);
console.log({ typedArr1, typedArr2, typedArr3, typedArr4, typedArr5 });
If you want a more comprehensive reference, Mozilla Docs has detailed tables and resources about TypedArray
s. See article.
We'll work with more TypedArray
s in the next section when looking at some common encoding methods.
Encoding arbitrary data 𝄃𝄃𝄂𝄂𝄀𝄁𝄃𝄂𝄂𝄃
Two effective and useful encodings for generic data are called hexadecimal and base64. They are convenient because they are easier to read and significantly more compact than storing binary string representations, but do not rely on a fixed character set like UTF-8, UTF-16 or GB 18030 (a Chinese standard). You'll never have to worry about unknown characters ���.
Hexadecimal (Hex) on the web
As the hexadecimal name implies, hex encoding takes 4 bits at a time and then represents each group with 16 possible values, digits 0-9 and letters A-F. The reason it doesn't directly use integers 0-15 is so that each bit group can be represented with a single character. The casing of the letters is irrelevant.
Binary | Decimal | Hex
----------------------
0000 | 0 | 0
0001 | 1 | 1
0010 | 2 | 2
0011 | 3 | 3
0100 | 4 | 4
0101 | 5 | 5
0110 | 6 | 6
0111 | 7 | 7
1000 | 8 | 8
1001 | 9 | 9
1010 | 10 | A
1011 | 11 | B
1100 | 12 | C
1101 | 13 | D
1110 | 14 | E
1111 | 15 | F
This means each byte is represented by 2 characters, which is great for consistency, but it becomes a little less intuitive with sequences like 01110000
where the decimal value is 112, but the hex is 70
. See if you can figure out some of these sequences:
91
✅
❌
106
✅
❌
86
❌
❌
109
❌
❌
Similar to binary, in Javascript we can use Number.toString(16)
to create a hex string and then parseInt.(,16)
to convert back into a number. To convert a TypedArray
, like a Uint8Array
into a string, we can use the following code:
const bytes = new Uint8Array([1, 2, 3, 255, 254, 253]);
const hexStr = Array.from(bytes)
.map((byte) => byte.toString(16).padStart(2, "0"))
.join("");
console.log(`Hex String: ${hexStr}`);
const backToBytes = new Uint8Array(
hexStr.match(/.{1,2}/g).map((byte) => parseInt(byte, 16)),
);
console.log(`Back to Bytes: ${backToBytes}`);
Javascript also supports hexadecimal literals, which have their use cases. Prefix any hex with 0x
and it will be treated similar to a number.
// Addition
const sum = 0x1A + 0x0F; // 26 + 15 in decimal
console.log(`Sum: ${sum}`); // Sum: 41
console.log(`Sum Hex: 0x${sum.toString(16).toUpperCase()}`); // Sum Hex: 0x29
// Using a bitwise function
const andResult = 0x1A & 0x0F; // 26 & 15 in decimal
console.log(`Bitwise AND: ${andResult}`); // Bitwise And: 10
console.log(`Bitwise AND Hex: 0x${andResult.toString(16).toUpperCase()}`); // Bitwise AND Hex: 0xA
// Multiplying a hexadecimal by a decimal
const product = 0x1A * 10;
console.log(`Product: ${product}`); // Product: 260
console.log(`Product Hex: 0x${product.toString(16).toUpperCase()}`); // Product Hex: 0x104
// Creating a Uint16Array with hexadecimal numbers
const hexArray = new Uint16Array([0x1A, 0x2B, 0x3C]);
console.log(`Uint16Array: ${hexArray}`); // Uint16Array: 26,43,60
You've probably seen hex encodings before to represent color data, but because of their versatility, we can see them in many other public facing applications like IPv6 addresses and Mac Addresses.
// Color Codes
const white = new Uint8Array([255, 255, 255]);
const black = new Uint8Array([0, 0, 0]);
const whiteHex = Array.from(white, byte => byte.toString(16).padStart(2, '0')).join('');
const blackHex = Array.from(black, byte => byte.toString(16).padStart(2, '0')).join('');
const whiteCode = `#${whiteHex}`;
const blackCode = `#${blackHex}`;
console.log(whiteCode); // Output: #FFFFFF
console.log(blackCode); // Output: #000000
// IPv6 Address
const ipv6Address = new Uint8Array([
32, 1, 13, 184, 133, 163, 0, 0, 0, 0, 138, 46, 3, 112, 115, 52
]);
const ipv6HexArray = Array.from(ipv6Address, (byte, index) =>
index % 2 === 0 ? byte.toString(16).padStart(2, '0') : byte.toString(16)
);
const ipv6Hex = ipv6HexArray.join(':').replace(/(:0{1,3})+/g, ':');
console.log(ipv6Hex); // Output: 2001:db8:85a3::8a2e:370:7334
// MAC Address
const macAddress = new Uint8Array([0, 26, 43, 60, 77, 94]);
const macHexArray = Array.from(macAddress, byte => byte.toString(16).padStart(2, '0'));
const macHex = macHexArray.join(':');
console.log(macHex); // Output: 00:1a:2b:3c:4d:5e
While hex has it's useful applications, and with practice, you could learn to covert it in your head, base64 encodes into even less characters.
Base64 on the web
Base64 encodes six bits into one of 64 characters. Unlike hex, upper case and lower case letters cannot be changed. You can review the encodings below:
Binary | Decimal | Base64
--------------------------
000000 | 0 | A
000001 | 1 | B
000010 | 2 | C
000011 | 3 | D
000100 | 4 | E
000101 | 5 | F
000110 | 6 | G
000111 | 7 | H
001000 | 8 | I
001001 | 9 | J
001010 | 10 | K
001011 | 11 | L
001100 | 12 | M
001101 | 13 | N
001110 | 14 | O
001111 | 15 | P
010000 | 16 | Q
010001 | 17 | R
010010 | 18 | S
010011 | 19 | T
010100 | 20 | U
010101 | 21 | V
010110 | 22 | W
010111 | 23 | X
011000 | 24 | Y
011001 | 25 | Z
011010 | 26 | a
011011 | 27 | b
011100 | 28 | c
011101 | 29 | d
011110 | 30 | e
011111 | 31 | f
100000 | 32 | g
100001 | 33 | h
100010 | 34 | i
100011 | 35 | j
100100 | 36 | k
100101 | 37 | l
100110 | 38 | m
100111 | 39 | n
101000 | 40 | o
101001 | 41 | p
101010 | 42 | q
101011 | 43 | r
101100 | 44 | s
101101 | 45 | t
101110 | 46 | u
101111 | 47 | v
110000 | 48 | w
110001 | 49 | x
110010 | 50 | y
110011 | 51 | z
110100 | 52 | 0
110101 | 53 | 1
110110 | 54 | 2
110111 | 55 | 3
111000 | 56 | 4
111001 | 57 | 5
111010 | 58 | 6
111011 | 59 | 7
111100 | 60 | 8
111101 | 61 | 9
111110 | 62 | +
111111 | 63 | /
Because base64 corresponds to six bit groupings, but data is physically stored in bytes, the encoder takes 3 bytes (24 bits) of data and encodes that data into 4 characters. If the the encoder gets to the end of a dataset and there are not 3 full bytes in the last segment (sometimes referred to as the tail), it will pad the last group with 0
s and use the =
sign to pad the remaining empty groups.
3 Byte Tail | |
---|---|
Bytes | 010011010110000101101110 |
6 bit groups | 010011010110000101101110 |
Base64 | TWFu |
2 Byte Tail | |
Bytes | 0100110101100001 |
6 bit groups | 010011010110000100•••••• |
Base64 | TWE= |
1 Byte Tail | |
Bytes | 01001101 |
6 bit groups | 010011010000•••••••••••• |
Base64 | TQ== |
Most Javascript run times have a global function called atob()
which encodes base64 into readable ASCII. They also have a corresponding function called btoa()
which converts an ASCII string into base64. Note that if you enter non-ASCII text (πρόβλεψη 👀), these functions fail.
btoa()
aGVsbG8=
atob()
hi
We can overcome the limited support with atob()
and btoa()
using some String
functions. We can spread any Uint8Array
into the static methodString.fromCodePoint()
to get a string of unicode characters, which then can be base64 encoded using btoa()
. To convert back to a Uint8Array
, we can use atob()
combined with String.charCodeAt()
.
const uint8_1 = new TextEncoder().encode("(πρόβλεψη 👀)");
const base64Str = btoa(String.fromCodePoint(...uint8_1));
console.log(`Base64 String: ${base64Str}`);
const uint8_2 = Uint8Array.from(atob(base64Str), (c) => c.charCodeAt(0));
const str = new TextDecoder().decode(uint8_2);
console.log(`Text String: ${str}`);
You can see this in action below. You will still get errors for invalid base64 strings.
Base64
z4DPgc+MzrLOu861z4jOtyDwn5GA
Normal text
👾🚀
The problem with this method is it's relatively slow because you are looping through the data multiple times. The fastest way to convert to and from base64 strings is using the FileReader
and fetch
APIs. You can read more about that in the Mozilla docs.
Future improvements to the web
In the near future, you should be able to use a new methods within Uint8Array
called Uint8Array.toHex()
and Uint8Array.toBase64()
, which should simplify the conversion process and add performance gains. You can read about the proposal here.
Hex and Base64 in Node.js
Arguably the encoding for hex and base64 is simpler and faster in Node than on the web. Because Node was built to handle server functions years before web Javascript had many of the tools that could have handled them, Node created a robust Buffer
API with intuitive data encoders. The current API allows us to create a Buffer
(now an extension of Uint8Array
) directly from a text encoding using the static Buffer.from()
, and re-encode it into many formats with Buffer.toString()
.
const utf8String = "שלום! 😄 אני שמח ש-Node.js קיים";
const buffer = Buffer.from(utf8String, 'utf8');
const hexString = buffer.toString('hex');
console.log('Hex String:', hexString);
const base64String = buffer.toString('base64');
console.log('Base64 String:', base64String);
const utf16leBuffer = buffer.toString('utf16le');
console.log('UTF-16LE String:', utf16leBuffer);
Encoding opens up a world of possibilities 🌍
We reviewed how text, numbers, images and raw data is encoded, and now the sky is the limit. Understanding data encoding will enable you to work with any kind of data you come across.
For the last example you can see how the concepts you reviewed make a topic as intimidating as cryptography within your reach:
You and your best friend, Bob, have been using an end to end encryption app. One day the French government and European Union decide to dismantle the app because they think private communication is too dangerous for the public. Bob sent you one last message, but your app stopped working before the message could be decrypted.
You look through the app directory and find a file that says key.pem
, which should be your private key. You find another file that you think is the last message he sent. Since you learned everything about coding on Youtube, you decide Javascript is the best option to try to decrypt Bob's last message. Can you do it?
In Javascript we can use the SubtleCrypto
API to decrypt data. It asks for a CryptoKey
and an ArrayBuffer
. Let's start with the CryptoKey
.
Can you tell what encoding method used in the private key?
Choose wisely
You can copy the code from the Base64 to convert this base64 to raw data and then use the crypto.subtle.importKey()
to make the key.
Key will go here 🔑
Given these are the contents of file, can you think of a way to read these contents in a more useful format?
Choose wisely
Click below to get the raw data.
Encrypted data will go here 0️⃣1️⃣0️⃣1️⃣
With the CryptoKey
made and the raw encrypted data, we can use crypto.subtle.decrypt()
to get the decoded ArrayBuffer
. Click below to see the ArrayBuffer
.
Decrypted message will go here 🗞
Did you catch how we can decode raw data to UTF-8 text?
Choose wisely
Using new TextDecoder.decode()
we can decode the raw data into UTF-8.