I'm bored on 4th of July holiday, so I made a wacky webpage: Deep Atlantic Storage. It is described as a free file storage service, where you can upload any file to be stored deep in the Atlantic Ocean, with no size limit and content restriction whatsoever. Since Chia currency farming became popular in May, hard drive prices went up significantly. How can I afford to operate an unlimited free storage service?
"Advanced Sorting Technology"
One of the benefits listed on Deep Atlantic Storage webpage is:
- Advanced sorting technology keeps your data neatly ordered.
What this means is that, content in the uploaded file would be sorted before being stored.
A sorting algorithm is an algorithm that puts elements of a list in a certain order, such as numerical order or lexicographical order. Every coder knows a few sorting algorithms, such as:
- quick sort
- bubble sort
- merge sort
- insertion sort
- selection sort
Most sorting algorithms are comparison sorts that rely on a comparison function to determine the relative order between two elements.
For example, the program below (try on Compiler Explorer) sorts a list of points on a two-dimensional Euclidean plane by its distance from the origin.
It uses std::sort
function from the C++ standard library, passing a custom comparison function that returns true
if the first point is closer to the origin point (0,0)
than the second point, or false
otherwise.
#include <algorithm>
#include <cmath>
#include <cstdio>
#include <iostream>
struct Point
{
double x;
double y;
};
int main() {
std::vector<Point> points{
{ 1.0, 2.0 },
{ 2.0, 0.9 },
{ 0.9, -2.0 },
{ 0.0, 0.0 },
{ -1.4, 0.0 },
{ -1.4, -0.7 },
};
std::sort(points.begin(), points.end(), [] (const Point& a, const Point& b) {
return std::sqrt(a.x * a.x + a.y * a.y) < std::sqrt(b.x * b.x + b.y * b.y);
});
for (const Point& point : points) {
std::printf("%+0.1f, %+0.1f\n", point.x, point.y);
}
}
The input and output of a sorting algorithm are both a list of elements. Deep Atlantic Storage deals with files. A file must first be turned into a list of elements before it can be sorted.
There are many ways to interpret a file as a list of elements. If the file is a database, following the database structure, each table in the database is a list of rows that can be sorted. If the file is plain text, the Unix sort command can read it as a list of text lines that can be sorted.
In Deep Atlantic Storage, I decided to use the most basic unit of information: bit. When you upload a file to my unlimited storage service, the bits contained in the file are sorted in ascending order. For example, suppose the file has the text:
@yoursunny
In binary form it is:
@ y o u r s n n n y
01000000 01111001 01101111 01110101 01110010 01110011 01110101 01101110 01101110 01111001
If I sort all the bits, it becomes:
00000000 00000000 00000000 00000000 00111111 11111111 11111111 11111111 11111111 11111111
Sorting Bits » Counting Bits
Naively, I can collect every bit in the input file into a list of bits, and sort them using a "normal" sort algorithm (try on RunKit):
const input = Buffer.from("@yoursunny");
const bits = [];
for (const b of input) {
for (let s = 0; s < 8; ++s) {
bits.push((b >> s) & 0x01);
}
}
let compares = 0;
bits.sort((a, b) => {
++compares;
return a - b;
});
console.log(
`${bits.length} elements`,
`${compares} compares`,
JSON.stringify(bits),
);
Array.prototype.sort() is a comparison sort algorithm. Theoretically, a comparison sort algorithm cannot perform better than O(n log n) comparisons, where n is the number of elements in the input list. For my 80-bit input, Node.js v16.3.0 invoked the comparison function 322 times. If the input was longer, considerably more comparisons would be needed.
Since there are only two possible values, 0
and 1
, for each bit, there is a better algorithm: counting sort.
Counting sort is an integer sorting algorithm suitable for a list of small non-negative integers.
It does not use a comparison function and hence is a non-comparison sorting algorithm.
Instead, counting sort first counts how many elements possess each distinct key value, then uses these counts to determine the positions of each key value in the output list.
Its time complexity is O(n+k), where n is the number of elements and k is the maximum integer key value in the list.
A counting sort on the same input can be written as (try on Go Playground):
package main
import (
"fmt"
)
func sortBits(bits []int) (sorted []int) {
m := make(map[int]int)
for _, bit := range bits {
m[bit]++
}
for bit := 0; bit <= 1; bit++ {
for i, n := 0, m[bit]; i < n; i++ {
sorted = append(sorted, bit)
}
}
return sorted
}
func main() {
var bits []int
for _, b := range []byte("@yoursunny") {
for s := uint(0); s < 8; s++ {
bit := (b >> s) & 0x01
bits = append(bits, int(bit))
}
}
sorted := sortBits(bits)
fmt.Println(sorted)
}
Sorted Bits » Bit Counts
A sorting algorithm does not change the size of the list being sorted. Suppose a 1GB file is uploaded to Deep Atlantic Storage, there are 8589934592 bits in this file before sorting, and there would be still 8589934592 bits after sorting. Storing a sorted file takes as much disk space as storing the original unsorted file.
Looking at the sorted bits, there is an important observation:
after sorting, all the 0
bits are together, and all the 1
bits are together!
00000000 00000000 00000000 00000000 00111111 11111111 11111111 11111111 11111111 11111111
\_____________ 34 zeros _____________/\____________________ 46 ones ____________________/
Instead of storing the same bit repeatedly, I only need to remember: "there are 34 zeros followed by 46 ones". This allows Deep Atlantic Storage to store sorted large files with considerably less disk space than the original files: any file, regardless of its size, can be represented by two numbers.
Given a list of sorted bits, I can iterate over the list and count the number of consecutive zeros and ones:
from itertools import groupby
bits = "00000000000000000000000000000000001111111111111111111111111111111111111111111111"
for bit, g in groupby(bits):
print(bit, len(list(g)))
This is, in fact, the basic idea of run-length encoding, a lossless data compression method.
However, it's unnecessary to run a sorting algorithm followed by a compression algorithm. Instead, I can let the counting sort algorithm return the counters for zeros and ones directly, skipping the unnecessary step of constructing a sorted list of bits.
Well, actually I don't even need to count both zeros and ones.
Since there are 8 bits in every byte, it suffices to only count the 1
bits, and I can calculate the number 0
bits to be 8 * bytes - ones
.
With that, our bit sorting counting algorithm becomes:
function countBits(input: Uint8Array): [cnt0: number, cnt1: number] {
let cnt = 0;
for (const b of input) {
for (let s = 0; s < 8; ++s) {
if ((b >> s) % 2 === 1) {
++cnt;
}
}
}
return [8 * input.length - cnt, cnt];
}
Bit Counting » Byte Counting
Looking at the bit counting algorithm, the inner loop that iterates over the bits within a byte would be executed once for every byte, which is a hot spot worth optimization. To optimize this code, I aim at eliminating the loop.
In a byte, there are 256 possible values between 0x00 and 0xFF.
The number of zeros and ones in each byte value never changes.
Therefore, it is unnecessary to loop over the bits every time.
Instead, I can build a lookup table that maps a byte value into the number of zeros and ones in that byte.
This code, executed during initialization, prepares the lookup table:
const ONES = [];
for (let b = 0x00; b <= 0xFF; ++b) {
let cnt = 0;
for (let s = 0; s < 8; ++s) {
if ((b >> s) % 2 === 1) {
++cnt;
}
}
ONES.push(cnt);
}
Using this lookup table, I can count bits in a file more efficiently:
function countBits(input: Uint8Array): [cnt0: number, cnt1: number] {
let cnt = 0;
for (const b of input) {
cnt += ONES[b];
}
return [8 * input.length - cnt, cnt];
}
As measured on JSBEN.CH, the lookup table approach is 3~5x faster than the previous algorithm.
Summary
In this article, I reviewed commonly used sorting algorithms, explained why counting sort is more efficient on a list of bits where each bit is either 0
or 1
, explored how to compactly store sorted bits as two numbers, and finally optimized the algorithm using a lookup table.
This article is the first of a 3-part series that reveals the secrets behind Deep Atlantic Storage. The next part in this series will explain how the bit sorting aka byte counting algorithm is used in a web application.