Creating a Lightning-Fast Append-Only Database with mmap in C++

Posted Jan 29, 2024 Updated Feb 24, 2024

By Fizzbuzzer 29 min read

Creating a Lightning-Fast Append-Only Database with mmap in C++

GitHub repository

In this code project, I demonstrate the utilization of mmap in C++ to construct a lightning fast append-only database, which I named TurboMap The underlying concept bears resemblance to that of a disk-based hash table, ensuring optimal performance and reliability. A hash table is a data structure that’s accessed by key, which is transformed by a hash function into an index for lookup in a collection of entries.

Append-only databases often exhibit superior performance and scalability characteristics compared to traditional databases that support updates and deletions. Since data is only appended, disk I/O operations can be optimized, reducing contention and enhancing overall throughput.

While the term database may appear ambitious for this implementation, it is essential to acknowledge its limitations. This project does not encompass the comprehensive functionality expected of a full-fledged database, lacking fundamental properties like atomicity. However, its primary objective remains to elucidate foundational concepts to readers, providing a springboard for further development and refinement.

You may be curious about the necessity of the second indirection of index when a straightforward approach, like pointing directly to the appropriate bucket using hash % NUMBER_OF_BUCKETS, seems feasible, as typically done in standard in-memory hash maps. The rationale behind this lies in our lack of precise knowledge regarding the exact location of each bucket within the entry file. This uncertainty arises because we continuously append new entries to the end of the entry file. While one potential solution could involve pre-allocating a large entry file and assigning a predefined segment to each bucket beforehand, such an approach lacks scalability as it imposes constraints on the size of each bucket.

The diagram provided below elucidates the fundamental concept.

The primary components include:

A hash function responsible for mapping keys to compact integers (referred to as buckets). An effective hash function ensures that keys are mapped to integers in a manner resembling randomness, facilitating an even distribution of bucket values.
An index utilized for associating each hash with the most recent entry within its corresponding bucket.
A file designated for storing map entries. Each entry comprises a key size, the key itself, a reference to the preceding bucket entry, and a pointer to the value record in the data file.
A data file housing the value records. Each record encompasses the value size, the actual value, a reference to the beginning of the respective record, and a pointer to the previous record associated with the same key.

Our classes interface will be outlined as follows:

  
/* TurboMap.h */

#ifndef TURBO_MAP_H
#define TURBO_MAP_H

#include <string>

#define MAX_BUCKETS 100
#define BUCKET_SIZE 1024 * 1024
#define ENTRIES_SIZE (MAX_BUCKETS * BUCKET_SIZE)
#define EMPTY_ENTRY '\0'

class TurboMap {
   private:
    using INDEX_TYPE = int;
    using INDEX_MAP_TYPE = INDEX_TYPE *;
    using ENTRIES_TYPE = unsigned char;
    using ENTRIES_MAP_TYPE = ENTRIES_TYPE *;
    using OFFSET_TYPE = int;
    using SIZE_TYPE = uint16_t;

    std::string m_data_dir;
    std::string m_index_fname;
    std::string m_entries_fname;
    std::string m_data_fname;
    INDEX_MAP_TYPE m_index_map;
    ENTRIES_MAP_TYPE m_entries_map;
    int m_index_fd;
    int m_entries_fd;
    int m_data_fd;
    size_t m_index_sz;
    size_t m_entries_sz;
    int m_last_data_pos;

    union Entry {
        SIZE_TYPE key_size;
        char *key;
    };

    struct Record {
        SIZE_TYPE size;
        char *value;
    };

    void Open();
    bool DoesFileExist(const std::string &fname);
    bool CreateIndexFile(const std::string &fname);
    bool CreateEntriesFile(const std::string &fname);
    bool CreateDataFile(const std::string &fname);
    bool LoadIndexFile(const std::string &fname);
    bool LoadEntriesFile(const std::string &fname);
    bool LoadDataFile(const std::string &fname);
    bool GrowEntriesMap(size_t grow_step);
    bool InsertIntoData(std::vector<unsigned char> &serialized_record, size_t sz);
    INDEX_TYPE Hash(const std::string &key);
    std::string ExtractKeyFromEntryAtKeyOffset(const OFFSET_TYPE &key_off);
    OFFSET_TYPE ExtractNextFromEntryAtKeyOffset(const OFFSET_TYPE &key_off);
    OFFSET_TYPE ExtractDataDownFromEntryAtKeyOffset(const OFFSET_TYPE &key_off);
    SIZE_TYPE GetSizeOfEntryAtKeyOffset(const OFFSET_TYPE &key_off);
    OFFSET_TYPE FindEntryForKeyInBucket(const std::string &key, const OFFSET_TYPE &pos);
    OFFSET_TYPE LastEntryPositionInBucket(const OFFSET_TYPE &key_off);
    SIZE_TYPE FillNewEntry(const std::string &key, const OFFSET_TYPE &next, const OFFSET_TYPE &data_down,
                           std::vector<ENTRIES_TYPE> &buf);
    SIZE_TYPE FillDataRecord(const std::string &value, const OFFSET_TYPE &data_next, std::vector<unsigned char> &buf);
    OFFSET_TYPE GetValueFromOffset(std::string &out, OFFSET_TYPE data_offset);
    OFFSET_TYPE GetLatestDataOffsetForKey(const std::string &key);

   public:
    TurboMap(const std::string &dir);
    bool InsertKeyValue(const std::string &key, const std::string &value);
    bool GetLatestValue(const std::string &key, std::string &out);
    void GetAllValues(const std::string &key, std::vector<std::string> &out);
    ~TurboMap();
};

#endif

This currently represents a rudimentary implementation of a disk-based hash map. Subsequently, I intend to refactor it into a class template to accommodate a wider range of key-value types. For the time being, the utilization of the std::string type should be adequate.

For our hash function, we’ll opt for a straightforward approach for now. However, you can easily swap it out later with a more optimized solution that yields uniformly distributed outputs and faster.

  
TurboMap::INDEX_TYPE TurboMap::Hash(const std::string &key) {
    INDEX_TYPE result = 0;
    for (size_t i = 0; i < key.length(); ++i) {
        result += key[i] * pow(31, i);
    }
    return result % MAX_BUCKETS;
}

You might have noticed a couple of support private functions. These are designed to assist in file creation and loading from disk.

I will only list the CreateIndexFile(...) function as CreateEntriesFile(...) is similar:

  
bool TurboMap::CreateIndexFile(const std::string &fname) {
    m_index_fd = open(fname.c_str(), O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600);

    if (m_index_fd == -1) {
        perror("Error opening file for writing");
        return false;
    }

    m_index_sz = sizeof(INDEX_TYPE) * MAX_BUCKETS;

    // Stretch the file size to m_index_sz
    if (lseek(m_index_fd, m_index_sz - 1, SEEK_SET) == -1) {
        close(m_index_fd);
        perror("Error calling lseek() to 'stretch' the file");
        return false;
    }

    /* Something needs to be written at the end of the file to
     * have the file actually have the new size.
     * Just writing an empty string at the current file position will do.
     */
    if (write(m_index_fd, "", 1) == -1) {
        close(m_index_fd);
        perror("Error writing last byte of the file");
        return false;
    }

    // Now the file is ready to be mmapped.
    m_index_map =
        reinterpret_cast<INDEX_MAP_TYPE>(mmap(0, m_index_sz, O_RDWR | O_CREAT | O_TRUNC, MAP_SHARED, m_index_fd, 0));

    if (m_index_map == MAP_FAILED) {
        err(1, "mmap: %s", fname.c_str());
        close(m_index_fd);
        return false;
    }

    return true;
}

We won’t use memory-mapping for the data file. Instead, we’ll load it like this:

  
bool TurboMap::CreateDataFile(const std::string &fname) {
    m_data_fd = open(fname.c_str(), O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600);
    if (m_data_fd == -1) {
        perror("Error opening file for writing");
        return false;
    }
    m_last_data_pos = 0;
    return true;
}

We won’t delve into the file loading process for now; the complete implementation will be provided later. Instead, I’ll elucidate on the primary function of this hash map: the component responsible for inserting and retrieving records.

Inserting Key/Value Pairs

  
bool TurboMap::InsertKeyValue(const std::string &key, const std::string &value) {
    // Create a new record
    std::vector<unsigned char> new_record;
    OFFSET_TYPE data_next = -1;
    SIZE_TYPE record_size = FillDataRecord(value, data_next, new_record);

    // Pointer to data file
    OFFSET_TYPE data_down = m_last_data_pos + record_size - 2 * sizeof(OFFSET_TYPE);

    // Get bucket for key
    INDEX_TYPE h = Hash(key);
    OFFSET_TYPE pos = m_index_map[h];  // Point to key of last bucket entry

    OFFSET_TYPE existing_key_entry_offset = -1; // Assume that no entry exists

    // Iterate over bucket (linked-list) and find entry with same key, if exists
    existing_key_entry_offset = FindEntryForKeyInBucket(key, pos);

    // Insert/Update Entry
    if (existing_key_entry_offset > 0) {  // We already have an entry for key
        // Update data next in record to older record data down
        data_next = ExtractDataDownFromEntryAtKeyOffset(existing_key_entry_offset);
        memcpy(&new_record[record_size - sizeof(OFFSET_TYPE)], &data_next, sizeof(OFFSET_TYPE));

        // Update data down pointer to point to new record in data file
        SIZE_TYPE key_len = sizeof(ENTRIES_TYPE) * (strlen(key.c_str()) + 1);
        memcpy(&m_entries_map[existing_key_entry_offset + key_len + sizeof(OFFSET_TYPE)], &data_down,
               sizeof(OFFSET_TYPE));
    } else {  // We don't have an entry for key. Append a new one to end of bucket

        SIZE_TYPE entries_end;
        memcpy(&(entries_end), &m_entries_map[0], sizeof(SIZE_TYPE));

        // Next points to last bucket entry or -1. This can be a colliding entry for a different key
        OFFSET_TYPE next = pos;

        // Create a new entry
        std::vector<ENTRIES_TYPE> new_entry;
        SIZE_TYPE entry_size = FillNewEntry(key, next, data_down, new_entry);

        // Insert entry
        memcpy(&m_entries_map[entries_end], reinterpret_cast<ENTRIES_MAP_TYPE>(&new_entry[0]), entry_size);

        // Point index to latest entry
        SIZE_TYPE bucket_pos = entries_end + sizeof(SIZE_TYPE);
        memcpy(&m_index_map[h], reinterpret_cast<INDEX_MAP_TYPE>(&bucket_pos), sizeof(SIZE_TYPE));

        // Note down new total entries size
        entries_end += entry_size;
        memcpy(&m_entries_map[0], reinterpret_cast<INDEX_MAP_TYPE>(&entries_end), sizeof(SIZE_TYPE));

        // Check if we need to stretch the entry file
        if (((entries_end) * 8 + 1024 * 1024) > m_entries_sz && !GrowEntriesMap(10)) {
            return false;
        }
    }

    // Insert Data Record
    if (InsertIntoData(new_record, record_size)) {
        m_last_data_pos += record_size;
        return true;
    } else {
        std::cerr << "Could not persist record" << std::endl;
        return false;
    }
}

Ok, this is alot of code to digest. So let’s break it down into more manageable parts for easier understanding.

All data that we intend to persist must be casted to unsigned char. To facilitate this, we’ll create a new buffer represented by a std::vector<unsigned char>. This buffer will hold the following data in the specified order:

The size of the value intended for storage
The value itself
A pointer (data down) to the start of this record in the data file
A pointer (data next) to any preceding record’s data down pointer, or -1 if no such record exists for our key

  
// Create a new record
std::vector<unsigned char> new_record;
OFFSET_TYPE data_next = -1;
SIZE_TYPE record_size = FillDataRecord(value, data_next, new_record);

Following this, we initiate the preparation of our entry element. The initial step involves determining the location to which this entry should point within the data file. This signifies the position of our newly created record that we intend to persist. This pointer will actually reference the data down pointer of our record, hence we subtract two times the size of OFFSET_TYPE.

  
// Pointer to data file
OFFSET_TYPE data_down = m_last_data_pos + record_size - 2 * sizeof(OFFSET_TYPE);

Now comes the interesting part. Like any hash function, collisions are inevitable. For each hash, we don’t just store one entry, but rather a linked list (bucket) of entries. When a new insertion is triggered, we must iterate over the corresponding entry list to determine if we already have an entry for that key. If an entry already exists, we simply update it to point to our latest record. Otherwise, we create a new entry and append it to the end of that list.

Initially, we need to ascertain the starting point of the bucket within our entry file:

  
// Get bucket for key
INDEX_TYPE h = Hash(key);
OFFSET_TYPE pos = m_index_map[h];  // Point to key of last bucket entry

OFFSET_TYPE existing_key_entry_offset = -1; // Assume that no entry exists

// Iterate over bucket (linked-list) and find entry with same key, if exists
existing_key_entry_offset = FindEntryForKeyInBucket(key, pos);

Let’s put aside the details of the FindEntryForKeyInBucket(...) function for the moment. Essentially, this method traverses the linked list, searching for an entry with the same key. If such an entry exists, it returns the offset of its key in the entry file. Otherwise, it returns -1.

Next, we address the scenario where we already have an existing entry, treating it separately from the case where no such entry exists. Let us focus on the later scenario first.

  
if (existing_key_entry_offset > 0) {  // We already have an entry for key
    // Update existing entry
    ...
} else {  // We don't have an entry for key. Append a new one to end of bucket

    // 1.
    SIZE_TYPE entries_end;
    memcpy(&(entries_end), &m_entries_map[0], sizeof(SIZE_TYPE));

    // 2.
    OFFSET_TYPE next = pos;

    // 3.
    std::vector<ENTRIES_TYPE> new_entry;
    SIZE_TYPE entry_size = FillNewEntry(key, next, data_down, new_entry);

    // 4.
    memcpy(&m_entries_map[entries_end], reinterpret_cast<ENTRIES_MAP_TYPE>(&new_entry[0]), entry_size);

    // 5.
    SIZE_TYPE bucket_pos = entries_end + sizeof(SIZE_TYPE);
    memcpy(&m_index_map[h], reinterpret_cast<INDEX_MAP_TYPE>(&bucket_pos), sizeof(SIZE_TYPE));

    // 6.
    entries_end += entry_size;
    memcpy(&m_entries_map[0], reinterpret_cast<INDEX_MAP_TYPE>(&entries_end), sizeof(SIZE_TYPE));

    // 7.
    if (((entries_end) * 8 + 1024 * 1024) > m_entries_sz && !GrowEntriesMap(10)) {
        return false;
    }
}

To prevent cluttering the code snippet with excessive comments, I’ve extracted the fundamental sections and provided more detailed explanations below:

The first value in our entries file always represents the total size of all entries contained within the file. This information is crucial because it indicates where we can begin appending our latest entry without overwriting existing data.
Given that each bucket is represented as a linked list that we traverse in reverse order, we ensure that the next pointer points to the current last entry of the bucket.
Much like our data record, our new entry will be encapsulated within a std::vector<unsigned char>, encompassing the key size, the actual key, a pointer to the previous entry (if any), and our data down pointer.
Append the newly created entry at the end of our entries file.
We update our index file to point to the key offset of our newly persisted entry in the entries file, essentially to the last entry of the bucket.
We increment the total size holder of all entries contained within the file.
After mapping a file into memory, we encounter a limitation where its size cannot be increased. To address this, we implement a check to determine if we need to expand our entries file. If expansion is necessary, we unmap it, extend its size by a specified step size, and then remap it into memory.

Let’s consider the scenario where we already have an entry for our key.

  
if (existing_key_entry_offset > 0) {  // We already have an entry for key
    // 1.
    data_next = ExtractDataDownFromEntryAtKeyOffset(existing_key_entry_offset);
    memcpy(&new_record[record_size - sizeof(OFFSET_TYPE)], &data_next, sizeof(OFFSET_TYPE));

    // 2.
    SIZE_TYPE key_len = sizeof(ENTRIES_TYPE) * (strlen(key.c_str()) + 1);
    memcpy(&m_entries_map[existing_key_entry_offset + key_len + sizeof(OFFSET_TYPE)], &data_down,
            sizeof(OFFSET_TYPE));
}

This is simple:

We read the data down pointer of the existing record in the data file and update the next pointer of our new record to point to it.
We update the existing entry to point to the data down pointer of the new record we are about to persist.

Finally, we need to persist our data record as well:

  
if (InsertIntoData(new_record, record_size)) {
    m_last_data_pos += record_size;
    return true;
} else {
    std::cerr << "Could not persist record" << std::endl;
    return false;
    }

Reading Data

Let us now direct our attention to retrieving values from our hash map.

  
bool TurboMap::GetLatestValue(const std::string &key, std::string &out) {
    OFFSET_TYPE data_offset = GetLatestDataOffsetForKey(key);
    if (data_offset >= 0) {
        GetValueFromOffset(out, data_offset);
        return true;
    }
    return false;
}

The function above invokes two additional functions. GetLatestDataOffsetForKey(...) is employed to obtain the offset of the most recent record associated with the specified key in the data file. It leverages the previously introduced function FindEntryForKeyInBucket(...) to locate the entry corresponding to the given key. Subsequently, it extracts and returns the data down pointer from the entry.

The function GetValueFromOffset(...) is dedicated to retrieving the value of the record from the data file situated at the provided offset, subsequently updating the designated output string with the retrieved value. It is responsible for deserializing the record found at the given offset within the data file. Subsequently, it returns the offset to the previous record for the key in the data file, that can be used for iterating down the value chain. If no previous record exists, it returns -1.

  
TurboMap::OFFSET_TYPE TurboMap::GetValueFromOffset(std::string &out, OFFSET_TYPE data_offset) {
    // Read data_down
    off_t pos = lseek(m_data_fd, data_offset, SEEK_SET);
    ssize_t bytes_read;
    OFFSET_TYPE buf[1];
    bytes_read = pread(m_data_fd, buf, sizeof(OFFSET_TYPE), pos);
    OFFSET_TYPE data_down = buf[0];

    bytes_read = pread(m_data_fd, buf, sizeof(OFFSET_TYPE), pos + sizeof(OFFSET_TYPE));
    OFFSET_TYPE data_next = buf[0];

    SIZE_TYPE size[1];
    bytes_read = pread(m_data_fd, size, sizeof(SIZE_TYPE), data_down);
    SIZE_TYPE value_size = size[0];

    // Read value
    SIZE_TYPE disk_record_sz = value_size + sizeof(SIZE_TYPE);
    std::vector<unsigned char> r(disk_record_sz);
    bytes_read = pread(m_data_fd, reinterpret_cast<unsigned char *>(&r[0]), disk_record_sz, data_down);
    unsigned char *off = reinterpret_cast<unsigned char *>(&r[0]);
    off += sizeof(SIZE_TYPE);
    std::string value(off, off + value_size - 1);
    out.assign(value);

    return data_next;
}

This encompasses the intricate details of our hash map. Let’s now delve into how to utilize it.

  
#include "TurboMap.h"

int main(int argc, char **argv) {
    TurboMap map("~/data/");

    std::string key = "Joe";
    std::string value = "Hello World!";

    map.InsertKeyValue(key, value);

    std::string latest;
    map.GetLatestValue(key, latest);
    std::cout << "Latest value for key '" << key << "': " << latest << std::endl;

    return 0;
}

Full Implementation

  
/* TurboMap.cpp */

#include "TurboMap.h"

#include <err.h>       // GNU C lib error messages: err
#include <fcntl.h>     // C POSIX libary file control options such as O_RDWR,  O_CREAT, O_TRUNC, etc
#include <sys/mman.h>  // memory management declarations: mmap, munmap
#include <sys/stat.h>  // stat
#include <unistd.h>    // C POSIX libary system calls: open, close

#include <iostream>  // std::cout
#include <sstream>   // std::ostringstream

TurboMap::TurboMap(const std::string &dir)
    : m_data_dir(dir),
      m_index_fname(m_data_dir + "/index.bin"),
      m_entries_fname(m_data_dir + "/entries.bin"),
      m_data_fname(m_data_dir + "/data.bin") {
    Open();
}

void TurboMap::Open() {
    if (!DoesFileExist(m_index_fname)) {
        CreateIndexFile(m_index_fname);
    } else {
        LoadIndexFile(m_index_fname);
    }

    if (!DoesFileExist(m_entries_fname)) {
        CreateEntriesFile(m_entries_fname);
    } else {
        LoadEntriesFile(m_entries_fname);
    }

    if (!DoesFileExist(m_data_fname)) {
        CreateDataFile(m_data_fname);
    } else {
        LoadDataFile(m_data_fname);
    }
}

/*
A hash function designed to generate an integer hash value for a given string key. The resulting hash is
subsequently employed as an offset in the index file.
*/
TurboMap::INDEX_TYPE TurboMap::Hash(const std::string &key) {
    INDEX_TYPE result = 0;
    for (size_t i = 0; i < key.length(); ++i) {
        result += key[i] * pow(31, i);
    }
    return result % MAX_BUCKETS;
}

/*
Returns true if file exists. Returns false otherwise.
*/
bool TurboMap::DoesFileExist(const std::string &fname) {
    struct stat buffer;
    return (stat(fname.c_str(), &buffer) == 0);
}

bool TurboMap::CreateIndexFile(const std::string &fname) {
    m_index_fd = open(fname.c_str(), O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600);

    if (m_index_fd == -1) {
        perror("Error opening file for writing");
        return false;
    }

    m_index_sz = sizeof(INDEX_TYPE) * MAX_BUCKETS;

    // Stretch the file size to m_index_sz
    if (lseek(m_index_fd, m_index_sz - 1, SEEK_SET) == -1) {
        close(m_index_fd);
        perror("Error calling lseek() to 'stretch' the file");
        return false;
    }

    /* Something needs to be written at the end of the file to
     * have the file actually have the new size.
     * Just writing an empty string at the current file position will do.
     *
     * Note:
     *  - The current position in the file is at the end of the stretched
     *    file due to the call to lseek().
     *  - An empty string is actually a single '\0' character, so a zero-byte
     *    will be written at the last byte of the file.
     */
    if (write(m_index_fd, "", 1) == -1) {
        close(m_index_fd);
        perror("Error writing last byte of the file");
        return false;
    }

    // // Now the file is ready to be mmapped.
    m_index_map =
        reinterpret_cast<INDEX_MAP_TYPE>(mmap(0, m_index_sz, O_RDWR | O_CREAT | O_TRUNC, MAP_SHARED, m_index_fd, 0));

    if (m_index_map == MAP_FAILED) {
        err(1, "mmap: %s", fname.c_str());
        close(m_index_fd);
        return false;
    }

    return true;
}

bool TurboMap::CreateEntriesFile(const std::string &fname) {
    m_entries_fd = open(fname.c_str(), O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600);

    if (m_entries_fd == -1) {
        perror("Error opening file for writing");
        return false;
    }

    m_entries_sz = sizeof(ENTRIES_TYPE) * ENTRIES_SIZE;

    if (lseek(m_entries_fd, m_entries_sz - 1, SEEK_SET) == -1) {
        close(m_entries_fd);
        perror("Error calling lseek() to 'stretch' the file");
        return false;
    }

    if (write(m_entries_fd, "", 1) == -1) {
        close(m_entries_fd);
        perror("Error writing last byte of the file");
        return false;
    }

    m_entries_map = reinterpret_cast<ENTRIES_MAP_TYPE>(
        mmap(0, m_entries_sz, O_RDWR | O_CREAT | O_TRUNC, MAP_SHARED, m_entries_fd, 0));

    if (m_entries_map == MAP_FAILED) {
        err(1, "mmap: %s", fname.c_str());
        close(m_entries_fd);
        return false;
    }

    // Set everything to empty
    for (int i = 0; i < ENTRIES_SIZE; ++i) {
        m_entries_map[i] = EMPTY_ENTRY;
    }

    // Pointer to where writing can continue
    SIZE_TYPE entries_end = sizeof(SIZE_TYPE);
    memcpy(&m_entries_map[0], reinterpret_cast<INDEX_MAP_TYPE>(&entries_end), sizeof(SIZE_TYPE));

    // Write it now to disk
    if (msync(m_entries_map, m_entries_sz, MS_SYNC) == -1) {
        perror("Could not sync the file to disk");
        return false;
    }

    return true;
}

bool TurboMap::CreateDataFile(const std::string &fname) {
    m_data_fd = open(fname.c_str(), O_RDWR | O_CREAT | O_TRUNC, (mode_t)0600);
    if (m_data_fd == -1) {
        perror("Error opening file for writing");
        return false;
    }
    m_last_data_pos = 0;
    return true;
}

bool TurboMap::LoadIndexFile(const std::string &fname) {
    struct stat fs;
    m_index_fd = open(fname.c_str(), O_RDWR, (mode_t)0600);
    if (m_index_fd == -1) {
        err(1, "open mmap: %s", fname.c_str());
        return false;
    }

    if (fstat(m_index_fd, &fs) == -1) {
        err(1, "stat:%s", fname.c_str());
        return false;
    }

    m_index_sz = fs.st_size;
    m_index_map =
        reinterpret_cast<INDEX_MAP_TYPE>(mmap(0, m_index_sz, PROT_READ | PROT_WRITE, MAP_SHARED, m_index_fd, 0));
    if (m_index_map == MAP_FAILED) {
        err(1, "mmap: %s", fname.c_str());
        close(m_index_fd);
        return false;
    }

    return true;
}

bool TurboMap::LoadEntriesFile(const std::string &fname) {
    struct stat fs;
    m_entries_fd = open(fname.c_str(), O_RDWR, (mode_t)0600);
    if (m_entries_fd == -1) {
        err(1, "open mmap: %s", fname.c_str());
        return false;
    }

    if (fstat(m_entries_fd, &fs) == -1) {
        err(1, "stat:%s", fname.c_str());
        return false;
    }

    /* fs.st_size could have been 0 actually */
    m_entries_sz = fs.st_size;
    m_entries_map =
        reinterpret_cast<ENTRIES_MAP_TYPE>(mmap(0, m_entries_sz, PROT_READ | PROT_WRITE, MAP_SHARED, m_entries_fd, 0));
    if (m_entries_map == MAP_FAILED) {
        err(1, "mmap: %s", fname.c_str());
        close(m_entries_fd);
        return false;
    }

    return true;
}

bool TurboMap::LoadDataFile(const std::string &fname) {
    struct stat fs;
    m_data_fd = open(fname.c_str(), O_RDWR, (mode_t)0600);
    if (m_data_fd == -1) {
        err(1, "open mmap: %s", fname.c_str());
        return false;
    }

    if (fstat(m_data_fd, &fs) == -1) {
        err(1, "stat:%s", fname.c_str());
        return false;
    }

    m_last_data_pos = lseek(m_data_fd, 0, SEEK_END);
    return true;
}

bool TurboMap::GrowEntriesMap(size_t grow_step) {
    std::cout << "Grow entries map" << std::endl;
    // grow file descriptor with ftruncate(fd, new_size);
    // unmap mapped file
    // map again with new size
    return true;
}

bool TurboMap::InsertIntoData(std::vector<unsigned char> &serialized_record, size_t sz) {
    lseek(m_data_fd, 0, SEEK_END);
    ssize_t bytes_written = write(m_data_fd, reinterpret_cast<unsigned char *>(&serialized_record[0]), sz);
    if (bytes_written == -1) {
        close(m_data_fd);
        perror("Error writing last byte of the file");
        return false;
    }
    return true;
};

/**
 * Returns key from entry at entry_offset
 */
std::string TurboMap::ExtractKeyFromEntryAtKeyOffset(const OFFSET_TYPE &key_off) {
    ENTRIES_MAP_TYPE entry_start = &m_entries_map[key_off - sizeof(SIZE_TYPE)];
    SIZE_TYPE entry_key_size;
    memcpy(&(entry_key_size), entry_start, sizeof(SIZE_TYPE));
    std::string key(entry_start + sizeof(SIZE_TYPE), entry_start + sizeof(SIZE_TYPE) + entry_key_size -
                                                         1);
    return key;
}

/**
 * Returns next pointer from entry at entry_offset
 */
TurboMap::OFFSET_TYPE TurboMap::ExtractNextFromEntryAtKeyOffset(const OFFSET_TYPE &key_off) {
    ENTRIES_MAP_TYPE entry_start = &m_entries_map[key_off - sizeof(SIZE_TYPE)];
    SIZE_TYPE entry_key_len;
    memcpy(&entry_key_len, entry_start, sizeof(SIZE_TYPE));

    OFFSET_TYPE next;
    memcpy(&next, entry_start + sizeof(SIZE_TYPE) + entry_key_len, sizeof(OFFSET_TYPE));
    return next;
}

/**
 * Returns data down pointer from entry at entry_offset
 */
TurboMap::OFFSET_TYPE TurboMap::ExtractDataDownFromEntryAtKeyOffset(const OFFSET_TYPE &key_off) {
    ENTRIES_MAP_TYPE entry_start = &m_entries_map[key_off - sizeof(SIZE_TYPE)];
    SIZE_TYPE entry_key_len;
    memcpy(&entry_key_len, entry_start, sizeof(SIZE_TYPE));

    OFFSET_TYPE data_down;
    memcpy(&data_down, entry_start + sizeof(SIZE_TYPE) + entry_key_len + sizeof(OFFSET_TYPE), sizeof(OFFSET_TYPE));
    return data_down;
}

/**
 * Returns the total size of an entry in bytes
 */
TurboMap::SIZE_TYPE TurboMap::GetSizeOfEntryAtKeyOffset(const OFFSET_TYPE &key_off) {
    ENTRIES_MAP_TYPE entry_start = &m_entries_map[key_off - sizeof(SIZE_TYPE)];
    SIZE_TYPE entry_key_len;
    memcpy(&entry_key_len, entry_start, sizeof(SIZE_TYPE));
    return sizeof(SIZE_TYPE) + entry_key_len + sizeof(OFFSET_TYPE) +
           sizeof(OFFSET_TYPE);  // sizeof(key_len) + key_len + sizeof(next) + sizeof(data_down)
}

/**
 * Returns offset of entry that matches the specified key. Return -1 if no such entry exists.
 */
TurboMap::OFFSET_TYPE TurboMap::FindEntryForKeyInBucket(const std::string &key, const OFFSET_TYPE &pos) {
    OFFSET_TYPE next = pos;
    while (next > 0) {
        std::string current_key = ExtractKeyFromEntryAtKeyOffset(next);
        if (current_key == key) {
            return next;
        }
        next = ExtractNextFromEntryAtKeyOffset(next);
    }
    return -1;
};

/**
 * Returns offset of last entry at bucket. -1 if no entry exists.
 */
TurboMap::OFFSET_TYPE TurboMap::LastEntryPositionInBucket(const OFFSET_TYPE &key_off) {
    ENTRIES_MAP_TYPE entry_start = &m_entries_map[key_off - sizeof(SIZE_TYPE)];
    SIZE_TYPE entry_key_size;
    memcpy(&(entry_key_size), entry_start, sizeof(SIZE_TYPE));
    if (entry_key_size > 0) {
        return key_off;
    }
    return -1;
}

TurboMap::SIZE_TYPE TurboMap::FillNewEntry(const std::string &key, const OFFSET_TYPE &next,
                                           const OFFSET_TYPE &data_down, std::vector<ENTRIES_TYPE> &buf) {
    SIZE_TYPE raw_data_size = sizeof(ENTRIES_TYPE) * (strlen(key.c_str()) + 1);

    buf.reserve(sizeof(SIZE_TYPE) + raw_data_size);
    buf.insert(buf.end(), reinterpret_cast<const unsigned char *>(&raw_data_size),
               reinterpret_cast<const unsigned char *>(&raw_data_size) + sizeof(SIZE_TYPE));
    const unsigned char *base(reinterpret_cast<const unsigned char *>(key.c_str()));
    buf.insert(buf.end(), base, base + raw_data_size);

    SIZE_TYPE entry_size = buf.size();

    buf.resize(entry_size + sizeof(OFFSET_TYPE));
    memcpy(&buf[entry_size], &next, sizeof(OFFSET_TYPE));
    entry_size += sizeof(OFFSET_TYPE);

    buf.resize(buf.size() + sizeof(OFFSET_TYPE));
    memcpy(&buf[entry_size], &data_down, sizeof(OFFSET_TYPE));
    entry_size += sizeof(OFFSET_TYPE);

    return entry_size;
}

TurboMap::SIZE_TYPE TurboMap::FillDataRecord(const std::string &value, const OFFSET_TYPE &data_next,
                                             std::vector<unsigned char> &buf) {
    SIZE_TYPE raw_data_size = sizeof(unsigned char) * (strlen(value.c_str()) + 1);

    buf.reserve(sizeof(SIZE_TYPE) + raw_data_size);
    buf.insert(buf.end(), reinterpret_cast<const unsigned char *>(&raw_data_size),
               reinterpret_cast<const unsigned char *>(&raw_data_size) + sizeof(SIZE_TYPE));
    const unsigned char *base(reinterpret_cast<const unsigned char *>(value.c_str()));
    buf.insert(buf.end(), base, base + raw_data_size);

    SIZE_TYPE record_size = buf.size();

    // Data down
    OFFSET_TYPE data_down = m_last_data_pos;  // points to start of data record
    buf.resize(record_size + sizeof(OFFSET_TYPE));
    memcpy(&buf[record_size], &data_down, sizeof(OFFSET_TYPE));
    record_size += sizeof(OFFSET_TYPE);

    buf.resize(buf.size() + sizeof(OFFSET_TYPE));
    memcpy(&buf[record_size], &data_next, sizeof(OFFSET_TYPE));
    record_size += sizeof(OFFSET_TYPE);
    return record_size;
}

bool TurboMap::InsertKeyValue(const std::string &key, const std::string &value) {
    // Create a new record
    std::vector<unsigned char> new_record;
    OFFSET_TYPE data_next = -1;
    SIZE_TYPE record_size = FillDataRecord(value, data_next, new_record);

    // Pointer to data file
    OFFSET_TYPE data_down = m_last_data_pos + record_size - 2 * sizeof(OFFSET_TYPE);

    // Get bucket for key
    INDEX_TYPE h = Hash(key);
    OFFSET_TYPE pos = m_index_map[h];  // Point to key of last bucket entry

    OFFSET_TYPE existing_key_entry_offset = -1;  // Assume that no entry exists

    // Iterate over bucket (linked-list) and find entry with same key, if exists
    existing_key_entry_offset = FindEntryForKeyInBucket(key, pos);

    // Insert/Update Entry
    if (existing_key_entry_offset > 0) {  // We already have an entry for key
        // Update data next in record to older record data down
        data_next = ExtractDataDownFromEntryAtKeyOffset(existing_key_entry_offset);
        memcpy(&new_record[record_size - sizeof(OFFSET_TYPE)], &data_next, sizeof(OFFSET_TYPE));

        // Update data down pointer to point to new record in data file
        SIZE_TYPE key_len = sizeof(ENTRIES_TYPE) * (strlen(key.c_str()) + 1);
        memcpy(&m_entries_map[existing_key_entry_offset + key_len + sizeof(OFFSET_TYPE)], &data_down,
               sizeof(OFFSET_TYPE));
    } else {  // We don't have an entry for key. Append a new one to end of bucket

        SIZE_TYPE entries_end;
        memcpy(&(entries_end), &m_entries_map[0], sizeof(SIZE_TYPE));

        // Next points to last bucket entry or -1. This can be a colliding entry for a different key
        OFFSET_TYPE next = pos;

        // Create a new entry
        std::vector<ENTRIES_TYPE> new_entry;
        SIZE_TYPE entry_size = FillNewEntry(key, next, data_down, new_entry);

        // Insert entry
        memcpy(&m_entries_map[entries_end], reinterpret_cast<ENTRIES_MAP_TYPE>(&new_entry[0]), entry_size);

        // Point index to latest entry
        SIZE_TYPE bucket_pos = entries_end + sizeof(SIZE_TYPE);
        memcpy(&m_index_map[h], reinterpret_cast<INDEX_MAP_TYPE>(&bucket_pos), sizeof(SIZE_TYPE));

        // Note down new total entries size
        entries_end += entry_size;
        memcpy(&m_entries_map[0], reinterpret_cast<INDEX_MAP_TYPE>(&entries_end), sizeof(SIZE_TYPE));

        // Check if we need to stretch the entry file
        if (((entries_end) * 8 + 1024 * 1024) > m_entries_sz && !GrowEntriesMap(10)) {
            return false;
        }
    }

    // Insert Data Record
    if (InsertIntoData(new_record, record_size)) {
        m_last_data_pos += record_size;
        return true;
    } else {
        std::cerr << "Could not persist record" << std::endl;
        return false;
    }
}

/**
 * Fill the out parameter with the value of the record at the given offset. Returns the data file offset to the previous
 * record. Returns -1 if no previous record exists.
 */
TurboMap::OFFSET_TYPE TurboMap::GetValueFromOffset(std::string &out, OFFSET_TYPE data_offset) {
    // Read data_down
    off_t pos = lseek(m_data_fd, data_offset, SEEK_SET);
    ssize_t bytes_read;
    OFFSET_TYPE buf[1];
    bytes_read = pread(m_data_fd, buf, sizeof(OFFSET_TYPE), pos);
    OFFSET_TYPE data_down = buf[0];

    bytes_read = pread(m_data_fd, buf, sizeof(OFFSET_TYPE), pos + sizeof(OFFSET_TYPE));
    OFFSET_TYPE data_next = buf[0];

    SIZE_TYPE size[1];
    bytes_read = pread(m_data_fd, size, sizeof(SIZE_TYPE), data_down);
    SIZE_TYPE value_size = size[0];

    // Read value
    SIZE_TYPE disk_record_sz = value_size + sizeof(SIZE_TYPE);
    std::vector<unsigned char> r(disk_record_sz);
    bytes_read = pread(m_data_fd, reinterpret_cast<unsigned char *>(&r[0]), disk_record_sz, data_down);
    unsigned char *off = reinterpret_cast<unsigned char *>(&r[0]);
    off += sizeof(SIZE_TYPE);
    std::string value(off, off + value_size - 1);
    out.assign(value);

    return data_next;
}

/**
 * Returns the data file offset of the latest entry for key. Returns -1 if no entry exists for key.
 */
TurboMap::OFFSET_TYPE TurboMap::GetLatestDataOffsetForKey(const std::string &key) {
    // Get bucket for key
    INDEX_TYPE h = Hash(key);
    OFFSET_TYPE pos = m_index_map[h];
    OFFSET_TYPE existing_entry_offset = FindEntryForKeyInBucket(key, pos);

    if (existing_entry_offset > 0) {
        return ExtractDataDownFromEntryAtKeyOffset(existing_entry_offset);
    }
    return -1;
}

/**
 * Returns data file offset of latest entry if key is present. Fills the out parameter with the value of that latest
 * data entry. Returns -1 otherwise.
 */
bool TurboMap::GetLatestValue(const std::string &key, std::string &out) {
    OFFSET_TYPE data_offset = GetLatestDataOffsetForKey(key);
    if (data_offset >= 0) {
        GetValueFromOffset(out, data_offset);
        return true;
    }
    return false;
}

void TurboMap::GetAllValues(const std::string &key, std::vector<std::string> &out) {
    std::string existing_value;
    OFFSET_TYPE data_down = GetLatestDataOffsetForKey(key);
    while (data_down >= 0) {
        data_down = GetValueFromOffset(existing_value, data_down);
        out.push_back(existing_value);
    }
}

TurboMap::~TurboMap() {
    if (munmap(m_index_map, m_index_sz) == -1) {
        perror("Error un-mmapping the file");
    }

    if (munmap(m_entries_map, m_entries_sz) == -1) {
        perror("Error un-mmapping the file");
    }

    // /* Un-mmaping doesn't close the file, so we still need to do that.*/
    close(m_index_fd);
    close(m_entries_fd);
    close(m_data_fd);
}

Final Thoughts

I trust you found this post informative. Please feel free to adapt this hash map to suit your project requirements. There are numerous enhancements that can be applied to this code, such as employing templates for more versatile key/value pairs and utilizing lambdas or policies for our hash and equality functions. However, for now, this should provide a solid foundation for understanding the core concepts. Be sure to explore the GitHub repository for this project, where I’ll be actively working on implementing these improvements in the future.

C++

This post is licensed under CC BY 4.0 by the author.

Creating a Lightning-Fast Append-Only Database with mmap in C++

Inserting Key/Value Pairs

Reading Data

Full Implementation

Final Thoughts

Trending Tags