Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement by Luc Perkins
Book exploring seven different, mostly NoSQL databases. Each database gets about 40-50 pages and is split into three days: the first day covers basic operations like setup, inserting, querying, and deleting; the second day goes into more advanced operations like aggregation or some unique features of the database; the third day addresses multi-node setups and how the database handles replication and network failure.
The first version of the book was published in 2012, so a few of the databases are quite obscure today. Few developers have heard of HBase and CouchDB. In the second edition, the authors removed Riak (probably the most obscure database in the first edition) and replaced it with DynamoDB. Overall, in a short span of pages, each chapter aims to cover the most important concepts for each database. This is enough to showcase the strengths and weaknesses and how each one offers a unique way to work with data.
PostgreSQL is the most advanced of the SQL databases and is familiar to anyone who has worked with relational databases like MySQL or SQLite. Its main advantage is its flexibility for queries: you can design your data schema without worrying about how you’re going to query it later.
PostgreSQL is suitable for typical business applications: CRUD (Create, Read, Update, Delete), which encompasses most business logic. Each row should have a unique primary key that identifies it. Referential integrity helps maintain the table’s structure, like preventing duplicate primary keys or ensuring foreign key references actually exist.
One feature of relational databases is their ability to join tables together. An inner join, the most common type, returns rows only when an entry exists in both tables. An outer join merges two tables, and if a record doesn’t exist in one table, it gets replaced with null. Indexes can be created to speed up data retrieval from a column and can be based on B-trees or hash tables.
Aggregation functions allow you to execute a function on results, like count, min, or max. More advanced window functions let you group rows into partitions and then run aggregation, using PARTITION BY instead of GROUP BY, which aggregates all rows at once.
Transactions ensure that your database maintains a consistent state: either all commands are executed or none, and any partial transaction is rolled back. Stored procedures let you create functions that run within the database, but this can result in tighter coupling between the application and the database. Triggers can be set up to execute a procedure based on an event, such as a row update.
Creating a view lets you conceal the specifics of underlying tables and simulate a table’s presence. Rules allow for the rewriting of queries before execution, and a view is a special case of a rule.
PostgreSQL also offers functions for text search: the basic wildcard search using the LIKE keyword. Other tools include regex, Levenshtein distance, trigrams, and phonetic matching, useful for handling partial matches and typos. It also provides simple NLP features to handle stemming (useful for plurals) and remove stopwords. It can also store embeddings and retrieve rows based on vector distances.
HBase is a column-oriented NoSQL database, an open-source version of Google’s BigTable. It’s recommended for use when your data is big enough to require at least five nodes to store it, and it’s a bad idea for data that’s only on the scale of gigabytes, not terabytes or petabytes.
The basic structure of HBase is centered on a row key. The concepts of rows and columns in HBase differ from those in other databases; you can think of a row as an index into another key-value database, where each column then acts as a key to another value. There aren’t any predefined data types, so values are just blobs of bytes.
You can modify a table’s columns with the console command-line tool, but to insert rows, you need to run a Ruby script. One use case is indexing Wikipedia pages with their content and links. HBase uses a bloom filter to quickly check if a row already exists. When a column becomes too large, it automatically divides the data across different nodes.
You can operate HBase on the cloud using AWS Elastic MapReduce (EMR) clusters. Its strength lies in scaling up to many nodes, but its capabilities are somewhat limited, so it’s not the best choice for smaller data sets.
MongoDB is kind of the opposite of SQL: while SQL forces your data to fit a neatly defined schema, MongoDB makes very few assumptions about the form of the data you want to store (since there is no fixed schema). This offers flexibility, but you might face issues later on, especially if there are typos in the fields, as it won’t complain.
When running MongoDB on multiple nodes, there’s always one primary, and the rest are replicas. If a network partition occurs, a new primary is automatically elected and will be unique. However, it’s recommended to have an odd number of nodes to ensure a unique primary can be elected. MongoDB also features the GridFS distributed file system.
CouchDB is designed to be robust and consistent. It uses an append-only storage model, where for each update, you need to supply the previous revision of the document, and if the update doesn’t match (because somebody else wrote to the document before you), the update will fail since the revisions don’t match.
It offers a RESTful interface for adding, modifying, and retrieving documents, you can query it including parameters to specify filtering options, you can make HTTP requests directly, or you can use a driver library like for Ruby. A view is implemented as a function that runs over all documents, applies a filter, and emits documents after transforming the data. More advanced views can be implemented with reducers in a framework similar to MapReduce, where the reducer processes all of the rows with the same keys.
You can monitor the database for changes, either in a streaming fashion with long polling or non-streaming to just catch the latest changes. CouchDB supports multi-master replication. Instead of having a single master and replicas, all nodes act as masters. When there’s a conflict, it uses a deterministic algorithm to choose the winner; the losing state (the one not chosen in a conflict) is saved and can be recovered later using application-specific logic.
Neo4j is a graph database, so instead of rows of data in a table, it represents data as either nodes or relationships between nodes. Each node or relationship can have an arbitrary amount of other data attached to it. You can interact with the database using the Cypher language, where each statement has a pattern matching syntax followed by instructions on what to do with nodes that match a given pattern, eg, retrieving nodes that have a certain label. The database is mostly schemaless, but you can create an index to quickly retrieve nodes from fa field or ensure that they are unique. Cypher is not the only way to interact with the database; you can also make HTTP requests or some combination of HTTP and Cypher.
The main strength of Neo4j is its ability to do complex pattern matching, eg: the star pattern can match multiple jumps between nodes, or you can find all nodes within a certain distance of a root node, without writing any kind of graph traversal logic. An example is finding what percentage of the graph is within two degrees of separation from a given node, and this can be done using one query.
For replication, Neo4j uses a single master pattern. You can spin up slaves to increase the capacity for reads but not writes. When the master goes down, the rest are able to automatically elect a new master. There is no sharding, so each node has a full copy of all the data.
DynamoDB is part of AWS, and in contrast to other DBs in the book, it is fully managed on the cloud, so you avoid database ops. This means a lot of its internals are a black box, and you just have to trust that it works. It is suitable for massive data on the scale of Amazon and other tech giants. In return, you give up a lot of query flexibility, so there are no concepts like joins, and many design decisions must be made upon database creation and cannot be reversed.
The data schema primarily uses a key-value system. You retrieve an item by a partition key or hash key. This has to be something that the application knows in advance, like the user ID. A second type of key is the range key or sort key, which allows you to do range queries, for example, retrieving all of the records within a time range. You can also create a composite key consisting of a partition key and a range key.
Partition keys should be evenly distributed since it determines how your data gets stored and distributed among storage nodes. Other than these keys, there are limited options for retrieving items. Local secondary indexes (LSIs) allow you to retrieve items among those that share a partition key other than the range key, but it cannot be modified after a table has been created. Global secondary indexes (GSI) lets you create indexes that can retrieve an item based on a key other than the partition key, and can be added after table creation. When reading data, you can choose for the data to be consistent at the expense of latency or faster at the expense of slightly stale data.
DynamoDB is rarely used by itself but is often chained together with other AWS services like Lambda and Kinesis. For example, IoT devices that report sensor values once every second can stream data into Kinesis, which then triggers a Lambda function that stores the data into DynamoDB. For more complex queries, you can set up a pipeline, import the data into Athena, and then query it with SQL.
Redis is an in-memory database that is primarily a key-value store, but it has many data structures that are more sophisticated. The most basic operation is to set and get a value given a key; you can group multiple such statements in transactions to be faster. This makes it useful as a caching mechanism, which is one of the most common use cases of Redis.
There are more complex data types as well: hashes are basically an unsorted set that can contain additional key-value pairs, but they can only nest one level deep. Lists can simulate a stack or a queue, and blocking lists make the listener wait until a new item is inserted, and can be useful for a pub-sub system. There’s also an explicit publish-subscribe system where a single publisher may publish a message and have multiple subscribers receive it.
Sets can be either sorted or unsorted. Sorted sets let you do range queries and retrieve the top-scoring items, which is useful for a leaderboard system. Each type of data structure has a unique prefix to its command, so you can easily identify which structure it is operating on. A bloom filter lets you quickly check approximately whether an item exists, or you can also manipulate bits directly in the underlying data structure.
Whenever you set a value, you may also set an auto-expiry time, and it will automatically unset itself after this time. There are also multiple database namespaces starting from 0; this is useful for running different applications while allowing them to share data with each other.
Redis has a few different options for persistence. It can be in memory completely or snapshot to disk at regular intervals, eg: every minute, every second, or after every 1000 records, but overall this is considered to be less durable than other databases, so it is not advised to store important data here. A cluster can be spun up easily with slave replicas to handle more read traffic. And the client is responsible for deciding which replica to hit based on the hash key.
The final example in the book ties together several databases: it uses Redis to transform data to a format that is more useful to the application since it’s a fast database. The data is stored in CouchDB as a source of truth, then finally it is imported into Neo4j because the data consists of bands and band members, which forms a natural graph relationship.