SQL Antipatterns: Avoiding the Pitfalls of Database Programming by Bill Karwin
Book about different ways to misuse database schemas, intended for intermediate-level programmers, discusses various common patterns that junior programmers often employ when designing database schemas. I found the anti-patterns to be good, but the proposed solutions are often lacking; they tend to be clumsy, and when multiple solutions are proposed, there is unclear discussion of the trade-offs, with generally not much detail provided. Also, all the examples of non-SQL code are in PHP, not a more modern language.
Chapter 2. Storing multiple values in one row, separated by a comma, is problematic; it’s difficult to perform queries or updates, and there’s a risk of running out of space if the number of characters is limited. It’s better to create an intersection table, ensuring that each row stores only one value.
Chapter 3. There are various ways to represent tree structures in SQL. The adjacency list method involves linking to the parent node; however, querying all descendants of a node can be challenging. Path enumeration involves storing a unique path for each node and keeping the entire path in a field; this approach allows for clear identification of all descendants by the prefix of a given path. Nested sets are a clever method that represents a node’s position using two numbers: left and right. This approach is fast for querying subtrees but becomes quite complex for other operations, like inserting a node. Finally, the closure table method uses a new table where each row represents an edge. Instead of storing only the parent-child relationships, it includes all relationships between a parent and all its descendants, including itself. This method requires more data but is generally the simplest for all operations.
Chapter 4. Many developers opt for a pseudokey on all tables, creating a column named ‘id’ with automatically generated integers that serve as the primary key; however, this approach isn’t always appropriate. In some cases, it might be better to use a natural key or a compound key, which consists of several columns, as the primary key.
Chapter 5: Foreign key constraints exist to help you avoid mistakes when one row depends on another; otherwise, bugs might cause data inconsistencies or necessitate a script to find orphaned rows. When updating or deleting, you can cascade changes to child rows, restrict the operation, or set a default value.
Chapter 6: It’s problematic to have a table with a name column and a value column (EAV pattern), as it’s difficult to ensure that a value exists and is of the correct type. This issue usually arises when different documents have varying properties. There are several ways to avoid this: one approach is to have a table for each subtype, but this tends to duplicate logic. Another design involves having a common table and additional tables for columns that don’t always exist, linked with a foreign key to the base table; you then perform an outer join to gather all the data. A third option is to use semi-structured data, storing optional data in a text field; however, this limits the options for querying this additional data.
Chapter 7: Avoid using a column that has a polymorphic foreign key pointing to multiple tables depending on the row, as this is not supported in SQL. The solution involves using intersection tables; if a table may join with multiple child tables, create two intersection tables. Then, to merge these tables, you can use a union operation or coalesce. Another option is to create two columns, each with a foreign key to one of the child tables.
Chapter 8: It’s problematic to have multiple columns containing the same data, such as tag1, tag2, tag3, because this leads to complexity in deciding which column to use. Instead, create a new table with a foreign key to the original, modeling a one-to-many relationship.
Chapter 9: Avoid creating new tables based on data characteristics, such as making a new table for each year of information, to avoid large tables. This approach makes it difficult to keep data in the correct table, especially around boundary cases, and creates additional work. Instead, many databases support horizontal partitioning, which splits a table by rows, or vertical partitioning, which splits it by columns. Another strategy for saving space is to store large columns, like large block data, in a new table that is joined with a foreign key.
Chapter 10: Floating-point values are subject to various kinds of rounding errors and precision issues. Consider using the Numeric type, which has a fixed number of digits, or the Decimal type, which is always exact.
Chapter 11: Enums are problematic because it’s difficult to add and remove options consistently. Instead, recommend creating a new table with all the possible options and a foreign key to this table.
Chapter 12: A common pattern is storing a file, like an image, on a drive and only storing the path in the database. However, this risks getting out of sync with the database in cases of deletion, rollback, etc. An alternative is using a BLOB, which stores the file in the database directly.
Chapter 13: Creating an excessive number of indices without a clear understanding of their impact is not advisable. Instead, it’s important to understand which queries utilize which indices to avoid over-indexing. Measure the performance to identify slow queries and use the ‘EXPLAIN’ command to show the execution plan. Then, create indices based on this analysis.
Chapter 14: The concept of the ‘NULL’ value is counterintuitive and can lead to misuse. For instance, any comparison with a ‘NULL’ value always results in ‘NULL’, and is never ‘TRUE’ or ‘FALSE’. Even ‘NULL = NULL’ in SQL returns ‘NULL’ and not ‘TRUE’. To handle cases where data might be ‘NULL’, use special equality queries like ‘IS NULL’ and ‘IS DISTINCT FROM’.
Chapter 15: Some operations, like max over grouped rows, could be incorrect if the column for max is different from the group column. Most databases will raise an error under this condition, except for MySQL and SQLite. There are various solutions to work around this, each with its own performance and complexity trade-offs.
Chapter 16: A quick way to select a random row is to order by rand, but this is very slow because it requires a full table scan. There are various methods to sample a random row, such as getting the total number of rows, then selecting one with a random offset.
Chapter 17: Full text search is slow without an index, as it must perform a full table scan. Various types of SQL databases have their own methods of indexing text, including inverted index engines that are specialized for this purpose.
Chapter 18: It is tempting to structure a very complex query, but at some point, it can become hard to debug. It’s better to use multiple queries to return results when you need them or to have an external script that makes multiple queries.
Chapter 19: It’s not good practice to have implicit columns when using ‘select’ and ‘insert’ because mismatches can occur when a column is deleted or added. It’s better to name the columns explicitly, even though it requires more typing.
Chapter 20: Never store passwords in plain text. Instead, add a random salt to the password, then hash it with the salt and store both the hash and the salt in the database. Use a reset email rather than sending the password to the user.
Chapter 21: SQL injection is common when using string concatenation to dynamically construct queries. You need a quoting or escaping mechanism that is robust against any user input.
Chapter 22: Don’t try to fix gaps in pseudokeys; it’s okay for them to be either incrementing with gaps or just random.
Chapter 23: It’s bad practice to ignore error codes when dealing with databases, as this can make debugging more difficult.
Chapter 24: Document your tables and constraints; use migration scripts and seed data, and conduct some testing.
Chapter 25: When using MVC to interact with the database, avoid having classes that correspond directly to database rows. They should be at a higher level, corresponding to business logic, and a single method can interact with multiple databases to provide proper abstraction.
Appendix: Different types of normal forms and database normalization. In a relational database, rows and columns do not have any order, and duplicate rows are not allowed. Every column must have a consistent type.
1NF: A row cannot have any repeating columns, whether separated by a comma or in multiple columns with similar names.
2NF: Mostly the same as 1NF, but in cases with a compound primary key, a column cannot be associated with only one part of the primary key.
3NF: This forbids storing data that is unrelated to the primary key and duplicated elsewhere, e.g., storing a user’s name and email in a table that is not the user table.
4NF: Forbids intersection tables that intersect more than two tables at once.
Each level of normal form offers a stricter guarantee. It goes up to 6NF, but this level is rarely used because it requires a large number of tables and joins.