Book Summary: High Performance Python by Micha Gorelick and Ian Oszvald

Rating: 8.3/10.

Book covering some fairly advanced optimization techniques in Python, such as the different types of runtimes or ways to call into C, data structures that are memory efficient or increase cache coherence; libraries that can do operations faster than plain Python, ways to do concurrency, multithreading, and a little bit about distributed systems as well. I learned a decent amount from this book, even having worked in Python for several years.

Chapter 1: Python performance involves understanding CPU, memory, communication, parallelism, and other factors. At the same time, it’s a high-level language that trades some runtime speed for developer productivity.

Chapter 2. Profiling using Julia set generation as an example. There are several ways to time operations: printing the time directly using the timeit decorator, using the system’s time command (which is not specific to Python), and using the built-in cProfile. cProfile is a popular profiling tool that generates timing information on a function level, which can be visualized using tools such as SnakeViz; however, it offers no visibility into line-by-line execution within a function; to achieve line-level profiling, you need the line profiler.

For diagnosing memory issues, the memory profiler helps track memory allocation by line and over time. The dis package lets you inspect bytecode for a function – while shorter code is sometimes faster, this is not always the case – you must profile to be sure. It is important to use unit tests to maintain correctness while profiling; try to control for factors such as background tasks, other parts of a larger program, etc.

Chapter 3: Lists. Many operators require linear search, and sometimes sorting and performing a binary search is faster. Lists are mutable, whereas tuples are static and immutable, making them faster. Lists use more memory than the number of elements due to allocating headroom every time they resize, but tuples don’t do this.

Chapter 4: Dictionaries and sets are backed by hash tables, which allow O(1) lookup and require a hash function that is usually predefined. Different items need to have different hash values to prevent collisions, and in cases of collision, resolution must be performed with linear probing, which is slow. Variable identifier lookup is performed through a series of dictionary lookups, and local variable access is faster than global or module variable access.

Chapter 5: Iterators decouple generating from processing a list while saving memory compared to explicitly constructing a list, since values are processed lazily and the whole list is never materialized.

Chapter 6. Uses example of 2D array operations on the heat equation simulation. Using 2D lists in Python is slow because the list is not contiguous in memory (the perf tool will show a high number of cache misses); there is also no native vectorization in Python. Converting the code to NumPy yields a large speedup due to vectorized operations and memory locality, leading to fewer cache misses. In-place NumPy operations are more efficient since they require fewer memory allocations, but these are difficult to read. The Numexpr library speeds up NumPy using parallelism and processes NumPy – you provide a NumPy expression as a string to the library to execute it.

Pandas is a library for data frames, and it has more complex types than NumPy, like nullable integers, which are backed by multiple NumPy arrays. An example of least squares computation: sklearn is several times slower than a pure NumPy implementation because sklearn does some extra checks that save developer time but could be removed. Pandas performance notes: Iterating over rows is more expensive than using the apply function since it needs to construct a lot of Series objects for each row. Building a Series one element at a time is also expensive, and chaining operators where each intermediate step creates a new Series is more expensive than a function that just returns the final result.

Chapter 7: Compiling code to C leads to large speedups often, and one way is Cython, which compiles Python code to C to run faster. Much of it is still calls into the Python VM due to the dynamic properties of operations, so you need to give it Cython-specific type annotations like int, which makes it much faster because it knows that the value is not dynamic (but the result is no longer a valid Python program).

Numba is a JIT that compiles individual functions to run faster and supports NumPy, but not most Python code so you must rewrite it to be compatible in nopython mode for speedup; otherwise, it will fall back to just running it in regular Python. You can use prange to parallelize using OpenMP as well.

PyPy is a faster JIT for pure Python code and tends to lag a few versions behind; there are some differences, like garbage collection works differently and doesn’t work well with C code like NumPy, so it’s best for pure Python code.

Using GPU – this can be done using PyTorch with a similar interface as NumPy. This is great for highly parallel vector operations, but it’s less useful if the logic involves branching; transfer between CPU and GPU memory causes performance issues, so it’s best to avoid this when possible.

Calling C code from Python, there are several ways. ctypes is the most basic interface, but it’s quite low-level since it requires manually defining mappings between C and Python types; CFFI is an easier way of calling into C code where you can define function signatures as strings. F2PY is a library that generates an interface to interact with Fortran code. The lowest-level method is writing a CPython module, and this has the most seamless Python integration for users, but it is the most low-level and unmaintainable, so it’s not recommended.

Chapter 8: Asyncio is using concurrency by running other tasks when waiting for the I/O to complete. This simplifies a lot of callback systems that were previously used in earlier versions of Python. Asyncio relies on async/await statements and relies on having a single event loop to coordinate all of the async tasks. Some previous libraries include gevent, which was a simple library that allowed you to queue up many threads that wait for I/O at the same time and uses a semaphore to limit the number of simultaneous threads in a block (eg: HTTP requests). Tornado was developed for web servers and puts more things on the event loop, but designed before the modern async syntax features; aiohttp is a web server based on some newer async features.

Handling tasks that have both I/O-bound and CPU-bound parts: One way is to batch I/O requests, which are often more efficient than one at a time, and the proportion of time spent in I/O decreases. You can also do pure asyncio, but need to make sure to defer to the event loop during the CPU code at least once in a while by calling something like sleep(0) so that the I/O tasks can run.

Chapter 9: Multiprocessing allows you to run CPU-bound tasks in parallel with one interpreter in each process, up to the number of CPUs. However, this tends to use a lot of memory, whereas using multiple threads offers no speed-up for CPU tasks due to the GIL. Joblib is a replacement for the multiprocessing module that parallelizes given an iterator as input and supports caching identical calls.

A common strategy is to break up the work into chunks, but there are some details that are important: randomize the sequence if tasks take varying amounts of time (eg: checking a list of numbers to determine which ones are prime). Also important is aligning the number of chunks to be a multiple of the number of CPUs.

Queues for IPC – possible but the communication cost is high if the unit of work to be parallelized is small and you often need to send sentinel values back to the main process to indicate that it’s done. There is also some complexity in pickling values to be sent across processes

How to communicate with other processes in a parallel search when a value has been found and work should stop – can use shared variables in mp module that incur locking costs, or use Redis, which lives outside of Python to store the shared value. Alternatively, you can sometimes use a raw mmap value that can be accessed by any process as raw bytes without locking (correct if the value is only set and never unset). For NumPy, it’s also possible to use multiple processes to work on parts of a NumPy array simultaneously by making a 1D mp.Array and then pointing numpy to that array as a buffer.

Variable that needs to be updated by multiple processes – without locks, the result will be incorrect so options include file-based locking, which has the advantage of working outside of Python. The mp.Value can be accessed from multiple processes but without locking unless you wrap it with a mp.Lock. The raw value primitive actually is synchronized and supports atomic updates without locking.

Chapter 10. Clusters is running a job across multiple machines, and this increases complexity substantially, eg dealing with failures and restarting gracefully (best to get it working robustly on one machine first). One simple way is IPython Parallel, which uses some of the same networking components as a Jupyter notebook and allows you to do apply async across multiple machines and uses ZeroMQ behind the scenes. For Pandas, you can use Dask to parallelize operations across cores or machines. More generally, a queue system like NSQ is a powerful model for distributed tasks, since you can spin up producers or consumers as needed, and the queue system can handle failures and retries if the worker goes down. Recommend using Docker to easily replicate the environment and deploy it on a cluster.

Chapter 11 – Memory Efficiency. Python lists are not memory efficient for storing primitive types like integers, since each entry is a separate object; much better is to store them in an array type or a NumPy array. NumPy creates large temporary variables in memory during complex operations, and NumExpr evaluates a string to save memory in NumPy and Pandas. Storing a set of strings is more efficient using various trie libraries (like marisa trie) than a set or sorted list, and there is a trade-off between constructing it versus persisting the tree after construction. Sparse matrix is often more memory efficient and faster when the matrix is mostly empty, like scikit-learn’s DictVectorizer, which uses a sparse matrix representation.

Probabilistic data structures can save a lot of memory by doing approximate operations eg, Morris Counter keeps track of how many times something has been seen using only one byte; KMinValues and HyperLogLog approximates how many unique values are in a set using hash functions. Bloom filters use hash functions to probabilistically determine whether we’ve seen an item before, and variants like Cuckoo Filter can handle things like streaming, merging sets, or deletion. The choice between these probabilistic data structures often depends on the trade-off between memory and required accuracy.

Chapter 12. The final chapter is a bunch of random guest articles from various engineers in the industry, mostly from startups, about system design and optimization scenarios and some general advice.

Share this:

Most similar books:

Leave a Reply Cancel reply