High Performance Python

Notes from the book High Performance Python

Python is an interpreted language that is easy to learn, fast to develop in, and has many useful libraries. However, there are some shortcomings. We’ll discuss some of these below and how to optimize. It’s important to use a profiler and consider the trade-offs between code readability and optimization. Python has a large emphasis on readability and sometimes heavily optimized code can become difficult to follow.

Some performance limitations

The GIL (Global interpreter lock) limits a Python process to running only one instruction at a time. In other words, even though a Python process may have access to multiple cores, only one core is running a Python instruction at any given time
Python objects are not optimally laid out in memory creating memory fragmentation, leading to CPU cache misses
Python is dynamically typed which means types are assigned to objects during runtime
Python is not compiled so it does not have the built-in optimization that a compiler usually provides

Profiling in Python

Profiling will inevitably slow down our code due to the overhead of information being reported. A trade-off to consider is the level of detail of the profiler output and the speed. Timing a function is a simple way to profile our code which can be done using a decorator function or the timeit module. The decorator gives the advantage of not having to put print statements and delete them after we’re done profiling.

cProfile is a built-in profiling tool which taps into the CPython virtual machine to measure the time it takes to run every function that it sees. It does not allow the user to see what’s happening line by line. As an example, to profile a file test.py We can run $ python -m cProfile -s cumulative test.py. We can also use a -o profile.stats flag which will create a stats output file called profile.stats. We can then analyze the output with the pstats module.

Once we’ve identified which functions are causing our code to slow down, we can use the line_profiler to profile the function on a line by line basis. To run, make sure line_profiler is installed, add a @profile decorator to the function we wish to profile, run the script with $ kernprof -l -v test.py.

Can we use less RAM by re-writing a function to work more efficiently? Can we use more RAM to minimize the number of CPU cycles by caching? The memory_profiler measures memory usage on a line-by-line basis. Make sure memory_profiler is installed and run $ python -m memory_profiler test.py. To visualize the output, use mprof to write out a statistics file $ mprof run test.py and mprof plot to visualize the performance over time.

Lists, tuples, and iterables

Differences between lists and tuples:

lists are dynamic arrays, tuples are static arrays
tuples are cached by the Python runtime which means we don’t have the extra step of talking to the kernel to reserve memory every time we want to use one
In general, use a list when we are constantly updating/changing objects; use a tuple if things are not changing in the array

Python lists by default use a sorting algorithm called Tim sort (O(n log n) in the worst case). Once our list is sorted, an efficient way to search is using binary search instead of a linear search. The bisect module can be used to efficiently add objects into a list while maintaining its sorting or for getting the diff of two lists.

When we loop over a list, our list gets converted to an iterator since the for loop requires an iterable. An iterator is an iterable, or a sequence that can be iterated over, and a generator is a type of iterator. In python 2, we have range and xrange. When writing a for loop, we tend to use the range function to specify the range through which we’d like to iterate. However, range pre-creates a list of all the numbers in the range specified, stores it in RAM, and then converts the list into an iterator. This is an inefficient approach that will not scale to larger lists if we only wish to iterate through a range of values once. xrange on the other hand, returns an iterator instead of a list and is also a generator. At any point during the iteration, we only store the current value and cannot reference any other item in the sequence. A generator is a function that is implemented using the yield keyword instead of return. Each time the function is called, it returns the next item. Thus, the code before the yield only runs the first time the function is called.

Generator comprehensions, similar to list comprehensions, can provide a more efficient use of memory if we don’t need to reference the list anymore after an operation. For example, we can use generators in divisible_by_three = sum((1 for n in list_of_numbers if n % 3 == 0)) as opposed to creating an intermediate list in divisible_by_three = sum([1 for n in list_of_numbers if n % 3 == 0]). itertools provides many useful generator functions such as map, filter, reduce, zip, etc.

Matrix and vector computation

Python lists are not very efficient for matrix computations because lists store pointers to where the data can be found instead of storing the data itself. For a list of lists called grid, for example, in order to look up the value at grid[a][b] we first lookup index a on the list grid; this will return a pointer to where the data at that location is stored in memory. We then de-reference the pointer, lookup index b on the list returned above, and get the location of where its data is stored. We then de-reference that pointer and finally get the value.

Since a list stores pointers to data instead of actual data, the actual values of the data are scattered throughout memory and cannot all be operated on at once. Each piece of data has to be moved individually to the CPU for computation which forces the CPU to have to wait for the data transfer to occur. Since the bus that transfers data from RAM to the CPU can only move contiguous chunks or memory, data has to be stored sequentially in RAM. To deal with the problems above, we use numpy which stores data in contiguous chunks in memory and supports vectorized operations on its data. As a result, any arithmetic we do on numpy arrays happens in chunks without us having to explicitly loop over each element. This can achieve around 40x speed-up at times.

Making numpy even faster with numexpr One downfall of numpy’s optimization of vector operations is that it only occurs on one operation at a time. That is to say, when we are doing the operation A * B + C with numpy vectors, first the entire A * B operation completes, and the data is stored in a temporary vector, then this new vector is added with C. numexpr is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. It should be used only for very large matrices/vectors greater than the size of the cache (probably around 1000x1000) or else the overhead of numexpr may actually slow things down.

Concurrency/ Asynchronous Programming

Concurrency lets us utilize CPU time that is usually wasted on waiting from I/O operations to complete. Connecting to databases and crawling webpages are cases with slow I/O where asynch programming is desirable. When a program enters an I/O wait, the execution is paused, the kernel must perform a context switch by saving the state of the CPU to complete the I/O request, then re-initialize the program and resume. This is quite an expensive operation. Instead of running code serially, we can re-write it to handle events so that different parts of the code run when different events happen. While one part of the code is waiting on an I/O operation to complete, a different part of the code has a chance to be executed. This paradigm is implemented using an event loop which has two forms: call back functions or futures. Callback functions calls another function with a value instead of returning the value itself. Futures use yield to pause the current function and allow something else to run until the item is available.

The multiprocessing library enables single-machine multicore parallelism. It uses processes to run multiple Python interpreters in parallel, allowing for data to be shared between processes so that each worker can operate on one chunk of an array. In order to effectively use multiprocessing we should:

split jobs into independent units of work
randomize input if workers take a varying amount of time
sort work queue so that slowest jobs go first
think about the overhead costs and make sure small workloads aren’t spending most of the time in startup costs
consider the tradeoffs of using a pool vs a queue. With a pool, we can split up a chunk of predefined work upfront among the available CPUs. This is less helpful if we have dynamic workloads which is when a queue is useful. If the task has a long completion time with a small amount of communication, then a queue approach might be the right answer.

For multi-machine parallelism, check out dask.

Conclusion

To go even further, we could compile portions of code down to C or even re-write it but the goal here was to focus on what we can do natively in Python. Feedback/corrections are welcome in the form of GH issues!

Written on November 12, 2017