High Performance Python
Notes from the book High Performance Python
Python is an interpreted language that is easy to learn, fast to develop in, and has many useful libraries. However, there are some shortcomings. We’ll discuss some of these below and how to optimize. It’s important to use a profiler and consider the trade-offs between code readability and optimization. Python has a large emphasis on readability and sometimes heavily optimized code can become difficult to follow.
Some performance limitations
- The GIL (Global interpreter lock) limits a Python process to running only one instruction at a time. In other words, even though a Python process may have access to multiple cores, only one core is running a Python instruction at any given time
- Python objects are not optimally laid out in memory creating memory fragmentation, leading to CPU cache misses
- Python is dynamically typed which means types are assigned to objects during runtime
- Python is not compiled so it does not have the built-in optimization that a compiler usually provides
Profiling in Python
Profiling will inevitably slow down our code due to the overhead of information being reported. A trade-off to consider is the level of detail of the profiler output and the speed. Timing a function is a simple way to profile our code which can be done using a decorator function or the timeit
module. The decorator gives the advantage of not having to put print statements and delete them after we’re done profiling.
cProfile
is a built-in profiling tool which taps into the CPython virtual machine to measure the time it takes to run every function that it sees. It does not allow the user to see what’s happening line by line. As an example, to profile a file test.py
We can run $ python -m cProfile -s cumulative test.py
. We can also use a -o profile.stats
flag which will create a stats output file called profile.stats
. We can then analyze the output with the pstats
module.
Once we’ve identified which functions are causing our code to slow down, we can use the line_profiler
to profile the function on a line by line basis. To run, make sure line_profiler
is installed, add a @profile
decorator to the function we wish to profile, run the script with $ kernprof -l -v test.py
.
Can we use less RAM by re-writing a function to work more efficiently? Can we use more RAM to minimize the number of CPU cycles by caching? The memory_profiler
measures memory usage on a line-by-line basis.
Make sure memory_profiler
is installed and run $ python -m memory_profiler test.py
. To visualize the output, use mprof
to write out a statistics file $ mprof run test.py
and mprof plot
to visualize the performance over time.
Lists, tuples, and iterables
Differences between lists and tuples:
- lists are dynamic arrays, tuples are static arrays
- tuples are cached by the Python runtime which means we don’t have the extra step of talking to the kernel to reserve memory every time we want to use one
- In general, use a list when we are constantly updating/changing objects; use a tuple if things are not changing in the array
Python lists by default use a sorting algorithm called Tim sort (O(n log n) in the worst case). Once our list is sorted, an efficient way to search is using binary search instead of a linear search. The bisect
module can be used to efficiently add objects into a list while maintaining its sorting or for getting the diff of two lists.
When we loop over a list, our list gets converted to an iterator since the for loop requires an iterable. An iterator is an iterable, or a sequence that can be iterated over, and a generator is a type of iterator. In python 2, we have range
and xrange
. When writing a for loop, we tend to use the range
function to specify the range through which we’d like to iterate. However, range
pre-creates a list of all the numbers in the range specified, stores it in RAM, and then converts the list into an iterator. This is an inefficient approach that will not scale to larger lists if we only wish to iterate through a range of values once. xrange
on the other hand, returns an iterator instead of a list and is also a generator. At any point during the iteration, we only store the current value and cannot reference any other item in the sequence. A generator is a function that is implemented using the yield
keyword instead of return
. Each time the function is called, it returns the next item. Thus, the code before the yield
only runs the first time the function is called.
Generator comprehensions, similar to list comprehensions, can provide a more efficient use of memory if we don’t need to reference the list anymore after an operation. For example, we can use generators in
divisible_by_three = sum((1 for n in list_of_numbers if n % 3 == 0))
as opposed to creating an intermediate list in divisible_by_three = sum([1 for n in list_of_numbers if n % 3 == 0])
.
itertools
provides many useful generator functions such as map
, filter
, reduce
, zip
, etc.
Matrix and vector computation
Python lists are not very efficient for matrix computations because lists store pointers to where the data can be found instead of storing the data itself. For a list of lists called grid
, for example, in order to look up the value at grid[a][b]
we first lookup index a
on the list grid; this will return a pointer to where the data at that location is stored in memory. We then de-reference the pointer, lookup index b
on the list returned above, and get the location of where its data is stored. We then de-reference that pointer and finally get the value.
Since a list stores pointers to data instead of actual data, the actual values of the data are scattered throughout memory and cannot all be operated on at once. Each piece of data has to be moved individually to the CPU for computation which forces the CPU to have to wait for the data transfer to occur. Since the bus that transfers data from RAM to the CPU can only move contiguous chunks or memory, data has to be stored sequentially in RAM. To deal with the problems above, we use numpy
which stores data in contiguous chunks in memory and supports vectorized operations on its data. As a result, any arithmetic we do on numpy
arrays happens in chunks without us having to explicitly loop over each element. This can achieve around 40x speed-up at times.
Making numpy even faster with numexpr
One downfall of numpy’s optimization of vector operations is that it only occurs on one operation at a time. That is to say, when we are doing the operation A * B + C
with numpy vectors, first the entire A * B
operation completes, and the data is stored in a temporary vector, then this new vector is added with C
. numexpr
is a module that can take an entire vector expression and compile it into very efficient code that is optimized to minimize cache misses and temporary space used. It should be used only for very large matrices/vectors greater than the size of the cache (probably around 1000x1000) or else the overhead of numexpr
may actually slow things down.
Concurrency/ Asynchronous Programming
Concurrency lets us utilize CPU time that is usually wasted on waiting from I/O operations to complete.
Connecting to databases and crawling webpages are cases with slow I/O where asynch programming is desirable. When a program enters an I/O wait, the execution is paused, the kernel must perform a context switch by saving the state of the CPU to complete the I/O request, then re-initialize the program and resume. This is quite an expensive operation. Instead of running code serially, we can re-write it to handle events so that different parts of the code run when different events happen. While one part of the code is waiting on an I/O operation to complete, a different part of the code has a chance to be executed. This paradigm is implemented using an event loop which has two forms: call back functions or futures. Callback functions calls another function with a value instead of returning the value itself.
Futures use yield
to pause the current function and allow something else to run until the item is available.
The multiprocessing
library enables single-machine multicore parallelism. It uses processes to run multiple Python interpreters in parallel, allowing for data to be shared between processes so that each worker can operate on one chunk of an array. In order to effectively use multiprocessing
we should:
- split jobs into independent units of work
- randomize input if workers take a varying amount of time
- sort work queue so that slowest jobs go first
- think about the overhead costs and make sure small workloads aren’t spending most of the time in startup costs
- consider the tradeoffs of using a
pool
vs aqueue
. With apool
, we can split up a chunk of predefined work upfront among the available CPUs. This is less helpful if we have dynamic workloads which is when aqueue
is useful. If the task has a long completion time with a small amount of communication, then aqueue
approach might be the right answer.
For multi-machine parallelism, check out dask
.
Conclusion
To go even further, we could compile portions of code down to C or even re-write it but the goal here was to focus on what we can do natively in Python. Feedback/corrections are welcome in the form of GH issues!