- How do I parallelize a Python application?
- What is data parallelism?
- What is task parallelism?
- Rewrite a program in a vectorized form.
- Understand the difference between data-based and task-based parallel programming.
- Apply
numba.jit
to accelerate Python.
Parallelizing a Python application
In order to recognize the advantages of parallelism we need an algorithm that is easy to parallelize, complex enough to take a few seconds of CPU time, understandable, and also interesting not to scare away the interested learner. Estimating the value of number \(\pi\) is a classical problem to demonstrate parallel programming.
The algorithm we present is a classical demonstration of the power of Monte Carlo methods. This is a category of algorithms using random numbers to approximate exact results. This approach is simple and has a straightforward geometrical interpretation.
We can compute the value of \(\pi\) using a random number generator. We count the points falling inside the blue circle M compared to the green square N. The ratio 4M/N then approximates \(\pi\).
Challenge: Implement the algorithm
Use only standard Python and the method random.uniform
.
The function should have the following interface:
import random
def calc_pi(N):
"""Computes the value of pi using N random samples."""
...for i in range(N):
# take a sample
...return ...
Also, make sure to time your function!
Solution
Solution
import random
def calc_pi(N):
= 0
M for i in range(N):
# Simulate impact coordinates
= random.uniform(-1, 1)
x = random.uniform(-1, 1)
y
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
+= 1
M return 4 * M / N
%timeit calc_pi(10**6)
676 ms ± 6.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Before we parallelize this program, the inner function must be as
efficient as we can make it. We show two techniques for doing this:
vectorization using numpy
, and native code
generation using numba
.
We first demonstrate a Numpy version of this algorithm:
import numpy as np
def calc_pi_numpy(N):
# Simulate impact coordinates
= np.random.uniform(-1, 1, (2, N))
pts # Count number of impacts inside the circle
= np.count_nonzero((pts**2).sum(axis=0) < 1)
M return 4 * M / N
This is a vectorized version of the original algorithm. A problem written in a vectorized form becomes amenable to data parallelization, where each single operation is replicated over a large collection of data. Data parallelism contrasts with task parallelism, where different independent procedures are performed in parallel. An example of task parallelism is the pea-soup recipe in the introduction.
This implementation is much faster than the ‘naive’ implementation above:
%timeit calc_pi_numpy(10**6)
25.2 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Discussion: is this all better?
What is the downside of the vectorized implementation? - It uses more memory. - It is less intuitive. - It is a more monolithic approach, i.e., you cannot break it up in several parts.
Challenge: Daskify
Write calc_pi_dask
to make the Numpy version parallel.
Compare its speed and memory performance with the Numpy version. NB:
Remember that the API of dask.array
mimics that of the
Numpy.
Solution
Solution
import dask.array as da
def calc_pi_dask(N):
# Simulate impact coordinates
= da.random.uniform(-1, 1, (2, N))
pts # Count number of impacts inside the circle
= da.count_nonzero((pts**2).sum(axis=0) < 1)
M return 4 * M / N
%timeit calc_pi_dask(10**6).compute()
4.68 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using Numba to accelerate Python code
Numba makes it easier to create accelerated functions. You can
activate it with the decorator numba.jit
.
import numba
@numba.jit
def sum_range_numba(a):
"""Compute the sum of the numbers in the range [0, a)."""
= 0
x for i in range(a):
+= i
x return x
Let’s time three versions of the same test. First, native Python iterators:
%timeit sum(range(10**7))
190 ms ± 3.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Second, with Numpy:
%timeit np.arange(10**7).sum()
17.5 ms ± 138 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Third, with Numba:
%timeit sum_range_numba(10**7)
162 ns ± 0.885 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Numba is hundredfold faster in this case! It gets this speedup with
“just-in-time” compilation (JIT) — that is, compiling the Python
function into machine code just before it is called, as the
@numba.jit
decorator indicates. Numba does not support
every Python and Numpy feature, but functions written with a for-loop
with a large number of iterates, like in our
sum_range_numba()
, are good candidates.
Just-in-time compilation speedup
The first time you call a function decorated with
@numba.jit
, you may see no or little speedup. The function
can then be much faster in subsequent calls. Also, timeit
may throw this warning:
The slowest run took 14.83 times longer than the fastest. This could mean that an intermediate result is being cached.
Why does this happen? On the first call, the JIT compiler needs to compile the function. On subsequent calls, it reuses the function previously compiled. The compiled function can only be reused if the types of its arguments (int, float, and the like) are the same as at the point of compilation.
See this example, where sum_range_numba
is timed once
again with a float argument instead of an int:
%time sum_range_numba(10**7)
%time sum_range_numba(10.**7)
CPU times: user 58.3 ms, sys: 3.27 ms, total: 61.6 ms
Wall time: 60.9 ms
CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 7.87 µs
Challenge: Numbify calc_pi
Create a Numba version of calc_pi
. Time it.
Solution
Solution
Add the @numba.jit
decorator to the first ‘naive’
implementation.
@numba.jit
def calc_pi_numba(N):
= 0
M for i in range(N):
# Simulate impact coordinates
= random.uniform(-1, 1)
x = random.uniform(-1, 1)
y
# True if impact happens inside the circle
if x**2 + y**2 < 1.0:
+= 1
M return 4 * M / N
%timeit calc_pi_numba(10**6)
13.5 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Measuring = knowing
Always profile your code to see which parallelization method works best.
numba.jit
is not a magical command to solve your
problems
Accelerating your code with Numba often outperforms other methods, but rewriting code to reap the benefits of Numba is not always trivial.
- Always profile your code to see which parallelization method works best.
- Vectorized algorithms are both a blessing and a curse.
- Numba can help you speed up code.