I came across this function called parallel in fastai, and it seems very interesting.

A Simple Example

from fastcore.all import parallel
from nbdev.showdoc import doc
doc(parallel)

parallel[source]

parallel(f, items, *args, n_workers=8, total=None, progress=None, pause=0, **kwargs)

Applies func in parallel to items, using n_workers

Show in docs

As the documentation states, the parallel function can run any python function f with items using multiple workers, and collect the results.

Let's try a simple examples:

import math
import time

def f(x):
  time.sleep(1)
  return x * 2

numbers = list(range(10))
%%time

list(map(f, numbers))
print()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 10 s
%%time

list(parallel(f, numbers))
print()
CPU times: user 32 ms, sys: 52 ms, total: 84 ms
Wall time: 2.08 s

The function f we have in this example is very simple: it sleeps for one second and then returns x*2. When executed in serial, it takes 10 seconds which is exactly what we expect. When using more workers(8 by default), it takes only 2 seconds.

Dig into the Implementation

Let's see how parallel is implemented:

parallel??
Signature:
parallel(
    f,
    items,
    *args,
    n_workers=8,
    total=None,
    progress=None,
    pause=0,
    **kwargs,
)
Source:   
def parallel(f, items, *args, n_workers=defaults.cpus, total=None, progress=None, pause=0, **kwargs):
    "Applies `func` in parallel to `items`, using `n_workers`"
    if progress is None: progress = progress_bar is not None
    with ProcessPoolExecutor(n_workers, pause=pause) as ex:
        r = ex.map(f,items, *args, **kwargs)
        if progress:
            if total is None: total = len(items)
            r = progress_bar(r, total=total, leave=False)
        return L(r)
File:      /opt/conda/lib/python3.7/site-packages/fastcore/utils.py
Type:      function
??ProcessPoolExecutor
Init signature:
ProcessPoolExecutor(
    max_workers=8,
    on_exc=<built-in function print>,
    pause=0,
    mp_context=None,
    initializer=None,
    initargs=(),
)
Source:        
class ProcessPoolExecutor(concurrent.futures.ProcessPoolExecutor):
    "Same as Python's ProcessPoolExecutor, except can pass `max_workers==0` for serial execution"
    def __init__(self, max_workers=defaults.cpus, on_exc=print, pause=0, **kwargs):
        if max_workers is None: max_workers=defaults.cpus
        self.not_parallel = max_workers==0
        store_attr(self, 'on_exc,pause,max_workers')
        if self.not_parallel: max_workers=1
        super().__init__(max_workers, **kwargs)

    def map(self, f, items, *args, **kwargs):
        self.lock = Manager().Lock()
        g = partial(f, *args, **kwargs)
        if self.not_parallel: return map(g, items)
        try: return super().map(partial(_call, self.lock, self.pause, self.max_workers, g), items)
        except Exception as e: self.on_exc(e)
File:           /opt/conda/lib/python3.7/site-packages/fastcore/utils.py
Type:           type
Subclasses:     

As we can see in the source code, under the hood, this is using the concurrent.futures.ProcessPoolExecutor class from Python.

Note that this class is essentially different than Python Threads, which is subject to the Global Interpreter Lock.

The ProcessPoolExecutor class is an Executor subclass that uses a pool of processes to execute calls asynchronously. ProcessPoolExecutor uses the multiprocessing module, which allows it to side-step the Global Interpreter Lock but also means that only picklable objects can be executed and returned.

Use cases

This function can be quite useful for long running tasks and you want to take advantage of multi-core CPUs to speed up your processing. For example, if you want to download a lot of images from the internet, you may want to use this to parallize your download jobs.

If your function f is very fast, there can be suprising cases, here is an example:

import math
import time

def f(x):
  return x * 2

numbers = list(range(10000))
%%time

list(map(f, numbers))
print()
CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 1.24 ms
%%time

list(parallel(f, numbers))
print()
CPU times: user 3.96 s, sys: 940 ms, total: 4.9 s
Wall time: 12.4 s

In the above example, f is very fast and the overhead of creating a lot of tasks outweigh the advantage of multi-processing. So use this with caution, and always take profiles.