If you work with big data sets, you probably remember the “aha” moment along your Python journey when you discovered the Pandas library. Pandas is a game-changer for data science and analytics.

That said pandas should be fast right? In reality this is not the case especially when you run a Pandas apply function as it can take ages to finish. However, alternatives do exist which can speed up the process  which I will share in this article.

Table of Contents

  • Reasons for low performance of Pandas DataFame.apply()
  • Option 1: Dask Library
  • Option 2: Swifter Library
  • Option 3: Vectorised when possible
  • Conclussions
  • References
SpaceX Bangabandhu Satellite-1 Mission
Photo by SpaceX / Unsplash

Reasons for low performance of Pandas DataFame.apply()

Pandas DataFame.apply() is unfortunately still limited to working with a single core. That said the whole process is carried by a single core without utilising any of the other cores of your machine.

Naturally the first question that pop out is how can we use all the cores of our machine/cluster to run Pandas DataFame.apply() in parallel?


Option 1: Dask Library

By having a quick look to Dask website I was able to find the following quote:

"Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love and is a flexible library for parallel computing in Python"

While at the same time Dask DataFrame mimics Pandas;

The simplest way is to use Dask's map_partitions. First you need to:

pip install dask

and also to import the followings :

import pandas as pd
import numpy as np
import dask.dataframe as dd
import multiprocessing

Below we run a script comparing the performance when using Dask's map_partitions vs DataFame.apply().  The command is pretty simple as the apply statement is wrapped around a map_partitions, there’s a compute() at the end, and npartitions have to be initialised .

Important things to notice:

  • The execution time was halved using Dask's map_partitions
  • dd.from_pandas(data, npartitions=4*multiprocessing.cpu_count()) used to divide pd.DataFrame into chunks.In simple terms, the npartitions property is the number of Pandas dataframes that compose a single Dask dataframe.
  • Generally you want a few times more partitions than you have cores.This affects performance in two main ways. 1) If you don't have enough partitions then you may not be able to use all of your cores effectively. For example if your dask.dataframe has only one partition then only one core can operate at a time. 2)If you have too many partitions then the scheduler may incur a lot of overhead deciding where to compute each task.Every task takes up a few hundred microseconds in the scheduler.
  • map_partitions is simply applying that  lambda function to each partition.
  • compute() is telling Dask to process everything that came before and deliver the end product to me. Many distributed libraries like Dask or Spark implement ‘lazy evaluation’, or creating a list of tasks and only executing when prompted to do so. It is very important to set scheduler='processes'  otherwise the computational time will increased dramatically as shown below.
  • Not proper settings of the Dask parameters can lead to increase of the execution time.

Dask comes with four available schedulers:

  • threads”: a scheduler backed by a thread pool
  • processes”: a scheduler backed by a process pool (preferred option on local machines as it uses all CPUs)
  • single-threaded” (aka “sync”): a synchronous scheduler, good for debugging

Option 2: Swifter Library

Swifter advertise itself as:

"A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner."

First you will need to pip install the library as follow:

pip install swifter

It works as a plugin for pandas, allowing you to reuse the apply function, thus it is very easy-to-use as shown below and very fast:

Surprisingly, it runs very fast and the reason why is that the function that we apply can be vectorised. Swifter has the intuition to understand that.

However, in cases where the function that we apply cannot be vectorised it will automatically decide to go either with Pandas apply function or Dask (without the user having to decide the number of partitions).Should we use parallel processing (which has some overhead), or a simple pandas apply (which only utilizes 1 CPU, but has no overhead)?

In simple terms, swifter uses pandas apply when it is faster for small data sets, and converges to dask parallel processing when that is faster for large data sets. In this manner, the user doesn’t have to think about which method to use, regardless of size of the data set.


Option 3: Vectorised when possible

Personally speaking if you think the function that you apply can be vectorised you should vectorise the function (in our case y*(x**2+1) is trivially vectorized, but there are plenty of things that are impossible to vectorize).It is important always to write code with a vector mindset (Broadcasting) as opposed to scalar. he serious time savings are coming from avoiding for loops and perform operations across the entire array.

As it was expected it is faster than any other option and its performance is pretty similar with swifter. The difference is attributed to the fact that swifter has some overhead time to identify if the function can be vectorised.


Conclusions

Rather than thinking of how to get more computational power, we should think about how to use the hardware we do have as efficiently as possible. In this article, we walked through 3 different options on how we can speed up Pandas apply function by taking full advantage of our computational power.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.

The complete Jupyter Notebook can be found in my below github page;

https://github.com/geodra/Pandas-Tricks/tree/master

References