In this article, I will present some useful Pandas commands that I use in my day-to-day job. So let’s get started.


✏️ Table of Contents

  • Fix messy dataframe column names
  • Remove Outliers
  • Inspect CSV files
  • Calculate Percentage Change
  • Date Ranges
  • Left merge with indicator
  • Compress & Save Pandas Dataframe
  • Conclusion
The Hungry Panda
Photo by Sid Balachandran / Unsplash

Note: this article is also available on Medium.

✅ Fix messy dataframe column names

Sometimes dataframe columns have extra spaces or are just plain odd, even if they look normal. One simple way to fix them is:

df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

✂️ Remove Outliers

Sometimes you want to remove records where a column has extreme values. This can be done easily by running the following command:

train.loc[train[col_name] <
                               np.percentile(train[col_name],95)]

🔍 Inspect CSV files

Sometimes it is useful to inspect a CSV file before loading it using Pandas in case you need to skip some rows. This can be achieved by the following command:

! head -10 'file_name.csv'

So in this case you want to skip the first 10 rows when reading the csv file. This can be achieved as follow:

df = pd.read_csv('file_name.csv',skiprows=10)

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:


🛠 Calculate Percentage Change

df["perc_change"] = df[col_name].pct_change()

You can read the full documentation of the above pandas function using the link below:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html


🗓Date Ranges

You can use pd.date_range to fill the missing dates of a Pandas dataframe.

rng = pd.date_range('2015-02-24', periods=15, freq='W')
df = pd.DataFrame({ 'date': rng, 'val' : np.random.randn(len(rng))}) 
display(df.head())

idx = pd.date_range(df.date.min(),df.date.max())
df = df.set_index('date')
df_fill = df.reindex(idx, fill_value=0)
plt.figure(figsize=(15,10))
plt.xlabel('Dates')
plt.ylabel('Number of tweets')
plt.plot(df.index,df.val,label='unfilled')
plt.plot(df_fill.index,df_fill.val,label='filled')
plt.legend()
plt.grid()
plt.show()

You can read the full documentation of the above Pandas function using the link below:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html


🔗 Left merge with indicator

Two useful arguments of the pd.merge function are:

  • suffixes: You pass a tuple of (str, str), default (‘_x’, ‘_y’). Suffix applies to only column names that exist both in the left and right dataframe. To raise an exception use (False, False).
  • indicator: You pass a boolean value (True or False). If True it adds a column to output DataFrame called “_merge” with information on the source of each row. Possible values are: “left_only” for observations whose merge key only appears in ‘left’ DataFrame, “right_only” for observations whose merge key only appears in ‘right’ DataFrame, and “both” if the observation’s merge key is found in both.

💾 Compress & Save Pandas Dataframe

We usually save a pandas Dataframe as a CSV file by running the following command.

df.to_csv('dataset.csv', index=False)

However, using an additional argument compression='gzip', we can significantly reduce the file size of the CSV file:

df.to_csv('dataset.gz', compression='gzip', index=False)

If you are worried that it may be more complicated to load the data, then the answer is no. It is fairly easy to read the gzipped data as Pandas DataFrame as follow:

df = pd.read_csv('dataset.gz')

🤖 Conclusion

This brings us to the end of this article. Hope you learn some new useful pandas commands. As Roman Orac has highlighted in one of his posts:

If you are using for loops with pandas, there is most probably a better way to write it.

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥