Okay, so you have done some transformation to your initial data frame and now you have a column where every cell is a collection. Now you want to save that data in a nice format, perhaps CSV, or some normalized database, or you just simply want to apply more functions onto that column, but they do not work with collections. You are wondering how to split those rows into multiple rows, so that you only have a single value in each cell instead of that collection. In this short tutorial I will show you how to do that in probably the fastest and simplest way possible.
1. Example data
Let us start by creating an example data frame.
import pandas as pd
df = pd.DataFrame(data=[[[1, 2, 3], 'X'], [[4, 5, 6], 'Y']], index=[0, 1], columns=['A', 'B'])
print(df)
A B
0 [1, 2, 3] X
1 [4, 5, 6] Y
2. Flattening the example
The fastest way to flatten that data frame is to utilize built in python functions and pandas iteritems method, because collections are internal to python and they are not supported well by external C libraries, so anything that will try do many calls to pandas will possibly only slow down the computation due to context switching between Python and C. In the example below we only create a data frame when we build the flattened column data frame and when we join back onto the initial data frame.
flattened_col = pd.DataFrame([(index, value) for (index, values) in df['A'].iteritems() for value in values],
columns=['index', 'A']).set_index('index')
df = df.drop('A', axis=1).join(flattened_col)
print(df)
B A
0 X 1
0 X 2
0 X 3
1 Y 4
1 Y 5
1 Y 6
3. Making a flattening function
We can obviously make a function out of the code above and we will end up with something like this:
def flatten_columns(df, cols):
"""Flattens multiple columns in a data frame, cannot specify all columns!"""
flattened_cols = {}
for col in cols:
flattened_cols[col] = pd.DataFrame([(index, value) for (index, values) in df[col].iteritems() for values in values],
columns=['index', col]).set_index('index')
flattened_df = df.drop(cols, axis=1)
for col in cols:
flattened_df = flattened_df.join(flattened_col[col])
return flattened_df
I leave the optimization of joins on purpose to you, so you can tweak the code and play with it. I hope that this quick read was useful to you.