Complete Pandas Tutorial
pandas: The Python Data Analysis Library
- This Python package is designed on the basis of the NumPy library.
- Another fundamental choice was to design ad hoc data structures for data analysis.
- These data structures are designed to work with relational data or labeled data, thus allowing you to manage data with features similar to those designed for SQL relational databases and Excel spreadsheets.
- First of all, we need to import pandas and numpy.
import pandas as pd
import numpy as np
Introduction to pandas Data Structures
The heart of pandas are the two primary data structures,
- Series (data structure designed to accommodate a sequence of one-dimensional data)
- Dataframes (complex data structured to accommodate several dimensions)
Series
The series is the object of the pandas library designed to represent one-dimensional data structures, similar to an array but with some additional features.
We can define series, by using pandas’ Series object.
>> s = pd.Series([12,-4,7,9])
>> s0 12
1 -4
2 7
3 9
dtype: int64
You can also assign your own index.
>> s = pd.Series([12,-4,7,9], index=['a','b','c','d'])>> s.valuesarray([12, -4, 7, 9])>> s.indexIndex(['a', 'b', 'c', 'd'], dtype='object')
We can select individual elements like in numpy arrays, specifying the key.
>> s[2]7
Also we can specify the label corresponding to the position of the index.
>> s['b']-4
In the same way, we can select multiple items in a numpy array,
>> s[0:2]a 12
b -4
dtype: int64
Also using corresponding labels,
>> s[['b','c']]b -4
c 7
dtype: int64
After assigning s[1] to 0, we get,
>> s[1] = 0
>> sa 12
b 0
c 7
d 9
dtype: int64
We can reassign the same value, by using s[‘b’]
>> s['b'] = 1
>> sa 12
b 1
c 7
d 9
dtype: int64
We can define a new series starting with NumPy arrays, an existing series and Dictionaries.
>> arr = np.array([1,2,3,4])
>> s1 = pd.Series(arr)
>> s10 1
1 2
2 3
3 4
dtype: int64
Using Existing Series,
>> s2 = pd.Series(s1)
>> s20 1
1 2
2 3
3 4
dtype: int64
Using Dictionaries,
>> s3 = pd.Series({'red':1, 'green':2, 'blue':3})
>> s3red 1
green 2
blue 3
dtype: int64
If you change the value of original NumPy array or Series, the new assigned series will also change because values are passed by reference not copied.
If you change arr[2] to 5, the value of s1[2] also changes to 5. Since, we created s1 from arr.
>> arr[2] = 5
>> s10 1
1 2
2 5
3 6
dtype: int64
And if you change the value s1[3] to 6, s2[3] also changes to 6. We created s2 from s1.
>> s1[3] = 6
>> s20 1
1 2
2 5
3 6
dtype: int64
If you need to know which elements are greater than 4, we can do:
>> s2[s2 > 4]2 5
3 6
dtype: int64
Other operations such as (+,-,*,/) and other mathematical functions that are applicable to NumPy array can be extended to series.
>> s2/20 0.5
1 1.0
2 2.5
3 3.0
dtype: float64>> np.log(s2)0 0.000000
1 0.693147
2 1.609438
3 1.791759
dtype: float64
Evaluating Values:
>> colors = pd.Series([1,2,3,-1,5,6,3,2,2],
index=['a','b','a','c','b','a','d','e',
'c'])
>> colorsa 1
b 2
a 3
c -1
b 5
a 6
d 3
e 2
c 2
dtype: int64
In our colors series, we can see that there are duplicate indices as well as duplicate values. To find out the unique items in indices and values, we can use unique()
function.
>> colors.valuesarray([ 1, 2, 3, -1, 5, 6, 3, 2, 2])>> colors.unique()array([ 1, 2, 3, -1, 5, 6])>> colors.indexIndex(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'e', 'c'], dtype='object')>> colors.index.unique()Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
Another function value_counts
not only returns the unique values but also calculates the occurrences within the series.
>> colors.value_counts()2 3
3 2
6 1
5 1
1 1
-1 1
dtype: int64>> colors.index.value_counts()a 3
c 2
b 2
e 1
d 1
dtype: int64
The function isin()
evaluates whether the given values are in series.
>> colors.isin([1,2])a True
b True
a False
c False
b False
a False
d False
e True
c True
dtype: bool>> colors[colors.isin([1,2])]a 1
b 2
e 2
c 2
dtype: int64>> log_colors = np.log(colors)
>> log_colorsa 0.000000
b 0.693147
a 1.098612
c NaN
b 1.609438
a 1.791759
d 1.098612
e 0.693147
c 0.693147
dtype: float64
Here, we can see that the log of a negative number is NaN. There are many occurrences that NaN can be in our series. NaN values must be managed in some way especially during data analysis. Pandas allows us to identify is indices are with or without values using isnull()
and notnull()
.
>> log_colors.isnull()a False
b False
a False
c True
b False
a False
d False
e False
c False
dtype: bool>> log_colors.notnull()a True
b True
a True
c False
b True
a True
d True
e True
c True
dtype: bool>> log_colors[log_colors.notnull()]a 0.000000
b 0.693147
a 1.098612
b 1.609438
a 1.791759
d 1.098612
e 0.693147
c 0.693147
dtype: float64>> log_colors[log_colors.isnull()]c NaN
dtype: float64
We can supply our own keys in a dictionary and create a series in which case, pandas controls the correspondence between the keys of the dict and that of our supplied indices. If there is a mismatch, pandas will add the NaN value.
>> s3red 1
green 2
blue 3
dtype: int64>> colors = ['green','blue','yellow','red']
>> s3_supply = pd.Series(s3, index=colors)
>> s3_supplygreen 2.0
blue 3.0
yellow NaN
red 1.0
dtype: float64
Dataframe
Dataframe is a tabular data structure very similar to a spreadsheet. It is designed to extend series to multiple dimensions.
Dataframe can also be understood as a dict of series, where the keys are the column names and the values are the series that will form the columns of the dataframe.
The most common way to create a new dataframe is to pass a dict object to DataFrame() constructor.
>> data = {'color':['blue','green','yellow','red','white'],
'object':['ball','pen','pencil','paper','mug'],
'price':[1.2,1.0,0.6,0.9,1.7]}
>> frame = pd.DataFrame(data)
>> frame
Suppose your dict has extra ‘color’ column which you don’t want in your analysis. You can choose the columns you want for analysis.
>> frame2 = pd.DataFrame(data,columns=['object','price'])
>> frame2
Pandas automatically assigns the indices for your dataframe. But you can assign your own indices by passing an array containing indices.
>> frame2 = pd.DataFrame(data, index['one','two','three','four','five'])
>> frame2
Suppose you want to create a DataFrame for testing, easily and quickly you can use the numpy’s matrix.
>> frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
>> frame3
You can select and filter DataFrame on the basis of columns, indices and loc attributes with the row you want to extract.
>> frame3.columnsIndex(['ball', 'pen', 'pencil', 'paper'], dtype='object')>> frame3.indexIndex(['red', 'blue', 'yellow', 'white'], dtype='object')>> frame3.valuesarray([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])>> frame3['paper']red 3
blue 7
yellow 11
white 15
Name: paper, dtype: int64>> frame3.paperred 3
blue 7
yellow 11
white 15
Name: paper, dtype: int64>> frame3.loc['red']ball 0
pen 1
pencil 2
paper 3
Name: red, dtype: int64>> frame3.loc[['red','yellow']]
>> frame3[0:2]
>> frame3['paper'][0]3
Once you can access various elements in the DataFrame you can use the same logic to add and change the values in it.
>> frame3.index.name = 'id'
>> frame3.columns.name = 'items'
>> frame3
You can add new column and assign values. If you assign a single value, pandas’ broadcasting creates all the values in columns to the same value.
>> frame3['new'] = 12
>> frame3
You can also assign the values for each row of the new column.
>> frame3['new'] = [12,13,14,15]
>> frame3
You can also assign series as a new column to the dataframe.
>> series = pd.Series(np.arange(3,7),index=['red','blue','yellow','white'])
>> seriesred 3
blue 4
yellow 5
white 6
dtype: int64>> frame3['new'] = series
>> frame3['new']id
red 3
blue 4
yellow 5
white 6
Name: new, dtype: int64>> frame3
Membership of a Value
>> frame3.isin([4,5,6,7])
>> frame3[frame3.isin([4,5,6,7])]
If you want to delete an entire column and its contents, we can use del
command
>> del frame3['new']
>> frame3
Filtering
>> frame3 < 5
>> frame3[frame3 < 5]
Dataframe from Nested Dict
>> nestdict = {'red': {2012: 22, 2013: 33},
'white': {2011: 13, 2012:22, 2013:16},
'blue': {2011:17, 2012:27, 2013:18}}
>> frame4 = pd.DataFrame(nestdict)
>> frame4
Transposition of Dataframe
>> frame4.T
Index Objects
The Index objects are responsible for the labels on the axes.
>> series = pd.Series([6,4,12,3,2,5], index=['red','red','green','blue','orange','yellow'])
>> series.indexIndex(['red', 'red', 'green', 'blue', 'orange', 'yellow'], dtype='object')
Unlike other elements, Index objects are immutable. Once they are declared, they cannot be changed. Each Index object has a number of methods and properties that are useful when you need to know the values they contain.
idxmin()
returns the index with the lowest value.
>> series.idxmin()'orange'
idxmax()
returns the index with the highest value.
>> series.idxmax()'green'
If your data structure is large, it becomes difficult to check if your indices are unique or not. So, is_unique()
attribute tells if there are indices with duplicate labels.
>> series.index.is_uniqueFalse
Find if the row is duplicate
>> series.index.duplicated()array([False, True, False, False, False, False])>> ~series.index.duplicated()array([ True, False, True, True, True, True])
If you want series, without the duplicated rows, you can do the following,
>> series = series[~series.index.duplicated()]
>> seriesred 6
green 12
blue 3
orange 2
yellow 5
dtype: int64
Reindexing
Well, once it’s declared in a data structure, the Index object cannot be changed but by executing a reindexing, you can overcome this problem.
>> series.reindex(['green','blue','red','yellow','orange'])green 12
blue 3
red 6
yellow 5
orange 2
dtype: int64
During reindexing, it is possible to change the order of the sequence of indexes, delete some of them, or add new ones. In the case of new ones, pandas will add NaN as the corresponding value.
>> series.reindex(['green','blue','red','pink'])green 12.0
blue 3.0
red 6.0
pink NaN
dtype: float64
Here, we removed ‘orange’ and added ‘pink’. Since ‘pink’ doesn’t have any associated value, Pandas sets its value as NaN.
Suppose you have a large series, and the values repeat in an ordered way. Then it’s hectic to write index with the same value continuously. You can use some methods that allow you to ill in or interpolate values automatically.
>> series_continuous = pd.Series([1,2,3],index=[0,3,6])
>> series_continuous0 1
3 2
6 3
dtype: int64>> series_continuous.reindex(range(8),method='ffill')0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
dtype: int64
‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.
>> series_continuous.reindex(range(8),method='bfill')0 1.0
1 2.0
2 2.0
3 2.0
4 3.0
5 3.0
6 3.0
7 NaN
dtype: float64
bfill(backfill) uses the next valid observation to fill the gap.
>> frame
>> frame.reindex(range(5), method='ffill', columns=['object','price','new','color'])
The ‘new’ column which would be NaN now is prefilled with the values of ‘color’ because alphabetically, ‘color’ comes before ‘new’. If bfill
was used, it would take values of 'object'.
Dropping
Deleting a row or a column is simple due to labels used to indicate the indexes and column names. We can use drop()
method to return a new object without the items that you want to delete.
>> series.drop('red')green 12
blue 3
orange 2
yellow 5
dtype: int64>> series.drop(['green','blue'])red 6
orange 2
yellow 5
dtype: int64
Regarding the DataFrame.
>> frame_drop = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])>> frame_drop
To delete the rows, you just pass the indexes of the row.
>> frame_drop.drop(['blue','yellow'])
To delete columns, you need to specify the columns but you must also specify the axis = 1 to refer to columns.
>> frame_drop.drop(['pen','pencil'],axis=1)
Arithmetic and Data Alignment
>> s1 = pd.Series([3,2,5,1],['blue','green','white','yellow'])
>> s2 = pd.Series([1,4,7,2,1],['black','blue','brown','white','yellow'])>> s1blue 3
green 2
white 5
yellow 1
dtype: int64>> s2black 1
blue 4
brown 7
white 2
yellow 1
dtype: int64>> s1 + s2black NaN
blue 7.0
brown NaN
green NaN
white 7.0
yellow 2.0
dtype: float64
Only ‘blue’, ‘white’ and ‘yellow’ are similar in both series. So, pandas performs sum between these only and others are set as NaN.
In the case of DataFrame, although it may appear more complex, the alignment follows the same principle but is carried out both for the rows and for the columns.
>> f1 = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['blue','red','white','yellow'],
columns=['ball','paper','pen','pencil'])
>> f2 = pd.DataFrame(np.arange(12).reshape((4,3)),
index=['blue','green','white','yellow'],
columns=['ball','mug','pen'])>> f1
>> f2
>> f1 + f2
Operations between Data Structures
Flexible Arithmetic Methods
The same +, — ,*, / can also be performed using add()
, sub()
, mul()
, div()
>> f1.add(f2)
Operations between DataFrame and Series.
>> frame = pd.DataFrame(np.arange(16).reshape((4,4)),
index=['red','blue','yellow','white'],
columns=['ball','paper','pen','pencil'])
>> frame
>> series = pd.Series(np.arange(4),index=['ball','paper','pen','pencil'])
>> seriesball 0
paper 1
pen 2
pencil 3
dtype: int64>> frame - series
In each row of DataFrame, 0,1,2,3 are subtracted from each column.
>> series['mug'] = 9
>> seriesball 0
paper 1
pen 2
pencil 3
mug 9
dtype: int64>> frame - series
If the index is not present in one of the two data structures, the result will be a new column with that index with values NaN.
Function Application and Mapping
universal functions (ufunc) operates element by element basis on the data structure.
>> np.sqrt(f1)
>> f1.max()ball 12
paper 13
pen 14
pencil 15
dtype: int64>> f1.min()ball 0
paper 1
pen 2
pencil 3
dtype: int64
Using the apply()
function, you can apply user-defined functions.
>> f1.apply(lambda x: x.max() - x.min())ball 12
paper 12
pen 12
pencil 12
dtype: int64>> f1.apply(lambda x: x.max() - x.min(), axis=1) #Columnblue 3
red 3
white 3
yellow 3
dtype: int64>> f1.apply(lambda x: pd.Series([x.min(),x.max()],index=['min','max']))
>> f1.apply(lambda x: pd.Series([x.min(),x.max()],index=['min','max']), axis=1)
Statistics Functions
>> f1.sum()ball 24
paper 28
pen 32
pencil 36
dtype: int64>> f1.mean()ball 6.0
paper 7.0
pen 8.0
pencil 9.0
dtype: float64>> f1.describe()
Sorting and Ranking
>> s1.sort_index()blue 3
green 2
white 5
yellow 1
dtype: int64>> s1.sort_index(ascending=False)yellow 1
white 5
green 2
blue 3
dtype: int64>> f1.sort_index()
>> f1.sort_index(axis=1)
>> s1.sort_values()yellow 1
green 2
blue 3
white 5
dtype: int64>> f1.sort_values(by='paper')
rank()
assigns a rank to each element in the series.
>> s1.rank()blue 3.0
green 2.0
white 4.0
yellow 1.0
dtype: float64>> s1.rank(method='first')blue 3.0
green 2.0
white 4.0
yellow 1.0
dtype: float64>> s1.rank(ascending=False)blue 2.0
green 3.0
white 1.0
yellow 4.0
dtype: float64
Correlation and Covariance
>> s2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])>> s3 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008',
'2009','2010','2011','2012','2013'])>> s22006 3
2007 4
2008 3
2009 4
2010 5
2011 4
2012 3
2013 2
dtype: int64>> s32006 1
2007 2
2008 3
2009 4
2010 4
2011 3
2012 2
2013 1
dtype: int64
Two statistical calculations Correlation and Covariance are expressed in pandas as corr()
and cov()
.
>> s2.corr(s3)0.7745966692414834>> s2.cov(s3)0.8571428571428571>> f2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
index=['red','blue','yellow','white'],
columns=['ball','pen','pencil','paper'])
>> f2
>> f2.corr()
>> f2.cov()
Using the corrwith()
method, you can calculate the pairwise correlations between the columns or rows of a DataFrame with a series or another DataFrame()
>> s4 = pd.Series([0,1,2,3,9],
index=['red','blue','yellow','white','green'])
>> s4red 0
blue 1
yellow 2
white 3
green 9
dtype: int64>> f2.corrwith(s4)ball 0.730297
pen -0.831522
pencil 0.210819
paper -0.119523
dtype: float64
NaN
Assigning a NaN value
>> s5 = pd.Series([0,1,2,np.NaN,9],
index=['red','blue','yellow','white','green'])
>> s5red 0.0
blue 1.0
yellow 2.0
white NaN
green 9.0
dtype: float64
Filtering Out NaN values
If the series contains a lot of NaN values, it is tedious to remove the rows one by one. So dropna()
function comes to our aid.
>> s5.dropna()red 0.0
blue 1.0
yellow 2.0
green 9.0
dtype: float64
You can also perform the filtering by placing notnull()
values.
>> s5[s5.notnull()]red 0.0
blue 1.0
yellow 2.0
green 9.0
dtype: float64>> fnan = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
index = ['blue','green','red'],
columns = ['ball','mug','pen'])
>> fnan
>> fnan.dropna()
To avoid having entire rows and columns disappear completely, you have to specify how option. how='all'
tells dropna() to delete only the rows or columns in which all elements are NaN.
>> fnan.dropna(how='all')
>> fnan.dropna(how='all',axis=1) #Column
Filling in NaN Occurrences
>> fnan.fillna(0)
Sometimes you want to fill the NaN values to mean or median to maintain the data consistency.
>> ball_median = fnan['ball'].median()
>> ball_median4.0>> ball_mean = fnan['ball'].mean()
>> ball_mean4.0>> fnan['ball'].fillna(ball_median)blue 6.0
green 4.0
red 2.0
Name: ball, dtype: float64
Hierarchical Indexing and Leveling
Hierarchical indexing allows you to have multiple levels of indexes on a single axis. It gives you a way to work with multiple dimensions while continuing to work in a two-dimensional structure.
>> hier_ser = pd.Series(np.random.rand(8), index=[
['white','white','white','blue','blue','red','red','red'],
['up','down','right','up','down','up','down','left']
])>> hier_serwhite up 0.314473
down 0.112632
right 0.082628
blue up 0.110578
down 0.618985
red up 0.028230
down 0.647547
left 0.077691
dtype: float64
>> hier_ser.indexMultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
>> hier_ser['white']up 0.314473
down 0.112632
right 0.082628
dtype: float64
>> hier_ser[:,'right']white 0.082628
dtype: float64
>> hier_ser['white','up']0.3144728041930178
The data could be rearranged and used in a DataFrame with a special function called unstack()
.
>> hier_ser.unstack()
If you want to reverse the operation, converting DataFrame to a series, you can use stack()
function.
>> hier_ser.unstack().stack()blue down 0.618985
up 0.110578
red down 0.647547
left 0.077691
up 0.028230
white down 0.112632
right 0.082628
up 0.314473
dtype: float64
With DataFrame it is possible to define hierarchical index both for rows and columns.
>> hier_df = pd.DataFrame(np.random.randn(16).reshape(4,4),
index=[
['white','white','red','red'],
['up','down','up','down']
],
columns=[
['pen','pen','paper','paper'],
[1,2,1,2]
])
>> hier_df
>> hier_df.columnsMultiIndex(levels=[['paper', 'pen'], [1, 2]],
codes=[[1, 1, 0, 0], [0, 1, 0, 1]])
>> hier_df.indexMultiIndex(levels=[['red', 'white'], ['down', 'up']],
codes=[[1, 1, 0, 0], [1, 0, 1, 0]])
Let’s give the levels an encompassing name.
>> hier_df.columns.names = ['objects','id']
>> hier_df.index.names = ['colors','status']
>> hier_df
If you want to interchange and return a new object with two levels interchanged between them, while leaving the data unmodified, you can use swaplevel()
.
>> hier_df.swaplevel('colors','status')
>> hier_df.sort_index(level='colors')
>> hier_df.sum(level='colors')
>> hier_df.sum(level='id',axis=1)