Complete Pandas Tutorial

Subash Basnet

Published in

Bajra Technologies Blog

15 min readJan 14, 2020

pandas: The Python Data Analysis Library

This Python package is designed on the basis of the NumPy library.
Another fundamental choice was to design ad hoc data structures for data analysis.
These data structures are designed to work with relational data or labeled data, thus allowing you to manage data with features similar to those designed for SQL relational databases and Excel spreadsheets.
First of all, we need to import pandas and numpy.

import pandas as pd
import numpy as np

Introduction to pandas Data Structures

The heart of pandas are the two primary data structures,

Series (data structure designed to accommodate a sequence of one-dimensional data)
Dataframes (complex data structured to accommodate several dimensions)

Series

The series is the object of the pandas library designed to represent one-dimensional data structures, similar to an array but with some additional features.

We can define series, by using pandas’ Series object.

>> s = pd.Series([12,-4,7,9])
>> s0    12
1    -4
2     7
3     9
dtype: int64

You can also assign your own index.

>> s = pd.Series([12,-4,7,9], index=['a','b','c','d'])>> s.valuesarray([12, -4,  7,  9])>> s.indexIndex(['a', 'b', 'c', 'd'], dtype='object')

We can select individual elements like in numpy arrays, specifying the key.

>> s[2]7

Also we can specify the label corresponding to the position of the index.

>> s['b']-4

In the same way, we can select multiple items in a numpy array,

>> s[0:2]a    12
b    -4
dtype: int64

Also using corresponding labels,

>> s[['b','c']]b   -4
c    7
dtype: int64

After assigning s[1] to 0, we get,

>> s[1] = 0
>> sa    12
b     0
c     7
d     9
dtype: int64

We can reassign the same value, by using s[‘b’]

>> s['b'] = 1
>> sa    12
b     1
c     7
d     9
dtype: int64

We can define a new series starting with NumPy arrays, an existing series and Dictionaries.

>> arr = np.array([1,2,3,4])
>> s1 = pd.Series(arr)
>> s10    1
1    2
2    3
3    4
dtype: int64

Using Existing Series,

>> s2 = pd.Series(s1)
>> s20    1
1    2
2    3
3    4
dtype: int64

Using Dictionaries,

>> s3 = pd.Series({'red':1, 'green':2, 'blue':3})
>> s3red      1
green    2
blue     3
dtype: int64

If you change the value of original NumPy array or Series, the new assigned series will also change because values are passed by reference not copied.

If you change arr[2] to 5, the value of s1[2] also changes to 5. Since, we created s1 from arr.

>> arr[2] = 5
>> s10    1
1    2
2    5
3    6
dtype: int64

And if you change the value s1[3] to 6, s2[3] also changes to 6. We created s2 from s1.

>> s1[3] = 6
>> s20    1
1    2
2    5
3    6
dtype: int64

If you need to know which elements are greater than 4, we can do:

>> s2[s2 > 4]2    5
3    6
dtype: int64

Other operations such as (+,-,*,/) and other mathematical functions that are applicable to NumPy array can be extended to series.

>> s2/20    0.5
1    1.0
2    2.5
3    3.0
dtype: float64>> np.log(s2)0    0.000000
1    0.693147
2    1.609438
3    1.791759
dtype: float64

Evaluating Values:

>> colors = pd.Series([1,2,3,-1,5,6,3,2,2], 
                  index=['a','b','a','c','b','a','d','e',
                        'c'])
>> colorsa    1
b    2
a    3
c   -1
b    5
a    6
d    3
e    2
c    2
dtype: int64

In our colors series, we can see that there are duplicate indices as well as duplicate values. To find out the unique items in indices and values, we can use unique() function.

>> colors.valuesarray([ 1,  2,  3, -1,  5,  6,  3,  2,  2])>> colors.unique()array([ 1,  2,  3, -1,  5,  6])>> colors.indexIndex(['a', 'b', 'a', 'c', 'b', 'a', 'd', 'e', 'c'], dtype='object')>> colors.index.unique()Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Another function value_counts not only returns the unique values but also calculates the occurrences within the series.

>> colors.value_counts()2    3
3    2
6    1
5    1
1    1
-1   1
dtype: int64>> colors.index.value_counts()a    3
c    2
b    2
e    1
d    1
dtype: int64

The function isin() evaluates whether the given values are in series.

>> colors.isin([1,2])a     True
b     True
a    False
c    False
b    False
a    False
d    False
e     True
c     True
dtype: bool>> colors[colors.isin([1,2])]a    1
b    2
e    2
c    2
dtype: int64>> log_colors = np.log(colors)
>> log_colorsa    0.000000
b    0.693147
a    1.098612
c         NaN
b    1.609438
a    1.791759
d    1.098612
e    0.693147
c    0.693147
dtype: float64

Here, we can see that the log of a negative number is NaN. There are many occurrences that NaN can be in our series. NaN values must be managed in some way especially during data analysis. Pandas allows us to identify is indices are with or without values using isnull() and notnull().

>> log_colors.isnull()a    False
b    False
a    False
c     True
b    False
a    False
d    False
e    False
c    False
dtype: bool>> log_colors.notnull()a     True
b     True
a     True
c    False
b     True
a     True
d     True
e     True
c     True
dtype: bool>> log_colors[log_colors.notnull()]a    0.000000
b    0.693147
a    1.098612
b    1.609438
a    1.791759
d    1.098612
e    0.693147
c    0.693147
dtype: float64>> log_colors[log_colors.isnull()]c   NaN
dtype: float64

We can supply our own keys in a dictionary and create a series in which case, pandas controls the correspondence between the keys of the dict and that of our supplied indices. If there is a mismatch, pandas will add the NaN value.

>> s3red      1
green    2
blue     3
dtype: int64>> colors = ['green','blue','yellow','red']
>> s3_supply = pd.Series(s3, index=colors)
>> s3_supplygreen     2.0
blue      3.0
yellow    NaN
red       1.0
dtype: float64

Dataframe

Dataframe is a tabular data structure very similar to a spreadsheet. It is designed to extend series to multiple dimensions.

Dataframe can also be understood as a dict of series, where the keys are the column names and the values are the series that will form the columns of the dataframe.

The most common way to create a new dataframe is to pass a dict object to DataFrame() constructor.

>> data = {'color':['blue','green','yellow','red','white'],
       'object':['ball','pen','pencil','paper','mug'],
       'price':[1.2,1.0,0.6,0.9,1.7]}
>> frame = pd.DataFrame(data)
>> frame

Suppose your dict has extra ‘color’ column which you don’t want in your analysis. You can choose the columns you want for analysis.

>> frame2 = pd.DataFrame(data,columns=['object','price'])
>> frame2

Pandas automatically assigns the indices for your dataframe. But you can assign your own indices by passing an array containing indices.

>> frame2 = pd.DataFrame(data, index['one','two','three','four','five'])
>> frame2

Suppose you want to create a DataFrame for testing, easily and quickly you can use the numpy’s matrix.

>> frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','pen','pencil','paper'])
>> frame3

You can select and filter DataFrame on the basis of columns, indices and loc attributes with the row you want to extract.

>> frame3.columnsIndex(['ball', 'pen', 'pencil', 'paper'], dtype='object')>> frame3.indexIndex(['red', 'blue', 'yellow', 'white'], dtype='object')>> frame3.valuesarray([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])>> frame3['paper']red        3
blue       7
yellow    11
white     15
Name: paper, dtype: int64>> frame3.paperred        3
blue       7
yellow    11
white     15
Name: paper, dtype: int64>> frame3.loc['red']ball      0
pen       1
pencil    2
paper     3
Name: red, dtype: int64>> frame3.loc[['red','yellow']]

>> frame3[0:2]

>> frame3['paper'][0]3

Once you can access various elements in the DataFrame you can use the same logic to add and change the values in it.

>> frame3.index.name = 'id'
>> frame3.columns.name = 'items'
>> frame3

You can add new column and assign values. If you assign a single value, pandas’ broadcasting creates all the values in columns to the same value.

>> frame3['new'] = 12
>> frame3

You can also assign the values for each row of the new column.

>> frame3['new'] = [12,13,14,15]
>> frame3

You can also assign series as a new column to the dataframe.

>> series = pd.Series(np.arange(3,7),index=['red','blue','yellow','white'])
>> seriesred       3
blue      4
yellow    5
white     6
dtype: int64>> frame3['new'] = series
>> frame3['new']id
red       3
blue      4
yellow    5
white     6
Name: new, dtype: int64>> frame3

Membership of a Value

>> frame3.isin([4,5,6,7])

>> frame3[frame3.isin([4,5,6,7])]

If you want to delete an entire column and its contents, we can use del command

>> del frame3['new']
>> frame3

Filtering

>> frame3 < 5

>> frame3[frame3 < 5]

Dataframe from Nested Dict

>> nestdict = {'red': {2012: 22, 2013: 33},
            'white': {2011: 13, 2012:22, 2013:16},
            'blue': {2011:17, 2012:27, 2013:18}}
>> frame4 = pd.DataFrame(nestdict)
>> frame4

Transposition of Dataframe

>> frame4.T

Index Objects

The Index objects are responsible for the labels on the axes.

>> series = pd.Series([6,4,12,3,2,5], index=['red','red','green','blue','orange','yellow'])
>> series.indexIndex(['red', 'red', 'green', 'blue', 'orange', 'yellow'], dtype='object')

Unlike other elements, Index objects are immutable. Once they are declared, they cannot be changed. Each Index object has a number of methods and properties that are useful when you need to know the values they contain.

idxmin() returns the index with the lowest value.

>> series.idxmin()'orange'

idxmax() returns the index with the highest value.

>> series.idxmax()'green'

If your data structure is large, it becomes difficult to check if your indices are unique or not. So, is_unique() attribute tells if there are indices with duplicate labels.

>> series.index.is_uniqueFalse

Find if the row is duplicate

>> series.index.duplicated()array([False,  True, False, False, False, False])>> ~series.index.duplicated()array([ True, False,  True,  True,  True,  True])

If you want series, without the duplicated rows, you can do the following,

>> series = series[~series.index.duplicated()]
>> seriesred        6
green     12
blue       3
orange     2
yellow     5
dtype: int64

Reindexing

Well, once it’s declared in a data structure, the Index object cannot be changed but by executing a reindexing, you can overcome this problem.

>> series.reindex(['green','blue','red','yellow','orange'])green     12
blue       3
red        6
yellow     5
orange     2
dtype: int64

During reindexing, it is possible to change the order of the sequence of indexes, delete some of them, or add new ones. In the case of new ones, pandas will add NaN as the corresponding value.

>> series.reindex(['green','blue','red','pink'])green    12.0
blue      3.0
red       6.0
pink      NaN
dtype: float64

Here, we removed ‘orange’ and added ‘pink’. Since ‘pink’ doesn’t have any associated value, Pandas sets its value as NaN.

Suppose you have a large series, and the values repeat in an ordered way. Then it’s hectic to write index with the same value continuously. You can use some methods that allow you to ill in or interpolate values automatically.

>> series_continuous = pd.Series([1,2,3],index=[0,3,6])
>> series_continuous0    1
3    2
6    3
dtype: int64>> series_continuous.reindex(range(8),method='ffill')0    1
1    1
2    1
3    2
4    2
5    2
6    3
7    3
dtype: int64

‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.

>> series_continuous.reindex(range(8),method='bfill')0    1.0
1    2.0
2    2.0
3    2.0
4    3.0
5    3.0
6    3.0
7    NaN
dtype: float64

bfill(backfill) uses the next valid observation to fill the gap.

>> frame

>> frame.reindex(range(5), method='ffill', columns=['object','price','new','color'])

The ‘new’ column which would be NaN now is prefilled with the values of ‘color’ because alphabetically, ‘color’ comes before ‘new’. If bfill was used, it would take values of 'object'.

Dropping

Deleting a row or a column is simple due to labels used to indicate the indexes and column names. We can use drop() method to return a new object without the items that you want to delete.

>> series.drop('red')green     12
blue       3
orange     2
yellow     5
dtype: int64>> series.drop(['green','blue'])red       6
orange    2
yellow    5
dtype: int64

Regarding the DataFrame.

>> frame_drop = pd.DataFrame(np.arange(16).reshape((4,4)),
                         index=['red','blue','yellow','white'],
                         columns=['ball','pen','pencil','paper'])>> frame_drop

To delete the rows, you just pass the indexes of the row.

>> frame_drop.drop(['blue','yellow'])

To delete columns, you need to specify the columns but you must also specify the axis = 1 to refer to columns.

>> frame_drop.drop(['pen','pencil'],axis=1)

Arithmetic and Data Alignment

>> s1 = pd.Series([3,2,5,1],['blue','green','white','yellow'])
>> s2 = pd.Series([1,4,7,2,1],['black','blue','brown','white','yellow'])>> s1blue      3
green     2
white     5
yellow    1
dtype: int64>> s2black     1
blue      4
brown     7
white     2
yellow    1
dtype: int64>> s1 + s2black     NaN
blue      7.0
brown     NaN
green     NaN
white     7.0
yellow    2.0
dtype: float64

Only ‘blue’, ‘white’ and ‘yellow’ are similar in both series. So, pandas performs sum between these only and others are set as NaN.

In the case of DataFrame, although it may appear more complex, the alignment follows the same principle but is carried out both for the rows and for the columns.

>> f1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                 index=['blue','red','white','yellow'],
                 columns=['ball','paper','pen','pencil'])
>> f2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                 index=['blue','green','white','yellow'],
                 columns=['ball','mug','pen'])>> f1

>> f2

>> f1 + f2

Operations between Data Structures

Flexible Arithmetic Methods

The same +, — ,*, / can also be performed using add(), sub(), mul(), div()

>> f1.add(f2)

Operations between DataFrame and Series.

>> frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                     index=['red','blue','yellow','white'],
                     columns=['ball','paper','pen','pencil'])
>> frame

>> series = pd.Series(np.arange(4),index=['ball','paper','pen','pencil'])
>> seriesball      0
paper     1
pen       2
pencil    3
dtype: int64>> frame - series

In each row of DataFrame, 0,1,2,3 are subtracted from each column.

>> series['mug'] = 9
>> seriesball      0
paper     1
pen       2
pencil    3
mug       9
dtype: int64>> frame - series

If the index is not present in one of the two data structures, the result will be a new column with that index with values NaN.

Function Application and Mapping

universal functions (ufunc) operates element by element basis on the data structure.

>> np.sqrt(f1)

>> f1.max()ball      12
paper     13
pen       14
pencil    15
dtype: int64>> f1.min()ball      0
paper     1
pen       2
pencil    3
dtype: int64

Using the apply() function, you can apply user-defined functions.

>> f1.apply(lambda x: x.max() - x.min())ball      12
paper     12
pen       12
pencil    12
dtype: int64>> f1.apply(lambda x: x.max() - x.min(), axis=1) #Columnblue      3
red       3
white     3
yellow    3
dtype: int64>> f1.apply(lambda x: pd.Series([x.min(),x.max()],index=['min','max']))

>> f1.apply(lambda x: pd.Series([x.min(),x.max()],index=['min','max']), axis=1)

Statistics Functions

>> f1.sum()ball      24
paper     28
pen       32
pencil    36
dtype: int64>> f1.mean()ball      6.0
paper     7.0
pen       8.0
pencil    9.0
dtype: float64>> f1.describe()

Sorting and Ranking

>> s1.sort_index()blue      3
green     2
white     5
yellow    1
dtype: int64>> s1.sort_index(ascending=False)yellow    1
white     5
green     2
blue      3
dtype: int64>> f1.sort_index()

>> f1.sort_index(axis=1)

>> s1.sort_values()yellow    1
green     2
blue      3
white     5
dtype: int64>> f1.sort_values(by='paper')

rank() assigns a rank to each element in the series.

>> s1.rank()blue      3.0
green     2.0
white     4.0
yellow    1.0
dtype: float64>> s1.rank(method='first')blue      3.0
green     2.0
white     4.0
yellow    1.0
dtype: float64>> s1.rank(ascending=False)blue      2.0
green     3.0
white     1.0
yellow    4.0
dtype: float64

Correlation and Covariance

>> s2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])>> s3 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008',
'2009','2010','2011','2012','2013'])>> s22006    3
2007    4
2008    3
2009    4
2010    5
2011    4
2012    3
2013    2
dtype: int64>> s32006    1
2007    2
2008    3
2009    4
2010    4
2011    3
2012    2
2013    1
dtype: int64

Two statistical calculations Correlation and Covariance are expressed in pandas as corr() and cov().

>> s2.corr(s3)0.7745966692414834>> s2.cov(s3)0.8571428571428571>> f2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],
                  index=['red','blue','yellow','white'],
                  columns=['ball','pen','pencil','paper'])
>> f2

>> f2.corr()

>> f2.cov()

Using the corrwith() method, you can calculate the pairwise correlations between the columns or rows of a DataFrame with a series or another DataFrame()

>> s4 = pd.Series([0,1,2,3,9],
              index=['red','blue','yellow','white','green'])
>> s4red       0
blue      1
yellow    2
white     3
green     9
dtype: int64>> f2.corrwith(s4)ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

NaN

Assigning a NaN value

>> s5 = pd.Series([0,1,2,np.NaN,9],
              index=['red','blue','yellow','white','green'])
>> s5red       0.0
blue      1.0
yellow    2.0
white     NaN
green     9.0
dtype: float64

Filtering Out NaN values

If the series contains a lot of NaN values, it is tedious to remove the rows one by one. So dropna() function comes to our aid.

>> s5.dropna()red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64

You can also perform the filtering by placing notnull() values.

>> s5[s5.notnull()]red       0.0
blue      1.0
yellow    2.0
green     9.0
dtype: float64>> fnan = pd.DataFrame([[6,np.nan,6],[np.nan,np.nan,np.nan],[2,np.nan,5]],
                    index = ['blue','green','red'],
                    columns = ['ball','mug','pen'])
>> fnan

>> fnan.dropna()

To avoid having entire rows and columns disappear completely, you have to specify how option. how='all' tells dropna() to delete only the rows or columns in which all elements are NaN.

>> fnan.dropna(how='all')

>> fnan.dropna(how='all',axis=1) #Column

Filling in NaN Occurrences

>> fnan.fillna(0)

Sometimes you want to fill the NaN values to mean or median to maintain the data consistency.

>> ball_median = fnan['ball'].median()
>> ball_median4.0>> ball_mean = fnan['ball'].mean()
>> ball_mean4.0>> fnan['ball'].fillna(ball_median)blue     6.0
green    4.0
red      2.0
Name: ball, dtype: float64

Hierarchical Indexing and Leveling

Hierarchical indexing allows you to have multiple levels of indexes on a single axis. It gives you a way to work with multiple dimensions while continuing to work in a two-dimensional structure.

>> hier_ser = pd.Series(np.random.rand(8), index=[
    ['white','white','white','blue','blue','red','red','red'],
    ['up','down','right','up','down','up','down','left']
])>> hier_serwhite  up       0.314473
       down     0.112632
       right    0.082628
blue   up       0.110578
       down     0.618985
red    up       0.028230
       down     0.647547
       left     0.077691
dtype: float64
>> hier_ser.indexMultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],
           codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
>> hier_ser['white']up       0.314473
down     0.112632
right    0.082628
dtype: float64
>> hier_ser[:,'right']white    0.082628
dtype: float64
>> hier_ser['white','up']0.3144728041930178

The data could be rearranged and used in a DataFrame with a special function called unstack().

>> hier_ser.unstack()

If you want to reverse the operation, converting DataFrame to a series, you can use stack() function.

>> hier_ser.unstack().stack()blue   down     0.618985
       up       0.110578
red    down     0.647547
       left     0.077691
       up       0.028230
white  down     0.112632
       right    0.082628
       up       0.314473
dtype: float64

With DataFrame it is possible to define hierarchical index both for rows and columns.

>> hier_df = pd.DataFrame(np.random.randn(16).reshape(4,4),
                      index=[
                          ['white','white','red','red'],
                          ['up','down','up','down']
                      ],
                      columns=[
                          ['pen','pen','paper','paper'],
                          [1,2,1,2]
                      ])
>> hier_df

>> hier_df.columnsMultiIndex(levels=[['paper', 'pen'], [1, 2]],
           codes=[[1, 1, 0, 0], [0, 1, 0, 1]])
>> hier_df.indexMultiIndex(levels=[['red', 'white'], ['down', 'up']],
           codes=[[1, 1, 0, 0], [1, 0, 1, 0]])

Let’s give the levels an encompassing name.

>> hier_df.columns.names = ['objects','id']
>> hier_df.index.names = ['colors','status']
>> hier_df

If you want to interchange and return a new object with two levels interchanged between them, while leaving the data unmodified, you can use swaplevel().