Pandas Tutorial - Python Pandas - Python Tutorial

In python pandas is a library for data analysis.
In 2008 Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management.
To open source the pandas he convinced the AQR to allow him.
In 2012 another AQR employee, Chang She, joined as the second major contributor to the library.
Pandas is built on top of two core Python libraries which matplotlib is used for data visualization and Numpy is used for mathematical operations.
It acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code.
Pandas. plot () combines multiple matplotlib methods into a single method, enabling you to plot a chart in a few lines, for instance.
Most analysts used Python for data munging and preparation, and then switched to a more domain specific language like R for the rest of their workflow, before pandas.
Pandas introduced two new types of objects for storing data which make analytical tasks easier and eliminate the need to switch tools.
They are series which have a list-like structure, and Data Frames, which have a tabular structure.

Features of Pandas

Data Aggregation & Data Transformation.
Data cleansing & Data adaption.
Manipulation of data structures.
Selection of data according to definable criteria.
Handling of missing data.
Splitting of large amounts of data.
Processing & adjusting time series.
Reading data from different data formats.
Filter functions for data.

Pandas Data Structures

In pandas there are two types of data structures they are series and data frames.

Series

It is just like a one-dimensional array-like object and it can contain any data types like floats, integers, strings, Python objects, and so on.
It can be compared with two arrays one is the index/labels, and the other one containing actual data.

Sample Code

import pandas as pd
from pandas import Series

s = pd.Series([11, 28, 72, 3, 5, 8])
print(s)

Output

Data Frame

It is a two or more-dimensional data structure like a relational table with rows and columns.
The idea of a Data-Frame is based on spreadsheets and we can see the data structure of a Data-Frame is just like a spreadsheet.
Data-Frame has both a row and a column index then their object contains an ordered collection of columns similar to excel sheet.
Different fields of data-frame can have various types the first column may consist of string, while the second one consists of boolean values and so on.

Pandas Descriptive Statistics Functions

Pandas have an array of methods and functions that collectively calculate descriptive statistics on data-frame columns.
Basic aggregation methods are sum (), mean (), but some of them, like sumsum () produces an object of the same size.
In these methods axis argument can be provided, just like ndarray .{sum, std, ...}, but the axis can be specified by integer or name.

Sample Code

import pandas as pd
from pandas import Series
import numpy as np

#Create a Dictionary of series
d = {'Name':pd.Series(['Venkat','Krishiv','Siva','Kishore','Nizar','Bharath','Karthik',
'Lokesh','Raja','Vijay','Suriya','Srikumar']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

#Create a DataFrame
df = pd.DataFrame(d)
print(df)

Output

Pandas Function Application

There are three important methods, going forward to a high level.
The usage depends on whether we want to apply an operation on an element- wise, entire Data-set or row/column-wise.

Data frame Function Application: pipe ()
Row/Column level Function Application: apply ()
Element level Function Application: applymap ()

Data frame Function Application: pipe ()

Custom functions can be applied by passing the function name with the appropriate number of parameters as pipe arguments. So, an operation is performed on the whole Data-Frame.

Sample Code

import pandas as pd
import numpy as np

def adder(ele1, ele2):
    return ele1+ele2


df = pd.DataFrame(np.random.randn(5, 3), columns=['Col1', 'Col2', 'Col3'])
print(df)

df1= df.pipe(adder, 2)
print(df1)

Output

Read Also

python internship , Python Training in Chennai , online python course

Row/Column level Function Application: apply ()

We may apply arbitrary functions to the axes of a panel or Data Frame by using the apply () method.
It takes an optional axis argument and can also be applied to a Series.
This operation will be performed column-wise, taking every column as an array, by default.
It enables the user, to pass a function and it apply to all the values of the Data Frame or Series.
It allows the segregation of data according to the given conditions, making it efficiently usable in data science and machine learning, it gives huge improvement for the library.

Sample Code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

df.apply(np.mean)
print(df.apply(np.mean))

Output

Row/Column level Function Application: apply ()

Element level Function Application: applymap ()

In this function the applymap () on method DataFrame is capable of taking and returning a single value.
Pandas function application is used to apply a function to DataFrame, which accepts and returns only one scalar value to every element of the DataFrame.
It is a Data-centric method of applying functions to DataFrames and we use the word lambda to define the functions.

Sample Code

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])

# My custom function
df['col1'].map(lambda x:x*100)
print(df.apply(np.mean))

Output

Python Iteration Operations

In dataset iterating over a dataset allows us to travel and visit all the values.
It allows us to carry out more complex operations and this facilitates our grasp on the data.
In pandas there are various ways iteration over a dataframe they are, row-wise, column-wise or iterate over each in the form of a tuple.

Iteration in pandas

We can visit each element of the dataset in a sequential manner, you can even apply mathematical operations too while iterating, with Pandas iteration.
Let us import the pandas library, before we start iteration in Pandas.

Syntax

import pandas as pd

3 ways for iteration in pandas

In pandas there are three ways to iterate over data frames, they are:

Iteritems (): Helps to iterate over each element of the set, column-wise.
Iterrows (): Each element of the set, row-wise.
Itertuple (): Each row and form a tuple out of them.

Iteritems () in pandas

This function lets us travel visit each and every value of the dataset.

Sample Code

import pandas as pd

data = {
  "firstname": ["Venkat", "Krishiv", "Sowmith"],
  "age": [40, 20, 10]
}

df = pd.DataFrame(data)

for a, b in df.iteritems():
  print(a)
  print(b)

Output

Iterrows() in pandas

We can visit all the elements of a dataset, row-wise, with Iterrows ().

Sample Code

import pandas as pd

data = {
  "firstname": ["Venkat", "Krishiv", "Sowmith"],
  "age": [40, 20, 10]
}

df = pd.DataFrame(data)

for index, row in df.iterrows():
  print(row["firstname"])

Output

Itertuples () in pandas

In the dataset function Itertuples () creates a tuple for every row.

Sample Code

import pandas as pd

data = {
  "firstname": ["Venkat", "Krishiv", "Sowmith"],
  "age": [40, 20, 10]
}

df = pd.DataFrame(data)

for row in df.itertuples():
  print(row)

Output

Pandas Sorting

In pandas there are two kinds of sorting available they are,

By label
By actual value

By label

By passing the axis arguments and the order of sorting, DataFrame can be sorted, using the sort_index ().
Sorting is done on row labels in ascending order, by default.

Sample Code

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df=unsorted_df.sort_index()
print(sorted_df)

Output

Order of sorting

The order of the sorting can be controlled, by passing the Boolean value to ascending parameter.

Sample Code

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print(sorted_df)

Output

Read Also

python inplant training , Best python Training in Chennai , python course online

Sort the columns

The sorting can be done on the column labels, by passing the axis argument with a value 0 or 1.

Sample Code

import pandas as pd
import numpy as np
 
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
 
sorted_df=unsorted_df.sort_index(axis=1)

print(sorted_df)

Output

By actual value

Sort_values () is the method for sorting by values, like index sorting.
It accepts a 'by' argument which use the column name of the DataFrame with which the values are to be sorted.

Sample Code

import pandas as pd
import numpy as np

unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')

print(sorted_df)

Output

Pandas Missing Data

In real life scenarios missing data is always problem.
Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values.
At those areas, missing value treatment is a major point of focus to make their models more accurate and valid.

Sample Code

# import the pandas library
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)

Output

Check for Missing Values

Pandas provides the isnull () and notnull () functions, to make detecting values easier, which are also methods on Series and DataFrame objects.

Sample Code

import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df['one'].isnull())

Output

Replace NaN with a Scalar Value

Sample Code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])

df = df.reindex(['a', 'b', 'c'])

print(df)
print ("NaN replaced with '0':")
print(df.fillna(0))

Output

Drop Missing Values

We use the dropna function along with the axis argument, if we want to simply exclude the missing values.
Axis=0, along row, which means that if any value within a row is NA then the whole row is excluded, by default.

Sample Code

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])

df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())

Output

Pandas Indexing and Selecting Data

The Numpy and Python indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases.
However, directly using standard operators has some optimization limits, since the type of the data to be accessed isn’t known in advance.
We recommend that you take advantage of the optimized pandas data access methods explained in this chapter, for production code.
Pandas now supports three types of multi-axes indexing, they are:

.loc ()– label based
.iloc ()- integer based
.ix ()- Both Label and Integer based

.loc ()

Pandas provide various methods to have purely label based indexing and while slicing, the start bound is also included.
They refer to the label and the position, but integers are valid labels.
. loc () has multiple access methods like a single scalar label, a list of labels, a slice of objects, a boolean array.
Loc takes two list/ range /single operator separated by ',’ first one indicates the row and the second one indicates columns.

Sample Code

#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])

#select all rows for a specific column
print(df.loc[:,'A'])

Output

. iloc ()

Python provides various methods in order to get purely integer-based indexing. Like python and numpy, these are 0-based indexing.
. iloc () has the various access methods, they are an integer, a list of integers, a range of values.

Sample Code

# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])

# select all rows for a specific column
print(df.iloc[:4])

Output

. ix ()

Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator, besides pure label and integer based.

Pandas working with Text Data

Pandas provides a set of string functions which make it easy to operate on string data.
Those functions ignore or exclude missing/NaN values, most importantly.
Pandas working with text data consists of various functions and descriptions.

lower ()

It converts strings in the Series/Index to lower case.

upper ()

It converts strings in the Series/Index to upper case.

len ()

Computes String length ().

strip ()

It helps strip whitespace (including newline) from each string in the Series/index from both the sides.

split(pattern)

Splits each string with the given pattern.

cat (sep=' ')

Concatenates the series/index elements with given separator.

get_dummies ()

It returns the DataFrame with One-Hot Encoded values.

contains (pattern)

It returns a Boolean value True for each element if the substring contains in the element, else False.

replace (a, b)

It replaces the value a with the value b.

repeat (value)

It repeats true if the element in the Series/Index starts with the pattern.

endswith (pattern)

It returns true if the element in the Series/Index ends with the pattern.

find (pattern)

It returns the first position of the first occurrence of the pattern.

findall (pattern)

It returns a list of all occurrence of the pattern.

swapcase()

It swaps the case lower/upper.

islower ()

It checks all characters in each string in the Series/Index in upper case or not. Returns Boolean.

isnumeric ()

It checks whether all characters in each string in the Series/Index are numeric then returns Boolean.

isupper ()

It checks whether all characters in each string in the Series/Index in upper case or not then returns Boolean.