- In python pandas is a library for data analysis.
- In 2008 Pandas was initially developed by Wes McKinney in 2008 while he was working at AQR Capital Management.
- To open source the pandas he convinced the AQR to allow him.
- In 2012 another AQR employee, Chang She, joined as the second major contributor to the library.
- Pandas is built on top of two core Python libraries which matplotlib is used for data visualization and Numpy is used for mathematical operations.
- It acts as a wrapper over these libraries, allowing you to access many of matplotlib's and NumPy's methods with less code.
- Pandas. plot () combines multiple matplotlib methods into a single method, enabling you to plot a chart in a few lines, for instance.
- Most analysts used Python for data munging and preparation, and then switched to a more domain specific language like R for the rest of their workflow, before pandas.
- Pandas introduced two new types of objects for storing data which make analytical tasks easier and eliminate the need to switch tools.
- They are series which have a list-like structure, and Data Frames, which have a tabular structure.
- Data Aggregation & Data Transformation.
- Data cleansing & Data adaption.
- Manipulation of data structures.
- Selection of data according to definable criteria.
- Handling of missing data.
- Splitting of large amounts of data.
- Processing & adjusting time series.
- Reading data from different data formats.
- Filter functions for data.
- In pandas there are two types of data structures they are series and data frames.
- It is just like a one-dimensional array-like object and it can contain any data types like floats, integers, strings, Python objects, and so on.
- It can be compared with two arrays one is the index/labels, and the other one containing actual data.
import pandas as pd
from pandas import Series
s = pd.Series([11, 28, 72, 3, 5, 8])
print(s)
- It is a two or more-dimensional data structure like a relational table with rows and columns.
- The idea of a Data-Frame is based on spreadsheets and we can see the data structure of a Data-Frame is just like a spreadsheet.
- Data-Frame has both a row and a column index then their object contains an ordered collection of columns similar to excel sheet.
- Different fields of data-frame can have various types the first column may consist of string, while the second one consists of boolean values and so on.
Pandas Descriptive Statistics Functions
- Pandas have an array of methods and functions that collectively calculate descriptive statistics on data-frame columns.
- Basic aggregation methods are sum (), mean (), but some of them, like sumsum () produces an object of the same size.
- In these methods axis argument can be provided, just like ndarray .{sum, std, ...}, but the axis can be specified by integer or name.
import pandas as pd
from pandas import Series
import numpy as np
#Create a Dictionary of series
d = {'Name':pd.Series(['Venkat','Krishiv','Siva','Kishore','Nizar','Bharath','Karthik',
'Lokesh','Raja','Vijay','Suriya','Srikumar']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
#Create a DataFrame
df = pd.DataFrame(d)
print(df)
Pandas Function Application
- There are three important methods, going forward to a high level.
- The usage depends on whether we want to apply an operation on an element- wise, entire Data-set or row/column-wise.
- Data frame Function Application: pipe ()
- Row/Column level Function Application: apply ()
- Element level Function Application: applymap ()
Data frame Function Application: pipe ()
- Custom functions can be applied by passing the function name with the appropriate number of parameters as pipe arguments. So, an operation is performed on the whole Data-Frame.
import pandas as pd
import numpy as np
def adder(ele1, ele2):
return ele1+ele2
df = pd.DataFrame(np.random.randn(5, 3), columns=['Col1', 'Col2', 'Col3'])
print(df)
df1= df.pipe(adder, 2)
print(df1)
Row/Column level Function Application: apply ()
- We may apply arbitrary functions to the axes of a panel or Data Frame by using the apply () method.
- It takes an optional axis argument and can also be applied to a Series.
- This operation will be performed column-wise, taking every column as an array, by default.
- It enables the user, to pass a function and it apply to all the values of the Data Frame or Series.
- It allows the segregation of data according to the given conditions, making it efficiently usable in data science and machine learning, it gives huge improvement for the library.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
df.apply(np.mean)
print(df.apply(np.mean))
Element level Function Application: applymap ()
- In this function the applymap () on method DataFrame is capable of taking and returning a single value.
- Pandas function application is used to apply a function to DataFrame, which accepts and returns only one scalar value to every element of the DataFrame.
- It is a Data-centric method of applying functions to DataFrames and we use the word lambda to define the functions.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['col1','col2','col3'])
# My custom function
df['col1'].map(lambda x:x*100)
print(df.apply(np.mean))
Python Iteration Operations
- In dataset iterating over a dataset allows us to travel and visit all the values.
- It allows us to carry out more complex operations and this facilitates our grasp on the data.
- In pandas there are various ways iteration over a dataframe they are, row-wise, column-wise or iterate over each in the form of a tuple.
- We can visit each element of the dataset in a sequential manner, you can even apply mathematical operations too while iterating, with Pandas iteration.
- Let us import the pandas library, before we start iteration in Pandas.
import pandas as pd
3 ways for iteration in pandas
- In pandas there are three ways to iterate over data frames, they are:
- Iteritems (): Helps to iterate over each element of the set, column-wise.
- Iterrows (): Each element of the set, row-wise.
- Itertuple (): Each row and form a tuple out of them.
- This function lets us travel visit each and every value of the dataset.
import pandas as pd
data = {
"firstname": ["Venkat", "Krishiv", "Sowmith"],
"age": [40, 20, 10]
}
df = pd.DataFrame(data)
for a, b in df.iteritems():
print(a)
print(b)
- We can visit all the elements of a dataset, row-wise, with Iterrows ().
import pandas as pd
data = {
"firstname": ["Venkat", "Krishiv", "Sowmith"],
"age": [40, 20, 10]
}
df = pd.DataFrame(data)
for index, row in df.iterrows():
print(row["firstname"])
- In the dataset function Itertuples () creates a tuple for every row.
import pandas as pd
data = {
"firstname": ["Venkat", "Krishiv", "Sowmith"],
"age": [40, 20, 10]
}
df = pd.DataFrame(data)
for row in df.itertuples():
print(row)
- In pandas there are two kinds of sorting available they are,
- By passing the axis arguments and the order of sorting, DataFrame can be sorted, using the sort_index ().
- Sorting is done on row labels in ascending order, by default.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
sorted_df=unsorted_df.sort_index()
print(sorted_df)
- The order of the sorting can be controlled, by passing the Boolean value to ascending parameter.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
sorted_df = unsorted_df.sort_index(ascending=False)
print(sorted_df)
- The sorting can be done on the column labels, by passing the axis argument with a value 0 or 1.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
sorted_df=unsorted_df.sort_index(axis=1)
print(sorted_df)
- Sort_values () is the method for sorting by values, like index sorting.
- It accepts a 'by' argument which use the column name of the DataFrame with which the values are to be sorted.
import pandas as pd
import numpy as np
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1')
print(sorted_df)
- In real life scenarios missing data is always problem.
- Areas like machine learning and data mining face severe issues in the accuracy of their model predictions because of poor quality of data caused by missing values.
- At those areas, missing value treatment is a major point of focus to make their models more accurate and valid.
# import the pandas library
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df)
- Pandas provides the isnull () and notnull () functions, to make detecting values easier, which are also methods on Series and DataFrame objects.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df['one'].isnull())
Replace NaN with a Scalar Value
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',
'two', 'three'])
df = df.reindex(['a', 'b', 'c'])
print(df)
print ("NaN replaced with '0':")
print(df.fillna(0))
- We use the dropna function along with the axis argument, if we want to simply exclude the missing values.
- Axis=0, along row, which means that if any value within a row is NA then the whole row is excluded, by default.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',
'h'],columns=['one', 'two', 'three'])
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
print(df.dropna())
Pandas Indexing and Selecting Data
- The Numpy and Python indexing operators "[ ]" and attribute operator "." provide quick and easy access to Pandas data structures across a wide range of use cases.
- However, directly using standard operators has some optimization limits, since the type of the data to be accessed isn’t known in advance.
- We recommend that you take advantage of the optimized pandas data access methods explained in this chapter, for production code.
- Pandas now supports three types of multi-axes indexing, they are:
- .loc ()– label based
- .iloc ()- integer based
- .ix ()- Both Label and Integer based
- Pandas provide various methods to have purely label based indexing and while slicing, the start bound is also included.
- They refer to the label and the position, but integers are valid labels.
- . loc () has multiple access methods like a single scalar label, a list of labels, a slice of objects, a boolean array.
- Loc takes two list/ range /single operator separated by ',’ first one indicates the row and the second one indicates columns.
#import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4),
index = ['a','b','c','d','e','f','g','h'], columns = ['A', 'B', 'C', 'D'])
#select all rows for a specific column
print(df.loc[:,'A'])
- Python provides various methods in order to get purely integer-based indexing. Like python and numpy, these are 0-based indexing.
- . iloc () has the various access methods, they are an integer, a list of integers, a range of values.
# import the pandas library and aliasing as pd
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns = ['A', 'B', 'C', 'D'])
# select all rows for a specific column
print(df.iloc[:4])
- Pandas provides a hybrid method for selections and subsetting the object using the .ix() operator, besides pure label and integer based.
Pandas working with Text Data
- Pandas provides a set of string functions which make it easy to operate on string data.
- Those functions ignore or exclude missing/NaN values, most importantly.
- Pandas working with text data consists of various functions and descriptions.
- It converts strings in the Series/Index to lower case.
- It converts strings in the Series/Index to upper case.
- Computes String length ().
- It helps strip whitespace (including newline) from each string in the Series/index from both the sides.
- Splits each string with the given pattern.
- Concatenates the series/index elements with given separator.
- It returns the DataFrame with One-Hot Encoded values.
- It returns a Boolean value True for each element if the substring contains in the element, else False.
- It replaces the value a with the value b.
- It repeats true if the element in the Series/Index starts with the pattern.
- It returns true if the element in the Series/Index ends with the pattern.
- It returns the first position of the first occurrence of the pattern.
- It returns a list of all occurrence of the pattern.
- It swaps the case lower/upper.
- It checks all characters in each string in the Series/Index in upper case or not. Returns Boolean.
- It checks whether all characters in each string in the Series/Index are numeric then returns Boolean.
- It checks whether all characters in each string in the Series/Index in upper case or not then returns Boolean.