How to Analyse Data with Pandas
Part 1 : Introduction, Selection and Filtering
Photo by Markus Spiske on Unsplash
Pandas is an open-source data analysis and manipulation library built with Python language. Pandas Library was developed by Wes McKinney when he was working at a financial firm called AQR Capital Management. The panda's name is derived from panel data, an econometrics term for multi-dimensional structured datasets, and a play on the phrase Python data analysis. Pandas collaborate well with other libraries for machine learning, statistics, data visualization and more. Pandas is used in a wide variety of domains like Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics etc.
Pandas library can be used for various data analyses and manipulations and some of them outlined below,
Read and write data out of different formats like CSV, Excel, SQL etc.
Missing data can be easily handled
Slicing, Indexing and reshaping data
Creating additional columns and deleting columns
Aggregations and Window Functions
Merging and Joining datasets/tables
Handling time series data
Installation
Pandas can be installed with pip and conda package managers with the following syntax.
pip install pandas
orconda install pandas
Pandas can be installed directly in some Linux OS as mentioned below
Debian/Ubuntu
sudo apt-get install python3-pandas
Fedora
dnf install python3-pandas
CentOS/RHEL
yum install python3-pandas
To start working with pandas, we need to import the package and a common way of doing it with import pandas as pd
Pandas Data Structures
The primary data structures of Pandas are Series (1-dimensional array) and DataFrame (2-dimensional tabular structure). Pandas data structures are flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and a Series is a container for scalar values.
Series is one dimensional labelled array capable of holding any type of data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.
pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)
data: Array-like, Iterable, dictionary, or a scalar value.
index: array-like or Index (1d), Values must be hashable and have the same length as data.
dtype: Data type for the output Series. It can be str, numpy.dtype, or ExtensionDtype, optional. Inferred from data if not specified.
name: The name to give to the Series. Hashable, default None
copy: boolean value, default False. Copy input data. Only affects Series or 1d ndarray input.
DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)
data: A dictionary, Series, arrays, constants, data class or list-like objects. Alignment is done on Series/DataFrame inputs. ndarray (structured or homogeneous), Iterable, dictionary, or DataFrame.
index: Index to use for resulting data frame. Index or array-like and default to RangeIndex if no information is provided.
columns: Column labels to use for the resulting data frame. Default to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.
dtype: Data type to force. default None.
copy: Copy data from inputs. boolean or None and default None.
# Creating Series and DataFrames
# Importing Pandas librray
import pandas as pd
# Creating Series and DataFrame objects from Lists
series1 = pd.Series([1,2,3,4])
series2 = pd.Series({'a':1, 'b':2, 'c':3})
df1 = pd.DataFrame(['a','b','c','d'])
df2 = pd.DataFrame({'data':[1,2,3,4]})
# Printing series1
print(series1)
0 1
1 2
2 3
3 4
dtype: int64
# Printing series2
print(series2)
a 1
b 2
c 3
dtype: int64
# Printing df1
print(df1)
0
0 a
1 b
2 c
3 d
# Printing df2
data
0 1
1 2
2 3
3 4
Let's explore how to utilise the Pandas library for doing data analysis and manipulation tasks, with data frames.
Working with Pandas
Input and Output
Pandas support many different file formats for reading and writing data. read_* methods provide functionality to read data from various sources and to_* methods provide functionality for exporting data.
Some of the Methods shown below,
File Formats | Read Methods | Write Methods |
CSV | pd.read_csv() | df.to_csv() |
Excel | pd.read_excel() | df.to_excel() |
JSON | pd.read_json() | df.to_json() |
HTML | pd.read_html() | df.to_html() |
Parquet | pd.read_parquet() | df.to_parquet() |
ORC | pd.read_orc() | df.to_orc() |
SQL | pd.read_sql() | df.to_sql() |
Pickle | pd.read_pickle() | df.to_pickle() |
Explaining reading CSV file and writing data to CSV file below.
>>> import pandas as pd
>>> # Reading a CSV file into DataFrame
>>> file = "./data/Student_Behaviour.csv"
>>> df1 = pd.read_csv(file)
>>> # Writing dataframe to CSV file
>>> df2 = pd.DataFrame({"Names":["Anil", "Hari", "Naveen", "Srikanth", "Sunil"],
"Marks":[89,91,78,92,88]})
>>> df2.to_csv("./marks.csv")
read_* and to_* methods take other arguments apart from file names, like delimiter, header, names, index_col and others for the read_csv method. Refer to the link for comprehensive input and output methods of Pandas at https://pandas.pydata.org/docs/reference/io.html
Basic Information
There are DataFrame methods that show basic information about the data. CSV file used for explanation downloaded from https://www.kaggle.com/datasets/gunapro/student-behavior
The following methods give basic information about the data frame.
Method Name | Returns |
df.head() | First n rows, default value 5 |
df.tail() | Last n rows, default value 5 |
df.info() | Summary of a DataFrame. |
df.describe() | Basic statistics like count, mean, sum, standard deviation etc. |
df.dtypes | Data types of columns in the DataFrame. By default integer types are int64 and float types are float64 , regardless of platform (32-bit or 64-bit). object dtype, which can hold any Python object, including strings. Mostly pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float , int , bool , timedelta64[ns] and datetime64[ns]. |
df.columns | Column names or labels |
df.index | Index or row labels of the DataFrame |
df.shape | Number of Rows and columns in tuple format (rows, columns) |
df.memory_usage() | Memory usage of each column in bytes |
>>> import pandas as pd
>>> df = pd.read_csv("./data/Student_Behaviour.csv")
>>> df.head()
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
0 No Male BCA 100.0 ... 30 - 60 minutes Bad Bad No
1 No Female BCA 90.0 ... 0 - 30 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
3 Yes Female BCA 147.0 ... 0 - 30 minutes Bad good No
4 No Male BCA 170.0 ... 30 - 60 minutes Good good No
>>> df.index
RangeIndex(start=0, stop=235, step=1)
>>> df.columns
Index(['Certification Course', 'Gender', 'Department', 'Height(CM)',
'Weight(KG)', '10th Mark', '12th Mark', 'college mark', 'hobbies',
'daily studing time', 'prefer to study in', 'salary expectation',
'Do you like your degree?',
'willingness to pursue a career based on their degree ',
'social medai & video', 'Travelling Time ', 'Stress Level ',
'Financial Status', 'part-time job'],
dtype='object')
>>> df.dtypes
Certification Course object
Gender object
Department object
Height(CM) float64
Weight(KG) float64
10th Mark float64
12th Mark float64
college mark float64
hobbies object
daily studing time object
prefer to study in object
salary expectation int64
Do you like your degree? object
willingness to pursue a career based on their degree object
social medai & video object
Travelling Time object
Stress Level object
Financial Status object
part-time job object
dtype: object
Selecting and Filtering
In data analysis activities, working with a subset of columns/data is common, with pandas a subset of data can be selected with square brackets []
. Inside the brackets, a single column/row label or list of columns/row labels, conditional expression or colon can be used for a subset of data. A single column may also be selected with dot(.) notation also.
A new column can be created by assigning the output to the DataFrame with a new column name in between the []
.
# For selecting a single column with Brackets
>>> df["Department"]
0 BCA
1 BCA
2 BCA
3 BCA
4 BCA
...
230 B.com ISM
231 B.com ISM
232 BCA
233 Commerce
234 B.com ISM
Name: Department, Length: 235, dtype: object
For selecting a single column with dot notation
>>> df.Department
0 BCA
1 BCA
2 BCA
3 BCA
4 BCA
...
230 B.com ISM
231 B.com ISM
232 BCA
233 Commerce
234 B.com ISM
Name: Department, Length: 235, dtype: object
# For Selecting multiple columns, list of columns inside brackets
>>> df[["Department","Gender"]]
Department Gender
0 BCA Male
1 BCA Female
2 BCA Male
3 BCA Female
4 BCA Male
.. ... ...
230 B.com ISM Male
231 B.com ISM Male
232 BCA Female
233 Commerce Female
234 B.com ISM Female
[235 rows x 2 columns]
Method | Description |
df.loc[] | Access a group of rows and columns by label(s) or a boolean array. |
df.iloc[] | Integer-location-based indexing for selection by position. |
df.insert(loc, column, value[, ...]) | Insert column into DataFrame at the specified location. |
df.items() | Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series. |
df.iterrows() | Iterate over DataFrame rows as (index, Series) pairs. |
df.where(cond[, other, inplace, ...]) | Replace values where the condition is False. |
df.mask(cond[, other, inplace, axis, ...]) | Replace values where the condition is True. |
df.drop([labels, axis, index, ...]) | Drop specified labels from rows or columns. |
df.filter([items, like, regex, axis]) | Subset the data frame rows or columns according to the specified index labels. |
df.query(expr, *[, inplace]) | Query the columns of a DataFrame with a boolean expression. |
df.isin(values) | Whether each element in the DataFrame is contained in values. |
# Selecting data with label/index
>>> df.loc[1]
Certification Course No
Gender Female
Department BCA
Height(CM) 90.0
Weight(KG) 40.0
10th Mark 70.0
12th Mark 80.0
college mark 70.0
hobbies Cinema
daily studing time 30 - 60 minute
prefer to study in Morning
salary expectation 15000
Do you like your degree? Yes
willingness to pursue a career based on their degree 75%
social medai & video 1 - 1.30 hour
Travelling Time 0 - 30 minutes
Stress Level Bad
Financial Status Bad
part-time job No
Name: 1, dtype: object
# Selecting multiple data with labels/index
# df.loc[[list_of _rows]]
>>> df.loc[[1,2,3]]
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
1 No Female BCA 90.0 ... 0 - 30 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
3 Yes Female BCA 147.0 ... 0 - 30 minutes Bad good No
# Selecting multiple data with labels and columns
# df.loc[[list_of _rows], ['list_of_columns']]
>>> df.loc[[1,2,3], ["Gender", "Stress Level "]]
Gender Stress Level
1 Female Bad
2 Male Awful
3 Female Bad
Note: Column headers are case-sensitive. This dataset contains columns with trailing spaces.
For filtering data based on conditions, conditions can be given in the brackets, where the results will be shown based on conditions evaluated to True.
# Filtering data with a condition
>>> df[df["Gender"]=="Male"]
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
0 No Male BCA 100.0 ... 30 - 60 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
4 No Male BCA 170.0 ... 30 - 60 minutes Good good No
6 Yes Male BCA 165.0 ... 1 - 1.30 hour Good good No
7 No Male BCA 152.0 ... 1 - 1.30 hour Good good No
.. ... ... ... ... ... ... ... ... ...
223 Yes Male Commerce 170.0 ... 0 - 30 minutes fabulous Bad No
227 Yes Male B.com ISM 167.0 ... 0 - 30 minutes Awful Bad No
228 Yes Male B.com ISM 155.0 ... 1 - 1.30 hour Good good No
230 Yes Male B.com ISM 170.0 ... 30 - 60 minutes Bad Bad No
231 Yes Male B.com ISM 172.0 ... 30 - 60 minutes Good good No
[156 rows x 19 columns]
# Filtering data with a query
# Query should be string only df.query("col_name == value")
>>> df.query("Gender == 'Male'")
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
0 No Male BCA 100.0 ... 30 - 60 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
4 No Male BCA 170.0 ... 30 - 60 minutes Good good No
6 Yes Male BCA 165.0 ... 1 - 1.30 hour Good good No
7 No Male BCA 152.0 ... 1 - 1.30 hour Good good No
.. ... ... ... ... ... ... ... ... ...
223 Yes Male Commerce 170.0 ... 0 - 30 minutes fabulous Bad No
227 Yes Male B.com ISM 167.0 ... 0 - 30 minutes Awful Bad No
228 Yes Male B.com ISM 155.0 ... 1 - 1.30 hour Good good No
230 Yes Male B.com ISM 170.0 ... 30 - 60 minutes Bad Bad No
231 Yes Male B.com ISM 172.0 ... 30 - 60 minutes Good good No
[156 rows x 19 columns]
# Filtering based on multiple conditions
>>> df[(df["Department"]=="BCA") & (df["Gender"]=="Male")]
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
0 No Male BCA 100.0 ... 30 - 60 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
4 No Male BCA 170.0 ... 30 - 60 minutes Good good No
6 Yes Male BCA 165.0 ... 1 - 1.30 hour Good good No
7 No Male BCA 152.0 ... 1 - 1.30 hour Good good No
.. ... ... ... ... ... ... ... ... ...
146 No Male BCA 165.0 ... 0 - 30 minutes fabulous Fabulous Yes
150 Yes Male BCA 175.0 ... 1 - 1.30 hour Good Bad No
151 No Male BCA 175.0 ... 1.30 - 2 hour Good good No
152 No Male BCA 160.0 ... 1 - 1.30 hour Awful good No
156 Yes Male BCA 175.0 ... 2 - 2.30 hour Good good No
[94 rows x 19 columns]
# Filtering data with isin()
# Values to be a list
>>> df[df.Gender.isin(["Male"])]
Certification Course Gender Department Height(CM) ... Travelling Time Stress Level Financial Status part-time job
0 No Male BCA 100.0 ... 30 - 60 minutes Bad Bad No
2 Yes Male BCA 159.0 ... 30 - 60 minutes Awful Bad No
4 No Male BCA 170.0 ... 30 - 60 minutes Good good No
6 Yes Male BCA 165.0 ... 1 - 1.30 hour Good good No
7 No Male BCA 152.0 ... 1 - 1.30 hour Good good No
.. ... ... ... ... ... ... ... ... ...
223 Yes Male Commerce 170.0 ... 0 - 30 minutes fabulous Bad No
227 Yes Male B.com ISM 167.0 ... 0 - 30 minutes Awful Bad No
228 Yes Male B.com ISM 155.0 ... 1 - 1.30 hour Good good No
230 Yes Male B.com ISM 170.0 ... 30 - 60 minutes Bad Bad No
231 Yes Male B.com ISM 172.0 ... 30 - 60 minutes Good good No
[156 rows x 19 columns]
This is some basic working with the Pandas library, in the next article, will cover other topics like handling null values, duplicate values, aggregations etc.
Link for Pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf
Link for Pandas API: https://pandas.pydata.org/docs/reference/index.html