How to Analyse Data with Pandas

Part 1 : Introduction, Selection and Filtering

Pandas is an open-source data analysis and manipulation library built with Python language. Pandas Library was developed by Wes McKinney when he was working at a financial firm called AQR Capital Management. The panda's name is derived from panel data, an econometrics term for multi-dimensional structured datasets, and a play on the phrase Python data analysis. Pandas collaborate well with other libraries for machine learning, statistics, data visualization and more. Pandas is used in a wide variety of domains like Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics etc.

Pandas library can be used for various data analyses and manipulations and some of them outlined below,

  • Read and write data out of different formats like CSV, Excel, SQL etc.

  • Missing data can be easily handled

  • Slicing, Indexing and reshaping data

  • Creating additional columns and deleting columns

  • Aggregations and Window Functions

  • Merging and Joining datasets/tables

  • Handling time series data

Installation

Pandas can be installed with pip and conda package managers with the following syntax.

  • pip install pandas or conda install pandas

Pandas can be installed directly in some Linux OS as mentioned below

  • Debian/Ubuntu sudo apt-get install python3-pandas

  • Fedora dnf install python3-pandas

  • CentOS/RHEL yum install python3-pandas

To start working with pandas, we need to import the package and a common way of doing it with import pandas as pd

Pandas Data Structures

The primary data structures of Pandas are Series (1-dimensional array) and DataFrame (2-dimensional tabular structure). Pandas data structures are flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and a Series is a container for scalar values.

Series is one dimensional labelled array capable of holding any type of data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)

  • data: Array-like, Iterable, dictionary, or a scalar value.

  • index: array-like or Index (1d), Values must be hashable and have the same length as data.

  • dtype: Data type for the output Series. It can be str, numpy.dtype, or ExtensionDtype, optional. Inferred from data if not specified.

  • name: The name to give to the Series. Hashable, default None

  • copy: boolean value, default False. Copy input data. Only affects Series or 1d ndarray input.

DataFrame is a 2-dimensional labelled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None)

  • data: A dictionary, Series, arrays, constants, data class or list-like objects. Alignment is done on Series/DataFrame inputs. ndarray (structured or homogeneous), Iterable, dictionary, or DataFrame.

  • index: Index to use for resulting data frame. Index or array-like and default to RangeIndex if no information is provided.

  • columns: Column labels to use for the resulting data frame. Default to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.

  • dtype: Data type to force. default None.

  • copy: Copy data from inputs. boolean or None and default None.

# Creating Series and DataFrames
# Importing Pandas librray
import pandas as pd

# Creating Series and DataFrame objects from Lists
series1 = pd.Series([1,2,3,4])
series2 = pd.Series({'a':1, 'b':2, 'c':3})
df1 = pd.DataFrame(['a','b','c','d'])
df2 = pd.DataFrame({'data':[1,2,3,4]})

# Printing series1
print(series1)
0    1
1    2
2    3
3    4
dtype: int64

# Printing series2
print(series2)
a    1
b    2
c    3
dtype: int64

# Printing df1
print(df1)
   0
0  a
1  b
2  c
3  d

# Printing df2
   data
0     1
1     2
2     3
3     4

Let's explore how to utilise the Pandas library for doing data analysis and manipulation tasks, with data frames.

Working with Pandas

Input and Output

Pandas support many different file formats for reading and writing data. read_* methods provide functionality to read data from various sources and to_* methods provide functionality for exporting data.

Some of the Methods shown below,

File FormatsRead MethodsWrite Methods
CSVpd.read_csv()df.to_csv()
Excelpd.read_excel()df.to_excel()
JSONpd.read_json()df.to_json()
HTMLpd.read_html()df.to_html()
Parquetpd.read_parquet()df.to_parquet()
ORCpd.read_orc()df.to_orc()
SQLpd.read_sql()df.to_sql()
Picklepd.read_pickle()df.to_pickle()

Explaining reading CSV file and writing data to CSV file below.

>>> import pandas as pd

>>> # Reading a CSV file into DataFrame
>>> file = "./data/Student_Behaviour.csv"
>>> df1 = pd.read_csv(file)

>>> # Writing dataframe to CSV file
>>> df2 = pd.DataFrame({"Names":["Anil", "Hari", "Naveen", "Srikanth", "Sunil"],
       "Marks":[89,91,78,92,88]})
>>> df2.to_csv("./marks.csv")

read_* and to_* methods take other arguments apart from file names, like delimiter, header, names, index_col and others for the read_csv method. Refer to the link for comprehensive input and output methods of Pandas at https://pandas.pydata.org/docs/reference/io.html

Basic Information

There are DataFrame methods that show basic information about the data. CSV file used for explanation downloaded from https://www.kaggle.com/datasets/gunapro/student-behavior

The following methods give basic information about the data frame.

Method NameReturns
df.head()First n rows, default value 5
df.tail()Last n rows, default value 5
df.info()Summary of a DataFrame.
df.describe()Basic statistics like count, mean, sum, standard deviation etc.
df.dtypesData types of columns in the DataFrame. By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit). object dtype, which can hold any Python object, including strings. Mostly pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy provides support for float, int, bool, timedelta64[ns] and datetime64[ns].
df.columnsColumn names or labels
df.indexIndex or row labels of the DataFrame
df.shapeNumber of Rows and columns in tuple format (rows, columns)
df.memory_usage()Memory usage of each column in bytes
>>> import pandas as pd
>>> df = pd.read_csv("./data/Student_Behaviour.csv")
>>> df.head()
Certification Course  Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
0                   No    Male        BCA       100.0  ...   30 - 60 minutes            Bad               Bad             No
1                   No  Female        BCA        90.0  ...    0 - 30 minutes            Bad               Bad             No
2                  Yes    Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
3                  Yes  Female        BCA       147.0  ...    0 - 30 minutes            Bad              good             No
4                   No    Male        BCA       170.0  ...   30 - 60 minutes           Good              good             No

>>> df.index
RangeIndex(start=0, stop=235, step=1)

>>> df.columns
Index(['Certification Course', 'Gender', 'Department', 'Height(CM)',
       'Weight(KG)', '10th Mark', '12th Mark', 'college mark', 'hobbies',
       'daily studing time', 'prefer to study in', 'salary expectation',
       'Do you like your degree?',
       'willingness to pursue a career based on their degree  ',
       'social medai & video', 'Travelling Time ', 'Stress Level ',
       'Financial Status', 'part-time job'],
      dtype='object')

>>> df.dtypes
Certification Course                                       object
Gender                                                     object
Department                                                 object
Height(CM)                                                float64
Weight(KG)                                                float64
10th Mark                                                 float64
12th Mark                                                 float64
college mark                                              float64
hobbies                                                    object
daily studing time                                         object
prefer to study in                                         object
salary expectation                                          int64
Do you like your degree?                                   object
willingness to pursue a career based on their degree       object
social medai & video                                       object
Travelling Time                                            object
Stress Level                                               object
Financial Status                                           object
part-time job                                              object
dtype: object

Selecting and Filtering

In data analysis activities, working with a subset of columns/data is common, with pandas a subset of data can be selected with square brackets []. Inside the brackets, a single column/row label or list of columns/row labels, conditional expression or colon can be used for a subset of data. A single column may also be selected with dot(.) notation also.

A new column can be created by assigning the output to the DataFrame with a new column name in between the [].

# For selecting a single column with Brackets
>>> df["Department"]
0            BCA
1            BCA
2            BCA
3            BCA
4            BCA
         ...
230    B.com ISM
231    B.com ISM
232          BCA
233     Commerce
234    B.com ISM
Name: Department, Length: 235, dtype: object

For selecting a single column with dot notation
>>> df.Department
0            BCA
1            BCA
2            BCA
3            BCA
4            BCA
         ...
230    B.com ISM
231    B.com ISM
232          BCA
233     Commerce
234    B.com ISM
Name: Department, Length: 235, dtype: object

# For Selecting multiple columns, list of columns inside brackets
>>> df[["Department","Gender"]]
    Department  Gender
0          BCA    Male
1          BCA  Female
2          BCA    Male
3          BCA  Female
4          BCA    Male
..         ...     ...
230  B.com ISM    Male
231  B.com ISM    Male
232        BCA  Female
233   Commerce  Female
234  B.com ISM  Female

[235 rows x 2 columns]
MethodDescription
df.loc[]Access a group of rows and columns by label(s) or a boolean array.
df.iloc[]Integer-location-based indexing for selection by position.
df.insert(loc, column, value[, ...])Insert column into DataFrame at the specified location.
df.items()Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
df.iterrows()Iterate over DataFrame rows as (index, Series) pairs.
df.where(cond[, other, inplace, ...])Replace values where the condition is False.
df.mask(cond[, other, inplace, axis, ...])Replace values where the condition is True.
df.drop([labels, axis, index, ...])Drop specified labels from rows or columns.
df.filter([items, like, regex, axis])Subset the data frame rows or columns according to the specified index labels.
df.query(expr, *[, inplace])Query the columns of a DataFrame with a boolean expression.
df.isin(values)Whether each element in the DataFrame is contained in values.
# Selecting data with label/index 
>>> df.loc[1]
Certification Course                                                  No
Gender                                                            Female
Department                                                           BCA
Height(CM)                                                          90.0
Weight(KG)                                                          40.0
10th Mark                                                           70.0
12th Mark                                                           80.0
college mark                                                        70.0
hobbies                                                           Cinema
daily studing time                                        30 - 60 minute
prefer to study in                                               Morning
salary expectation                                                 15000
Do you like your degree?                                             Yes
willingness to pursue a career based on their degree                 75%
social medai & video                                       1 - 1.30 hour
Travelling Time                                           0 - 30 minutes
Stress Level                                                         Bad
Financial Status                                                     Bad
part-time job                                                         No
Name: 1, dtype: object

# Selecting multiple data with labels/index
# df.loc[[list_of _rows]]
>>> df.loc[[1,2,3]]
  Certification Course  Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
1                   No  Female        BCA        90.0  ...    0 - 30 minutes            Bad               Bad             No
2                  Yes    Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
3                  Yes  Female        BCA       147.0  ...    0 - 30 minutes            Bad              good             No

# Selecting multiple data with labels and columns
# df.loc[[list_of _rows], ['list_of_columns']]
>>> df.loc[[1,2,3], ["Gender", "Stress Level "]]
   Gender Stress Level
1  Female           Bad
2    Male         Awful
3  Female           Bad

Note: Column headers are case-sensitive. This dataset contains columns with trailing spaces.

For filtering data based on conditions, conditions can be given in the brackets, where the results will be shown based on conditions evaluated to True.

# Filtering data with a condition
>>> df[df["Gender"]=="Male"]
    Certification Course Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
0                     No   Male        BCA       100.0  ...   30 - 60 minutes            Bad               Bad             No
2                    Yes   Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
4                     No   Male        BCA       170.0  ...   30 - 60 minutes           Good              good             No
6                    Yes   Male        BCA       165.0  ...     1 - 1.30 hour           Good              good             No
7                     No   Male        BCA       152.0  ...     1 - 1.30 hour           Good              good             No
..                   ...    ...        ...         ...  ...               ...            ...               ...            ...
223                  Yes   Male   Commerce       170.0  ...    0 - 30 minutes       fabulous               Bad             No
227                  Yes   Male  B.com ISM       167.0  ...    0 - 30 minutes          Awful               Bad             No
228                  Yes   Male  B.com ISM       155.0  ...     1 - 1.30 hour           Good              good             No
230                  Yes   Male  B.com ISM       170.0  ...   30 - 60 minutes            Bad               Bad             No
231                  Yes   Male  B.com ISM       172.0  ...   30 - 60 minutes           Good              good             No

[156 rows x 19 columns]

# Filtering data with a query
# Query should be string only df.query("col_name == value")
>>> df.query("Gender == 'Male'")
    Certification Course Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
0                     No   Male        BCA       100.0  ...   30 - 60 minutes            Bad               Bad             No
2                    Yes   Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
4                     No   Male        BCA       170.0  ...   30 - 60 minutes           Good              good             No
6                    Yes   Male        BCA       165.0  ...     1 - 1.30 hour           Good              good             No
7                     No   Male        BCA       152.0  ...     1 - 1.30 hour           Good              good             No
..                   ...    ...        ...         ...  ...               ...            ...               ...            ...
223                  Yes   Male   Commerce       170.0  ...    0 - 30 minutes       fabulous               Bad             No
227                  Yes   Male  B.com ISM       167.0  ...    0 - 30 minutes          Awful               Bad             No
228                  Yes   Male  B.com ISM       155.0  ...     1 - 1.30 hour           Good              good             No
230                  Yes   Male  B.com ISM       170.0  ...   30 - 60 minutes            Bad               Bad             No
231                  Yes   Male  B.com ISM       172.0  ...   30 - 60 minutes           Good              good             No

[156 rows x 19 columns]

# Filtering based on multiple conditions
>>> df[(df["Department"]=="BCA") & (df["Gender"]=="Male")]
    Certification Course Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
0                     No   Male        BCA       100.0  ...   30 - 60 minutes            Bad               Bad             No
2                    Yes   Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
4                     No   Male        BCA       170.0  ...   30 - 60 minutes           Good              good             No
6                    Yes   Male        BCA       165.0  ...     1 - 1.30 hour           Good              good             No
7                     No   Male        BCA       152.0  ...     1 - 1.30 hour           Good              good             No
..                   ...    ...        ...         ...  ...               ...            ...               ...            ...
146                   No   Male        BCA       165.0  ...    0 - 30 minutes       fabulous          Fabulous            Yes
150                  Yes   Male        BCA       175.0  ...     1 - 1.30 hour           Good               Bad             No
151                   No   Male        BCA       175.0  ...     1.30 - 2 hour           Good              good             No
152                   No   Male        BCA       160.0  ...     1 - 1.30 hour          Awful              good             No
156                  Yes   Male        BCA       175.0  ...     2 - 2.30 hour           Good              good             No

[94 rows x 19 columns]

# Filtering data with isin()
# Values to be a list
>>> df[df.Gender.isin(["Male"])]
    Certification Course Gender Department  Height(CM)  ...  Travelling Time   Stress Level   Financial Status  part-time job
0                     No   Male        BCA       100.0  ...   30 - 60 minutes            Bad               Bad             No
2                    Yes   Male        BCA       159.0  ...   30 - 60 minutes          Awful               Bad             No
4                     No   Male        BCA       170.0  ...   30 - 60 minutes           Good              good             No
6                    Yes   Male        BCA       165.0  ...     1 - 1.30 hour           Good              good             No
7                     No   Male        BCA       152.0  ...     1 - 1.30 hour           Good              good             No
..                   ...    ...        ...         ...  ...               ...            ...               ...            ...
223                  Yes   Male   Commerce       170.0  ...    0 - 30 minutes       fabulous               Bad             No
227                  Yes   Male  B.com ISM       167.0  ...    0 - 30 minutes          Awful               Bad             No
228                  Yes   Male  B.com ISM       155.0  ...     1 - 1.30 hour           Good              good             No
230                  Yes   Male  B.com ISM       170.0  ...   30 - 60 minutes            Bad               Bad             No
231                  Yes   Male  B.com ISM       172.0  ...   30 - 60 minutes           Good              good             No

[156 rows x 19 columns]

This is some basic working with the Pandas library, in the next article, will cover other topics like handling null values, duplicate values, aggregations etc.

Link for Pandas cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

Link for Pandas API: https://pandas.pydata.org/docs/reference/index.html

Thank you for reading, please like and share.