Python Pandas Full Course For Biginners


 Pandas

Introduction To Pandas

Pandas is one of the core libraries of the python programming language which is widely used by data scientists and machine learning engineers for their data analysis tasks.
pandas can handle such large data sets as compared to excel with extreme ease, The main use of Pandas library is data manipulation, data cleaning, data visualization, etc in a data science project.

What You Will Learn After Completion Of This Tutorial:

  • you will be able to handle large data sets
  • you will clean any type of data sets
  • you will be able to use pandas in any of your data science project
  • you will do all the data manipulation tasks with pandas 
So let's begin 

Installation Of Pandas

To install pandas just open your cmd and type:
pip install pandas
but if you have already installed an anaconda environment in your system you don't need to install pandas again

IDE used in this tutorial

We used Jupyter Notebook in this tutorial and I will prefer to always use jupyter notebook so you will have always the best experience. Jupyter Notebook comes with Anaconda Package if you don't know how to install anaconda click here to install anaconda in your system and then come back to this tutorial again after you successfully installed anaconda go to your search bar in the window and type jupyter you will see jupyter and then click on that to open it in your default browser. So till now, I hope you have successfully opened the jupyter notebook in your system so let's go to explore Pandas.

Import pandas: 

whenever we are using pandas in our project the first thing we should do is to import pandas in our notebook as below: import pandas but if you want to use pandas with a short name you can do that as follow: import pandas as pd

Create a Series

If you want to create a series with pandas then do as follow
 s = pd.Series(['element1','element2','element3'],
              index= [0,1,2,],
    name = 'column_1')
    print(s)
    
OutPut:
  0    element1
1    element2
2    element3
Name: column_1, dtype: object
so in the above code create one column with a unique index

Create A Data Frame

If you want to create a data frame with multiple rows and columns you can do as follow we create a simple data frame of student data
data = [[2,3,4,5],[5,6,7,4],[7,6,5,4],[4,5,6,7]]
data_frame = pd.DataFrame(data,
                          index = [0,1,2,3],
                          columns=['col_1','col_2','col_3','col_4'])
Output:
col_1  col_2  col_3  col_4
0      2      3      4      5
1      5      6      7      4
2      7      6      5      4
3      4      5      6      7
In the above code, we have creat a data frame with 4 columns with an index number. Now let's Create another Data Frame with dictionary type: we will create a small data frame which include student data as follow:
student_id = [1,2,3]
student_name = ['nomi', 'rizu', 'sajnu']
marks = [444, 467,390]
data_frame = pd.DataFrame({
    "student_id": student_id,
    "student_name": student_name,
    "marks": marks
})
print(data_frame)
OutPut:
student_id student_name  marks
0           1         nomi    444
1           2         rizu    467
2           3        sajnu    390
so the above code show's a small data frame which we have created you can try with yourself some data frames so you have the best experience

Data Set

we will use some already built-in data set in this tutorial for the best experience with pandas first we will download a data set which is iris data you can download it from here.
This iris data set is in the type of CSV so follow on how to bring it to our notebook

Read CSV File

As you know most of the times we have the data in the form of csv also most of the data sets for data science projects are in the csv form so let's us know that how we can open or read a csv file in our jupyter notebook, there are many ways to do it one of them as follow:
data = pd.read_csv(r'data url \ data_name.csv')
the above is just simple structure if you have to load csv always call csv after read and then in parenthisis make sure you have write "r" or double slashes in your path then specify the file name followed by a .csv as like bellow
data = pd.read_csv(r'D:\online courses\Data Sets\iris.csv')
I have given my system path so you should give the path according to where's your data is if you still stuck you can watch the video tutorial and follow on with me

Selecting rows and column

show first 5 rows

data.head()
OutPut:


 it will show you the first five rows in your data but if you want 2 or 3 or even your choice you can enter that between parenthesis and you will get it. i-e:
data.head(3)
it will give you the first three rows in your data set

Show last 10 rows

if you want to check last 10 rows of your data set you can do that as follow:
data.tail(10)
OutPut:


Select single column:

data['Sepal.Length']
OutPut:
0      5.1
1      4.9
2      4.7
3      4.6
4      5.0
      ... 
145    6.7
146    6.3
147    6.5
148    6.2
149    5.9
Name: Sepal.Length, Length: 150, dtype: float64
so we have successfully grabbed the sepal. length column

Select multiple columns

you can also select multiple columns as follow:
data[['Sepal.Length','Sepal.Width']]
Output:
Sepal.Length	Sepal.Width
0	5.1	3.5
1	4.9	3.0
2	4.7	3.2
3	4.6	3.1
4	5.0	3.6
...	...	...
145	6.7	3.0
146	6.3	2.5
147	6.5	3.0
148	6.2	3.4
149	5.9	3.0
150 rows × 2 columns

Select rows by index values:

if you want to select rows through index you can use loc[] to do that as in the following example we grab the first row at index '0'
data.loc[0]
OutPut:
Sepal.Length       5.1
Sepal.Width        3.5
Petal.Length       1.4
Petal.Width        0.2
Species         setosa
Name: 0, dtype: object
you can also access many rows at once like data.loc[[0,3]]

Select rows by position:

you can also select a row through position like :
data.loc[1:2]
this will give you rows from second until third OutPut:
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa

Data wrangling

Filter by value:

if you want to grab some rows according to your comparison operators like in this case i will grab all the values which are great than 4.6 in the first column:
data[data['Sepal.Length']>4.6]
output:
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa
5	5.4	3.9	1.7	0.4	setosa
...	...	...	...	...	...
145	6.7	3.0	5.2	2.3	virginica
146	6.3	2.5	5.0	1.9	virginica
147	6.5	3.0	5.2	2.0	virginica
148	6.2	3.4	5.4	2.3	virginica
149	5.9	3.0	5.1	1.8	virginica
141 rows × 5 columns
so it gives me those columns in which values are greater than 4.6 the above output is not correctly ordered you can check it in your own notebook you can practice more by yourself to grab more columns

Sort by one column

If you want to sort any column in your data set you can do that as follow:
data.sort_values(['Sepal.Length'])
OutPut:
13	4.3	3.0	1.1	0.1	setosa
42	4.4	3.2	1.3	0.2	setosa
38	4.4	3.0	1.3	0.2	setosa
8	4.4	2.9	1.4	0.2	setosa
41	4.5	2.3	1.3	0.3	setosa
...	...	...	...	...	...
122	7.7	2.8	6.7	2.0	virginica
118	7.7	2.6	6.9	2.3	virginica
117	7.7	3.8	6.7	2.2	virginica
135	7.7	3.0	6.1	2.3	virginica
131	7.9	3.8	6.4	2.0	virginica
150 rows × 5 columns
so it gives you a completely sorted array through accending order

Identify duplicate rows:

you can also identify if there are any duplicated values in your array as follow: data.duplicated() OutPut:
0      False
1      False
2      False
3      False
4      False
       ...  
145    False
146    False
147    False
148    False
149    False
Length: 150, dtype: bool
so it gives you the result in boolean if there is True means you have duplicated values

Grouping Data In Pandas:

If you have a daframe related to a cricket league and you want to get information about that league by just entering the year name and you get all the possible information related to that year
psl_data = {'Team': ['peshawar', 'karachi', 'multan', 'lahore', 'queta',
         'islamabad','peshawar','multan','lahore', 'queta', 'islamabad','peshawar','islamabad'],
         'Ranking': [1,3,2,4,5,6,4,2,5,1,3,2,4],
         'Year': [2016,2017,2016,2018,2017,2019,2020,2017,2021,2020,2019,2021,2016],
         'Points':[412,450,430,489,500,432,432,476,487,544,423,544,534]}
df = pd.DataFrame(psl_data)
so we have created a data frame now we want to get information through a specific group in this case we will select year as our group: grouped = df.groupby('Year')
grouped.get_group(2016)
Output:

 

The shape of Rows and Columns

If you want to know how many rows and columns are there in my data set you can do that as : data.shape 
 OutPut
(150, 5)
so in this case, it show's that we have 150 rows and 5 columns in our iris data set

Describe index

If you wanna know how many indexes are there in my data frame you can do that as data.index Output:
RangeIndex(start=0, stop=150, step=1)

Describe Name's Of DataFrame columns

If you wanna get the names of all your columns in the data set you can do that as : data.columns Output:
Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width',
       'Species'],
      dtype='object')
So it gives you the exact name's for all the columns in the iris data set

Information About DataFrame

If you wanna know the important information about your data frame you can do that as follow: data.info() Output:

RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Sepal.Length  150 non-null    float64
 1   Sepal.Width   150 non-null    float64
 2   Petal.Length  150 non-null    float64
 3   Petal.Width   150 non-null    float64
 4   Species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB

Number of Values Without NAN

If you want to get information about that how many values I have in my data frame without NAN values you can do that as follow: data.count() Output:
Sepal.Length    150
Sepal.Width     150
Petal.Length    150
Petal.Width     150
Species         150
dtype: int64
so it's show's you how many values you have in each column if add .sum() it will show you several values in your data set i-e: data.count().sum() OutPut:
750

Summary About DataFrame

Sum OF Values

if you want to get the sum of all columns in your data set you can do as follow: data.sum() Output:
Sepal.Length                                                876.5
Sepal.Width                                                 458.6
Petal.Length                                                563.7
Petal.Width                                                 179.9
Species         setosasetosasetosasetosasetosasetosasetosaseto...
dtype: object
so we have got the sum for each column as you can see above

The cumulative sum of values

if you have to get a cumulative sum then do that as data.cumsum() Output:
The result is to large try in your own notebook

Find Minimum/maximum values

I want to get minimum values in all of your columns then do as data.min() Output:
Sepal.Length       4.3
Sepal.Width          2
Petal.Length         1
Petal.Width        0.1
Species         setosa
dtype: object
for maximum values data.max()

Description of your Data

If you want to get all the information like mean, median, mode, etc at once you can do that as follow data.describe() OutPut:
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Mean of values

If you wanna get the mean of all of your columns you can do that as follow: data.mean() OutPut:
Sepal.Length    5.843333
Sepal.Width     3.057333
Petal.Length    3.758000
Petal.Width     1.199333
dtype: float64
​

Median of values

If you want to get the median of all of your columns you can do that as follow data.median() OutPut:
Sepal.Length    5.80
Sepal.Width     3.00
Petal.Length    4.35
Petal.Width     1.30
dtype: float64

Asking for help

If you wanna get help with anything you can do that by typing help() and inside parenthesis give the variable name for which you want to get help as bellow: help(data)

Applying Functions

let's assume that you have a function and you want to apply that to all of your values in the data set you can do that for the whole data set as follow f = lambda x: x*2 data.apply(f) OutPut:
The above lambda fucntion will be apply to each and every value in our data set
But If you want to apply function element-wise you can do that as follow f = lambda x: x*2 data.applymap(f)

Find Null Values Information

If you want to find null values (empty values) in your data set you can do that as follow data.isnull().sum() OutPut
Sepal.Length    0
Sepal.Width     0
Petal.Length    0
Petal.Width     0
Species         0
dtype: int64
In this case, we have no null values in our data set you can also apply like this to show you the information of all rows at once data.isnull().sum().sum() OutPut:
0
Mean's We have zero null values

Drop Null Values

In case we have null values in our data set then we can apply the following method to remove all null values at once like: data.dropna( ) we have no null values in this data set but you can try it with other data set to gain experience

Swap rows and columns:

if you want to change rows into columns and columns into rows you can do as follow by using the transpose method: data.transpose() This will convert your rows into columns and columns into rows

Drop a column:

if you want to remove any column from your data set you can do as follow: data.drop('Sepal.Length', axis = 1) it will drop the sepal length column from your data set but remember always define the axis=1 as you know we have two axes one is axis = 0 which is for rows and axis=1 for columns as we used above

Clone a data frame:

if you want to clone or make a copy of your original data set then you can do that as follow: new_data = data.copy()

Connect multiple data frames vertically:

if you want to contact two data frames you can do that by the following code pd.concat([data_1,data_2]) to do it load new data set into your notebook and then concat it together as like above

Make New Columns

If you have to insert or add a new column into your data set you can do that let assume that our new column will be like: new_column = data['Sepal. Length']*2 Now add this column into our iris data set new_column = data['Sepal.Length']*2 data['New_column'] = new_column OutPut:
So the new column will be added with your previous iris data set

Pivot Table

Pivot

pivot is used to transform or reshape your data Example:
df.pivot(index = 'column name as a row', columns = 'Col_Name')
In the above example index is set to that column which will appear through all the rows while other remains the same

Pivot Table

A pivot table is used to summarize and aggregate data inside the data frame For Example this will find the sum for your summarize data frame you can also apply mean, median, or any other function
df.pivot_table(index = 'Date' , columns = 'City', aggfunc = 'sum')

Stacking & Unstacking

if you have two or more headers and you want to make some of the columns to be shown in rows or you want to stack your dataframe you can do that as:
df.stack()
if you have to unstack again you should use:
df.unstack()

No comments:

Post a Comment