Pandas
Introduction To Pandas
Pandas is one of the core libraries of the python programming language which is widely used by data scientists and machine learning engineers for their data analysis tasks.
pandas can handle such large data sets as compared to excel with extreme ease, The main use of Pandas library is data manipulation, data cleaning, data visualization, etc in a data science project.
What You Will Learn After Completion Of This Tutorial:
- you will be able to handle large data sets
- you will clean any type of data sets
- you will be able to use pandas in any of your data science project
- you will do all the data manipulation tasks with pandas
Installation Of Pandas
To install pandas just open your cmd and type:
pip install pandas
but if you have already installed an anaconda environment in your system you don't need to install pandas again
IDE used in this tutorial
We used Jupyter Notebook in this tutorial and I will prefer to always use jupyter notebook so you will have always the best experience. Jupyter Notebook comes with Anaconda Package if you don't know how to install anaconda click here to install anaconda in your system and then come back to this tutorial again after you successfully installed anaconda go to your search bar in the window and type jupyter you will see jupyter and then click on that to open it in your default browser. So till now, I hope you have successfully opened the jupyter notebook in your system so let's go to explore Pandas.Import pandas:
whenever we are using pandas in our project the first thing we should do is to import pandas in our notebook as below:
import pandas
but if you want to use pandas with a short name you can do that as follow:
import pandas as pd
Create a Series
If you want to create a series with pandas then do as follow s = pd.Series(['element1','element2','element3'],
index= [0,1,2,],
name = 'column_1')
print(s)
OutPut:
0 element1 1 element2 2 element3 Name: column_1, dtype: objectso in the above code create one column with a unique index
Create A Data Frame
If you want to create a data frame with multiple rows and columns you can do as follow we create a simple data frame of student datadata = [[2,3,4,5],[5,6,7,4],[7,6,5,4],[4,5,6,7]]
data_frame = pd.DataFrame(data,
index = [0,1,2,3],
columns=['col_1','col_2','col_3','col_4'])
Output:
col_1 col_2 col_3 col_4 0 2 3 4 5 1 5 6 7 4 2 7 6 5 4 3 4 5 6 7In the above code, we have creat a data frame with 4 columns with an index number. Now let's Create another Data Frame with dictionary type: we will create a small data frame which include student data as follow:
student_id = [1,2,3]
student_name = ['nomi', 'rizu', 'sajnu']
marks = [444, 467,390]
data_frame = pd.DataFrame({
"student_id": student_id,
"student_name": student_name,
"marks": marks
})
print(data_frame)
OutPut:
student_id student_name marks 0 1 nomi 444 1 2 rizu 467 2 3 sajnu 390so the above code show's a small data frame which we have created you can try with yourself some data frames so you have the best experience
Data Set
we will use some already built-in data set in this tutorial for the best experience with pandas first we will download a data set which is iris data you can download it from here.This iris data set is in the type of CSV so follow on how to bring it to our notebook
Read CSV File
As you know most of the times we have the data in the form of csv also most of the data sets for data science projects are in the csv form so let's us know that how we can open or read a csv file in our jupyter notebook, there are many ways to do it one of them as follow:data = pd.read_csv(r'data url \ data_name.csv')
the above is just simple structure if you have to load csv always call csv after read and then in parenthisis make sure you have write "r" or double slashes in your path then specify the file name followed by a .csv as like bellow
data = pd.read_csv(r'D:\online courses\Data Sets\iris.csv')
I have given my system path so you should give the path according to where's your data is if you still stuck you can watch the video tutorial and follow on with me
Selecting rows and column
show first 5 rows
data.head()
OutPut: it will show you the first five rows in your data but if you want 2 or 3 or even your choice you can enter that between parenthesis and you will get it. i-e:
data.head(3)
it will give you the first three rows in your data set
Show last 10 rows
if you want to check last 10 rows of your data set you can do that as follow:data.tail(10)
OutPut:Select single column:
data['Sepal.Length']
OutPut:
0 5.1 1 4.9 2 4.7 3 4.6 4 5.0 ... 145 6.7 146 6.3 147 6.5 148 6.2 149 5.9 Name: Sepal.Length, Length: 150, dtype: float64so we have successfully grabbed the sepal. length column
Select multiple columns
you can also select multiple columns as follow:data[['Sepal.Length','Sepal.Width']]
Output:
Sepal.Length Sepal.Width 0 5.1 3.5 1 4.9 3.0 2 4.7 3.2 3 4.6 3.1 4 5.0 3.6 ... ... ... 145 6.7 3.0 146 6.3 2.5 147 6.5 3.0 148 6.2 3.4 149 5.9 3.0 150 rows × 2 columns
Select rows by index values:
if you want to select rows through index you can use loc[] to do that as in the following example we grab the first row at index '0'data.loc[0]
OutPut:
Sepal.Length 5.1 Sepal.Width 3.5 Petal.Length 1.4 Petal.Width 0.2 Species setosa Name: 0, dtype: objectyou can also access many rows at once like
data.loc[[0,3]]
Select rows by position:
you can also select a row through position like :data.loc[1:2]
this will give you rows from second until third
OutPut:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa
Data wrangling
Filter by value:
if you want to grab some rows according to your comparison operators like in this case i will grab all the values which are great than 4.6 in the first column:data[data['Sepal.Length']>4.6]
output:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa 5 5.4 3.9 1.7 0.4 setosa ... ... ... ... ... ... 145 6.7 3.0 5.2 2.3 virginica 146 6.3 2.5 5.0 1.9 virginica 147 6.5 3.0 5.2 2.0 virginica 148 6.2 3.4 5.4 2.3 virginica 149 5.9 3.0 5.1 1.8 virginica 141 rows × 5 columnsso it gives me those columns in which values are greater than 4.6 the above output is not correctly ordered you can check it in your own notebook you can practice more by yourself to grab more columns
Sort by one column
If you want to sort any column in your data set you can do that as follow:data.sort_values(['Sepal.Length'])
OutPut:
13 4.3 3.0 1.1 0.1 setosa 42 4.4 3.2 1.3 0.2 setosa 38 4.4 3.0 1.3 0.2 setosa 8 4.4 2.9 1.4 0.2 setosa 41 4.5 2.3 1.3 0.3 setosa ... ... ... ... ... ... 122 7.7 2.8 6.7 2.0 virginica 118 7.7 2.6 6.9 2.3 virginica 117 7.7 3.8 6.7 2.2 virginica 135 7.7 3.0 6.1 2.3 virginica 131 7.9 3.8 6.4 2.0 virginica 150 rows × 5 columnsso it gives you a completely sorted array through accending order
Identify duplicate rows:
you can also identify if there are any duplicated values in your array as follow:data.duplicated()
OutPut:
0 False 1 False 2 False 3 False 4 False ... 145 False 146 False 147 False 148 False 149 False Length: 150, dtype: boolso it gives you the result in boolean if there is True means you have duplicated values
Grouping Data In Pandas:
If you have a daframe related to a cricket league and you want to get information about that league by just entering the year name and you get all the possible information related to that yearpsl_data = {'Team': ['peshawar', 'karachi', 'multan', 'lahore', 'queta',
'islamabad','peshawar','multan','lahore', 'queta', 'islamabad','peshawar','islamabad'],
'Ranking': [1,3,2,4,5,6,4,2,5,1,3,2,4],
'Year': [2016,2017,2016,2018,2017,2019,2020,2017,2021,2020,2019,2021,2016],
'Points':[412,450,430,489,500,432,432,476,487,544,423,544,534]}
df = pd.DataFrame(psl_data)
so we have created a data frame now we want to get information through a specific group in this case we will select year as our group:
grouped = df.groupby('Year')
grouped.get_group(2016)
Output:The shape of Rows and Columns
If you want to know how many rows and columns are there in my data set you can do that as :data.shape
OutPut
If you have to insert or add a new column into your data set you can do that let assume that our new column will be like:
new_column = data['Sepal. Length']*2
Now add this column into our iris data set
(150, 5)so in this case, it show's that we have 150 rows and 5 columns in our iris data set
Describe index
If you wanna know how many indexes are there in my data frame you can do that asdata.index
Output:
RangeIndex(start=0, stop=150, step=1)
Describe Name's Of DataFrame columns
If you wanna get the names of all your columns in the data set you can do that as :data.columns
Output:
Index(['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'], dtype='object')So it gives you the exact name's for all the columns in the iris data set
Information About DataFrame
If you wanna know the important information about your data frame you can do that as follow:data.info()
Output:
RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sepal.Length 150 non-null float64 1 Sepal.Width 150 non-null float64 2 Petal.Length 150 non-null float64 3 Petal.Width 150 non-null float64 4 Species 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB
Number of Values Without NAN
If you want to get information about that how many values I have in my data frame without NAN values you can do that as follow:data.count()
Output:
Sepal.Length 150 Sepal.Width 150 Petal.Length 150 Petal.Width 150 Species 150 dtype: int64so it's show's you how many values you have in each column if add .sum() it will show you several values in your data set i-e:
data.count().sum()
OutPut:
750
Summary About DataFrame
Sum OF Values
if you want to get the sum of all columns in your data set you can do as follow:data.sum()
Output:
Sepal.Length 876.5 Sepal.Width 458.6 Petal.Length 563.7 Petal.Width 179.9 Species setosasetosasetosasetosasetosasetosasetosaseto... dtype: objectso we have got the sum for each column as you can see above
The cumulative sum of values
if you have to get a cumulative sum then do that asdata.cumsum()
Output:
The result is to large try in your own notebook
Find Minimum/maximum values
I want to get minimum values in all of your columns then do asdata.min()
Output:
Sepal.Length 4.3 Sepal.Width 2 Petal.Length 1 Petal.Width 0.1 Species setosa dtype: objectfor maximum values
data.max()
Description of your Data
If you want to get all the information like mean, median, mode, etc at once you can do that as followdata.describe()
OutPut:
Sepal.Length Sepal.Width Petal.Length Petal.Width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000
Mean of values
If you wanna get the mean of all of your columns you can do that as follow:data.mean()
OutPut:
Sepal.Length 5.843333 Sepal.Width 3.057333 Petal.Length 3.758000 Petal.Width 1.199333 dtype: float64
Median of values
If you want to get the median of all of your columns you can do that as followdata.median()
OutPut:
Sepal.Length 5.80 Sepal.Width 3.00 Petal.Length 4.35 Petal.Width 1.30 dtype: float64
Asking for help
If you wanna get help with anything you can do that by typing help() and inside parenthesis give the variable name for which you want to get help as bellow:help(data)
Applying Functions
let's assume that you have a function and you want to apply that to all of your values in the data set you can do that for the whole data set as followf = lambda x: x*2
data.apply(f)
OutPut:
The above lambda fucntion will be apply to each and every value in our data setBut If you want to apply function element-wise you can do that as follow
f = lambda x: x*2
data.applymap(f)
Find Null Values Information
If you want to find null values (empty values) in your data set you can do that as followdata.isnull().sum()
OutPut
Sepal.Length 0 Sepal.Width 0 Petal.Length 0 Petal.Width 0 Species 0 dtype: int64In this case, we have no null values in our data set you can also apply like this to show you the information of all rows at once
data.isnull().sum().sum()
OutPut:
0Mean's We have zero null values
Drop Null Values
In case we have null values in our data set then we can apply the following method to remove all null values at once like:data.dropna( )
we have no null values in this data set but you can try it with other data set to gain experience
Swap rows and columns:
if you want to change rows into columns and columns into rows you can do as follow by using the transpose method:data.transpose()
This will convert your rows into columns and columns into rows
Drop a column:
if you want to remove any column from your data set you can do as follow:data.drop('Sepal.Length', axis = 1)
it will drop the sepal length column from your data set but remember always define the axis=1 as you know we have two axes one is axis = 0 which is for rows and axis=1 for columns as we used above
Clone a data frame:
if you want to clone or make a copy of your original data set then you can do that as follow:new_data = data.copy()
Connect multiple data frames vertically:
if you want to contact two data frames you can do that by the following code pd.concat([data_1,data_2])
to do it load new data set into your notebook and then concat it together as like above
Make New Columns
new_column = data['Sepal.Length']*2
data['New_column'] = new_column
OutPut:
So the new column will be added with your previous iris data set
Pivot Table
Pivot
pivot is used to transform or reshape your data Example:df.pivot(index = 'column name as a row', columns = 'Col_Name')
In the above example index is set to that column which will appear through all the rows while other remains the same
Pivot Table
A pivot table is used to summarize and aggregate data inside the data frame For Example this will find the sum for your summarize data frame you can also apply mean, median, or any other functiondf.pivot_table(index = 'Date' , columns = 'City', aggfunc = 'sum')
Stacking & Unstacking
if you have two or more headers and you want to make some of the columns to be shown in rows or you want to stack your dataframe you can do that as:df.stack()
if you have to unstack again you should use:
df.unstack()
No comments:
Post a Comment