If you are an aspiring data scientist or interested in data analysis, you must know working with Pandas. So in this Python Pandas tutorial, I will break down the basics and show you how to work with Series and DataFrame.
Moreover, I will show you how to use Pandas to read, clean, transform and store data.
Check out this course on Pandas from LinkedIn Learning if you want to learn through video:
What is Pandas?
Pandas is a scientific computing library for data analysis in Python. It was developed in 2008 by Wes McKinney. Moreover, it’s completely open-source and built on top of Numpy.
Read: NumPy Tutorial for Beginners – Arrays
Some of the key features of Pandas includes:
- Provides fast performance to process large data sets.
- Able to load data from a wide variety of sources.
- Offers data structures and operations to manipulate large data sets.
We will learn about other important features as we go along.
Installing Pandas
Before we continue with our Pandas tutorial, I am assuming that you have basic Python knowledge.
Read: The Ultimate Python Cheat Sheet – An Essential Reference for Python Developers
To install Pandas, I recommend that you download the Anaconda distribution of Python.
This distribution is an open-source data science platform that comes with Pandas & other scientific libraries.
If you do not want to use the Anaconda, then you can also install Pandas from your terminal using the commands below in the following order:
pip install numpy
pip install pandas
Other than that, Google has a free Jupyter notebook platform known as Google Colab that already provides Pandas and other Python data science libraries. You can use that as well.
Visit here to learn more about Google Colab:
Series
In this part of our Pandas tutorial, I will talk about Series.
Pandas Series is a one-dimensional labeled array to hold any data type such as integer, float, string, and Python objects.
In short, you can think of Series as columns as in excel sheet. A series represents a single column in the computer memory.
Let’s see how we can create a Series.
First, we have to import Pandas using the import
keyword.
import pandas as pd
Now, type following to create a Series:
mySeries = pd.Series([0,1,2,3,4,5,10])
print(mySeries)
Output:
0 0
1 1
2 2
3 3
4 4
5 5
6 10
dtype: int64
Also, you can give your Series a name.
mySeries = pd.Series([0,1,2,3,4,5,10], name='Num')
print(mySeries)
Output:
0 0
1 1
2 2
3 3
4 4
5 5
6 10
Name: Num, dtype: int64
Since Pandas works on top of Numpy, the Series contains a Numpy array within itself.
Related: NumPy Array Indexing & Slicing Explained
And we can extract the Numpy array using the method values
.
mySeries.values
Output:
array([ 0, 1, 2, 3, 4, 5, 10])
As you can see, that the output is a Numpy array.
A Series also uses index positions.
mySeries.index
Output:
RangeIndex(start=0, stop=7, step=1)
To extract values at certain index positions, I can type:
print(mySeries[6])
print(mySeries[4])
The values at index position 6
& 4
:
10
4
Similar to other data structures, Python also allows us to slice a Series. For instance, I want to extract the values between the index positions 5 & 7. For that, we can type:
mySeries[5:7]
Output:
5 5
6 10
Name: Num, dtype: int64
Remember that the column on the left holds the index values or positions.
We can also assign names to the indexes when we create a Series.
So let’s create a new Series with student scores. Besides, I will also pass the index parameter.
scores = pd.Series([70,90,80,100,95,85], index=['Sam','Andrea','Marcos','Peng','Karen','Chen'])
print(scores)
Output:
Sam 70
Andrea 90
Marcos 80
Peng 100
Karen 95
Chen 85
dtype: int64
I passed an index parameter, and this parameter is a list object. Furthermore, this list holds all of our index names or labels.
The left column is the index column or, you can refer to it as the label column. On the right, we have all our values.
If you wish to extract all the index labels, then type:
scores.index
Output:
Index(['Sam', 'Andrea', 'Marcos', 'Peng', 'Karen', 'Chen'], dtype='object')
According to the dtype
, we have a collection of Python objects.
Indexing & Slicing Series
I did touch a little bit on indexing & slicing Series objects previously. However, let’s try out some more examples to see how they really work.
Let’s perform some indexing and slicing operations using numbers on the Series scores.
For example, I want to check the score of Karen at the index position 4
. So, I can type:
scores[4]
Output:
95
Furthermore, we can slice the Series using numbers as well. Let’s say I want the scores of Andrea, Marcos, Peng & Karen.
Therefore, I have to slice from the index position 1
to 5
scores[1:5]
Output:
Andrea 90
Marcos 80
Peng 100
Karen 95
dtype: int64
When we perform slicing, the ending number is the number up to which Python slices. But it does not include the value.
For example, here, we are slicing up to the index position 5.
But not including the actual value.
You can also slice a Series using the index labels.
For example:
scores['Sam':'Peng']
Output:
Sam 70
Andrea 90
Marcos 80
Peng 100
dtype: int64
Notice how slicing with index labels is different compared to the way we slice using numbers.
In contrast to using numerical indexes, you can see that the value of the ending index label (Peng
) is in our result.
iloc & loc
Pandas have a built-in object known as the iloc
to extract data using the integer-based index.
So, let’s see how we can use the iloc
to extract information from scores:
scores.iloc[0:2]
Output:
Sam 70
Andrea 90
dtype: int64
Again, the iloc
tells Python to extract data using the integer-based index.
Similarly, we also have the loc
object at our disposal to extract data using a label-based index.
This is how it works:
scores.loc['Sam':'Peng']
Output:
Sam 70
Andrea 90
Marcos 80
Peng 100
dtype: int64
You may learn more about iloc
& loc
when we get to the DataFrame section. But for now, let’s keep it simple.
Before I go to the DataFrame section of this Pandas tutorial, I like to show you two more ways to create Series.
You can also create Series from a Numpy array. First, import Numpy and Pandas into your IDE:
import numpy as np
import pandas as pd
Then create a Numpy array:
data = np.array(['X','Y','Z'])
To convert data
into a Pandas Series, I can type:
mySeries = pd.Series(data)
print(mySeries)
Output:
0 X
1 Y
2 Z
dtype: object
Series from Dictionaries
We can also create Pandas Series from Python dictionaries.
So let’s start by creating a dictionary:
dictScores = {'Jacob':98,'Mae':70,'Kayla':95}
Now pass the dictScores
into pd.Series()
method:
mySeries = pd.Series(dictScores)
print(mySeries)
Output:
Jacob 98
Mae 70
Kayla 95
dtype: int64
When creating a Series with a dictionary, you should remember that the dictionary keys are index labels.
Well, that’s it for Series. In the next part of the Pandas tutorial, we will talk about DataFrames.
DataFrames
Great job if you have made it through Series. In this part of the Pandas tutorial, we learn about DataFrames and the multiple ways we can create them. Not only that, but you will also learn how to set indexes and select, combine and create columns on a DataFrame.
Now, what is a DataFrame in Pandas?
A DataFrame is a two-dimensional data structure that has rows and columns. These columns and rows are known as labeled axes.
And columns of a Dataframe are made up of separate or multiple Series objects. In short, one of the ways you can create a DataFrame is by having two or more Series.
Coming back to the axes, a DataFrame has two axes. They are axis 0 & axis 1.
In simple terms, axis 0 represents rows and, axis 1 represents columns. Sometimes I like to think of Pandas DataFrame as Excel sheets.
Anyway, let’s get our hands dirty with some coding.
First, I will show you how to create DataFrames using a dictionary.
Go ahead and import the following libraries:
import numpy as np
import pandas as pd
Again, create a dictionary objects called scores:
scores = {
'Name':['Jake','Chan','Alex'],
'Age': [15,16,17],
'Grade': ['C','B','A']
}
The keys of our dictionary will be the column names of our DataFrame. And the values of the dictionary will be the list of items under those columns.
Now convert score
into a DataFrame:
df = pd.DataFrame(scores)
print(df)
Output:
Name Age Grade
0 Jake 15 C
1 Chan 16 B
2 Alex 17 A
As you see, we got ourselves a nice-looking DataFrame.
And this DataFrame looks similar to an excel sheet.
Also, notice that the keys of our dictionary became the column names. Then the values are the list of items under the columns.
On the far left side of the DataFrame, we have the index column. Each index value represents a row.
If you want to check the column names, then you can type:
df.columns
Output:
Index(['Name', 'Age', 'Grade'], dtype='object')
Remember that each column is a Python object it self.
By the way, each row is an index on our DataFrame. So, to see all the index or rows, I can type:
df.index
Output:
RangeIndex(start=0, stop=3, step=1)
The output shows that our index starts at 0
and stops at 3
with a default step of 1
.
DataFrames from Series
So far, we have learned how to create Pandas DataFrame using Python dictionary.
Now, I will show you how to create DataFrames using Series.
Step 1 – Create Dictionaries
We will start by creating two dictionaries, with each having yearly scores of students.
So the first one holds student score information from the year 2020. Then the second tells us their score from 2021.
scores2020 = {'Jake':90,'Kayla':85,'Muhammad':95,'Alexis':98}
scores2021 = {'Jake':85,'Kayla':95,'Muhammad':90,'Alexis':97}
Step 2 – Convert the Dictionaries into Series
The keys in our dictionaries are the index labels. So, we have created two dictionaries. Now the next step is to convert them into Series objects using pd.Series().
seriesOne = pd.Series(scores2020)
seriesTwo = pd.Series(scores2021)
Step 3 – Create a New Dictionary Using the Series Objects
In this step, we will create a new dictionary that we will pass into the method pd.DataFrame()
later.
Let’s create a newDict
first:
newDict = {'2020': seriesOne,'2021': seriesTwo}
The keys of the newDict
are the column names. And the values (Series objects) are the data each column holds.
Step 4 – Convert the New Dictionary into a DataFrame
Lastly, we will create the DataFrame by converting the newDict
.
For this to work, I have to use pd.DataFrame()
and pass our newDict
Here’s what I mean:
yearlyScores = pd.DataFrame(newDict)
print(yearlyScores)
Output:
2020 2021
Jake 90 85
Kayla 85 95
Muhammad 95 90
Alexis 98 97
Take a look at what’s going on here. I used the pd.DataFrame()
method and passed in our newDict
where the keys were the columns. And the rows were the values.
Working with Columns
As a data scientist, you will work with Pandas a lot. As a result, you also need to work with columns to make changes or retrieve specific information.
So, let me show you the ways you can work with the columns of a DataFrame.
We will use the yearlyScores
DataFrame that we just created.
To select a column, type this:
yearlyScores['2020']
Output:
Jake 90
Kayla 85
Muhammad 95
Alexis 98
Name: 2020, dtype: int64
Similarly, we can also select the year 2021
using the same process:
yearlyScores['2021']
Output:
Jake 85
Kayla 95
Muhammad 90
Alexis 97
Name: 2021, dtype: int64
You can also see that we have the index column on the left. So, no matter what column you select, the output will always have an index column as a default.
It’s time to see how we can create a new column and add it to an existing DataFrame.
I will first create a new Pandas Series, which is technically a column, and then add it to the DataFrame yearlyScores.
Creating a Series from a dictionary:
scores2019 = scores2019 = {'Jake':78,'Kayla':90,'Muhammad':96,'Alexis':100}
seriesThree = pd.Series(scores2019)
seriesThree
Output:
Jake 78
Kayla 90
Muhammad 96
Alexis 100
dtype: int64
After that, I can add this Series to our DataFrame as a new column:
yearlyScores['2019'] = seriesThree
print(yearlyScores)
So we have our usual DataFrame and, then I have used the square brackets to name our new column '2019'
. Then I have assigned the yearlyScores['2019']
to the seriesThree object. As a result, it adds a new column to the DataFrame.
Output:
2020 2021 2019
Jake 90 85 78
Kayla 85 95 90
Muhammad 95 90 96
Alexis 98 97 100
But before we move further, we have a problem with our DataFrame. And that is, we want column 2019
to be at the first instead of last.
We have to reorder it. And to perform reordering operation, I can type:
yearlyScores = yearlyScores[['2019','2020','2021']]
Let’s print it out:
print(yearlyScores)
Output:
2019 2020 2021
Jake 78 90 85
Kayla 90 85 95
Muhammad 96 95 90
Alexis 100 98 97
The order of the columns is now changed.
One of the aspects of Pandas is that when we create DataFrames from Series, it will automatically match the index and add the column.
You can also delete columns. For instance, I want to delete the column 2019
from our DataFrame. And to do that, I can use the del
keyword:
del yearlyScores['2019']
Print out yearlyScores
:
print(yearlyScores)
Output:
2020 2021
Jake 90 85
Kayla 85 95
Muhammad 95 90
Alexis 98 97
You can see that column 2019
is now gone.
Often you may also need to delete multiple columns. Therefore, we can do that using the drop()
method.
Here’s how it works:
yearlyScores.drop(['2020','2021'], axis=1,inplace=True)
So, I want to delete the columns 2020
& 2021
. That’s why I passed them as a list. Since we are dealing with columns, I also set axis=1
. Lastly, I have inplace=True
, which means that we want to make the change permanent. If it is False
, then the drop()
method would return a copy instead of making the change permanent.
Also, do know that, by default, inplace
is always False
.
Let’s print it out:
print(yearlyScores)
Output:
Empty DataFrame
Columns: []
Index: [Jake, Kayla, Muhammad, Alexis]
All of our columns are now gone. Python tells us that it is an empty DataFrame.
Similar to working with columns, you can also select, add and delete rows.
Since we dropped all of our columns from the DataFrame, we have to recreate yearlyScores
again.
Use the following code to do that:
scores2020 = {'Jake':90,'Kayla':85,'Muhammad':95,'Alexis':98}
scores2021 = {'Jake':85,'Kayla':95,'Muhammad':90,'Alexis':97}
seriesOne = pd.Series(scores2020)
seriesTwo = pd.Series(scores2021)
newDict = {'2020': seriesOne,'2021': seriesTwo}
yearlyScores = pd.DataFrame(newDict)
print(yearlyScores)
Output:
2020 2021
Jake 90 85
Kayla 85 95
Muhammad 95 90
Alexis 98 97
So, we have our yearlyScores
DataFrame back. I kept all the data the same as before for the sake of this tutorial.
Working with Rows
Okay, let’s see how to work with rows.
There are two ways you can select a row. It’s either using the index labels or the index values.
To select rows using labels, I can use the loc
function and pass the row name.
For example:
yearlyScores.loc['Muhammad']
Output:
2020 95
2021 90
Name: Muhammad, dtype: int64
The result that we got is a Series. Moreover, notice that the column names of our DataFrame became the index labels.
Then we have the iloc
function to select rows. bypassing the integer-based index value.
For instance, I want to extract Kayla’s scores. The index value at which her score exists is 1
. As a result, I have to pass the index value 1 to the iloc
function.
Here’s how it works:
yearlyScores.iloc[1]
Output:
2020 85
2021 95
Name: Kayla, dtype: int64
Besides, we can also slice multiple rows using the colon operator.
For example, I want to extract all the scores starting from Kayla to Alexis.
And the procedure is the same as how you would slice a Python list.
So, I will type:
yearlyScores[1:4]
Output:
2020 2021
Kayla 85 95
Muhammad 95 90
Alexis 98 97
The output is a DataFrame containing the rows that we sliced starting from the index position one and till but not including the index position four.
Moreover, we can also add new rows to a DataFrame using the append()
function.
However, I will create a new DataFrame first. You’ll see why in a bit.
You can create a DataFrame in any way you like. But, I will use Series.
So, this DataFrame will have two new students with their scores.
To create the DataFrame, I can type:
# Create the Series objects
scores2020 = pd.Series({'Lori':80,'Vince':82})
scores2021 = pd.Series({'Lori':90,'Vince':92})
# Pass and convert our Series objects to DataFrame
newStudentsDf = pd.DataFrame({'2020':scores2020,'2021':scores2021})
Output:
2020 2021
Lori 80 90
Vince 82 92
In the end, we have our new DataFrame. So, let’s add rows using newStudentDf
to our previous DataFrame yearlyScores.
Type the following:
yearlyScores = yearlyScores.append(newStudentsDf)
print(yearlyScores)
Output:
2020 2021
Jake 90 85
Kayla 85 95
Muhammad 95 90
Alexis 98 97
Lori 80 90
Vince 82 92
To sum up, first, I created a new DataFrame, newStudentDf.
This DataFrame holds two new students and their scores. Then, I passed that DataFrame using the append()
function to merge its rows with yearlyScores
.
You may need to go over the whole procedure few times to wrap your head around it. In contrast to columns, rows may be complex.
We are almost at the end of this Pandas tutorial. If you have been able to follow along till this point, then well done! We are almost there.
The last concept that I want to talk about is deleting rows.
Deleting rows is similar to deleting columns. Like before, we can use the drop()
function to delete rows in Pandas.
For instance, I want to delete Jake and Lori’s information from our DataFrame. As simple as it sounds, I will type:
yearlyScores.drop(['Jake','Lori'],inplace=True)
print(yearlyScores)
We have to use the inplace parameter to make our change permanent.
So, here is what we have:
2020 2021
Kayla 85 95
Muhammad 95 90
Alexis 98 97
Vince 82 92
Conclusion
Finally, we are at the end of our Pandas tutorial. I hope you learned some of the valuable concepts on how to work with Pandas.
I recommend going over the tutorial a few times to understand the concepts vividly. Although, it may take a while. To this day, I still have to practice Pandas regularly to keep up my skills sharp.
Working with DataFrame can be complex. That’s why follow the order that I have mentioned when it comes to learning Pandas. Besides, the best way to get good at a skill is by practicing.
Also, if you are interested in video courses for Pandas tutorial, then try out these courses for FREE:
Interested in becoming a Data Scientist or Machine Learning Engineer? Well, I have an in-depth career guide that you might want to read:
- How to Become a Data Scientist – The Sexiest Job of 21st Century
- How to Become a Machine Learning Engineer
If you have any questions regarding this Pandas tutorial, then feel free to comment below.
How do you plan to use Pandas? Is there any part of this Pandas tutorial that was confusing? What other Python libraries can you think of that are great for data analysis?