Pandas Tutorial for Beginners - The Ultimate Guide -

If you are an aspiring data scientist or interested in data analysis, you must know working with Pandas. So in this Python Pandas tutorial, I will break down the basics and show you how to work with Series and DataFrame.

Moreover, I will show you how to use Pandas to read, clean, transform and store data.

Check out this course on Pandas from LinkedIn Learning if you want to learn through video:

Pandas Tutorial – Essential Training

What is Pandas?

Pandas is a scientific computing library for data analysis in Python. It was developed in 2008 by Wes McKinney. Moreover, it’s completely open-source and built on top of Numpy.

Read: NumPy Tutorial for Beginners – Arrays

Some of the key features of Pandas includes:

Provides fast performance to process large data sets.
Able to load data from a wide variety of sources.
Offers data structures and operations to manipulate large data sets.

We will learn about other important features as we go along.

Installing Pandas

Before we continue with our Pandas tutorial, I am assuming that you have basic Python knowledge.

Read: The Ultimate Python Cheat Sheet – An Essential Reference for Python Developers

To install Pandas, I recommend that you download the Anaconda distribution of Python.

This distribution is an open-source data science platform that comes with Pandas & other scientific libraries.

If you do not want to use the Anaconda, then you can also install Pandas from your terminal using the commands below in the following order:

pip install numpy
pip install pandas

Other than that, Google has a free Jupyter notebook platform known as Google Colab that already provides Pandas and other Python data science libraries. You can use that as well.

Visit here to learn more about Google Colab:

https://colab.research.google.com/

Series

In this part of our Pandas tutorial, I will talk about Series.

Pandas Series is a one-dimensional labeled array to hold any data type such as integer, float, string, and Python objects.

In short, you can think of Series as columns as in excel sheet. A series represents a single column in the computer memory.

Let’s see how we can create a Series.

First, we have to import Pandas using the import keyword.

import pandas as pd

Now, type following to create a Series:

mySeries = pd.Series([0,1,2,3,4,5,10])
print(mySeries)

Output:

0     0
1     1
2     2
3     3
4     4
5     5
6    10
dtype: int64

Also, you can give your Series a name.

mySeries = pd.Series([0,1,2,3,4,5,10], name='Num')
print(mySeries)

Output:

0     0
1     1
2     2
3     3
4     4
5     5
6    10
Name: Num, dtype: int64

Since Pandas works on top of Numpy, the Series contains a Numpy array within itself.

And we can extract the Numpy array using the method values.

mySeries.values

Output:

array([ 0,  1,  2,  3,  4,  5, 10])

As you can see, that the output is a Numpy array.

A Series also uses index positions.

mySeries.index

Output:

RangeIndex(start=0, stop=7, step=1)

To extract values at certain index positions, I can type:

print(mySeries[6])
print(mySeries[4])

The values at index position 6 & 4:

10
4

Similar to other data structures, Python also allows us to slice a Series. For instance, I want to extract the values between the index positions 5 & 7. For that, we can type:

mySeries[5:7]

Output:

5     5
6    10
Name: Num, dtype: int64

Remember that the column on the left holds the index values or positions.

We can also assign names to the indexes when we create a Series.

So let’s create a new Series with student scores. Besides, I will also pass the index parameter.

scores = pd.Series([70,90,80,100,95,85], index=['Sam','Andrea','Marcos','Peng','Karen','Chen'])
print(scores)

Output:

Sam        70
Andrea     90
Marcos     80
Peng      100
Karen      95
Chen       85
dtype: int64

I passed an index parameter, and this parameter is a list object. Furthermore, this list holds all of our index names or labels.

The left column is the index column or, you can refer to it as the label column. On the right, we have all our values.

If you wish to extract all the index labels, then type:

scores.index

Output:

Index(['Sam', 'Andrea', 'Marcos', 'Peng', 'Karen', 'Chen'], dtype='object')

According to the dtype, we have a collection of Python objects.

Indexing & Slicing Series

I did touch a little bit on indexing & slicing Series objects previously. However, let’s try out some more examples to see how they really work.

Let’s perform some indexing and slicing operations using numbers on the Series scores.

For example, I want to check the score of Karen at the index position 4. So, I can type:

scores[4]

Output:

Furthermore, we can slice the Series using numbers as well. Let’s say I want the scores of Andrea, Marcos, Peng & Karen.

Therefore, I have to slice from the index position 1 to 5

scores[1:5]

Output:

Andrea     90
Marcos     80
Peng      100
Karen      95
dtype: int64

When we perform slicing, the ending number is the number up to which Python slices. But it does not include the value.

For example, here, we are slicing up to the index position 5. But not including the actual value.

You can also slice a Series using the index labels.

For example:

scores['Sam':'Peng']

Output:

Sam        70
Andrea     90
Marcos     80
Peng      100
dtype: int64

Notice how slicing with index labels is different compared to the way we slice using numbers.

In contrast to using numerical indexes, you can see that the value of the ending index label (Peng) is in our result.

iloc & loc

Pandas have a built-in object known as the iloc to extract data using the integer-based index.

So, let’s see how we can use the iloc to extract information from scores:

scores.iloc[0:2]

Output:

Sam       70
Andrea    90
dtype: int64

Again, the iloc tells Python to extract data using the integer-based index.

Similarly, we also have the loc object at our disposal to extract data using a label-based index.

This is how it works:

scores.loc['Sam':'Peng']

Output:

Sam        70
Andrea     90
Marcos     80
Peng      100
dtype: int64

You may learn more about iloc & loc when we get to the DataFrame section. But for now, let’s keep it simple.

Before I go to the DataFrame section of this Pandas tutorial, I like to show you two more ways to create Series.

You can also create Series from a Numpy array. First, import Numpy and Pandas into your IDE:

import numpy as np
import pandas as pd

Then create a Numpy array:

data = np.array(['X','Y','Z'])

To convert data into a Pandas Series, I can type:

mySeries = pd.Series(data)
print(mySeries)

Output:

0    X
1    Y
2    Z
dtype: object

Series from Dictionaries

We can also create Pandas Series from Python dictionaries.

So let’s start by creating a dictionary:

dictScores = {'Jacob':98,'Mae':70,'Kayla':95}

Now pass the dictScores into pd.Series() method:

mySeries = pd.Series(dictScores)
print(mySeries)

Output:

Jacob    98
Mae      70
Kayla    95
dtype: int64

When creating a Series with a dictionary, you should remember that the dictionary keys are index labels.

Well, that’s it for Series. In the next part of the Pandas tutorial, we will talk about DataFrames.

DataFrames

Great job if you have made it through Series. In this part of the Pandas tutorial, we learn about DataFrames and the multiple ways we can create them. Not only that, but you will also learn how to set indexes and select, combine and create columns on a DataFrame.

Now, what is a DataFrame in Pandas?

A DataFrame is a two-dimensional data structure that has rows and columns. These columns and rows are known as labeled axes.

And columns of a Dataframe are made up of separate or multiple Series objects. In short, one of the ways you can create a DataFrame is by having two or more Series.

Coming back to the axes, a DataFrame has two axes. They are axis 0 & axis 1.

In simple terms, axis 0 represents rows and, axis 1 represents columns. Sometimes I like to think of Pandas DataFrame as Excel sheets.

Anyway, let’s get our hands dirty with some coding.

First, I will show you how to create DataFrames using a dictionary.

Go ahead and import the following libraries:

import numpy as np
import pandas as pd

Again, create a dictionary objects called scores:

scores = {   
  'Name':['Jake','Chan','Alex'],
  'Age': [15,16,17],
  'Grade': ['C','B','A']
}

The keys of our dictionary will be the column names of our DataFrame. And the values of the dictionary will be the list of items under those columns.

Now convert score into a DataFrame:

df = pd.DataFrame(scores)
print(df)

Output:

   Name  Age Grade
0  Jake   15     C
1  Chan   16     B
2  Alex   17     A

As you see, we got ourselves a nice-looking DataFrame.

And this DataFrame looks similar to an excel sheet.

Also, notice that the keys of our dictionary became the column names. Then the values are the list of items under the columns.

On the far left side of the DataFrame, we have the index column. Each index value represents a row.

If you want to check the column names, then you can type:

df.columns

Output:

Index(['Name', 'Age', 'Grade'], dtype='object')

Remember that each column is a Python object it self.

By the way, each row is an index on our DataFrame. So, to see all the index or rows, I can type:

df.index

Output:

RangeIndex(start=0, stop=3, step=1)

The output shows that our index starts at 0 and stops at 3 with a default step of 1.

DataFrames from Series

So far, we have learned how to create Pandas DataFrame using Python dictionary.

Now, I will show you how to create DataFrames using Series.

Step 1 – Create Dictionaries

We will start by creating two dictionaries, with each having yearly scores of students.

So the first one holds student score information from the year 2020. Then the second tells us their score from 2021.

scores2020 = {'Jake':90,'Kayla':85,'Muhammad':95,'Alexis':98}
scores2021 = {'Jake':85,'Kayla':95,'Muhammad':90,'Alexis':97}

Step 2 – Convert the Dictionaries into Series

The keys in our dictionaries are the index labels. So, we have created two dictionaries. Now the next step is to convert them into Series objects using pd.Series().

seriesOne = pd.Series(scores2020)
seriesTwo = pd.Series(scores2021)

Step 3 – Create a New Dictionary Using the Series Objects

In this step, we will create a new dictionary that we will pass into the method pd.DataFrame() later.

Let’s create a newDict first:

newDict = {'2020': seriesOne,'2021': seriesTwo}

The keys of the newDict are the column names. And the values (Series objects) are the data each column holds.

Step 4 – Convert the New Dictionary into a DataFrame

Lastly, we will create the DataFrame by converting the newDict.

For this to work, I have to use pd.DataFrame() and pass our newDict

Here’s what I mean:

yearlyScores = pd.DataFrame(newDict)
print(yearlyScores)

Output:

          2020  2021
Jake        90    85
Kayla       85    95
Muhammad    95    90
Alexis      98    97

Take a look at what’s going on here. I used the pd.DataFrame() method and passed in our newDict where the keys were the columns. And the rows were the values.

Working with Columns

As a data scientist, you will work with Pandas a lot. As a result, you also need to work with columns to make changes or retrieve specific information.

So, let me show you the ways you can work with the columns of a DataFrame.

We will use the yearlyScores DataFrame that we just created.

To select a column, type this:

yearlyScores['2020']

Output:

Jake        90
Kayla       85
Muhammad    95
Alexis      98
Name: 2020, dtype: int64

Similarly, we can also select the year 2021 using the same process:

yearlyScores['2021']

Output:

Jake        85
Kayla       95
Muhammad    90
Alexis      97
Name: 2021, dtype: int64

You can also see that we have the index column on the left. So, no matter what column you select, the output will always have an index column as a default.

It’s time to see how we can create a new column and add it to an existing DataFrame.

I will first create a new Pandas Series, which is technically a column, and then add it to the DataFrame yearlyScores.

Creating a Series from a dictionary:

scores2019 = scores2019 = {'Jake':78,'Kayla':90,'Muhammad':96,'Alexis':100}
seriesThree = pd.Series(scores2019)
seriesThree

Output:

Jake         78
Kayla        90
Muhammad     96
Alexis      100
dtype: int64

After that, I can add this Series to our DataFrame as a new column:

yearlyScores['2019'] = seriesThree
print(yearlyScores)

So we have our usual DataFrame and, then I have used the square brackets to name our new column '2019'. Then I have assigned the yearlyScores['2019'] to the seriesThree object. As a result, it adds a new column to the DataFrame.

Output:

          2020  2021  2019
Jake        90    85    78
Kayla       85    95    90
Muhammad    95    90    96
Alexis      98    97   100

But before we move further, we have a problem with our DataFrame. And that is, we want column 2019 to be at the first instead of last.

We have to reorder it. And to perform reordering operation, I can type:

yearlyScores = yearlyScores[['2019','2020','2021']]

Let’s print it out:

print(yearlyScores)

Output:

          2019  2020  2021
Jake        78    90    85
Kayla       90    85    95
Muhammad    96    95    90
Alexis     100    98    97

The order of the columns is now changed.

One of the aspects of Pandas is that when we create DataFrames from Series, it will automatically match the index and add the column.

You can also delete columns. For instance, I want to delete the column 2019 from our DataFrame. And to do that, I can use the del keyword:

del yearlyScores['2019']

Print out yearlyScores:

print(yearlyScores)

Output:

          2020  2021
Jake        90    85
Kayla       85    95
Muhammad    95    90
Alexis      98    97

You can see that column 2019 is now gone.

Often you may also need to delete multiple columns. Therefore, we can do that using the drop() method.

Here’s how it works:

yearlyScores.drop(['2020','2021'], axis=1,inplace=True)

So, I want to delete the columns 2020 & 2021. That’s why I passed them as a list. Since we are dealing with columns, I also set axis=1. Lastly, I have inplace=True, which means that we want to make the change permanent. If it is False, then the drop() method would return a copy instead of making the change permanent.

Also, do know that, by default, inplace is always False.

Let’s print it out:

print(yearlyScores)

Output:

Empty DataFrame
Columns: []
Index: [Jake, Kayla, Muhammad, Alexis]

All of our columns are now gone. Python tells us that it is an empty DataFrame.

Similar to working with columns, you can also select, add and delete rows.

Since we dropped all of our columns from the DataFrame, we have to recreate yearlyScores again.

Use the following code to do that:

scores2020 = {'Jake':90,'Kayla':85,'Muhammad':95,'Alexis':98}
scores2021 = {'Jake':85,'Kayla':95,'Muhammad':90,'Alexis':97}

seriesOne = pd.Series(scores2020)
seriesTwo = pd.Series(scores2021)

newDict = {'2020': seriesOne,'2021': seriesTwo}

yearlyScores = pd.DataFrame(newDict)
print(yearlyScores)

Output:

          2020  2021
Jake        90    85
Kayla       85    95
Muhammad    95    90
Alexis      98    97

So, we have our yearlyScores DataFrame back. I kept all the data the same as before for the sake of this tutorial.

Working with Rows

Okay, let’s see how to work with rows.

There are two ways you can select a row. It’s either using the index labels or the index values.

To select rows using labels, I can use the loc function and pass the row name.

For example:

yearlyScores.loc['Muhammad']

Output:

2020    95
2021    90
Name: Muhammad, dtype: int64

The result that we got is a Series. Moreover, notice that the column names of our DataFrame became the index labels.

Then we have the iloc function to select rows. bypassing the integer-based index value.

For instance, I want to extract Kayla’s scores. The index value at which her score exists is 1. As a result, I have to pass the index value 1 to the iloc function.

Here’s how it works:

yearlyScores.iloc[1]

Output:

2020    85
2021    95
Name: Kayla, dtype: int64

Besides, we can also slice multiple rows using the colon operator.
For example, I want to extract all the scores starting from Kayla to Alexis.

And the procedure is the same as how you would slice a Python list.

So, I will type:

yearlyScores[1:4]

Output:

          2020  2021
Kayla       85    95
Muhammad    95    90
Alexis      98    97

The output is a DataFrame containing the rows that we sliced starting from the index position one and till but not including the index position four.

Moreover, we can also add new rows to a DataFrame using the append() function.

However, I will create a new DataFrame first. You’ll see why in a bit.

You can create a DataFrame in any way you like. But, I will use Series.

So, this DataFrame will have two new students with their scores.

To create the DataFrame, I can type:

# Create the Series objects
scores2020 = pd.Series({'Lori':80,'Vince':82})
scores2021 = pd.Series({'Lori':90,'Vince':92})

# Pass and convert our Series objects to DataFrame
newStudentsDf = pd.DataFrame({'2020':scores2020,'2021':scores2021})

Output:

       2020  2021
Lori     80    90
Vince    82    92

In the end, we have our new DataFrame. So, let’s add rows using newStudentDf to our previous DataFrame yearlyScores.

Type the following:

yearlyScores = yearlyScores.append(newStudentsDf)
print(yearlyScores)

Output:

          2020  2021
Jake        90    85
Kayla       85    95
Muhammad    95    90
Alexis      98    97
Lori        80    90
Vince       82    92

To sum up, first, I created a new DataFrame, newStudentDf. This DataFrame holds two new students and their scores. Then, I passed that DataFrame using the append() function to merge its rows with yearlyScores.

You may need to go over the whole procedure few times to wrap your head around it. In contrast to columns, rows may be complex.

We are almost at the end of this Pandas tutorial. If you have been able to follow along till this point, then well done! We are almost there.

The last concept that I want to talk about is deleting rows.

Deleting rows is similar to deleting columns. Like before, we can use the drop() function to delete rows in Pandas.

For instance, I want to delete Jake and Lori’s information from our DataFrame. As simple as it sounds, I will type:

yearlyScores.drop(['Jake','Lori'],inplace=True)
print(yearlyScores)

We have to use the inplace parameter to make our change permanent.

So, here is what we have:

          2020  2021
Kayla       85    95
Muhammad    95    90
Alexis      98    97
Vince       82    92

Conclusion

Finally, we are at the end of our Pandas tutorial. I hope you learned some of the valuable concepts on how to work with Pandas.

I recommend going over the tutorial a few times to understand the concepts vividly. Although, it may take a while. To this day, I still have to practice Pandas regularly to keep up my skills sharp.

Working with DataFrame can be complex. That’s why follow the order that I have mentioned when it comes to learning Pandas. Besides, the best way to get good at a skill is by practicing.

Also, if you are interested in video courses for Pandas tutorial, then try out these courses for FREE:

Interested in becoming a Data Scientist or Machine Learning Engineer? Well, I have an in-depth career guide that you might want to read:

If you have any questions regarding this Pandas tutorial, then feel free to comment below.

How do you plan to use Pandas? Is there any part of this Pandas tutorial that was confusing? What other Python libraries can you think of that are great for data analysis?