Pandas CSV Tutorial - Reading CSV Files into Python Using Pandas

Knowing how to utilize the Pandas CSV functionality is critical to every data scientist. Besides, in the real world, a lot of the projects involve reading DataFrames from CSV files.

So in this article, I will discuss how to use the Pandas CSV feature to read CSV files.

Apart from reading CSVs, Pandas also allows us to read data from different sources such as JSON, Excel, and even SQL database tables.

So without wasting any more time, let’s open a Jupyter notebook and see how Pandas CSV works.

Before we continue, I am assuming that you know your way around Python and Jupyter notebooks.

Read: The Ultimate Python Cheat Sheet – An Essential Reference for Python Developers

Interested in in-depth video tutorials on Pandas? Then check out the following course from Linkedin Learning:

Pandas Tutorial – Essential Training

Installing Pandas

To install Pandas, I recommend that you download the Anaconda distribution of Python from their official website.

Anaconda distribution is an open-source data science platform that comes with Pandas & other scientific libraries.

You can also install Pandas from your terminal using the commands below in the following order:

pip install numpy
pip install pandas

Besides, Google also has a free Jupyter notebook platform known as Google Colab. It already comes with Pandas and other Python data science libraries. You can use that as well.

Visit here to learn more about Google Colab:

https://colab.research.google.com/

Important: I will be using a CSV file from Kaggle for this Pandas CSV tutorial. Here’s the link to the file:

https://www.kaggle.com/lava18/google-play-store-apps

Make sure to put the googleplaystore.csv file under the same directory as your Jupyter notebook.

Reading CSV Files Using Pandas

Let’s start by importing the necessary libraries first:

import numpy as np
import pandas as pd

After that, we will read our CSV file and create a DataFrame using googleplaystore.csv and assign it to a variable.

So, type the following to read the data:

newDf = pd.read_csv('googleplaystore.csv')

The method we used to read our data is pd.read_csv(). It allows us to read CSV files and perform operations on them.

You may have already noticed that I passed the name of our CSV file as a parameter.

Keep in mind that if my Jupyter notebook and the CSV file were in separate directories, then I have to pass in the full path and the name of the file.

Now, let’s see what we have so far. Type the following:

newDf.head()

Output:

The result is a nice-looking DataFrame. It’s giving us information about the name of the apps, categories, ratings, and more.

Read: JSON with Python – Reading & Writing (With Examples)

Shape Attribute

Moving on, let’s see how we can verify the number of rows and columns that are on this DataFrame. And the way to do that is to use the shape attribute.

This is how it works:

newDf.shape

Output:

(10841, 13)

The output that we get is a tuple. The number 10841 represents the number of rows, and the 13 represents the number of columns.

If you ever work as a data analyst, you may need to check and compare against the original data file. Since you need to make sure the data has been accurately read and imported into the DataFrame.

That’s one of the aspects where the shape attribute comes in handy.

You may like:

How to Become a Data Scientist – The Sexiest Job of 21st Century

Head & Tail Function

Next in this Pandas CSV tutorial, I will show you how to use the head() and the tail() function. They are very efficient when it comes to dealing with Pandas CSV.

You probably already saw me using the head() function before when I created our newDf DataFrame.

In short, this function returns the first 5 rows of a DataFrame.

On the other hand, the tail() function returns the last 5 rows.

For example:

newDf.head()

Output:

And then the tail() function to view the last 5 rows:

newDf.tail()

Output:

By default, the head() & the tail() function will display the first and the last five rows. However, we can change that by passing parameters.

For instance, I want to view the first ten rows of the newDf DataFrame. Thus, I can type the head() function and pass in the number 10.

Here’s how it looks:

newDf.head(10)

Output:

As a result, you can see that it displayed the first ten rows since we passed 10 as a parameter.

Similarly, we can display the last 10 ten rows:

newDf.tail(10)

Output:

Now you can see the last 10 rows of the DataFrame.

Info Function

Next in Pandas CSV tutorial, I want to discuss the info() function.

The info() function provides the summary of a DataFrame. We can use this function to check the number of entries, the data type, and whether any data is missing.

Let’s see it in action:

newDf.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB

So, the result is a nice-looking summary with insights that can benefit you.

Counting Values

Another effective method for Pandas CSV is the value_counts(). This method will return with the counts of unique values that exist on a particular column.

In other words, it returns a Series object with all the unique values sorted.

For instance, I want to check the number or the count of each category on our DataFrame.

Thus, I can type:

newDf.Category.value_counts()

Output:

FAMILY                 1972
GAME                   1144
TOOLS                   843
MEDICAL                 463
BUSINESS                460
PRODUCTIVITY            424
PERSONALIZATION         392
COMMUNICATION           387
SPORTS                  384
LIFESTYLE               382
FINANCE                 366
HEALTH_AND_FITNESS      341
PHOTOGRAPHY             335
SOCIAL                  295
NEWS_AND_MAGAZINES      283
SHOPPING                260
TRAVEL_AND_LOCAL        258
DATING                  234
BOOKS_AND_REFERENCE     231
VIDEO_PLAYERS           175
EDUCATION               156
ENTERTAINMENT           149
MAPS_AND_NAVIGATION     137
FOOD_AND_DRINK          127
HOUSE_AND_HOME           88
LIBRARIES_AND_DEMO       85
AUTO_AND_VEHICLES        85
WEATHER                  82
ART_AND_DESIGN           65
EVENTS                   64
PARENTING                60
COMICS                   60
BEAUTY                   53
1.9                       1
Name: Category, dtype: int64

You can see that I got a Series containing the counts of each type of category. Also, note that the first value that value_counts() returned is FAMILY.

It is because FAMILY is the most frequently occurring category. As a result, we can see that majority of the apps on our DataFrame fall under this category.

After that, there is the GAME category. This category is the second most occurring on our DataFrame.

You can see that the number decreases as we go down the Series object with the least occurring category at the very end.

However, you can also reverse this order.

To reverse the order, I can type:

newDf.Category.value_counts(ascending=True)

Output:

1.9                       1
BEAUTY                   53
COMICS                   60
PARENTING                60
EVENTS                   64
ART_AND_DESIGN           65
WEATHER                  82
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       85
HOUSE_AND_HOME           88
FOOD_AND_DRINK          127
MAPS_AND_NAVIGATION     137
ENTERTAINMENT           149
EDUCATION               156
VIDEO_PLAYERS           175
BOOKS_AND_REFERENCE     231
DATING                  234
TRAVEL_AND_LOCAL        258
SHOPPING                260
NEWS_AND_MAGAZINES      283
SOCIAL                  295
PHOTOGRAPHY             335
HEALTH_AND_FITNESS      341
FINANCE                 366
LIFESTYLE               382
SPORTS                  384
COMMUNICATION           387
PERSONALIZATION         392
PRODUCTIVITY            424
BUSINESS                460
MEDICAL                 463
TOOLS                   843
GAME                   1144
FAMILY                 1972
Name: Category, dtype: int64

Now we have FAMILY at the bottom of the Series object since the numbers are ascending. And the least occurring category is at the top.

By default, the results are in descending order unless you use a parameter to change that. Here, for example, we set ascending=True and passed it the method.

Sorting Values

We are almost at the end of this Pandas CSV tutorial. But before I conclude, I also want to discuss the sort_values() method.

The sort_values() method sorts the contents of a column in either ascending or descending order.

For example, I want to sort the Rating column in ascending order.

So, I can type:

newDf.Rating.sort_values()

Output:


newDf.Rating.sort_values()
8820     1.0
7144     1.0
10400    1.0
10591    1.0
5151     1.0
        ... 
10824    NaN
10825    NaN
10831    NaN
10835    NaN
10838    NaN
Name: Rating, Length: 10841, dtype: float64

As a result, the Rating column is sorted in an ascending order. We can perform this operation and make it descending as well.

newDf.Rating.sort_values(ascending=False)

Output:

10472    19.0
7435      5.0
8058      5.0
8234      5.0
8230      5.0
         ... 
10824     NaN
10825     NaN
10831     NaN
10835     NaN
10838     NaN
Name: Rating, Length: 10841, dtype: float64

Now the order is changed to descending order. And the values of the Rating column is decreasing as we go down the Series object.

For your information, the ascending parameter is set to True by default. So, to sort the values in descending order, you have to change that False.

Moreover, we can also sort a column in an alphabetical order.

Type the following:

newDf.Category.sort_values()

Output:

10472               1.9
0        ART_AND_DESIGN
35       ART_AND_DESIGN
36       ART_AND_DESIGN
37       ART_AND_DESIGN
              ...      
3645            WEATHER
3646            WEATHER
3647            WEATHER
8291            WEATHER
8168            WEATHER
Name: Category, Length: 10841, dtype: object

Notice that the Series object is in an alphabetical order where the number comes before the alphabets. That’s why 1.9 is at the top.

Next, I also want to discuss the by parameter. Let’s see it in action and then learn how it works.

newDf.sort_values(by=['Category','Rating'])

Output:

All I did was use the by parameter with a list having the columns Category and Rating. The result is we get a whole DataFrame where the Category and Rating columns are in ascending order.

Moreover, we can also sort them in a descending order as well:

newDf.sort_values(by=['Category','Rating'],ascending=False)

Output:

And you can see that the orders are changed.

However, this change is not permanent. If you wish to change the order entirely, then you have to use the inplace parameter and set it to True.

For example, let’s make the change permanent:

newDf.sort_values(by=['Category','Rating'],ascending=False,inplace=True)

Now the change is permanent. If you type newDf on your Jupyter notebook, you will see the difference. The Category and Rating columns are in descending order.

Conclusion

To summarize, utilizing Pandas CSV to read data is very straightforward.

First, you import your data and then use the pd.read_csv() function to read it.

Also, we discussed the shape attribute to check the number of rows and columns. And then saw how to work with functions such as the head(), tail(), and info()

In a separate tutorial, I probably will talk about how to write to CSV files using Pandas. But for now, let’s keep it simple and start here.

In my experience, it’s also beneficial to learn through video tutorials. As a result, I do recommend checking out the Pandas course from Linkedin Learning.

Here are some of the courses that I highly recommend:

Are you a data scientist or an analyst who uses Pandas? Do you find working with Pandas CSV effective when it comes to reading CSVs?