Knowing how to utilize the Pandas CSV functionality is critical to every data scientist. Besides, in the real world, a lot of the projects involve reading DataFrames from CSV files.
So in this article, I will discuss how to use the Pandas CSV feature to read CSV files.
Related: Pandas Tutorial for Beginners – The Ultimate Guide
Apart from reading CSVs, Pandas also allows us to read data from different sources such as JSON, Excel, and even SQL database tables.
So without wasting any more time, let’s open a Jupyter notebook and see how Pandas CSV works.
Before we continue, I am assuming that you know your way around Python and Jupyter notebooks.
Read: The Ultimate Python Cheat Sheet – An Essential Reference for Python Developers
Interested in in-depth video tutorials on Pandas? Then check out the following course from Linkedin Learning:
Installing Pandas
To install Pandas, I recommend that you download the Anaconda distribution of Python from their official website.
Anaconda distribution is an open-source data science platform that comes with Pandas & other scientific libraries.
You can also install Pandas from your terminal using the commands below in the following order:
pip install numpy
pip install pandas
Besides, Google also has a free Jupyter notebook platform known as Google Colab. It already comes with Pandas and other Python data science libraries. You can use that as well.
Related: NumPy Tutorial for Beginners – Arrays
Visit here to learn more about Google Colab:
Important: I will be using a CSV file from Kaggle for this Pandas CSV tutorial. Here’s the link to the file:
Make sure to put the googleplaystore.csv
file under the same directory as your Jupyter notebook.
Reading CSV Files Using Pandas
Let’s start by importing the necessary libraries first:
import numpy as np
import pandas as pd
After that, we will read our CSV file and create a DataFrame using googleplaystore.csv
and assign it to a variable.
So, type the following to read the data:
newDf = pd.read_csv('googleplaystore.csv')
The method we used to read our data is pd.read_csv()
. It allows us to read CSV files and perform operations on them.
You may have already noticed that I passed the name of our CSV file as a parameter.
Keep in mind that if my Jupyter notebook and the CSV file were in separate directories, then I have to pass in the full path and the name of the file.
Now, let’s see what we have so far. Type the following:
newDf.head()
Output:
The result is a nice-looking DataFrame. It’s giving us information about the name of the apps, categories, ratings, and more.
Read: JSON with Python – Reading & Writing (With Examples)
Shape Attribute
Moving on, let’s see how we can verify the number of rows and columns that are on this DataFrame. And the way to do that is to use the shape
attribute.
This is how it works:
newDf.shape
Output:
(10841, 13)
The output that we get is a tuple. The number 10841
represents the number of rows, and the 13
represents the number of columns.
If you ever work as a data analyst, you may need to check and compare against the original data file. Since you need to make sure the data has been accurately read and imported into the DataFrame.
That’s one of the aspects where the shape
attribute comes in handy.
You may like:
Head & Tail Function
Next in this Pandas CSV tutorial, I will show you how to use the head()
and the tail()
function. They are very efficient when it comes to dealing with Pandas CSV.
You probably already saw me using the head()
function before when I created our newDf
DataFrame.
In short, this function returns the first 5 rows of a DataFrame.
On the other hand, the tail()
function returns the last 5 rows.
For example:
newDf.head()
Output:
And then the tail() function to view the last 5 rows:
newDf.tail()
Output:
By default, the head()
& the tail()
function will display the first and the last five rows. However, we can change that by passing parameters.
For instance, I want to view the first ten rows of the newDf
DataFrame. Thus, I can type the head()
function and pass in the number 10
.
Here’s how it looks:
newDf.head(10)
Output:
As a result, you can see that it displayed the first ten rows since we passed 10
as a parameter.
Similarly, we can display the last 10 ten rows:
newDf.tail(10)
Output:
Now you can see the last 10 rows of the DataFrame.
Info Function
Next in Pandas CSV tutorial, I want to discuss the info()
function.
The info()
function provides the summary of a DataFrame. We can use this function to check the number of entries, the data type, and whether any data is missing.
Let’s see it in action:
newDf.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 App 10841 non-null object
1 Category 10841 non-null object
2 Rating 9367 non-null float64
3 Reviews 10841 non-null object
4 Size 10841 non-null object
5 Installs 10841 non-null object
6 Type 10840 non-null object
7 Price 10841 non-null object
8 Content Rating 10840 non-null object
9 Genres 10841 non-null object
10 Last Updated 10841 non-null object
11 Current Ver 10833 non-null object
12 Android Ver 10838 non-null object
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
So, the result is a nice-looking summary with insights that can benefit you.
Counting Values
Another effective method for Pandas CSV is the value_counts(). This method will return with the counts of unique values that exist on a particular column.
In other words, it returns a Series object with all the unique values sorted.
For instance, I want to check the number or the count of each category on our DataFrame.
Thus, I can type:
newDf.Category.value_counts()
Output:
FAMILY 1972
GAME 1144
TOOLS 843
MEDICAL 463
BUSINESS 460
PRODUCTIVITY 424
PERSONALIZATION 392
COMMUNICATION 387
SPORTS 384
LIFESTYLE 382
FINANCE 366
HEALTH_AND_FITNESS 341
PHOTOGRAPHY 335
SOCIAL 295
NEWS_AND_MAGAZINES 283
SHOPPING 260
TRAVEL_AND_LOCAL 258
DATING 234
BOOKS_AND_REFERENCE 231
VIDEO_PLAYERS 175
EDUCATION 156
ENTERTAINMENT 149
MAPS_AND_NAVIGATION 137
FOOD_AND_DRINK 127
HOUSE_AND_HOME 88
LIBRARIES_AND_DEMO 85
AUTO_AND_VEHICLES 85
WEATHER 82
ART_AND_DESIGN 65
EVENTS 64
PARENTING 60
COMICS 60
BEAUTY 53
1.9 1
Name: Category, dtype: int64
You can see that I got a Series containing the counts of each type of category. Also, note that the first value that value_counts()
returned is FAMILY
.
It is because FAMILY
is the most frequently occurring category. As a result, we can see that majority of the apps on our DataFrame fall under this category.
After that, there is the GAME
category. This category is the second most occurring on our DataFrame.
You can see that the number decreases as we go down the Series object with the least occurring category at the very end.
However, you can also reverse this order.
To reverse the order, I can type:
newDf.Category.value_counts(ascending=True)
Output:
1.9 1
BEAUTY 53
COMICS 60
PARENTING 60
EVENTS 64
ART_AND_DESIGN 65
WEATHER 82
AUTO_AND_VEHICLES 85
LIBRARIES_AND_DEMO 85
HOUSE_AND_HOME 88
FOOD_AND_DRINK 127
MAPS_AND_NAVIGATION 137
ENTERTAINMENT 149
EDUCATION 156
VIDEO_PLAYERS 175
BOOKS_AND_REFERENCE 231
DATING 234
TRAVEL_AND_LOCAL 258
SHOPPING 260
NEWS_AND_MAGAZINES 283
SOCIAL 295
PHOTOGRAPHY 335
HEALTH_AND_FITNESS 341
FINANCE 366
LIFESTYLE 382
SPORTS 384
COMMUNICATION 387
PERSONALIZATION 392
PRODUCTIVITY 424
BUSINESS 460
MEDICAL 463
TOOLS 843
GAME 1144
FAMILY 1972
Name: Category, dtype: int64
Now we have FAMILY
at the bottom of the Series object since the numbers are ascending. And the least occurring category is at the top.
By default, the results are in descending order unless you use a parameter to change that. Here, for example, we set ascending=True
and passed it the method.
Sorting Values
We are almost at the end of this Pandas CSV tutorial. But before I conclude, I also want to discuss the sort_values()
method.
The sort_values()
method sorts the contents of a column in either ascending or descending order.
For example, I want to sort the Rating
column in ascending order.
So, I can type:
newDf.Rating.sort_values()
Output:
newDf.Rating.sort_values()
8820 1.0
7144 1.0
10400 1.0
10591 1.0
5151 1.0
...
10824 NaN
10825 NaN
10831 NaN
10835 NaN
10838 NaN
Name: Rating, Length: 10841, dtype: float64
As a result, the Rating
column is sorted in an ascending order. We can perform this operation and make it descending as well.
newDf.Rating.sort_values(ascending=False)
Output:
10472 19.0
7435 5.0
8058 5.0
8234 5.0
8230 5.0
...
10824 NaN
10825 NaN
10831 NaN
10835 NaN
10838 NaN
Name: Rating, Length: 10841, dtype: float64
Now the order is changed to descending order. And the values of the Rating
column is decreasing as we go down the Series object.
For your information, the ascending
parameter is set to True
by default. So, to sort the values in descending order, you have to change that False.
Moreover, we can also sort a column in an alphabetical order.
Type the following:
newDf.Category.sort_values()
Output:
10472 1.9
0 ART_AND_DESIGN
35 ART_AND_DESIGN
36 ART_AND_DESIGN
37 ART_AND_DESIGN
...
3645 WEATHER
3646 WEATHER
3647 WEATHER
8291 WEATHER
8168 WEATHER
Name: Category, Length: 10841, dtype: object
Notice that the Series object is in an alphabetical order where the number comes before the alphabets. That’s why 1.9
is at the top.
Next, I also want to discuss the by
parameter. Let’s see it in action and then learn how it works.
newDf.sort_values(by=['Category','Rating'])
Output:
All I did was use the by
parameter with a list having the columns Category
and Rating
. The result is we get a whole DataFrame where the Category
and Rating
columns are in ascending order.
Moreover, we can also sort them in a descending order as well:
newDf.sort_values(by=['Category','Rating'],ascending=False)
Output:
And you can see that the orders are changed.
However, this change is not permanent. If you wish to change the order entirely, then you have to use the inplace
parameter and set it to True
.
For example, let’s make the change permanent:
newDf.sort_values(by=['Category','Rating'],ascending=False,inplace=True)
Now the change is permanent. If you type newDf
on your Jupyter notebook, you will see the difference. The Category
and Rating
columns are in descending order.
Conclusion
To summarize, utilizing Pandas CSV to read data is very straightforward.
First, you import your data and then use the pd.read_csv() function to read it.
Also, we discussed the shape attribute to check the number of rows and columns. And then saw how to work with functions such as the head(), tail(), and info()
In a separate tutorial, I probably will talk about how to write to CSV files using Pandas. But for now, let’s keep it simple and start here.
In my experience, it’s also beneficial to learn through video tutorials. As a result, I do recommend checking out the Pandas course from Linkedin Learning.
Here are some of the courses that I highly recommend:
Are you a data scientist or an analyst who uses Pandas? Do you find working with Pandas CSV effective when it comes to reading CSVs?