Read a web page’s tabular data using Pandas

html page

Read a web page’s tabular data using Pandas

So you have seen a tabular data on one of the internet page, say on wiki. You were amazed to see some data facts mentioned by some fellow mate. You wanted to bring that data onto your computer. May be you wanted to perform analysis or visualization or even make some predictions using machine learning.

This post is all about helping you read the tabular data from a web page and perform some basic pandas operations on the data.

Lets say you are fascinated by an info on a wiki page that speaks of “List of Domesticated Animals” (Click here to view the page). You wanted to make a presentation or analyze the data from this web page. The web page looks like this:

In this web page you will find multiple tabular data. Now the question arises as to, “How to read a specific table from a web page using pandas and python?”

First, you would import the pandas library.

#First import the necessary libraries
import pandas as pd

Using read_html(), read the html web page. As a parameter, pass the link which you want to read. This would read all the tabular data in the form of list of individual dataframes.

animal_table = pd.read_html("https://en.wikipedia.org/wiki/List_of_domesticated_animals")


#To choose the first table of the html page, then use the index 0.

#To choose the second table of the html page, then use the index 1.

Let’s take the first tabular dataframe into account for which we would perform some visualizations, and read it into a variable “domestic_animal”. You might want to separate the names of each animal under the column “Species and subspecies” from the unwanted characters. If you observe every value in that column, you will see that every individual value after the first “(” bracket, is something that you can remove. So all we have to do is, convert the entire column into string and split every value based on the first occurrence of “(“. Since we have to save this list of every animal, we will create an empty list and append the names of all the animals.

lAnimal_List = []
for i in domestic_animal['Species and subspecies'].str.split("("):

Now merge back the list that we created (merge based on index) and drop the column “Species and Subspecies”, as it has unwanted characters.

domestic_animal = pd.merge(pd.DataFrame(pd.Series(lAnimal_List), columns=['Animal']),
                          domestic_animal.drop(columns=['Species and subspecies']),
                          left_index=True, right_index=True)

#By performing groupby, we are grouping all the columns based on "Location
#of origin"
s = domestic_animal.groupby(['Location of origin']).count()

#Importing matplotlib to plot the graph.
import matplotlib.pyplot as plt

#Change the size of the resultant figure
fig = plt.figure(figsize=(50,50))
plt.barh(s.index, s['Animal'])
plt.xlabel("Name of the country")
plt.title("Country vs. Animal")
plt.savefig("country vs animal.png")

So what all did you learn today?

  1. How to read a web page using read_html() of pandas.
  2. Cleaning the data using Python.
  3. Visualization using matplotlib.pyplot.