Extract Table Data From Wikipedia Using Web Scraping With Python

Extract Table Data From Wikipedia Using Web Scraping With Python

In this blog we will do web scraping using python and convert html table into pandas dataframe. After that we will also analyze the data. We will scrape data of "Economic development in India", which is available on wikipedia and url is https://en.wikipedia.org/wiki/Economic_development_in_India.

We will grab below html table from wikipedia and analyze these data using pandas and matplotlib module.

What is requests?

Requests is an elegant and simple HTTP library for Python, built for human beings. To install Requests, simply we can run this command in your terminal:

python -m pip install requests

Making a request with Requests is very simple. Simply we need to import the Requests module.

import requests

r = requests.get('WEB-PAGE-URL')

Now, we have a Response object called r. We can get all the information we need from this object.

Import modules

In [1]:
import requests

Get webpage content using requests

In [2]:
page_url = "https://en.wikipedia.org/wiki/Economic_development_in_India"

We are getting webpage content using requests.get() function in that passing timeout and verify parameters.

Requests can also ignore verifying the SSL certificate if you set verify to False. By default, verify is set to True. Option verify only applies to host certs.

Most requests to external servers should have a timeout attached, in case the server is not responding in a timely manner. By default, requests do not time out unless a timeout value is set explicitly. Without a timeout, our code may hang for minutes or more.

The connect timeout is the number of seconds Requests will wait for your client to establish a connection to a remote machine call on the socket. It’s a good practice to set connect timeouts.

The timeout value will be applied to both the connect and the read timeouts. Specify a tuple if you would like to set the values separately.

In [3]:
html = requests.get(page_url, timeout = 5, verify = True)

Display first 1000 characters

In [4]:
html.text[:1000]
Out[4]:
'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Economic development in India - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"e8600076-9b66-4751-8181-a16d947012b9","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Economic_development_in_India","wgTitle":"Economic development in India","wgCurRevisionId":1056799682,"wgRevisionId":1056799682,"wgArticleId":13770627,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 maint: archived copy as title","CS1 errors: missing periodical",'

Load html in pandas dataframe

In [5]:
import pandas as pd

pandas.read_html

pandas.read_html(io, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True)

Read HTML tables into a list of DataFrame objects.

Parameters

io: str, path object or file-like object

    A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

match: str or compiled regular expression, optional

    The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.


flavor: str, optional

    The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

header: int or list-like, optional

    The row (or list of rows for a MultiIndex) to use to make the columns headers.

index_col: int or list-like, optional

        The column (or list of columns) to use to create the index.

skiprows: int, list-like or slice, optional

    Number of rows to skip after parsing the column integer. 0-based. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.

attrs: dict, optional

    This is a dictionary of attributes that you can pass to use to identify the table in the HTML.

parse_dates: bool, optional

thousands: str, optional

        Separator to use to parse thousands. Defaults to ','.

encoding: str, optional

        The encoding used to decode the web page. Defaults to None.

decimalstr, default '.'

        Character to recognize as decimal point.

converters: dict, default None

        Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.

 na_values: iterable, default None

     Custom NA values.

 keep_default_na: bool, default True

     If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they're appended to.

displayed_only: bool, default True

        Whether elements with "display: none" should be parsed.

Returns

    dfs

        A list of DataFrames.

See also
In [6]:
dfs = pd.read_html(html.text)
dfs
Out[6]:
[    0                                                  1
 0 NaN  This article has multiple issues. Please help ...
 1 NaN  This article needs to be updated. Please help ...
 2 NaN  The lead section of this article may need to b...
 3 NaN  This article's lead section may be too long fo...,
     0                                                  1
 0 NaN  This article needs to be updated. Please help ...,
     0                                                  1
 0 NaN  The lead section of this article may need to b...,
     0                                                  1
 0 NaN  This article's lead section may be too long fo...,
     Year  Growth (real) (%)
 0   2000              3.841
 1   2001              4.824
 2   2002              3.804
 3   2003              7.860
 4   2004              7.923
 5   2005              7.923
 6   2006              8.061
 7   2007              7.661
 8   2008              3.087
 9   2009              7.862
 10  2010              8.498
 11  2011              5.241
 12  2012              5.456
 13  2013              6.386
 14  2014              7.410
 15  2015              7.996
 16  2016              8.170
 17  2017              7.168
 18  2018              6.982,
    World Rank                          Company  Logo                Industry  \
 0         142              Reliance Industries   NaN    Oil & Gas Operations   
 1         152              State Bank of India   NaN                 Banking   
 2         183  Oil and Natural Gas Corporation   NaN    Oil & Gas Operations   
 3         263                      Tata Motors   NaN                     NaN   
 4         283                       ICICI Bank   NaN                 Banking   
 5         431                             NTPC   NaN               Utilities   
 6         463                       Tata Steel   NaN               Materials   
 7         349           Indian Oil Corporation   NaN    Oil & Gas Operations   
 8         485                             HDFC   NaN                 Banking   
 9         485                              TCS   NaN  Information Technology   
 
    Revenue (billion $)  Profits (billion $)  Assets (billion $)  \
 0                71.70                 3.70               76.60   
 1                40.80                 2.30              400.60   
 2                28.70                 4.40               59.30   
 3                42.30                 2.70               34.70   
 4                14.20                 1.90              124.80   
 5                12.90                 1.90               35.40   
 6                32.77                 3.08               31.16   
 7                74.30                 1.20               44.70   
 8                 8.40                 1.40               84.30   
 9                15.10                 3.50               11.00   
 
    Market Value (billion $)  
 0                     42.90  
 1                     33.00  
 2                     43.70  
 3                     28.80  
 4                     30.00  
 5                     20.20  
 6                      2.46  
 7                     14.60  
 8                     41.60  
 9                     80.30  ,
    .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vte Economy of India  \
 0                                           Companies                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 1                                          Governance                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 2                                            Currency                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 3                                  Financial services                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 4                                             History                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 5                                              People                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 6                                              States                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 7                                             Sectors                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 8                                           Regulator                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 9                                               Other                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 10                      Category  Commons  Wikiquotes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 
    .mw-parser-output .navbar{display:inline;font-size:88%;font-weight:normal}.mw-parser-output .navbar-collapse{float:left;text-align:left}.mw-parser-output .navbar-boxtext{word-spacing:0}.mw-parser-output .navbar ul{display:inline-block;white-space:nowrap;line-height:inherit}.mw-parser-output .navbar-brackets::before{margin-right:-0.125em;content:"[ "}.mw-parser-output .navbar-brackets::after{margin-left:-0.125em;content:" ]"}.mw-parser-output .navbar li{word-spacing:-0.125em}.mw-parser-output .navbar a>span,.mw-parser-output .navbar a>abbr{text-decoration:inherit}.mw-parser-output .navbar-mini abbr{font-variant:small-caps;border-bottom:none;text-decoration:none;cursor:inherit}.mw-parser-output .navbar-ct-full{font-size:114%;margin:0 7em}.mw-parser-output .navbar-ct-mini{font-size:114%;margin:0 4em}vte Economy of India.1  
 0   BSE SENSEX NIFTY 50 Government-owned companies...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 1   Ministry of Finance Finance ministers Ministry...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 2   Indian rupee Sign History Historical Forex Coi...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 3   Banking Banks Insurance Multi Commodity Exchan...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 4   COVID-19 impact Economic development Liberalis...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 5   Billionaires Businesspeople Demography Income ...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 6   Andhra Pradesh Assam Bihar Delhi Goa Gujarat H...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 7   Agriculture Livestock Fishing Automotive Defen...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 8                                      IRDAI RBI SEBI                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 9   NCLT NCLAT BIFR IBBI IBC SARFESI Act Income Ta...                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 10                      Category  Commons  Wikiquotes                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ]

We can see many tables we got. We have to display second tables.

Get the length of total tables and find the correct index

In [7]:
len(dfs)
Out[7]:
7

I have checked index 4th provides our required data.

In [8]:
dfs[4].head()
Out[8]:
Year Growth (real) (%)
0 2000 3.841
1 2001 4.824
2 2002 3.804
3 2003 7.860
4 2004 7.923

Let us save in new dataframe

In [9]:
df = dfs[4]
df.head()
Out[9]:
Year Growth (real) (%)
0 2000 3.841
1 2001 4.824
2 2002 3.804
3 2003 7.860
4 2004 7.923
In [10]:
df.shape
Out[10]:
(19, 2)
In [11]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               19 non-null     int64  
 1   Growth (real) (%)  19 non-null     float64
dtypes: float64(1), int64(1)
memory usage: 432.0 bytes
In [12]:
df.columns
Out[12]:
Index(['Year', 'Growth (real) (%)'], dtype='object')

Plot Year and Growth columns

In [13]:
import matplotlib.pyplot as plt
In [14]:
plt.figure(figsize=(10,6))
plt.plot(df['Year'], df['Growth (real) (%)'])

plt.title('GDP Growth Rate in %', color = "red", fontsize = 20)
plt.xlabel("Year")
plt.ylabel("Growth (real) (%)")
plt.show()

Bar chart

In [15]:
plt.figure(figsize=(10,6))
plt.bar(df['Year'], df['Growth (real) (%)'])

plt.title('GDP Growth Rate in %', color = "red", fontsize = 20)
plt.xlabel("Year")
plt.ylabel("Growth (real) (%)")
plt.show()

pie chart

In [16]:
plt.figure(figsize=(15,8))
plt.pie(df['Growth (real) (%)'], labels = df['Year'], autopct='%1.1f%%')

plt.title('GDP Growth Rate in %', color = "red", fontsize = 20)
plt.legend(loc ="center right", fontsize = 10)
plt.show()

Take another html table and convert into dataframe

In [17]:
dfs[5].head()
Out[17]:
World Rank Company Logo Industry Revenue (billion $) Profits (billion $) Assets (billion $) Market Value (billion $)
0 142 Reliance Industries NaN Oil & Gas Operations 71.7 3.7 76.6 42.9
1 152 State Bank of India NaN Banking 40.8 2.3 400.6 33.0
2 183 Oil and Natural Gas Corporation NaN Oil & Gas Operations 28.7 4.4 59.3 43.7
3 263 Tata Motors NaN NaN 42.3 2.7 34.7 28.8
4 283 ICICI Bank NaN Banking 14.2 1.9 124.8 30.0
In [18]:
df1 = dfs[5]
In [19]:
df1.head(10)
Out[19]:
World Rank Company Logo Industry Revenue (billion $) Profits (billion $) Assets (billion $) Market Value (billion $)
0 142 Reliance Industries NaN Oil & Gas Operations 71.70 3.70 76.60 42.90
1 152 State Bank of India NaN Banking 40.80 2.30 400.60 33.00
2 183 Oil and Natural Gas Corporation NaN Oil & Gas Operations 28.70 4.40 59.30 43.70
3 263 Tata Motors NaN NaN 42.30 2.70 34.70 28.80
4 283 ICICI Bank NaN Banking 14.20 1.90 124.80 30.00
5 431 NTPC NaN Utilities 12.90 1.90 35.40 20.20
6 463 Tata Steel NaN Materials 32.77 3.08 31.16 2.46
7 349 Indian Oil Corporation NaN Oil & Gas Operations 74.30 1.20 44.70 14.60
8 485 HDFC NaN Banking 8.40 1.40 84.30 41.60
9 485 TCS NaN Information Technology 15.10 3.50 11.00 80.30
In [20]:
df1.shape
Out[20]:
(10, 8)
In [21]:
df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   World Rank                10 non-null     int64  
 1   Company                   10 non-null     object 
 2   Logo                      0 non-null      float64
 3   Industry                  9 non-null      object 
 4   Revenue (billion $)       10 non-null     float64
 5   Profits (billion $)       10 non-null     float64
 6   Assets (billion $)        10 non-null     float64
 7   Market Value (billion $)  10 non-null     float64
dtypes: float64(5), int64(1), object(2)
memory usage: 768.0+ bytes

Plot 10 leading companies by assets

In [22]:
plt.figure(figsize=(12,8))
plt.bar(df1['Company'], df1['Assets (billion $)'])

plt.title('10 Leading Companies by Assets', fontsize = 20)
plt.xlabel("Companies", fontsize = 20)
plt.ylabel("Assets in billion '$'", fontsize = 20)
plt.xticks(rotation=45, ha='right')
plt.show()

pie chart

In [23]:
plt.figure(figsize=(15,8))
plt.pie(df1['Assets (billion $)'], labels = df1['Company'], autopct='%1.1f%%')

plt.legend(title = "Companies", loc ="upper right")
plt.title('10 Leading Companies by Assets', fontsize = 22)

plt.show()

We can see, according to assets comapany "State Bank of India" is the largest one.

Plot Industry according to market values

First need to check Industry column has NAN value or not.

In [24]:
df1['Industry'].isna().sum()
Out[24]:
1

If we see data, company "Tata Motors" doesn't have Industry. So let add industry in that.

In [25]:
import numpy as np
In [26]:
df1['Industry'] = np.where((df1.Company == 'Tata Motors'),'Automotive', df1.Industry)
df1['Industry']
Out[26]:
0      Oil & Gas Operations
1                   Banking
2      Oil & Gas Operations
3                Automotive
4                   Banking
5                 Utilities
6                 Materials
7      Oil & Gas Operations
8                   Banking
9    Information Technology
Name: Industry, dtype: object
In [27]:
df1
Out[27]:
World Rank Company Logo Industry Revenue (billion $) Profits (billion $) Assets (billion $) Market Value (billion $)
0 142 Reliance Industries NaN Oil & Gas Operations 71.70 3.70 76.60 42.90
1 152 State Bank of India NaN Banking 40.80 2.30 400.60 33.00
2 183 Oil and Natural Gas Corporation NaN Oil & Gas Operations 28.70 4.40 59.30 43.70
3 263 Tata Motors NaN Automotive 42.30 2.70 34.70 28.80
4 283 ICICI Bank NaN Banking 14.20 1.90 124.80 30.00
5 431 NTPC NaN Utilities 12.90 1.90 35.40 20.20
6 463 Tata Steel NaN Materials 32.77 3.08 31.16 2.46
7 349 Indian Oil Corporation NaN Oil & Gas Operations 74.30 1.20 44.70 14.60
8 485 HDFC NaN Banking 8.40 1.40 84.30 41.60
9 485 TCS NaN Information Technology 15.10 3.50 11.00 80.30
In [28]:
plt.figure(figsize=(10,6))

plt.bar(df1['Industry'], df1['Market Value (billion $)'])

plt.title('10 Leading Industry by Market Value (billion $)', fontsize = 22)
plt.xlabel("Industry", fontsize = 20)
plt.ylabel("Market Value (billion $)", fontsize = 20)
plt.show()
In [29]:
plt.figure(figsize=(15,8))
plt.pie(df1['Market Value (billion $)'], labels = df1['Industry'], autopct='%1.1f%%')

plt.legend(loc ="upper right", fontsize = 10)
plt.title('10 Leading Industry by Market Value (billion $)', fontsize = 22)

plt.show()
In [ ]:
 
In [ ]:
 

Machine Learning

  1. Deal Banking Marketing Campaign Dataset With Machine Learning

TensorFlow

  1. Difference Between Scalar, Vector, Matrix and Tensor
  2. TensorFlow Deep Learning Model With IRIS Dataset
  3. Sequence to Sequence Learning With Neural Networks To Perform Number Addition
  4. Image Classification Model MobileNet V2 from TensorFlow Hub
  5. Step by Step Intent Recognition With BERT
  6. Sentiment Analysis for Hotel Reviews With NLTK and Keras
  7. Simple Sequence Prediction With LSTM
  8. Image Classification With ResNet50 Model
  9. Predict Amazon Inc Stock Price with Machine Learning
  10. Predict Diabetes With Machine Learning Algorithms
  11. TensorFlow Build Custom Convolutional Neural Network With MNIST Dataset
  12. Deal Banking Marketing Campaign Dataset With Machine Learning

PySpark

  1. How to Parallelize and Distribute Collection in PySpark
  2. Role of StringIndexer and Pipelines in PySpark ML Feature - Part 1
  3. Role of OneHotEncoder and Pipelines in PySpark ML Feature - Part 2
  4. Feature Transformer VectorAssembler in PySpark ML Feature - Part 3
  5. Logistic Regression in PySpark (ML Feature) with Breast Cancer Data Set

PyTorch

  1. Build the Neural Network with PyTorch
  2. Image Classification with PyTorch
  3. Twitter Sentiment Classification In PyTorch
  4. Training an Image Classifier in Pytorch

Natural Language Processing

  1. Spelling Correction Of The Text Data In Natural Language Processing
  2. Handling Text For Machine Learning
  3. Extracting Text From PDF File in Python Using PyPDF2
  4. How to Collect Data Using Twitter API V2 For Natural Language Processing
  5. Converting Text to Features in Natural Language Processing
  6. Extract A Noun Phrase For A Sentence In Natural Language Processing