Basic web scraping with BeautifulSoup4

Introduction

BeautifulSoup is a python library for pulling data out of HTML and XML files. It provides idiomatic ways of navigating, searching, and modifying the parse tree. This python library is useful for scraping websites, extracting informations. For example, you can use BeautifulSoup to extract reviews from Amazon, to gauge the overall sentiment of a particular type of products.

https://pythonprogramming.net/parsememcparseface/

In [1]:
import bs4 as bs
import urllib.request

Returning the souce code of the webpage

In [2]:
sauce = urllib.request.urlopen('https://pythonprogramming.net/parsememcparseface/').read()
In [3]:
soup = bs.BeautifulSoup(sauce,'lxml')

Print title of webpage

In [4]:
soup.title.text
Out[4]:
'Python Programming Tutorials'

Print first paragraph

In [5]:
soup.p.text
Out[5]:
'Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.'

Print all paragraphs

In [6]:
for paragraph in soup.find_all('p'):
    print (paragraph.text)
Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:
I think it's clear that, on a scale of 1-10, python is:
Javascript (dynamic data) test:
y u bad tho?
Whᶐt hαppéns now¿
sitemap

Cancel  
						Login


Cancel  
								Sign Up

Contact: Harrison@pythonprogramming.net.
Programming is a superpower.

Some paragraphs don’t have the ‘p’ tag. Some has ‘pre’ tags so here’s how we extract the text of the whole webpage

In [7]:
print(soup.get_text())
Python Programming Tutorials

search

Home
+=1
Store
Community
Log in
Sign up

Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:

Python
Pascal
Lisp
D#
Cobol
Fortran
Haskell



Program Name
Internet Points
Kittens?


Python
932914021
Definitely


Pascal
532
Unlikely


Lisp
1522
Uncertain


D#
12
Possibly


Cobol
3
No.


Fortran
52124
Yes.


Haskell
24
lol.


I think it's clear that, on a scale of 1-10, python is:

Javascript (dynamic data) test:
y u bad tho?

     document.getElementById('yesnojs').innerHTML = 'Look at you shinin!';
  

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Whᶐt hαppéns now¿
sitemap




Login


Username

Password


Cancel  
						Login

Legal stuff:

Terms and Conditions
Privacy Policy


Programming is a superpower.


            © OVER 9000! PythonProgramming.net

Print all hyperlinks on the webpage

In [8]:
for url in soup.find_all('a'):
    print(url.get('href'))
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
/sitemap.xml
#
#
#
/support-donate/
/consulting/
https://www.facebook.com/pythonprogramming.net/
https://plus.google.com/+sentdex
/about/tos/
/about/privacy-policy/
https://xkcd.com/353/

Navigation bar links

In [10]:
nav = soup.nav
In [12]:
for url in nav.find_all('a'):
    print(url.get('href'))
/
#
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/
/
/+=1/
/store/python-hoodie/
/community/
/login/
/register/

Body tag

In [13]:
body = soup.body
In [14]:
for paragraph in body.find_all('p'):
    print(paragraph.text)
Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:
I think it's clear that, on a scale of 1-10, python is:
Javascript (dynamic data) test:
y u bad tho?
Whᶐt hαppéns now¿
sitemap

Return everything within the div body class

In [15]:
for div in soup.find_all('div',class_='body'):
    print(div.text)
Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
The following table gives some general information for the following programming languages:

Python
Pascal
Lisp
D#
Cobol
Fortran
Haskell



Program Name
Internet Points
Kittens?


Python
932914021
Definitely


Pascal
532
Unlikely


Lisp
1522
Uncertain


D#
12
Possibly


Cobol
3
No.


Fortran
52124
Yes.


Haskell
24
lol.


I think it's clear that, on a scale of 1-10, python is:

Javascript (dynamic data) test:
y u bad tho?

     document.getElementById('yesnojs').innerHTML = 'Look at you shinin!';
  

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!
Whᶐt hαppéns now¿
sitemap

Scraping tables

tr tag: table row

th tag: table header

td tag: table data

In [17]:
table = soup.table
In [18]:
table
In [19]:
table_rows = table.find_all('tr')
In [20]:
for tr in table_rows:
    td = tr.find_all('td')
    row = [i.text for i in td]
    print(row)
[]
['Python', '932914021', 'Definitely']
['Pascal', '532', 'Unlikely']
['Lisp', '1522', 'Uncertain']
['D#', '12', 'Possibly']
['Cobol', '3', 'No.']
['Fortran', '52124', 'Yes.']
['Haskell', '24', 'lol.']

You can also scrape table using pandas

In [21]:
import pandas as pd
In [24]:
dfs = pd.read_html('https://pythonprogramming.net/parsememcparseface/', header=0)
In [25]:
for df in dfs:
    print(df)
  Program Name  Internet Points    Kittens?
0       Python        932914021  Definitely
1       Pascal              532    Unlikely
2         Lisp             1522   Uncertain
3           D#               12    Possibly
4        Cobol                3         No.
5      Fortran            52124        Yes.
6      Haskell               24        lol.

Scraping XML

where you can get all the links!

In [26]:
sauce_xml = urllib.request.urlopen('https://pythonprogramming.net/sitemap.xml').read()
In [29]:
soup_xml = bs.BeautifulSoup(sauce_xml,'xml')
In [31]:
for url in soup_xml.find_all('loc'):
    print(url.text)
https://pythonprogramming.net/machine-learning-clustering-introduction-machine-learning-tutorial/
https://pythonprogramming.net/targets-for-machine-learning-labels-python-programming-for-finance/
https://pythonprogramming.net/preprocessing-for-machine-learning-python-programming-for-finance/
https://pythonprogramming.net/combining-alpha-factors-quantopian-python-programming-for-finance/
https://pythonprogramming.net/encryption-and-decryption-in-python-code-example-with-explanation/
https://pythonprogramming.net/r-squared-coefficient-of-determination-machine-learning-tutorial/
https://pythonprogramming.net/hierarchical-clustering-machine-learning-python-scikit-learn/
https://pythonprogramming.net/python-programming-finance-machine-learning-classifier-sets/
https://pythonprogramming.net/support-vector-machine-parameters-machine-learning-tutorial/
https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/
https://pythonprogramming.net/recurrent-neural-network-rnn-lstm-machine-learning-tutorial/
https://pythonprogramming.net/training-self-driving-car-neural-network-python-plays-gta-v/
https://pythonprogramming.net/handling-stock-data-graphing-python-programming-for-finance/
https://pythonprogramming.net/more-stock-data-manipulation-python-programming-for-finance/
https://pythonprogramming.net/tensorflow-deep-neural-network-machine-learning-tutorial/
https://pythonprogramming.net/parsing-comments-python-reddit-api-wrapper-praw-tutorial/
https://pythonprogramming.net/values-from-multiprocessing-intermediate-python-tutorial/
https://pythonprogramming.net/creating-pygame-environment-intermediate-python-tutorial/
https://pythonprogramming.net/headlines-function-alexa-skill-flask-ask-python-tutorial/
https://pythonprogramming.net/scikit-learn-tutorials-machine-learning-python-investing/
https://pythonprogramming.net/training-dataset-chatbot-deep-learning-python-tensorflow/
https://pythonprogramming.net/soft-margin-kernel-cvxopt-svm-machine-learning-tutorial/
https://pythonprogramming.net/weighted-bandwidth-mean-shift-machine-learning-tutorial/
https://pythonprogramming.net/sp500-company-price-data-python-programming-for-finance/
https://pythonprogramming.net/testing-deploying-alexa-skill-flask-ask-python-tutorial/
https://pythonprogramming.net/sqlite-part-2-dynamically-inserting-database-timestamps/
https://pythonprogramming.net/concatenate-append-data-analysis-python-pandas-tutorial/
https://pythonprogramming.net/rolling-statistics-data-analysis-python-pandas-tutorial/
https://pythonprogramming.net/graphing-matplotlib-python-part-3-colors-line-thickness/
https://pythonprogramming.net/python-pro

Leave a Reply

Your email address will not be published. Required fields are marked *