5/16/2020

Python - Web Scraping with BeautifulSoup and Requests

** Web Scraping with BeautifulSoup and Requests


==========================================

C:\Users\purunet>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.9.0-py3-none-any.whl (109 kB)
     |████████████████████████████████| 109 kB 8.9 kB/s
Collecting soupsieve>1.2
  Downloading soupsieve-2.0-py2.py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.0 soupsieve-2.0

==========================================

C:\Users\purunet>pip install lxml
Collecting lxml
  Downloading lxml-4.5.0-cp38-cp38-win32.whl (3.3 MB)
     |████████████████████████████████| 3.3 MB 1.3 MB/s
Installing collected packages: lxml
Successfully installed lxml-4.5.0

==========================================

C:\Users\purunet>pip install html5lib
Collecting html5lib
  Downloading html5lib-1.0.1-py2.py3-none-any.whl (117 kB)
     |████████████████████████████████| 117 kB 233 kB/s
Collecting webencodings
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: six>=1.9 in c:\users\purunet\appdata\roaming\pyth
on\python38\site-packages (from html5lib) (1.14.0)
Installing collected packages: webencodings, html5lib
Successfully installed html5lib-1.0.1 webencodings-0.5.1

==========================================

C:\Users\purunet>pip install requests
Collecting requests
  Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 153 kB/s
Collecting idna<3>=2.5
  Downloading idna-2.9-py2.py3-none-any.whl (58 kB)
     |████████████████████████████████| 58 kB 454 kB/s
Collecting certifi>=2017.4.17
  Downloading certifi-2020.4.5.1-py2.py3-none-any.whl (157 kB)
     |████████████████████████████████| 157 kB 547 kB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1 .26="">=1.21.1
  Downloading urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
     |████████████████████████████████| 126 kB 2.2 MB/s
Collecting chardet<4>=3.0.2
  Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
     |████████████████████████████████| 133 kB 3.3 MB/s
Installing collected packages: idna, certifi, urllib3, chardet, requests
Successfully installed certifi-2020.4.5.1 chardet-3.0.4 idna-2.9 requests-2.23.0
 urllib3-1.25.9

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

print(soup)

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

print(soup.prettify())

---------------------------------




---------------------------------


==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

match = soup.title

print(match)


---------------------------------




==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

match = soup.title.text

print(match)

---------------------------------

Test - A Sample Website

---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

match = soup.div

print(match)


---------------------------------




---------------------------------


==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

match = soup.find('div')

print(match)


---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

match = soup.find('div', class_ ='footer')

print(match)


---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

article = soup.find('div', class_ ='article')

print(article)

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

article = soup.find('div', class_ ='article')

headline = article.h2.a.text

print(headline)

---------------------------------

Article 1 Headline

---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

article = soup.find('div', class_ ='article')

headline = article.h2.a.text
print(headline)

summary = article.p.text
print(summary)

---------------------------------

Article 1 Headline
This is a summary of article 1

---------------------------------

==========================================

from bs4 import BeautifulSoup

import requests

with open('simple.html') as html_file:
soup = BeautifulSoup(html_file, 'lxml')

for article in soup.find_all('div', class_ ='article'):
headline = article.h2.a.text
print(headline)

summary = article.p.text
print(summary)

print()

---------------------------------

Article 1 Headline
This is a summary of article 1

Article 2 Headline
This is a summary of article 2


---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://linuxerhan.blogspot.com').text
soup = BeautifulSoup(source, 'lxml')

print(soup.prettify())

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://linuxerhan.blogspot.com').text
soup = BeautifulSoup(source, 'lxml')

span = soup.find('span')

print(span)
print('-'*80)
print(span.prettify())

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://linuxerhan.blogspot.com').text
soup = BeautifulSoup(source, 'lxml')

span = soup.find('span')

headline = span.a.text
print(headline)

---------------------------------

skip to main

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')

headline = article.h2.a.text

print(headline

---------------------------------

Python Tutorial: Zip Files – Creating and Extracting Zip Archives

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')

print(article.prettify())

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')

summary = article.find('div', class_='entry-content').p.text

print(summary)

---------------------------------

In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')
print(vid_src)

---------------------------------




---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')['src']
print(vid_src)


---------------------------------

https://www.youtube.com/embed/z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent


---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')['src']

vid_id = vid_src.split('/')
print(vid_id)

---------------------------------

['https:', '', 'www.youtube.com', 'embed', 'z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent']

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')['src']

vid_id = vid_src.split('/')[4]
print(vid_id)

---------------------------------

z0gguhEmWiY?version=3&rel=1&fs=1&autohide=2&showsearch=0&showinfo=1&iv_load_policy=1&wmode=transparent

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')['src']

vid_id = vid_src.split('/')[4]

vid_id = vid_id.split('?')[0]

print(vid_id)


---------------------------------

z0gguhEmWiY

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')


article = soup.find('article')


vid_src = article.find('iframe', class_ = 'youtube-player')['src']

vid_id = vid_src.split('/')[4]

vid_id = vid_id.split('?')[0]


yt_link = f'https://youtube.com/watch?v={vid_id}'
print(yt_link)

---------------------------------

https://youtube.com/watch?v=z0gguhEmWiY

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests

source = requests.get('http://coreyms.com').text
soup = BeautifulSoup(source, 'lxml')

for article in soup.find_all('article'):
headline = article.h2.a.text
print(headline)

summary = article.find('div', class_='entry-content').p.text
print(summary)

try:

vid_src = article.find('iframe', class_='youtube-player')['src']
vid_id = vid_src.split('/')[4]
vid_id = vid_id.split('?')[0]

yt_link = f'https://youtube.com/watch?v={vid_id}'

except Exception as e:
yt_link = None

print(yt_link)

print()

---------------------------------

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also look at how to process multiple high-resolution images at the same time using a ProcessPoolExecutor from the concurrent.futures module. Let’s get started…
https://youtube.com/watch?v=fKl2JW_qrso

Python Threading Tutorial: Run Code Concurrently Using the Threading Module
In this Python Programming video, we will be learning how to run threads concurrently using the threading module. We will also look at how to download multiple high-resolution images online using a ThreadPoolExecutor from the concurrent.futures module. Let’s get started…
https://youtube.com/watch?v=IEEhzQoKtQU

Update (2019-09-03)
Hey everyone. I wanted to give you an update on my videos. I will be releasing videos on threading and multiprocessing within the next week. Thanks so much for your patience. I currently have a temporary recording studio setup at my Airbnb that will allow me to record and edit the threading/multiprocessing videos. I am going to be moving into my new house in 10 days and once I have my recording studio setup then you can expect much faster video releases. I really appreciate how patient everyone has been while I go through this move, especially those of you who are contributing monthly through YouTube
None

Python Quick Tip: The Difference Between “==” and “is” (Equality vs Identity)
In this Python Programming Tutorial, we will be learning the difference between using “==” and the “is” keyword when doing comparisons. The difference between these is that “==” checks to see if values are equal, and the “is” keyword checks their identity, which means it’s going to check if the values are identical in terms of being the same object in memory. We’ll learn more in the video. Let’s get started…
https://youtube.com/watch?v=mO_dS3rXDIs

Python Tutorial: Calling External Commands Using the Subprocess Module
In this Python Programming Tutorial, we will be learning how to run external commands using the subprocess module from the standard library. We will learn how to run commands, capture the output, handle errors, and also how to pipe output into other commands. Let’s get started…
https://youtube.com/watch?v=2Fp1N6dof0Y

Visual Studio Code (Windows) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on Windows. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…
https://youtube.com/watch?v=-nh9rCzPJ20

Visual Studio Code (Mac) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on MacOS. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…
https://youtube.com/watch?v=06I63_p-2A4

Clarifying the Issues with Mutable Default Arguments
In this Python Programming Tutorial, we will be clarifying the issues with mutable default arguments. We discussed this in my last video titled “5 Common Python Mistakes and How to Fix Them”, but I received many comments from people who were still confused. So we will be doing a deeper dive to explain exactly what is going on here. Let’s get started…
https://youtube.com/watch?v=_JGmemuINww

---------------------------------

==========================================

from bs4 import BeautifulSoup
import requests
import csv


source = requests.get('http://coreyms.com').text

soup = BeautifulSoup(source, 'lxml')

csv_file = open('cms_scrape.csv', 'w', encoding='UTF-8')

csv_writer = csv.writer(csv_file)
csv_writer.writerow(['headline', 'summary', 'video_link'])


for article in soup.find_all('article'):
headline = article.h2.a.text
print(headline)

summary = article.find('div', class_='entry-content').p.text
print(summary)

try:
vid_src = article.find('iframe', class_='youtube-player')['src']

vid_id = vid_src.split('/')[4]
vid_id = vid_id.split('?')[0]

yt_link = f'https://youtube.com/watch?v={vid_id}'

except Exception as e:
yt_link = None

print(yt_link)

print()

csv_writer.writerow([headline, summary, yt_link])

csv_file.close()

---------------------------------

Python Tutorial: Zip Files – Creating and Extracting Zip Archives
In this video, we will be learning how to create and extract zip archives. We will start by using the zipfile module, and then we will see how to do this using the shutil module. We will learn how to do this with single files and directories, as well as learning how to use gzip as well. Let’s get started…
https://youtube.com/watch?v=z0gguhEmWiY

Python Data Science Tutorial: Analyzing the 2019 Stack Overflow Developer Survey
In this Python Programming video, we will be learning how to download and analyze real-world data from the 2019 Stack Overflow Developer Survey. This is terrific practice for anyone getting into the data science field. We will learn different ways to analyze this data and also some best practices. Let’s get started…
https://youtube.com/watch?v=_P7X8tMplsw

Python Multiprocessing Tutorial: Run Code in Parallel Using the Multiprocessing Module
In this Python Programming video, we will be learning how to run code in parallel using the multiprocessing module. We will also look at how to process multiple high-resolution images at the same time using a ProcessPoolExecutor from the concurrent.futures module. Let’s get started…
https://youtube.com/watch?v=fKl2JW_qrso

Python Threading Tutorial: Run Code Concurrently Using the Threading Module
In this Python Programming video, we will be learning how to run threads concurrently using the threading module. We will also look at how to download multiple high-resolution images online using a ThreadPoolExecutor from the concurrent.futures module. Let’s get started…
https://youtube.com/watch?v=IEEhzQoKtQU

Update (2019-09-03)
Hey everyone. I wanted to give you an update on my videos. I will be releasing videos on threading and multiprocessing within the next week. Thanks so much for your patience. I currently have a temporary recording studio setup at my Airbnb that will allow me to record and edit the threading/multiprocessing videos. I am going to be moving into my new house in 10 days and once I have my recording studio setup then you can expect much faster video releases. I really appreciate how patient everyone has been while I go through this move, especially those of you who are contributing monthly through YouTube
None

Python Quick Tip: The Difference Between “==” and “is” (Equality vs Identity)
In this Python Programming Tutorial, we will be learning the difference between using “==” and the “is” keyword when doing comparisons. The difference between these is that “==” checks to see if values are equal, and the “is” keyword checks their identity, which means it’s going to check if the values are identical in terms of being the same object in memory. We’ll learn more in the video. Let’s get started…
https://youtube.com/watch?v=mO_dS3rXDIs

Python Tutorial: Calling External Commands Using the Subprocess Module
In this Python Programming Tutorial, we will be learning how to run external commands using the subprocess module from the standard library. We will learn how to run commands, capture the output, handle errors, and also how to pipe output into other commands. Let’s get started…
https://youtube.com/watch?v=2Fp1N6dof0Y

Visual Studio Code (Windows) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on Windows. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…
https://youtube.com/watch?v=-nh9rCzPJ20

Visual Studio Code (Mac) – Setting up a Python Development Environment and Complete Overview
In this Python Programming Tutorial, we will be learning how to set up a Python development environment in VSCode on MacOS. VSCode is a very nice free editor for writing Python applications and many developers are now switching over to this editor. In this video, we will learn how to install VSCode, get the Python extension installed, how to change Python interpreters, create virtual environments, format/lint our code, how to use Git within VSCode, how to debug our programs, how unit testing works, and more. We have a lot to cover, so let’s go ahead and get started…
https://youtube.com/watch?v=06I63_p-2A4

Clarifying the Issues with Mutable Default Arguments
In this Python Programming Tutorial, we will be clarifying the issues with mutable default arguments. We discussed this in my last video titled “5 Common Python Mistakes and How to Fix Them”, but I received many comments from people who were still confused. So we will be doing a deeper dive to explain exactly what is going on here. Let’s get started…
https://youtube.com/watch?v=_JGmemuINww

---------------------------------