Scrape the HTML content of a website
Here’s a basic Python script that uses the requests
and BeautifulSoup
libraries to scrape the HTML content of a website:
import requests
from bs4 import BeautifulSoup
# Set the URL to scrape
url = ‘https://www.example.com’
# Make an HTTP GET request to the website
response = requests.get(url)
# Parse the HTML content of the website
soup = BeautifulSoup(response.content, ‘html.parser’)
# Extract the information you want from the website
title = soup.find(‘title’).get_text()
print(title)
# Extract all the links from the website
links = soup.find_all(‘a’)
for link in links:
print(link.get(‘href’))
This script makes an HTTP GET request to the specified URL, retrieves the HTML content of the website, and then uses the BeautifulSoup
library to parse the HTML. The find()
method is used to locate the title
tag and extract its text, and find_all()
is used to locate all a
tags and extract their href
attributes.
You can modify the script to extract other information from the website by changing the find()
and find_all()
methods and their parameters to locate different elements in the HTML.