For parsing web pages with Python, BeautifulSoup is a helpful library. To install:
pip install beautifulsoup4
You also need a way to fetch the page. The standard library urllib works fine for simple cases; requests is more ergonomic for anything complex:
pip install requests
Basic Usage: Extract all links
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = input('Enter URL: ')
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all anchor tags
tags = soup('a')
for tag in tags:
print(tag.get('href', None))
Finding Elements
BeautifulSoup provides several ways to locate content:
# First matching element
soup.find('h1')
soup.find('div', class_='content')
soup.find('a', id='main-link')
# All matching elements (returns a list)
soup.find_all('p')
soup.find_all('a', class_='external')
# CSS selector syntax
soup.select('div.article > p')
soup.select_one('#footer a')
Extracting Text and Attributes
tag = soup.find('a')
tag.text # visible text content (strips tags)
tag.get_text() # same, with optional separator
tag['href'] # attribute access (raises KeyError if missing)
tag.get('href') # safe attribute access (returns None if missing)
Using requests Instead of urllib
import requests
from bs4 import BeautifulSoup
response = requests.get('https://example.com')
soup = BeautifulSoup(response.text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
Note: always check response.status_code == 200 before parsing, and respect robots.txt.