Master Python Web Scraping on an Ubuntu VPS: A Beginner’s Guide
Web scraping is a valuable skill that enables you to collect and analyze large amounts of data from web pages with minimal manual effort.
In this comprehensive guide, we will walk you through the steps of web scraping using Python on an Ubuntu VPS. Whether you’re a beginner or have some experience, this article will equip you with the necessary skills to scrape data effectively.
Table of Contents
- Prerequisites
- Setting Up Your Environment for Python Web Scraping
- Making Your First Request
- Extracting Data with Beautiful Soup
- Parsing HTML and Navigating the DOM Tree
- Storing Scraped Data
- Using Regular Expressions to Scrape Data
- Handling Dynamic Content and Ajax Calls
- Error Handling and Logging
- Creating a Simple Web Scraper Application
- Python Web Scraping FAQ
Prerequisites
To follow along with this Python web scraping tutorial, you will need:
- An Ubuntu VPS with Python, pip, and the venv module installed.
- Secure Shell (SSH) access to the VPS.
- Basic knowledge of how to run commands in a terminal.
While you can technically write Python code for web scraping without a VPS, it’s recommended to use one, especially for beginners. A VPS hosting plan provides more stability and better performance for web scraping, particularly for large-scale tasks. If your script takes too long to execute on your local machine or consumes too much memory, switching to a VPS can enhance your scraping capabilities.
Setting Up Your Environment for Python Web Scraping
Start by logging into your VPS via SSH. You can use an SSH client, such as PuTTY for this purpose. Download and install PuTTY, launch it, enter the IP address of your VPS, and log in using your credentials. If you are a Hostinger user, find your VPS login credentials by navigating to hPanel → VPS → Manage → SSH access.
Once logged in, run the following commands in order to set up your environment:
python3 -m venv myenv
The above command creates a new virtual environment. To activate it, use:
source myenv/bin/activate
Now, install Beautiful Soup and Requests, two essential libraries for web scraping:
pip install requests beautifulsoup4
Making Your First Request
Python’s Requests library provides a user-friendly interface for interacting with web servers through HTTP requests. To make your first request, let’s create a Python script:
touch temp.py
Open the file in your favorite code editor, and paste the following code:
import requests
# Send a GET request to the URL
response = requests.get('https://www.<a href="https://hostinger.com?REFERRALCODE=1CRYPTO99" rel="sponsored noopener" target="_blank">Hostinger</a>.com/tutorials/how-to-run-a-python-script-in-linux')
# Print the HTML content of the page
print(response.text)
Run the script with the following command:
python3 temp.py
After running the command, you should see a large amount of HTML content printed on your console.
Extracting Data with Beautiful Soup
Now that we have a basic understanding of HTTP requests, let’s see how we can extract data from the retrieved response object using Beautiful Soup.
Create a new script named scraping.py
:
touch scraping.py
Open the file in your editor and paste the following code:
import requests
from bs4 import BeautifulSoup
# Send a GET request to the URL
response = requests.get('https://www.<a href="https://hostinger.com?REFERRALCODE=1CRYPTO99" rel="sponsored noopener" target="_blank">Hostinger</a>.com/tutorials/how-to-run-a-python-script-in-linux')
# Parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')
# Find all elements with a specific tag (e.g., all h2 headings)
titles = soup.find_all('h2')
# Extract text content from each H2 element
for title in titles:
print(title.text)
To execute the script, run:
python3 scraping.py
You should see a list of all H2 headings from the visited URL.
Parsing HTML and Navigating the DOM Tree
Beautiful Soup allows us to extract data by finding specific elements. Websites often have complex structures, and understanding how to navigate the Document Object Model (DOM) is crucial.
Let’s update our scraping.py
script to explore different HTML structures:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.<a href="https://hostinger.com?REFERRALCODE=1CRYPTO99" rel="sponsored noopener" target="_blank">Hostinger</a>.com/tutorials/how-to-run-a-python-script-in-linux')
soup = BeautifulSoup(response.text, 'html.parser')
# Find the first tag
first_header = soup.find('h1')
print('First tag text:', first_header.text)
# Find all tags (links)
all_links = soup.find_all('a')
print('All tag hrefs:')
for link in all_links:
print(link.get('href'))
Run the script again:
python3 scraping.py
This will print out the first H1 tag and all the links found on the page.
Storing Scraped Data
After scraping data, it’s often necessary to store it for later use. You can save it to a CSV file or a database like MongoDB. First, install the pymongo library:
pip install pymongo
Now, add the following code to your scraping.py
file:
import csv
from pymongo import MongoClient
# Extracting data for storing
data_to_store = {'First_h1_tag': first_header.text,
'First_link_text': first_link.text,
'First_link_href': first_link.get('href')}
# Storing data to a CSV file
csv_file = 'scraped_data.csv'
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
writer = csv.DictWriter(file, fieldnames=data_to_store.keys())
writer.writeheader()
writer.writerow(data_to_store)
print('Data saved to CSV file:', csv_file)
# Storing data to MongoDB
try:
client = MongoClient('mongodb://localhost:27017/')
db = client['scraping_db']
collection = db['scraped_data']
collection.insert_one(data_to_store)
print('Data saved to MongoDB collection: scraped_data')
except Exception as e:
print('Error:', e)
print('Failed to connect to MongoDB. Exiting...')
sys.exit(1)
Upon successful execution, you should see confirmation messages for data saving.
Using Regular Expressions to Scrape Data
Regular expressions can be useful when extracting data that follows a specific pattern. Create a new file named regex.py
:
touch regex.py
Paste the following code to find email addresses:
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('https://webscraper.io/test-sites/e-commerce/static/phones')
soup = BeautifulSoup(response.text, 'html.parser')
# Define a regex pattern for matching email addresses
email_pattern = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}')
# Find all text in the HTML that matches the email pattern
emails = soup.find_all(string=email_pattern)
# Print all matched email addresses
for email in emails:
print(email)
Execute the script with:
python3 regex.py
Handling Dynamic Content and Ajax Calls
When web scraping, you may encounter dynamic content loaded via AJAX. To handle this, you can use Selenium, a tool for automating web browsers.
Install Selenium:
pip install selenium
Here’s a sample Selenium script:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Define the URL to scrape
url = 'https://www.example.com'
# Create a new WebDriver instance
driver = webdriver.Chrome('/path/to/chromedriver')
# Navigate to the URL
driver.get(url)
# Explicit wait for content to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'unique_element_id')))
# Locate and extract the desired data
view_details_buttons = driver.find_elements(By.CLASS_NAME, 'view_details_button')
# Close the browser window
driver.quit()
Error Handling and Logging
Implementing error handling and logging is crucial in web scraping. Here’s how you can enhance your script:
import logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
try:
response = requests.get('https://www.<a href="https://hostinger.com?REFERRALCODE=1CRYPTO99" rel="sponsored noopener" target="_blank">Hostinger</a>.com/tutorials/how-to-run-a-python-script-in-linux')
response.raise_for_status()
except requests.RequestException as e:
logging.error(f"Request failed: {e}")
sys.exit(1)
Creating a Simple Web Scraper Application
Now, let’s create a simple web scraper application that takes a keyword and a URL as input, calculates how many times the keyword appears, and displays all internal links.
touch app.py
Paste the following code into app.py
:
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urlparse
def get_html(url):
try:
response = requests.get(url)
response.raise_for_status()
return response.text
except requests.exceptions.RequestException as e:
print('Error fetching the URL:', e)
return None
def count_keyword(html, keyword):
keyword = keyword.lower()
text = BeautifulSoup(html, 'html.parser').get_text().lower()
return text.count(keyword)
def find_external_links(html, base_url):
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a', href=True)
external_links = []
parsed_base_url = urlparse(base_url)
base_domain = parsed_base_url.netloc
for link in links:
href = link.get('href')
parsed_href = urlparse(href)
if parsed_href.netloc and parsed_href.netloc != base_domain:
external_links.append(href)
return external_links
def main():
keyword = input("Enter the keyword: ")
url = input("Enter the URL:")
html = get_html(url)
if html is None:
print('Failed to retrieve HTML content.')
return
keyword_count = count_keyword(html, keyword)
external_links = find_external_links(html, url)
print(f"\nKeyword '{keyword}' found {keyword_count} times in the page.")
print("\nExternal Links:")
for link in external_links:
print(link)
if __name__ == '__main__':
main()
Python Web Scraping FAQ
Which Python libraries are commonly used for web scraping?
The Python libraries most commonly used for web scraping are Beautiful Soup, Requests, and Scrapy.
Is it possible to scrape websites that require login credentials using Python?
It is technically possible to scrape websites that require login credentials using web scraping. However, you may need to replicate the login process, which can be complex. Be sure that web scraping doesn’t violate the website’s terms of service.
Can I extract images or media files with Python web scraping?
Yes, you can extract multimedia files from websites using Python web scraping. The process involves retrieving the HTML of the relevant page, using Beautiful Soup to find the appropriate HTML tag (e.g., <img/>
for images), extracting the URL, and using the Requests library to download the image or media file.
Web scraping vs. API: which is better?
Both web scraping and APIs have their merits, and the better choice depends on your specific needs. If an API is available and offers the data you need in a structured format, it’s generally the preferred option due to its ease of use and standardization. However, if you need specific data, and the website doesn’t offer an API, web scraping may be your only option.
If you’re ready to dive into the world of web scraping using Python, consider using Hostinger for your VPS hosting needs. With dedicated resources, you can enhance your scraping capabilities!
Conclusion
Web scraping is a fantastic way to process, analyze, and aggregate large volumes of online data. In this guide, we learned how to make your first request in Python, use Beautiful Soup to extract data, parse HTML, navigate the DOM tree, leverage regular expressions, handle dynamic content, and store scraped data in a file and database. We hope you found this tutorial useful!
👉 Start your website with Hostinger – get fast, secure hosting here 👈
🔗 Read more from MinimaDesk:
- How to Disable xmlrpc.php in WordPress: A Step-by-Step Guide
- Mastering the WP-Content Directory in WordPress
- The Ultimate Guide to WP-Content: Access, Upload, and Hide Your WordPress Directory
- How Many WordPress Plugins Are Too Many? Optimize Your Site for Success
🎁 Download free premium WordPress tools from our Starter Tools page.