探索网页爬取技术：用Python深入研究

简介

在数字时代，数据丰富多样，但高效地访问数据可能是一个挑战。网页抓取（web scraping）是从网站提取信息的过程，提供了一种解决方案。在本文章中，将使用Python深入探讨网页抓取的艺术，通过三个不同的网站：Politifact、Altnews和Mastodon，探索不同的工具和方法。对于这个项目，我从ChatGPT中获取了一个基本的代码模板，并根据我的要求进行了更新，并在代码中采用了自己的方法。

所需软件包

确保您在Python环境中使用pip安装了以下软件包，pip是Python的软件包管理器。您可以使用以下命令进行安装：

pip install selenium webdriver_manager requests beautifulsoup4

使用BeautifulSoup爬取Politifact

Politifact是一個豐富的文章和事實核查的寶藏，使其成為我們爬取數據的理想選擇。我們將使用BeautifulSoup，一個用於解析HTML和XML的Python庫，來遍歷該網站的結構。

面临的挑战

大量的文章：Politifact 网站包含大量的文章和事实核查，使得有效地抓取所有数据变得具有挑战性。
连续网页爬取：最初按顺序爬取每篇文章导致了很大的时间消耗，特别是需要在爬取个别文章之前先检索所有的链接。
效率与速度：鉴于海量的数据，确保爬取过程高效快速变得至关重要。
分析复杂的HTML结构：Politifact的网站可能具有复杂的HTML结构，需要仔细解析以准确提取所需的信息。

方法

收集文章和事实核查链接：我们利用BeautifulSoup从文章和事实核查的列表页面中提取链接。
并行执行：为了加快抓取过程，我们使用ThreadPoolExecutor同时获取链接并并发地抓取文章。
提取数据：使用BeautifulSoup，我们从每篇文章中获取各种详细信息，包括标题、内容、图片、作者信息和来源。
保存数据：最后，我们将抓取的数据存储在JSON文件中以供进一步分析。

代码

# Import necessary libraries
import requests
from bs4 import BeautifulSoup
import json
import concurrent.futures

# Function to fetch article details
def fetch_article_details(link):
    response = requests.get(link)
    if response.status_code == 200:
        article = {}
        soup = BeautifulSoup(response.content, "html.parser")
        # Extract relevant details such as title, content, images, author, etc.
        article["title"] = soup.find("div", class_="m-statement__quote").text.strip()
        article["content"] = soup.find("article", class_="m-textblock").text.strip()
        article["images"] = [img["src"] for img in soup.find("article", class_="m-textblock").find_all("img")]
        # Extract author details
        authors = []
        for author in soup.find_all("div", class_="m-author"):
            a = {}
            if author.find("img") is not None:
                a["avatar"] = author.find("img")["src"]
            a["name"] = author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a").text.strip()
            a["profile"] = "https://www.politifact.com" + author.find("div", class_="m-author__content copy-xs u-color--chateau").find("a")["href"]
            authors.append(a)
        article["authors"] = authors
        return article
    else:
        print("Failed to retrieve the page. Status code:", response.status_code)

# Function to fetch article links
def fetch_article_links(page_number):
    url = f"https://www.politifact.com/article/list/?page={page_number}"
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, "html.parser")
        articles = soup.find_all("div", class_="m-teaser")
        return ["https://www.politifact.com" + article.find("h3", class_="m-teaser__title").find("a")["href"] for article in articles]
    else:
        print(f"Failed to retrieve the page {url}. Status code:", response.status_code)
        return []

# Fetch article links in parallel
with concurrent.futures.ThreadPoolExecutor() as executor:
    article_links = []
    results = executor.map(fetch_article_links, range(1, 10))  # Adjust range as needed
    for result in results:
        article_links.extend(result)

articles = []

# Fetch article details concurrently
with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(fetch_article_details, article_links)
    for result in results:
        if result:
            articles.append(result)

# Save the results to a JSON file
with open('politifact_articles.json', 'w') as json_file:
    json.dump(articles, json_file, indent=4)

print("Data saved to politifact_articles.json")

使用Selenium进行动态爬取: Altnews案例研究

Altnews（其他消息）在动态加载内容方面提出了独特的挑战。为了克服这个难题，我们转向 Selenium，这是一个强大的用于网页浏览的自动化工具。

面临的挑战

动态加载的内容: Altnews 网站在用户向下滚动页面时动态加载内容，使得提取所有相关数据变得具有挑战性。
滚动和页面交互：Selenium需要模拟用户交互，如滚动来触发加载额外内容，这需要仔细编写脚本。
解析动态HTML：从动态加载的HTML元素中提取数据可能会具有挑战性，因为结构可能会动态变化。
效率和速度：在保证爬取过程快速可靠的同时，高效处理动态内容至关重要。

方法

动态内容处理：Selenium允许我们通过滚动网页来加载额外的内容，从而模拟人类交互。
提取链接：一旦页面完全加载完成，我们使用Selenium来提取各个文章的链接。
抓取文章：利用BeautifulSoup，我们抓取每篇文章以检索相关信息，如标题，作者，内容，图片和视频。
为了提高效率，我们使用ThreadPoolExecutor同时获取和处理多个文章，从而进行并行处理。

代码

# Import necessary libraries
import json
import re
import concurrent.futures
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import time
from selenium.webdriver.chrome.options import Options

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")

# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://www.altnews.in/'

# Open the website
driver.get(url)

try:
    num_scrolls = 10
    # Scroll the page multiple times to load more content
    for _ in range(num_scrolls):
        # Scroll down to the bottom of the page
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        # Wait for some time for the content to load
        time.sleep(2)

    links = driver.find_elements(By.CSS_SELECTOR, 'h4 a')
    # Filter links that match the specified pattern
    filtered_links = [link.get_attribute("href") for link in links if re.match(r'^https://www.altnews.in/.+$', link.get_attribute("href"))]

    def fetch_post_data(link):
        response = requests.get(link)
        if response.status_code == 200:
            post = {}
            soup = BeautifulSoup(response.content, 'html.parser')
            author_detail = soup.find('span', class_='author vcard')
            if author_detail is not None:

                post["title"] = soup.find('h1', class_='headline entry-title e-t').get_text()
                element = soup.find('div', class_='entry-content e-c m-e-c guten-content')

                post["author_avatar"] = author_detail.find('img')['src']
                post["author_name"] = author_detail.find('a', class_="url fn n").get_text()
                post["author_profile"] = author_detail.find('a', class_="url fn n")['href']
                images = [img['src'] for img in element.find_all('img')]
                if images:
                    post["images"] = images
                videos = [vid['src'] for vid in element.find_all('video')]
                if videos:
                    post["videos"] = videos
                post["content"] = element.get_text()
                post["link"] = link
            pprint.pprint(post)
            return post

    # Concurrently fetch data from multiple links
    with concurrent.futures.ThreadPoolExecutor() as executor:
        posts = list(executor.map(fetch_post_data, filtered_links))

    # Write to JSON file
    with open('altnews.json', 'w') as json_file:
        json.dump(posts, json_file, indent=4)

finally:
    driver.quit()

应对Mastodon的动态界面

鲸鱼象(Mastodon)由于其动态的界面和持续更新的内容而提出了独特的挑战。我们再次采用Selenium来完成这个任务，但需要额外考虑。

面临的挑战

无限滚动：Mastodon的界面采用无限滚动，当用户向下滚动时，动态加载新的帖子，增加了抓取过程的复杂性。
重复内容：由于页面的动态特性，重复滚动可能导致重复内容被抓取，需要进行仔细处理。
解析HTML属性：为了避免重复爬取相同的内容，识别每个帖子的唯一属性变得必要。
数据提取和存储：在确保高效存储和处理的同时，提取各种数据元素（如图片、视频和帖子详情）存在一定的挑战。

方法

增量滚动：为了处理Mastodon的动态界面，我们将滚动拆分为较小的增量，这样我们可以捕捉到新的帖子，同时避免冗余。
唯一帖子标识：Mastodon为每个帖子分配一个独特的数据-id属性，使我们能够区分新帖子和先前捕获的帖子。
提取帖子数据：使用Selenium，我们从每篇帖子中提取各种细节，包括作者详情、内容、媒体文件（图片/视频）、状态卡和互动指标。
数据导出：我们同时以CSV和JSON格式导出抓取的数据，以便进行进一步的分析和可视化。

代码

# Import necessary libraries
import csv
import json
import pprint
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
import time
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

# Set up Chrome options
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run Chrome in headless mode
chrome_options.add_argument("--disable-gpu")

# Initialize WebDriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=chrome_options)
url = 'https://mastodon.social/'

# Open the website
driver.get(url)

try:
    visited = []
    num_scrolls = 30

    # Find the <body> element
    body = driver.find_element(By.TAG_NAME, 'body')
    posts = []

    # Scroll using keyboard keys
    for _ in range(num_scrolls):
        curr_post = {}
        body.send_keys(Keys.PAGE_DOWN)  # Scroll down one page
        time.sleep(1)  # Adjust sleep time as needed

        for article in driver.find_elements(By.CSS_SELECTOR, '.status.status-public'):
            if article.get_attribute("data-id") is not None and article.get_attribute("data-id") not in visited:
                # Extract post details
                curr_post["avatar"] = article.find_elements(By.CSS_SELECTOR, '.status__avatar img')[0].get_attribute("src")
                curr_post["name"] = article.find_elements(By.CSS_SELECTOR, '.display-name__html')[0].text
                curr_post["account"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.display-name__account')])
                curr_post["content"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status__content__text.status__content__text--visible.translate')])
                # Extract media (images and videos)
                media = article.find_elements(By.CSS_SELECTOR, 'img')
                if media:
                    curr_post["pictures"] = [i.get_attribute("src") for i in media]
                media = article.find_elements(By.CSS_SELECTOR, 'video')
                if media:
                    curr_post["videos"] = [i.get_attribute("src") for i in media]
                # Extract hashtags, if available
                hashtags = article.find_elements(By.CSS_SELECTOR, 'hashtag-bar')
                if hashtags:
                    curr_post["hashtags"] = [i.text for i in hashtags]
                # Extract status card content and link
                if article.find_elements(By.CSS_SELECTOR, '.status-card__content'):
                    curr_post["status_card"] = "\n".join([i.text for i in article.find_elements(By.CSS_SELECTOR, '.status-card__content')])
                    curr_post["status_card_link"] = article.find_elements(By.CSS_SELECTOR, '.status-card.expanded')[0].get_attribute("href")
                # Extract engagement metrics (boosts, favorites, replies)
                counters = article.find_elements(By.CSS_SELECTOR, '.icon-button__counter')
                if counters:
                    curr_post["boosts"] = counters[1].text
                    curr_post["favs"] = counters[2].text
                    curr_post["replies"] = counters[0].text
                pprint.pprint(curr_post)
                if len(visited) > 10:
                    visited.pop(0)
                visited.append(article.get_attribute("data-id"))
                posts.append(curr_post)

    # Write to CSV file
    with open('mastodon_posts.csv', mode='w', newline='', encoding='utf-8') as file:
        fieldnames = ['avatar', 'name', 'account', 'content', 'pictures', 'videos', 'status_card', 'status_card_link', 'boosts', 'favs', 'replies']
        writer = csv.DictWriter(file, fieldnames=fieldnames)
        writer.writeheader()
        for post in posts:
            writer.writerow(post)
    
    # Write to JSON file
    with open('mastodon_posts.json', 'w') as json_file:
        json.dump(posts, json_file, indent=4)

finally:
    driver.quit()

结论

随着我们在网络爬虫的动态领域中的旅程接近尾声，我们回顾了沿途遇到的无数挑战和胜利。从Politifact的结构限制到Altnews和Mastodon Social的动态景观，每个网站都呈现了独特的障碍，考验了我们的毅力和创造力。通过Python库、自动化工具和策略方法的明智应用，我们突破了动态内容的限制，带着丰富有价值的数据获得了胜利。在我们制定新的计划并踏上未来的探险之旅时，愿我们的经验成为扩展中的网络爬虫宇宙中的同伴们的灵感之光。