网络爬虫是一种自动化、程序化的过程，通过它可以不断地从网页上 "爬取" 数据。网络抓取也称为屏幕抓取或网络采集，可以从任何可公开访问的网页上提供即时数据。在某些网站上，网页抓取可能是非法的。

1: 使用 Scrapy 框架进行抓取

首先，你必须建立一个新的 Scrapy 项目。输入一个存放代码的目录，然后运行：

scrapy startproject stackoverflow_project

要进行抓取，我们需要一个爬虫。爬虫定义了如何对某个网站进行抓取。下面是一个爬虫的代码，它跟踪 StackOverflow 上投票最高的问题的链接，并从每个页面上刮取一些数据（源代码）：

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'  # 每个爬虫都有一个唯一的名称
    start_urls = ['http://stackoverflow.com/questions?sort=votes']  # 爬取的起始 URL

    def parse(self, response):
        # 使用 CSS 选择器提取问题链接
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.get())  # 构造完整的问题 URL
            yield scrapy.Request(full_url, callback=self.parse_question)

        # 处理分页，自动翻页
        next_page = response.css('.pager .next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def parse_question(self, response):
        # 提取问题的详细信息
        yield {
            'title': response.css('h1 a::text').get(),  # 问题标题
            'votes': response.css('.question .vote-count-post::text').get(),  # 问题的投票数
            'body': response.css('.question .post-text').get(),  # 问题的正文内容
            'tags': response.css('.question .post-tag::text').getall(),  # 问题的标签
            'link': response.url,  # 问题的链接
        }

将爬虫类保存在 projectName\spiders 目录中。在这种情况下

stackoverflow_project\stackoverflow_project\spiders\stackoverflow_spider.py.

现在你可以使用你的爬虫了。例如，试着运行（在项目目录下）：

cd stackoverflow_project
scrapy crawl stackoverflow -o output.json

这会将爬取的结果保存到 output.json 文件中。

2: 使用 Selenium WebDriver 进行抓取

有些网站不喜欢被爬取数据。在这种情况下，您可能需要模拟使用浏览器的真实用户。Selenium 可启动并控制网络浏览器。

from selenium import webdriver

browser = webdriver.Firefox() # launch Firefox browser

browser.get('http://stackoverflow.com/questions?sort=votes') # load url

title = browser.find_element_by_css_selector('h1').text # page title (first h1 element)

questions = browser.find_elements_by_css_selector('.question-summary') # question list

for question in questions: # iterate over questions
  question_title = question.find_element_by_css_selector('.summary h3 a').text
  question_excerpt = question.find_element_by_css_selector('.summary .excerpt').text
  question_vote = question.find_element_by_css_selector('.stats .vote .votes .vote-countpost').text

print("%s\n%s\n%s votes\n-----------\n" % (question_title, question_excerpt, question_vote))

Selenium 的功能远不止这些。它可以修改浏览器的 cookie、填写表格、模拟鼠标点击、截取网页截图以及运行自定义 JavaScript。

3: 使用 requests 和 lxml 爬取数据的基本示例

# For Python 2 compatibility.

import lxml.html
import requests

def main():
    try:
        # 发送 HTTP GET 请求
        url = "https://httpbin.org"
        r = requests.get(url)
        
        # 检查响应状态码
        if r.status_code == 200:
            html_source = r.text
            root_element = lxml.html.fromstring(html_source)
            
            # 使用 XPath 提取页面标题
            page_title = root_element.xpath('/html/head/title/text()')
            
            if page_title:
                print("Page Title:", page_title[0])
            else:
                print("No title found on the page.")
        else:
            print(f"Failed to retrieve the page. Status code: {r.status_code}")
    except requests.exceptions.RequestException as e:
        print(f"Error occurred while trying to fetch the URL: {e}")
        print("Please check the URL's validity and your network connection. Try again later.")
    except lxml.etree.ParserError as e:
        print(f"Error occurred while parsing the HTML content: {e}")
        print("The HTML content may be malformed or incomplete.")

if __name__ == '__main__':
    main()

4: 通过 requests 维护网络抓取的会话

维护网络抓取会话是个好主意，可以持久保存 cookie 和其他参数。此外，它会提高性能，因为 requests.Session 会重复使用与主机的底层 TCP 连接：

import requests

with requests.Session() as session:
  # all requests through session now have User-Agent header set
  session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'}
  
  # set cookies
  session.get('http://httpbin.org/cookies/set?key=value')

  # get cookies
  response = session.get('http://httpbin.org/cookies')
  print(response.text)

5: 使用 BeautifulSoup4 进行搜索

from bs4 import BeautifulSoup
import requests

def fetch_codechef_problems():
    url = "https://www.codechef.com/problems/easy"
    try:
        # 发送 HTTP GET 请求
        res = requests.get(url, timeout=10)  # 设置超时时间，避免长时间等待
        res.raise_for_status()  # 检查请求是否成功

        # 创建 BeautifulSoup 对象
        page = BeautifulSoup(res.text, 'lxml')  # 使用 lxml 解析器

        # 使用 CSS 选择器获取问题列表
        datatable_tags = page.select('table.dataTable')  # 问题列表在 class="dataTable" 的  标签中

        if not datatable_tags:
            print("No table with class 'dataTable' found on the page.")
            return []

        datatable = datatable_tags[0]
        prob_tags = datatable.select('a > b')  # 问题名称在  标签下的  标签中
        prob_names = [tag.getText().strip() for tag in prob_tags]

        return prob_names

    except requests.exceptions.RequestException as e:
        print(f"Error occurred while fetching the URL: {e}")
        print("Please check the URL's validity and your network connection.")
        print("If the problem persists, try accessing the URL directly in your browser.")
        return []

# 调用函数并打印结果
if __name__ == "__main__":
    problem_names = fetch_codechef_problems()
    if problem_names:
        print("Problem Names:")
        for name in problem_names:
            print(name)
    else:
        print("Failed to retrieve problem names.")
6: 使用 urllib.request 下载简单的网页内容
标准库模块 urllib.request 可用来下载网页内容：
from urllib.request import urlopen

response = urlopen('http://stackoverflow.com/questions?sort=votes')
data = response.read()

# The received bytes should usually be decoded according the response's character set
encoding = response.info().get_content_charset()
html = data.decode(encoding)
Python 2 中也有类似的模块。
7: 修改 Scrapy 用户代理
有时，Scrapy 的默认用户代理（“Scrapy/VERSION (+http://scrapy.org)”）会被主机阻止。要更改默认用户代理，请打开 settings.py，取消注释并编辑下面一行。
#USER_AGENT = 'projectName (+http://www.yourdomain.com)'
例如
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'
8: 使用 curl 进行抓取
导入：
from subprocess import Popen, PIPE
from lxml import etree
from io import StringIO
下载
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'
url = 'http://stackoverflow.com'
get = Popen(['curl', '-s', '-A', user_agent, url], stdout=PIPE)
result = get.stdout.read().decode('utf8')
-s: 静默下载
-A： 用户代理标志
解析：
tree = etree.parse(StringIO(result), etree.HTMLParser())
divs = tree.xpath('//div')



    
        
                        Tags：curl转python
                    
                
                        
                
                    
                    
                    
                    
                
            
            
                    
            


    
        上一篇：            如何把一个Python应用程序装进Docker
                    
        下一篇：            统信UOS 1070上编译安装高版本的Python3及pip
                    
        
    




    猜你喜欢
        
                2025-03-14 一个用 Rust 开发的极快、易用的 Python 包和项目管理利器
                2025-03-14 python解释器管理工具pyenv使用说明
                2025-03-14 Linux命令手册:从青铜到王者，这30个命令让你成为终端高手
                2025-03-14 python爬虫混肴DES案例:某影视大数据平台
                2025-03-14 AI应用下一个风口:知识库（ai智能应用）
                2025-03-14 使用vllm部署自己的大模型（如何部署模型）
                2025-03-14 Python调用OpenDaylight REST API实验
                2025-03-14 Java教程:gitlab-使用入门（java gitbook）
                2025-03-14 接入5家DeepSeek模型提供商!边缘大模型网关助力一键畅享大模型
                2025-03-14 python爬虫如何一键构造请求（python 构造）
            


                    
    
    
        
                
 
     
         
             
                最新文章 
                热门文章 
                推荐文章
             
         
         
            
                
                                        
03-30学会IDEA REST Client后，postman就可以丢掉了...
03-30学习在Postman中发送POST请求的最佳实践
03-30IDEA中居然藏着一个跟Postman一样好用的插件
03-30《5分钟Java》实现excel文件上传并解析
03-30不会接口测试?用Postman轻松入门(八下)——请求结果断言方法
03-30「Postman」测试(Tests)脚本编写和断言详解
03-30Spring Boot对接twilio发送邮件信息
03-30Crowd 批量添加用户(Postman 数据驱动)
                                    
            
            
                
                                        
1470℃桌面软件开发新体验!用 Blazor Hybrid 打造简洁高效的视频处理工具
496℃命令的用法（dmesg命令的用法）
483℃3种解决方案:如何彻底删除C盘的垃圾文件
477℃什么是cmd（什么是cmd环境）
440℃CMD命令之——越走越远的封(zhuang)神(bi)之路!
379℃Git学习笔记 001 Git基础 part1（git自学）
373℃git常用命令总结（git常用命令详解）
355℃CAD常用基本操作（cad常用基本操作快捷键）
                                    
            
            
                
                                        
                                    
            
         
    

                


最近发表



学会IDEA REST Client后，postman就可以丢掉了...
学习在Postman中发送POST请求的最佳实践
IDEA中居然藏着一个跟Postman一样好用的插件
《5分钟Java》实现excel文件上传并解析
不会接口测试?用Postman轻松入门(八下)——请求结果断言方法
「Postman」测试(Tests)脚本编写和断言详解
Spring Boot对接twilio发送邮件信息
Crowd 批量添加用户(Postman 数据驱动)
RPA028-调用飞书API发送文件(.netのc#)
使用Postman快速上手ONES OpenAPI





标签列表



cmd/c (57)
c++中::是什么意思 (57)
sqlset (59)
ps可以打开pdf格式吗 (58)
phprequire_once (61)
localstorage.removeitem (74)
routermode (59)
vector线程安全吗 (70)
& (66)
java (73)
org.redisson (64)
log.warn (60)
cannotinstantiatethetype (62)
js数组插入 (83)
resttemplateokhttp (59)
gormwherein (64)
linux删除一个文件夹 (65)
mac安装java (72)
reader.onload (61)
outofmemoryerror是什么意思 (64)
flask文件上传 (63)
eacces (67)
查看mysql是否启动 (70)
java是值传递还是引用传递 (58)
无效的列索引 (74)

网站首页 > 技术文章正文

python散装笔记——92: 使用 Python 进行网络抓取

1: 使用 Scrapy 框架进行抓取

2: 使用 Selenium WebDriver 进行抓取

3: 使用 requests 和 lxml 爬取数据的基本示例

4: 通过 requests 维护网络抓取的会话

5: 使用 BeautifulSoup4 进行搜索

6: 使用 urllib.request 下载简单的网页内容

7: 修改 Scrapy 用户代理

8: 使用 curl 进行抓取

猜你喜欢

网站首页 > 技术文章 正文

python散装笔记——92: 使用 Python 进行网络抓取

1: 使用 Scrapy 框架进行抓取

2: 使用 Selenium WebDriver 进行抓取

3: 使用 requests 和 lxml 爬取数据的基本示例

4: 通过 requests 维护网络抓取的会话

5: 使用 BeautifulSoup4 进行搜索

6: 使用 urllib.request 下载简单的网页内容

7: 修改 Scrapy 用户代理

8: 使用 curl 进行抓取

猜你喜欢

网站首页 > 技术文章正文