Python + Requests + BeautifulSoup:10分钟搭建你的第一个网页爬虫
Python Requests BeautifulSoup10分钟搭建你的第一个网页爬虫这篇文章写给所有想要学习爬虫但不知道从何入手的朋友特别是编程新手和想要快速入门数据采集的开发者。解决什么问题当你想要从网页上获取信息时不知道如何编写爬虫程序面对复杂的网络请求和HTML解析束手无策。为什么写这篇我自己刚开始学习爬虫时也踩了很多坑从环境配置到代码实现走了很多弯路。今天我把这些经验分享给大家让你少走弯路10分钟就能搭建起自己的第一个爬虫。痛点分析为什么很多初学者觉得爬虫很难1. 环境配置复杂不知道需要安装哪些库版本兼容性问题依赖关系混乱2. HTTP请求复杂不了解HTTP协议基础请求参数构造困难响应数据解析麻烦3. HTML解析困难面对复杂的HTML结构无从下手不知道如何准确定位元素数据提取逻辑混乱这些问题其实都有成熟的解决方案今天我们就用最简单的方式一步步搭建起你的第一个爬虫。环境准备在开始之前我们需要准备以下工具和库必需的Python库# 安装Requests库 - 用于发送HTTP请求 pip install requests # 安装BeautifulSoup4库 - 用于解析HTML pip install beautifulsoup4 # 安装lxml解析器 - 提供更快的HTML解析速度 pip install lxml验证安装import requests from bs4 import BeautifulSoup import lxml print(所有库安装成功)开发环境推荐Python 3.6 版本代码编辑器VS Code 或 PyCharm浏览器Chrome用于调试分步实战步骤1发送第一个HTTP请求我们先从一个简单的网页开始发送HTTP请求获取页面内容import requests # 发送GET请求 url http://httpbin.org/get response requests.get(url) # 检查请求是否成功 if response.status_code 200: print(请求成功) print(响应内容) print(response.text[:500]) # 只显示前500个字符 else: print(f请求失败状态码{response.status_code})说明requests.get()发送GET请求response.status_code检查HTTP状态码200表示请求成功步骤2设置请求头为了避免被网站识别为爬虫我们需要设置合适的请求头import requests # 设置请求头 headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36, Accept: text/html,application/xhtmlxml,application/xml;q0.9,image/webp,*/*;q0.8, Accept-Language: zh-CN,zh;q0.9,en;q0.8 } # 发送带请求头的GET请求 url http://httpbin.org/get response requests.get(url, headersheaders) print(请求头设置成功) print(响应中的请求头信息) print(response.json()[headers])说明User-Agent模拟真实浏览器Accept告诉服务器我们接受什么类型的响应Accept-Language设置语言偏好步骤3解析HTML页面现在我们开始解析真实的HTML页面。我们以一个简单的新闻网站为例import requests from bs4 import BeautifulSoup # 发送请求获取页面 url http://quotes.toscrape.com/ # 一个专门用于爬虫练习的网站 response requests.get(url) # 使用BeautifulSoup解析HTML soup BeautifulSoup(response.text, lxml) # 查找所有名言 quotes soup.find_all(div, class_quote) print(f找到 {len(quotes)} 条名言) print(- * 50) for i, quote in enumerate(quotes[:3], 1): # 只显示前3条 text quote.find(span, class_text).text author quote.find(small, class_author).text print(f{i}. {text}) print(f —— {author}) print()说明BeautifulSoup(response.text, lxml)解析HTMLfind_all()查找所有匹配的元素find()查找第一个匹配的元素text获取文本内容步骤4提取数据并保存我们把提取的数据保存到CSV文件中import requests from bs4 import BeautifulSoup import csv import time # 发送请求获取页面 url http://quotes.toscrape.com/ response requests.get(url) # 使用BeautifulSoup解析HTML soup BeautifulSoup(response.text, lxml) # 准备数据列表 quotes_data [] # 查找所有名言 quotes soup.find_all(div, class_quote) for quote in quotes: text quote.find(span, class_text).text author quote.find(small, class_author).text tags [tag.text for tag in quote.find_all(a, class_tag)] quotes_data.append({ text: text, author: author, tags: , .join(tags) }) # 保存到CSV文件 filename quotes.csv with open(filename, w, newline, encodingutf-8-sig) as csvfile: fieldnames [text, author, tags] writer csv.DictWriter(csvfile, fieldnamesfieldnames) writer.writeheader() writer.writerows(quotes_data) print(f数据已保存到 {filename}) print(f共保存了 {len(quotes_data)} 条名言)说明csv.DictWriter用于写入CSV文件encodingutf-8-sig确保中文正确显示newline避免CSV文件出现空行完整代码import requests from bs4 import BeautifulSoup import csv import time def scrape_quotes(url): 爬取名言网站的所有名言数据 Args: url (str): 要爬取的网址 Returns: list: 包含所有名言数据的列表 try: # 发送HTTP请求 headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 } response requests.get(url, headersheaders, timeout10) # 检查请求是否成功 if response.status_code ! 200: print(f请求失败状态码{response.status_code}) return [] # 解析HTML soup BeautifulSoup(response.text, lxml) # 提取数据 quotes_data [] quotes soup.find_all(div, class_quote) for quote in quotes: text quote.find(span, class_text).text.strip() author quote.find(small, class_author).text.strip() tags [tag.text.strip() for tag in quote.find_all(a, class_tag)] quotes_data.append({ text: text, author: author, tags: , .join(tags) }) return quotes_data except requests.exceptions.RequestException as e: print(f请求异常{e}) return [] except Exception as e: print(f未知错误{e}) return [] def save_to_csv(data, filename): 将数据保存到CSV文件 Args: data (list): 要保存的数据列表 filename (str): 文件名 if not data: print(没有数据可保存) return try: with open(filename, w, newline, encodingutf-8-sig) as csvfile: fieldnames [text, author, tags] writer csv.DictWriter(csvfile, fieldnamesfieldnames) writer.writeheader() writer.writerows(data) print(f数据已成功保存到 {filename}) print(f共保存了 {len(data)} 条记录) except Exception as e: print(f保存文件时出错{e}) def main(): 主函数 # 目标网址 url http://quotes.toscrape.com/ print(开始爬取名言数据...) print(f目标网址{url}) print(- * 50) # 爬取数据 quotes_data scrape_quotes(url) if quotes_data: # 保存数据 save_to_csv(quotes_data, quotes.csv) # 显示前5条数据 print( 爬取的数据示例前5条) print(- * 50) for i, quote in enumerate(quotes_data[:5], 1): print(f{i}. {quote[text]}) print(f —— {quote[author]}) print(f 标签{quote[tags]}) print() else: print(爬取失败请检查网络连接和网址) if __name__ __main__: main()GitHub链接https://github.com/your-username/python-crawler-tutorial避坑指南坑1中文编码问题问题运行程序时出现UnicodeDecodeError中文显示乱码。现象UnicodeDecodeError: utf-8 codec cant decode byte 0xe4 in position 0原因网页编码和程序编码不一致。解决方案# 在发送请求时指定编码 response requests.get(url) response.encoding response.apparent_encoding # 自动检测编码 # 或者指定为UTF-8 response.encoding utf-8坑2User-Agent被识别问题网站检测到爬虫返回403错误或验证码。现象requests.exceptions.HTTPError: 403 Client Error: Forbidden原因默认的User-Agent被识别为爬虫。解决方案# 使用更真实的User-Agent headers { User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36, Accept: text/html,application/xhtmlxml,application/xml;q0.9,image/webp,*/*;q0.8, Accept-Language: zh-CN,zh;q0.9,en;q0.8 } # 或者使用User-Agent池 user_agents [ Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36, Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36, Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36 ] import random headers[User-Agent] random.choice(user_agents)坑3HTML元素定位失败问题使用find()方法找不到指定的HTML元素。现象AttributeError: NoneType object has no attribute text原因页面结构变化或选择器错误。解决方案# 先检查元素是否存在 quote soup.find(div, class_quote) if quote: text quote.find(span, class_text).text print(text) else: print(未找到名言元素) # 或者使用更灵活的选择器 quotes soup.select(div.quote) # 使用CSS选择器 for quote in quotes: text quote.select_one(span.text).text print(text)坑4网络连接超时问题请求长时间没有响应程序卡住。现象程序长时间等待最终抛出超时异常。原因网络不稳定或目标网站响应慢。解决方案# 设置超时时间 response requests.get(url, timeout10) # 10秒超时 # 使用try-catch处理异常 try: response requests.get(url, timeout10) response.raise_for_status() # 检查HTTP状态码 except requests.exceptions.Timeout: print(请求超时) except requests.exceptions.ConnectionError: print(连接错误) except requests.exceptions.RequestException as e: print(f请求异常{e})坑5数据保存失败问题CSV文件保存失败或数据格式错误。现象UnicodeEncodeError: gbk codec cant encode character原因文件编码问题或数据格式不正确。解决方案# 使用正确的编码 with open(quotes.csv, w, newline, encodingutf-8-sig) as csvfile: writer csv.writer(csvfile) writer.writerow([text, author, tags]) # 写入表头 writer.writerows(data) # 写入数据 # 或者使用pandas import pandas as pd df pd.DataFrame(data) df.to_csv(quotes.csv, indexFalse, encodingutf-8-sig)效果展示运行我们的爬虫程序你会看到类似这样的输出开始爬爬取名言数据... 目标网址http://quotes.toscrape.com/ -------------------------------------------------- 数据已成功保存到 quotes.csv 共保存了 10 条记录 爬取的数据示例前5条 -------------------------------------------------- 1. The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking. —— Albert Einstein 标签change deep-thoughts thinking world 2. It is our choices, Harry, that show what we truly are, far more than our abilities. —— J.K. Rowling 标签: choices 3. There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle. —— Albert Einstein 标签: inspirational life live mircale miracles 4. The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid. —— Jane Austen 标签: classic literature 5. Imperfection is beauty, madness is genius and its better to be absolutely ridiculous than absolutely boring. —— Marilyn Monroe 标签: be-yourself inspirational生成的CSV文件内容text,author,tags The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.,Albert Einstein,change deep-thoughts thinking world It is our choices, Harry, that show what we truly are, far more than our abilities.,J.K. Rowling,choices There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.,Albert Einstein,inspirational life live mircale miracles The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.,Jane Austen,classic literature Imperfection is beauty, madness is genius and its better to be absolutely ridiculous than absolutely boring.,Marilyn Monroe,be-yourself inspirational结尾今天我们成功搭建了第一个网页爬虫从环境配置到数据保存完整地体验了爬虫开发的整个流程。这个爬虫虽然简单但包含了爬虫开发的核心要素HTTP请求、HTML解析、数据提取和文件保存。