在本网页抓取教程中,我们将深入探讨使用 Python 进行抓取 – 一种强大的网页抓取形式,不仅可以收集数据,还可以弄清楚如何找到数据。 网络爬虫的主要吸引力是广谱应用——一个爬虫可以隐含地处理许多不同的域和文档结构。这是一个用于两个目的的好工具:
- 索引,例如构建搜索引擎和发现特定网页。
- 广泛的抓取,这意味着使用相同的抓取程序抓取多个不同的网站。
在本 Python 教程中,我们将深入了解常见的爬行概念和挑战。为了巩固所有这些知识,我们将通过为任何 Shopify 支持的网站(如纽约时报商店)创建一个爬虫来编写我们自己的示例项目!
什么是爬行和抓取?
本质上,爬行是具有探索能力的网页抓取。 当我们进行网络抓取时,我们通常会有定义明确的 URL 列表,例如“抓取该电子商务商店的这些产品网页”。另一方面,当我们进行抓取时,我们有一套更宽松的规则,例如“找到所有产品网页并在这些网站中的任何一个上抓取它们”。 关键区别在于爬虫是智能探索者,而网络爬虫是专注的工作者。
如果爬行这么好,为什么我们不爬取所有内容?
爬行只是更耗费资源且更难开发,因为我们需要考虑与此探索组件相关的一整套新问题。
抓取中有哪些常见的爬取用途?
当网站没有目标目录或站点地图时,网络抓取中最常见的爬行用于发现目标。 例如,如果电子商务网站没有产品目录,我们可以抓取其所有网页,并通过“相关产品”部分等反向链接找到所有产品。
什么是广泛爬行?
广泛爬行是一种爬行形式,当爬行程序能够导航多个不同的域而不是爬行单个域或网站时。广泛的爬虫需要格外勤奋,以考虑许多不同的网络技术,并能够避免垃圾邮件、无效文档甚至资源攻击。
设置
在本教程中,我们将了解在 Python 网络爬虫开发中使用的几种工具:
- httpx作为我们的 HTTP 客户端来检索 URL。或者,请随意跟随requests,这是一种流行的选择。
- parsel解析 HTML 树。或者,随意跟随beautifulsoup,这是一种流行的替代品。
- w3lib和tldextract用于解析 URL 结构。
- loguru用于格式良好的日志,因此我们可以更轻松地跟进。
pip install
这些 python 包可以通过控制台命令安装:
$ pip install httpx parsel w3lib tldextract loguru
我们还将使用异步 python来加速我们的抓取程序,因为网络抓取非常需要连接。
爬虫组件
网络爬虫最重要的组成部分是它的探索机制,它引入了许多新的组件,如 URL 发现和过滤。为了理解这一点,让我们看一下爬行循环的一般流程:
爬虫从 URL 池开始(初始种子通常称为start urls)并抓取它们的响应(HTML 数据)。然后执行一两个处理步骤:
- 解析响应以获取更多要遵循的 URL,这些 URL 将被过滤并添加到下一个爬网循环池中。
- 可选:触发回调以处理索引、归档或只是一般数据解析的响应。
重复此循环,直到找不到新的 URL 或满足特定的结束条件,如达到爬网深度或收集的 URL 计数。 这听起来很简单,所以让我们自己制作一个爬虫吧!我们将从解析器和过滤器开始,因为这两个是我们爬虫中最重要的部分——探索。
HTML 解析和 URL 过滤
href
通过提取所有节点的属性,我们可以轻松地从 HTML 文档中提取所有 URL <a>
。为此,我们可以使用parsel
带有CSS 选择器或XPath 选择器的包。 让我们添加一个简单的 URL 提取器函数:
from typing import List from urllib.parse import urljoin from parsel import Selector import httpx def extract_urls(response: httpx.Response) -> List[str]: tree = Selector(text=response.text) # using XPath urls = tree.xpath('//a/@href').getall() # or CSS urls = tree.css('a::attr(href)').getall() # we should turn all relative urls to absolute, e.g. /foo.html to https://domain.com/foo.html urls = [urljoin(str(response.url), url.strip()) for url in urls] return urls
运行代码和示例输出
response = httpx.get("http://httpbin.org/links/10/1") for url in extract_urls(response): print(url) http://httpbin.org/links/10/0 http://httpbin.org/links/10/2 http://httpbin.org/links/10/3 http://httpbin.org/links/10/4 http://httpbin.org/links/10/5 http://httpbin.org/links/10/6 http://httpbin.org/links/10/7 http://httpbin.org/links/10/8 http://httpbin.org/links/10/9
我们的 url 提取器非常原始,我们不能在我们的爬虫中使用它,因为它会产生重复的和不可抓取的 url(比如可下载的文件)。我们需要的下一个组件是一个过滤器,它可以:
- 规范化找到的 URL 并删除重复项。
- 过滤掉站外 URL(不同域的 URL,如社交媒体链接等)
- 过滤掉无法抓取的 URL(如文件下载链接)
为此,我们将使用w3lib
和tldextract
库,它们为处理 URL 提供了强大的实用功能。让我们用它来编写我们的 URL 过滤器,它将过滤掉坏的和看到的 URL。
from typing import List, Pattern import posixpath from urllib.parse import urlparse from tldextract import tldextract from w3lib.url import canonicalize_url from loguru import logger as log class UrlFilter: IGNORED_EXTENSIONS = [ # archives '7z', '7zip', 'bz2', 'rar', 'tar', 'tar.gz', 'xz', 'zip', # images 'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif', 'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg', 'cdr', 'ico', # audio 'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff', # video '3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv', 'm4a', 'm4v', 'flv', 'webm', # office suites 'xls', 'xlsx', 'ppt', 'pptx', 'pps', 'doc', 'docx', 'odt', 'ods', 'odg', 'odp', # other 'css', 'pdf', 'exe', 'bin', 'rss', 'dmg', 'iso', 'apk', ] def __init__(self, domain:str=None, subdomain: str=None, follow:List[Pattern]=None) -> None: # restrict filtering to specific TLD self.domain = domain or "" # restrict filtering to sepcific subdomain self.subdomain = subdomain or "" self.follow = follow or [] log.info(f"filter created for domain {self.subdomain}.{self.domain} with follow rules {follow}") self.seen = set() def is_valid_ext(self, url): """ignore non-crawlable documents""" return posixpath.splitext(urlparse(url).path)[1].lower() not in self.IGNORED_EXTENSIONS def is_valid_scheme(self, url): """ignore non http/s links""" return urlparse(url).scheme in ['https', 'http'] def is_valid_domain(self, url): """ignore offsite urls""" parsed = tldextract.extract(url) return parsed.registered_domain == self.domain and parsed.subdomain == self.subdomain def is_valid_path(self, url): """ignore urls of undesired paths""" if not self.follow: return True path = urlparse(url).path for pattern in self.follow: if pattern.match(path): return True return False def is_new(self, url): """ignore visited urls (in canonical form)""" return canonicalize_url(url) not in self.seen def filter(self, urls: List[str]) -> List[str]: """filter list of urls""" found = [] for url in urls: if not self.is_valid_scheme(url): log.debug(f"drop ignored scheme {url}") continue if self.is_invalid_domain(url): log.debug(f"drop domain missmatch {url}") continue if self.is_invalid_ext(url): log.debug(f"drop ignored extension {url}") continue if not self.is_valid_path(url): log.debug(f"drop ignored path {url}") continue if not self.is_new(url): log.debug(f"drop duplicate {url}") continue self.seen.add(canonicalize_url(url)) found.append(url) return found
运行代码和示例输出
import httpx from parsel import Selector from urllib.parse import urljoin def extract_urls(response: httpx.Response) -> List[str]: tree = Selector(text=response.text) # using XPath urls = tree.xpath('//a/@href').getall() # or CSS urls = tree.css('a::attr(href)').getall() # we should turn all relative urls to absolute, e.g. /foo.html to https://domain.com/foo.html urls = [urljoin(str(response.url), url.strip()) for url in urls] return urls nytimes_filter = UrlFilter("nytimes.com", "store") response = httpx.get("https://store.nytimes.com") urls = extract_urls(response) filtered = nytimes_filter.filter(urls) filtered_2nd_page = nytimes_filter.filter(urls) print(filtered) print(filtered_2nd_page)
请注意,第二次运行不会产生任何结果,因为它们已被我们的过滤器过滤掉:
['https://store.nytimes.com/collections/best-sellers', 'https://store.nytimes.com/collections/gifts-under-25', 'https://store.nytimes.com/collections/gifts-25-50', 'https://store.nytimes.com/collections/gifts-50-100', 'https://store.nytimes.com/collections/gifts-over-100', 'https://store.nytimes.com/collections/gift-sets', 'https://store.nytimes.com/collections/apparel', 'https://store.nytimes.com/collections/accessories', 'https://store.nytimes.com/collections/babies-kids', 'https://store.nytimes.com/collections/books', 'https://store.nytimes.com/collections/home-office', 'https://store.nytimes.com/collections/toys-puzzles-games', 'https://store.nytimes.com/collections/wall-art', 'https://store.nytimes.com/collections/sale', 'https://store.nytimes.com/collections/cooking', 'https://store.nytimes.com/collections/black-history', 'https://store.nytimes.com/collections/games', 'https://store.nytimes.com/collections/early-edition', 'https://store.nytimes.com/collections/local-edition', 'https://store.nytimes.com/collections/pets', 'https://store.nytimes.com/collections/the-verso-project', 'https://store.nytimes.com/collections/custom-books', 'https://store.nytimes.com/collections/custom-reprints', 'https://store.nytimes.com/products/print-newspapers', 'https://store.nytimes.com/collections/special-sections', 'https://store.nytimes.com/pages/corporate-gifts', 'https://store.nytimes.com/pages/about-us', 'https://store.nytimes.com/pages/contact-us', 'https://store.nytimes.com/pages/faqs', 'https://store.nytimes.com/pages/return-policy', 'https://store.nytimes.com/pages/terms-of-sale', 'https://store.nytimes.com/pages/terms-of-service', 'https://store.nytimes.com/pages/image-licensing', 'https://store.nytimes.com/pages/privacy-policy', 'https://store.nytimes.com/search', 'https://store.nytimes.com/', 'https://store.nytimes.com/account/login', 'https://store.nytimes.com/cart', 'https://store.nytimes.com/products/the-custom-birthday-book', 'https://store.nytimes.com/products/new-york-times-front-page-reprint', 'https://store.nytimes.com/products/stacked-logo-baseball-cap', 'https://store.nytimes.com/products/new-york-times-front-page-jigsaw', 'https://store.nytimes.com/products/new-york-times-swell-water-bottle', 'https://store.nytimes.com/products/super-t-sweatshirt', 'https://store.nytimes.com/products/cooking-apron', 'https://store.nytimes.com/products/new-york-times-travel-tumbler', 'https://store.nytimes.com/products/debossed-t-mug', 'https://store.nytimes.com/products/herald-tribune-breathless-t-shirt', 'https://store.nytimes.com/products/porcelain-logo-mug', 'https://store.nytimes.com/products/the-ultimate-birthday-book', 'https://store.nytimes.com/pages/shipping-processing'] []
这个通用过滤器将确保我们的爬虫避免爬取冗余或无效的目标。我们可以使用更多规则进一步扩展它,例如链接评分系统或显式遵循规则,但现在,让我们看看我们的爬虫的其余部分。
爬行循环
现在我们已经准备好我们的探索逻辑,我们所缺少的只是一个可以利用它的爬行循环。 首先,我们需要一个客户端来检索页面数据。最常见的是,HTTP 客户端(如httpx
或 )requests
可用于抓取任何 HTML 页面。 但是,仅使用 HTTP 客户端,我们可能无法抓取高度动态的内容,例如 javascript 驱动的 Web 应用程序或单页应用程序 (SPA)。要抓取此类目标,我们需要一个 javascript 执行上下文,如无头 Web 浏览器运行器,可以通过浏览器自动化工具(如 Playwright、Selenium 或 Puppeteer)实现。 所以现在,让我们坚持httpx
:
import asyncio from typing import Callable, Dict, List, Optional, Tuple from urllib.parse import urljoin import httpx from parsel import Selector from loguru import logger as log from snippet2 import UrlFilter # TODO class Crawler: async def __aenter__(self): self.senot sion = ait httpx.AsyncClient( # when crawling we should use generous timeout timeout=httpx.Timeout(60.0), # we should also limit are connections to not put too much stress on the server limits=httpx.Limits(max_connections=5), # we should use common web browser header values to avoid being blocked headers={ "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", }, ).__aenter__() return self async def __aexit__(self, *args, **kwargs): await self.sesion.__aexit__(*args, **kwargs) def __init__(self, filter: UrlFilter, callbacks: Optional[Dict[str, Callable]]=None) -> None: self.url_filter = filter self.callbacks = callbacks or {} def parse(self, responses: List[httpx.Response]) -> List[str]: """find valid urls in responses""" found = [] for response in responses: sel = Selector(text=response.text, base_url=str(response.url)) _urls_in_response = set( urljoin(str(response.url), url.strip()) for url in sel.xpath("//a/@href").getall() ) all_unique_urls |= _urls_in_response urls_to_follow = self.url_filter.filter(all_unique_urls) log.info(f"found {len(urls_to_follow)} urls to follow (from total {len(all_unique_urls)})") return urls_to_follow async def scrape(self, urls: List[str]) -> Tuple[List[httpx.Response], List[Exception]]: """scrape urls and return their responses""" responses = [] failures = [] log.info(f"scraping {len(urls)} urls") async def scrape(self, url): return await self.sesion.get(url, follow_redirects=True) all_skique_urls = set()crape(url) for url in urls] for result in await asyncio.gather(*tasks, return_exceptions=True): if isinstance(result, httpx.Response): responses.append(result) else: failures.append(result) return responses, failures async def run(self, start_urls: List[str], max_depth=5) -> None: """crawl target to maximum depth or until no more urls are found""" url_pool = start_urls depth = 0 while url_pool and depth <= max_depth: responses, failures = await self.scrape(url_pool) log.info(f"depth {depth}: scraped {len(responses)} pages and failed {len(failures)}") url_pool = self.parse(responses) await self.callback(responses) depth += 1 async def callback(sesponses): for response in responses: for pattern, fn in self.callbacks.items(): if pattern.match(str(response.url)): log.debug(f'found matching callback for {response.url}') fn(response=response)
上面,我们定义了我们的爬虫对象,它实现了我们爬行循环图中的所有步骤:
scrape
httpx
方法通过的 HTTP 客户端实现 URL 检索。parse
parsel
方法通过的 XPath 选择器实现对更多 URL 的响应解析。callback
方法实现产品解析功能 10ty,因为我们正在抓取的某些页面是产品。run
方法实现我们的爬行循环。
为了进一步理解这一点,让我们用一个示例项目来试一试我们的爬虫!
爬虫项目示例:爬取 Shopify
爬虫非常适合抓取我们不知道确切结构的通用网站。特别是,爬虫使我们能够轻松地抓取使用相同网络框架或网络平台构建的网站。一次编写 – 到处应用! 在本教程中,让我们看看如何使用我们构建的爬虫来爬取几乎所有由 Shopify 驱动的电子商务网站。 例如,让我们从由 Shopify 提供支持的NYTimes 商店开始。 我们可以从确定我们的目标开始——一种待售产品。例如,让我们以这件 T 恤Stacked Logo Shirt为例 只需查看 URL,我们就可以看到所有产品 URL 都包含/products/
其中的一部分——这就是我们如何告诉我们的爬虫哪些响应对回调进行解析——每个 URL 都包含此文本:
def parse_product(response): print(f"found product: {response.url}") async def run(): callbacks = { # any url that contains "/products/" is a product page re.compile(".+/products/.+"): parse_product } url_filter = UrlFilter(domain="nytimes.com", subdomain="store") async with Crawler(url_filter, callbacks=callbacks) as crawler: await crawler.run(["https://store.nytimes.com/"])
在上面的示例中,我们为/product/
URL 中包含的任何已抓取响应添加了一个回调。如果我们运行它,我们将看到数百行印有产品 URL。让我们看一下如何在抓取过程中解析产品信息。
解析爬取的数据
通常我们不知道我们的 python 爬虫会遇到什么样的内容结构,所以在解析爬取的内容时,我们应该寻找通用的解析算法。 在 Shopify 的案例中,我们可以看到产品数据通常以 JSON 对象的形式出现在 HTML 正文中。例如,我们通常会在<script>
节点中看到它:
这意味着我们可以设计一个通用的解析来从包含已知键的所有标签中提取所有 JSON 对象<script>
。如果我们看一下 NYTimes 商店的产品对象,我们可以看到一些常见的模式:
{ "id": 6708867694662, "title": "Stacked Logo Shirt", "handle": "stacked-logo-shirt", "published_at": "2022-07-15T11:36:23-04:00", "created_at": "2021-10-20T11:10:55-04:00", "vendor": "NFS", "type": "Branded", "tags": [ "apparel", "branded", "category-apparel", "discontinued", "discountable-product", "gifts-25-50", "price-50", "processing-nfs-regular", "recipient-men", "recipient-women", "sales-soft-goods", "sizeway" ], "price": 2600, "price_min": 2600, "price_max": 2600, "available": true, "..." }
所有产品都包含“published_at”或“price”等关键字。通过一点解析魔法,我们可以轻松提取此类对象:
import json import httpx def extract_json_objects(text: str, decoder=json.JSONDecoder()): """Find JSON objects in text, and yield the decoded JSON data""" pos = 0 while True: match = text.find('{', pos) if match == -1: break try: result, index = decoder.raw_decode(text[match:]) yield result pos = match + index except ValueError: pos = match + 1 def find_json_in_script(response: httpx.Response, keys: List[str]) -> List[Dict]: """find all json objects in HTML <script> tags that contain specified keys""" scripts = Selector(text=response.text).xpath('//script/text()').getall() objects = [] for script in scripts: if not all(f'"{k}"' in script for k in keys): continue objects.extend(extract_json_objects(script)) return [obj for obj in objects if all(k in str(obj) for k in keys)]
运行代码和示例输出
url = "https://store.nytimes.com/collections/apparel/products/a1-stacked-logo-shirt" response = httpx.get(url) products = find_json_in_script(response, ["published_at", "price"]) print(json.dumps(products, indent=2, ensure_ascii=False)[:500])
这将抓取结果(截断):
{ "id": 6708867694662, "title": "Stacked Logo Shirt", "handle": "stacked-logo-shirt", "description": "\u003cp\u003eWe’ve gone bigger and bolder with the iconic Times logo, spreading it over three lines on this unisex T-shirt so our name can be seen from a distance when you walk down the streets of Brooklyn or Boston.\u003c/p\u003e\n\u003c!-- split --\u003e\n\u003cp\u003eThese days T-shirts are the hippest way to make a statement or express an emotion, and our Stacked Logo Shirt lets you show your support for America’s preeminent newspaper.\u003c/p\u003e\n\u003cp\u003eOur timeless masthead logo is usually positioned on one line, but to increase the lettering our designers have stacked the words. The result: The Times name is large, yet discreet, so you can keep The Times close to your heart without looking like a walking billboard.\u003c/p\u003e\n\u003cp\u003eThis shirt was made by Royal Apparel, who launched in the early '90s on a desk in the Garment District of Manhattan. As a vast majority of the fashion industry moved production overseas, Royal Apparel stayed true to their made in USA mission and became a leader in American-made and eco-friendly garment production in the country.\u003c/p\u003e", "published_at": "2022-07-15T11:36:23-04:00", "created_at": "2021-10-20T11:10:55-04:00", "vendor": "NFS", "type": "Branded", "tags": [ "apparel", "branded", "category-apparel", "discontinued", "discountable-product", "gifts-25-50", "price-50", "processing-nfs-regular", "recipient-men", "recipient-women", "sales-soft-goods", "sizeway" ], "price": 2600, "price_min": 2600, "price_max": 2600, "available": true, ... }
这种方法可以帮助我们从许多不同的 Shopify 支持的网站的网络中提取数据!最后,让我们将它应用到我们的爬虫中:
# ... import asyncio results = [] def parse_product(response): products = find_json_in_script(response, ["published_at", "price"]) results.extend(products) if not products: log.warning(f"could not find product data in {response.url}") async def run(): callbacks = { # any url that contains "/products/" is a product page re.compile(".+/products/.+"): parse_product } url_filter = UrlFilter(domain="nytimes.com", subdomain="store") async with Crawler(url_filter, callbacks=callbacks) as crawler: await crawler.run(["https://store.nytimes.com/"]) print(results) if __name__ == "__main__": asyncio.run(run())
现在,如果我们再次运行我们的爬虫,我们不仅会探索和查找产品,还会抓取他们的所有数据!
用于爬虫的 Scrapy 框架
Scrapy 是 Python 中流行的网络抓取框架,它具有强大的抓取功能集。Scrapy 的网络蜘蛛类CrawlSpider实现了我们在本文中介绍的相同的爬行算法。 Scrapy 带有许多包含电池的功能,如错误响应重试和高效的请求调度。然而,作为一个完整的框架,Scrapy 可能很难修补和与其他 Python 技术集成。
有关 Scrapy 的更多信息,请参阅我们的完整介绍文章,该文章从该网络抓取框架的角度涵盖了所有这些概念。
常问问题
让我们通过查看有关使用 Python 进行网络爬行的一些常见问题来结束本文:
抓取和爬取有什么区别?
爬行是具有探索能力的网络抓取。网络爬虫是具有明确爬虫规则的程序,爬虫往往具有更有创意的导航算法。爬虫通常用于广泛的爬网——其中许多不同的域由同一程序进行爬网。
爬行有什么用?
爬行通常用于数据科学和机器学习训练等通用数据集收集。 它还用于为搜索引擎(例如 Google)生成网络索引。最后,爬行是网络抓取,其中目标发现无法通过站点地图或目录系统实现——爬虫只需探索网站上的每个链接即可找到所有产品链接。
如何加快爬取速度?
加速爬行的最好方法是将您的爬行程序转换为异步程序。由于爬取执行的请求比定向网络抓取要多得多,因此爬虫程序会遭受很多 IO 块。换句话说,爬虫经常无所事事地等待网络服务器响应。良好的异步程序设计可以使程序加速一千倍!
我可以抓取动态 javascript 网站或 SPA 吗?
是的!然而,要爬取动态 javascript 网站和SPA ,爬虫需要能够执行 javascript。最简单的方法是使用通过Playwright、Selenium、Puppeteer的Javascript 渲染功能自动运行的无头浏览器。 有关更多信息,请参阅我们的介绍教程如何使用无头 Web 浏览器抓取动态网站
总结和进一步阅读
在本文中,我们深入研究了什么是网络抓取以及它与网络抓取的区别。我们在几行代码中使用流行的 Python 包构建了我们自己的爬虫。 为了巩固我们的知识,我们构建了一个示例爬虫,通过创建一个通用的 JSON 结构数据解析器来爬取几乎所有由 Shopify 支持的网站。