如何在Python中限制异步请求的速率

在进行网页爬取时，我们常常受到网站技术能力的限制。我们不能爬得太快而不被封锁或淹没较小的网站。

像httpx这样的异步 HTTP 客户端允许我们轻松地每秒发出数百个请求，但没有提供微调爬取速度的方法。因此，在这个快速教程中，我们将了解如何对异步 HTTP 连接进行速率限制以减慢我们的爬取速度。

Python httpx

HTTPX 是 Python 中最流行的异步 HTTP 客户端，可以使用pip install终端命令安装：

$ pip install httpx

HTTPX 支持同步和异步 HTTP 客户端。我们不太可能需要限制同步连接，因为它们启动速度非常慢，因此让我们看一下针对异步客户端的限制选项。

为了限制 httpx 客户端，我们可以使用该httpx.Limit对象：

import httpx
session = httpx.AsyncClient(
    limits=httpx.Limits(
        max_connections=5  # we can change max connection count here
    )
)

然而，通过连接数限制爬取是非常不准确的——在较慢的网站上，5 个连接可能每分钟只能管理几个请求，而在快速的网站上，它可能会达到每分钟数百个请求。

为了限制 httpx 驱动的爬取工具，我们需要一个额外的层来跟踪请求本身而不是连接。我们来看看最流行的节流库—— aiometer.

Python 气压计

Python 中限制所有异步任务的最流行方法是aiometer，可以使用pip install终端命令安装：

$ pip install aiometer

然后我们可以安排所有请求通过 aiometer 限制器运行：

import asyncio
from time import time

import aiometer
import httpx

session = httpx.AsyncClient()


async def scrape(url):
    response = await session.get(url)
    return response


async def run():
    _start = time()
    urls = ["http://httpbin.org/html" for i in range(10)]
    results = await aiometer.run_on_each(
        scrape, 
        urls,
        max_per_second=1,  # here we can set max rate per second
    )
    print(f"finished {len(urls)} requests in {time() - _start:.2f} seconds")
    return results


if __name__ == "__main__":
    asyncio.run(run())

# will print:
# finished 10 requests in 9.54 seconds

在我们的小示例爬取器中，我们使用aiometer.run_on_each函数将 10 个爬取请求限制为每秒 1 个请求。通过这个命令，我们可以将我们的爬取工具限制到每秒精确请求的速度！