大规模的网络爬取通常需要大量的代理来避免速率限制、节流甚至阻塞。由于代理是昂贵的资源,因此正确分配会对整体资源使用产生巨大影响。 在本文中,我们将了解什么是代理轮换、它如何帮助避免阻塞以及一些常见的轮换模式、提示和技巧。
为什么要轮换代理的IP?
当使用代理池进行网络爬取时,我们的连接成功机会可能会受到连接模式的影响。换句话说,如果代理 A 在 5 秒内连接到 50 个页面,它就会被限制或阻止。但是,如果我们有代理 A、B、C 并让它们轮流使用,它将避免基于模式的速率限制和阻塞。 那么,轮换代理的最佳方式是什么?
代理分发和订购
我们分配代理轮换的方式可以产生巨大的不同。例如,一种常见的方法是简单地为每个请求从池中提取一个随机代理,但这不是一种非常聪明或有效的方法。 我们应该做的第一件事是按子网或代理来源轮换代理。例如,如果我们有一个代理池:
import random proxies = [ "xx.xx.123.1", "xx.xx.123.2", "xx.xx.123.3", "xx.xx.124.1", "xx.xx.124.2", "xx.xx.125.1", "xx.xx.125.2", ] def get_proxy(): return random.choice(proxies) print("got 10 proxies:") for i in range(10): print(get_proxy())
如果我们要为我们的网络爬取请求选择一个随机代理,那么同一子网的代理很可能会连续出现。
由于许多节流/阻塞服务在其逻辑中考虑子网,我们的代理旋转器将处于主要劣势。 相反,最好确保我们按子网随机化:
import random proxies = [ "xx.xx.123.1", "xx.xx.123.2", "xx.xx.123.3", "xx.xx.124.1", "xx.xx.124.2", "xx.xx.125.1", "xx.xx.125.2", ] last_subnet = "" def get_proxy(): while True: ip = random.choice(proxies) ip_subnet = ip.split('.')[2] if ip_subnet != last_subnet: last_subnet = ip_subnet return ip print("got 10 proxies:") for i in range(10): print(get_proxy())
现在,我们的代理选择器将永远不会连续返回来自同一子网的两个代理。 子网并不是影响随机化的唯一轴。 代理还带有许多元数据,例如 IP 地址 ASN(自治系统编号,本质上是指“代理 IP 所有者 ID”)、位置等。基于网络爬取目标,基于所有这些功能轮换代理是个好主意.
跟踪代理性能
一旦我们的代理最终滚动,它们将开始被阻止。有些会比其他的更成功,有些受阻的会在某个时候恢复。 我们可以进一步扩展我们的旋转器来跟踪代理性能。例如,我们可以标记死者以防止被选中,以便他们有时间恢复:
import random from datetime import datetime proxies = [ "xx.xx.123.1", "xx.xx.123.2", "xx.xx.123.3", "xx.xx.124.1", "xx.xx.124.2", "xx.xx.125.1", "xx.xx.125.2", ] last_subnet = "" dead_proxies = {} def get_proxy(): while True: ip = random.choice(proxies) ip_subnet = ip.split('.')[2] if ip_subnet == last_subnet: continue if ip in dead_proxies: if dead_proxies[ip] - datetime.utcnow() > timedelta(seconds=30): #proxy has not recovered yet - skip continue else: # proxy has recovered - set it free! del dead_proxies[ip] last_subnet = ip_subnet return ip def scrape(url, retries=0): proxy = get_proxy() response = requests.get(url, proxies=proxy) # mark dead proxies and retry if response.status_code == 200: return response else: dead_proxies[proxy] = datetime.utcnow() retries += 1 if retries > 3: raise RetriesExceeded() scrape(url, retries=retries)
上面,我们的简单代理旋转器按子网随机分配代理并跟踪死代理。
加权随机代理示例旋转器
让我们把我们学到的东西放在一起并构建一个随机旋转器。为此,我们将使用 python 加权随机选择函数random.choices
。 random.choices
选择一个随机元素,但最好的功能是我们可以为可用选择分配自定义权重。这允许我们将一些代理优先于其他代理。例如:
from collections import Counter import random proxies = [ ("xx.xx.123.1", 10), ("xx.xx.123.2", 1), ("xx.xx.123.3", 1), ("xx.xx.124.1", 10), ("xx.xx.124.2", 1), ("xx.xx.125.1", 10), ("xx.xx.125.2", 1), ] counter = Counter() for i in range(1000): choice = random.choices([p[0] for p in proxies], [p[1] for p in proxies], k=1) counter[choice[0]] += 1 for proxy, used_count in counter.most_common(): print(f"{proxy} was used {used_count} times")
在上面的示例中,我们对所有以其他方式结尾的代理的权.1
重是其他代理的十倍——这意味着它们被选择的频率是其他代理的 10 倍:
xx.xx.125.1 was used 298 times xx.xx.124.1 was used 292 times xx.xx.123.1 was used 283 times xx.xx.125.2 was used 38 times xx.xx.124.2 was used 34 times xx.xx.123.3 was used 30 times xx.xx.123.2 was used 25 times
加权随机化为我们设计代理旋转器提供了巨大的创造力! 例如,我们可以:
- 惩罚最近使用的代理
- 惩罚失败的代理
- 促进健康或快速代理
- 将一种代理类型推广到另一种代理类型(例如,数据中心上的住宅代理)
这种方法允许我们创建不可预测但智能的代理旋转器。让我们看一个实现这个逻辑的例子:
import random from time import time from typing import List, Literal class Proxy: """container for a proxy""" def __init__(self, ip, type_="datacenter") -> None: self.ip: str = ip self.type: Literal["datacenter", "residential"] = type_ _, _, self.subnet, self.host = ip.split(":")[0].split('.') self.status: Literal["alive", "unchecked", "dead"] = "unchecked" self.last_used: int = None def __repr__(self) -> str: return self.ip def __str__(self) -> str: return self.ip class Rotator: """weighted random proxy rotator""" def __init__(self, proxies: List[Proxy]): self.proxies = proxies self._last_subnet = None def weigh_proxy(self, proxy: Proxy): weight = 1_000 if proxy.subnet == self._last_subnet: weight -= 500 if proxy.status == "dead": weight -= 500 if proxy.status == "unchecked": weight += 250 if proxy.type == "residential": weight += 250 if proxy.last_used: _seconds_since_last_use = time() - proxy.last_used weight += _seconds_since_last_use return weight def get(self): proxy_weights = [self.weigh_proxy(p) for p in self.proxies] proxy = random.choices( self.proxies, weights=proxy_weights, k=1, )[0] proxy.last_used = time() self.last_subnet = proxy.subnet return proxy
示例运行代码和输出 我们可以模拟运行我们的 Rotator 来查看代理分布:
from collections import Counter if __name__ == "__main__": proxies = [ # these will be used more often Proxy("xx.xx.121.1", "residential"), Proxy("xx.xx.121.2", "residential"), Proxy("xx.xx.121.3", "residential"), # these will be used less often Proxy("xx.xx.122.1"), Proxy("xx.xx.122.2"), Proxy("xx.xx.123.1"), Proxy("xx.xx.123.2"), ] rotator = Rotator(proxies) # let's mock some runs: _used = Counter() _failed = Counter() def mock_scrape(): proxy = rotator.get() _used[proxy.ip] += 1 if proxy.host == "1": # simulate proxies with .1 being significantly worse _fail_rate = 60 else: _fail_rate = 20 if random.randint(0, 100) < _fail_rate: # simulate some failure _failed[proxy.ip] += 1 proxy.status = "dead" mock_scrape() else: proxy.status = "alive" return for i in range(10_000): mock_scrape() for proxy, count in _used.most_common(): print(f"{proxy} was used {count:>5} times") print(f" failed {_failed[proxy]:>5} times")
现在,当我们运行我们的脚本时,我们可以看到我们的旋转器通过使用加权随机化更频繁地选择住宅代理和更少地选择以“.1”结尾的代理:
xx.xx.121.2 was used 2629 times failed 522 times xx.xx.121.3 was used 2603 times failed 508 times xx.xx.123.2 was used 2321 times failed 471 times xx.xx.122.2 was used 2302 times failed 433 times xx.xx.121.1 was used 1941 times failed 1187 times xx.xx.122.1 was used 1629 times failed 937 times xx.xx.123.1 was used 1572 times failed 939 times
在这个例子中,我们可以看到我们如何使用简单的概率调整来智能地轮换代理,而几乎没有监督或跟踪代码。
总结
在本文中,我们了解了常见的代理轮换策略,以及它们如何帮助我们避免网络爬取时的阻塞。 我们介绍了一些流行的轮换模式,例如基于代理详细信息和性能的分数跟踪。最后,我们制作了一个示例代理旋转器帽,它使用随机加权代理选择来智能地为我们的网络爬取连接选择代理。