在我们上面的短爬虫中,我们选择了目录页面的所有 5 页。为了扩展这一点,我们可以通过在我们爬取的每家公司中探索相关公司来采用爬行技术。如果我们看一下之前爬取的数据集,我们会发现每个公司页面都包含最多六家竞争公司的列表:在本教程中,我们将了解如何为上市公司数据爬取ZoomInfo。
我们将从 Zoominfo.com 的工作原理概述开始,以便我们可以找到所有上市公司页面。然后我们将使用 Python 和一些社区包来爬取公司数据。
为什么要爬取 Zoominfo?
Zoominfo.com拥有数以百万计的公开公司资料,其中包含公司证书、财务数据和联系方式。公司概览数据可用于商业智能和市场分析。公司联系方式和员工详细信息可用于潜在客户开发和就业市场。
项目设置
在本教程中,我们将使用 Python 和一些流行的社区包:
- httpx – 一个 HTTP 客户端库,可以让我们与 amazon.com 的服务器进行通信
- parsel – 一个 HTML 解析库,尽管我们将在本教程中进行很少的 HTML 解析,而是主要直接处理 JSON 数据。
这些包可以通过pip
命令轻松安装:
$ pip install httpx parsel
或者,可以随意换成httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 函数,这些函数在每个库中几乎都是可以互换的。至于,parsel
另一个不错的选择是beautifulsoup包。
爬取 Zoominfo 公司数据
要爬取 Zoominfo 上列出的公司简介,首先让我们看一下公司页面本身。例如,让我们看看特斯拉公司的这个页面。zoominfo.com/c/tesla- inc/104333869
可见的 HTML 包含数据,但是,我们可以查看网页的页面源,而不是直接解析它,我们可以看到数据被嵌入为引用或原始 JSON 文件:
所以,我们不解析 HTML,而是直接获取这个 JSON 文件:
import asyncio import json from pathlib import Path import httpx from parsel import Selector def _unescape_angular(text): """Helper function to unescape Angular quoted text""" ANGULAR_ESCAPE = { "&a;": "&", "&q;": '"', "&s;": "'", "&l;": "<", "&g;": ">", } for from_, to in ANGULAR_ESCAPE.items(): text = text.replace(from_, to) return text def parse_company(selector: Selector): """parse Zoominfo company page for company data""" data = selector.css("script#app-root-state::text").get() data = _unescape_angular(data) data = json.loads(data)["cd-pageData"] return data async def scrape_company(url:str, session: httpx.AsyncClient) -> dict: """scrape zoominfo company page""" response = await session.get(url) assert response.status_code == 200, "request was blocked, see the avoid blocking section for more info" return parse_company(Selector(text=response.text, base_url=response.url))
运行代码和示例输出
async def run(): BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async with httpx.AsyncClient( limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True ) as session: data = await scrape_company("https://www.zoominfo.com/c/tesla-inc/104333869", session=session) print(json.dumps(data, indent=2, ensure_ascii=False)) if __name__ == "__main__": asyncio.run(run()) { "companyId": "104333869", "url": "www.tesla.com", "foundingYear": "2003", "totalFundingAmount": "13763790", "isPublic": "Public", "name": "Tesla", "names": [ "Tesla Inc", "..." ], "logo": "https://res.cloudinary.com/zoominfo-com/image/upload/w_70,h_70,c_fit/tesla.com", "ticker": "NASDAQ: TSLA", "website": "//www.tesla.com", "displayLink": "www.tesla.com", "revenue": "53823001", "numberOfEmployees": "99290", "fullName": "Tesla, Inc.", "companyIds": [ "104333869", "..." ], "industries": [ { "name": "Manufacturing", "link": "/companies-search/industry-manufacturing", "primary": true }, "..." ], "socialNetworkUrls": [ { "socialNetworkType": "LINKED_IN", "socialNetworkUrl": "https://www.linkedin.com/company/tesla-motors/" }, "..." ], "address": { "street": "1 Tesla Road", "city": "Austin", "state": "Texas", "country": "United States", "zip": "78725" }, "phone": "(512) 516-8177", "techUsed": [ { "id": 92112, "name": "Microsoft SQL Server Reporting", "logo": "https://storage.googleapis.com/datanyze-data//technologies/17480e9fd49bbff12f7c482210d0060cf8f97713.png", "vendorFullName": "Microsoft Corporation", "vendorDisplayName": "Microsoft", "vendorId": 24904409 }, "..." ], "techOwned": [], "description": "Founded in 2003, Tesla is an electric vehicle and clean energy company that offers products including electric cars, battery energy storage from home to grid-scale, solar panels, solar roof tiles, and other related products and services.", "competitors": [ { "id": "407578600", "name": "NIO", "employees": 9834, "revenue": "720117", "logo": "https://res.cloudinary.com/zoominfo-com/image/upload/w_70,h_70,c_fit/nio.com", "index": 0 }, "..." ], "fundings": [ { "amount": "2000000", "date": "Feb 13, 2020", "type": "Stock Issuance/Offering", "investors": [ "Elon Musk", "Larry Ellison" ] }, "..." ], "acquisitions": [], "claimed": false, "sic": [ "37", "..." ], "naics": [ "44", "..." ], "success": true, "chartData": { "chartEmployeeData": [ { "date": "'21 - Q1", "value": 45000000 }, "..." ], "chartRevenueData": [ { "date": "'21 - Q1", "value": 24578000000 }, "..." ], "twitter": [], "facebook": [] }, "executives": { "CEO": { "personId": "3201848920", "fullName": "Elon Musk", "title": "Co-Founder & Chief Executive Officer", "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/elon-musk-neuralink-portrait.jpg", "personUrl": "/p/Elon-Musk/3201848920", "orgChartTier": 1 }, "CFO": { "personId": "3744260195", "fullName": "Zachary Kirkhorn", "title": "Master of Coin & Chief Financial Officer", "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/photo-1024x768.jpg", "personUrl": "/p/Zachary-Kirkhorn/3744260195", "orgChartTier": 2 } }, "orgChart": { "title": "Tesla's Org Chart", "btnContent": "See Full Org Chart", "personCardActions": { "nameAction": "OrgChartContact", "imageAction": "OrgChartContact", "emailAction": "OrgChartContactInfo", "phoneAction": "OrgChartContactInfo" }, "firstTier": { "personId": "3201848920", "fullName": "Elon Musk", "title": "Co-Founder & Chief Executive Officer", "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/elon-musk-neuralink-portrait.jpg", "personUrl": "/p/Elon-Musk/3201848920", "orgChartTier": 1 }, "secondTier": [ { "personId": "3744260195", "fullName": "Zachary Kirkhorn", "title": "Master of Coin & Chief Financial Officer", "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/photo-1024x768.jpg", "personUrl": "/p/Zachary-Kirkhorn/3744260195", "orgChartTier": 2 }, "..." ] }, "pic": [ { "personId": "-2033294111", "fullName": "Emmanuelle Stewart", "title": "Deputy General Counsel", "picture": "", "personUrl": "/p/Emmanuelle-Stewart/-2033294111", "orgChartTier": 3 }, "..." ], "ceo": { "personId": "3201848920", "fullName": "Elon Musk", "title": "Co-Founder & Chief Executive Officer", "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/elon-musk-neuralink-portrait.jpg", "personUrl": "/p/Elon-Musk/3201848920", "orgChartTier": 1, "rating": { "great": 18, "good": 1, "ok": 1, "bad": 2 }, "company": { "name": "Tesla", "id": "104333869", "country": "US", "logo": "https://res.cloudinary.com/zoominfo-com/image/upload/w_70,h_70,c_fit/tesla.com", "fullName": "Tesla, Inc.", "claimed": false, "domain": "www.tesla.com", "numberOfEmployees": "99290", "industries": [ { "name": "Manufacturing", "link": "/companies-search/industry-manufacturing", "primary": true }, "..." ], "address": { "street": "1 Tesla Road", "city": "Austin", "state": "Texas", "country": "United States", "zip": "78725" } } }, "newsFeed": [ { "url": "https://www.inferse.com/152631/tesla-issues-another-over-the-air-recall-on-a-small-number-of-cars-in-the-us-electrek/", "title": "Tesla issues another over-the-air recall on a small number of cars in the US - Inferse.com", "content": "March 25 Fred Lambert - Mar. 25th 2022 11:23 am PT", "date": "2022-07-18T02:09:11Z", "domain": "www.inferse.com", "isComparablyNews": false }, "..." ], "user": { "country": "US" }, "emailPatterns": [ { "value": "tesla.com", "rank": 0, "rawpatternstring": "0:tesla.com:0.61:0.61:0.98:25574", "sampleemail": "[email protected]", "usagePercentage": 59.8, "format": "first initials + last" }, "..." ] }
我们可以看到,我们的 Zoominfo 爬虫使用这种方法是多么的短小、高效和简单!
现在我们知道如何爬取单个公司的页面,让我们来看看如何查找公司页面 URL,以便我们可以从 Zoominfo 收集所有公开公司数据。
查找 Zoominfo 公司页面
遗憾的是,Zoominfo 不像许多其他网站那样提供可公开访问的站点地图目录。因此,我们要么需要按位置/行业浏览目录,要么按名称搜索公司。让我们来看看其中的两种发现技术。
爬取目录和爬行
Zoominfo.com 有许多地点或行业类型的公开公司名录页面。但是,这些目录限制为每次查询 100 个结果(5 页)。例如,要查找“洛杉矶的软件公司”,我们可以使用此目录页面:
zoominfo.com/companies-search/location-usa–california–los-angeles-industry-software
从每个目录页面中获取前 100 个结果可以为我们提供大量结果,而且很容易爬取:
import httpx from parsel import Selector from urllib.parse import urljoin from typing import List def scrape_directory(url: str, scrape_pagination=True) -> List[str]: """Scrape Zoominfo directory page""" response = httpx.get(url) assert response.status_code == 200 # check whether we're blocked # parse first page of the results selector = Selector(text=response.text, base_url=url) companies = selector.css("div.tableRow_companyName_nameAndLink>a::attr(href)").getall() # parse other pages of the results if scrape_pagination: other_pages = selector.css('div.pagination>a::attr(href)').getall() for page_url in other_pages: companies.extend(scrape_directory(page_url, scrape_pagination=False)) return companies print(scrape_directory("https://www.zoominfo.com/companies-search/location-usa--california--los-angeles-industry-software"))
在我们上面的短爬虫中,我们选择了目录页面的所有 5 页。为了扩展这一点,我们可以通过在我们爬取的每家公司中探索相关公司来采用爬行技术。如果我们看一下之前爬取的数据集,我们会发现每个公司页面都包含最多六家竞争公司的列表:
"competitors": [ { "id": "407578600", "name": "NIO", "employees": 9834, "revenue": "720117", "logo": "https://res.cloudinary.com/zoominfo-com/image/upload/w_70,h_70,c_fit/nio.com", "index": 0 }, "..." ],
因此,通过目录中的所有公司及其竞争对手,我们可以达到相当高的覆盖率。这种方法通常称为爬行。通过爬取我们获得更多要遵循的 URL,我们有一个或几个 URL 的起点。因此,通过目录中的所有公司及其竞争对手,我们可以达到相当高的覆盖率。这种方法通常称为爬行。通过爬取我们获得更多要遵循的 URL,我们有一个或几个 URL 的起点。
通过结合这两种技术,我们的 Zoominfo 爬虫具有不错的发现覆盖率,即使每个目录有 5 页的分页限制。
为了补充这一点,让我们看看接下来我们如何使用搜索系统找到更多公司页面。
爬取 Zoominfo 搜索
Zoominfo 还提供了一个快速搜索系统,可为给定的搜索查询提供最多 3 个结果:
因此,如果我们有一组公司名称,我们可以使用以下搜索找到他们的 Zoominfo 页面
import asyncio import json import httpx async def search(query, session: httpx.AsyncClient): url = "https://directory-api.zoominfo.com/api/zoominfo/dual-quick-search" resp = await session.post(url, json={"searchString": query}) data = resp.json() return data
运行代码和示例输出
async def run(): BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async with httpx.AsyncClient( limits=httpx.Limits(max_connections=5), timeout=httpx.Timeout(15.0), headers=BASE_HEADERS, http2=True ) as session: return await search("tesla", session=session) if __name__ == "__main__": print(json.dumps(asyncio.run(run()), indent=2, ensure_ascii=False)) { "people": [ { "id": "3201848920", "name": { "first": "Elon", "last": "Musk" }, "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/elon-musk-neuralink-portrait.jpg", "jobTitle": "Co-Founder & Chief Executive Officer", "companyName": "Tesla" }, { "id": "3744260195", "name": { "first": "Zachary", "last": "Kirkhorn" }, "picture": "https://www.jingzhengli.com/wp-content/uploads/2023/07/photo-1024x768.jpg", "jobTitle": "Master of Coin & Chief Financial Officer", "companyName": "Tesla" }, { "id": "7719115377", "name": { "first": "Alisher", "last": "Valikhanov" }, "picture": "", "jobTitle": "Chief Executive Officer", "companyName": "Tesla-TAN" } ], "companies": [ { "id": "104333869", "name": "Tesla", "url": "www.tesla.com", "headquarters": { "city": "Austin", "state": "Texas", "country": "United States" } }, { "id": "430439652", "name": "Tesla-TAN", "url": "www.teslatan.kz", "headquarters": { "city": "Atyrau", "state": "Atyrau", "country": "Kazakhstan" } }, { "id": "112033901", "name": "TESLA ENGINEERING", "url": "www.tesla.co.uk", "headquarters": { "city": "Storrington", "state": "West Sussex", "country": "United Kingdom" } } ], "success": true }
使用 Zoominfo 的搜索端点,我们可以找到公司页面,只要我们知道他们的名字,并且有大量上市公司数据库可以帮助我们。
常问问题
为了总结本指南,让我们看一下有关网络爬取Zoominfo.com 的一些常见问题:
爬取 Zoominfo.com 是否合法?
是的。Zoominfo 上显示的数据是公开的,我们不会提取任何私人信息。以缓慢、尊重的速度爬取 Zoominfo.com 属于道德爬取定义。
话虽如此,在爬取个人数据(例如人们的数据)时,应注意欧盟的 GDRP 合规性。
Zoominfo 爬取总结
在本教程中,我们构建了一个Zoominfo.com公司数据爬取工具。我们已经了解了如何通过提取嵌入状态数据而不是解析 HTML 文件来爬取公司页面。我们还研究了如何使用 Zoominfo 目录页面或其搜索系统查找公司页面。为此,我们将 Python 与一些社区包(如httpx )一起使用。