在本教程中,我们将了解如何抓取Wellfound.com(以前称为 angel.co)——科技创业公司和职位列表数据的主要目录。 Wellfound 是一个流行的网络抓取目标,用于收集与工作和科技行业相关的数据。在本指南中,我们将介绍职位列表和公司信息抓取,其中包含以下数据字段:
- 公司概况,如资金细节、业绩和文化
- 资金详情
- 职位列表
我们将结合使用 Python 和隐藏的 Web 数据抓取技术,只需几行代码即可抓取 Wellfound。然而,由于 Wellfound 因阻止所有网络爬虫而臭名昭著,我们将以Scrapfly为例,看如何绕过它们的反爬虫保护。让我们开始吧!
为什么要刮 Wellfound?
Wellfound(以前称为 AngelList)包含大量与科技初创公司相关的数据。通过抓取公司信息、员工数据、公司文化、资金和职位等详细信息,我们可以创建强大的商业智能数据集。这可用于竞争优势或一般市场分析。工作数据和公司联系人也被招聘人员用于为增长黑客创造业务线索。
项目设置
在本指南中,我们将使用 Python (3.7+) 和ScrapFly SDK包——这将使我们能够绕过 Wellfound 用于检索公共 HTML 数据的大量反抓取技术。 或者,对于本教程,我们还将使用loguru – 一个漂亮的日志记录库,可以帮助我们通过漂亮的彩色日志跟踪正在发生的事情。 这些包可以通过pip
命令轻松安装:
$ pip install scrapfly-sdk loguru
寻找良好的公司和工作
让我们通过抓取搜索系统来启动我们的 Wellfound 抓取工具。这将使我们能够找到网站上列出的公司和职位。 有几种方法可以在wellfound.com上找到这些详细信息,但我们将看一下最流行的两种方法——按角色和/或位置搜索:
- 要按角色查找工作
/role/<role name>
,可以使用端点,例如:wellfound.com/role/python-developer - 对于位置,
/location/<location name>
使用类似的端点,例如:wellfound.com/location/france - 我们可以使用
/role/l/<role name>/<location name>
端点将两者结合起来,例如:https ://wellfound.com/role/l/python-developer/san-francisco
当我们调整搜索时,我们可以看到 url 的变化
抓取 Wellfound 搜索
为了抓取搜索,首先让我们看一下单个搜索页面的内容。需要的数据位于何处,我们如何从 HTML 页面中提取它? 如果我们看一下像wellfound.com/role/l/python-developer/san-francisco这样的搜索页面和页面的查看源代码,我们可以看到嵌入在 javascript 变量中的搜索结果数据:
这是 GraphQL 支持的网站的常见模式,其中页面缓存以 JSON 格式存储在 HTML 中。尤其是 Angel.co,由Apollo graphQL提供支持
这对于我们的 AngelList 网络抓取工具来说非常方便,因为我们不需要解析 HTML 并且可以一次获取所有数据。让我们看看如何抓取这个:
import json import asyncio from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient from loguru import logger as log def extract_apollo_state(result: ScrapeApiResponse): """extract apollo state graph from a page""" data = result.selector.css("script#__NEXT_DATA__::text").get() data = json.loads(data) graph = data["props"]["pageProps"]["apolloState"]["data"] return graph async def scrape_search(session: ScrapflyClient, role: str = "", location: str = ""): """scrape wellfound.com search""" # wellfound.com has 3 types of search urls: for roles, for locations and for combination of both if role and location: url = f"https://wellfound.com/role/l/{role}/{location}" elif role: url = f"https://wellfound.com/role/{role}" elif location: url = f"https://wellfound.com/location/{location}" else: raise ValueError("need to pass either role or location argument to scrape search") log.info(f'scraping search of "{role}" in "{location}"') scrape = ScrapeConfig( url=url, # url to scrape asp=True, # this will enable anti-scraping protection bypass ) result = await session.async_scrape(scrape) graph = extract_apollo_state(result) return graph
让我们运行这段代码并查看它生成的结果:
if __name__ == "__main__": with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session: result = asyncio.run(scrape_search(session, role="python-developer")) print(json.dumps(result, indent=2, ensure_ascii=False)) { "props": { "pageProps": { "page": null, "role": "python-developer", "apollo": null, "apolloState": { "data": { ... "StartupResult:6427941": { "id": "6427941", "badges": [ { "type": "id", "generated": false, "id": "Badge:ACTIVELY_HIRING", "typename": "Badge" } ], "companySize": "SIZE_11_50", ... "JobListingSearchResult:2275832": { "autoPosted": false, "atsSource": null, "description": "**Company: Capitalmind**\n\nAt Capitalmind we ...", "jobType": "full-time", "liveStartAt": 1656420205, "locationNames": { "type": "json", "json": ["Bengaluru"] }, "primaryRoleTitle": "DevOps", "remote": false, "slug": "python-developer", "title": "Python Developer", "compensation": "₹50,000 – ₹1L", ...
我们首先注意到的是,有很多格式非常复杂的结果。我们这里接收到的数据是一个数据图,它是一种数据存储格式,各种数据对象通过引用连接起来。为了更好地理解这一点,让我们将其解析为一个熟悉的平面结构:
def unpack_node_references(node, graph, debug=False): """ unpacks references in a graph node to a flat node structure: >>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}}) {'field': {'foo': 'bar'}} """ def flatten(value): try: if value["type"] != "id": return value except (KeyError, TypeError): return value data = deepcopy(graph[value["id"]]) # flatten nodes too: if data.get("node"): data = flatten(data["node"]) if debug: data["__reference"] = value["id"] return data node = flatten(node) for key, value in node.items(): if isinstance(value, list): node[key] = [flatten(v) for v in value] elif isinstance(value, dict): node[key] = unpack_node_references(value, graph) return node
上面,我们定义了一个函数来展平复杂的图形结构。它通过用数据本身替换所有引用来工作。在我们的例子中,我们想从图形集中获取 Company 对象以及所有相关对象,如工作、人员等:
在上图中,我们可以更好地可视化引用解包。 接下来,让我们将此图形解析添加到我们的抓取工具以及分页功能中,以便我们可以从 所有工作页面收集格式良好的公司数据:
class JobData(TypedDict): """type hint for scraped job result data""" id: str title: str slug: str remtoe: bool primaryRoleTitle: str locationNames: Dict liveStartAt: int jobType: str description: str # there are more fields, but these are basic ones class CompanyData(TypedDict): """type hint for scraped company result data""" id: str badges: list companySize: str highConcept: str highlightedJobListings: List[JobData] logoUrl: str name: str slug: str # there are more fields, but these are basic ones async def scrape_search(session: ScrapflyClient, role: str = "", location: str = "") -> List[CompanyData]: """scrape wellfound.com search""" # wellfound.com has 3 types of search urls: for roles, for locations and for combination of both if role and location: url = f"https://wellfound.com/role/l/{role}/{location}" elif role: url = f"https://wellfound.com/role/{role}" elif location: url = f"https://wellfound.com/location/{location}" else: raise ValueError("need to pass either role or location argument to scrape search") async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]: """scrape search pages concurrently""" companies = [] log.info(f"scraping search of {role} in {location}; pages {page_numbers}") search_meta = None async for result in session.concurrent_scrape( [ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers] ): graph = extract_apollo_state(result) search_meta = graph[next(key for key in graph if "seoLandingPageJobSearchResults" in key)] companies.extend( [unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")] ) return companies, search_meta # scrape first page first_page_companies, pagination_meta = await scrape_search_page([1]) # scrape other pages pages_to_scrape = list(range(2, pagination_meta["pageCount"] + 1)) other_page_companies, _ = await scrape_search_page(pages_to_scrape) return first_page_companies + other_page_companies
if __name__ == "__main__": with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session: result = asyncio.run(scrape_search(session, role="python-developer")) print(json.dumps(result, indent=2, ensure_ascii=False)) [ { "id": "6427941", "badges": [ { "id": "ACTIVELY_HIRING", "name": "ACTIVELY_HIRING_BADGE", "label": "Actively Hiring", "tooltip": "Actively processing applications", "avatarUrl": null, "rating": null, "__typename": "Badge" } ], "companySize": "SIZE_11_50", "highConcept": "India's First Digital Asset Management Company", "highlightedJobListings": [ { "autoPosted": false, "atsSource": null, "description": "**Company: Capitalmind**\n\nAt Capitalmind <...truncacted...>", "jobType": "full-time", "liveStartAt": 1656420205, "locationNames": { "type": "json", "json": [ "Bengaluru" ] }, "primaryRoleTitle": "DevOps", "remote": false, "slug": "python-developer", "title": "Python Developer", "compensation": "₹50,000 – ₹1L", "id": "2275832", "isBookmarked": false, "__typename": "JobListingSearchResult" } ], "logoUrl": "https://photos.wellfound.com/startups/i/6427941-9e4960b31904ccbcfe7e3235228ceb41-medium_jpg.jpg?buster=1539167505", "name": "Capitalmind", "slug": "capitalmindamc", "__typename": "StartupResult" }, ... ]
我们更新后的抓取器现在能够抓取所有搜索页面并将图形数据扁平化为更具可读性的内容。我们可以进一步解析它以去除不需要的字段,但我们会把它留给你。 这里要注意的一件事是公司和工作数据不完整。虽然这里有很多数据,但在/company/
端点页面上可用的完整数据集中还有更多数据。接下来,让我们看看我们如何抓取它!
刮取良好的公司和工作
wellfound.com 上的公司页面包含的详细信息比我们在搜索过程中看到的还要多。例如,如果我们查看像wellfound.com/company/moxion-power-co这样的页面,我们可以在页面的可见部分看到更多可用数据:
我们也可以应用在公司页面的抓取搜索中使用的相同抓取技术。让我们来看看如何:
def parse_company(result: ScrapeApiResponse) -> CompanyData: """parse company data from wellfound.com company page""" graph = extract_apollo_state(result) company = None for key in graph: if key.startswith("Startup:"): company = graph[key] break else: raise ValueError("no embedded company data could be found") return unpack_node_references(company, graph) async def scrape_companies(company_ids: List[str], session: ScrapflyClient) -> List[CompanyData]: """scrape wellfound.com companies""" urls = [f"https://wellfound.com/company/{company_id}/jobs" for company_id in company_ids] companies = [] async for result in session.concurrent_scrape([ScrapeConfig(url, asp=True, cache=True) for url in urls]): companies.append(parse_company(result)) return companies
运行代码和示例输出
if __name__ == "__main__": with ScrapflyClient(key="YOUR_SCRAPFLY_KEY", max_concurrency=2) as session: result = await scrape_companies(["moxion-power-co"], session=session) print(json.dumps(result[0], indent=2, ensure_ascii=False)) { "id": "8281817", "__typename": "Startup", "slug": "moxion-power-co", "completenessScore": 92, "currentUserCanEditProfile": false, "currentUserCanRecruitForStartup": false, "completeness": {"score": 95}, "name": "Moxion Power", "logoUrl": "https://photos.wellfound.com/startups/i/8281817-91faf535f176a41dc39259fc232d1b4e-medium_jpg.jpg?buster=1619536432", "highConcept": "Zero-Emissions Temporary Power as a Service", "hiring": true, "isOperating": null, "companySize": "SIZE_11_50", "totalRaisedAmount": 13225000, "companyUrl": "https://www.moxionpower.com/", "twitterUrl": "https://twitter.com/moxionpower", "blogUrl": "", "facebookUrl": "", "linkedInUrl": "https://www.linkedin.com/company/moxion-power-co/", "productHuntUrl": "", "public": true, "published": true, "quarantined": false, "isShell": false, "isIncubator": false, "currentUserCanUpdateInvestors": false, "jobPreamble": "Moxion is looking to hire a diverse team across several disciplines, currently focusing on engineering and production.", "jobListingsConnection({\"after\":\"MA==\",\"filters\":{\"jobTypes\":[],\"locationIds\":[],\"roleIds\":[]},\"first\":20})": { "totalPageCount": 3, "pageSize": 20, "edges": [ { "id": "2224735", "public": true, "primaryRoleTitle": "Product Designer", "primaryRoleParent": "Designer", "liveStartAt": 1653724125, "descriptionSnippet": "<ul>\n<li>Conduct user research to drive design decisions</li>\n<li>Design graphics to be vinyl printed onto physical hardware and signage</li>\n</ul>\n", "title": "Senior UI/UX Designer", "slug": "senior-ui-ux-designer", "jobType": "full_time", ... }
只需添加几行代码,我们就可以收集每家公司的工作、员工、文化和资金细节。因为我们使用了一种通用的方式来抓取 Apollo Graphql 支持的网站,比如 wellfound.com,所以我们可以轻松地将其应用于许多其他页面! 让我们通过查看完整的抓取器代码以及抓取此目标时的一些其他提示和技巧来总结一下。
常问问题
为了总结本指南,让我们看一下有关网络抓取wellfound.com的一些常见问题:
抓取 Wellfound(又名 AngelList)是否合法?
是的。Wellfound 的数据是公开的,我们不会提取任何私人信息。以尊重的价格抓取 wellfound.com 既合乎道德又合法。话虽如此,在处理个人(员工)可识别数据等已抓取的个人数据时,应注意欧盟的 GDRP 合规性。
如何在 Wellfound 上找到所有公司页面?
查找没有工作列表的公司页面有点困难,因为 wellfound.com 不为爬虫提供站点目录或站点地图。 为此,可以使用wellfound.com/search 或者,我们可以利用公共搜索索引,例如 google.com 或 bing.com,使用如下查询:site:wellfound.com inurl:/company/
Wellfound可以爬取吗?
是的。Web 抓取是一种替代的 Web 抓取技术,它探索网站以查找链接并跟踪它们。这是发现新页面并抓取它们的好方法。然而,由于 Wellfound 使用的是隐藏的网络数据,因此按照本教程中的介绍,显式地抓取它要容易得多。
Wellfound 抓取总结
在本教程中,我们构建了一个wellfound.com爬虫。我们已经了解了如何通过 AngelList 的搜索功能发现公司页面。然后,我们为 GraphQL 驱动的网站编写了一个通用数据集解析器,应用于 wellfound.com 搜索结果和公司数据解析。 为此,我们使用 Python 和 scrapfly-sdk 中包含的一些社区包,为了防止被阻止,我们使用了 ScrapFly 的 API,它巧妙地配置每个网络抓取器连接以避免被阻止。
完整的 Wellfound 爬虫代码
最后,让我们把所有东西放在一起:使用搜索找到公司并抓取他们的信息并使用 ScrapFly 集成查看数据:
完整的 Wellfound 爬虫代码 最后,让我们把所有东西放在一起:使用搜索找到公司并抓取他们的信息并使用 ScrapFly 集成查看数据: import asyncio import json from copy import deepcopy from pathlib import Path from typing import Dict, List, Optional, Tuple from loguru import logger as log from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient from typing_extensions import TypedDict scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY", max_concurrency=5) def unpack_node_references(node: Dict, graph: Dict, debug: bool = False) -> Dict: """ unpacks references in a graph node to a flat node structure: >>> unpack_node_references({"field": {"id": "reference1", "type": "id"}}, graph={"reference1": {"foo": "bar"}}) {'field': {'foo': 'bar'}} """ def flatten(value): try: if value["type"] != "id": return value except (KeyError, TypeError): return value data = deepcopy(graph[value["id"]]) # flatten nodes too: if data.get("node"): data = flatten(data["node"]) if debug: data["__reference"] = value["id"] return data node = flatten(node) for key, value in node.items(): if isinstance(value, list): node[key] = [flatten(v) for v in value] elif isinstance(value, dict): node[key] = unpack_node_references(value, graph) return node def extract_apollo_state(result: ScrapeApiResponse): """extract apollo state graph from a page""" data = result.selector.css("script#__NEXT_DATA__::text").get() data = json.loads(data) graph = data["props"]["pageProps"]["apolloState"]["data"] return graph class JobData(TypedDict): """type hint for scraped job result data""" id: str title: str slug: str remtoe: bool primaryRoleTitle: str locationNames: Dict liveStartAt: int jobType: str description: str # there are more fields, but these are basic ones class CompanyData(TypedDict): """type hint for scraped company result data""" id: str badges: list companySize: str highConcept: str highlightedJobListings: List[JobData] logoUrl: str name: str slug: str # there are more fields, but these are basic ones def parse_company(result: ScrapeApiResponse) -> CompanyData: """parse company data from wellfound.com (previously angel.co) company page""" graph = extract_apollo_state(result) for key in graph: if key.startswith("Startup:"): company = graph[key] break else: raise ValueError("no embedded company data could be found") flat_dataset = unpack_node_references(company, graph) return flat_dataset async def scrape_companies(company_ids: List[str]) -> List[CompanyData]: """scrape wellfound.com (previously angel.co) companies""" log.info(f"scraping {len(company_ids)} companies: {company_ids}") companies = [] to_scrape = [ScrapeConfig(f"https://wellfound.com/company/{company_id}/jobs", asp=True) for company_id in company_ids] async for result in scrapfly.concurrent_scrape(to_scrape): companies.append(parse_company(result)) return companies async def scrape_search(role: str = "", location: str = "", max_pages: Optional[int] = None) -> List[CompanyData]: """scrape wellfound.com (previously angel.co) search""" # wellfound.com has 3 types of search urls: for roles, for locations and for combination of both if role and location: url = f"https://wellfound.com/role/l/{role}/{location}" elif role: url = f"https://wellfound.com/role/{role}" elif location: url = f"https://wellfound.com/location/{location}" else: raise ValueError("need to pass either role or location argument to scrape search") async def scrape_search_page(page_numbers: List[int]) -> Tuple[List[CompanyData], Dict]: """scrape search pages concurrently""" companies = [] log.info(f"scraping search of {role} in {location}; pages {page_numbers}") search_meta = {} async for result in scrapfly.concurrent_scrape( [ScrapeConfig(url + f"?page={page}", asp=True, cache=True) for page in page_numbers] ): graph = extract_apollo_state(result) for k, v in graph['ROOT_QUERY']['talent'].items(): if k.startswith("seoLandingPageJobSearchResults"): search_meta = v break else: print("Could not find search meta information for paginating search") search_meta = {} companies.extend( [unpack_node_references(graph[key], graph) for key in graph if key.startswith("StartupResult")] ) return companies, search_meta # scrape first page and figure out how many pages are there first_page_companies, pagination_meta = await scrape_search_page([1]) # scrape other pages concurrently total_pages = pagination_meta["pageCount"] if max_pages and total_pages > max_pages: total_pages = max_pages pages_to_scrape = list(range(2, total_pages + 1)) other_page_companies, _ = await scrape_search_page(pages_to_scrape) return first_page_companies + other_page_companies async def example_run(): """ This example run scrapes: - a single company page and saves results to /results/companies.json - a single search page query including pagination and saves results to /results/search.json """ out = Path(__file__).parent / "results" out.mkdir(exist_ok=True) result_companies = await scrape_companies(["moxion-power-co"]) out.joinpath("companies.json").write_text(json.dumps(result_companies, indent=2, ensure_ascii=False)) result_search = await scrape_search("python-developer", "san-francisco") out.joinpath("search.json").write_text(json.dumps(result_search, indent=2, ensure_ascii=False)) if __name__ == "__main__": asyncio.run(example_run())