在本网络抓取教程中,我们将了解如何从Indeed.com抓取工作列表数据。 Indeed.com 是最受欢迎的工作列表网站之一,而且很容易抓取! 在本教程中,我们将仅使用几行 Python 代码构建我们的抓取工具。我们将了解 Indeed 的搜索如何在我们的爬虫中复制它并从嵌入式 javascript 变量中提取工作数据。让我们开始吧!
项目设置
对于这个网络爬虫,我们只需要一个 HTTP 客户端库,比如httpxpip
库,可以通过控制台命令安装:
$ pip install httpx
Python 中有许多 HTTP 客户端,如 requests、httpx、aiohttp 等。但是,我们推荐使用 httpx,因为它支持 http2 协议,因此最不可能被阻止。httpx
还支持异步 python,这意味着我们可以非常快速地抓取数据!
寻找 Indeed 职位
首先,我们来看看如何在Indeed.com上查找职位列表。 如果我们转到主页并提交搜索,我们可以看到 Indeed 将我们重定向到带有几个关键参数的搜索 URL:
https://www.indeed.com/jobs?q=python&l=Texas
因此,要在德克萨斯州找到 Python 工作,我们所要做的就是发送带有l=Texas
URLq=Python
参数的请求:
import httpx HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Accept-Encoding": "gzip, deflate, br", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Connection": "keep-alive", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", } response = httpx.get("https://www.indeed.com/jobs?q=python&l=Texas", headers=HEADERS) print(response)
我们得到一个包含 15 个职位列表的页面!在我们收集剩余的页面之前,让我们看看如何从这个响应中解析职位列表数据。 我们可以使用 CSS 或 XPath 选择器解析 HTML 文档,但还有一种更简单的方法:我们可以找到隐藏在 HTML 深处的所有工作列表数据作为 JSON 文档:
因此,让我们使用简单的正则表达式模式解析此数据:
import httpx import re import json HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Accept-Encoding": "gzip, deflate, br", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Connection": "keep-alive", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", } def parse_search_page(html: str): data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html) data = json.loads(data[0]) return { "results": data["metaData"]["mosaicProviderJobCardsModel"]["results"], "meta": data["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"], } response = httpx.get("https://www.indeed.com/jobs?q=python&l=Texas", headers=HEADERS) print(parse_search_page(response.text))
在我们上面的代码中,我们使用正则表达式模式来选择mosaic-provider-jobcards
变量值,将其加载为 python 字典并解析出结果和分页元数据。 现在我们有了第一页结果和总页数,我们可以检索剩余的页面:
import asyncio import json import re from urllib.parse import urlencode import httpx def parse_search_page(html: str): data = re.findall(r'window.mosaic.providerData\["mosaic-provider-jobcards"\]=(\{.+?\});', html) data = json.loads(data[0]) return { "results": data["metaData"]["mosaicProviderJobCardsModel"]["results"], "meta": data["metaData"]["mosaicProviderJobCardsModel"]["tierSummaries"], } async def scrape_search(client: httpx.AsyncClient, query: str, location: str, max_results: int = 50): def make_page_url(offset): parameters = {"q": query, "l": location, "filter": 0, "start": offset} return "https://www.indeed.com/jobs?" + urlencode(parameters) print(f"scraping first page of search: {query=}, {location=}") response_first_page = await client.get(make_page_url(0)) data_first_page = parse_search_page(response_first_page.text) results = data_first_page["results"] total_results = sum(category["jobCount"] for category in data_first_page["meta"]) # there's a page limit on indeed.com of 1000 results per search if total_results > max_results: total_results = max_results print(f"scraping remaining {total_results - 10 / 10} pages") other_pages = [make_page_url(offset) for offset in range(10, total_results + 10, 10)] for response in await asyncio.gather(*[client.get(url=url) for url in other_pages]): results.extend(parse_search_page(response.text)) return results
运行代码和示例输出
async def main(): # we need to use browser-like headers to avoid being blocked instantly: HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Accept-Encoding": "gzip, deflate, br", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", } async with httpx.AsyncClient(headers=HEADERS) as client: search_data = await scrape_search(client, query="python", location="texas") print(json.dumps(search_data, indent=2)) asyncio.run(main())
这将导致搜索结果数据类似于:
[ { "company": "Apple", "companyBrandingAttributes": { "headerImageUrl": "https://d2q79iu7y748jz.cloudfront.net/s/_headerimage/1960x400/ecdb4796986d27b654fe959e2fdac201", "logoUrl": "https://d2q79iu7y748jz.cloudfront.net/s/_squarelogo/256x256/86583e966849b2f081928769a6abdb09" }, "companyIdEncrypted": "c1099851e9794854", "companyOverviewLink": "/cmp/Apple", "companyOverviewLinkCampaignId": "serp-linkcompanyname", "companyRating": 4.1, "companyReviewCount": 11193, "companyReviewLink": "/cmp/Apple/reviews", "companyReviewLinkCampaignId": "cmplinktst2", "displayTitle": "Software Quality Engineer, Apple Pay", "employerAssistEnabled": false, "employerResponsive": false, "encryptedFccompanyId": "6e7b40121fbb5e2f", "encryptedResultData": "VwIPTVJ1cTn5AN7Q-tSqGRXGNe2wB2UYx73qSczFnGU", "expired": false, "extractTrackingUrls": "https://jsv3.recruitics.com/partner/a51b8de1-f7bf-11e7-9edd-d951492604d9.gif?client=3427&rx_c=&rx_campaign=indeed16&rx_group=130795&rx_source=Indeed&job=200336736-2&rx_r=none&rx_ts=20220831T001748Z&rx_pre=1&indeed=sp", "extractedEntities": [], "fccompanyId": -1, "featuredCompanyAttributes": {}, "featuredEmployer": false, "featuredEmployerCandidate": false, "feedId": 2772, "formattedLocation": "Austin, TX", "formattedRelativeTime": "Today", "hideMetaData": false, "hideSave": false, "highVolumeHiringModel": { "highVolumeHiring": false }, "highlyRatedEmployer": false, "hiringEventJob": false, "indeedApplyEnabled": false, "indeedApplyable": false, "isJobSpotterJob": false, "isJobVisited": false, "isMobileThirdPartyApplyable": true, "isNoResumeJob": false, "isSubsidiaryJob": false, "jobCardRequirementsModel": { "additionalRequirementsCount": 0, "requirementsHeaderShown": false }, "jobLocationCity": "Austin", "jobLocationState": "TX", "jobTypes": [], "jobkey": "5b47456ae8554711", "jsiEnabled": false, "locationCount": 0, "mobtk": "1gbpe4pcikib6800", "moreLocUrl": "", "newJob": true, "normTitle": "Software Quality Engineer", "openInterviewsInterviewsOnTheSpot": false, "openInterviewsJob": false, "openInterviewsOffersOnTheSpot": false, "openInterviewsPhoneJob": false, "overrideIndeedApplyText": true, "preciseLocationModel": { "obfuscateLocation": false, "overrideJCMPreciseLocationModel": true }, "pubDate": 1661835600000, "redirectToThirdPartySite": false, "remoteLocation": false, "resumeMatch": false, "salarySnippet": { "salaryTextFormatted": false }, "saved": false, "savedApplication": false, "showCommutePromo": false, "showEarlyApply": false, "showJobType": false, "showRelativeDate": true, "showSponsoredLabel": false, "showStrongerAppliedLabel": false, "smartFillEnabled": false, "snippet": "<ul style=\"list-style-type:circle;margin-top: 0px;margin-bottom: 0px;padding-left:20px;\"> \n <li style=\"margin-bottom:0px;\">At Apple, new ideas become extraordinary products, services, and customer experiences.</li>\n <li>We have the rare and rewarding opportunity to shape upcoming products\u2026</li>\n</ul>", "sourceId": 2700, "sponsored": true, "taxoAttributes": [], "taxoAttributesDisplayLimit": 5, "taxoLogAttributes": [], "taxonomyAttributes": [ { "attributes": [], "label": "job-types" }, "..."], "tier": { "matchedPreferences": { "longMatchedPreferences": [], "stringMatchedPreferences": [] }, "type": "DEFAULT" }, "title": "Software Quality Engineer, Apple Pay", "translatedAttributes": [], "translatedCmiJobTags": [], "truncatedCompany": "Apple", "urgentlyHiring": false, "viewJobLink": "...", "vjFeaturedEmployerCandidate": false }, ]
我们已经用很少的几行 Python 代码成功地抓取了海量数据!接下来,让我们看看如何通过抓取职位页面来获取职位列表的其余详细信息(如完整描述)。
抓取 Indeed 职位
我们的搜索结果包含几乎所有的工作列表数据,除了一些细节,例如完整的工作描述。要抓取这个,我们需要工作 ID,它可以在jobkey
我们搜索结果的字段中找到:
{ "jobkey": "a82cf0bd2092efa3", }
使用jobkey
我们可以请求完整的工作详细信息页面,就像搜索一样;我们可以解析嵌入的数据而不是 HTML:
我们可以看到所有的工作和页面信息都隐藏在变量中_initialData
,我们可以用一个简单的正则表达式模式提取它:
import re import json import asyncio from typing import List import httpx def parse_job_page(html): """parse job data from job listing page""" data = re.findall(r"_initialData=(\{.+?\});", html) data = json.loads(data[0]) return data["jobInfoWrapperModel"]["jobInfoModel"] async def scrape_jobs(client: httpx.AsyncClient, job_keys: List[str]): """scrape job details from job page for given job keys""" urls = [f"https://www.indeed.com/m/basecamp/viewjob?viewtype=embedded&jk={job_key}" for job_key in job_keys] scraped = [] for response in await asyncio.gather(*[client.get(url=url) for url in urls]): scraped.append(parse_job_page(response.text)) return scraped
运行代码和示例输出
async def main(): HEADERS = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Accept-Encoding": "gzip, deflate, br", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "Connection": "keep-alive", "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6", } async with httpx.AsyncClient(headers=HEADERS) as client: job_data = await scrape_jobs(client, ["a82cf0bd2092efa3"]) print(job_data[0]['sanitizedJobDescription']['content']) print(job_data) asyncio.run(main())
这将抓取类似于以下内容的结果:
[ { "jobInfoHeaderModel": { "...", "companyName": "ExxonMobil", "companyOverviewLink": "https://www.indeed.com/cmp/Exxonmobil?campaignid=mobvjcmp&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25", "companyReviewLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=mobvjcmp&cmpratingc=mobviewjob&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician", "companyReviewModel": { "companyName": "ExxonMobil", "desktopCompanyLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=viewjob&cmpratingc=mobviewjob&from=viewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician", "mobileCompanyLink": "https://www.indeed.com/cmp/Exxonmobil/reviews?campaignid=mobvjcmp&cmpratingc=mobviewjob&from=mobviewjob&tk=1gbpekba3is92800&fromjk=9dacdef3068a1d25&jt=Geoscience+Technician", "ratingsModel": { "ariaContent": "3.9 out of 5 stars from 4,649 employee ratings", "count": 4649, "countContent": "4,649 reviews", "descriptionContent": "Read what people are saying about working here.", "rating": 3.9, "showCount": true, "showDescription": true, "size": null } }, "disableAcmeLink": false, "employerActivity": null, "employerResponsiveCardModel": null, "formattedLocation": "Spring, TX 77389", "hideRating": false, "isDesktopApplyButtonSticky": false, "isSimplifiedHeader": false, "jobTitle": "Geoscience Technician", "openCompanyLinksInNewTab": false, "parentCompanyName": null, "preciseLocationModel": null, "ratingsModel": null, "remoteWorkModel": null, "subtitle": "ExxonMobil - Spring, TX 77389", "tagModels": null, "viewJobDisplay": "DESKTOP_EMBEDDED" }, "sanitizedJobDescription": { "content": "<p></p>\n<div>\n <div>\n <div>\n <div>\n <h2 class='\"jobSectionHeader\"'><b>Education and Related Experience</b></h2>\n </div>\n <div>\n ...", "contentKind": "HTML" }, "viewJobDisplay": "DESKTOP_EMBEDDED" } ]
如果我们运行这个爬虫,我们应该看到完整的工作描述被打印出来。
Indeed 爬取总结
在这个简短的网络抓取教程中,我们了解了网络抓取Indeed.com职位列表搜索。我们使用自定义搜索参数构建了一个搜索 URL,并使用正则表达式从嵌入的 JSON 数据中解析了职位数据。