Glassdoor 以过去和现在员工的公司评论而闻名,尽管它包含更多数据,如公司元数据、薪资信息和评论。这使得 Glassdoor 成为网络抓取的重要公共数据目标! 在这个网络抓取实践教程中,我们将了解glassdoor.com以及我们如何抓取公司信息、职位列表和评论。我们将使用一些流行的社区包在 Python 中执行此操作,所以让我们开始吧。
项目设置
在本教程中,我们将使用 Python 和一些流行的社区包:
- httpx – 一个 HTTP 客户端库,可以让我们与 amazon.com 的服务器进行通信
- parsel – 一个 HTML 解析库,尽管我们将在本教程中进行很少的 HTML 解析,而是主要直接处理 JSON 数据。
这些包可以通过pip
命令轻松安装:
$ pip install httpx parsel
或者,可以随意换成httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 功能,这些功能几乎可以在每个库中互换。至于,parsel
另一个不错的选择是beautifulsoup包。
处理 Glassdoor 覆盖
在浏览 Glassdoor 一段时间后,我们肯定会遇到一个要求用户登录的覆盖:
所有内容仍然存在,只是被覆盖层覆盖了。当抓取我们的解析工具时仍然能够找到这些数据:
import httpx from parse import Selector response = httpx.get( "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm" ) selector = Selector(response.text) # find description in the HTML: print(selector.css('[data-test="employerDescription"]::text').get()) # will print: # eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals
话虽如此,当我们开发网络抓取工具时,我们希望查看和检查网页。我们可以使用一点点 javascript 轻松删除叠加层:
function addGlobalStyle(css) { var head, style; head = document.getElementsByTagName('head')[0]; if (!head) { return; } style = document.createElement('style'); style.type = 'text/css'; style.innerHTML = css; head.appendChild(style); } addGlobalStyle("#HardsellOverlay {display:none !important;}"); addGlobalStyle("body {overflow:auto !important; position: initial !important}"); window.addEventListener("scroll", event => event.stopPropagation(), true); window.addEventListener("mousemove", event => event.stopPropagation(), true);
该脚本设置了一些全局 CSS 样式来隐藏叠加层,它可以通过 Web 浏览器的开发人员工具控制台(F12 键,控制台选项卡)执行。 或者,它可以作为小书签添加到书签工具栏,只需将此链接:glassdoor overlay remover拖到书签工具栏,然后单击它即可随时删除覆盖。
选择地区
Glassdoor 在世界各地开展业务,其大部分内容都是区域感知的。例如,如果我们在 glassdoor.co.uk 网站上查看 Ebay 的 Glassdoor 简介,我们只会看到与英国相关的职位列表。 要在网络抓取时选择区域,我们可以提供一个带有所选区域 ID 的 cookie:
from parsel import Selector import httpx france_location_cookie = {"tldp": "6"} response = httpx.get( "https://www.glassdoor.com/Overview/Working-at-eBay-EI_IE7853.11,15.htm", cookies=france_location_cookie, follow_redirects=True, ) selector = Selector(response.text) # find employee count in the HTML: print(selector.css('[data-test="employer-size"]::text').get()) # will print: # Plus de 10 000 employés
如何获得国家ID? 所有国家/地区 ID 都存在于每个 glassdoor 页面 HTML 中,可以使用简单的正则表达式模式提取: import re import json import httpx response = httpx.get( "https://www.glassdoor.com/", follow_redirects=True, ) country_data = re.findall(r'"countryMenu\\":.+?(\[.+?\])', response.text)[0].replace('\\', '') country_data = json.loads(country_data) for country in country_data: print(f"{country['textKey']}: {country['id']}") 请注意,这些 ID 不太可能更改,因此这是完整的输出: Argentina: 13 Australia: 5 Belgique (Français): 15 België (Nederlands): 14 Brasil: 9 Canada (English): 3 Canada (Français): 19 Deutschland: 7 España: 8 France: 6 Hong Kong: 20 India: 4 Ireland: 18 Italia: 23 México: 12 Nederland: 10 New Zealand: 21 Schweiz (Deutsch): 16 Singapore: 22 Suisse (Français): 17 United Kingdom: 2 United States: 1 Österreich: 11
然而,有些页面使用 URL 参数,我们将在本文的网络抓取部分详细介绍这些参数。
抓取 Glassdoor 公司数据
在本教程中,我们将专注于抓取公司信息,例如公司概览、职位列表、评论等。也就是说,本节中介绍的技术几乎可以应用于 glassdoor.com 上的任何其他数据页面
公司编号
在我们抓取任何特定公司数据之前,我们需要知道他们的内部 Glassdoor ID 和名称。为此,我们可以使用 Glassdoor 搜索页面推荐。 例如,如果我们搜索“eBay”,我们将看到一个公司列表及其 ID:
要在 Python 中抓取它,我们可以使用 typeahead API 端点:
import json import httpx def find_companies(query: str): """find company Glassdoor ID and name by query. e.g. "ebay" will return "eBay" with ID 7853""" result = httpx.get( url=f"https://www.glassdoor.com/searchsuggest/typeahead?numSuggestions=8&source=GD_V2&version=NEW&rf=full&fallback=token&input={query}", ) data = json.loads(result.content) return data[0]["suggestion"], data[0]["employerId"] print(find_companies("ebay")) ["eBay", "7853"]
现在我们可以轻松检索公司名称 ID 和数字 ID,我们可以开始抓取公司职位列表、评论、薪水等。
公司简介
让我们通过抓取公司概览数据来启动我们的抓取工具:
要抓取这些详细信息,我们需要公司页面 URL 或使用公司 ID 名称和编号自行生成 URL。
import httpx from parsel import Selector company_name = "eBay" company_id = "7671" url = f"https://www.glassdoor.com/Overview/Working-at-{company_name}-EI_IE{company_id}.htm" response = httpx.get( url, cookies={"tldp": "1"}, # use cookies to force US location follow_redirects=True ) sel = Selector(response.text) print(sel.css("h1::text").get())
要解析公司数据,我们可以使用 BeautifulSoup 等传统 HTML 解析工具来解析呈现的 HTML。然而,由于 Glassdoor 使用 Apollo Graphql 来支持他们的网站,我们可以从页面源中提取隐藏的 JSON 网络数据。 抓取隐藏的网络数据的优势在于它是页面上所有可用数据的完整数据集。这意味着我们可以提取比页面上可见的更多的数据,并且它已经为我们构建好了。 让我们看看如何使用 Python 来做到这一点:
import re import httpx import json def extract_apollo_state(html): """Extract apollo graphql state data from HTML source""" data = re.findall('apolloState":\s*({.+})};', html)[0] return json.loads(data) def scrape_overview(company_name: str, company_id: int) -> dict: url = f"https://www.glassdoor.com/Overview/Worksgr-at-{company_name}-EI_IE{company_id}.htm" response = httpx.get(url, cookies={"tldp": "1"}, follow_redirects=True) apollo_state = extract_apollo_state(response.text) return next(v for k, v in apollo_state.items() if k.startswith("Employer:")) print(json.dumps(scrape_overview("7853"), indent=2))
示例输出 { "__typename": "Employer", "id": 7853, "awards({\"limit\":200,\"onlyFeatured\":false})": [ { "__typename": "EmployerAward", "awardDetails": null, "name": "Best Places to Work", "source": "Glassdoor", "year": 2022, "featured": true }, "... truncated for preview ..." ], "shortName": "eBay", "links": { "__typename": "EiEmployerLinks", "reviewsUrl": "/Reviews/eBay-Reviews-E7853.htm", "manageoLinkData": null }, "website": "www.ebayinc.com", "type": "Company - Public", "revenue": "$10+ billion (USD)", "headquarters": "San Jose, CA", "size": "10000+ Employees", "stock": "EBAY", "squareLogoUrl({\"size\":\"SMALL\"})": "https://media.glassdoor.com/sqls/7853/ebay-squareLogo-1634568971365.png", "primaryIndustry": { "__typename": "EmployerIndustry", "industryId": 200063, "industryName": "Internet & Web Services" }, "yearFounded": 1995, "overview": { "__typename": "EmployerOverview", "description": "eBay is where the world goes to shop, sell, and give. Every day, our professionals connect millions of buyers and sellers around the globe, empowering people and creating opportunity. We're on a mission to build a better, more connected form of commerce that benefits individuals, businesses, and society. We create stronger connections between buyers and sellers, offering product experiences that are fast, mobile and secure. At eBay, we develop technologies that enable connected commerce and make every interaction effortless\u2014and more human. And we are doing it on a global scale, providing everyone with the chance to participate and create value.", "mission": "We connect people and build communities to create economic opportunity for all." }, "bestProfile": { "__ref": "EmployerProfile:7925" }, "employerManagedContent({\"parameters\":[{\"divisionProfileId\":961530,\"employerId\":7853}]})": [ { "__typename": "EmployerManagedContent", "diversityContent": { "__typename": "DiversityAndInclusionContent", "programsAndInitiatives": { "__ref": "EmployerManagedContentSection:0" }, "goals": [] } } ], "badgesOfShame": [] }
通过解析嵌入的 graphQl 数据,我们可以用几行代码轻松提取整个公司数据集! 接下来让我们看看如何使用这种技术来抓取其他细节,例如工作和评论。
抓取 Glassdoor 职位列表
为了抓取工作列表,我们还将查看嵌入式 graphql 数据,尽管这次我们将解析 graphql 缓存而不是状态数据。 为此,让我们看一下 Ebay 的 Glassdoor 招聘页面:
如果我们查看页面源代码,我们可以看到所有作业数据都存在于window.appCache
隐藏<script>
节点的 javascript 变量中:
要提取它,我们可以使用一些常见的解析算法:
import json def find_json_objects(text: str, decoder=json.JSONDecoder()): """Find JSON objects in text, and generate decoded JSON data and it's ID""" pos = 0 while True: match = text.find("{", pos) if match == -1: break try: result, index = decoder.raw_decode(text[match:]) # backtrack to find the key/identifier for this json object: key_end = text.rfind('"', 0, match) key_start = text.rfind('"', 0, key_end) key = text[key_start + 1 : key_end] yield key, result pos = match + index except ValueError: pos = match + 1 def extract_apollo_cache(html): """Extract apollo graphql cache data from HTML source""" selector = Selector(text=html) script_with_cache = selector.xpath("//script[contains(.,'window.appCache')]/text()").get() cache = defaultdict(list) for key, data in find_json_objects(script_with_cache): cache[key].append(data) return cache
我们上面的解析器将获取 HTML 文本,然后找到<script>
具有window.appCache
变量的节点并提取缓存对象。我们来看看它是如何处理 Glassdoor 的工作页面的:
import asyncio import json import math from collections import defaultdict from typing import Dict, List import httpx from parsel import Selector session = httpx.AsyncClient( timeout=httpx.Timeout(30.0), cookies={"tldp": "1"}, follow_redirects=True, ) def find_json_objects(text: str, decoder=json.JSONDecoder()): """Find JSON objects in text, and generate decoded JSON data and it's ID""" pos = 0 while True: match = text.find("{", pos) if match == -1: break try: result, index = decoder.raw_decode(text[match:]) # backtrack to find the key/identifier for this json object: key_end = text.rfind('"', 0, match) key_start = text.rfind('"', 0, key_end) key = text[key_start + 1 : key_end] yield key, result pos = match + index except ValueError: pos = match + 1 def extract_apollo_cache(html): """Extract apollo graphql cache data from HTML source""" selector = Selector(text=html) script_with_cache = selector.xpath("//script[contains(.,'window.appCache')]/text()").get() cache = defaultdict(list) for key, data in find_json_objects(script_with_cache): cache[key].append(data) return cache def parse_jobs(html) -> List[Dict]: """parse jobs page for job data and total amount of jobs""" cache = extract_apollo_cache(html) return [v["jobview"] for v in cache["JobListingSearchResult"]] def parse_job_page_count(html) -> int: """parse job page count from pagination details in Glassdoor jobs page""" _total_results = Selector(html).css(".paginationFooter::text").get() if not _total_results: return 1 _total_results = int(_total_results.split()[-1]) _total_pages = math.ceil(_total_results / 40) return _total_pages async def scrape_jobs(employer_name: str, employer_id: str): """Scrape job listings""" # scrape first page of jobs: first_page = await session.get( url=f"https://www.glassdoor.com/Jobs/{employer_name}-Jobs-E{employer_id}.htm?filter.countryId={session.cookies.get('tldp') or 0}", ) jobs = parse_jobs(first_page.text) total_pages = parse_job_page_count(first_page.text) print(f"scraped first page of jobs, scraping remaining {total_pages - 1} pages") other_pages = [ session.get( url=str(first_page.url).replace(".htm", f"_P{page}.htm"), ) for page in range(2, total_pages + 1) ] for page in await asyncio.gather(*other_pages): jobs.extend(parse_jobs(page.text)) return jobs async def main(): jobs = await scrape_jobs("eBay", "7853") print(json.dumps(jobs, indent=2)) asyncio.run(main())
上面,我们有一个通过经典分页抓取算法的抓取器:
- 抓取第一个工作页面
- 为作业数据提取 GraphQl 缓存
- 解析 HTML 以获得总页数
- 同时抓取剩余页面
如果我们运行我们的爬虫,我们应该立刻得到所有的职位列表!
示例输出 [ { "header": { "adOrderId": 1281260, "advertiserType": "EMPLOYER", "ageInDays": 0, "easyApply": false, "employer": { "id": 7853, "name": "eBay inc.", "shortName": "eBay", "__typename": "Employer" }, "goc": "machine learning engineer", "gocConfidence": 0.9, "gocId": 102642, "jobLink": "/partner/jobListing.htm?pos=140&ao=1281260&s=21&guid=0000018355c715f3b12a6090d334a7dc&src=GD_JOB_AD&t=ESR&vt=w&cs=1_9a5bdc18&cb=1663591454509&jobListingId=1008147859269&jrtk=3-0-1gdase5jgjopr801-1gdase5k2irmo800-952fa651f152ade0-", "jobTitleText": "Sr. Manager, AI-Guided Service Products", "locationName": "Salt Lake City, UT", "divisionEmployerName": null, "needsCommission": false, "payCurrency": "USD", "payPercentile10": 75822, "payPercentile25": 0, "payPercentile50": 91822, "payPercentile75": 0, "payPercentile90": 111198, "payPeriod": "ANNUAL", "salarySource": "ESTIMATED", "sponsored": true, "__typename": "JobViewHeader" }, "job": { "importConfigId": 322429, "jobTitleText": "Sr. Manager, AI-Guided Service Products", "jobTitleId": 0, "listingId": 1008147859269, "__typename": "JobDetails" }, "jobListingAdminDetails": { "cpcVal": null, "jobListingId": 1008147859269, "jobSourceId": 0, "__typename": "JobListingAdminDetailsVO" }, "overview": { "shortName": "eBay", "squareLogoUrl": "https://media.glassdoor.com/sql/7853/ebay-squareLogo-1634568971326.png", "__typename": "Employer" }, "__typename": "JobView" }, "..." ]
现在我们了解了 Glassdoor 的工作原理,让我们来看看如何获取 graphql 缓存中可用的其他详细信息,例如评论。
Glassdoor 公司评论
为了收集评论,我们将看看另一个 graphql 特性——页面状态数据。 就像我们在页面 HTML 中找到 graphql 缓存一样,我们也可以找到 graphql 状态:
因此,为了抓取评论,我们可以解析包含所有评论、评论元数据和其他数据详细信息负载的 graphql 状态数据:
import asyncio import re import json from typing import Tuple, List, Dict import httpx def extract_apollo_state(html): """Extract apollo graphql state data from HTML source""" data = re.findall('apolloState":\s*({.+})};', html)[0] data = json.loads(data) return data def parse_reviews(html) -> Tuple[List[Dict], int]: """parse jobs page for job data and total amount of jobs""" cache = extract_apollo_state(html) xhr_cache = cache["ROOT_QUERY"] reviews = next(v for k, v in xhr_cache.items() if k.startswith("employerReviews") and v.get("reviews")) return reviews async def scrape_reviews(employer: str, employer_id: str, session: httpx.AsyncClient): """Scrape job listings""" # scrape first page of jobs: first_page = await session.get( url=f"https://www.glassdoor.com/Reviews/{employer}-Reviews-E{employer_id}.htm", ) reviews = parse_reviews(first_page.text) # find total amount of pages and scrape remaining pages concurrently total_pages = reviews["numberOfPages"] print(f"scraped first page of reviews, scraping remaining {total_pages - 1} pages") other_pages = [ session.get( url=str(first_page.url).replace(".htm", f"_P{page}.htm"), ) for page in range(2, total_pages + 1) ] for page in await asyncio.gather(*other_pages): page_reviews = parse_reviews(page.text) reviews["reviews"].extend(page_reviews["reviews"]) return reviews
运行代码和示例输出 我们可以像以前一样运行我们的爬虫:
async def main(): async with httpx.AsyncClient( timeout=httpx.Timeout(30.0), cookies={"tldp": "1"}, follow_redirects=True, ) as client: reviews = await scrape_reviews("eBay", "7853", client) print(json.dumps(reviews, indent=2)) asyncio.run(main())
这将产生类似于以下的结果:
{ "__typename": "EmployerReviews", "filteredReviewsCountByLang": [ { "__typename": "ReviewsCountByLanguage", "count": 4109, "isoLanguage": "eng" }, "..." ], "employer": { "__ref": "Employer:7853" }, "queryLocation": null, "queryJobTitle": null, "currentPage": 1, "numberOfPages": 411, "lastReviewDateTime": "2022-09-16T14:51:36.650", "allReviewsCount": 5017, "ratedReviewsCount": 4218, "filteredReviewsCount": 4109, "ratings": { "__typename": "EmployerRatings", "overallRating": 4.1, "reviewCount": 4218, "ceoRating": 0.87, "recommendToFriendRating": 0.83, "cultureAndValuesRating": 4.2, "diversityAndInclusionRating": 4.3, "careerOpportunitiesRating": 3.8, "workLifeBalanceRating": 4.1, "seniorManagementRating": 3.7, "compensationAndBenefitsRating": 4.1, "businessOutlookRating": 0.66, "ceoRatingsCount": 626, "ratedCeo": { "__ref": "Ceo:768619" } }, "reviews": [ { "__typename": "EmployerReview", "isLegal": true, "reviewId": 64767391, "reviewDateTime": "2022-05-27T08:41:43.217", "ratingOverall": 5, "ratingCeo": "APPROVE", "ratingBusinessOutlook": "POSITIVE", "ratingWorkLifeBalance": 5, "ratingCultureAndValues": 5, "ratingDiversityAndInclusion": 5, "ratingSeniorLeadership": 5, "ratingRecommendToFriend": "POSITIVE", "ratingCareerOpportunities": 5, "ratingCompensationAndBenefits": 5, "employer": { "__ref": "Employer:7853" }, "isCurrentJob": true, "lengthOfEmployment": 1, "employmentStatus": "REGULAR", "jobEndingYear": null, "jobTitle": { "__ref": "JobTitle:60210" }, "location": { "__ref": "City:1139151" }, "originalLanguageId": null, "pros": "Thorough training, compassionate and very patient with all trainees. Benefits day one. Inclusive and really work with their employees to help them succeed in their role.", "prosOriginal": null, "cons": "No cons at all! So far everything is great!", "consOriginal": null, "summary": "Excellent Company", "summaryOriginal": null, "advice": null, "adviceOriginal": null, "isLanguageMismatch": false, "countHelpful": 2, "countNotHelpful": 0, "employerResponses": [], "isCovid19": false, "divisionName": null, "divisionLink": null, "topLevelDomainId": 1, "languageId": "eng", "translationMethod": null }, "..." ] }
其他详情
我们在工作和审查数据抓取中学到的嵌入式 graphql 数据解析技术可以应用于抓取其他公司的详细信息,如薪水、面试、福利和照片。 例如,工资数据可以像我们找到评论一样找到:
... def parse_salaries(html) -> Tuple[List[Dict], int]: """parse jobs page for job data and total amount of jobs""" cache = extract_apollo_state(html) xhr_cache = cache["ROOT_QUERY"] salaries = next(v for k, v in xhr_cache.items() if k.startswith("salariesByEmployer") and v.get("results")) return salaries async def scrape_reviews(employer: str, employer_id: str, session: httpx.AsyncClient): """Scrape job listings""" # scrape first page of jobs: first_page = await session.get( url=f"https://www.glassdoor.com/Salaries/{employer}-Salaries-E{employer_id}.htm", ) salaries = parse_salaries(first_page.text) total_pages = salaries["pages"] print(f"scraped first page of salaries, scraping remaining {total_pages - 1} pages") other_pages = [ session.get( url=str(first_page.url).replace(".htm", f"_P{page}.htm"), ) for page in range(2, total_pages + 1) ] for page in await asyncio.gather(*other_pages): page_reviews = parse_salaries(page.text) salaries["results"].extend(page_reviews["results"]) return salaries
常问问题
为了总结本指南,让我们看一下有关网络抓取Glasdoor.com 的一些常见问题:
抓取 Glassdoor 是否合法?
是的。Glassdoor 上显示的数据是公开的,我们不会提取任何私人信息。以缓慢、尊重的速度抓取 Glassdoor.com 属于道德抓取定义。话虽如此,在抓取用户提交的评论等数据时,应注意欧盟的 GDRP 合规性。
如何找到 Glassdoor 上列出的所有公司页面?
Glassdoor 仅包含超过 50 万家美国公司,但没有站点地图。尽管它确实包含多个有限的目录页面,例如美国公司的目录。不幸的是,目录页面仅限于几百页,但通过抓取和过滤可以发现所有页面。 我们在本教程中介绍的另一种方法是通过公司 IDS 发现公司页面。Glassdoor 上的每个公司都分配了一个增量 ID,范围为 1000-1_000_000+。使用 HEAD 类型的请求,我们可以轻松地戳这些 ID 中的每一个,以查看它们是否指向公司页面。
Glassdoor 爬取总结
在此网络抓取教程中,我们了解了如何抓取公司详细信息,例如 glassdoor.com 上显示的元数据、评论、职位列表和薪水。 我们通过利用 graphql 缓存和状态数据来做到这一点,我们用纯 Python 中的一些通用网络抓取算法提取了这些数据。 通过了解 Glassdoor 网络基础设施的工作原理,我们可以轻松收集 Glassdoor 上可用的所有上市公司数据集。