Yelp.com是最古老和知名的黄页网站之一。它包含公司信息,如地址、网站、位置等,以及这些公司的用户评论。
在本网络抓取教程中,我们将了解如何使用Python抓取yelp.com 。我们将从搜索功能的一些逆向工程开始,这样我们就可以找到企业,然后我们将抓取和解析业务数据本身。最后,我们将看看如何避免我们的抓取器在大规模抓取时被阻塞,因为 Yelp 因阻止网络抓取而臭名昭著。
项目设置
我们将在本教程中使用 Python 以及一些流行的社区包:
- httpx – HTTP 客户端库,可以让我们与 Booking.com 的服务器进行通信。
- parsel – HTML 解析库,它将帮助我们解析 web 抓取的 HTML 文件以获取 yelp 数据。
我们可以使用pip
命令轻松安装它们:
$ pip install httpx parsel
或者,可以随意换成httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要在每个库中几乎可以互换的基本 HTTP 函数。至于,parsel
另一个很好的选择是beautifulsoup包或任何支持 CSS 选择器的东西,我们将在本教程中使用它。
发现 Yelp 公司页面
要开始抓取,我们需要找到一种在 yelp 上发现企业的方法。
不幸的是,如果我们查看yelp.com/robots.txt,我们会发现 yelp.com 不提供站点地图或任何可能包含所有业务的目录页面。这意味着我们必须对他们的搜索功能进行逆向工程,并将其复制到我们的 yelp scraper 中。
抓取搜索
让我们先看看 yelp 的首页,看看当我们提交搜索时会发生什么:
我们可以看到,在输入搜索详细信息后,我们被重定向到带有搜索关键字的 URL:
https://www.yelp.com/search?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=220
这是我们的搜索种子请求,但我们可以更进一步,通过检查分页来查找数据请求。让我们点击下一页链接,看看我们浏览器的网络检查器 XHR 选项卡中发生了什么:
https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user
我们找到了 yelp 后端 API 的数据端点。我们可以看到/search/snippet
端点接受一些参数并返回业务 ID 的搜索结果和预览详细信息,例如:
{ // Business ID which we'll need later "bizId": "oIff0iLkEiPsWcDATe6mfA", // Business preview data "searchResultBusiness": { "ranking": null, "isAd": true, "renderAdInfo": true, "name": "Smooth Air", "alternateNames": [], "businessUrl": "/adredir?ad_business_id=oIff0iLkEiPsWcDATe6mfA&campaign_id=VcMvmxKjXiH2peL8g1c_jw&click_origin=search_results&placement=carousel_0&placement_slot=0&redirect_url=https%3A%2F%2Fwww.yelp.com%2Fbiz%2Fsmooth-air-brampton&request_id=daed206f44c35b85&signature=e537121fa6eb5d95fe240274d63ae189267de71994e5908c824eab5cea323c55&slot=1", "categories": [{ "title": "Plumbing", "url": "/search?cflt=plumbing&find_loc=Toronto%2C+Ontario%2C+Canada" }, { "title": "Heating & Air Conditioning/HVAC", "url": "/search?cflt=hvac&find_loc=Toronto%2C+Ontario%2C+Canada" }, { "title": "Water Heater Installation/Repair", "url": "/search?cflt=waterheaterinstallrepair&find_loc=Toronto%2C+Ontario%2C+Canada" }], "priceRange": "", "rating": 0.0, "reviewCount": 0, "formattedAddress": "", "neighborhoods": [], "phone": "", "serviceArea": null, "parentBusiness": null, "servicePricing": null, "bizSiteUrl": "https://biz.yelp.com" }
因此,我们可以使用此 API 端点查找给定位置和搜索词的所有业务 ID。有了这些信息,我们就可以开始研究我们的网络抓取工具了。
我们可以通过复制我们之前看到的搜索请求来开始我们的网络抓取工具:
import asyncio import httpx async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0): """scrape single page of yelp search""" # final url example: # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user resp = await session.get( "https://www.yelp.com/search/snippet", params={ "find_desc": keyword, "find_loc": location, "start": offset, "parent_request": "", "ns": 1, "request_origin": "user" } ) assert resp.status_code == 200 return resp.json()
注意:我们使用的是异步 python,所以稍后我们可以同时安排多个请求,这将给我们带来巨大的速度提升。
在上面的脚本中,我们正在复制/search/snippet
返回单个搜索页面的搜索结果数据的端点请求。让我们看看这个抓取器生成的结果:
运行代码和示例输出
BASE_HEADERS = { "authority": "www.yelp.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await _yelp_search_page('plumbers', 'Toronto, Ontario, Canada', session=session) print(results) if __name__ == "__main__": asyncio.run(run()) { "pageTitle": [ "Yelp" ], "loggingConfig": { "sitRepConfig": { "isSitRepEnabled": true, "enabledSitRepChannels": { "vertical_search_reservation": true, "vertical_search_platform": true, "frontend_performance": true, "search_suggest_events": true, "vertical_search_waitlist": true, "ad_syndication_cookie_sync_errors": true, "traffic_quality": true, "search_ux": true, "message_the_business": true, "ytp_session_events": true, "ad_syndication": true }, } "searchPageProps"{ ... } ... }
此外,我们需要解析此搜索数据并实现抓取所有页面的能力。
让我们从解析开始:
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]: """ Parses yelp search results for business results Returns list of businesses and search metadata """ results = search_results['searchPageProps']['mainContentComponentsListProps'] businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')] search_meta = next(r for r in results if r.get('type') == 'pagination')['props'] return businesses, search_meta
后端 API 通常包含大量元数据,包括广告、跟踪信息等。但是,我们只需要业务信息和搜索查询中的页面总数,因此我们可以检索所有结果。
最后,让我们用一个循环函数来包装所有内容,该循环函数异步抓取所有可用的搜索页面。我们将抓取第一页,然后异步抓取其余页面:
async def yelp_search_all(keyword: str, location: str, session: httpx.AsyncClient): """scrape all pages of yelp search for business preview data""" # get the first page data first_page = await _yelp_search(keyword, location, session=session) # parse first page for first page of businesses and total amount of pages businesses, search_meta = parse_search(first_page) # scrape remaining pages asynchronously tasks = [] for page in range(10, search_meta['totalResults'], 10): tasks.append( _yelp_search(keyword, location, session=session, offset=page) ) for result in await asyncio.gather(*tasks): businesses.extend(parse_search(result)[0]) return businesses
这种常见的分页抓取习惯用法使我们能够通过异步请求大大加快网络抓取速度。我们检索总页数的第一页,然后我们可以为其余页面安排并发请求。
import asyncio from typing import Dict, List, Tuple import httpx def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]: """ Parses yelp search results for business results Returns list of businesses and search metadata """ results = search_results['searchPageProps']['mainContentComponentsListProps'] businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')] search_meta = next(r for r in results if r.get('type') == 'pagination')['props'] return businesses, search_meta async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0): """scrape single page of yelp search""" # final url example: # https://www.yelp.com/search/snippet?find_desc=plumbers&find_loc=Toronto%2C+Ontario%2C+Canada&ns=1&start=210&parent_request_id=54233ce74d09d270&request_origin=user resp = await session.get( "https://www.yelp.com/search/snippet", params={ "find_desc": keyword, "find_loc": location, "start": offset, "parent_request": "", "ns": 1, "request_origin": "user" } ) assert resp.status_code == 200 return resp.json() async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient): """scrape all pages of yelp search for business preview data""" first_page = await _search_yelp_page(keyword, location, session=session) businesses, search_meta = parse_search(first_page) tasks = [] for page in range(10, search_meta['totalResults'], 10): tasks.append( _search_yelp_page(keyword, location, session=session, offset=page) ) for result in await asyncio.gather(*tasks): businesses.extend(parse_search(result)[0]) return businesses
运行代码和示例输出
BASE_HEADERS = { "authority": "www.yelp.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await yelp_search_all('plumbers', 'Toronto, Ontario, Canada', session=session) print(results) if __name__ == "__main__": asyncio.run(run())
抓取 Yelp 公司数据
现在我们有了公司发现抓取器,我们可以进一步检索我们发现的每家公司的详细信息。为此,我们需要抓取每个公司的 URL。
让我们先看看公司页面本身以及数据所在的位置:
我们看到 HTML 包含我们可能需要的所有业务数据,如电话号码、地址等。但是,如果我们启动 Web 检查器,我们可以看到结构本身不是很整洁:
如此复杂的类名表明它们是动态生成的——这意味着我们不能依赖于在我们的 HTML 解析选择器中使用类名,或者我们必须非常安全地使用它。相反,我们将构建与文本匹配相关的选择器。换句话说,我们将找到诸如“获取路线”之类的关键字文本,然后将树导航到地址值:
我们可以通过利用 XPATHcontains()
和..
功能轻松实现这一点:
//a[contains(text(),"Get Directions")]/../following-sibling::p/text()
我们将使用这种技术来获取大部分值,所以让我们开始吧。对于我们的 XPATH 选择器,我们将使用parsel HTML 解析库:
$ pip install parsel
使用parsel
和 XPATH 我们可以完全提取页面上的所有可见细节:
import httpx import asyncio import json from parsel import Selector def parse_company(resp: httpx.Response): sel = Selector(text=resp.text) xpath = lambda xp: sel.xpath(xp).get(default="").strip() open_hours = {} for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'): name = day.xpath('text()').get().strip() value = day.xpath('../following-sibling::td//p/text()').get().strip() open_hours[name.lower()] = value return dict( name=xpath('//h1/text()'), website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'), phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'), address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'), logo=xpath('//img[contains(@class,"businessLogo")]/@src'), claim_status=''.join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(), open_hours=open_hours, ) async def _scrape_companies_by_url(company_urls:List[str], session: httpx.AsyncClient) -> List[Dict]: """Scrape yelp company details from given yelp company urls""" responses = await asyncio.gather(*[ session.get(url) for url in company_urls ]) results = [] for resp in responses: results.append(parse_company(resp)) return results
在这里,我们添加了我们的parse_company
函数,我们在其中使用我们之前介绍过的 xpath 技术来提取突出显示的字段。如果我们运行这个爬虫,我们会看到类似的结果:
运行代码
BASE_HEADERS = { "authority": "www.yelp.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: resp = await yelp_companies(["https://www.yelp.com/biz/smooth-air-brampton"]) results = parse_company(resp) print(json.dumps(results, indent=2)) if __name__ == "__main__": asyncio.run(run()) [{ "name": "Smooth Air", "website": "https://www.smoothairhvac.com", "phone": "(647) 828-6789", "address": "305 Fleetwood Crescent Brampton, ON L6T 2E7 Canada", "logo": "https://s3-media0.fl.yelpcdn.com/businessregularlogo/c90545xfS2yr7R7yKe9gZg/ms.jpg", "claim_status": "claimed", "open_hours": { "mon": "Open 24 hours", "tue": "Open 24 hours", "wed": "Open 24 hours", "thu": "Open 24 hours", "fri": "Open 24 hours", "sat": "Open 24 hours", "sun": "Open 24 hours" } }, ... ]
抓取 Yelp 评论
要抓取 Yelp 公司的评论,我们必须看看另一个隐藏的 API 请求。找到此 API 端点的最简单方法是简单地单击第二个审查页面并观察 Web 检查器以查找传出请求:
/review_feed
在这里我们看到正在向端点发出请求。我们可以看到它使用了一些参数BUSINESS_ID
,例如我们之前通过 Yelp 搜索步骤抓取的参数。
如何找到 yelp 的企业 ID?
Yelp 的企业 ID 也可以在企业页面本身的 HTML 源代码中找到:
import httpx from parsel import Selector def scrape_business_id(url): response = httpx.get(url) selector = Selector(response.text) return selector.css('meta[name="yelp-biz-id"]::attr(content)').get() print(scrape_business_id("https://www.yelp.com/biz/capri-laguna-laguna-beach")) "Yz7qwi0GipbeLBFAjSr_PQ"
例如,让我们以这项业务https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ为例。
它的评论位于: https: //www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed ?rl=en&q=&sort_by=relevance_desc&start=10 我们甚至可以点击并在浏览器中查看 JSON 结果。
让我们来看看如何用 Python 抓取 yelp 评论:
import asyncio from typing import TypedDict, List from parsel import Selector import httpx import json class Review(TypedDict): id: str userId: str business: dict user: dict comment: dict rating: int ... async def scrape_reviews(business_url: str, session: httpx.AsyncClient) -> List[Review]: # first find business ID from business URL response_business = await session.get(business_url) selector = Selector(text=response_business.text) business_id = selector.css('meta[name="yelp-biz-id"]::attr(content)').get() # then scrape first page first_page = await session.get( f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0" ) first_page_data = json.loads(first_page.text) reviews = first_page_data["reviews"] total_reviews = first_page_data["pagination"]["totalResults"] print(f"scraping {total_reviews} of business {business_id}") to_scrape = [ session.get( f"https://www.yelp.com/biz/{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={offset}" ) for offset in range(10, total_reviews + 10, 10) ] for page in asyncio.as_completed(to_scrape): response = await page data = json.loads(response.text) reviews.extend(data["reviews"]) return reviews
运行代码和示例输出
BASE_HEADERS = { "authority": "www.yelp.com", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await scrape_reviews("https://www.yelp.com/biz/capri-laguna-laguna-beach", session=session) print(json.dumps(results, indent=2)) if __name__ == "__main__": asyncio.run(run()) [ { "id": "eB6j_V2LILubb2i0O6pODw", "userId": "ANYfELwm1rX-Z__Ryi_pQQ", "business": { "id": "Yz7qwi0GipbeLBFAjSr_PQ", "alias": "capri-laguna-laguna-beach", "name": "Capri Laguna", "photoSrc": "https://s3-media0.fl.yelpcdn.com/bphoto/nLCrCo0iobpoB6dIpEifXw/60s.jpg" }, "user": { "link": "HIDDEN", "src": "HIDDEN", "srcSet": null, "markupDisplayName": "HIDDEN", "displayLocation": "HIDDEN", "altText": "HIDDEN", "userUrl": "HIDDEN", "partnerAlias": null, "friendCount": 0, "photoCount": 3, "reviewCount": 1, "eliteYear": null }, "comment": { "text": "Very nice getaway for the family! I have been in Capri Laguna three times this summer already and the place never fails to amaze me. <br>The hotel has the best view over the ocean. You can watch the sunset from any deck or terrace in this hotel. As well some rooms have private balconies. <br>The service was great. The rooms are very clean and comfortable. The area is so calm and relaxing, you can sleep peacefully and comfortably. <br>The staff is so welcoming and respectful. Bachir was great, he is kind, friendly and very professional. Amazing customer service. Thank you Bachir!", "language": "en" }, "localizedDate": "9/23/2022", "localizedDateVisited": null, "rating": 5, "photos": [ { "src": "https://s3-media0.fl.yelpcdn.com/bphoto/0OUID9ZaHT89dEgZpU9wmA/180s.jpg", "caption": null, ... } ... ], "lightboxMediaItems": [ { ... }, ], "photosUrl": "/biz_photos/capri-laguna-laguna-beach?userid=ANYfELwm1rX-Z__Ryi_pQQ", "totalPhotos": 3, "feedback": { "counts": { "useful": 0, "funny": 0, "cool": 0 }, "userFeedback": { "useful": false, "funny": false, "cool": false }, "voterText": null }, "isUpdated": false, "businessOwnerReplies": null, "appreciatedBy": null, "previousReviews": null, "tags": [ { "label": "3 photos", "title": null, "href": "HIDDEN", "iconName": "18x18_camera", "iconColor": "" } ] },
在我们上面的抓取工具中,为了下载评论的 yelp 数据,我们首先从企业的资料页面抓取企业 ID。然后,我们使用这个 ID 来抓取评论的第一页以找到评论计数,并同时抓取其余的评论页面。
上面的代码片段在短短几秒钟内就获得了 500 多个 yelp 评论!那是因为隐藏的 API 比 HTML 页面快得多。
绕过 Yelp 屏蔽
Yelp.com 是一个主要的网络抓取目标,这意味着他们采用许多技术来大规模地博客网络抓取。为了检索页面,我们确实使用了复制通用网络浏览器的自定义标头,但如果我们要将此抓取工具扩展到数千家公司,Yelp 最终会赶上我们并阻止我们。
一旦 Yelp 意识到客户端是一个网络爬虫,它就会开始将所有请求重定向到“此页面不可用”网页。我们怎样才能避免这种情况?
我们可以做很多事情来避免刮板阻塞,有关所有这些详细信息,请参阅我们的深入指南:
常问问题
为了总结本指南,让我们看一下有关网络抓取 Yelp 的一些常见问题:
网络抓取 yelp.com 是否合法?
是的。Yelp 仅托管公共数据,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 yelp.com 属于道德抓取定义。为了抓取 Yelp 评论,我们应该确保我们不会在GDPR保护的国家收集任何个人数据或进一步咨询律师。
如何抓取 Yelp 评论?
要检索业务页面的评论,我们需要复制另一个后端 API 请求。如果我们点击评论容器中的第二页,我们可以看到正在https://www.yelp.com/biz/BUSINESS_ID/review_feed?rl=en&q=&sort_by=relevance_desc&start=10
提出的请求:
我们之前在搜索步骤中提取的 ID 在哪里BUSINESS_ID
,或者可以在业务页面本身的 HTML 源代码中找到 ID。
例如,https://www.yelp.com/biz/capri-laguna-laguna-beach评论将位于此 url 下https://www.yelp.com/biz/Yz7qwi0GipbeLBFAjSr_PQ/review_feed?rl=en&q=&sort_by =relevance_desc&开始=10
Yelp 总结
在本教程中,我们构建了一个小型yelp.com抓取工具,它根据提供的关键字和位置输入发现公司,并检索他们的联系方式,如电话号码、网站和其他信息字段。为此,我们使用了带有httpx和parsel包的 Python。