在本网络抓取教程中,我们将了解如何使用Python抓取 。我们将从搜索功能的一些逆向工程开始,这样我们就可以找到企业,然后我们将抓取和解析业务数据本身。最后,我们将看看如何避免我们的抓取器在大规模抓取时被阻塞,因为 Yelp 因阻止网络抓取而臭名昭著。
我们将在本教程中使用 Python 以及一些流行的社区包:
- httpx – HTTP 客户端库,可以让我们与 的服务器进行通信。
- parsel – HTML 解析库,它将帮助我们解析 web 抓取的 HTML 文件以获取 yelp 数据。
$ pip install httpx parsel
任何其他 HTTP 客户端包,例如requests,因为我们只需要在每个库中几乎可以互换的基本 HTTP 函数。至于,parsel
另一个很好的选择是beautifulsoup包或任何支持 CSS 选择器的东西,我们将在本教程中使用它。
发现 Yelp 公司页面
要开始抓取,我们需要找到一种在 yelp 上发现企业的方法。
不幸的是,如果我们查看,我们会发现 不提供站点地图或任何可能包含所有业务的目录页面。这意味着我们必须对他们的搜索功能进行逆向工程,并将其复制到我们的 yelp scraper 中。
让我们先看看 yelp 的首页,看看当我们提交搜索时会发生什么:
我们可以看到,在输入搜索详细信息后,我们被重定向到带有搜索关键字的 URL:
这是我们的搜索种子请求,但我们可以更进一步,通过检查分页来查找数据请求。让我们点击下一页链接,看看我们浏览器的网络检查器 XHR 选项卡中发生了什么:
我们找到了 yelp 后端 API 的数据端点。我们可以看到/search/snippet
端点接受一些参数并返回业务 ID 的搜索结果和预览详细信息,例如:
{ // Business ID which we'll need later "bizId": "oIff0iLkEiPsWcDATe6mfA", // Business preview data "searchResultBusiness": { "ranking": null, "isAd": true, "renderAdInfo": true, "name": "Smooth Air", "alternateNames": [], "businessUrl": "/adredir?ad_business_id=oIff0iLkEiPsWcDATe6mfA&campaign_id=VcMvmxKjXiH2peL8g1c_jw&click_origin=search_results&placement=carousel_0&placement_slot=0&", "categories": [{ "title": "Plumbing", "url": "/search?cflt=plumbing&find_loc=Toronto%2C+Ontario%2C+Canada" }, { "title": "Heating & Air Conditioning/HVAC", "url": "/search?cflt=hvac&find_loc=Toronto%2C+Ontario%2C+Canada" }, { "title": "Water Heater Installation/Repair", "url": "/search?cflt=waterheaterinstallrepair&find_loc=Toronto%2C+Ontario%2C+Canada" }], "priceRange": "", "rating": 0.0, "reviewCount": 0, "formattedAddress": "", "neighborhoods": [], "phone": "", "serviceArea": null, "parentBusiness": null, "servicePricing": null, "bizSiteUrl": "" }
因此,我们可以使用此 API 端点查找给定位置和搜索词的所有业务 ID。有了这些信息,我们就可以开始研究我们的网络抓取工具了。
import asyncio import httpx async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0): """scrape single page of yelp search""" # final url example: # resp = await session.get( "", params={ "find_desc": keyword, "find_loc": location, "start": offset, "parent_request": "", "ns": 1, "request_origin": "user" } ) assert resp.status_code == 200 return resp.json()
注意:我们使用的是异步 python,所以稍后我们可以同时安排多个请求,这将给我们带来巨大的速度提升。
BASE_HEADERS = { "authority": "", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await _yelp_search_page('plumbers', 'Toronto, Ontario, Canada', session=session) print(results) if __name__ == "__main__": { "pageTitle": [ "Yelp" ], "loggingConfig": { "sitRepConfig": { "isSitRepEnabled": true, "enabledSitRepChannels": { "vertical_search_reservation": true, "vertical_search_platform": true, "frontend_performance": true, "search_suggest_events": true, "vertical_search_waitlist": true, "ad_syndication_cookie_sync_errors": true, "traffic_quality": true, "search_ux": true, "message_the_business": true, "ytp_session_events": true, "ad_syndication": true }, } "searchPageProps"{ ... } ... }
def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]: """ Parses yelp search results for business results Returns list of businesses and search metadata """ results = search_results['searchPageProps']['mainContentComponentsListProps'] businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')] search_meta = next(r for r in results if r.get('type') == 'pagination')['props'] return businesses, search_meta
后端 API 通常包含大量元数据,包括广告、跟踪信息等。但是,我们只需要业务信息和搜索查询中的页面总数,因此我们可以检索所有结果。
async def yelp_search_all(keyword: str, location: str, session: httpx.AsyncClient): """scrape all pages of yelp search for business preview data""" # get the first page data first_page = await _yelp_search(keyword, location, session=session) # parse first page for first page of businesses and total amount of pages businesses, search_meta = parse_search(first_page) # scrape remaining pages asynchronously tasks = [] for page in range(10, search_meta['totalResults'], 10): tasks.append( _yelp_search(keyword, location, session=session, offset=page) ) for result in await asyncio.gather(*tasks): businesses.extend(parse_search(result)[0]) return businesses
import asyncio from typing import Dict, List, Tuple import httpx def parse_search(search_results: Dict) -> Tuple[List[Dict], Dict]: """ Parses yelp search results for business results Returns list of businesses and search metadata """ results = search_results['searchPageProps']['mainContentComponentsListProps'] businesses = [r for r in results if r.get('searchResultBusiness') and not r.get('adLoggingInfo')] search_meta = next(r for r in results if r.get('type') == 'pagination')['props'] return businesses, search_meta async def _search_yelp_page(keyword: str, location: str, session: httpx.AsyncClient, offset=0): """scrape single page of yelp search""" # final url example: # resp = await session.get( "", params={ "find_desc": keyword, "find_loc": location, "start": offset, "parent_request": "", "ns": 1, "request_origin": "user" } ) assert resp.status_code == 200 return resp.json() async def search_yelp(keyword: str, location: str, session: httpx.AsyncClient): """scrape all pages of yelp search for business preview data""" first_page = await _search_yelp_page(keyword, location, session=session) businesses, search_meta = parse_search(first_page) tasks = [] for page in range(10, search_meta['totalResults'], 10): tasks.append( _search_yelp_page(keyword, location, session=session, offset=page) ) for result in await asyncio.gather(*tasks): businesses.extend(parse_search(result)[0]) return businesses
BASE_HEADERS = { "authority": "", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await yelp_search_all('plumbers', 'Toronto, Ontario, Canada', session=session) print(results) if __name__ == "__main__":
抓取 Yelp 公司数据
现在我们有了公司发现抓取器,我们可以进一步检索我们发现的每家公司的详细信息。为此,我们需要抓取每个公司的 URL。
我们看到 HTML 包含我们可能需要的所有业务数据,如电话号码、地址等。但是,如果我们启动 Web 检查器,我们可以看到结构本身不是很整洁:
如此复杂的类名表明它们是动态生成的——这意味着我们不能依赖于在我们的 HTML 解析选择器中使用类名,或者我们必须非常安全地使用它。相反,我们将构建与文本匹配相关的选择器。换句话说,我们将找到诸如“获取路线”之类的关键字文本,然后将树导航到地址值:
我们可以通过利用 XPATHcontains()
//a[contains(text(),"Get Directions")]/../following-sibling::p/text()
我们将使用这种技术来获取大部分值,所以让我们开始吧。对于我们的 XPATH 选择器,我们将使用parsel HTML 解析库:
$ pip install parsel
和 XPATH 我们可以完全提取页面上的所有可见细节:
import httpx import asyncio import json from parsel import Selector def parse_company(resp: httpx.Response): sel = Selector(text=resp.text) xpath = lambda xp: sel.xpath(xp).get(default="").strip() open_hours = {} for day in sel.xpath('//th/p[contains(@class,"day-of-the-week")]'): name = day.xpath('text()').get().strip() value = day.xpath('../following-sibling::td//p/text()').get().strip() open_hours[name.lower()] = value return dict( name=xpath('//h1/text()'), website=xpath('//p[contains(text(),"Business website")]/following-sibling::p/a/text()'), phone=xpath('//p[contains(text(),"Phone number")]/following-sibling::p/text()'), address=xpath('//a[contains(text(),"Get Directions")]/../following-sibling::p/text()'), logo=xpath('//img[contains(@class,"businessLogo")]/@src'), claim_status=''.join(sel.xpath('//span[contains(@class,"claim-text")]/text()').getall()).strip().lower(), open_hours=open_hours, ) async def _scrape_companies_by_url(company_urls:List[str], session: httpx.AsyncClient) -> List[Dict]: """Scrape yelp company details from given yelp company urls""" responses = await asyncio.gather(*[ session.get(url) for url in company_urls ]) results = [] for resp in responses: results.append(parse_company(resp)) return results
函数,我们在其中使用我们之前介绍过的 xpath 技术来提取突出显示的字段。如果我们运行这个爬虫,我们会看到类似的结果:
BASE_HEADERS = { "authority": "", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: resp = await yelp_companies([""]) results = parse_company(resp) print(json.dumps(results, indent=2)) if __name__ == "__main__": [{ "name": "Smooth Air", "website": "", "phone": "(647) 828-6789", "address": "305 Fleetwood Crescent Brampton, ON L6T 2E7 Canada", "logo": "", "claim_status": "claimed", "open_hours": { "mon": "Open 24 hours", "tue": "Open 24 hours", "wed": "Open 24 hours", "thu": "Open 24 hours", "fri": "Open 24 hours", "sat": "Open 24 hours", "sun": "Open 24 hours" } }, ... ]
抓取 Yelp 评论
要抓取 Yelp 公司的评论,我们必须看看另一个隐藏的 API 请求。找到此 API 端点的最简单方法是简单地单击第二个审查页面并观察 Web 检查器以查找传出请求:
,例如我们之前通过 Yelp 搜索步骤抓取的参数。
如何找到 yelp 的企业 ID?
Yelp 的企业 ID 也可以在企业页面本身的 HTML 源代码中找到:
import httpx from parsel import Selector def scrape_business_id(url): response = httpx.get(url) selector = Selector(response.text) return selector.css('meta[name="yelp-biz-id"]::attr(content)').get() print(scrape_business_id("")) "Yz7qwi0GipbeLBFAjSr_PQ"
它的评论位于: https: // ?rl=en&q=&sort_by=relevance_desc&start=10 我们甚至可以点击并在浏览器中查看 JSON 结果。
让我们来看看如何用 Python 抓取 yelp 评论:
import asyncio from typing import TypedDict, List from parsel import Selector import httpx import json class Review(TypedDict): id: str userId: str business: dict user: dict comment: dict rating: int ... async def scrape_reviews(business_url: str, session: httpx.AsyncClient) -> List[Review]: # first find business ID from business URL response_business = await session.get(business_url) selector = Selector(text=response_business.text) business_id = selector.css('meta[name="yelp-biz-id"]::attr(content)').get() # then scrape first page first_page = await session.get( f"{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start=0" ) first_page_data = json.loads(first_page.text) reviews = first_page_data["reviews"] total_reviews = first_page_data["pagination"]["totalResults"] print(f"scraping {total_reviews} of business {business_id}") to_scrape = [ session.get( f"{business_id}/review_feed?rl=en&q=&sort_by=relevance_desc&start={offset}" ) for offset in range(10, total_reviews + 10, 10) ] for page in asyncio.as_completed(to_scrape): response = await page data = json.loads(response.text) reviews.extend(data["reviews"]) return reviews
BASE_HEADERS = { "authority": "", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): async with httpx.AsyncClient(headers=BASE_HEADERS) as session: results = await scrape_reviews("", session=session) print(json.dumps(results, indent=2)) if __name__ == "__main__": [ { "id": "eB6j_V2LILubb2i0O6pODw", "userId": "ANYfELwm1rX-Z__Ryi_pQQ", "business": { "id": "Yz7qwi0GipbeLBFAjSr_PQ", "alias": "capri-laguna-laguna-beach", "name": "Capri Laguna", "photoSrc": "" }, "user": { "link": "HIDDEN", "src": "HIDDEN", "srcSet": null, "markupDisplayName": "HIDDEN", "displayLocation": "HIDDEN", "altText": "HIDDEN", "userUrl": "HIDDEN", "partnerAlias": null, "friendCount": 0, "photoCount": 3, "reviewCount": 1, "eliteYear": null }, "comment": { "text": "Very nice getaway for the family! I have been in Capri Laguna three times this summer already and the place never fails to amaze me. <br>The hotel has the best view over the ocean. You can watch the sunset from any deck or terrace in this hotel. As well some rooms have private balconies. <br>The service was great. The rooms are very clean and comfortable. The area is so calm and relaxing, you can sleep peacefully and comfortably. <br>The staff is so welcoming and respectful. Bachir was great, he is kind, friendly and very professional. Amazing customer service. Thank you Bachir!", "language": "en" }, "localizedDate": "9/23/2022", "localizedDateVisited": null, "rating": 5, "photos": [ { "src": "", "caption": null, ... } ... ], "lightboxMediaItems": [ { ... }, ], "photosUrl": "/biz_photos/capri-laguna-laguna-beach?userid=ANYfELwm1rX-Z__Ryi_pQQ", "totalPhotos": 3, "feedback": { "counts": { "useful": 0, "funny": 0, "cool": 0 }, "userFeedback": { "useful": false, "funny": false, "cool": false }, "voterText": null }, "isUpdated": false, "businessOwnerReplies": null, "appreciatedBy": null, "previousReviews": null, "tags": [ { "label": "3 photos", "title": null, "href": "HIDDEN", "iconName": "18x18_camera", "iconColor": "" } ] },
在我们上面的抓取工具中,为了下载评论的 yelp 数据,我们首先从企业的资料页面抓取企业 ID。然后,我们使用这个 ID 来抓取评论的第一页以找到评论计数,并同时抓取其余的评论页面。
上面的代码片段在短短几秒钟内就获得了 500 多个 yelp 评论!那是因为隐藏的 API 比 HTML 页面快得多。
绕过 Yelp 屏蔽 是一个主要的网络抓取目标,这意味着他们采用许多技术来大规模地博客网络抓取。为了检索页面,我们确实使用了复制通用网络浏览器的自定义标头,但如果我们要将此抓取工具扩展到数千家公司,Yelp 最终会赶上我们并阻止我们。
一旦 Yelp 意识到客户端是一个网络爬虫,它就会开始将所有请求重定向到“此页面不可用”网页。我们怎样才能避免这种情况?
为了总结本指南,让我们看一下有关网络抓取 Yelp 的一些常见问题:
网络抓取 是否合法?
是的。Yelp 仅托管公共数据,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 属于道德抓取定义。为了抓取 Yelp 评论,我们应该确保我们不会在GDPR保护的国家收集任何个人数据或进一步咨询律师。
如何抓取 Yelp 评论?
要检索业务页面的评论,我们需要复制另一个后端 API 请求。如果我们点击评论容器中的第二页,我们可以看到正在
我们之前在搜索步骤中提取的 ID 在哪里BUSINESS_ID
,或者可以在业务页面本身的 HTML 源代码中找到 ID。
例如,评论将位于此 url 下 =relevance_desc&开始=10
Yelp 总结
在本教程中,我们构建了一个小型yelp.com抓取工具,它根据提供的关键字和位置输入发现公司,并检索他们的联系方式,如电话号码、网站和其他信息字段。为此,我们使用了带有httpx和parsel包的 Python。