在本教程中,我们将了解如何抓取YellowPages.com – 各种美国企业的在线目录。
YellowPages 是电话簿的数字版本,称为黄页。它包含商业信息,例如电话号码、网站和地址以及商业评论。
在本教程中,我们将使用 Python 来抓取所有业务和评论信息。让我们开始吧!
为什么要抓取 YellowPages.com?
YellowPages.com 包含数百万家企业及其详细信息,如电话号码、网站和位置。所有这些数据都可以用于各种市场和业务分析,以获得竞争优势或潜在客户。不仅如此,黄页还包含评论数据、商业图像和服务菜单,可进一步用于市场分析。
项目设置
首先我们应该注意 yellowpages.com 只能由美国的 IP 地址访问。因此,如果您位于美国境外,则需要美国代理、VPN 才能访问 yellowpages.com。
至于代码,在本教程中我们将使用 Python 和两个主要的社区包:
- httpx – HTTP 客户端库,可以让我们与 YellowPages.com 的服务器进行通信
- parsel – HTML 解析库,它将帮助我们解析网络抓取的 HTML 文件。在本教程中,我们将坚持使用 CSS 选择器,因为 YellowPages HTML 非常简单。
我们还可以选择使用loguru – 一个漂亮的日志库,可以帮助我们跟踪正在发生的事情。
这些包可以通过pip
命令轻松安装:
$ pip install httpx parsel loguru
或者,可以随意换成httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 函数,这些函数在每个库中几乎都是可以互换的。至于,parsel
另一个很好的选择是beautifulsoup包或任何支持 CSS 选择器的东西,我们将在本教程中使用它。
查找黄页公司
我们的第一个目标是弄清楚如何找到我们想要在 YellowPages.com 上抓取的公司。有几种方法可以实现这一点。首先,如果我们想要抓取特定区域的所有企业,我们可以使用yellowpages.com/sitemap页面,其中包含指向所有类别和位置的链接。
但是,我们将通过抓取黄页搜索来使用更灵活、更简单的方法:
我们可以看到,当我们提交搜索请求时,黄页会将我们带到一个包含结果页面的新 URL。让我们看看如何解决这个问题。
抓取黄页搜索
为了抓取黄页搜索,我们将从给定参数形成搜索 URL,然后遍历多个页面 URL 以收集所有企业列表。
如果我们看一下 URL 格式:
我们可以看到它接受了几个关键参数:query(例如“Japanese Restaurant”)、位置和页码。让我们来看看如何有效地抓取它。
首先,让我们看一下抓取单个搜索页面,例如加利福尼亚州旧金山的日本餐馆:
yellowpages.com/search? search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco,%2C+CA 。
为此,我们将使用parsel
带有一些 CSS 选择器的包:
import asyncio import math from urllib.parse import urlencode, urljoin import httpx from parsel import Selector from loguru import logger as log from typing_extensions import TypedDict from typing import Dict, List, Optional class Preview(TypedDict): """Type hint container for business preview data. This object just helps us to keep track what results we'll be getting""" name: str url: str links: List[str] phone: str categoresi: List[str] address: str location: str rating: str rating_count: str def parse_search(response) -> Preview: """parse yellowpages.com search page for business preview data""" sel = Selector(text=response.text) parsed = [] for result in sel.css(".organic div.result"): links = {} for link in result.css("div.links>a"): name = link.xpath("text()").get() url = link.xpath("@href").get() links[name] = url first = lambda css: result.css(css).get("").strip() many = lambda css: [value.strip() for value in result.css(css).getall()] parsed.append( { "name": first("a.business-name ::text"), "url": urljoin("https://www.yellowpages.com/", first("a.business-name::attr(href)")), "links": links, "phone": first("div.phone::text"), "categories": many(".categories>a::text"), "address": first(".adr .street-address::text"), "location": first(".adr .locality::text"), "rating": first("section.ratings .rating-stars::attr(class)").split(" ", 1)[-1], "rating_count": first("section.ratings .count::text").strip("()"), } ) return parsed
在上面的代码中,我们首先通过边界框隔离每个结果,并遍历每 30 个结果框:
在每次迭代中,我们使用相关的 CSS 选择器来收集业务预览信息,例如电话号码、评级、名称,最重要的是,链接到他们的完整信息页面。
让我们运行我们的爬虫并查看它生成的结果:
import asyncio import json # to avoid being instantly blocked we should use request headers that of a common web browser: BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } # to run our scraper we need to start httpx session: async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: response = await session.get("https://www.yellowpages.com/search?search_terms=Japanese+Restaurants&geo_location_terms=San+Francisco%2C+CA") result_search = parse_search(response) print(json.dumps(result_search, indent=2)) if __name__ == "__main__": asyncio.run(run()) [ { "name": "Ichiraku", "url": "https://www.yellowpages.com/san-francisco-ca/mip/ichiraku-6317061", "links": { "View Menu": "/san-francisco-ca/mip/ichiraku-6317061#open-menu" }, "phone": "(415) 668-9918", "categories": [ "Japanese Restaurants", "Asian Restaurants", "Take Out Restaurants" ], "address": "3750 Geary Blvd", "location": "San Francisco, CA 94118", "rating": "four half", "rating_count": "13" }, { "name": "Benihana", "url": "https://www.yellowpages.com/san-francisco-ca/mip/benihana-458857411", "links": { "Website": "http://www.benihana.com/locations/sanfrancisco-ca-sf" }, "phone": "(415) 563-4844", "categories": [ "Japanese Restaurants", "Bar & Grills", "Restaurants" ], "address": "1737 Post St", "location": "San Francisco, CA 94115", "rating": "three half", "rating_count": "10" }, ... ]
我们可以抓取一个搜索页面,所以现在我们要做的就是将这个逻辑包装在一个抓取循环中:
async def search(query: str, session: httpx.AsyncClient, location: Optional[str] = None) -> List[Preview]: """search yellowpages.com for business preview information scraping all of the pages""" def make_search_url(page): base_url = "https://www.yellowpages.com/search?" parameters = {"search_terms": query, "geo_location_terms": location, "page": page} return base_url + urlencode(parameters) log.info(f'scraping "{query}" in "{location}"') first_page = await session.get(make_search_url(1)) sel = Selector(text=first_page.text) total_results = int(sel.css(".pagination>span::text ").re("of (\d+)")[0]) total_pages = int(math.ceil(total_results / 30)) log.info(f'{query} in {location}: scraping {total_pages} of business preview pages') previews = parse_search(first_page) for result in await asyncio.gather(*[session.get(make_search_url(page)) for page in range(2, total_pages + 1)]): previews.extend(parse_search(result)) log.info(f'{query} in {location}: scraped {len(previews)} total of business previews') return previews
运行代码
async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: result_search = await search("japanese restaurants", location="San Francisco, CA", session=session) print(json.dumps(result_search, indent=2)) if __name__ == "__main__": asyncio.run(run())
上面的函数实现了一个完整的抓取循环。我们根据给定的查询和位置参数生成搜索 URL。然后,我们抓取第一个结果页面以提取总结果数并同时抓取其余页面。这是一个常见的分页网络抓取习惯用法:
现在我们知道如何找到企业及其预览数据,让我们来看看如何通过抓取每个已建立的页面来抓取所有业务数据。
抓取黄页公司数据
为了抓取公司数据,我们需要请求我们之前找到的每家公司的黄页 URL。让我们从餐厅业务ozumo-japanese-restaurant-8083027的示例 URL 开始
我们可以看到该页面包含大量我们可以抓取的业务数据。让我们抓取这些标记的字段:
from parsel import Selector from typing_extensions import TypedDict from typing import Dict, List, Optional from loguru import logger as log class Company(TypedDict): """type hint container for company data found on yellowpages.com""" name: str categories: List[str] rating: str rating_count: str phone: str website: str address: str work_hours: Dict[str, str] def parse_company(response) -> Company: """extract company details from yellowpages.com company's page""" sel = Selector(text=response.text) # here we define some lamba shortcuts for parsing common data like # selecting first elements, many elements and join all elements together and first = lambda css: sel.css(css).get("").strip() many = lambda css: [value.strip() for value in sel.css(css).getall()] together = lambda css, sep=" ": sep.join(sel.css(css).getall()) # to parse working hours we need to do a bit of complex string parsing def _parse_datetime(values: List[str]): """ parse datetime from yellow pages datetime strings >>> _parse_datetime(["Fr-Sa 12:00-22:00"]) {'Fr': '12:00-22:00', 'Sa': '12:00-22:00'} >>> _parse_datetime(["Fr 12:00-22:00"]) {'Fr': '12:00-22:00'} >>> _parse_datetime(["Fr-Sa 12:00-22:00", "We 10:00-18:00"]) {'Fr': '12:00-22:00', 'Sa': '12:00-22:00', 'We': '10:00-18:00'} """ WEEKDAYS = ["Mo", "Tu", "We", "Th", "Fr", "Sa", "Su"] results = {} for text in values: days, hours = text.split(" ") if "-" in days: day_start, day_end = days.split("-") for day in WEEKDAYS[WEEKDAYS.index(day_start) : WEEKDAYS.index(day_end) + 1]: results[day] = hours else: results[days] = hours return results return { "name": first("h1.business-name::text"), "categories": many(".categories>a::text"), "rating": first(".rating .result-rating::attr(class)").split(" ", 1)[-1], "rating_count": first(".rating .count::text").strip("()"), "phone": first("#main-aside .phone>strong::text").replace('(', '').replace(')', ''), "website": first("#main-aside .website-link::attr(href)"), "address": together("#main-aside .address ::text"), "work_hours": _parse_datetime(many(".open-details tr time::attr(datetime)")), } async def scrape_company(url: str, session: httpx.AsyncClient) -> Company: """scrape yellowpage.com company page details""" first_page = await session.get(url) return parse_company(first_page)
如您所见,这段代码大部分是 HTML 解析。我们抓取了业务 URL 并构建了一个parsel.Selector
然后使用一些 CSS 选择器来提取我们之前标记的特定字段。
我们还可以看到在 Python 中处理数据是多么容易:
- 我们清理了刮掉的黄页电话号码。
- 我们将工作日的范围从 like 分解
Mo-We
为单个值 likeMo,Tu,We
让我们运行我们的黄页爬虫并查看它产生的结果:
import httpx import json import asyncio BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: result_company = await scrape_company( "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session, ) print(json.dumps(result_company)) if __name__ == "__main__": asyncio.run(run()) { "name": "Ozumo Japanese Restaurant", "categories": [ "Japanese Restaurants", "Asian Restaurants", "Caterers", "Japanese Restaurants", "Asian Restaurants", "Caterers", "Family Style Restaurants", "Restaurants", "Sushi Bars" ], "rating": "three half", "rating_count": "72", "phone": "(415) 882-1333", "website": "http://www.ozumo.com", "address": "161 Steuart St San Francisco, CA 94105", "work_hours": { "Mo": "16:00-22:00", "Tu": "16:00-22:00", "We": "16:00-22:00", "Th": "16:00-22:00", "Fr": "12:00-22:00", "Sa": "12:00-22:00", "Su": "12:00-21:00" } }
我们看到我们可以只用几行代码就可以抓取所有这些信息!页面上还有一些更有趣的数据点,例如菜单详细信息和照片,但是,让我们坚持本教程中的基础知识并继续进行评论。
抓取黄页评论
为了抓取商业评论,我们必须提出额外的几个请求,因为它们分散在多个页面中。例如,如果我们回到我们的日本餐厅列表并一直滚动到底部,我们可以看到评论页面 URL 格式:
我们可以看到,对于下一页,我们需要做的就是添加?page=2
参数,因为我们知道结果总数,所以我们可以像抓取搜索结果一样抓取它:
import asyncio import math from typing import List from typing_extensions import TypedDict from urllib.parse import urlencode import httpx from parsel import Selector class Review(TypedDict): """type hint for yellowpages.com scraped review""" id: str author: str source: str date: str stars: str title: str text: str def parse_reviews(response) -> List[Review]: """parse company page for visible reviews""" sel = Selector(text=response.text) reviews = [] for box in sel.css("#reviews-container>article"): first = lambda css: box.css(css).get("").strip() many = lambda css: [value.strip() for value in box.css(css).getall()] reviews.append( { "id": box.attrib.get("id"), "author": first("div.author::text"), "source": first("span.attribution>a::text"), "date": first("p.date-posted>span::text"), "stars": len(many(".result-ratings ul>li.rating-star")), "title": first(".review-title::text"), "text": first(".review-response p::text"), } ) return reviews class CompanyData(TypedDict): info: Company reviews: List[Review] # Now we can extend our company scraper to pick up reviews as well! async def scrape_company(url, session: httpx.AsyncClient, get_reviews=True) -> CompanyData: """scrape yellowpage.com company page details""" first_page = await session.get(url) sel = Selector(text=first_page.text) if not get_reviews: return parse_company(first_page) reviews = parse_reviews(first_page) if reviews: total_reviews = int(sel.css(".pagination-stats::text").re(r"of (\d+)")[0]) total_pages = int(math.ceil(total_reviews / 20)) for response in await asyncio.gather( *[session.get(url + "?" + urlencode({"page": page})) for page in range(2, total_pages + 1)] ): reviews.extend(parse_reviews(response)) return { "info": parse_company(first_page), "reviews": reviews, }
上面我们结合了我们从抓取搜索中学到的知识 – 通过多个页面分页 – 以及我们从抓取公司信息中学到的知识 – 使用 CSS 选择器解析 HTML。通过这些附加功能,我们的抓取工具可以收集公司信息和审查数据。让我们试一试:
import asyncio import math from typing import Dict, List, Optional from urllib.parse import urlencode, urljoin import httpx from loguru import logger as log from parsel import Selector from typing_extensions import TypedDict BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: result_company = await scrape_company( "https://www.yellowpages.com/san-francisco-ca/mip/ozumo-japanese-restaurant-8083027", session=session, ) print(json.dumps(result_company, indent=2, ensure_ascii=False)) if __name__ == "__main__": asyncio.run(run()) { "info": { "name": "Ozumo Japanese Restaurant", "categories": [ "Japanese Restaurants", "Asian Restaurants", "Caterers", "Japanese Restaurants", "Asian Restaurants", "Caterers", "Family Style Restaurants", "Restaurants", "Sushi Bars" ], "rating": "three half", "rating_count": "72", "phone": "(415) 882-1333", "website": "http://www.ozumo.com", "address": "161 Steuart St San Francisco, CA 94105", "work_hours": { "Mo": "16:00-22:00", "Tu": "16:00-22:00", "We": "16:00-22:00", "Th": "16:00-22:00", "Fr": "12:00-22:00", "Sa": "12:00-22:00", "Su": "12:00-21:00" } }, "reviews": [ { "id": "<redacted for blog use>", "author": "<redacted for blog use>", "source": "Citysearch", "date": "03/18/2010", "stars": 5, "title": "Mindblowing Japanese!", "text": "Wow what a dinner! I went to Ozumo last night with a friend for a complimentary meal I had won by being a Citysearch Dictator. It was AMAZING! We ordered the Hanabi (halibut) and Dohyo (ahi tuna) small plates as well as the Gindara (black cod) and Burikama (roasted yellowtail). Everything was absolutely delicious. They paired our meal with a variety of unique wines and sakes. The manager, Hiro, and our waitress were extremely knowledgeable about the food and how it was prepared. We started to tease the manager that he had a story for everything. His most boring story, he said, was about edamame. It was a great experience!" }, ... ] }
常问问题
为了总结本指南,让我们看一下有关网络抓取YellowPages.com 的一些常见问题:
抓取 YellowPages.com 是否合法?
是的,黄页的数据是公开的,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 YellowPages.com 属于道德抓取定义。
也就是说,在从评论部分抓取个人数据(例如个人数据)时,应注意欧盟的 GDRP 合规性。
有黄页 API 吗?
不,不幸的是,YellowPages.com 不提供 API,但正如我们在本教程中介绍的那样 – 这个网站很容易用 Python 抓取!
黄页抓取摘要
在本教程中,我们用 Python 构建了一个黄页数据抓取工具。我们首先使用浏览器开发工具熟悉网站的工作原理。然后,我们在我们的爬虫中复制了搜索系统,以从给定的查询中找到黄页。
为了抓取商业数据本身,我们为商业评级和联系方式(如电话号码、网站等)等各个领域构建了 CSS 选择器。我们将 Python 与httpx和parsel包一起使用。