在本教程中,我们将了解如何从亚马逊- 世界上最大的电子商务网站上抓取信息! 亚马逊包含数百万种产品,并在许多不同的国家开展业务,这使其成为公开市场分析数据的重要目标。 为了抓取亚马逊产品数据、价格和评论,我们将使用 Python 以及一些社区包和常见的亚马逊网络抓取习惯用法。那么,让我们开始吧!
为什么要抓取 Amazon.com?
亚马逊包含大量有价值的电子商务数据:产品详细信息、价格和评论。它是全球许多地区领先的电子商务平台。这使得亚马逊的公共数据成为市场分析、商业智能和许多数据科学领域的理想选择。 公司经常使用亚马逊来跟踪第 3 方经销商销售的产品的性能。因此,不用说,几乎有无数种方法可以利用这些公共数据!
项目设置
在本教程中,我们将使用 Python 和两个主要的社区包:
- httpx – HTTP 客户端库,可以让我们与 amazon.com 的服务器进行通信
- parsel – HTML 解析库,它将帮助我们解析网络抓取的 HTML 文件。在本教程中,我们将混合使用 CSS 选择器和 XPath 选择器来解析 HTML——两者都受
parsel
.
我们还可以选择使用loguru – 一个漂亮的日志库,可以帮助我们跟踪正在发生的事情。 这些包可以通过pip install
命令轻松安装:
$ pip install httpx parsel loguru
或者,可以随意换成httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 功能,这些功能几乎可以在每个库中互换。至于,parsel
另一个很好的选择是beautifulsoup包或任何支持 CSS 选择器的东西,我们将在本教程中使用它。
寻找亚马逊产品
在亚马逊上查找产品的方法有多种,最灵活、最强大的一种是亚马逊的搜索系统。
https://www.amazon.com/s?k=<search query>
我们可以看到,当我们输入搜索词时,亚马逊会将我们重定向到我们可以在爬取工具中使用的搜索页面:
import httpx def parse_search(response): pass # we'll fill this in later async def search(query:str, session: httpx.AsyncClient): """Search for amazon products using searchbox""" log.info(f"{query}: scraping first page") # first, let's scrape first query page to find out how many pages we have in total: first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1") sel = Selector(text=first_page.text) _page_numbers = sel.xpath( '//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()' ).getall() total_pages = max(int(number) for number in _page_numbers) # now we can scrape remaining pages concurrently log.info(f"{query}: found {total_pages}, scraping them concurrently") other_pages = await asyncio.gather( *[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)] ) # parse all of search pages for product preview data: previews = [] for response in [first_page, *other_pages]: previews.extend(parse_search(response)) log.info(f"{query}: found total of {len(previews)} product previews") return previews
在这里,在我们的search
函数中,我们收集给定查询的第一个结果页面。然后,我们找到这个查询包含的总页数,并同时抓取其余页面。这是一个常见的分页抓取习惯用法,当我们可以找到页面总数时,我们就可以利用并发网络抓取。 此外,让我们解析我们收集的搜索页面 HTML 以获取产品预览数据:
from typing import List, Optional from typing_extensions import TypedDict from loguru import logger as log class ProductPreview(TypedDict): """result generated by search scraper""" url: str title: str price: str real_price: str rating: str rating_count: str def parse_search(resp) -> List[ProductPreview]: """Parse search result page for product previews""" previews = [] sel = Selector(text=resp.text) # find boxes of each product preview product_boxes = sel.css("div.s-result-item[data-component-type=s-search-result]") for box in product_boxes: url = urljoin(str(resp.url), box.css("h2>a::attr(href)").get()).split("?")[0] if "/slredirect/" in url: # skip ads etc. continue previews.append( { "url": url, "title": box.css("h2>a>span::text").get(), # big price text is discounted price "price": box.css(".a-price[data-a-size=xl] .a-offscreen::text").get(), # small price text is "real" price "real_price": box.css(".a-price[data-a-size=b] .a-offscreen::text").get(), "rating": (box.css("span[aria-label~=stars]::attr(aria-label)").re(r"(\d+\.*\d*) out") or [None])[0], "rating_count": box.css("span[aria-label~=stars] + span::attr(aria-label)").get(), } ) log.debug(f"found {len(previews)} product listings in {resp.url}") return previews async def search(query, session): """Search for amazon products using searchbox""" log.info(f"{query}: scraping first page") # first, let's scrape first query page to find out how many pages we have in total: first_page = await session.get(f"https://www.amazon.com/s?k={query}&page=1") sel = Selector(text=first_page.text) _page_numbers = sel.xpath('//a[has-class("s-pagination-item")][not(has-class("s-pagination-separator"))]/text()').getall() total_pages = max(int(number) for number in _page_numbers) # now we can scrape remaining pages concurrently log.info(f"{query}: found {total_pages}, scraping them concurrently") other_pages = await asyncio.gather( *[session.get(f"https://www.amazon.com/s?k={query}&page={page}") for page in range(2, total_pages + 1)] ) # parse all of search pages for product preview data: previews = [] for response in [first_page, *other_pages]: previews.extend(parse_search(response)) log.info(f"{query}: found total of {len(previews)} product previews") return previews
我们正在使用 parsel CSS 选择器功能来选择产品预览容器并遍历每个容器:每个容器都包含产品的预览信息,我们可以使用一些相关的 CSS 选择器提取这些信息。让我们运行我们当前的 Amazon scraper 并查看它生成的结果: 运行代码和示例输出
import httpx import json import asyncio # We need to use browser-like headers for our requests to avoid being blocked # here we set headers of Chrome browser on Windows: BASE_HEADERS = { "accept-language": "en-US,en;q=0.9", "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36", "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", "accept-language": "en-US;en;q=0.9", "accept-encoding": "gzip, deflate, br", } async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: data = await search("kindle", session=session) print(json.dumps(data, indent=2)) if __name__ == "__main__": asyncio.run(run()) [ { "url": "https://www.amazon.com/Kindle-Now-with-Built-in-Front-Light/dp/B07978J597/ref=sr_1_1", "title": "Kindle - With a Built-in Front Light - Black", "price": "$59.99", "real_price": "$89.99", "rating": "4.6", "rating_count": "36,856" }, { "url": "https://www.amazon.com/All-new-Kindle-Paperwhite-adjustable-Ad-Supported/dp/B08KTZ8249/ref=sr_1_2", "title": "Kindle Paperwhite (8 GB) \u2013 Now with a 6.8\" display and adjustable warm light", "price": "$139.99", "real_price": null, "rating": "4.7", "rating_count": "10,775" }, ... ]
现在我们可以有效地找到产品,让我们来看看我们如何自己抓取产品数据。
抓取亚马逊产品
为了抓取产品信息,我们将检索每个产品的 HTML 页面并使用我们的parsel
包对其进行解析。为此,我们将使用 parsel 的 CSS 选择器功能。
抓取产品信息
要检索产品数据,我们通常只需要 ASIN(亚马逊标准识别码)代码。这个由 10 个字符组成的唯一标识符分配给亚马逊上的每个产品和产品变体。我们通常可以从产品 URL 中提取它,例如:
这也意味着我们只要知道任何产品的 ASIN 码,就可以找到任何产品的 URL。让我们试一试:
from typing_extensions import TypedDict class ProductInfo(TypedDict): """type hint for our scraped product result""" name: str stars: str rating_coutn: str features: List[str] images: dict def parse_product(response) -> ProductInfo: """parse Amazon's product page (e.g. https://www.amazon.com/dp/B07KR2N2GF) for essential product data""" sel = Selector(text=response.text) # images are stored in javascript state data found in the html # for this we can use a simple regex pattern: image_data = json.loads(re.findall(r"colorImages':.*'initial':\s*(\[.+?\])},\n", response.text)[0]) # the other fields can be extracted with simple css selectors # we can define our helper functions to keep our code clean return { "name": sel.css("#productTitle::text").get("").strip(), "stars": sel.css("i[data-hook=average-star-rating] ::text").get("").strip(), "rating_count": sel.css("div[data-hook=total-review-count] ::text").get("").strip(), "features": sel.css("#feature-bullets li ::text").getall(), "images": image_data } async def scrape_product(asin: str, session: httpx.AsyncClient) -> ProductInfo: log.info(f"scraping {asin}") response = await session.get(f"https://www.amazon.com/dp/{asin}") return parse_product(response)
上面,我们定义了我们的亚马逊产品抓取器,它从给定的 ASIN 代码中检索产品页面并解析基本信息,如名称、评级等。让我们运行它: 运行代码和示例输出
async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: data = await scrape_product("B07L5G6M1Q", session=session) print(json.dumps(data, indent=2)) if __name__ == "__main__": asyncio.run(run()) { "name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads", "stars": "4.6 out of 5 stars", "rating_count": "19,779 global ratings", "features": [ " Our best 7\", 300 ppi flush-front Paperwhite display. ", " Adjustable warm light to shift screen shade from white to amber. ", " Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water. ", " Thin and light ergonomic design with page turn buttons. ", " Reads like real paper with the latest e-ink technology for fast page turns. ", " Instant access to millions of books, newspapers, and audiobooks. ", " Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening. " ], "images": [ { "hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg", "thumb": "https://www.jingzhengli.com/wp-content/uploads/2023/06/41XunSGTqvL._AC_US40_.jpg", "large": "https://www.jingzhengli.com/wp-content/uploads/2023/06/41XunSGTqvL._AC_.jpg", ... }
然而,这段代码缺少一个重要的细节——价格!为此,让我们看看 Amazon.com 如何为其产品定价以及我们如何抓取该信息。
抓取产品变体和价格
亚马逊上的每个产品都可以有多个变体。例如,让我们看一下这个产品:
我们可以看到这个产品可以通过几个选项进行定制。这些选项组合中的每一个都由其自己的 ASIN 标识符表示。因此,如果我们查看页面源代码并找到该产品的所有标识符,我们可以看到多个 ASIN 代码:
我们可以看到变体 ASIN 代码和描述存在于隐藏在页面 HTML 源代码中的 javascript 变量中。更确切地说,它在dimensionValuesDisplayData
一个dataToReturn
变量的字段中。我们可以使用一个小的正则表达式模式轻松提取它:
import re import httpx product_html = httpx.get("https://www.amazon.com/dp/B07F7TLZF4").text # this pattern selects value between curly braces that follow dimensionValeusDsiplayData key: variant_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', product_html) print(variant_data)
现在,我们可以通过提取变体 ASIN 标识符并抓取每个变体的价格详细信息来将此逻辑实现到我们的抓取器中: 使用此功能,我们可以提取产品变体的价格。让我们用变体抓取逻辑扩展我们的产品抓取功能:
async def scrape_product(asin: str, session: httpx.AsyncClient, reviews=True, pricing=True) -> ProductData: log.info(f"scraping {asin}") response_product = await session.get(f"https://www.amazon.com/dp/{asin}") # parse current page as first variant variants = [parse_variant(response_product)] # if product has variants - we want to scrape all of them _variation_data = re.findall(r'dimensionValuesDisplayData"\s*:\s* ({.+?}),\n', response_product.text) variant_asins = list(json.loads(_variation_data[0])) log.info(f"scraping {len(variant_asins)} variants: {variant_asins}") if _variation_data: variants.extend(await asyncio.gather(*[scrape_variant(asin, session) for asin in variant_asins])) return { "info": parse_product(response_product), "variants": variants, }
这里需要注意的有趣的一点是,并非每个产品都有多个变体,但每个产品至少有一个变体。让我们试一试这个爬虫程序: 运行代码和示例输出
async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: data = await scrape_product("B07L5G6M1Q", session=session, reviews=True) print(json.dumps(data, indent=2)) if __name__ == "__main__": asyncio.run(run()) { "info": { "name": "Kindle Oasis \u2013 Now with adjustable warm light - Without Lockscreen Ads", "stars": "4.6 out of 5 stars", "rating_count": "19,779 global ratings", "features": [ " Our best 7\", 300 ppi flush-front Paperwhite display. ", " Adjustable warm light to shift screen shade from white to amber. ", " Waterproof (IPX8) so you can read in the bath or by the pool. Your Kindle has been tested to withstand accidental immersion in water. ", " Thin and light ergonomic design with page turn buttons. ", " Reads like real paper with the latest e-ink technology for fast page turns. ", " Instant access to millions of books, newspapers, and audiobooks. ", " Works with Audible - pair with Bluetooth headphones or speakers to switch seamlessly between reading and listening. " ], "images": [ { "hiRes": "https://m.media-amazon.com/images/I/614TlIaYBvL._AC_SL1000_.jpg", "thumb": "https://www.jingzhengli.com/wp-content/uploads/2023/06/41XunSGTqvL._AC_US40_.jpg", "large": "https://www.jingzhengli.com/wp-content/uploads/2023/06/41XunSGTqvL._AC_.jpg", ... }, ... ] }, "variants": [ { "asin": "B07L5G6M1Q", "price": "$299.99" }, { "asin": "B07F7TLZF4", "price": "$249.99" }, ... ] }
我们可以看到,现在我们的抓取工具生成了产品信息和变体数据点列表,其中每个数据点都包含价格和自己的 ASIN 标识符。 我们现在唯一缺少的细节是产品评论,所以接下来,让我们看看如何抓取亚马逊产品评论。
抓取亚马逊评论
要抓取亚马逊产品评论,让我们看看在哪里可以找到它们。如果我们滚动到页面底部,我们可以看到一个显示“查看所有评论”的链接,如果我们单击它,我们可以看到我们被带到一个遵循以下 URL 格式的新位置:
我们可以看到,就像产品信息一样,我们只需要ASIN标识就可以找到产品的评论页面。让我们将此逻辑添加到我们的抓取工具中:
from typing_extensions import TypedDict import httpx class ReviewData(TypedDict): """storage type hint for amazons review object""" title: str text: str location_and_date: str verified: bool rating: float def parse_reviews(response) -> ReviewData: """parse review from single review page""" sel = Selector(text=response.text) review_boxes = sel.css("#cm_cr-review_list div.review") parsed = [] for box in review_boxes: parsed.append({ "text": "".join(box.css("span[data-hook=review-body] ::text").getall()).strip(), "title": box.css("*[data-hook=review-title]>span::text").get(), "location_and_date": box.css("span[data-hook=review-date] ::text").get(), "verified": bool(box.css("span[data-hook=avp-badge] ::text").get()), "rating": box.css("*[data-hook*=review-star-rating] ::text").re(r"(\d+\.*\d*) out")[0], }) return parsed async def scrape_reviews(asin, session: httpx.AsyncClient) -> ReviewData: """scrape all reviews of a given ASIN of an amazon product""" url = f"https://www.amazon.com/product-reviews/{asin}/" log.info(f"scraping review page: {url}") # find first page first_page = await session.get(url) sel = Selector(text=first_page.text) # find total amount of pages total_reviews = sel.css("div[data-hook=cr-filter-info-review-rating-count] ::text").re(r"(\d+,*\d*)")[1] total_reviews = int(total_reviews.replace(",", "")) total_pages = int(math.ceil(total_reviews / 10.0)) log.info(f"found total {total_reviews} reviews across {total_pages} pages -> scraping") _next_page = urljoin(url, sel.css(".a-pagination .a-last>a::attr(href)").get()) if _next_page: next_page_urls = [_next_page.replace("pageNumber=2", f"pageNumber={i}") for i in range(2, total_pages + 1)] assert len(set(next_page_urls)) == len(next_page_urls) other_pages = await asyncio.gather(*[session.get(url) for url in next_page_urls]) else: other_pages = [] reviews = [] for response in [first_page, *other_pages]: reviews.extend(parse_reviews(response)) log.info(f"scraped total {len(reviews)} reviews") return reviews
在上面的爬虫中,我们将在本教程中学到的所有内容放在一起:
- 为了抓取分页,我们使用了与抓取搜索相同的技术:抓取第一页,查找总页数并同时抓取其余页面。
- 解析评论也使用了我们在解析搜索中使用的相同技术:遍历包含评论的每个框并使用 CSS 选择器解析数据。
让我们运行这个 scraper 并查看它生成的输出: 运行代码和示例输出
async def run(): limits = httpx.Limits(max_connections=5) async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session: data = await scrape_reviews("B07L5G6M1Q", session=session) print(json.dumps(data, indent=2)) if __name__ == "__main__": asyncio.run(run()) [ { "text": "I have the previous generation oasis as well (on the left side in the pics) and wanted this one for reading at night. Overall there's not many differences between the two, so if the light tone customizability isn't important to you I wouldn't particularly recommend this one over the 9th gen. However, the lighting is noticeably more even with the 10th gen (my older one visibly fades from one side to the other) and there's a ton of variability in the tone of the screen. Overall, for me it was worth it, but your mileage may vary if you don't read in a dark room (so as not to wake the spouse) before bed very often.", "title": "Loving it so far", "location_and_date": "Reviewed in the United States on July 29, 2019", "verified": true, "rating": "5.0" }, { "text": "So I've been using a Kindle Paperwhite since 2014 and absolutely loved it. Despite it being five years old, it still worked great and has been a pleasure as a reading device. ", "title": "From 2014 Paperwhite to 2019 Oasis", "location_and_date": "Reviewed in the United States on August 9, 2019", "verified": true, "rating": "3.0" }, ... ]
至此,我们已经了解了如何在亚马逊上查找产品并抓取它们的描述、定价和评论数据。
常问问题
为了总结本指南,让我们看一下有关网络抓取亚马逊的一些常见问题:
抓取 Amazon.com 是否合法?
是的。亚马逊的数据是公开的,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 Amazon.com 属于道德抓取定义。也就是说,在从评论部分抓取个人数据(例如个人数据)时,应注意欧盟的 GDRP 合规性。 由于每个页面上都有广泛的相关产品和推荐系统,因此很容易抓取亚马逊产品。换句话说,我们可以编写一个爬虫,它接收亚马逊产品 URL 的种子,抓取它们,从相关产品部分提取更多产品 URL – 并循环执行此操作。
概括
在本教程中,我们通过了解网站的功能构建了一个亚马逊产品抓取工具,以便我们可以在我们的网络抓取工具中复制其功能。首先,我们复制了搜索功能来查找产品,然后我们抓取了产品信息和变体数据,最后,我们抓取了每个产品的所有评论。 我们可以看到,借助httpx和parsel等出色的社区工具,使用 Python 抓取 amazon 网页非常容易。