in

如何爬取Walmart.com网站数据

Walmart.com是世界上最大的零售商之一,在美国拥有主要的在线业务。由于影响如此之大,沃尔玛的公共产品数据通常需要用于竞争情报分析。那么,我们如何才能抓取这些有价值的产品数据呢?

在本网络抓取教程中,我们将了解如何使用 Python抓取沃尔玛产品数据。

我们将从了解如何使用站点地图、类别链接或沃尔玛的搜索 API 查找产品 URL开始。
然后,我们将看看沃尔玛产品抓取本身。如何使用常见的隐藏 javascript 数据解析技术快速轻松地抓取大量产品数据。

最后,我们将看看如何避免网络抓取工具阻止沃尔玛以其臭名昭著的方式。

项目设置

对于我们的 Walmart scraper,我们将使用 Python 和一些社区库:

  • httpx – HTTP 客户端库,可以让我们与 Booking.com 的服务器进行通信
  • parsel – HTML 解析库,它将帮助我们解析从网络上抓取的 HTML 文件以获取酒店数据。
  • loguru [可选] – 用于更漂亮的日志记录,因此我们可以更轻松地进行后续操作

我们可以使用以下方法轻松安装它们pip

$ pip install httpx parsel loguru

或者,可以随意换成httpx任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 功能,这些功能几乎可以在每个库中互换。
至于,parsel另一个很好的选择是beautifulsoup包或任何支持 CSS 选择器的东西,我们将在本教程中使用它。

寻找沃尔玛产品

要开始沃尔玛抓取,我们首先必须找到一种方法来发现沃尔玛产品,有两种最常见的方法可以实现这一点。

最简单的方法是利用沃尔玛的站点地图。如果我们看一下walmart.com/robots.txt抓取规则,我们可以看到有多个站点地图:

Sitemap: https://www.walmart.com/sitemap_browse.xml
Sitemap: https://www.walmart.com/sitemap_category.xml
Sitemap: https://www.walmart.com/sitemap_store_main.xml

Sitemap: https://www.walmart.com/help/sitemap_gm.xml
Sitemap: https://www.walmart.com/sitemap_browse_fst.xml
Sitemap: https://www.walmart.com/sitemap_store_dept.xml

Sitemap: https://www.walmart.com/sitemap_bf_2020.xml
Sitemap: https://www.walmart.com/sitemap_tp_legacy.xml
...

不幸的是,这并没有为我们提供很大的结果过滤空间。从外观上看,我们只能使用walmart.com/sitemap_category.xml站点地图按类别过滤结果:

<url>
<loc>https://www.walmart.com/cp/-degree/928899</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-depend/1092729</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-hungergames/1095300</loc>
<lastmod>2022-04-01</lastmod>
</url>
<url>
<loc>https://www.walmart.com/cp/-jackson/1103987</loc>
<lastmod>2022-04-01</lastmod>
</url>

此站点地图中的每个 URL 都会将我们带到单个类别的分页页面,我们可以使用其他过滤器进一步自定义该页面:

用于搜索查询的 walmart.com 过滤器

或者,我们可以自己使用搜索系统,这会将我们带到同一个具有过滤功能的页面:
walmart.com/search? q=spider&sort=price_low&page=2&affinityOverride=default

因此,无论哪种方式,我们都选择采用这种方法,我们必须解析同一种页面,这很好——我们可以编写一个抓取器函数来处理这两种情况。

在本教程中,让我们坚持解析搜索页面,但要解析类别页面,我们所要做的就是替换抓取的 url。首先,让我们选择一个示例搜索页面,例如搜索单词“spider”:

https://www.walmart.com/search?q=spider&sort=price_low&page=1&affinityOverride=default

我们看到这个 URL 包含一些参数,例如:

  • q代表“搜索查询”,在这种情况下,它是“蜘蛛”一词
  • page代表页码,在这种情况下,它是第一页
  • sort代表排序顺序,在这种情况下,price_low表示按价格升序排序

现在,由于我们的抓取不执行 javascript,动态结果内容将对我们不可见。相反,让我们打开页面源并搜索一些产品详细信息,我们可以看到下面有状态数据:

<script id="__NEXT_DATA__">{"...PRODUCT_PAGINATION_DATA..."}</script>

高度动态的网站(尤其是由 React/Next.js 框架运行)通常包含隐藏在 HTML 中的数据,然后在加载时使用 javascript 将其解压缩为 HTML 结果。

隐藏的网络数据对我们来说是个好消息,因为它使沃尔玛产品抓取变得非常容易!

首先,让我们从我们的搜索抓取器开始:

import asyncio
import json
import math
import httpx
from parsel import Selector


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session:httpx.AsyncClient):
    """scrape single walmart search page"""
    url = "https://www.walmart.com/search?" + urlencode(
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str):
    """extract search results from search HTML response"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    return results, total_results

在这个爬虫中,我们从两个函数开始:

  • 异步_search_walmart_page()从给定参数创建查询 URL 并抓取搜索页面的 HTML
  • parse_search()获取搜索页面 HTML,找到__NEXT_DATA__隐藏的页面数据。然后解析搜索结果以及总结果数。

我们有办法检索单个搜索页面的结果,但我们可以改进它以抓取所有 25 页的结果:

async def discover_walmart(search:str, session:httpx.AsyncClient):
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.text)
    max_page = math.ceil(total_items / 40)
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.text)[0])
    return results

在这里,我们添加了一个包装函数,它将首先抓取第一页并找到总页数。然后,同时抓取剩余的页面(速度非常快)。

让我们运行我们当前的爬虫代码并查看它生成的结果:

运行代码和示例输出

BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    # limit connection speed to prevent scraping too fast
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await discover_walmart("spider", session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

注意:我们的运行器代码在我们的网络连接会话中应用了一些自定义标头以避免抓取器阻塞。我们创建一个异步httpx客户端并调用我们的发现函数来查找该页面的所有结果:

[
  {
    "__typename": "Product",
    "availabilityStatusDisplayValue": "In stock",
    "productLocationDisplayValue": null,
    "externalInfoUrl": "",
    "canonicalUrl": "/ip/Eliminator-Ant-Roach-Spider-Killer4-20-oz-Kills-Insects-Spiders/795033156",
    "canAddToCart": true,
    "showOptions": false,
    "showBuyNow": false,
    "description": "<li>KILLS ON CONTACT: Eliminator Ant, Roach & Spider Killer4 kills cockroaches, ants, carpenter ants, crickets, firebrats, fleas, silverfish and spiders</li><li>NON-STAINING: This water-based product</li>",
    "flag": "",
    "badge": {
      "text": "",
      "id": "",
      "type": "",
      "key": ""
    },
    "fulfillmentBadges": [
      "Pickup",
      "Delivery",
      "1-day shipping"
    ],
    "fulfillmentIcon": {
      "key": "SAVE_WITH_W_PLUS",
      "label": "Save with"
    },
    "fulfillmentBadge": "Tomorrow",
    "fulfillmentSpeed": [
      "TOMORROW"
    ],
    "fulfillmentType": "FC",
    "groupMetaData": {
      "groupType": null,
      "groupSubType": null,
      "numberOfComponents": 0,
      "groupComponents": null
    },
    "id": "5D3NBXRMIZK4",
    "itemType": null,
    "usItemId": "795033156",
    "image": "https://i5.walmartimages.com/asr/c9c0c51c-f30f-4eb2-aaf1-88f599167584.d824f7ff13f10b3dcfb9dadd2a04686d.jpeg?odnHeight=180&odnWidth=180&odnBg=ffffff",
    "isOutOfStock": false,
    "esrb": "",
    "mediaRating": "",
    "name": "Eliminator Ant, Roach & Spider Killer4, 20 oz, Kills Insects & Spiders",
    "price": 3.48,
    "preOrder": {
      "isPreOrder": false,
      "preOrderMessage": null,
      "preOrderStreetDateMessage": null
    },
    "..."
]

处理分页限制

我们的搜索发现方法存在一个小问题——页数限制。沃尔玛每次查询仅返回 25 页(1000 种产品)——如果我们的查询不止于此怎么办?

处理这个问题的最好方法是将我们的查询分成多个较小的查询,我们可以通过应用过滤器来做到这一点:

适合批处理的沃尔玛过滤器

我们可以做的第一件事是反向排序:我们可以抓取从最低到最高的排序结果,然后将其反转——将我们的覆盖范围加倍到 50 页或 2000 种产品!

此外,我们可以使用单选过滤器(单选按钮)(如“部门”)将查询拆分为更小的查询,或者更进一步使用价格范围。

通过一些巧妙的查询拆分,这个 2000 产品限制看起来并不那么吓人!

沃尔玛产品爬虫

我们的抓取工具可以使用沃尔玛的搜索功能来发现包含价格、一些图片、产品 URL 和一些描述的产品预览详细信息。
为了收集完整的产品数据,我们需要单独抓取每个产品 url,所以让我们用这个功能扩展我们的抓取工具。

def parse_product(html):
    ...

async def _scrape_products_by_url(urls: List[str], session:httpx.AsyncClient):
    responses = await asyncio.gather(*[session.get(url) for url in urls])
    results = []
    for resp in responses:
        assert resp.status_code == 200
        results.append(parse_product(resp.text))
    return results

对于解析,我们可以采用与解析搜索相同的策略——提取__NEXT_DATA__JSON 状态对象。在产品案例中,它包含了 JSON 格式的所有产品数据,这对我们来说非常方便:

def parse_product(html_text: str) -> Dict:
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    # There's a lot of product data, including private meta keywords, so we need to do some filtering:
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}

在此解析函数中,我们拾取__NEXT_DATA__对象并解析它以获取产品信息。这里有很多数据,所以我们使用密钥白名单只选择最重要的密钥,如产品名称、价格、描述和媒体。

运行代码和示例输出

BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    # limit connection speed to prevent scraping too fast
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await _scrape_products_by_url(["Some product url"], session=session)
        print(json.dumps(results, indent=2))


if __name__ == "__main__":
    asyncio.run(run())
{
  "product": {
    "availabilityStatus": "IN_STOCK",
    "averageRating": 2.3,
    "brand": "Sony Pictures Entertainment",
    "shortDescription": "It's great to be Spider-Man (Andrew Garfield). for Peter Parker, there's no feeling quite like swinging between skyscrapers, embracing being the hero, and spending time with Gwen (Emma Stone). But being Spider-Man comes at a price: only Spider-Man can protect his fellow New Yorkers from the formidable villains that threaten the city. With the emergence of Electro (Jamie Foxx), Peter must confront a foe far more powerful than himself. And as his old friend, Harry Osborn (Dane DeHaan), returns, Peter comes to realize that all of his enemies have one thing in common: Oscorp.",
    "id": "43N352NZTVIQ",
    "imageInfo": {
      "allImages": [
        {
          "id": "E832A8930EF64D37B408265925B61573",
          "url": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg",
          "zoomable": false
        },
        {
          "id": "A2C3299D21A34FADB84047E627CFD9E4",
          "url": "https://i5.walmartimages.com/asr/c8f793a3-5ebf-4f83-a2e1-a71fda15dbd3_1.f8b6234fb668f7c4f8d72f1a1c0f21c4.jpeg",
          "zoomable": false
        }
      ],
      "thumbnailUrl": "https://i5.walmartimages.com/asr/175ec13e-b95a-4e6f-a57c-13e9ea7b703e_1.1047259e6d0655743cac1997d2b1a16e.jpeg"
    },
    "manufacturerName": "Sony",
    "name": "The Amazing Spider-Man 2 (Blu-ray + DVD)",
    "orderMinLimit": 1,
    "orderLimit": 5,
    "priceInfo": {
      "priceDisplayCodes": {
        "clearance": null,
        "eligibleForAssociateDiscount": true,
        "finalCostByWeight": null,
        "priceDisplayCondition": null,
        "reducedPrice": null,
        "rollback": null,
        "submapType": null
      },
      "currentPrice": {
        "price": 7.82,
        "priceString": "$7.82",
        "variantPriceString": "$7.82",
        "currencyUnit": "USD"
      },
      "wasPrice": {
        "price": 14.99,
        "priceString": "$14.99",
        "variantPriceString": null,
        "currencyUnit": "USD"
      },
      "unitPrice": null,
      "savings": null,
      "subscriptionPrice": null,
      "priceRange": {
        "minPrice": null,
        "maxPrice": null,
        "priceString": null,
        "currencyUnit": null,
        "denominations": null
      },
      "capType": null,
      "walmartFundedAmount": null
    },
    "type": "Movies"
  },
  "reviews": {
    "averageOverallRating": 2.3333,
    "customerReviews": [
      {
        "rating": 1,
        "reviewSubmissionTime": "9/7/2019",
        "reviewText": "I received this and was so disappointed.  the pic advertised shows a digital copy is included,  but it's just the Blu-ray.  immediately returned bc that is not what I ordered nor does it match the photo shown.",
        "reviewTitle": "no digital copy",
        "userNickname": "Tbaby",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 1,
        "reviewSubmissionTime": "3/11/2019",
        "reviewText": "Advertised as \"VUDU Instawatch Included\", this is not true.\nPicture shows BluRay + DVD + Digital HD, what actually ships is just the BluRay + DVD.",
        "reviewTitle": "WARNING: You don't get what's advertised.",
        "userNickname": "Reviewer",
        "photos": [],
        "badges": null,
        "syndicationSource": null
      },
      {
        "rating": 5,
        "reviewSubmissionTime": "1/4/2021",
        "reviewText": null,
        "reviewTitle": null,
        "userNickname": null,
        "photos": [],
        "badges": [
          {
            "badgeType": "Custom",
            "id": "VerifiedPurchaser",
            "contentType": "REVIEW",
            "glassBadge": {
              "id": "VerifiedPurchaser",
              "text": "Verified Purchaser"
            }
          }
        ],
        "syndicationSource": null
      }
    ],
    "ratingValueFiveCount": 1,
    "ratingValueFourCount": 0,
    "ratingValueOneCount": 2,
    "ratingValueThreeCount": 0,
    "ratingValueTwoCount": 0,
    "roundedAverageOverallRating": 2.3,
    "topNegativeReview": null,
    "topPositiveReview": null,
    "totalReviewCount": 3
  }
}

最终的沃尔玛爬虫

我们可以使用搜索找到沃尔玛产品并抓取每个单独的产品——让我们将这两个放在我们最终的网络抓取脚本中:

完整的爬虫代码

import asyncio
import json
import math
from typing import Dict, List, Tuple
from urllib.parse import urljoin

import httpx
from loguru import logger as log
from parsel import Selector
from urllib.parse import urlencode


async def _search_walmart_page(query: str = "", page=1, sort="price_low", session=httpx.AsyncClient) -> httpx.Response:
    """scrape single walmart search page"""
    url = "https://www.walmart.com/search?" + urlencode(
        {
            "q": query,
            "sort": sort,
            "page": page,
            "affinityOverride": "default",
        },
    )
    log.debug(f'searching walpart page {page} of "{query}" sorted by {sort}')
    resp = await session.get(url)
    assert resp.status_code == 200
    return resp


def parse_search(html_text: str) -> Tuple[Dict, int]:
    """extract search results from search HTML response"""
    log.debug(f"parsing search page")
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)

    total_results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["count"]
    results = data["props"]["pageProps"]["initialData"]["searchResult"]["itemStacks"][0]["items"]
    # there are other results types such as ads or placeholders - filter them out:
    results = [result for result in results if result["__typename"] == "Product"]
    log.info(f"parsed {len(results)} search product previews")
    return results, total_results


async def discover_walmart(search: str, session: httpx.AsyncClient) -> List[Dict]:
    log.info(f"searching walmart for {search}")
    _resp_page1 = await _search_walmart_page(query=search, session=session)
    results, total_items = parse_search(_resp_page1.content)
    max_page = math.ceil(total_items / 40)
    log.info(f"found total {max_page} pages of results ({total_items} products)")
    if max_page > 25:
        max_page = 25
    for response in await asyncio.gather(
        *[_search_walmart_page(query=search, page=i, session=session) for i in range(2, max_page)]
    ):
        results.extend(parse_search(response.content)[0])
    log.info(f"parsed total {len(results)} pages of results ({total_items} products)")
    return results


def parse_product(html_text):
    """parse walmart product"""
    sel = Selector(text=html_text)
    data = sel.xpath('//script[@id="__NEXT_DATA__"]/text()').get()
    data = json.loads(data)
    _product_raw = data["props"]["pageProps"]["initialData"]["data"]["product"]
    wanted_product_keys = [
        "availabilityStatus",
        "averageRating",
        "brand",
        "id",
        "imageInfo",
        "manufacturerName",
        "name",
        "orderLimit",
        "orderMinLimit",
        "priceInfo",
        "shortDescription",
        "type",
    ]
    product = {k: v for k, v in _product_raw.items() if k in wanted_product_keys}
    reviews_raw = data["props"]["pageProps"]["initialData"]["data"]["reviews"]
    return {"product": product, "reviews": reviews_raw}


async def _scrape_products_by_url(urls: List[str], session: httpx.AsyncClient):
    """scrape walmart products by urls"""
    log.info(f"scraping {len(urls)} product urls (in chunks of 50)")
    results = []
    # we chunk requests to reduce memory usage and scraping speeds
    for i in range(0, len(urls), 50):
        log.debug(f"scraping product chunk: {i}:{i+50}")
        chunk = urls[i : i + 50]
        responses = await asyncio.gather(*[session.get(url) for url in chunk])
        print(responses)
        for resp in responses:
            assert resp.status_code == 200
            results.append(parse_product(resp.content))
    return results


async def scrape_walmart(search: str, session: httpx.AsyncClient):
    """scrape walmart products by search term"""
    search_results = await discover_walmart(search, session=session)
    product_urls = [
        urljoin("https://www.walmart.com/", product_preview["canonicalUrl"]) for product_preview in search_results
    ]
    return await _scrape_products_by_url(product_urls, session=session)


BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}


async def run():
    limits = httpx.Limits(max_keepalive_connections=5, max_connections=5)
    async with httpx.AsyncClient(headers=BASE_HEADERS, limits=limits) as session:
        results = await scrape_walmart("spider", session=session)
        total = json.dumps(results)
        return total


if __name__ == "__main__":
    asyncio.run(run())

在这个简短的爬虫中,我们实现了两个基本功能:

  • discover_walmart()它以产品预览的形式从给定的关键字中找到产品,它提供基本的产品信息,最重要的是产品页面的 url。
  • _scrape_products_by_url()从这些发现的 url 中抓取完整的产品数据。

至于解析,我们利用沃尔玛的前端存储到__NEXT_DATA__HTML/JS 变量,将其提取并解析为一组列入白名单的键的 JSON 对象。这种方法比 HTML 解析更容易实现和维护。

常问问题

为了总结本指南,让我们看一下有关网络抓取walmart.com 的一些常见问题:

是的。沃尔玛产品数据是公开的,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 walmart.com 属于道德抓取定义。

沃尔玛抓取摘要

在本教程中,我们构建了一个小型https://www.walmart.com/ scraper,它使用搜索来发现产品,然后快速抓取所有产品,同时避免阻塞。为此,我们使用了带有httpx和parsel包的 Python。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。