in

如何爬取TripAdvisor.com网站数据

如何爬取TripAdvisor.com网站数据

在本网络抓取指南中,我们将抓取TripAdvisor.com – 酒店业最大的服务门户之一,其中包含酒店、活动和餐厅数据。 在本教程中,我们将了解如何抓取 TripAdvisor评论以及酒店信息和定价等其他详细信息,以及如何通过抓取搜索来查找酒店页面。 我们将学到的一切都可以应用于其他 TripAdvisor 目标,例如餐厅、旅游和活动。

为什么要抓取 TripAdvisor?

TripAdvisor 是酒店业数据的最大来源之一。大多数人都对抓取 TripAdvisor 评论感兴趣,但这个公共资源还包含酒店、旅游和餐厅信息以及价格等数据。因此,通过抓取 TripAdvisor,我们不仅可以收集有关酒店行业目标的信息,还可以收集有关它们的公众意见! 所有这些数据在市场和竞争分析等商业智能中都具有巨大的价值。换句话说,TripAdvisor 上可用的数据可以让我们瞥见酒店业,可以用来产生潜在客户并提高业务绩效。

项目设置

在本教程中,我们将使用带有两个或三个社区包的 Python:

  • httpx – HTTP 客户端库,可以让我们与 TripAdvisor.com 的服务器进行通信
  • parsel – HTML 解析库,它将帮助我们解析网络抓取的 HTML 文件。

这些包可以通过pip命令轻松安装:

$ pip install "httpx[http2,brotli]" parsel

或者,您可以自由地换成httpx任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 函数,这些函数在每个库中几乎都是可以互换的。至于,parsel另一个不错的选择是beautifulsoup包。

查找 Tripadvisor 酒店

首先让我们看看如何在 TripAdvisor 上找到酒店。为此,我们来看看TripAdvisor的搜索功能是如何运作的:

在上面的短视频中,我们可以看到当我们输入搜索查询时,后台正在发出 GraphQl 支持的 POST 请求。此请求返回搜索页面推荐,每个搜索页面推荐都包含酒店、餐馆或旅游的预览数据。 让我们在基于 Python 的抓取工具中复制这个 graphql 请求。 为此,我们将建立一个 HTTP 连接会话并提交一个POST模仿我们上面观察到的类型请求:

import asyncio
import json
import random
import string
from typing import List, TypedDict

import httpx
from loguru import logger as log


class LocationData(TypedDict):
    """result dataclass for tripadvisor location data"""

    localizedName: str
    url: str
    HOTELS_URL: str
    ATTRACTIONS_URL: str
    RESTAURANTS_URL: str
    placeType: str
    latitude: float
    longitude: float



async def scrape_location_data(query: str, client: httpx.AsyncClient) -> List[LocationData]:
    """
    scrape search location data from a given query.
    e.g. "New York" will return us TripAdvisor's location details for this query
    """
    log.info(f"scraping location data: {query}")
    # the graphql payload that defines our search
    # note: that changing values outside of expected ranges can block the web scraper
    payload = [
        {
            # Every graphql query has a query ID that doesn't change often:
            "query": "5eec1d8288aa8741918a2a5051d289ef",
            # the variables define the search
            "variables": {
                "request": {
                    "query": query,
                    "limit": 10,
                    "scope": "WORLDWIDE",
                    "locale": "en-US",
                    "scopeGeoId": 1,
                    "searchCenter": None,
                    # we can define search result types, in this case we want to search locations
                    "types": [
                        "LOCATION",
                        #   "QUERY_SUGGESTION",
                        #   "USER_PROFILE",
                        #   "RESCUE_RESULT"
                    ],
                    # we can further narrow down locations to
                    "locationTypes": [
                        "GEO",
                        "AIRPORT",
                        "ACCOMMODATION",
                        "ATTRACTION",
                        "ATTRACTION_PRODUCT",
                        "EATERY",
                        "NEIGHBORHOOD",
                        "AIRLINE",
                        "SHOPPING",
                        "UNIVERSITY",
                        "GENERAL_HOSPITAL",
                        "PORT",
                        "FERRY",
                        "CORPORATION",
                        "VACATION_RENTAL",
                        "SHIP",
                        "CRUISE_LINE",
                        "CAR_RENTAL_OFFICE",
                    ],
                    "userId": None,
                    "articleCategories": [
                        "default",
                        "love_your_local",
                        "insurance_lander",
                    ],
                    "enabledFeatures": ["typeahead-q"],
                }
            },
        }
    ]

    # we need to generate a random request ID for this request to succeed
    random_request_id = "".join(
        random.choice(string.ascii_lowercase + string.digits) for i in range(180)
    )
    headers = {
        "X-Requested-By": random_request_id,
        "Referer": "https://www.tripadvisor.com/Hotels",
        "Origin": "https://www.tripadvisor.com",
    }
    result = await client.post(
        url="https://www.tripadvisor.com/data/graphql/ids",
        json=payload,
        headers=headers,
    )
    data = json.loads(result.content)
    results = data[0]["data"]["Typeahead_autocomplete"]["results"]
    results = [r["details"] for r in results]  # strip metadata
    log.info(f"found {len(results)} results")
    return results

# To avoid being instantly blocked we'll be using request headers that
# mimic Chrome browser on Windows
BASE_HEADERS = {
    "authority": "www.tripadvisor.com",
    "accept-language": "en-US,en;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
# start HTTP session client with our headers and HTTP2
client = httpx.AsyncClient(
    http2=True,  # http2 connections are significantly less likely to get blocked
    headers=BASE_HEADERS,
    timeout=httpx.Timeout(15.0),
    limits=httpx.Limits(max_connections=5),
)


async def run():
    result = await scrape_location_data("Malta")
    print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

这个 graphQL 请求可能看起来很复杂,但我们主要使用从浏览器获取的值,仅更改query字符串本身。这里有几点需要注意:

  • 标题RefererOrigin必须不被 TripAdvisor 屏蔽
  • 标头X-Requested-By是一个跟踪 ID 标头,在这种情况下,我们只是生成一堆随机数。

我们还使用启用了 http2 的 httpx 来使我们的请求更快并且不太可能被阻止。让我们试一试我们的抓取工具,看看它为“Malta”关键字找到了什么: 示例输出

{
  "localizedName": "Malta",
  "localizedAdditionalNames": {
    "longOnlyHierarchy": "Europe"
  },
  "streetAddress": {
    "street1": null
  },
  "locationV2": {
    "placeType": "COUNTRY",
    "names": {
      "longOnlyHierarchyTypeaheadV2": "Europe"
    },
    "vacationRentalsRoute": {
      "url": "/VacationRentals-g190311-Reviews-Malta-Vacation_Rentals.html"
    }
  },
  "url": "/Tourism-g190311-Malta-Vacations.html",
  "HOTELS_URL": "/Hotels-g190311-Malta-Hotels.html",
  "ATTRACTIONS_URL": "/Attractions-g190311-Activities-Malta.html",
  "RESTAURANTS_URL": "/Restaurants-g190311-Malta.html",
  "placeType": "COUNTRY",
  "latitude": 35.892,
  "longitude": 14.42979,
  "isGeo": true,
  "thumbnail": {
    "photoSizeDynamic": {
      "maxWidth": 2880,
      "maxHeight": 1920,
      "urlTemplate": "https://dynamic-media-cdn.tripadvisor.com/media/photo-o/21/66/c5/99/caption.jpg?w={width}&h={height}&s=1&cx=1203&cy=677&chk=v1_cf397a9cdb4fbd9239a9"
    }
  }
}

我们可以看到我们获得了酒店、餐厅和景点搜索的 URL!我们可以使用这些 URL 来自己抓取搜索结果。

我们弄清楚了如何使用 TripAdvisor 的搜索建议来查找搜索页面,现在让我们抓取这些页面以获取酒店预览数据,如链接和名称。 让我们看看如何通过扩展我们的抓取代码来做到这一点:

import asyncio
import json
import math
from typing import List, Optional, TypedDict
from urllib.parse import urljoin

import httpx
from loguru import logger as log
from parsel import Selector
from snippet1 import scrape_location_data, client


class Preview(TypedDict):
    url: str
    name: str


def parse_search_page(response: httpx.Response) -> List[Preview]:
    """parse result previews from TripAdvisor search page"""
    log.info(f"parsing search page: {response.url}")
    parsed = []
    # Search results are contain in boxes which can be in two locations.
    # this is location #1:
    selector = Selector(response.text)
    for box in selector.css("span.listItem"):
        title = box.css("div[data-automation=hotel-card-title] a ::text").getall()[1]
        url = box.css("div[data-automation=hotel-card-title] a::attr(href)").get()
        parsed.append(
            {
                "url": urljoin(str(response.url), url),  # turn url absolute
                "name": title,
            }
        )
    if parsed:
        return parsed
    # location #2
    for box in selector.css("div.listing_title>a"):
        parsed.append(
            {
                "url": urljoin(
                    str(response.url), box.xpath("@href").get()
                ),  # turn url absolute
                "name": box.xpath("text()").get("").split(". ")[-1],
            }
        )
    return parsed


async def scrape_search(query: str, max_pages: Optional[int] = None) -> List[Preview]:
    """scrape search results of a search query"""
    # first scrape location data and the first page of results
    log.info(f"{query}: scraping first search results page")
    try:
        location_data = (await scrape_location_data(query))[0]  # take first result
    except IndexError:
        log.error(f"could not find location data for query {query}")
        return
    hotel_search_url = "https://www.tripadvisor.com" + location_data["HOTELS_URL"]

    log.info(f"found hotel search url: {hotel_search_url}")
    first_page = await client.get(hotel_search_url)
    assert first_page.status_code == 200, "scraper is being blocked"

    # parse first page
    results = parse_search_page(first_page)
    if not results:
        log.error("query {} found no results", query)
        return []

    # extract pagination metadata to scrape all pages concurrently
    page_size = len(results)
    total_results = first_page.selector.xpath("//span/text()").re(
        "(\d*\,*\d+) properties"
    )[0]
    total_results = int(total_results.replace(",", ""))
    next_page_url = first_page.selector.css(
        'a[aria-label="Next page"]::attr(href)'
    ).get()
    next_page_url = urljoin(hotel_search_url, next_page_url)  # turn url absolute
    total_pages = int(math.ceil(total_results / page_size))
    if max_pages and total_pages > max_pages:
        log.debug(
            f"{query}: only scraping {max_pages} max pages from {total_pages} total"
        )
        total_pages = max_pages

    # scrape remaining pages
    log.info(
        f"{query}: found {total_results=}, {page_size=}. Scraping {total_pages} pagination pages"
    )
    other_page_urls = [
        # note: "oa" stands for "offset anchors"
        next_page_url.replace(f"oa{page_size}", f"oa{page_size * i}")
        for i in range(1, total_pages)
    ]
    # we use assert to ensure that we don't accidentally produce duplicates which means something went wrong
    assert len(set(other_page_urls)) == len(other_page_urls)

    to_scrape = [client.get(url) for url in other_page_urls]
    for response in asyncio.as_completed(to_scrape):
        results.extend(parse_search_page(await response))
    return results

# example use:
if __name__ == "__main__":

    async def run():
        result = await scrape_search("Malta")
        print(json.dumps(result, indent=2))

    asyncio.run(run())

示例输出

[

    "id": "573828",
    "url": "/Hotel_Review-g230152-d573828-Reviews-Radisson_Blu_Resort_Spa_Malta_Golden_Sands-Mellieha_Island_of_Malta.html",
    "name": "Radisson Blu Resort & Spa, Malta Golden Sands"
  },
  ...
]

在这里,我们创建了scrape_search()接受查询并找到正确搜索页面的函数。然后我们抓取包含多个分页页面的整个搜索页面。 有了预览结果,我们就可以抓取每个 TripAdvisor 酒店列表的信息、定价和评论数据——让我们来看看如何做到这一点。

抓取 Tripadvisor 酒店数据

要抓取酒店信息,我们必须收集使用搜索找到的每个酒店页面。 在我们开始抓取之前,让我们看一下各个酒店页面,看看酒店页面本身的数据在哪里。 例如,让我们看看这家1926 Hotel & Spa酒店。如果我们在浏览器中查看此页面的页面源代码,我们可以看到包含大量数据的 GraphQl 缓存状态:

页面源图 - 我们可以看到隐藏在 javascript 变量中的数据
我们可以通过在浏览器中浏览页面源来查看酒店数据

由于 TripAdvisor 是一个高度动态的网站,它将其数据存储在页面的可见部分(页面的 HTML)和页面的隐藏部分(javascript 页面状态)。后者通常包含比页面可见部分显示的数据更多的数据,它通常也更容易解析——非常适合我们的抓取工具! 我们可以通过提取隐藏的 JSON Web 数据并在 Python 中解析它来轻松提取所有这些隐藏数据:

import re

def extract_page_manifest(html):
    """extract javascript state data hidden in TripAdvisor HTML pages"""
    data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
    return json.loads(data)

通过使用简单的正则表达式模式,我们可以从任何 TripAdvisor 页面提取页面清单数据。让我们把这个函数用在我们的酒店数据抓取器中:

import asyncio
import json
import httpx
import re


def extract_page_manifest(html):
    """extract javascript state data hidden in TripAdvisor HTML pages"""
    data = re.findall(r"pageManifest:({.+?})};", html, re.DOTALL)[0]
    return json.loads(data)

def extract_named_urql_cache(urql_cache: dict, pattern: str):
    """extract named urql response cache from hidden javascript state data"""
    data = json.loads(next(v["data"] for k, v in urql_cache.items() if pattern in v["data"]))
    return data

async def scrape_hotel(url, session):
    """Scrape TripAdvisor's hotel information"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)
    hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]
    return hotel_info

如果我们现在运行我们的爬虫,我们可以看到类似于以下的酒店信息结果: 示例输出

async def run():
    limits = httpx.Limits(max_connections=5)
    BASE_HEADERS = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
        "accept-encoding": "gzip, deflate, br",
    }
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        print(json.dumps(result, indent=2))


if __name__ == "__main__":
    asyncio.run(run())

这导致数据集类似于:

{
  "locationId": 264936,
  "name": "1926 Hotel & Spa",
  "accommodationType": "T_HOTEL",
  "parent": {
    "locationId": 190320
  },
  "parents": [
    {
      "name": "Sliema",
      "hotelsUrl": "/Hotels-g190327-Sliema_Island_of_Malta-Hotels.html"
    },
    "..."
  ],
  "locationDescription": "Inspired by the life and passions of one man and featuring a touch of the roaring twenties, 1926 Hotel & Spa offers luxury rooms and suites in the central city of Sliema. The hotel is located 200 meters from the seafront and also offers a splendid Beach Club on the water’s edge as well as a luxury SPA. Beach club is located 200 meters away from the hotel and is a seasonal operation. Our concept of ‘Lean Luxury’ includes the following: • Luxury rooms at affordable prices • Uncomplicated comfort and a great sleep • Smart design technology • Raindance showerheads • Flat screens • SuitePad Tablets • Self check in and check out (if desired) • Coffee & tea making facilities",
  "businessAdvantageData": {
    "specialOffer": null,
    "contactLinks": [
      {
        "contactLinkType": "PHONE",
        "linkUrl": null
      },
      "..."
    ]
  },
  "writeUserReviewUrl": "/UserReview-g190327-d264936-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
  "reviewSummary": {
    "rating": 4.5,
    "count": 955
  },
  "accommodationCategory": "HOTEL",
  "popIndexDetails": {
    "popIndexRank": 5,
    "popIndexTotal": 29
  },
  "detail": {
    "hotelAmenities": {
      "highlightedAmenities": {
        "roomFeatures": [
          {
            "tagId": 18898,
            "amenityNameLocalized": "Bathrobes",
            "amenityCategoryName": "Comfort",
            "amenityIcon": "hotels"
          },
          "..."
        ],
        "roomTypes": [
          {
            "tagId": 9184,
            "amenityNameLocalized": "Non-smoking rooms",
            "amenityCategoryName": "RoomTypes",
            "amenityIcon": "hotels"
          },
          "..."
        ],
        "propertyAmenities": [
          {
            "tagId": 18970,
            "amenityNameLocalized": "Free public parking nearby",
            "amenityCategoryName": "Parking",
            "amenityIcon": "parking"
          },
          "..."
        ]
      },
      "nonHighlightedAmenities": {
        "roomFeatures": [
          {
            "tagId": 19104,
            "amenityNameLocalized": "Telephone",
            "amenityCategoryName": "RoomAmenities",
            "amenityIcon": "hotels"
          },
          "..."
        ],
        "roomTypes": [],
        "propertyAmenities": [
          {
            "tagId": 19052,
            "amenityNameLocalized": "Paid private parking nearby",
            "amenityCategoryName": "Parking",
            "amenityIcon": "parking"
          },
          "..."
        ]
      },
      "languagesSpoken": [
        {
          "tagId": 18950,
          "amenityNameLocalized": "English"
        },
        "..."
      ]
    },
    "userPartialFilterMatch": {
      "locations": []
    },
    "starRating": [],
    "styleRankings": [
      {
        "tagId": 6216,
        "tagName": "Family",
        "geoRanking": 1,
        "translatedName": "Family",
        "score": 0.8039135983441473
      },
      "..."
    ],
    "hotel": {
      "reviewSubratingAvgs": [
        {
          "avgRating": 4.368532,
          "questionId": 10
        },
        "..."
      ],
      "greenLeader": null
    }
  },
  "heroPhoto": {
    "id": 599471239
  }
}

在本节中,我们仅通过提取 javascript 状态数据并在 Python 中对其进行解析来抓取酒店的信息。我们可以进一步使用这种技术来检索酒店的定价数据——让我们看看如何做到这一点。

抓取 Tripadvisor 酒店价格数据

对于定价信息,我们似乎需要提供入住和退房日期。但是,更简单的方法是探索包含几个月定价数据的定价日历: 对于定价日历信息,让我们进一步探索我们的 javascript 状态缓存。检查这一点的一种简单方法是在我们的缓存数据中搜索日历中存在的日期之一(例如,只需 ctrl+f“2022-06-20”):

页面源图 - 我们可以看到隐藏在 javascript 变量中的定价数据
我们可以看到页面状态变量中存在的定价日历数据

这意味着我们可以使用与解析酒店信息相同的技术来提取酒店定价数据:

import asyncio
import json
import httpx
from snippet3 import extract_named_urql_cache, extract_page_manifest

async def scrape_hotel(url: str, session: httpx.AsyncClient):
    """Scrape hotel data: information and pricing"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)

    # price data keys are dynamic first we need to find the full key name
    _pricing_key = next(
        key for key in page_data["redux"]["api"]["responses"] 
        if "/hotelDetail" in key and "/heatMap" in key
    )
    pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]

    hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]

    hotel_data = {
        "price": pricing_details,
        "info": parse_hotel_info(hotel_info),
    }
    return hotel_data

如果我们现在运行我们的爬虫,我们可以看到几个月的定价数据,看起来像这样:运行代码和示例输出

async def run():
    limits = httpx.Limits(max_connections=5)
    BASE_HEADERS = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
        "accept-encoding": "gzip, deflate, br",
    }
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        print(json.dumps(result['price'], indent=2))


if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "date": "2022-08-31",
    "priceUSD": 13852,
    "priceDisplay": "USD 138.52"
  },
  {
    "date": "2022-08-30",
    "priceUSD": 14472,
    "priceDisplay": "USD 144.72"
  },
  ...
]

完成我们的这一部分爬虫后,我们就拥有了酒店信息和酒店价格的抓取功能——我们只缺少酒店评论。那么,让我们来看看如何抓取酒店评论数据。

抓取 Tripadvisor 酒店评论

最后,为了用 Python 抓取 TripAdvisor 评论,我们将继续使用我们的 javascript 状态缓存抓取方法。但是,由于酒店评论分散在多个页面中,我们必须提出一些额外的要求。我们的抓取流程将如下所示:

  1. 抓取酒店页面
  2. 从第一页提取评论
  3. 提取总页数
  4. 抓取剩余的评论页面并解析它们

让我们scrape_hotel()用评论抓取逻辑更新我们的功能:

import asyncio
import json
import math
from typing import TypedDict

import httpx
from snippet3 import extract_named_urql_cache, extract_page_manifest


class Review(TypedDict):
    """storage type hint for review data"""
    id: str
    date: str
    rating: str
    title: str
    text: str
    votes: int
    url: str
    language: str
    platform: str
    author_id: str
    author_name: str
    author_username: str


def parse_reviews(html) -> Review:
    """Parse reviews from a review page"""
    page_data = extract_page_manifest(html)
    review_cache = extract_named_urql_cache(page_data["urqlCache"]["results"], '"reviewListPage"')
    parsed = []
    # review data contains loads of information, let's parse only the basic in this tutorial
    for review in review_cache["locations"][0]["reviewListPage"]["reviews"]:
        parsed.append(
            {
                "id": review["id"],
                "date": review["publishedDate"],
                "rating": review["rating"],
                "title": review["title"],
                "text": review["text"],
                "votes": review["helpfulVotes"],
                "url": review["route"]["url"],
                "language": review["language"],
                "platform": review["publishPlatform"],
                "author_id": review["userProfile"]["id"],
                "author_name": review["userProfile"]["displayName"],
                "author_username": review["userProfile"]["username"],
            }
        )
    return parsed


async def scrape_hotel(url, session):
    """Scrape all hotel data: information, pricing and reviews"""
    first_page = await session.get(url)
    page_data = extract_page_manifest(first_page.text)
    _pricing_key = next(
        (key for key in page_data["redux"]["api"]["responses"] if "/hotelDetail" in key and "/heatMap" in key)
    )
    pricing_details = page_data["redux"]["api"]["responses"][_pricing_key]["data"]["items"]
    hotel_cache = extract_named_urql_cache(page_data["urqlCache"]["result"], '"locationDescription"')
    hotel_info = hotel_cache["locations"][0]

    # ------- NEW CODE ----------------
    # for reviews we first need to scrape multiple pages
    # so, first let's find total amount of pages
    total_reviews = hotel_info["reviewSummary"]["count"]
    _review_page_size = 10
    total_review_pages = int(math.ceil(total_reviews / _review_page_size))
    # then we can scrape all review pages concurrently
    # note: in review url "or" stands for "offset reviews"
    review_urls = [
        url.replace("-Reviews-", f"-Reviews-or{_review_page_size * i}-") for i in range(1, total_review_pages)
    ]
    assert len(set(review_urls)) == len(review_urls)
    review_responses = await asyncio.gather(*[session.get(url) for url in review_urls])
    reviews = []
    for response in [first_page, *review_responses]:
        reviews.extend(parse_reviews(response.text))
    # ---------------------------------

    hotel_data = {
        "price": pricing_details,
        "info": parse_hotel_info(hotel_info),
        "reviews": reviews,
    }
    return hotel_data

上面,我们使用了与抓取酒店信息相同的技术。我们从 javascript 状态中提取初始评论数据,然后我们遍历所有页面以相同的方式收集剩余的评论。 这里要注意的一件事是,我们使用的是一种常见的分页抓取习惯用法:我们检索第一页以获得总结果量,然后同时收集剩余页面。使用这种方法可以让我们同时抓取许多分页页面,这给了我们巨大的速度提升! 运行代码和示例输出

async def run():
    limits = httpx.Limits(max_connections=5)
    BASE_HEADERS = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "accept-language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
        "accept-encoding": "gzip, deflate, br",
    }
    async with httpx.AsyncClient(limits=limits, timeout=httpx.Timeout(15.0), headers=BASE_HEADERS) as session:
        result = await scrape_hotel(
            "https://www.tripadvisor.com/Hotel_Review-g190327-d264936-Reviews-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
            session,
        )
        rint(json.dumps(result['reviews'], indent=2))


if __name__ == "__main__":
    asyncio.run(run())

这应该导致审查数据集类似于:

[
    {
      "id": 843669952,
      "date": "2022-06-20",
      "rating": 5,
      "title": "A birthday to remember",
      "text": "Memorable visit for a special birthday. Room was just perfect. Staff were lovely and on the whole very helpful. Used the beach club and loved it. Lovely hotel to spend some time with friends and so handy for sight seeing and local bars and restaurants.",
      "votes": 0,
      "url": "/ShowUserReviews-g190327-d264936-r843669952-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
      "language": "en",
      "platform": "OTHER",
      "author_id": "removed from blog for privacy purposes",
      "author_name": "removed from blog for privacy purposes",
      "author_username": "removed from blog for privacy purposes"
    },
    {
      "id": 843644452,
      "date": "2022-06-19",
      "rating": 5,
      "title": "t mini bre",
      "text": "We stayed here for a friends wedding and it was lovely staff were great. Breakfast had a good range of food and drink. Couldn\u2019t fault the hotel had everything you needed. Beach club was really good and served lovely food. ",
      "votes": 0,
      "url": "/ShowUserReviews-g190327-d264936-r843644452-1926_Hotel_Spa-Sliema_Island_of_Malta.html",
      "language": "en",
      "platform": "OTHER",
      "author_id": "removed from blog for privacy purposes",
      "author_name": "removed from blog for privacy purposes",
      "author_username": "removed from blog for privacy purposes"
    }
...
]

有了这个最后的功能,我们就有了完整的 TripAdvisor 抓取工具,可以抓取酒店信息、定价数据和评论。我们可以轻松地应用相同的抓取逻辑来抓取其他 TripAdvisor 详细信息,如活动和餐厅数据,因为底层网络技术是相同的。


常问问题

为了总结本指南,让我们看一下有关网络抓取tripadvisor.com 的一些常见问题:

是的。TripAdvisor 的数据是公开的,我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 tripadvisor.com 属于道德抓取定义。也就是说,对于抓取评论,我们应该避免在符合 GDRP 的国家(如欧盟)收集用户姓名等个人信息。

为什么要抓取 TripAdvisor 而不是使用 TripAdvisor 的 API?

不幸的是,TripAdvisor 的 API 很难使用并且非常有限。例如,每个地点仅提供 3 条评论。通过抓取公共 TripAdvisor 页面,我们可以收集所有评论和酒店详细信息,否则我们无法通过 TripAdvisor API 获得这些信息。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。