in

如何爬取 Booking.com数据

如何爬取 Booking.com数据

Booking.com是最大的旅游预订服务,它包含数千家酒店、度假村、airbnb 等的公共数据。 在本教程中,我们将了解如何使用 Python 编程语言抓取 booking.com。 我们将从快速概览 booking.com 的网站功能开始。然后,我们将在我们的 python 抓取器中复制它的行为来抓取酒店信息和价格数据。 最后,我们将通过查看一些提示和技巧以及网络抓取 booking.com 时经常遇到的挑战来总结所有内容。那么,让我们开始吧!

项目设置

在本教程中,我们将使用带有两个包的 Python:

  • httpx – HTTP 客户端库,可以让我们与 Booking.com 的服务器进行通信
  • parsel – HTML 解析库,它将帮助我们解析从网络上抓取的 HTML 文件以获取酒店数据。

这两个包都可以通过pip命令轻松安装:

$ pip install "httpx[http2,brotli]" parsel

或者,您可以自由地换成httpx任何其他 HTTP 客户端库,例如 %url https://pypi.org/project/requests/ requests %] 因为我们只需要基本的 HTTP 功能,这些功能几乎可以在每个图书馆。至于parsel,另一个很好的选择是beautifulsoup包。

查找预订酒店

我们的第一步是弄清楚如何发现酒店页面,这样我们就可以开始抓取他们的数据了。在 Booking.com 平台上,我们有几种方法可以实现这一点。

使用站点地图

可以通过其庞大的站点地图系统轻松访问 Booking.com。站点地图是高度压缩的 XML 文件,其中包含按主题分类的网站上所有可用的 URL。 要查找站点地图,我们首先必须访问/robots.txt页面,在这里我们可以找到站点地图链接:

Sitemap: https://www.booking.com/sitembk-airport-index.xml Sitemap: https://www.booking.com/sitembk-articles-index.xml Sitemap: https://www.booking.com/sitembk-attractions-index.xml Sitemap: https://www.booking.com/sitembk-beaches-index.xml Sitemap: https://www.booking.com/sitembk-beach-holidays-index.xml Sitemap: https://www.booking.com/sitembk-cars-index.xml Sitemap: https://www.booking.com/sitembk-city-index.xml Sitemap: https://www.booking.com/sitembk-country-index.xml Sitemap: https://www.booking.com/sitembk-district-index.xml Sitemap: https://www.booking.com/sitembk-hotel-index.xml Sitemap: https://www.booking.com/sitembk-landmark-index.xml Sitemap: https://www.booking.com/sitembk-region-index.xml Sitemap: https://www.booking.com/sitembk-tourism-index.xml Sitemap: https://www.booking.com/sitembk-themed-city-villas-index.xml Sitemap: https://www.booking.com/sitembk-themed-country-golf-index.xml Sitemap: https://www.booking.com/sitembk-themed-region-budget-index.xml

在这里,我们可以看到按城市、地标甚至主题分类的网址。例如,如果我们看一下https://www.booking.com/sitembk-hotel-index.xml我们可以看到它分成了另一组站点地图(因为单个站点地图只允许包含 50 000结果):

<sitemapindex xmlns="http://www.google.com/schemas/sitemap/0.9">
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0037.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
<sitemap>
<loc>https://www.booking.com/sitembk-hotel-zh-tw.0036.xml.gz</loc>
<lastmod>2022-05-17</lastmod>
</sitemap>
...

在这里,我们有 1710 个站点地图——这意味着 8500 万个指向各种酒店页面的链接。当然,并不是所有的酒店页面都是唯一的(有些是重复的),但这是在 booking.com 上发现酒店列表的一种简单方法。 使用站点地图是查找酒店的一种有效且简单的方法,但它不是一种灵活的发现方法。要抓取特定区域可用的酒店或包含某些功能的酒店,我们需要抓取 Booking.com 的搜索系统。那么,让我们来看看如何做到这一点。

或者,我们可以像人类用户一样利用在 booking.com 上运行的搜索系统。

booking.com 搜索栏的插图
booking.com 的搜索系统查找位于伦敦的酒店

由于 URL 较长,Booking 的搜索起初可能看起来很复杂,但如果我们深入挖掘,就会发现它相当简单,因为大多数 URL 参数都是可选的。例如,让我们看一下“Hotels in London”的查询:

https://www.booking.com/searchresults.html?label=gen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ
&sid=51b2c8cd7b3c8377e83692903e6f19ca
&sb=1
&sb_lp=1
&src=index
&src_elem=sb
&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1DCAEoggI46AdIM1gEaN0BiAEBmAExuAEKyAEF2AED6AEB-AECiAIBqAIDuAK-u5eUBsACAdICJGRlN2VhYzYyLTJiYzItNDE0MS1iYmY4LWYwZjkxNTc0OGY4ONgCBOACAQ%26sid%3D51b2c8cd7b3c8377e83692903e6f19ca%26sb_price_type%3Dtotal%26%26
&ss=London%2C+Greater+London%2C+United+Kingdom
&is_ski_area=
&ssne=London
&ssne_untouched=London
&checkin_year=2022
&checkin_month=6
&checkin_monthday=9
&checkout_year=2022
&checkout_month=6
&checkout_monthday=11
&group_adults=2
&group_children=0
&no_rooms=1
&b_h4u_keep_filters=
&from_sf=1
&search_pageview_id=f25c2a9ee3630134
&ac_suggestion_list_length=5
&ac_suggestion_theme_list_length=0
&ac_position=0
&ac_langcode=en
&ac_click_type=b
&dest_id=-2601889
&dest_type=city
&iata=LON
&place_id_lat=51.507393
&place_id_lon=-0.127634
&search_pageview_id=f25c2a9ee3630134
&search_selected=true
&ss_raw=London

很多可怕的参数!幸运的是,我们可以在我们的 Python 网络抓取工具中将其提炼为一些强制性的。让我们编写爬虫的第一部分——检索单个搜索结果页面的函数:

from urllib.parse import urlencode
from httpx import AsyncClient


async def search_page(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
    offset: int = 0,
):
    """scrapes a single hotel search page of booking.com"""
    checkin_year, checking_month, checking_day = checkin.split("-") if checkin else ("", "", "")
    checkout_year, checkout_month, checkout_day = checkout.split("-") if checkout else ("", "", "")

    url = "https://www.booking.com/searchresults.html"
    url += "?" + urlencode(
        {
            "ss": query,
            "checkin_year": checkin_year,
            "checkin_month": checking_month,
            "checkin_monthday": checking_day,
            "checkout_year": checkout_year,
            "checkout_month": checkout_month,
            "checkout_monthday": checkout_day,
            "no_rooms": number_of_rooms,
            "offset": offset,
        }
    )
    return await session.get(url, follow_redirects=True)

# Example use:
# first we need to immitate web browser headers to not get blocked instantly
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Accept-Language": "en-US,en;q=0.9,lt;q=0.8,et;q=0.7,de;q=0.6",
}


async def run():
    async with AsyncClient(headers=HEADERS) as session:
        await search_page("London", session)

if __name__ == "__main__":
    asyncio.run(run())

在这里,我们定义了一个函数,它从给定的搜索查询和签入数据中请求单个搜索结果页面。我们还使用了一些常见的抓取习惯用法:

  • 我们将 HTTP 客户端设置HEADERS为普通网络浏览器的客户端,以避免立即被阻止。在本例中,我们在 Windows 操作系统上使用 Chrome。 我们使用follow_redirects关键字自动跟踪所有 30X 响应,因为我们生成的查询参数缺少一些可选值。

这里的另一个关键参数是offset控制搜索结果分页的参数。提供偏移量表示我们想要 25 个结果从结果集的 X 点开始。那么,我们就用它来实现全分页和酒店预览数据解析

def parse_search_total_results(html: str):
    """parse total number of results from search page HTML"""
    sel = Selector(text=html)
    # parse total amount of pages from heading1 text:
    # e.g. "London: 1,232 properties found"
    total_results = int(sel.css("h1").re("([\d,]+) properties found")[0].replace(",", ""))
    return total_results


def parse_search_page(html: str):
    """parse hotel preview data from search page HTML"""
    sel = Selector(text=html)

    hotel_previews = {}
    for hotel_box in sel.xpath('//div[@data-testid="property-card"]'):
        url = hotel_box.xpath('.//h3/a[@data-testid="title-link"]/@href').get("").split("?")[0]
        hotel_previews[url] = {
            "name": hotel_box.xpath('.//h3/a[@data-testid="title-link"]/div/text()').get(""),
            "location": hotel_box.xpath('.//span[@data-testid="address"]/text()').get(""),
            "score": hotel_box.xpath('.//div[@data-testid="review-score"]/div/text()').get(""),
            "review_count": hotel_box.xpath('.//div[@data-testid="review-score"]/div[2]/div[2]/text()').get(""),
            "stars": len(hotel_box.xpath('.//div[@data-testid="rating-stars"]/span').getall()),
            "image": hotel_box.xpath('.//img[@data-testid="image"]/@src').get(),
        }
    return hotel_previews


async def scrape_search(
    query,
    session: AsyncClient,
    checkin: str = "",
    checkout: str = "",
    number_of_rooms=1,
):
    """scrape all hotel previews from a given search query"""
    first_page = await search_page(
        query=query, session=session, checkin=checkin, checkout=checkout, number_of_rooms=number_of_rooms
    )
    total_results = parse_search_total_results(first_page.text)
    other_pages = await asyncio.gather(
        *[
            search_page(
                query=query,
                session=session,
                checkin=checkin,
                checkout=checkout,
                number_of_rooms=number_of_rooms,
                offset=offset,
            )
            for offset in range(25, total_results, 25)
        ]
    )
    hotel_previews = {}
    for response in [first_page, *other_pages]:
        hotel_previews.update(parse_search_page(response.text))
    return hotel_previews

这里有相当多的代码,所以让我们一点一点地解压它: 首先,我们定义我们的scrape_search()函数,它循环通过我们之前定义的search_page()函数来抓取所有页面,而不仅仅是第一个页面。我们通过利用一个常见的网络抓取习惯用法来抓取已知大小的分页——我们抓取第一页,找到结果的数量并同时抓取其余页面。 然后,我们使用 XPATH 选择器从每个结果页面解析预览数据。为此,我们遍历页面上显示的 25 个酒店预览框,并提取名称、位置、分数、URL、评论数和起始值等详细信息。 让我们运行我们的搜索抓取工具:

运行代码和示例输出
import json

async def run():
    async with AsyncClient(headers=HEADERS) as session:
        results = await scrape_search("London", session)
        print(json.dumps(results, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
{
  "https://www.booking.com/hotel/gb/nobu-hotel-london-portman-square.html": {
    "name": "Nobu Hotel London Portman Square",
    "location": "Westminster Borough, London",
    "score": "8.9",
    "review_count": "445 reviews",
    "stars": 5,
    "image": "https://cf.bstatic.com/xdata/images/hotel/square200/339532965.webp?k=ba363634cf1e7c91ac2e64f701bf702d520b133c311ac91e2b3df118d0570aaa&o=&s=1"
  },
...

我们已经成功抓取了 booking.com 的搜索页面,以发现位于伦敦的酒店。此外,我们获得了一些有价值的元数据和酒店页面本身的 URL,因此接下来,我们可以抓取详细的酒店数据和价格!

抓取 Booking.com 酒店数据

现在我们有了一个可以抓取 booking.com 的酒店预览数据的抓取器,我们可以通过抓取每个单独的酒店 URL 来进一步收集剩余的酒店数据,如描述、地址、功能列表等。

酒店页面解析标记
酒店页面字段标记——我们将抓取这些字段

我们将继续使用httpx连接到 booking.com 和parsel处理酒店 HTML 页面

from collections import defaultdict
from parsel import Selector
from httpx import AsyncClient

def parse_hotel(html: str):
    sel = Selector(text=html)
    css = lambda selector, sep="": sep.join(sel.css(selector).getall()).strip()
    css_first = lambda selector: sel.css(selector).get("")
    # get latitude and longitude of the hotel address:
    lat, lng = css_first(".show_map_hp_link::attr(data-atlas-latlng)").split(",")
    # get hotel features by type
    features = defaultdict(list)
    for feat_box in sel.css("[data-capla-component*=FacilitiesBlock]>div>div>div"):
        type_ = feat_box.xpath('.//span[contains(@data-testid, "facility-group-icon")]/../text()').get()
        feats = [f.strip() for f in feat_box.css("li ::text").getall() if f.strip()]
        features[type_] = feats
    data = {
        "title": css("h2#hp_hotel_name::text"),
        "description": css("div#property_description_content ::text", "\n"),
        "address": css(".hp_address_subtitle::text"),
        "lat": lat,
        "lng": lng,
        "features": dict(features),
        "id": re.findall(r"b_hotel_id:\s*'(.+?)'", html)[0],
    }
    return data

async def scrape_hotels(urls: List[str], session: AsyncClient):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        return hotel

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels

在这里,我们定义了我们的酒店页面抓取逻辑。我们的scrape_hotels函数需要一个酒店 url 列表,我们通过对 HTML 数据的简单 GET 请求来抓取这些 url。然后我们使用我们的 HTML 解析库使用 CSS 选择器提取酒店信息。 我们可以测试我们的爬虫:

运行代码和示例输出
async def run():
    async with AsyncClient(headers=HEADERS) as session:
        hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '')
        print(json.dumps(hotels, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is superbly located in Kensington Gardens Square. It offers stylish, family-run accommodations, a short walk from Bayswater Underground Station.\n\n\nEach comfortable room is individually designed, with an LCD Freeview cable TV. All rooms have their own private internal private bathrooms, except for a few cozy single rooms which have their own exclusive private external bathrooms.\n\n\nFree Wi-Fi internet access is available throughout the hotel, and there is also free luggage storage and a safe for guests to use at the 24-hour reception.\n\n\nThe hotel is located in fashionable Notting Hill, close to Portobello Antiques Markets and the Royal Parks. Kings Cross Station is 4.8 km away.",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [ "Breakfast in the room" ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", "..." ],
      "Cleaning Services": [ "Daily housekeeping", ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..." ],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "..." ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd"
  }
]

页面上有更多可用数据,但为了使本教程简短,我们只关注几个示例字段。但是,我们遗漏了一个非常重要的细节——价格!为此,我们需要根据额外的请求修改我们的抓取工具。

抓取 Booking.com 酒店定价

Booking.com 的酒店页面不包含 HTML 数据中的定价,因此我们必须发出额外的请求来检索定价日历数据。如果我们在酒店页面上向下滚动并打开我们的网络检查器,我们可以看到 Booking.com 如何填充其定价日历:

booking.com酒店定价日历插图
Booking.com酒店的定价日历请求

在这里,我们可以看到正在发出后台请求以检索 61 天的定价数据!让我们将此功能添加到我们的抓取工具中:

async def scrape_hotels(urls: List[str], session: AsyncClient, price_start_dt: str, price_n_days=30):
    async def scrape_hotel(url: str):
        resp = await session.get(url)
        hotel = parse_hotel(resp.text)
        hotel["url"] = str(resp.url)
        # for background requests we need to find cross-site-reference token
        csrf_token = re.findall(r"b_csrf_token:\s*'(.+?)'", resp.text)[0]
        hotel['price'] = await scrape_prices(csrf_token=csrf_token, hotel_id=hotel['id'])
        return hotel

    async def scrape_prices(hotel_id, csrf_token):
        data = {
            "name": "hotel.availability_calendar",
            "result_format": "price_histogram",
            "hotel_id": hotel_id,
            "search_config": json.dumps({
                # we can adjust pricing configuration here but this is the default
                "b_adults_total": 2,
                "b_nr_rooms_needed": 1,
                "b_children_total": 0,
                "b_children_ages_total": [],
                "b_is_group_search": 0,
                "b_pets_total": 0,
                "b_rooms": [{"b_adults": 2, "b_room_order": 1}],
            }),
            "checkin": price_start_dt,
            "n_days": price_n_days,
            "respect_min_los_restriction": 1,
            "los": 1,
        }
        resp = await session.post(
            "https://www.booking.com/fragment.json?cur_currency=usd",
            headers={**session.headers, "X-Booking-CSRF": csrf_token},
            data=data,
        )
        return resp.json()["data"]

    hotels = await asyncio.gather(*[scrape_hotel(url) for url in urls])
    return hotels

我们scrape_hotels通过复制我们在网络检查器中看到的定价日历请求,扩展了价格抓取功能。如果我们运行这段代码,我们的结果应该包含类似于这样的定价数据:

运行代码和示例输出
async def run():
    async with AsyncClient(headers=HEADERS) as session:
        hotels = await scrape_hotels(["https://www.booking.com/hotel/gb/gardencourthotel.html"], session, '2022-05-20')
        print(json.dumps(hotels, indent=2))

if __name__ == "__main__":
    asyncio.run(run())
[
  {
    "title": "Garden Court Hotel",
    "description": "You're eligible for a Genius discount at Garden Court Hotel! To save at this property, all you have to do is \nsign in\n.\n\n\nThe 19th-century Garden Court Hotel is ...",
    "address": "30-31 Kensington Gardens Square, Notting Hill, Westminster Borough, London, W2 4BG, United Kingdom",
    "lat": "51.51431706",
    "lng": "-0.19066349",
    "features": {
      "Food & Drink": [
        "Breakfast in the room"
      ],
      "Internet": [],
      "Parking": [ "Electric vehicle charging station", "Street parking" ],
      "Front Desk Services": [ "Invoice provided", ""],
      "Cleaning Services": [ "Daily housekeeping", "Ironing service" ],
      "Business Facilities": [ "Fax/Photocopying" ],
      "Safety & security": [ "Fire extinguishers", "..."],
      "General": [ "Shared lounge/TV area", "..." ],
      "Languages Spoken": [ "Arabic", "English", "Spanish", "French", "Romanian" ]
    },
    "id": "102764",
    "url": "https://www.booking.com/hotel/gb/gardencourthotel.html?cur_currency=usd",
    "price": {
      "min_los": 1,
      "days": [
        {
          "b_is_weekend": 0,
          "b_month": "05",
          "b_month_name": "May",
          "b_price_pretty": "USD\u00a0276,69",
          "checkin": "2022-05-20",
          "b_full_year": "2022",
          "b_length_of_stay": 1,
          "b_price": 276.6988213958,
          "b_day": "20",
          "b_avg_price_pretty": "276",
          "b_short_month_name": "May",
          "b_checkout": "2022-05-21",
          "b_weekday": 5,
          "b_avg_price_raw": "276",
          "b_available": 1,
          "b_epoch": 1652997600,
          "b_weekday_name": "Fr",
          "b_min_length_of_stay": 1,
          "b_url_hp": "/hotel/gb/gardencourthotel.html?label=gen173nr-1DEghmcmFnbWVudCiCAjjoB0gzWARo3QGIAQGYATG4ARfIAQzYAQPoAQH4AQOIAgGoAgO4AoLPnZQGwAIB0gIkNzllMjNlMDItMjRlNC00M2U0LTk0YzYtY2JlNDlkMjA5NzI52AIE4AIB&sid=3af1cb864972f2e88ef99b900927c6f1&checkin=2022-05-20&checkout=2022-05-21&room1=A%2CA%2C&#maxotel_rooms",
          "b_checkin": "2022-05-20"
        },
     ...

我们可以让此请求为我们指定的日历范围内的每一天生成价格和可用性数据。

抓取 Booking.com 酒店评论

为了抓取 booking.com 酒店评论,让我们看看浏览评论页面时会发生什么。让我们点击第 2 页,观察浏览器 Web 检查器(主要浏览器中的 F12)发生了什么。

reviewlist.html我们可以看到对端点发出了一个请求,该请求返回了一个包含 10 条评论的 HTML 页面。我们可以很容易地在我们的爬虫中复制它:

def parse_reviews(html: str) -> List[dict]:
    """parse review page for review data """
    sel = Selector(text=html)
    parsed = []
    for review_box in sel.css('.review_list_new_item_block'):
        get_css = lambda css: review_box.css(css).get("").strip()
        parsed.append({
            "id": review_box.xpath('@data-review-url').get(),
            "score": get_css('.bui-review-score__badge::text'),
            "title": get_css('.c-review-block__title::text'),
            "date": get_css('.c-review-block__date::text'),
            "user_name": get_css('.bui-avatar-block__title::text'),
            "user_country": get_css('.bui-avatar-block__subtitle::text'),
            "text": ''.join(review_box.css('.c-review__body ::text').getall()),
            "lang": review_box.css('.c-review__body::attr(lang)').get(),
        })
    return parsed


async def scrape_reviews(hotel_id: str, session) -> List[dict]:
    """scrape all reviews of a hotel"""
    async def scrape_page(page, page_size=25):  # 25 is largest possible page size for this endpoint
        url = "https://www.booking.com/reviewlist.html?" + urlencode(
            {
                "type": "total",
                # we can configure language preference
                "lang": "en-us",
                # we can configure sorting order here, in this case recent reviews are first
                "sort": "f_recent_desc",
                "cc1": "gb",
                "dist": 1,
                "pagename": hotel_id,
                "rows": page_size,
                "offset": page * page_size,
            }
        )
        return await session.get(url)

    first_page = await scrape_page(1)
    total_pages = Selector(text=first_page.text).css(".bui-pagination__link::attr(data-page-number)").getall()
    total_pages = max(int(page) for page in total_pages)
    other_pages = await asyncio.gather(*[scrape_page(i) for i in range(2, total_pages + 1)])

    results = []
    for response in [first_page, *other_pages]:
        results.extend(parse_reviews(response.text))
    return results

在上面的抓取程序代码中,我们使用了之前学到的知识:我们收集第一页以提取总页数,然后同时抓取其余页面。另一件需要注意的事情是,我们可以根据自己的喜好稍微调整默认的 url 参数。上面我们使用 25 的页面大小而不是默认的 10,这意味着我们必须执行更少的请求来检索所有评论。 最后 – 我们的抓取工具可以发现酒店、提取酒店预览数据,然后抓取每个酒店页面以获取酒店信息、价格数据和评论!

常问问题

为了总结本指南,让我们看一下有关网页抓取Booking.com的一些常见问题:

是的。预订酒店数据是公开的;我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 booking.com 属于道德抓取定义。

抓取 booking.com 时如何更改货币?

Booking.com 会根据网络抓取工具的 IP 地址的地理位置自动选择货币。最简单的方法是使用特定位置的代理。

如何抓取超过 1000 家 booking.com 酒店?

与许多结果分页系统一样,Booking.com 的搜索返回的结果数量有限。在这种情况下,每个查询 1000 个结果可能不足以完全覆盖一些更广泛的查询。 这里最好的方法是将查询拆分为几个较小的查询。例如,我们可以通过抓取伦敦的每个街区来拆分搜索,而不是搜索“伦敦”:

booking.com 搜索拆分的插图

Booking.com 抓取摘要

在此网络抓取教程中,我们构建了一个小型Booking.com抓取工具,它使用搜索来发现酒店列表预览,然后抓取酒店数据和定价信息。为此,我们使用了带有httpx和parsel包的 Python。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。