in

如何使用Python爬取Zoopla的房地产数据

如何使用Python爬取Zoopla的房地产数据new

在本网络爬取教程中,我们将了解如何爬取 Zoopla – 一个受欢迎的英国房地产市场。 我们将爬取 Zoopla 属性页面上显示的房地产数据,例如定价信息、地址、照片和电话号码。 为了爬取 Zoopla 属性,我们将使用隐藏的 Web 数据爬取方法,因为该网站由 Next.js 提供支持。我们还将了解如何使用 Zoopla 的搜索和站点地图系统来收集所有可用的房地产数据来查找房地产。 最后,我们还将通过不断爬取新上市的房产来进行房产跟踪——这使我们在房地产市场认知度方面具有优势。我们将使用 Python 和一些社区库 – 让我们开始吧!

为什么要爬取 Zoopla.com?

Zoopla.com 是美国最大的房地产网站之一,使其成为最大的公共房地产数据集。包含房地产价格、列表位置和销售日期以及一般房地产信息等字段。 这对于市场分析、房地产行业研究以及竞争对手的总体概述来说都是有价值的信息。

可用的 Zoopla 数据字段

我们可以从 Zoopla 中爬取几个流行的房地产数据字段和目标的数据:

  • 待售房产
  • 出租物业
  • 房地产经纪人信息

在本指南中,我们将重点介绍如何爬取流行数据字段的房地产(租金和销售)数据,例如:

  • 价格
  • 相片
  • 代理联系方式
  • 物业特点
  • 属性元数据和性能

有关更多信息,请参阅我们将在本指南中爬取的所有字段的示例爬取数据集: 爬虫输出示例

{
"id": 63422316,
"title": "3 bed maisonette for sale",
"description": "Internal:<br><br>Entrance - Access is made via its own rear entrance door from the private outdoor terrace, opening to the first floor hall.<br><br>Hall - With solid wood flooring, the carpeted staircase leading up to the second floor landing and doors to the lounge, the kitchen, bedroom three and the bathroom.<br><br>Lounge - Offering generous space for furniture with a double glazed window with wooden shutters, solid wood flooring and a feature period fireplace with a decorative surround, mantelpiece and hearth.<br><br>Kitchen/Diner - Fitted with a range of modern black wall and base units with complementing wood worktops over, two double glazed windows and tiled flooring and splashbacks. Inset one and a half stainless steel sink basin with a drainer and mixer tap and an integrated fridge-freezer and an electric oven with a countertop gas hob and overhead extractor hood, with space for further appliances and for a dining table and chairs.<br><br>Bedroom Three - Can be used as a double size bedroom or as an additional reception room or study, with a double glazed window with wooden shutters, solid wood flooring and a closed corner fireplace.<br><br>Bathroom - Modern white suite comprising a push-button WC, a wash hand basin and a panelled bath with an overhead shower and glass screen. Obscure double glazed window, tiled flooring and partly tiled walls.<br><br>Second Floor Landing - With a Velux window, carpeted flooring, a storage cupboard, eaves storage and doors to bedrooms one and two.<br><br>Bedroom One - Large bedroom providing ample space for furniture, with a double glazed window, carpeted flooring, a closed fireplace and a storage cupboard.<br><br>Bedroom Two - Double size bedroom with a double glazed window with wooden shutters, carpeted flooring and a storage cupboard.<br><br>External:<br><br>The property benefits from a spacious and low-maintenance private terrace to the rear first floor.<br><br>Additional information:<br><br>Council Tax Band: B<br><br>Local Authority: Sutton<br><br>Lease: 125 years from 25 March 1988<br><br>Annual Ground Rent: £50 per<br><br>This information is to be confirmed by the vendor's solicitor.<br><br>Early viewing is highly recommended due to the property being realistically priced.<br><br>Disclaimer:<br><br>These particulars, whilst believed to be accurate are set out as a general guideline and do not constitute any part of an offer or contract. Intending Purchasers should not rely on them as statements of representation of fact, but must satisfy themselves by inspection or otherwise as to their accuracy. Please note that we have not tested any apparatus, equipment, fixtures, fittings or services including gas central heating and so cannot verify they are in working order or fit for their purpose. Furthermore, Solicitors should confirm moveable items described in the sales particulars and, in fact, included in the sale since circumstances do change during the marketing or negotiations. Although we try to ensure accuracy, if measurements are used in this listing, they may be approximate. Therefore if intending Purchasers need accurate measurements to order carpeting or to ensure existing furniture will fit, they should take such measurements themselves. Photographs are reproduced general information and it must not be inferred that any item is included for sale with the property.<br><strong>Tenure<br></strong><br><br>To be confirmed by the Vendor’s Solicitors<br><strong>Possession<br></strong><br><br>Vacant possession upon completion<br><strong>Viewing<br></strong><br><br>Viewing strictly by appointment through The Express Estate Agency",
"url": "/for-sale/details/63422316/",
"price": "£220,000",
"type": "maisonette",
"date": "2022-12-08T02:28:04",
"category": "residential",
"section": "for-sale",
"features": [
  "*Guide Price £220,000 - £240,000*",
  "**cash buyers only**",
  "Duplex Maisonette With Private Entrance",
  "Vacant with No Onward Chain",
  "Three Double Size Bedrooms",
  "Spacious Lounge with Feature Fireplace",
  "Modern Kitchen/Diner with Appliances",
  "Modern Bathroom Suite",
  "Private Rear Terrace & Entrance Door",
  "Ideal Rental Investment"
],
"floor_plan": { "filename": null, "caption": null },
"nearby": [
  { "title": "Malmesbury Primary School", "distance": 0.3 },
  "etc. (reduced for blog)"
],
"coordinates": {
  "lat": 51.383345,
  "lng": -0.190118
},
"photos": [
  {
    "filename": "5f2cbcd9866478e716b32aa9af78e59a2c3645ce.jpg",
    "caption": null
  },
  "etc. (reduced for blog)"
],
"details": {
  "__typename": "ListingAnalyticsTaxonomy",
  "location": "Sutton",
  "regionName": "London",
  "section": "for-sale",
  "acorn": 36,
  "acornType": 36,
  "areaName": "Sutton",
  "bedsMax": 3,
  "bedsMin": 3,
  "branchId": 15566,
  "branchLogoUrl": "https://www.jingzhengli.com/wp-content/uploads/2023/06/zoopla_static_agent_logo_654815.png",
  "branchName": "Express Estate Agency",
  "brandName": "Express Estate Agency",
  "chainFree": true,
  "companyId": 18887,
  "countryCode": "gb",
  "countyAreaName": "London",
  "currencyCode": "GBP",
  "displayAddress": "Rosehill, Sutton SM1",
  "furnishedState": "",
  "groupId": null,
  "hasEpc": true,
  "hasFloorplan": true,
  "incode": "3HE",
  "isRetirementHome": false,
  "isSharedOwnership": false,
  "listingCondition": "pre-owned",
  "listingId": 63422316,
  "listingsCategory": "residential",
  "listingStatus": "for_sale",
  "memberType": "agent",
  "numBaths": 1,
  "numBeds": 3,
  "numImages": 14,
  "numRecepts": 1,
  "outcode": "SM1",
  "postalArea": "SM",
  "postTownName": "Sutton",
  "priceActual": 220000,
  "price": 220000,
  "priceMax": 220000,
  "priceMin": 220000,
  "priceQualifier": "guide_price",
  "propertyHighlight": "",
  "propertyType": "maisonette",
  "sizeSqFeet": "",
  "tenure": "leasehold",
  "zindex": 255069
},
"agency": {
  "__typename": "AgentBranch",
  "branchDetailsUri": "/find-agents/branch/express-estate-agency-manchester-15566/",
  "branchId": "15566",
  "branchResultsUri": "/for-sale/branch/express-estate-agency-manchester-15566/",
  "logoUrl": "https://www.jingzhengli.com/wp-content/uploads/2023/06/zoopla_static_agent_logo_654815.png",
  "phone": "0333 016 5458",
  "name": "Express Estate Agency",
  "memberType": "agent",
  "address": "St George's House, 56 Peter Street, Manchester",
  "postcode": "M2 3NQ"
}
}

项目设置

在本教程中,我们将使用 Python 和三个社区包:

  • httpx – HTTP 客户端库,可让我们与 Zoopla.com 的服务器进行通信
  • parsel – HTML 解析库,它将帮助我们使用CSS 选择器Xpath解析网页爬取的 HTML 数据。
  • jmespath – JSON 解析库。允许为 JSON 编写类似 XPath 的规则。

这些软件包可以通过以下pip install命令轻松安装:

$ pip install httpx parsel jmespath

或者,您可以随意更换httpx任何其他 HTTP 客户端包(例如requests),因为我们只需要基本的 HTTP 函数,这些函数几乎在每个库中都可以互换。至于,parsel另一个不错的选择是beautifulsoup包。

爬取 Zoopla 属性数据

让我们首先看看如何从单个列表页面中爬取属性数据。 Zoopla 使用 Next.js 来渲染其页面,并为每个属性页面生成隐藏的 Web 数据。我们将提取隐藏的 JSON 数据并使用 jmespath 解析它,而不是使用 beautifulsoup 等传统工具解析 HTML。 如何爬取隐藏的网络数据 让我们首先选择一个随机的房产列表作为我们的测试目标。为了找到隐藏的数据,让我们看一下页面源代码:

Zoopla属性页的页面源码图解
Zoopla 属性页的页面源突出显示

我们可以看到一个大的房地产数据集位于一个<script id="__NEXT_DATA__">元素中。我们来刮一下: 我们可以看到一个大的房地产数据集位于一个<script id="__NEXT_DATA__">元素中。我们来刮一下:

import asyncio
import json
from typing import List, Optional, TypedDict

import jmespath
from httpx import AsyncClient, Response
from parsel import Selector

session = AsyncClient(
    headers={
        # use same headers as a popular web browser (Chrome on Windows in this case)
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
    }
)

def extract_next_data(response: Response) -> dict:
    selector = Selector(text=response.text)
    data = selector.css("script#__NEXT_DATA__::text").get()
    if not data:
        print(f"page {response.url} is not a property listing page")
        return
    data = json.loads(data)
    return data["props"]["pageProps"]


async def scrape_properties(urls: List[str]):
    to_scrape = [session.get(url) for url in urls]
    properties = []
    for response in asyncio.as_completed(to_scrape):
        properties.append(extract_next_data(await response)["listingDetails"])
    return properties

运行代码

if __name__ == "__main__":
    urls = [
        "https://www.zoopla.co.uk/for-sale/details/63422872/",
        "https://www.zoopla.co.uk/for-sale/details/63422422/",
        "https://www.zoopla.co.uk/for-sale/details/63422316/",
        "https://www.zoopla.co.uk/for-sale/details/63422320/",
        "https://www.zoopla.co.uk/for-sale/details/63422282/",
        "https://www.zoopla.co.uk/for-sale/details/63422152/",
        "https://www.zoopla.co.uk/for-sale/details/63422228/",
        "https://www.zoopla.co.uk/for-sale/details/63422243/",
        "https://www.zoopla.co.uk/for-sale/details/63422274/",
        "https://www.zoopla.co.uk/for-sale/details/63422422/",
    ]
    asyncio.run(scrape_properties(urls))

上面,我们为 Zoopla 属性编写了一个小型网络爬虫。让我们看一下我们在这里所做的关键点。 首先,我们httpx使用类似浏览器的默认标头建立会话以避免阻塞。 然后,为了提取隐藏的 Web 数据,我们将 HTML 作为parsel.Selector对象加载,并使用CSS 选择器选择<script>带有id=__NEXT_DATA__. 该脚本包含 JSON 格式的属性数据,因此我们将其作为 Python 字典加载并返回结果。 这个基本的 Python Zoopla 爬取工具为我们提供了整个可用的属性数据集,但它充满了无用的数据字段。让我们通过使用 JMESPath 解析出最重要的字段来清理它。

解析属性 JSON

让我们使用 JMESPath 查询语言将此数据集缩减为最重要的字段,该语言允许我们为 JSON 编写查询路径,例如为 HTML 编写 XPath 或 CSS 选择器。

让我们用属性解析来更新我们的爬取工具:

import asyncio
import json
from typing import List, Optional, TypedDict, Dict

import jmespath
from httpx import AsyncClient, Response
from parsel import Selector

session = AsyncClient(
    headers={
        # use same headers as a popular web browser (Chrome on Windows in this case)
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "Accept-Language": "en-US,en;q=0.9",
    }
)


class PropertyResult(TypedDict):
    """Type hint for scraped property data just so we can visualize it better"""
    listing_id: str
    title: str
    description: str
    url: str
    price: str
    photos: List[dict]
    agency: Dict[str, str]
    features: List[str]
    ...  # and much more


def parse_property(response: Response) -> Optional[PropertyResult]:
    data = _extract_next_data(response)
    if not data:
        return
    result = jmespath.search(
        """root.{
        id: listingId,
        title: title,
        description: detailedDescription,
        url: listingUris.detail,
        price: pricing.label,
        type: propertyType,
        date: publishedOn,
        category: category,
        section: section,
        features: features.bullets,
        floor_plan: floorPlan.image.{filename:filename, caption: caption}, 
        nearby: pointsOfInterest[].{title: title, distance: distanceMiles},
        coordinates: location.coordinates.{lat:latitude, lng: longitude},
        photos: propertyImage[].{filename: filename, caption: caption},
        details: analyticsTaxonomy,
        agency: branch
    }""", {"root": data["listingDetails"]})
    return result


def _extract_next_data(response: Response) -> dict:
    """find NEXT_DATA in given response HTML body"""
    selector = Selector(text=response.text)
    data = selector.css("script#__NEXT_DATA__::text").get()
    if not data:
        print(f"page {response.url} is not a property listing page")
        return
    data = json.loads(data)
    return data["props"]["pageProps"]

async def scrape_properties(urls: List[str]):
    """Scrape Zooplas property pages for property data"""
    to_scrape = [session.get(url) for url in urls]
    properties = []
    for response in asyncio.as_completed(to_scrape):
        properties.append(parse_property(await response))
    return properties

示例输出

{
  "id": "63422152",
  "title": "3 bed flat for sale",
  "description": "Featuring a fabulous private roof terrace and secure underground parking, this superb 3 bedroom apartment provides stylish lateral living space within a popular riverside development.<br><br>The River Thames is moments away and lots of on-site amenities can be found throughout the development, including convenience store, cafe and dentist. The shops, cafes and restaurants of Putney town centre and Wandsworth Town are all within easy reach.<br><br>Please use the reference CHPK4316397 when contacting Foxtons.",
  "url": "/for-sale/details/63422152/",
  "price": "£900,000",
  "type": "flat",
  "date": "2022-12-08T01:15:06",
  "category": "residential",
  "section": "for-sale",
  "features": [
    "Secure entrance and lift to 4th floor",
    "Sun-filled reception room with door to terrace",
    "Sleek open-plan kitchen with integrated appliances",
    "Main bedroom with private balcony and en suite",
    "2 additional good-sized bedrooms",
    "Smart main bathroom",
    "Private roof terrace with space for dining, relaxing and storage",
    "Extra wide underground parking space"
  ],
  "floor_plan": {
    "filename": null,
    "caption": null
  },
  "nearby": [
    {
      "title": "Wandsworth Riverside Quarter Pier",
      "distance": 0.1
    },
    {
      "title": "The Roche School",
      "distance": 0.2
    },
    {
      "title": "St Joseph's RC Primary School",
      "distance": 0.2
    },
    {
      "title": "Wandsworth Town",
      "distance": 0.4
    }
  ],
  "coordinates": {
    "lat": 51.461368,
    "lng": -0.197922
  },
  "photos": [
    {
      "filename": "8b8f71df67b294ad3d114708603af4146096b7dc.jpg",
      "caption": null
    },
    {
      "filename": "002e46a10979e1a6b9c691f2a7ba221ba6aca215.jpg",
      "caption": null
    },
    {
      "filename": "3c2aef9eafa4e81aca16c814fff9fdbbb6d7ea14.jpg",
      "caption": null
    },
    {
      "filename": "ffa96fada68131ee28f2ce8d6f127303af60b136.jpg",
      "caption": null
    },
    {
      "filename": "e8ec10ce0aad02214f9bdda1290459e4e461bcc5.jpg",
      "caption": null
    },
    {
      "filename": "cf1e876b476195faab1e29cca417d7fd8fa3cd49.jpg",
      "caption": null
    },
    {
      "filename": "7bbb8e63ff8483ae3af0709d8d40de8b3bfeda26.jpg",
      "caption": null
    },
    {
      "filename": "4438e37536d2e82ef430cdf21f16ab4da809d0bc.jpg",
      "caption": null
    },
    {
      "filename": "6bf71a57c2400c2c4375894dad7c5361afa5dd98.jpg",
      "caption": null
    },
    {
      "filename": "13762722024be732299a94c94929ed8025e77bb1.jpg",
      "caption": null
    },
    {
      "filename": "06e89441cfd613b9fd186ceeb4b2b9ab320e4295.jpg",
      "caption": null
    },
    {
      "filename": "5a3a7393c90974e02a3cbfaa1809814685d26bb8.jpg",
      "caption": null
    },
    {
      "filename": "6c04fb22d82715221f6dfc53ff682d4b361cd246.jpg",
      "caption": null
    },
    {
      "filename": "1cf7e428cf5b2ac9f12b8ec09886e4804c85c584.jpg",
      "caption": null
    },
    {
      "filename": "68ebc853bce8112c41df4dc6b92b042cd682f9e1.jpg",
      "caption": null
    },
    {
      "filename": "5218cd282da9ea0213d78d639ef07fca68b7aba1.jpg",
      "caption": null
    }
  ],
  "details": {
    "__typename": "ListingAnalyticsTaxonomy",
    "location": "London",
    "regionName": "London",
    "section": "for-sale",
    "acorn": 15,
    "acornType": 15,
    "areaName": "London",
    "bedsMax": 3,
    "bedsMin": 3,
    "branchId": 2888,
    "branchLogoUrl": "https://www.jingzhengli.com/wp-content/uploads/2023/06/zoopla_static_agent_logo_592983.png",
    "branchName": "Foxtons - Putney",
    "brandName": "Foxtons",
    "chainFree": false,
    "companyId": 1370,
    "countryCode": "gb",
    "countyAreaName": "London",
    "currencyCode": "GBP",
    "displayAddress": "Knightley Walk, Wandsworth, London SW18",
    "furnishedState": "",
    "groupId": 267,
    "hasEpc": false,
    "hasFloorplan": true,
    "incode": "1HB",
    "isRetirementHome": false,
    "isSharedOwnership": false,
    "listingCondition": "pre-owned",
    "listingId": 63422152,
    "listingsCategory": "residential",
    "listingStatus": "for_sale",
    "memberType": "agent",
    "numBaths": 2,
    "numBeds": 3,
    "numImages": 16,
    "numRecepts": 1,
    "outcode": "SW18",
    "postalArea": "SW",
    "postTownName": "London",
    "priceActual": 900000,
    "price": 900000,
    "priceMax": 900000,
    "priceMin": 900000,
    "priceQualifier": "",
    "propertyHighlight": "",
    "propertyType": "flat",
    "sizeSqFeet": "1022",
    "tenure": null,
    "zindex": 628988
  },
  "agency": {
    "__typename": "AgentBranch",
    "branchDetailsUri": "/find-agents/branch/foxtons-putney-london-2888/",
    "branchId": "2888",
    "branchResultsUri": "/for-sale/branch/foxtons-putney-london-2888/",
    "logoUrl": "https://www.jingzhengli.com/wp-content/uploads/2023/06/zoopla_static_agent_logo_592983.png",
    "phone": "020 3542 2189",
    "name": "Foxtons - Putney",
    "memberType": "agent",
    "address": "175 Putney High Street, London",
    "postcode": "SW15 1RT"
  }
}

在这里,我们通过使用 JMESPath 定义解析路径,使用 JSON 解析逻辑更新了 Python Zoopla 网络爬取工具。

查找 Zoopla 属性

要在 Zoopla 上查找房产列表,我们有两种选择:爬取站点地图以查找所有列出的房产,或使用 Zoopla 的搜索系统按位置爬取列表。

为了爬取 Zoopla 的搜索,我们首先来看看它是如何工作的。如果我们提交像“Islington, London”这样的搜索请求,我们可以看到我们被重定向到包含搜索结果的 URL:

https://www.zoopla.co.uk/for-sale/property/london/islington/?q=Islington%2C%20London&search_source=home

但是,我们如何从搜索查询创建这个 URL?让我们看一下当我们提交搜索时网页会做什么:

演示如何使用Chrome开发者工具查找Zoopla的搜索API

我们可以看到 Zoopla 正在使用后台 API 将我们从查询重定向到搜索页面:

https://www.zoopla.co.uk/search/?view_type=list§ion=for-sale&q=Islington%2C%20London&geo_autocomplete_identifier=&search_source=home

让我们在我们的爬取工具中复制这个

from typing import Literal

...

async def find_properties(query: str, query_type: Literal["for-sale", "to-rent"] = "for-sale"):
    # scrape first results page to start:
    first_page = await session.get(
        url=f"https://www.zoopla.co.uk/search/?view_type=list§ion={query_type}&q={query}&geo_autocomplete_identifier=&search_source=home&sort=newest_listings",
        follow_redirects=True,
    )
    # extract next.js data and the listings of the first page
    data = extract_next_data(first_page)["initialProps"]["searchResults"]
    listings = data["listings"]["regular"]
    # then extract total pages
    total_results = data["pagination"]["totalResults"]
    total_pages = math.ceil(data["pagination"]["totalResults"] / len(listings))
    if total_pages > data["pagination"]["pageNumberMax"]:
        total_pages = data["pagination"]["pageNumberMax"]

    # scrape reamining pages concurrently:
    print(f"total {total_results} results, {total_pages} pages")
    other_pages = [session.get(url=str(first_page.url) + f"&pn={page}") for page in range(2, total_pages + 1)]
    for response in asyncio.as_completed(other_pages):
        data = extract_next_data(await response)["initialProps"]["searchResults"]
        listings.extend(data["listings"]["regular"])
    return listings

运行代码

if __name__ == "__main__":
    results = asyncio.run(find_properties(""))
    print(results)

为了解释上面的 Zoopla 爬虫 – 我们首先将查询发送到 Zoopla 的搜索 API,该 API 将我们重定向到第一个结果页面。然后,我们使用与属性页解析和提取隐藏 Web 数据相同的解析技术。该数据包含我们稍后用于爬取结果分页的其他页面的总结果计数。 我们可以通过使用我们之前编写的属性爬取器爬取每个属性页面来进一步丰富属性结果。 接下来,让我们看一下另一种发现方法,该方法可用于查找Zoopla 上的所有属性 – 站点地图。

爬取 Zoopla 站点地图

站点地图是包含各种网页 URL(无论是属性列表、博客文章还是单个页面)的文件集合。 对于 Python 中的 Zoopla 爬取工具,要使用站点地图集合查找所有属性,我们首先必须找到站点地图本身。为此,我们可以检查/robots.txt端点,其中包含网络爬取工具的各种说明:

Sitemap: https://www.zoopla.co.uk/xmlsitemap/sitemap/index.xml.gz

此中心站点地图充当按主题分类的所有其他站点地图的中心:

<sitemap>
  <loc>https://www.zoopla.co.uk/xmlsitemap/sitemap/for_sale_details_001.xml.gz</loc>
  <lastmod>2022-12-08T09:25:08+00:00</lastmod>
</sitemap>
<sitemap>
  <loc>https://www.zoopla.co.uk/xmlsitemap/sitemap/to_rent_details_001.xml.gz</loc>
  <lastmod>2022-12-08T09:25:08+00:00</lastmod>
</sitemap>
<sitemap>
  <loc>https://www.zoopla.co.uk/xmlsitemap/sitemap/for_sale_flats_001.xml.gz</loc>
  <lastmod>2022-12-08T09:25:08+00:00</lastmod>
</sitemap>
...
每个站点地图仅限 50000 个网址 – 这就是它们被分为几个部分的原因。

例如,要爬取所有出租房产,我们可以通过爬取to_rent_站点地图来找到所有 URL。 让我们快速了解一下如何使用 Python 爬取站点地图文件:

import asyncio
from httpx import AsyncClient
from parsel import Selector

session = AsyncClient()

async def scrape_feed(url):
    resp = await session.get(url)
    selector = Selector(text=resp.text)
    results = []
    for url in selector.xpath("//loc/text()").getall():
        results.append(url)
    return results

# example run:
asyncio.run(
    scrape_feed("https://www.zoopla.co.uk/xmlsitemap/sitemap/to_rent_details_001.xml.gz")
)
  • 有时站点地图文件可以进行 gzip 编码。在将内容传递给选择器之前,使用 gzip.decode() 函数对内容进行解码。

由于站点地图是 XML 文件,我们可以使用与解析 HTML 相同的工具来解析它们。在上面的示例中,我们检索站点地图页面并使用parselXPath 选择器提取 URL。

追踪新的 Zoopla 列表

我们知道如何查找房产列表,因此现在我们还可以通过爬取搜索或站点地图来跟踪 Zoopla 的新房产列表。 如果我们想让整个列表数据集保持最新,我们可以跟踪我们之前介绍过的Zooplasnew_home_details_x中找到的站点地图。 不过,这些站点地图每天只更新一次 – 如果我们想尽快了解新列表怎么办?为此,我们可以爬取搜索查询并按“最近”对它们进行排序,这正是我们对搜索爬取器进行编码的方式。

常问问题

作为本指南的总结,让我们看一下有关如何从 Zoopla 中爬取数据的一些常见问题:

是的。Zoopla 的数据是公开的——以缓慢、尊重的速度爬取 Zoopla 属于道德爬取的定义。 话虽如此,在存储个人数据(例如代理人的个人详细信息(如姓名、电话号码)时)请注意欧盟的 GDPR 合规性。

有 Zoopla API 吗?

是的,尽管它是私有的并且仅限于一组特定的数据字段(例如不包含代理联系方式)。幸运的是,正如本文所述,我们可以使用 Python 爬取 Zoopla。

如何爬取 Zoopla.com?

对于网络爬取 Zoopla,我们可以采用本文中介绍的爬取技术。特别地,推荐/相似属性数据字段可以用于开发爬取逻辑。话虽这么说,使用 Zoopla 广泛的站点地图系统,爬行是不必要的,因为我们可以直接爬取所有属性。

Zoopla 爬取总结

在本指南中,我们只使用 Python 和一些社区包:httpx、parsel 和 jmespath,编写了一个用于房地产数据的Zoopla爬取器。 为了爬取属性数据,我们使用 parsel 来提取隐藏在 HTML 脚本元素中的数据。然后我们清理它并使用 JMESPath 解析语言解析最重要的字段。 为了找到要爬取的属性,我们还探讨了如何爬取 Zoopla 的站点地图和搜索系统。我们还介绍了如何使用搜索爬取来跟踪新属性的上市时间。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。