in

如何用 Python 爬取 Idealista.com 的房地产数据

如何用 Python 爬取 Idealista.com 的房地产数据

在本网络抓取教程中,我们将抓取idealista.com—西班牙、葡萄牙和意大利最大的房地产市场。

在本指南中,我们将通过查看Idealista.com来探索房地产数据抓取。我们将收集常见的房产数据点,例如房产价格、地址、照片和代理电话号码。

在网络抓取方面,Idealista.com 是一个传统的抓取目标。为了抓取它,我们将介绍 Python 中使用的流行网络抓取技术,例如使用 CSS 选择器的 HTML 解析和使用 asyncio 的并发请求。

最后,我们还将介绍跟踪以抓取新上市的房产——让我们在房地产发现和投标方面占据优势。

在本文中,我们将重点关注网站的西班牙语版本 (Idealista.com),尽管意大利语和葡萄牙语版本的功能相同,我们的抓取代码也应该适用于这些来源。

为什么要爬取 Idealista.com?

Idealista.com 是西班牙(以及意大利和葡萄牙)最大的房地产网站之一,使其成为这些地区最大的公共房地产数据集。包含房地产价格、挂牌地点和销售日期以及一般财产信息等字段。

这对于市场分析、住宅行业研究和竞争对手的总体概况来说是有价值的信息。

项目设置

在本教程中,我们将使用带有两个社区包的 Python:

  • httpx – HTTP 客户端库,可以让我们与 Idealista.com 的服务器进行通信
  • parsel – HTML 解析库,它将帮助我们使用CSS 选择器XPath 择器解析我们的网络抓取的 HTML 文件。

这些包可以通过pip install命令轻松安装:

$ pip install httpx parsel

或者,可以随意换成httpx任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 功能,这些功能几乎可以在每个库中互换。至于,parsel另一个很好的选择是beautifulsoup包。

爬取 Idealista 属性数据

让我们先来看看如何为单个属性抓取 Idealista。在后面的部分中,我们还将了解如何使用此属性抓取器查找任何属性并抓取它们。

例如,让我们先看一下列表页面,看看页面上存储的所有信息在哪里。让我们选择一个随机的属性列表,例如:

idealista.com/en/inmueble/94156485/

为了在 Idealista 上解析数据,我们将使用CSS 选择器,所以让我们标记我们想要抓取的字段:

idealista 属性页面的屏幕截图和标记

在此示例中,我们将抓取以蓝色突出显示的字段

Idealista 是一个纯 HTML 网站,具有非常方便的样式标记,我们可以在我们的抓取工具中加以利用。例如,如果我们右击价格和inspectHTML 元素,我们可以看到 HTML 结构是多么清晰:

idealista 源页面插图

我们可以看到所有数据点都在明确的类名下,例如info-data-price价格或main-info__title-main属性名称。

让我们抓取它:

import asyncio
import json
import re
from typing import Dict, List

import httpx
from parsel import Selector
from typing_extensions import TypedDict

# Establish persisten HTTPX session with browser-like headers to avoid blocking
BASE_HEADERS = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "accept-language": "en-US;en;q=0.9",
    "accept-encoding": "gzip, deflate, br",
}
session = httpx.AsyncClient(headers=BASE_HEADERS, follow_redirects=True)

# type hints fo expected results so we can visualize our scraper easier:
class PropertyResult(TypedDict):
    url: str
    title: str
    location: str
    price: int
    currency: str
    description: str
    updated: str
    features: Dict[str, List[str]]
    images: Dict[str, List[str]]
    plans: List[str]


def parse_property(response: httpx.Response) -> PropertyResult:
    """parse Idealista.com property page"""
    # load response's HTML tree for parsing:
    selector = Selector(text=response.text)
    css = lambda x: selector.css(x).get("").strip()
    css_all = lambda x: selector.css(x).getall()

    data = {}
    # Meta data
    data["url"] = str(response.url)

    # Basic information
    data['title'] = css("h1 .main-info__title-main::text")
    data['location'] = css(".main-info__title-minor::text")
    data['currency'] = css(".info-data-price::text")
    data['price'] = int(css(".info-data-price span::text").replace(",", ""))
    data['description'] = "\n".join(css_all("div.comment ::text")).strip()
    data["updated"] = selector.xpath(
        "//p[@class='stats-text']"
        "[contains(text(),'updated on')]/text()"
    ).get("").split(" on ")[-1]

    # Features
    data["features"] = {}
    #  first we extract each feature block like "Basic Features" or "Amenities"
    for feature_block in result.selector.css(".details-property-h3"):
        # then for each block we extract all bullet points underneath them
        label = feature_block.xpath("text()").get()
        features = feature_block.xpath("following-sibling::div[1]//li")
        data["features"][label] = [
            ''.join(feat.xpath(".//text()").getall()).strip()
            for feat in features
        ]

    # Images
    # the images are tucked away in a javascript variable.
    # We can use regular expressions to find the variable and parse it as a dictionary:
    image_data = re.findall(
        "fullScreenGalleryPics\s*:\s*(\[.+?\]),", 
        response.text
    )[0]
    # we also need to replace unquoted keys to quoted keys (i.e. title -> "title"):
    images = json.loads(re.sub(r'(\w+?):([^/])', r'"\1":\2', image_data))
    data['images'] = defaultdict(list)
    data['plans'] = []
    for image in images:
        url = urljoin(str(response.url), image['imageUrl'])
        if image['isPlan']:
            data['plans'].append(url)
        else:
            data['images'][image['tag']].append(url)
    return data


async def scrape_properties(urls: List[str]) -> List[PropertyResult]:
    """Scrape Idealista.com properties"""
    properties = []
    to_scrape = [session.get(url) for url in urls]
    # tip: asyncio.as_completed allows concurrent scraping - super fast!
    for response in asyncio.as_completed(to_scrape):
        response = await response
        if response.status_code != 200:
            print(f"can't scrape property: {response.url}")
            continue
        properties.append(parse_property(response))
    return properties

运行代码和示例输出

async def run():
    urls = ["https://www.idealista.com/en/inmueble/97028172/"]
    data = await scrape_properties(urls)
    print(json.dumps(data, indent=2, ensure_ascii=False))

if __name__ == "__main__":
    asyncio.run(run())

这将导致类似于此的数据集:

[
  {
    "title": "Penthouse for sale in La Dreta de l'Eixample",
    "location": "Eixample, Barcelona",
    "price": 5200000,
    "currency": "€",
    "description": "This stunning penthouse hosts 269 m2 distributed across two floors and a turret with 360º exposures. With straight access from the main lift, we walk through a hall that leads to a large central space composed by the living area and a dining room and an access to a terrace at the same level.  A full equipped and red lacquered kitchen, is directly connected to the dining room and features a large window framing Gaudi's masterpiece, Sagrada Familia. On the same floor there are three bedrooms, one en-suite and two double bedrooms with their own bathroom. All the rooms are exterior facing and are surrounded by terraces. Moreover, oversize windows allow for abundant light to stream across the interiors with high-ceilings. \nOn the upper floor we find a room with access to 200 m2 of terraces hosting chill-out areas, a swimming pool and a jacuzzi.  In addition, an interior spiral staircase on the same floor, leads to a turret on a third level spanning 360º views over Barcelona. \nThe penthouse is well preserved with high quality finishes, air conditioning and heating, but it also offers the opportunity have the interiors renovated to contemporary standards, to convert it into one-of-a-kind piece in Barcelona city. \nContact us for more information or to arrange a viewing.",
    "features": {
      "Basic features": [
        "367 m² built",
        "5 bedrooms",
        "4 bathrooms",
        "Terrace",
        "Second hand/good condition",
        "Fitted wardrobes",
        "Built in 1954"
      ],
      "Building": [
        "exterior",
        "With lift"
      ],
      "Amenities": [
        "Air conditioning",
        "Swimming pool"
      ],
      "Energy performance certificate": [
        "Not indicated"
      ]
    },
    "updated": "2 November",
    "url": "https://www.idealista.com/en/inmueble/97028172/",
    "images": {
      "Communal areas": [
        "https://www.idealista.com/inmueble/97028172/foto/1/",
        "https://www.idealista.com/inmueble/97028172/foto/3/",
        "https://www.idealista.com/inmueble/97028172/foto/5/",
        "https://www.idealista.com/inmueble/97028172/foto/6/",
        "https://www.idealista.com/inmueble/97028172/foto/9/",
        "https://www.idealista.com/inmueble/97028172/foto/10/",
        "https://www.idealista.com/inmueble/97028172/foto/11/"
      ],
      "Swimming pool": [
        "https://www.idealista.com/inmueble/97028172/foto/2/",
        "https://www.idealista.com/inmueble/97028172/foto/4/",
        "https://www.idealista.com/inmueble/97028172/foto/7/",
        "https://www.idealista.com/inmueble/97028172/foto/8/"
      ],
      "Views": [
        "https://www.idealista.com/inmueble/97028172/foto/12/",
        "https://www.idealista.com/inmueble/97028172/foto/28/",
        "https://www.idealista.com/inmueble/97028172/foto/48/"
      ],
      "Living room": [
        "https://www.idealista.com/inmueble/97028172/foto/13/",
        "https://www.idealista.com/inmueble/97028172/foto/14/",
        "https://www.idealista.com/inmueble/97028172/foto/16/",
        "https://www.idealista.com/inmueble/97028172/foto/17/",
        "https://www.idealista.com/inmueble/97028172/foto/18/",
        "https://www.idealista.com/inmueble/97028172/foto/19/"
      ],
      "Dining room": [
        "https://www.idealista.com/inmueble/97028172/foto/15/",
        "https://www.idealista.com/inmueble/97028172/foto/25/"
      ],
      "Terrace": [
        "https://www.idealista.com/inmueble/97028172/foto/20/",
        "https://www.idealista.com/inmueble/97028172/foto/21/",
        "https://www.idealista.com/inmueble/97028172/foto/22/",
        "https://www.idealista.com/inmueble/97028172/foto/24/",
        "https://www.idealista.com/inmueble/97028172/foto/36/",
        "https://www.idealista.com/inmueble/97028172/foto/40/",
        "https://www.idealista.com/inmueble/97028172/foto/41/",
        "https://www.idealista.com/inmueble/97028172/foto/42/"
      ],
      "Bedroom": [
        "https://www.idealista.com/inmueble/97028172/foto/23/",
        "https://www.idealista.com/inmueble/97028172/foto/31/",
        "https://www.idealista.com/inmueble/97028172/foto/34/",
        "https://www.idealista.com/inmueble/97028172/foto/35/",
        "https://www.idealista.com/inmueble/97028172/foto/38/",
        "https://www.idealista.com/inmueble/97028172/foto/39/",
        "https://www.idealista.com/inmueble/97028172/foto/43/"
      ],
      "Kitchen": [
        "https://www.idealista.com/inmueble/97028172/foto/26/",
        "https://www.idealista.com/inmueble/97028172/foto/27/",
        "https://www.idealista.com/inmueble/97028172/foto/29/",
        "https://www.idealista.com/inmueble/97028172/foto/30/"
      ],
      "Bathroom": [
        "https://www.idealista.com/inmueble/97028172/foto/32/",
        "https://www.idealista.com/inmueble/97028172/foto/37/",
        "https://www.idealista.com/inmueble/97028172/foto/44/"
      ],
      "Office": [
        "https://www.idealista.com/inmueble/97028172/foto/33/",
        "https://www.idealista.com/inmueble/97028172/foto/46/"
      ],
      "Staircase": [
        "https://www.idealista.com/inmueble/97028172/foto/45/",
        "https://www.idealista.com/inmueble/97028172/foto/47/"
      ],
      "Reception": [
        "https://www.idealista.com/inmueble/97028172/foto/49/"
      ]
    },
    "plans": [
      "https://www.idealista.com/inmueble/97028172/foto/50/",
      "https://www.idealista.com/inmueble/97028172/foto/51/"
    ]
  }
]

在此演示中,我们使用了一些 CSS 和 XPath 选择器parsel来提取属性详细信息,如价格、描述、功能等。

然而,图像是事情变得有点复杂的地方。对于图片轮播,很多网站使用javascript按需生成动态HTML。Idealista 也不例外,它将所有图像 URL 隐藏在一个 javascript 变量中,然后使用 javascript 显示它。
为了抓取它,我们使用正则表达式模式找到隐藏的 javascript 变量,然后将其作为 Python 字典对象加载并解析图像和平面图。

对于抓取本身,我们使用异步功能httpxasyncio.as_completed同时安排多个属性,使我们的抓取器超级快!

接下来,让我们看看如何通过实施探索功能来扩展此抓取工具。

寻找 Idealista 属性

有几种方法可以找到 Idealista 中列出的属性。最受欢迎和最可靠的是按区域探索。在本节中,我们将看看如何通过一点点爬行来抓取属性列表——我们将探索位置目录。

要找到位置目录,我们可以滚动到页面底部:

idealista位置目录页面截图
位于页面底部的位置目录。

每个链接都指向一个省列表 URL,该 URL 进一步指向区域列表 URL。我们可以使用我们之前使用的相同 CSS 选择器技术轻松地抓取它:

def parse_province(response: httpx.Response) -> List[str]:
    """parse province page for area search urls"""
    selector = Selector(text=response.text)
    urls = selector.css("#location_list li>a::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_provinces(urls: List[str]) -> List[str]:
    """
    Scrape province pages like:
    https://www.idealista.com/en/venta-viviendas/malaga-provincia/con-chalets/municipios
    for search page urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    """
    to_scrape = [session.get(url) for url in urls]
    search_urls = []
    async for response in asyncio.as_completed(to_scrape):
        search_urls.extend(parse_province(await response))
    return search_urls

运行代码和示例输出

async def run():
    
    data = await scrape_provinces([
        "https://www.idealista.com/en/venta-viviendas/balears-illes/con-chalets/municipios"
    ])
    print(json.dumps(data, indent=2))

将产生与此类似的数据集:

[
  "https://www.idealista.com/en/venta-viviendas/alaior-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alaro-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/alcudia-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/algaida-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/andratx-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/ariany-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/arta-balears-illes/con-chalets/",
  "https://www.idealista.com/en/venta-viviendas/santa-maria-del-cami-balears-illes/con-chalets/",
  ...
]

该抓取器将抓取给定省份的所有区域页面。要发现所有房产清单,我们所要做的就是抓取所有省份。接下来,让我们抓取搜索结果页面本身:

def parse_search(response: httpx.Response) -> List[str]:
    """Parse search result page for 30 listings"""
    selector = Selector(text=response.text)
    urls = selector.css("article.item .item-link::attr(href)").getall()
    return [urljoin(str(response.url), url) for url in urls]


async def scrape_search(url: str, paginate=True) -> List[str]:
    """
    Scrape search urls like:
    https://www.idealista.com/en/venta-viviendas/marbella-malaga/con-chalets/
    for proprety urls
    """
    first_page = await session.get(url)
    property_urls = parse_search(first_page)
    if not paginate:
        return property_urls
    total_results = first_page.selector.css("h1#h1-container").re(": (.+) houses")[0]
    total_pages = math.ceil(int(total_results.replace(",", "")) / 30)
    if total_pages > 60:
        print(f"search contains more than max page limit ({total_pages}/60)")
        total_pages = 60
    print(f"scraping {total_pages} of search results concurrently")
    to_scrape = [
        session.get(first_page.context["url"] + f"pagina-{page}.htm")
        for page in range(2, total_pages + 1)
    ]
    async for response in asyncio.as_completed(to_scrape):
        property_urls.extend(parse_search(await response))
    return property_urls

为了抓取像区域结果页面这样的分页内容,我们首先抓取第一页以提取总结果数。然后,我们可以在几秒钟内同时抓取剩余页面并检索所有列表!


有了这个发现抓取器和我们之前的房地产抓取器,我们可以收集 Idealista.com 上所有现有的房地产数据——但如果我们想成为第一个了解新房地产清单的人怎么办?接下来,让我们看看如何跟踪 Idealista 的新房产清单。

跟踪新的 Idealista 列表

要跟踪新列表,我们可以利用 Idealista 的结果排序,我们可以重复使用我们的搜索抓取工具。Idealista 上的每个搜索结果页面都可以按“最新”排序,我们可以不断地抓取它,以便在新房产上市时第一个知道。

例如,让我们看一下Eixample, Barecelona 的房产:

idealista 最新房源页面截图

如果我们点击“最近”按钮,我们可以看到每个结果页面都可以通过 URL 参数进行排序ordenado-por,在“最近”的情况下是ordenado-por=fecha-pbulicacion-desc。让我们利用这个事实并构建一个跟踪器抓取器。

为了在 Python 中抓取这个,我们可以开始一个无限循环,不断检查这个页面是否有新列表:

...  # include code from previous sections

async def track_search(url: str, output: Path, interval=60):
    """Track Idealista.com results page, scrape new listings and append them as JSON to the output file"""
    seen = set()
    output.touch(exist_ok=True)  # create file if it doesn't exist
    try:
        while True:
            properties = await scrape_search(url=url, max_pages=1)
            # check deduplication filter
            properties = [prop for prop in properties if prop not in seen]
            if properties:
                # scrape properties and save to file - 1 property as JSON per line
                results = await scrape_properties(properties)
                with output.open("a") as f:
                    f.write("\n".join(json.dumps(property) for property in results))

                # add seen to deduplication filter
                for prop in properties:
                    seen.add(prop)
            print(f"scraped {len(results)} new properties; waiting {interval} seconds")
            await asyncio.sleep(interval)
    except KeyboardInterrupt:
        print("stopping price tracking")
        
# Example run:
from pathlib import Path
asyncio.run(track_search(
    "https://www.idealista.com/en/venta-viviendas/barcelona/eixample/?ordenado-por=fecha-publicacion-desc",
    Path("new-barcelona-eixample-area-properties.jsonl"),
))

这个简短的跟踪器抓取器将抓取新列表的提供结果页面。它保留已见列表的记忆以防止重复并将结果附加到 JSON 行文件(每行 1 个 JSON 对象)。

我们编写了属性发现、抓取和跟踪——剩下的就是扩展我们的抓取器。

常问问题

为了总结本指南,让我们看一下有关 Web 抓取 Idealista.com 数据的一些常见问题:

是的。Idealista.com 的数据是公开的;我们不会提取任何个人或私人信息。以缓慢、尊重的速度抓取 Idealista.com 是完全合法和道德的。
也就是说,在抓取个人数据(如卖家姓名、电话号码等)时,应注意欧盟的 GDRP 合规性。

Idealista.com 是否有公共 API?

不,Idealista.com(及其姊妹网站)不提供财产数据的公共 API。但是,如本指南所示,使用一点 Python 就可以轻松抓取和抓取。

Idealista 爬取总结

在这个网络抓取教程中,我们为房地产数据编写了一个简短的Idealista抓取工具。我们首先抓取单个属性页并使用 CSS 和 XPath 选择器解析详细信息。

然后,我们了解了如何使用 Idealista 的目录和搜索系统查找属性。我们编写了一个小型网络爬虫,可以爬取和抓取西班牙指定省份的所有财产清单。

最后,我们了解了如何通过创建一个不断检查新列表的循环爬虫来跟踪发布在 Idealista 上的新列表。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。