在这个网页爬取教程中,我们将看看如何爬取 Redfin——一个流行的房地产列表网页。 我们将爬取 Redfin 房产页面上显示的价格信息、地址、照片和电话号码等房地产数据。 为了爬取 Redfin 属性,我们将使用隐藏的网络数据爬取方法。我们还将了解如何使用 Redfin 的搜索和站点地图系统收集网站上可用的整个房地产数据集来查找房地产。 最后,我们还将通过不断爬取新上市或更新的信息来涵盖财产跟踪——让我们在房地产竞标中占据上风。我们将使用 Python 和一些社区库——让我们开始吧!
为什么要爬取 Redfin.com?
Redfin.com 是美国最大的房地产网站之一,使其成为最大的公共房地产数据集。包含房地产价格、挂牌地点和销售日期以及一般财产信息等字段。 这对于市场分析、住宅行业研究和竞争对手的总体概况来说是有价值的信息。通过网络爬取 Redfin,我们可以轻松访问主要的房地产数据集。可用的 Redfind 数据字段
我们可以为几个流行的房地产数据字段和目标爬取 Redfin:- 待售物业
- 土地出售
- 开放日活动
- 出租物业
- 房地产经纪人信息
项目设置
在本教程中,我们将使用 Python 和三个社区包:- httpx – HTTP 客户端库,可以让我们与 Redfin.com 的服务器进行通信
- parsel – HTML 解析库,它将帮助我们解析网络爬取的 HTML 文件。
- jmespath – JSON 解析库。允许为 JSON 编写类似 XPath 的规则。
pip install
命令轻松安装:
$ pip install httpx parsel jmespath或者,可以随意换成
httpx
任何其他 HTTP 客户端包,例如requests,因为我们只需要基本的 HTTP 函数,这些函数在每个库中几乎都是可以互换的。至于,parsel
另一个很好的选择是beautifulsoup包。
爬取 Redfin 属性数据
首先让我们来看看如何爬取单个列表页面的属性数据。 Redfin 使用 Next.js 来呈现其页面。我们可以利用这一事实并爬取隐藏的网络数据,而不是直接解析 HTML。这可能看起来有点复杂,所以如果您不熟悉隐藏的网络数据爬取,请参阅我们的介绍文章: Redfin 的隐藏数据集包含所有属性数据等。在这种情况下,属性数据位于 javascript 变量中__reactServerState.InitialContext
:
要提取整个数据集,我们将:
- 查找
script
包含此 javascript 变量的元素 - 使用正则表达式查找变量的值
- 将其加载为 Python 字典并清理数据集
import json import asyncio from httpx import AsyncClient, Response from parsel import Selector session = AsyncClient(headers={ # use same headers as a popular web browser (Chrome on Windows in this case) "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "en-US,en;q=0.9", }) def extract_cache(react_initial_context): """extract microservice cache data from the react server agent""" result = {} for name, cache in react_initial_context["ReactServerAgent.cache"]["dataCache"].items(): # first we retrieve cached response and see whether it's a success try: cache_response = cache["res"] except KeyError: # empty cache continue if cache_response.get("status") != 200: print("skipping non 200 cache") continue # then extract cached response body and interpret it as a JSON cache_data = cache_response.get("body", {}).get("payload") if not cache_data: cache_data = json.loads(cache_response["text"].split("&&", 1)[-1]).get("payload") if not cache_data: # skip empty caches continue # for Redfin we can cleanup cache names for home data endpoints: if "/home/details" in name: name = name.split("/home/details/")[-1] result[name.replace("/", "")] = cache_data # ^note: we sanitize name to avoid slashes as they are not allowed in JMESPath return result def parse_property(response: Response): selector = Selector(response.text) script = selector.xpath('//script[contains(.,"ServerState.InitialContext")]/text()').get() initial_context = re.findall(r"ServerState.InitialContext = (\{.+\});", script) if not initial_context: print(f"page {response.url} is not a property listing page") return return extract_cache(json.loads(initial_context[0])) async def scrape_properties(urls: List[str]) -> List[PropertyResult]: to_scrape = [session.get(url) for url in urls] properties = [] for response in asyncio.as_completed(to_scrape): properties.append(parse_property(await response)) return properties运行代码 要运行我们的爬虫,我们所要做的就是调用 asyncio 协程:
urls = [ "https://www.redfin.com/FL/Cape-Coral/402-SW-28th-St-33914/home/61856041", "https://www.redfin.com/FL/Cape-Coral/4202-NW-16th-Ter-33993/home/62053611", "https://www.redfin.com/FL/Cape-Coral/1415-NW-38th-Pl-33993/home/62079956", ] if __name__ == "__main__": asyncio.run(scrape_properties(urls))上面,我们用来
httpx
检索 HTML 页面并将其加载为parsel.Selector
. 然后,我们找到包含 javascript 缓存变量的脚本元素。为了提取缓存,我们使用一个简单的正则表达式来捕获InitialContext
关键字和};
字符之间的文本。 这导致了巨大的 Redfin 属性数据,并且由于这是一个内部网络数据集,它充满了技术数据字段——我们不得不做一些清理工作。让我们解析它!
解析 Redfin 数据
我们爬取的数据集很大,包含大量无用信息。为了将其解析为我们可以消化的内容,我们将使用 JMESPath——一种流行的 JSON 解析语法。 JMESPath 有点类似于 XPath 或 CSS 选择器,但用于 JSON。使用它,我们可以创建在何处查找我们要保留的数据字段的路径规则。 例如,对于价格,我们将使用 JMESPath 路径:aboveTheFold.addressSectionInfo.priceInfo.amount让我们看一下整个解析器:
from typing import TypedDict import jmespath class PropertyResult(TypedDict): """type hint for property result. i.e. Defines what fields are expected in property dataset""" photos: List[str] videos: List[str] price: int info: Dict[str, str] amenities: List[Dict[str, str]] records: Dict[str, str] history: Dict[str, str] floorplan: Dict[str, str] activity: Dict[str, str] def parse_redfin_proprety_cache(data_cache) -> PropertyResult: """parse Redfin's cache data for proprety information""" # here we define field name to JMESPath mapping parse_map = { # from top area of the page: basic info, videos and photos "photos": "aboveTheFold.mediaBrowserInfo.photos[*].photoUrls.fullScreenPhotoUrl", "videos": "aboveTheFold.mediaBrowserInfo.videos[*].videoUrl", "price": "aboveTheFold.addressSectionInfo.priceInfo.amount", "info": """aboveTheFold.addressSectionInfo.{ bed_num: beds, bath_numr: baths, full_baths_num: numFullBaths, sqFt: sqFt, year_built: yearBuitlt, city: city, state: state, zip: zip, country_code: countryCode, fips: fips, apn: apn, redfin_age: timeOnRedfin, cumulative_days_on_market: cumulativeDaysOnMarket, property_type: propertyType, listing_type: listingType, url: url } """, # from bottom area of the page: amenities, records and event history "amenities": """belowTheFold.amenitiesInfo.superGroups[].amenityGroups[].amenityEntries[].{ name: amenityName, values: amenityValues }""", "records": "belowTheFold.publicRecordsInfo", "history": "belowTheFold.propertyHistoryInfo", # other: sometimes there are floorplans "floorplan": r"listingfloorplans.floorPlans", # and there's always internal Redfin performance info: views, saves, etc. "activity": "activityInfo", } results = {} for key, path in parse_map.items(): value = jmespath.search(path, data_cache) results[key] = value return results示例输出
{ "photos": [ "https://ssl.cdn-redfin.com/photo/192/bigphoto/966/222084966_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_1_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_2_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_3_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_4_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_5_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_6_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_7_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_8_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_9_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_10_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_11_0.jpg", "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_12_0.jpg" ], "videos": [], "price": 311485, "info": { "bed_num": 3, "bath_numr": 2.5, "full_baths_num": 2, "sqFt": { "displayLevel": 1, "value": 1636 }, "year_built": null, "city": "Cape Coral", "state": "FL", "zip": "33909", "country_code": "US", "fips": "12071", "apn": "304324C2137060780", "redfin_age": 489909873, "cumulative_days_on_market": 6, "property_type": 13, "listing_type": 1, "url": "/FL/Cape-Coral/1445-Weeping-Willow-Ct-33909/home/178539241" }, "amenities": [ { "name": "Parking", "values": [ "2+ Spaces", "Driveway Paved" ] }, { "name": "Amenities", "values": [ "Basketball", "Business Center", "Clubhouse", "Community Pool", "Community Room", "Community Spa/Hot tub", "Exercise Room", "Pickleball", "Play Area", "Sidewalk", "Tennis Court", "Underground Utility", "Volleyball" ] }, ... // trucated for blog ], "records": { "basicInfo": { "propertyTypeName": "Townhouse", "lotSqFt": 1965, "apn": "304324C2137060780", "propertyLastUpdatedDate": 1669759070845, "displayTimeZone": "US/Eastern" }, "taxInfo": {}, "allTaxInfo": [], "addressInfo": { "isFMLS": false, "street": "1445 Weeping Willow Ct", "city": "Cape Coral", "state": "FL", "zip": "33909", "countryCode": "US" }, "mortgageCalculatorInfo": { "displayLevel": 1, "dataSourceId": 192, "listingPrice": 311485, "downPaymentPercentage": 20.0, "monthlyHoaDues": 312, "propertyTaxRate": 1.29, "homeInsuranceRate": 1.17, "mortgageInsuranceRate": 0.75, "creditScore": 740, "loanType": 1, "mortgageRateInfo": { "fifteenYearFixed": 5.725, "fiveOneArm": 5.964, "thirtyYearFixed": 6.437, "isFromBankrate": true }, "countyId": 471, "stateId": 19, "countyName": "Lee County", "stateName": "Florida", "mortgageRatesPageLinkText": "View all rates", "baseMortgageRatesPageURL": "/mortgage-rates?location=33909&locationType=4&locationId=14465", "zipCode": "33909", "isCoop": false }, "countyUrl": "/county/471/FL/Lee-County", "countyName": "Lee County", "countyIsActive": true, "sectionPreviewText": "County data refreshed on 11/29/2022" }, "history": { "isHistoryStillGrowing": true, "hasAdminContent": false, "hasLoginContent": false, "dataSourceId": 192, "canSeeListing": true, "listingIsNull": false, "hasPropertyHistory": true, "showLogoInLists": false, "definitions": [], "displayTimeZone": "US/Eastern", "isAdminOnlyView": false, "events": [ { "isEventAdminOnly": false, "price": 311485, "isPriceAdminOnly": false, "eventDescription": "Listed", "mlsDescription": "Active", "source": "BEARMLS", "sourceId": "222084966", "dataSourceDisplay": { "dataSourceId": 192, "dataSourceDescription": "Bonita Springs Association of Realtors (BEARMLS)", "dataSourceName": "BEARMLS", "shouldShowLargerLogo": false }, "priceDisplayLevel": 1, "historyEventType": 1, "eventDate": 1669708800000 } ], "mediaBrowserInfoBySourceId": {}, "addressInfo": { "isFMLS": false, "street": "1445 Weeping Willow Ct", "city": "Cape Coral", "state": "FL", "zip": "33909", "countryCode": "US" }, "isFMLS": false, "historyHasHiddenRows": false, "priceEstimates": { "displayLevel": 1, "priceHomeUrl": "/what-is-my-home-worth?estPropertyId=178539241&src=ldp-estimates" }, "sectionPreviewText": "Details will be added when we have them" }, "floorplan": [ "https://www.jingzhengli.com/wp-content/uploads/2023/06/222084966_1_0.jpg", ], "activity": { "viewCount": 28, "favoritesCount": 1, "totalFavoritesCount": 1, "xOutCount": 0, "totalXOutCount": 0, "tourCount": 0, "totalTourCount": 0, "addressInfo": { "isFMLS": false, "street": "1445 Weeping Willow Ct", "city": "Cape Coral", "state": "FL", "zip": "33909", "countryCode": "US" }, "sectionPreviewText": "1 people favorited this home" } }我们使用 JMESPath 和 Python 将数千行长的 Redfin 属性数据集减少到只有几十个最重要的字段。 我们可以看到使用现代爬取工具爬取现代网站是多么容易——为了爬取 Redfin 属性,我们只使用了几行 Python 代码。接下来,让我们看看如何找到要爬取的列表。
查找 Redfin 属性
有几种方法可以在 Redfin 上找到要爬取的列表。虽然最明显和最快的方法是使用 Redfin 的站点地图。 Redfin 提供了一个广泛的站点地图系统,其中包含美国各州、社区、学区等列表的站点地图。为此,让我们看一下/robots.txt页面,特别是站点地图部分。 例如,所有位置目录都有站点地图: 对于租赁数据: 最后,我们有非列表对象的站点地图,例如Agents。继 Redfin 列表变更之后
为了跟踪新的 Redfin 列表,我们可以使用站点地图提要来获取最新和更新的列表: 为了找到新的列表和更新,我们将爬取这两个站点地图,它们在列出或更新时提供列表 URL 和时间戳:<url> <loc>https://www.redfin.com/NH/Boscawen/1-Sherman-Dr-03303/home/96531826</loc> <lastmod>2022-12-01T00:53:20.426-08:00</lastmod> <changefreq>daily</changefreq> <priority>1.0</priority> </url>请注意,此站点地图使用的是 UTC-8 时区。它由日期时间字符串的最后一个数字表示:-08.00。 为了在 Python 中爬取这些 Redfin 提要,我们将使用我们之前使用过的库
httpx
:parsel
import arrow # for handling datetime: pip install arrow from httpx import AsyncClient session = AsyncClient(headers={ # use same headers as a popular web browser (Chrome on Windows in this case) "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9", "Accept-Language": "en-US,en;q=0.9", }) async def scrape_feed(url) -> Dict[str, datetime]: """scrape Redfin sitemap and return url:datetime dictionary""" result = await session.get(url) selector = Selector(text=result.text, type="xml") results = {} for item in selector.xpath("//url"): url = item.xpath("loc/text()").get() pub_date = item.xpath("lastmod/text()").get(%Y-%m-%dT%H:%M:%S.%f%z) results[url] = arrow.get(pub_date).datetime return results然后我们可以使用我们之前编写的 Python Redfin 爬取工具来爬取这些 URL 以获取属性数据集。