in

如何爬取网页微格式

如何爬取网页微格式

抓取网络微格式是网络抓取公共数据的最简单方法之一。因此,在本教程中,我们将了解如何使用extruct库在 Python 中使用这种网络抓取技术。

Web 微格式是一组标准化的元数据格式,可以嵌入到 HTML 页面中,以提供有关各种类型内容(如产品、人员、组织等)的结构化数据。

通过抓取微格式,我们可以轻松抓取公共数据并接收可预测的格式,因为微格式通常遵循 schema.org 定义的严格模式定义。

今天我们将介绍常见的微格式类型,如何抓取它们,并通过抓取Etsy.com查看抓取示例项目

什么是微格式?

创建微格式是为了标准化重要 Web 数据对象的表示,以便机器可读。最常见的微格式用于为网页创建预览卡。它最常用于为搜索引擎、社交网络和其他通信渠道提供数据视图。

实际上,大多数人通过社交媒体或通信平台(如 Slack)中的网站预览功能熟悉微格式。即当您发布网站时,托管服务器会抓取微格式数据以生成一个小网站预览。

微格式的唯一缺点是它们通常不包含整个可用页面数据集。当网页抓取时,我们可能需要使用像beautifulsoup这样的 HTML 解析工具或使用CSS 选择器XPath解析器来扩展微格式解析器,并进行额外的 HTML 解析。

设置

为了抓取微格式,我们将使用带有extruct库的 Python,它使用 HTML 解析工具来提取微格式数据。

pip install它可以使用终端命令安装:

$ pip install extruct

Schema.org

Schema.org 是主要搜索引擎和其他技术行业领导者之间的一项合作计划,旨在为网络内容提供标准数据类型。

Schema.org 包含流行数据对象(如人物、网站、文章、公司等)的模式(数据对象规则和定义)。这些标准的静态模式类型简化了 Web 自动化。

这些模式是为微格式创建的,但并非所有微格式都必须使用 schema.org 对象定义。

接下来,让我们通过探索schemar.org/Person对象来了解微格式类型。

微格式类型

Web 上使用了多种微格式数据类型标准。它们非常相似,只是标记和用例不同。

让我们来看看流行的微格式类型以及如何使用extructPython 提取它们。

JSON-LD

JSON-LD是最流行的现代微格式。它使用直接表示 schema.org 对象的嵌入式 JSON 文档。

这是一个 JSON-LD 标记示例以及如何使用它来解析它extruct

html = """
<script type="application/ld+json">
{
  "@context": "http://schema.org",
  "@type": "Person",
  "name": "John Doe",
  "image": "johndoe.jpg",
  "jobTitle": "Software Engineer",
  "telephone": "(555) 555-5555",
  "email": "[email protected]",
  "address": {
    "@type": "PostalAddress",
    "streetAddress": "123 Main St",
    "addressLocality": "Anytown",
    "addressRegion": "CA",
    "postalCode": "12345"
  }
}
</script>
"""

from extruct import JsonLdExtractor

data = JsonLdExtractor().extract(html)
print(data)
[
  {
    "@context": "http://schema.org",
    "@type": "Person",
    "name": "John Doe",
    "image": "johndoe.jpg",
    "jobTitle": "Software Engineer",
    "telephone": "(555) 555-5555",
    "email": "[email protected]",
    "address": {
        "@type": "PostalAddress",
        "streetAddress": "123 Main St",
        "addressLocality": "Anytown",
        "addressRegion": "CA",
        "postalCode": "12345",
    },
  }
]

这是一个 Person 对象的示例(由元字段表示@type),我们可以在schema.org/Person上找到架构详细信息。

JSON-LD 易于实施和使用,但由于它是与页面上的可见数据分开的数据集,因此它可能与页面数据不匹配。

Microdata

Microdata是第二流行的微格式,它使用 HTML 属性来标记微格式数据字段。这种微格式非常适合网络抓取,因为它涵盖了可见的页面数据,这意味着我们可以准确地获取我们在页面上看到的内容。

这是一个微数据标记示例以及如何使用 extruct 解析它:

html = """
<div itemscope itemtype="http://schema.org/Person">
  <h1 itemprop="name">John Doe</h1>
  <img itemprop="image" src="johndoe.jpg" alt="John Doe">
  <p itemprop="jobTitle">Software Engineer</p>
  <p itemprop="telephone">(555) 555-5555</p>
  <p itemprop="email"><a href="mailto:[email protected]">[email protected]</a></p>
  <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
    <p><span itemprop="streetAddress">123 Main St</span>, <span itemprop="addressLocality">Anytown</span>, <span itemprop="addressRegion">CA</span> <span itemprop="postalCode">12345</span></p>
  </div>
</div>
"""

from extruct import MicrodataExtractor

data = MicrodataExtractor().extract(html)
print(data)
[
  {
    "type": "http://schema.org/Person",
    "properties": {
      "name": "John Doe",
      "image": "johndoe.jpg",
      "jobTitle": "Software Engineer",
      "telephone": "(555) 555-5555",
      "email": "[email protected]",
      "address": {
        "type": "http://schema.org/PostalAddress",
        "properties": {
          "streetAddress": "123 Main St",
          "addressLocality": "Anytown",
          "addressRegion": "CA",
          "postalCode": "12345",
        },
      },
    },
  }
]

微数据使用itemprop属性指定字段键和内部 HTML 数据作为值。这种格式有点复杂,但它更接近真实来源,因为它与网页上显示的数据相同。

微数据通常使用 schema.org 模式进行标记,但因为它非常灵活,所以与 schema.org 模式不匹配的其他标记也是可能的。

RDFA

RDFA类似于微数据,使用 HTML 属性标记来提供额外的微格式数据。它几乎与 Microdata 格式相同,并且具有标记页面上可见数据的相同优势。

这是一个 RDFA 标记示例以及如何使用 extruct 解析它:

html = """
<div vocab="http://schema.org/" typeof="Person">
  <h1 property="name">John Doe</h1>
  <img property="image" src="johndoe.jpg" alt="John Doe"/>
  <p property="jobTitle">Software Engineer</p>
  <p property="telephone">(555) 555-5555</p>
  <p property="email"><a href="mailto:[email protected]">[email protected]</a></p>
  <div property="address" typeof="PostalAddress">
    <p><span property="streetAddress">123 Main St</span>, <span property="addressLocality">Anytown</span>, <span property="addressRegion">CA</span> <span property="postalCode">12345</span></p>
  </div>
</div>
"""

from extruct import RDFaExtractor

data = RDFaExtractor().extract(html)
print(data)

[
    {"@id": "", "http://www.w3.org/ns/rdfa#usesVocabulary": [{"@id": "http://schema.org/"}]},
    {
        "@id": "_:Naa49dc28a80f47119694913cd98fc5dc",
        "@type": ["http://schema.org/Person"],
        "http://schema.org/address": [{"@id": "_:Nb8c8aea8ce7d434989a88308e1a12e7e"}],
        "http://schema.org/email": [{"@value": "[email protected]"}],
        "http://schema.org/image": [{"@id": "johndoe.jpg"}],
        "http://schema.org/jobTitle": [{"@value": "Software Engineer"}],
        "http://schema.org/name": [{"@value": "John Doe"}],
        "http://schema.org/telephone": [{"@value": "(555) 555-5555"}],
    },
    {
        "@id": "_:Nb8c8aea8ce7d434989a88308e1a12e7e",
        "@type": ["http://schema.org/PostalAddress"],
        "http://schema.org/addressLocality": [{"@value": "Anytown"}],
        "http://schema.org/addressRegion": [{"@value": "CA"}],
        "http://schema.org/postalCode": [{"@value": "12345"}],
        "http://schema.org/streetAddress": [{"@value": "123 Main St"}],
    },
]

OpenGraph

Facebook 的 opengraph是另一种流行的微格式,主要用于在社交媒体帖子中生成预览卡。因此,尽管 Opengraph 支持所有 schema.org 对象,但它很少用于标记基本网站预览信息之外的内容。

这是一个 opengraph 标记示例以及如何使用它来解析它extruct

html = """
<head>
  <meta property="og:type" content="profile" />
  <meta property="og:title" content="John Doe" />
  <meta property="og:image" content="johndoe.jpg" />
  <meta property="og:description" content="Software Engineer" />
  <meta property="og:phone_number" content="(555) 555-5555" />
  <meta property="og:email" content="[email protected]" />
  <meta property="og:street-address" content="123 Main St" />
  <meta property="og:locality" content="Anytown" />
  <meta property="og:region" content="CA" />
  <meta property="og:postal-code" content="12345" />
  <meta property="og:country-name" content="USA" />
</head>
"""

from extruct import OpenGraphExtractor

data = OpenGraphExtractor().extract(html)
print(data)
[
  {
    "namespace": {"og": "http://ogp.me/ns#"},
    "properties": [
      ("og:type", "profile"),
      ("og:title", "John Doe"),
      ("og:image", "johndoe.jpg"),
      ("og:description", "Software Engineer"),
      ("og:phone_number", "(555) 555-5555"),
      ("og:email", "[email protected]"),
      ("og:street-address", "123 Main St"),
      ("og:locality", "Anytown"),
      ("og:region", "CA"),
      ("og:postal-code", "12345"),
      ("og:country-name", "USA"),
    ],
  }
]

Opengraph 与 JSON-LD 非常相似,因为它不是自然页面的一部分。这意味着 opengraph 信息可能与页面上显示的数据不同。

Microformat

Microformat是早于 schema.org 对象的最古老的标记之一。相反,微格式有自己的模式定义,用于标记人员、组织、事件、位置、博客文章、产品、评论、简历、食谱等。

下面是一个微格式标记示例以及如何使用 extruct 解析它:

html = """
<div class="h-card">
  <h1 class="fn">John Doe</h1>
  <img class="photo" src="johndoe.jpg" alt="John Doe">
  <p class="title">Software Engineer</p>
  <p class="tel">(555) 555-5555</p>
  <a class="email" href="mailto:[email protected]">[email protected]</a>
  <div class="adr">
    <span class="street-address">123 Main St</span>, 
    <span class="locality">Anytown</span>, 
    <span class="region">CA</span>
    <span class="postal-code">12345</span>
  </div>
</div>
"""

from extruct import MicroformatExtractor

data = MicroformatExtractor().extract(html)
print(data)
[
  {
    "type": ["h-card"],
    "properties": {
      "name": ["John Doe"],
      "photo": ["johndoe.jpg"],
      "job-title": ["Software Engineer"],
      "tel": ["(555) 555-5555"],
      "email": ["mailto:[email protected]"],
      "adr": [
        {
          "type": ["h-adr"],
          "properties": {"name": ["123 Main St, Anytown, CA 12345"]},
          "value": "123 Main St, Anytown, CA 12345",
        }
      ],
    },
  }
]

微格式爬虫示例

让我们看一下通过示例抓取器抓取的微格式。我们将抓取一些使用微格式来标记其数据的流行网站。

我们将以使用ScrapFly SDK 为例,它将帮助我们检索 HTML 页面而不会被阻止并extruct以解析微格式数据。

所有这些库都可以使用pip install命令安装:

$ pip install "scrapfly-sdk[all]" extruct

对于我们的第一个示例,让我们看一下抓取 Etsy.com——一个流行的电子商务网站,专门从事手工和古董商品。

例如,让我们以这个珠宝产品etsy.com/listing/1214112656为例,看看我们可以使用以下方法从中刮取什么extruct

import json
import os
import extruct

from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient(os.environ["SCRAPFLY_KEY"])
result = scrapfly.scrape(ScrapeConfig(
    url="https://www.etsy.com/listing/1214112656/",
    asp=True,
))

micro_data = extruct.extract(result.content)

示例输出

{
  "microdata": [],
  "json-ld": [
    {
      "@type": "Product",
      "@context": "https://schema.org",
      "url": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant",
      "name": "9K TINY Solid Yellow Gold Coin Pendant, Gold Disk Necklace, 9k Gold Coin Necklace, Solid Gold Rose Necklace, Christmas Gift for Her",
      "sku": "1214112656",
      "gtin": "n/a",
      "description": "----------------🌸🌸🌸Welcome to MissFlorenceJewelry🌸🌸🌸---------------\nDetails:\n· Material: 9K Solid Yellow Gold\n· Measurement: Pendant approx 6.5*6.5mm. Pendant hole 3.5*1.5mm\n· Please not that a chain is NOT included. If you are interested to get one, check out this chain listing: \nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n· All jewelry is custom handcrafted with Love and Care. ❤️\n· All items are custom made to order, about 2 weeks.\n\n\nShipping :\n· It takes 1-2 business days to ship the item to you, and 7-10 days additionally for the USPS to deliver the package.\n· Packing: The item will be presented in a beautiful box. Complimentary gift wrapping and gift tags available.\n\n\nReturns and Exchanges:\n· I gladly accept returns and exchanges, just contact me within 15 days of delivery.\n· Buyers are responsible for return shipping costs. If the item is not returned in its original condition, the buyer is responsible for any loss in value.\n\n---------------🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸🌸-------------------\n\n· If you can't find the information you need, Please feel free to contact us.😊\n· Thank you so much for your visit and hope you have a happy shopping here.❤️",
      "image": [
        {
          "@type": "ImageObject",
          "@context": "https://schema.org",
          "author": "MissFlorenceJewelry",
          "contentURL": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
          "description": null,
          "thumbnail": "https://i.etsystatic.com/34276015/c/650/516/41/108/il/1fbd0e/3856672332/il_340x270.3856672332_2867.jpg"
        },
        {
          "@type": "ImageObject",
          "@context": "https://schema.org",
          "author": "MissFlorenceJewelry",
          "contentURL": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg",
          "description": null,
          "thumbnail": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_340x270.3904169515_9c7r.jpg"
        }
      ],
      "category": "Jewelry < Necklaces < Pendants",
      "brand": {
        "@type": "Brand",
        "@context": "https://schema.org",
        "name": "MissFlorenceJewelry"
      },
      "logo": "https://i.etsystatic.com/isla/862c6f/58067961/isla_fullxfull.58067961_aiop800d.jpg?version=0",
      "aggregateRating": {
        "@type": "AggregateRating",
        "ratingValue": "4.8",
        "reviewCount": 25
      },
      "offers": {
        "@type": "AggregateOffer",
        "offerCount": 8,
        "lowPrice": "62.00",
        "highPrice": "167.00",
        "priceCurrency": "USD",
        "availability": "https://schema.org/InStock"
      },
      "review": [
        {
          "@type": "Review",
          "reviewRating": {
            "@type": "Rating",
            "ratingValue": 5,
            "bestRating": 5
          },
          "datePublished": "2022-11-05",
          "reviewBody": "Thank you, just perfect although I would have liked the 14k on the backside, but it’s precious. I will wear it everyday with my other pendants.",
          "author": {
            "@type": "Person",
            "name": "Juanita Bell"
          }
        }
      ]
    },
    "..."
  ],
  "opengraph": [
    {
      "namespace": {
        "og": "http://ogp.me/ns#",
        "product": "http://ogp.me/ns/product#"
      },
      "properties": [
        [
          "og:title",
          "9K TINY Solid Yellow Gold Coin Pendant Gold Disk Necklace 9k - Etsy South Korea"
        ],
        [
          "og:description",
          "This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023"
        ],
        [
          "og:type",
          "product"
        ],
        [
          "og:url",
          "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share"
        ],
        [
          "og:image",
          "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_1080xN.3856672332_2867.jpg"
        ],
        [
          "product:price:amount",
          "62.00"
        ],
        [
          "product:price:currency",
          "USD"
        ]
      ]
    },
    "..."
  ],
  "microformat": [],
  "rdfa": [
    {
      "@id": "",
      "al:android:app_name": [
        {
          "@value": "Etsy"
        }
      ],
      "al:android:package": [
        {
          "@value": "com.etsy.android"
        }
      ],
      "al:android:url": [
        {
          "@value": "etsy://listing/1214112656?ref=applinks_android"
        }
      ],
      "al:ios:app_name": [
        {
          "@value": "Etsy"
        }
      ],
      "al:ios:app_store_id": [
        {
          "@value": "477128284"
        }
      ],
      "al:ios:url": [
        {
          "@value": "etsy://listing/1214112656?ref=applinks_ios"
        }
      ],
      "http://ogp.me/ns#description": [
        {
          "@value": "This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023"
        }
      ],
      "http://ogp.me/ns#image": [
        {
          "@value": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_1080xN.3856672332_2867.jpg"
        }
      ],
      "http://ogp.me/ns#title": [
        {
          "@value": "9K TINY Solid Yellow Gold Coin Pendant Gold Disk Necklace 9k - Etsy South Korea"
        }
      ],
      "http://ogp.me/ns#type": [
        {
          "@value": "product"
        }
      ],
      "http://ogp.me/ns#url": [
        {
          "@value": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant?utm_source=OpenGraph&utm_medium=PageTools&utm_campaign=Share"
        }
      ],
      "https://www.facebook.com/2008/fbmlapp_id": [
        {
          "@value": "89186614300"
        }
      ],
      "product:price:amount": [
        {
          "@value": "62.00"
        }
      ],
      "product:price:currency": [
        {
          "@value": "USD"
        }
      ]
    },
    "..."
  ],
  "dublincore": [
    {
      "namespaces": {},
      "elements": [
        {
          "name": "description",
          "content": "This Pendants item by MissFlorenceJewelry has 45 favorites from Etsy shoppers. Ships from United States. Listed on Jan 31, 2023",
          "URI": "http://purl.org/dc/elements/1.1/description"
        }
      ],
      "terms": []
    },
    "..."
  ]
}

我们可以看到 Etsy 包含许多不同的格式,但当涉及到产品数据时,它json-ld显然是赢家,其中包含大部分产品详细信息:sku、名称、价格、描述甚至评论元数据:

{
  "@type": "Product",
  "@context": "https://schema.org",
  "url": "https://www.etsy.com/listing/1214112656/9k-tiny-solid-yellow-gold-coin-pendant",
  "name": "9K TINY Solid Yellow Gold Coin Pendant, Gold Disk Necklace, 9k Gold Coin Necklace, Solid Gold Rose Necklace, Christmas Gift for Her",
  "sku": "1214112656",
  "gtin": "n/a",
  "description": "----------------\ud83c\udf38\ud83c\udf38\ud83c\udf38Welcome to MissFlorenceJewelry\ud83c\udf38\ud83c\udf38\ud83c\udf38---------------\nDetails:\n\u00b7 Material: 9K Solid Yellow Gold\n\u00b7 Measurement: Pendant approx 6.5*6.5mm. Pendant hole 3.5*1.5mm\n\u00b7 Please not that a chain is NOT included. If you are interested to get one, check out this chain listing: \nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n\u00b7 All jewelry is custom handcrafted with Love and Care. \u2764\ufe0f\n\u00b7 All items are custom made to order, about 2 weeks.\n\n\nShipping :\n\u00b7 It takes 1-2 business days to ship the item to you, and 7-10 days additionally for the USPS to deliver the package.\n\u00b7 Packing: The item will be presented in a beautiful box. Complimentary gift wrapping and gift tags available.\n\n\nReturns and Exchanges:\n\u00b7 I gladly accept returns and exchanges, just contact me within 15 days of delivery.\n\u00b7 Buyers are responsible for return shipping costs. If the item is not returned in its original condition, the buyer is responsible for any loss in value.\n\n---------------\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38-------------------\n\n\u00b7 If you can't find the information you need, Please feel free to contact us.\ud83d\ude0a\n\u00b7 Thank you so much for your visit and hope you have a happy shopping here.\u2764\ufe0f",
  "image": [
    {
      "@type": "ImageObject",
      "@context": "https://schema.org",
      "author": "MissFlorenceJewelry",
      "contentURL": "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
      "description": null,
      "thumbnail": "https://i.etsystatic.com/34276015/c/650/516/41/108/il/1fbd0e/3856672332/il_340x270.3856672332_2867.jpg"
    },
    {
      "@type": "ImageObject",
      "@context": "https://schema.org",
      "author": "MissFlorenceJewelry",
      "contentURL": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg",
      "description": null,
      "thumbnail": "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_340x270.3904169515_9c7r.jpg"
    }
  ],
  "category": "Jewelry < Necklaces < Pendants",
  "brand": {
    "@type": "Brand",
    "@context": "https://schema.org",
    "name": "MissFlorenceJewelry"
  },
  "logo": "https://i.etsystatic.com/isla/862c6f/58067961/isla_fullxfull.58067961_aiop800d.jpg?version=0",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.8",
    "reviewCount": 25
  },
  "offers": {
    "@type": "AggregateOffer",
    "offerCount": 8,
    "lowPrice": "62.00",
    "highPrice": "167.00",
    "priceCurrency": "USD",
    "availability": "https://schema.org/InStock"
  },
  "review": [
    {
      "@type": "Review",
      "reviewRating": {
        "@type": "Rating",
        "ratingValue": 5,
        "bestRating": 5
      },
      "datePublished": "2022-11-05",
      "reviewBody": "Thank you, just perfect although I would have liked the 14k on the backside, but it\u2019s precious. I will wear it everyday with my other pendants.",
      "author": {
        "@type": "Person",
        "name": "Juanita Bell"
      }
    }
  ]
}

通过一些数据扁平化代码,我们可以使用 extruct 在几行代码中提取漂亮的数据集:

import json
import extruct
from scrapfly import ScrapflyClient, ScrapeConfig

scrapfly = ScrapflyClient("YOUR SCRAPFLY KEY")
result = scrapfly.scrape(ScrapeConfig(url="https://www.etsy.com/listing/1214112656/"))

micro_data = extruct.extract(result.content)
product = next(data for data in micro_data['json-ld'] if data['@type'] == "Product")
parsed = {
    # copy basic fields over
    "url": product["url"],
    "name": product["name"],
    "sku": product["sku"],
    "description": product["description"],
    "category": product["category"],
    # flatten complex fields:
    "store": product["brand"]["name"],
    "review_count": product["aggregateRating"]["reviewCount"],
    "review_avg": product["aggregateRating"]["ratingValue"],
    "price_min": product["offers"]["lowPrice"],
    "price_max": product["offers"]["highPrice"],
    "images": [img['contentURL'] for img in product['image']]
}
print(json.dumps(parsed, indent=2))

这将输出:

{
  "url": "https://www.etsy.com/es/listing/1214112656/colgante-de-monedas-de-oro-amarillo",
  "name": "Colgante de monedas de oro amarillo s\u00f3lido 9K TINY, collar de disco de oro, collar de monedas de oro de 9k, collar de rosas de oro macizo, regalo de Navidad para ella",
  "sku": "1214112656",
  "description": "----------------\ud83c\udf38\ud83c\udf38\ud83c\udf38Bienvenido a MissFlorenceJewelry\ud83c\udf38\ud83c\udf38\ud83c\udf38---------------\nDetalles:\n\u00b7 Material: oro amarillo macizo de 9 quilates\n\u00b7 Medida: Colgante aprox 6.5 * 6.5mm. Agujero colgante 3.5 * 1.5mm\n\u00b7 Por favor, tenga en cuenta que una cadena NO est\u00e1 incluida. Si est\u00e1 interesado en obtener uno, consulte esta lista de cadenas:\nhttps://www.etsy.com/listing/1199362536/14k-gold-chain-necklace-simple-cable?click_key=83fddaad04776d7bd8f6d871ccac4e71a3f96272%3A1199362536&click_sum=5d2ef6ce&ref=shop_home_active_11&frs=1\n\u00b7 Todas las joyas est\u00e1n hechas a mano con amor y cuidado. \u2764\ufe0f\n\u00b7 Todos los art\u00edculos est\u00e1n hechos a medida a pedido, aproximadamente 2 semanas.\n\n\nNaviero:\n\u00b7 Se tarda de 1 a 2 d\u00edas h\u00e1biles en enviarle el art\u00edculo, y de 7 a 10 d\u00edas adicionales para que el USPS entregue el paquete.\n\u00b7 Embalaje: El art\u00edculo se presentar\u00e1 en una hermosa caja. Envoltura de regalo de cortes\u00eda y etiquetas de regalo disponibles.\n\n\nDevoluciones y cambios:\n\u00b7 Con mucho gusto acepto devoluciones y cambios, solo cont\u00e1ctame dentro de los 15 d\u00edas posteriores a la entrega.\n\u00b7 Los compradores son responsables de los gastos de env\u00edo de devoluci\u00f3n. Si el art\u00edculo no se devuelve en su estado original, el comprador es responsable de cualquier p\u00e9rdida de valor.\n\n---------------\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38\ud83c\udf38-------------------\n\n\u00b7 Si no puede encontrar la informaci\u00f3n que necesita, no dude en contactarnos. \ud83d\ude0a\n\u00b7 Muchas gracias por su visita y espero que tenga una feliz compra aqu\u00ed. \u2764\ufe0f",
  "category": "Joyer\u00eda < Collares < Colgantes",
  "store": "MissFlorenceJewelry",
  "review_count": 25,
  "review_avg": "4.8",
  "price_min": "62.00",
  "price_max": "167.00",
  "images": [
    "https://i.etsystatic.com/34276015/r/il/1fbd0e/3856672332/il_fullxfull.3856672332_2867.jpg",
    "https://i.etsystatic.com/34276015/r/il/01c4eb/3904169515/il_fullxfull.3904169515_9c7r.jpg"
  ]
}

 

常问问题

在我们结束对网络抓取微格式数据的介绍之前,让我们看一下一些常见问题:

是的。微格式的创建是为了更容易进行网络抓取,任何包含微格式数据的公共页面都是完全合法的。

JSON-LD 和微数据微格式在网络抓取中最常见。JSON-LD 是网络上最流行的微格式,尽管微数据在网络抓取中是首选,因为它通常包含更高质量的数据。

微格式爬取总结

在这个网络抓取微格式数据的快速介绍中,我们快速浏览了流行的微格式类型:json-ld、微数据、微格式和 rdfa。这些格式通常包含 schema.org 类型的数据,这使得网络抓取可预测的数据集变得轻而易举。

为了说明这一点,我们用一个快速的 Etsy.com 抓取器结束了我们的教程,我们只使用几行代码就抓取了产品数据!

抓取微格式是抓取公共数据的最简单方法之一,虽然并非所有页面数据都可用于提取,但对于任何抓取工具来说,这都是一个很好的起点。

Written by 河小马

河小马是一位杰出的数字营销行业领袖,广告中国论坛的重要成员,其专业技能涵盖了PPC广告、域名停放、网站开发、联盟营销以及跨境电商咨询等多个领域。作为一位资深程序开发者,他不仅具备强大的技术能力,而且在出海网络营销方面拥有超过13年的经验。