怎么利用网页爬取打入炒球市场

球鞋市场是增长最快的时尚服装市场之一，通过一些网络抓取就可以很容易地理解它。网络上充满了公共鞋类产品数据，我们可以使用 Python 和网络抓取库免费利用这些数据。在本概述中，我们将通过网络抓取来演示产品性能分析。我们将抓取 Goat.com——最受欢迎的鞋类市场之一，并使用 Python 和 Jupyter Notebooks 来理解它。这个简短的演示将让您体验使用网络抓取和数据分析可以做什么。

项目设置

我们将使用基本的 Jupyter 笔记本。如果您不熟悉 Jupyter，它是数据分析和可视化的绝佳工具。您可以在 jupyter.org了解更多信息。我们对本文的设置非常简单，我们将使用基本的 Python 库：

pandas作为我们的数据分析库
matplotlib作为我们的数据可视化库
ipywidgets作为我们的交互式小部件库，用于交互式定制数据可视化

pip所有这些都可以使用控制台命令安装：

$ pip install notebook pandas matplotlib ipywidgets

抓取数据集

对于本教程，我们将使用 Goat.com 数据，因为它在鞋类产品方面特别受欢迎。对于我们的示例数据集，我们将专注于单个鞋类产品“Air Jordan 3”，它可以使用Goat.com scraper进行抓取。另外，Goat.com 使用 Cloudflare Bot Management 来阻止网络抓取，如果想确保快速爬取数据集而不会被阻止，那就需要其它工具进行辅助。在这里我们将以Scrapfly SDK为例，因为ScrapFly SDK 允许轻松绕过反爬取保护，这将使我们能够快速爬取 Goat.com 而不会被阻止：完整的爬虫代码

import asyncio
import json
import math
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional
from urllib.parse import quote, urlencode
from uuid import uuid4

from scrapfly import ScrapeApiResponse, ScrapeConfig, ScrapflyClient

scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")


def find_hidden_data(result: ScrapeApiResponse) -> dict:
    """extract hidden NEXT_DATA from page html"""
    data = result.selector.css("script#__NEXT_DATA__::text").get()
    data = json.loads(data)
    return data


async def scrape_products(urls: List[str]) -> List[dict]:
    """scrape a single goat.com product page for product data"""
    to_scrape = [ScrapeConfig(url=url, cache=True, asp=True) for url in urls]
    products = []
    async for result in scrapfly.concurrent_scrape(to_scrape):
        try:
            data = find_hidden_data(result)
            product = data["props"]["pageProps"]["productTemplate"]
            product["offers"] = data["props"]["pageProps"]["offers"]["offerData"]
            products.append(product)
        except Exception as e:
            print(f"Failed to scrape {result.context['url']}; got {e}")
    return products


async def scrape_search(query: str, max_pages: Optional[int] = 10) -> List[Dict]:
    def make_page_url(page: int = 1):
        params = {
            "c": "ciojs-client-2.29.12",  # this is hardcoded API version
            "key": "key_XT7bjdbvjgECO5d8",  # API key which is hardcoded in the client
            "i": str(uuid4()),  # unique id for each request, generated by UUID4
            "s": "2",
            "page": page,
            "num_results_per_page": "24",
            "sort_by": "relevance",
            "sort_order": "descending",
            "fmt_options[hidden_fields]": "gp_lowest_price_cents_3",
            "fmt_options[hidden_fields]": "gp_instant_ship_lowest_price_cents_3",
            "fmt_options[hidden_facets]": "gp_lowest_price_cents_3",
            "fmt_options[hidden_facets]": "gp_instant_ship_lowest_price_cents_3",
            "_dt": int(datetime.utcnow().timestamp() * 1000),  # current timestamp in milliseconds
        }
        return f"https://ac.cnstrc.com/search/{quote(query)}?{urlencode(params)}"

    url_first_page = make_page_url(page=1)
    print(f"scraping product search paging {url_first_page}")
    # scrape first page
    result_first_page = await scrapfly.async_scrape(ScrapeConfig(url=url_first_page, asp=True))
    first_page = json.loads(result_first_page.content)["response"]
    results = [result["data"] for result in first_page["results"]]

    # find total page count
    total_pages = math.ceil(first_page["total_num_results"] / 24)
    if max_pages and max_pages < total_pages:
        total_pages = max_pages

    # scrape remaining pages
    print(f"scraping remaining total pages: {total_pages-1} concurrently")
    to_scrape = [ScrapeConfig(make_page_url(page=page), asp=True) for page in range(2, total_pages + 1)]
    async for result in scrapfly.concurrent_scrape(to_scrape):
        data = json.loads(result.content)
        items = [result["data"] for result in data["response"]["results"]]
        results.extend(items)

    return results

为了收集我们的数据集，我们可以使用以下抓取代码：

# Scraper code from
# https://www.jingzhengli.com/how-to-scrape-goat-com-fashion-apparel/#full-scraper-code
import json
from pathlib import Path

async def scrape_all_search(query):
    search = await scrape_search(query, max_pages=30)
    product_urls = [f"https://www.goat.com/{item['product_type']}/{item['slug']}" for item in search]
    search_products = await scrape_products(product_urls)
    Path(f"{query}.json").write_text(json.dumps(search_products, indent=2, ensure_ascii=False))

asyncio.run(scrape_all_search("Air Jordan 3"))

这将产生 600 多种鞋类产品，这些产品是“Air Jordan 3”运动鞋的变体。让我们看看我们可以从这个数据集中学到什么。

市场分布

市场分布是我们可以做的最简单的数据分析之一，它可以让我们了解市场概况。我们可以通过产品颜色、设计师和其他产品特性来分析市场分布。让我们这样做吧！为了开始我们的数据分析，让我们将数据集加载到 Pandas DataFrame 中：

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_json('air-jordan-3.json')
# convert datetime string to datetime object
data['releaseDate'] = pd.to_datetime(data['releaseDate'])
# convert price strings to floats
for _, row in data.iterrows():
    for size_obj in row['offers']:
        size_obj = float(size_obj['price'])
data.head()

这将为我们提供一个包含 600 多个产品行和许多数据字段列（如价格、颜色、发布日期等）的 DataFrame。现在我们可以开始分析我们的数据了。让我们从一个基本问题开始： Air Jordan 3 有哪些颜色？

# Color distribution by percentage
color_counts = data['color'].value_counts().head(10)
color_percentages = (color_counts / color_counts.sum()) * 100
ax = color_counts.plot(kind='bar', title='Top 10 Most Common Shoe Colors', figsize=(10, 6))
for i, p in enumerate(ax.patches):
    ax.annotate(f"{color_percentages[i]:.1f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='baseline', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')
plt.xlabel('Color')
plt.ylabel('Shoe count')
plt.show()

在这里，我们可以看到 Goat.com 上提供的 Jordai Air 3 的颜色分布。到目前为止，白色和黑色似乎是最常见的颜色。接下来，我们就来看看哪位Jordan Air 3设计师最受欢迎：

designer_counts = data['designer'].value_counts().head(10)
designer_counts_perc = (designer_counts / designer_counts.sum()) * 100
ax = designer_counts.plot(kind='bar', title='Top 10 Shoe Designers by Number of Shoes Designed', figsize=(10, 6))
for i, p in enumerate(ax.patches):
    ax.annotate(f"{designer_counts_perc[i]:.1f}%", (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='baseline', fontsize=10, color='black', xytext=(0, 5),
                 textcoords='offset points')
plt.xlabel('Designer')
plt.ylabel('Number of Shoes Designed')
plt.show()

在这里我们可以看到 Tinker Hatfield 的 Jordan Air 3 是迄今为止这款运动鞋最受欢迎的版本。

分布分析是快速了解市场的好方法。我们还可以使用这些数据创建一个简单的推荐系统，根据用户的喜好推荐产品。

价格分布

最吸引人的数据领域当然是价格。我们可以做一些基于价格的分析，这些分析可以应用于产品转售，甚至可以在许多不同的店面之间翻转。由于我们的 Goat.com 数据集包含基于鞋码、颜色和其他特征的定价，我们可以找出哪些产品的需求最高。例如 – 每个鞋码的平均价格：

sizes_list = []
prices_list = []

for _, row in data.iterrows():
    for size_obj in row['offers']:
        if not(6 < size_obj['size'] < 16):  # only men sized shoes
            continue
        sizes_list.append(size_obj['size'])
        prices_list.append(float(size_obj['price']))

size_price_df = pd.DataFrame({'size': sizes_list, 'price': prices_list})
average_price_by_size = size_price_df.groupby('size')['price'].mean().sort_values(ascending=False)

# Create the bar chart
average_price_by_size.plot(kind='bar', figsize=(10, 6))
plt.title('Average Price per Shoe Size')
plt.xlabel('Shoe Size')
plt.ylabel('Average Price')
plt.show()

Jordan Air 3 鞋类价格按鞋码分布 — Goat.com 上按鞋码划分的 Jordan Air 3 价格分布

在这里我们可以看到价格分布相当混乱，但有明确的指标表明最小尺寸的 6.5 和 7 是最便宜的。我们可以更进一步，看看价格分布如何根据鞋子的颜色而变化。让我们添加一个滑块小部件来显示给定鞋码的每种鞋色的平均价格：

import ipywidgets as widgets
from IPython.display import display

def plot_avg_price_by_color(shoe_size):
    colors_list = []
    prices_list = []

    for _, row in data.iterrows():
        for size_obj in row['offers']:
            if size_obj['size'] != shoe_size:
                continue
            colors_list.append(row['color'])
            prices_list.append(float(size_obj['price']))

    color_price_df = pd.DataFrame({'color': colors_list, 'price': prices_list})
    average_price_by_color = color_price_df.groupby('color')['price'].median().sort_values(ascending=False)

    # Create the bar chart
    average_price_by_color.plot(kind='bar', figsize=(10, 6))
    plt.title(f'Average Price per Shoe Color for size {shoe_size}')
    plt.xlabel('Shoe Color')
    plt.ylabel('Average Price')
    plt.ylim(0, 4000)
    plt.show()

# Create slider widget
shoe_size_slider = widgets.FloatSlider(value=6, min=6, max=14, step=0.5, description='Shoe Size:')
widgets.interact(plot_avg_price_by_color, shoe_size=shoe_size_slider)

在这里，我们可以看到如何ipywidgets根据自定义输入毫不费力地创建交互式可视化。我们可以使用这些数据来快速查找和可视化市场异常值。

价格历史分析

了解市场的另一种方法是查看产品随时间变化的历史记录。我们可以跟踪价格变化以达成交易，甚至可以预测未来的价格变化。我们可以跟踪产品功能变化，在某些情况下还可以跟踪产品库存变化，以预测需求和供应变化。为了实现这一切，我们需要不断抓取我们的目标，因为大多数网站（如 Goat.com）不提供历史数据。回到我们的 goat.com 抓取器示例，我们可以编写一个每天收集数据并将其存储以供进一步分析的抓取器：

import json
from pathlib import Path
from datetime import datetime

async def scrape_daily(url):
    product = await scrape_products([url])
    now = datetime.utcnow().strftime("%Y-%m-%d")
    Path(f"{product['slug']}_{now}.json").write_text(json.dumps(product, indent=2, ensure_ascii=False))

asyncio.run(scrape_daily("https://www.goat.com/sneakers/air-jordan-3-retro-white-cement-reimagined-dn3707-100"))

每次运行时，这个抓取工具都会创建一个文件名中包含当前日期戳的 JSON 文件。如果我们安排它每天运行，我们将拥有产品的完整历史记录：

air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-16.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-17.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-18.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-19.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-20.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-21.json
air-jordan-3-retro-white-cement-reimagined-dn3707-100_2021-04-22.json

让我们看看如何使用这些数据来分析“Air Jordan 3 Retro White Cement Reimagined”运动鞋的价格和变化历史。让我们首先将这些日常数据集加载到 pandas 数据框中：

from pathlib import Path

import json
import pandas as pd
import matplotlib.pyplot as plt

price_data = []

for file in Path().glob(f"air-jordan-3*.json"):
    data = json.loads(file.read_text())
    date = file.name.split('_')[-1].split('.')[0]
    date = pd.to_datetime(date, format='%Y-%m-%d')
    
    for offer in data["offers"]:
        size = offer["size"]
        price = float(offer["price"])
        price_data.append({"date": date, "size": size, "price": price})
price_df = pd.DataFrame(price_data)
price_df.head()

有了这些数据，我们可以做很多有趣的价格分析。首先让我们看看价格如何随时间变化：

# Pivot the DataFrame to have shoe sizes as columns and dates as index
price_pivot_df = price_df.pivot_table(index="date", columns="size", values="price")

# Create a line chart for price trends
price_pivot_df.plot.line(figsize=(12, 8), marker='o')
plt.title("Price Trend for Air Jordan 3 Retro White Cement Reimagined (DN3707-100)")
plt.xlabel("Date")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.legend(title="Shoe Size")
plt.show()

价格趋势分析可以让我们对市场有很多洞察。例如，我们可以看到这双鞋的价格急剧上涨。我们可以看到的另一件事是某些尺寸的价格偏差小于其他尺寸 – 让我们看一下价格波动：

price_volatility = price_pivot_df.std()
price_volatility.plot.bar(figsize=(10, 6))
plt.title("Price Volatility for Each Shoe Size")
plt.xlabel("Shoe Size")
plt.ylabel("Standard Deviation")
plt.xticks(rotation=0)
plt.show()

在这个例子中，我们可以看到，比较常见的男鞋尺码的价格波动往往小于比较稀有的尺码。这是一个很好的指标，表明由于交易量增加，更常见尺寸的市场更加稳定。为了预测未来的价格趋势，我们可以进一步创建价格分布趋势图：

price_pivot_df.plot.box(figsize=(12, 8))
plt.title("Price Distribution for Each Shoe Size")
plt.xlabel("Shoe Size")
plt.ylabel("Price")
plt.xticks(rotation=0)
plt.show()

这只是一个简单的例子，说明了抓取和分析鞋类市场数据是多么容易。我们可以使用相同的技术来分析其他市场，如服装、配饰，甚至收藏品。

概括

在这个快速概述中，我们了解了如何使用网络抓取来使用基本的 Python 分析工具深入鞋类市场分析。我们使用 Goat.com 来抓取 Air Jordan 3 运动鞋数据，并按产品特征（如鞋码和颜色）分析市场分布。然后，我们回顾了过去一周收集到的单一产品的价格趋势，发现一些鞋码比其他鞋码波动更大。网络抓取是市场研究和分析的重要资源，因为网络上充满了免费的公共数据。使用 Python 和现有的网络抓取和数据分析工具，我们只需几行代码就可以进行相当多的市场分析！

怎么利用网页爬取打入炒球市场

项目设置

抓取数据集

市场分布

价格分布

价格历史分析

概括

Related

使用 Selenium 进行 Rust 网络抓取

如何在无代码的情况下抓取 Expedia 数据？

使用 C# 进行网络抓取的初学者指南

如何使用 PycURL 进行网页抓取?

使用PHP和代理IP进行网页抓取的详细教程

Puppeteer Sharp 和 XPath 的终极网页抓取指南

Written by 河小马

如何使用AdsPower进行多账户管理

Kookeey 代理服务器怎么样？

如何使用指纹浏览器 Multilogin 群控TikTok账户？

9Proxy 代理服务器怎么样？

Proxidize 代理服务器怎么样？

Youproxy 代理服务器怎么样？

如何使用AdsPower进行多账户管理

Kookeey 代理服务器怎么样？

如何使用指纹浏览器 Multilogin 群控TikTok账户？

9Proxy 代理服务器怎么样？

Proxidize 代理服务器怎么样？

Youproxy 代理服务器怎么样？

LunaProxy 代理服务器怎么样？

IP2World 代理服务器怎么样

IPIDEA 代理服务器怎么样？

922 S5 Proxy 代理服务器怎么样？