如何爬取 Instagram数据及图片

在这个 Python 网络抓取教程中，我们将探索Instagram——最大的社交媒体网站之一。我们将看看如何抓取 Instagram 的用户和发布数据。我们还将重点介绍一些提示和技巧，了解如何有效地访问这些端点，以及如何避免网络抓取工具阻塞并在无需登录 Instagram 的情况下访问所有这些信息。那么，让我们开始吧！

项目设置

在这个网络抓取 Instagram 教程中，我们将使用 Python 和一个 HTTP 客户端库httpx，它将支持我们与 Instagram 服务器的所有交互。我们还将使用JMespath JSON 解析库，这将帮助我们减少从 Instagram 获得的巨大数据集，只保留最重要的部分，如照片 url、评论、计数等。 pip我们可以通过控制台命令安装所有这些包：

$ pip install httpx jmespath

注意 – 登录要求

许多 Instagram 端点需要登录，但不是全部。在本教程中，我们将只介绍不需要登录且所有人都可以公开访问的端点。通过登录抓取 Instagram 可能会导致许多意想不到的后果，因为您的帐户被阻止，Instagram 会采取法律行动明确违反其服务条款。正如本教程中所述，登录通常是不必要的，所以让我们来看看如何在无需登录和冒暂停风险的情况下抓取 Instagram。

抓取 Instagram 用户数据

让我们从抓取用户个人资料开始。为此，我们将使用 Instagram 的后端 API 端点，该端点在浏览器加载配置文件 URL 时被触发。例如，这是 Google 的 Instagram 个人资料页面：

预览 Google 的 Instagram 个人资料摘要的屏幕截图 — 谷歌的 Instagram 页面

此端点在页面加载时调用，并返回包含所有用户数据的 JSON 数据集。我们可以使用这个端点来抓取 Instagram 用户数据，而无需登录 Instagram：

import json
import httpx

client = httpx.Client(
    headers={
        # this is internal ID of an instegram backend app. It doesn't change often.
        "x-ig-app-id": "936619743392459",
        # use browser-like features
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36",
        "Accept-Language": "en-US,en;q=0.9,ru;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept": "*/*",
    }
)


def scrape_user(username: str):
    """Scrape Instagram user's data"""
    result = client.get(
        f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}",
    )
    data = json.loads(result.content)
    return data["data"]["user"]

print(scrape_user("google"))

Example Output
This approach will return Instagram user data such as bio description, follower counts, profile pictures etc:

{
  "biography": "Google unfiltered—sometimes with filters.",
  "external_url": "https://linkin.bio/google",
  "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1",
  "edge_followed_by": {
    "count": 13015078
  },
  "fbid": "17841401778116675",
  "edge_follow": {
    "count": 33
  },
  "full_name": "Google",
  "highlight_reel_count": 5,
  "id": "1067259270",
  "is_business_account": true,
  "is_professional_account": true,
  "is_supervision_enabled": false,
  "is_guardian_of_viewer": false,
  "is_supervised_by_viewer": false,
  "is_embeds_disabled": false,
  "is_joined_recently": false,
  "guardian_id": null,
  "is_verified": true,
  "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83",
  "username": "google",
  ...
}
This is a great, easy method to scrape Instagram profiles - it even includes the details of the first 12 posts including photos and videos!

Parsing Instagram User Data
The user dataset we scraped can be a bit daunting as it contains a lot of data. To reduce it to the most important bits we can use Jmespath.

import jmespath

def parse_user(data: Dict) -> Dict:
    """Parse instagram user's hidden web dataset for user's data"""
    log.debug("parsing user data {}", data['username'])
    result = jmespath.search(
        """{
        name: full_name,
        username: username,
        id: id,
        category: category_name,
        business_category: business_category_name,
        phone: business_phone_number,
        email: business_email,
        bio: biography,
        bio_links: bio_links[].url,
        homepage: external_url,        
        followers: edge_followed_by.count,
        follows: edge_follow.count,
        facebook_id: fbid,
        is_private: is_private,
        is_verified: is_verified,
        profile_image: profile_pic_url_hd,
        video_count: edge_felix_video_timeline.count,
        videos: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            thumb: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            duration: video_duration
        },
        image_count: edge_owner_to_timeline_media.count,
        images: edge_felix_video_timeline.edges[].node.{
            id: id, 
            title: title,
            shortcode: shortcode,
            src: display_url,
            url: video_url,
            views: video_view_count,
            tagged: edge_media_to_tagged_user.edges[].node.user.username,
            captions: edge_media_to_caption.edges[].node.text,
            comments_count: edge_media_to_comment.count,
            comments_disabled: comments_disabled,
            taken_at: taken_at_timestamp,
            likes: edge_liked_by.count,
            location: location.name,
            accesibility_caption: accessibility_caption,
            duration: video_duration
        },
        saved_count: edge_saved_media.count,
        collections_count: edge_saved_media.count,
        related_profiles: edge_related_profiles.edges[].node.username
    }""",
        data,
    )
    return result

此函数将接收完整的数据集并将其缩减为仅包含重要字段的更扁平的结构。我们正在使用 JMespath 的重塑功能，它允许我们将数据集提炼成一个新的结构。

抓取 Instagram 发布数据

为了抓取 Instagram 帖子数据，我们将使用与以前相同的方法，但这次我们将使用端点post。为了动态生成帖子视图，Instagram 使用 GraphQL 后端查询返回帖子数据、评论、喜欢和其他信息。我们可以使用此端点来抓取发布数据。所有 Instagram GrapQL 端点都通过以下方式访问： https://www.instagram.com/graphql/query/?query_hash=<>&variables=<> 查询哈希和变量定义查询功能的位置。例如，为了抓取帖子数据，我们将使用以下查询哈希和变量：

{
  "query_hash": "b3055c01b4b222b8a47dc12b090e4e64",  # this post query hash which doesn't change
  "variables": {
    "shortcode": "CQYQ1Y1nZ1Y",  # post shortcode (from URL)
    # how many and what comments to include
    "child_comment_count": 20,  
    "fetch_comment_count": 100,
    "parent_comment_count": 24,
    "has_threaded_comments": true
  }
}

因此，要使用 Python 抓取它，我们将使用以下代码：

import json
from typing import Dict
from urllib.parse import quote

import httpx

INSTAGRAM_APP_ID = "936619743392459"  # this is the public app id for instagram.com


def scrape_post(url_or_shortcode: str) -> Dict:
    """Scrape single Instagram post data"""
    if "http" in url_or_shortcode:
        shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0]
    else:
        shortcode = url_or_shortcode
    print("scraping instagram post: {}", shortcode)

    variables = {
        "shortcode": shortcode,
        "child_comment_count": 20,
        "fetch_comment_count": 100,
        "parent_comment_count": 24,
        "has_threaded_comments": True,
    }
    url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables="
    result = httpx.get(
        url=url + quote(json.dumps(variables)),
        headers={"x-ig-app-id": INSTAGRAM_APP_ID},
    )
    data = json.loads(result.content)
    return data["data"]["shortcode_media"]

# Example usage:
posts = scrape_post("1067259270", session)
print(json.dumps(posts, indent=2, ensure_ascii=False))

这种抓取方法将返回整个帖子数据集，其中包括许多有用的字段，如帖子标题、评论、喜欢和其他信息。然而，它也包括许多通常不是很有用的标志和不必要的字段。为了减少被抓取的数据集，让我们看看接下来使用 Jmespath 进行的 JSON 解析。

解析 Instagram 发布数据

Instagram 帖子数据甚至比用户个人资料数据更复杂，因此我们将再次使用 Jmespath 来减少它。

import jmespath

def parse_post(data: Dict) -> Dict:
    log.debug("parsing post data {}", data['shortcode'])
    result = jmespath.search("""{
        id: id,
        shortcode: shortcode,
        dimensions: dimensions,
        src: display_url,
        src_attached: edge_sidecar_to_children.edges[].node.display_url,
        has_audio: has_audio,
        video_url: video_url,
        views: video_view_count,
        plays: video_play_count,
        likes: edge_media_preview_like.count,
        location: location.name,
        taken_at: taken_at_timestamp,
        related: edge_web_media_to_related_media.edges[].node.shortcode,
        type: product_type,
        video_duration: video_duration,
        music: clips_music_attribution_info,
        is_video: is_video,
        tagged_users: edge_media_to_tagged_user.edges[].node.user.username,
        captions: edge_media_to_caption.edges[].node.text,
        related_profiles: edge_related_profiles.edges[].node.username,
        comments_count: edge_media_to_parent_comment.count,
        comments_disabled: comments_disabled,
        comments_next_page: edge_media_to_parent_comment.page_info.end_cursor,
        comments: edge_media_to_parent_comment.edges[].node.{
            id: id,
            text: text,
            created_at: created_at,
            owner: owner.username,
            owner_verified: owner.is_verified,
            viewer_has_liked: viewer_has_liked,
            likes: edge_liked_by.count
        }
    }""", data)
    return result

在这里，就像以前一样，我们使用 jmespath 从我们从爬虫收到的大量 JSON 响应中提取最有用的数据字段。请注意，不同的帖子类型（卷轴、图像、视频等）有不同的可用字段。

抓取所有用户帖子

为了检索用户的帖子和发表评论，我们将使用另一个需要三个变量的 GraphQl 端点：我们从之前抓取用户个人资料中获得的用户ID 、页面大小和页面偏移光标：

{
  "id": "NUMERIC USER ID",
  "first": 12,
  "after": "CURSOR ID FOR PAGING"
}

例如，如果我们想要检索Google创建的所有 Instagram 帖子，我们首先必须检索该用户的 ID，然后编译我们的 graphql 请求。

谷歌instagram页面截图 — Google 的 Instagram 页面——我们可以访问所有这些 JSON 格式的帖子数据

在 Google 的示例中，graphql URL 为： https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={id:1067259270,first: 12} 我们可以在浏览器中尝试，我们应该会看到一个 JSON 返回，其中包含前 12 个帖子的数据，其中包括如下详细信息：

发布照片和视频
帖子评论第一页
发布元数据，例如浏览量和评论数

但是，要检索所有帖子，我们需要实现分页逻辑，因为所有信息都分散在多个页面中。

import json
import httpx
from urllib.parse import quote


def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12):
    base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables="
    variables = {
        "id": user_id,
        "first": page_size,
        "after": None,
    }
    _page_number = 1
    while True:
        resp = session.get(base_url + quote(json.dumps(variables)))
        data = resp.json()
        posts = data["data"]["user"]["edge_owner_to_timeline_media"]
        for post in posts["edges"]:
            yield parse_post(post["node"])  # note: we're using parse_post function from previous chapter
        page_info = posts["page_info"]
        if _page_number == 1:
            print(f"scraping total {posts['count']} posts of {user_id}")
        else:
            print(f"scraping page {_page_number}")
        if not page_info["has_next_page"]:
            break
        if variables["after"] == page_info["end_cursor"]:
            break
        variables["after"] = page_info["end_cursor"]
        _page_number += 1

# Example run:
if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        posts = list(scrape_user_posts("1067259270", session, page_limit=3))
        print(json.dumps(posts, indent=2, ensure_ascii=False))

示例输出
[
  {
  "__typename": "GraphImage",
  "id": "2890253001563912589",
  "dimensions": {
    "height": 1080,
    "width": 1080
  },
  "display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a",
  "display_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 640,
      "config_height": 640
    },
    "..."
  ],
  "is_video": false,
  "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9",
  "edge_media_to_tagged_user": {
    "edges": [
      {
        "node": {
          "user": {
            "full_name": "Jahmar Gale | Data Analyst",
            "id": "51661809026",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a",
            "username": "datajayintech"
          },
          "x": 0.68611115,
          "y": 0.32222223
        }
      },
      "..."
    ]
  },
  "accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.",
  "edge_media_to_caption": {
    "edges": [
      {
        "node": {
          "text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle"
        }
      },
      "..."
    ]
  },
  "shortcode": "CgcPcqtOTmN",
  "edge_media_to_comment": {
    "count": 139,
    "page_info": {
      "has_next_page": true,
      "end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg=="
    },
    "edges": [
      {
        "node": {
          "id": "18209382946080093",
          "text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed",
          "created_at": 1658867672,
          "did_report_as_spam": false,
          "owner": {
            "id": "39246725285",
            "is_verified": false,
            "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a",
            "username": "bud_mcgrowin"
          },
          "viewer_has_liked": false
        }
      },
      "..."
    ]
  },
  "edge_media_to_sponsor_user": {
    "edges": []
  },
  "comments_disabled": false,
  "taken_at_timestamp": 1658765028,
  "edge_media_preview_like": {
    "count": 9251,
    "edges": []
  },
  "gating_info": null,
  "fact_check_overall_rating": null,
  "fact_check_information": null,
  "media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z",
  "owner": {
    "id": "1067259270",
    "username": "google"
  },
  "location": null,
  "viewer_has_liked": false,
  "viewer_has_saved": false,
  "viewer_has_saved_to_collection": false,
  "viewer_in_photo_of_you": false,
  "viewer_can_reshare": true,
  "thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a",
  "thumbnail_resources": [
    {
      "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a",
      "config_width": 150,
      "config_height": 150
    },
    "..."
  ]
},
...
]

建立个人资料 – 标签提及

现在我们可以抓取所有用户帖子，我们可以尝试一个流行的分析练习：抓取所有帖子并提取主题标签提及。为此，让我们抓取所有帖子，从帖子描述中提取提到的主题标签并计算所有内容：

from collections import Counter


def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None):
    """find all hashtags user mentioned in their posts"""
    hashtags = Counter()
    hashtag_pattern = re.compile(r"#(\w+)")
    for post in scrape_user_posts(user_id, session=session, page_limit=page_limit):
        desc = '\n'.join(post['captions'])
        found = hashtag_pattern.findall(desc)
        for tag in found:
            hashtags[tag] += 1
    return hashtags

运行代码和示例输出
import json
import httpx

if __name__ == "__main__":
    with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
        # if we only know the username but not user id we can scrape
        # the user profile to find the id:
        user_id = scrape_user("google")["id"]  # will result in: 1067259270
        # then we can scrape the hashtag profile
        hashtags = scrape_hastag_mentions(user_id, session, page_limit=5)
        # order results and print them as JSON:
        print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False))
{
    "MadeByGoogle": 10,
    "TeamPixel": 5,
    "GrowWithGoogle": 4,
    "Pixel7": 3,
    "LifeAtGoogle": 3,
    "SaferWithGoogle": 3,
    "Pixel6a": 3,
    "DoodleForGoogle": 2,
    "MySuperG": 2,
    "ShotOnPixel": 1,
    "DayInTheLife": 1,
    "DITL": 1,
    "GoogleAustin": 1,
    "Austin": 1,
    "NestWifi": 1,
    "NestDoorbell": 1,
    "GoogleATAPAmbientExperiments": 1,
    "GoogleATAPxKOCHE": 1,
    "SoliATAP": 1,
    "GooglePixelWatch": 1,
    "Chromecast": 1,
    "DooglersAroundTheWorld": 1,
    "GoogleSearch": 1,
    "GoogleSingapore": 1,
    "InternationalDogDay": 1,
    "Doogler": 1,
    "BlackBusinessMonth": 1,
    "PixelBuds": 1,
    "HowTo": 1,
    "Privacy": 1,
    "Settings": 1,
    "GoogleDoodle": 1,
    "NationalInternDay": 1,
    "GoogleInterns": 1,
    "Sushi": 1,
    "StopMotion": 1,
    "LetsInternetBetter": 1
}

通过这个简单的分析脚本，我们收集了个人资料主题标签，我们可以使用这些标签来确定对任何公共 Instagram 帐户的兴趣。

常问问题

为了总结本指南，让我们看一下有关网络抓取instagram.com 的一些常见问题：

网络抓取 instagram.com 是否合法？

是的。Instagram 的数据是公开的，因此以缓慢、尊重的速度抓取 instagram.com 属于道德抓取定义。但是，在处理个人数据时，我们需要了解当地的版权和用户数据法律，例如欧盟的GDPR 。有关更多信息，请参阅我们的网页抓取合法吗？文章。

如何从用户名获取 Instagram 用户 ID？

要从公共用户名中获取私有用户 ID，我们可以使用我们的scrape_user函数抓取用户配置文件，私有 ID 将位于以下id字段中：

with httpx.Client(timeout=httpx.Timeout(20.0)) as session:
    user_id = scrape_user('google')['id']
    print(user_id)

如何从用户 ID 获取 Instagram 用户名？

要从 Instagram 的私人用户 ID 获取公共用户名，我们可以利用公共 iPhone API https://i.instagram.com/api/v1/users/<USER_ID>/info/：

import httpx
iphone_api = "https://i.instagram.com/api/v1/users/{}/info/"
iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920"
resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent})
print(resp.json()['user']['username'])

魔术参数 __a=1 不再起作用？

Instagram 一直在推出新的更改，并逐渐取消此功能。但是，在本文中，我们介绍了两种替代功能?__a=1，即/v1/性能更好的 API 端点和 GraphQl 端点！

Instagram 抓取摘要

在这个 Instagram 抓取教程中，我们了解了如何使用 Python 和隐藏的 API 端点轻松抓取 Instagram。我们已经抓取了用户个人资料页面，其中包含用户详细信息、帖子和元信息以及每个单独的帖子数据。为了减少抓取的数据集，我们使用了 JMespath JSON 解析库。

如何爬取 Instagram数据及图片

项目设置

注意 – 登录要求

抓取 Instagram 用户数据

抓取 Instagram 发布数据

解析 Instagram 发布数据

抓取所有用户帖子

建立个人资料 – 标签提及

常问问题

网络抓取 instagram.com 是否合法？

如何从用户名获取 Instagram 用户 ID？

如何从用户 ID 获取 Instagram 用户名？

魔术参数 __a=1 不再起作用？

Instagram 抓取摘要

相关

如何使用 cURL 发送 POST 请求

Python爬虫入门教程：一学就会！

如何在Node.js中使用Fetch API发送HTTP请求

如何在Python中轮换代理IP？

如何爬取Indeed.com网站数据

如何爬取Walmart.com网站数据

Written by 河小马

如何爬取Instagram有用数据 | 最佳 Instagram 爬取工具

2024年最佳 Instagram 代理IP：定价和功能比较

如何测试代理IP—五种方法！

如何使用 cURL 发送 POST 请求

Python爬虫入门教程：一学就会！

如何在Node.js中使用Fetch API发送HTTP请求

如何测试代理IP—五种方法！

如何使用 cURL 发送 POST 请求

Python爬虫入门教程：一学就会！

如何在Node.js中使用Fetch API发送HTTP请求

分析竞争对手网站谷歌广告的5种方法

如何隐藏你的 IP 地址？

如何在Python中轮换代理IP？

什么是IP地址轮换?

SSL 代理的定义及其优势

什么是财务数据？