在这个 Python 网络抓取教程中,我们将探索Instagram——最大的社交媒体网站之一。我们将看看如何抓取 Instagram 的用户和发布数据。 我们还将重点介绍一些提示和技巧,了解如何有效地访问这些端点,以及如何避免网络抓取工具阻塞并在无需登录 Instagram 的情况下访问所有这些信息。那么,让我们开始吧!
项目设置
在这个网络抓取 Instagram 教程中,我们将使用 Python 和一个 HTTP 客户端库httpx,它将支持我们与 Instagram 服务器的所有交互。 我们还将使用JMespath JSON 解析库,这将帮助我们减少从 Instagram 获得的巨大数据集,只保留最重要的部分,如照片 url、评论、计数等。 pip
我们可以通过控制台命令安装所有这些包:
$ pip install httpx jmespath
注意 – 登录要求
许多 Instagram 端点需要登录,但不是全部。在本教程中,我们将只介绍不需要登录且所有人都可以公开访问的端点。 通过登录抓取 Instagram 可能会导致许多意想不到的后果,因为您的帐户被阻止,Instagram 会采取法律行动明确违反其服务条款。正如本教程中所述,登录通常是不必要的,所以让我们来看看如何在无需登录和冒暂停风险的情况下抓取 Instagram。
抓取 Instagram 用户数据
让我们从抓取用户个人资料开始。为此,我们将使用 Instagram 的后端 API 端点,该端点在浏览器加载配置文件 URL 时被触发。例如,这是 Google 的 Instagram 个人资料页面:
此端点在页面加载时调用,并返回包含所有用户数据的 JSON 数据集。我们可以使用这个端点来抓取 Instagram 用户数据,而无需登录 Instagram:
import json import httpx client = httpx.Client( headers={ # this is internal ID of an instegram backend app. It doesn't change often. "x-ig-app-id": "936619743392459", # use browser-like features "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36", "Accept-Language": "en-US,en;q=0.9,ru;q=0.8", "Accept-Encoding": "gzip, deflate, br", "Accept": "*/*", } ) def scrape_user(username: str): """Scrape Instagram user's data""" result = client.get( f"https://i.instagram.com/api/v1/users/web_profile_info/?username={username}", ) data = json.loads(result.content) return data["data"]["user"] print(scrape_user("google"))
Example Output This approach will return Instagram user data such as bio description, follower counts, profile pictures etc: { "biography": "Google unfiltered—sometimes with filters.", "external_url": "https://linkin.bio/google", "external_url_linkshimmed": "https://l.instagram.com/?u=https%3A%2F%2Flinkin.bio%2Fgoogle&e=ATOaH1Vrx_TkkMUhpCCh1_PM-C1k5t35gAtJ0eBjTPE84RItj-cCFdqRoRHwlbiCSrB5G_v6MgjePl1SQN4vTw&s=1", "edge_followed_by": { "count": 13015078 }, "fbid": "17841401778116675", "edge_follow": { "count": 33 }, "full_name": "Google", "highlight_reel_count": 5, "id": "1067259270", "is_business_account": true, "is_professional_account": true, "is_supervision_enabled": false, "is_guardian_of_viewer": false, "is_supervised_by_viewer": false, "is_embeds_disabled": false, "is_joined_recently": false, "guardian_id": null, "is_verified": true, "profile_pic_url": "https://instagram.furt1-1.fna.fbcdn.net/v/t51.2885-19/126151620_3420222801423283_6498777152086077438_n.jpg?stp=dst-jpg_s150x150&_nc_ht=instagram.furt1-1.fna.fbcdn.net&_nc_cat=1&_nc_ohc=bmDCZ2Q8wTkAX-Ilbqq&edm=ABfd0MgBAAAA&ccb=7-4&oh=00_AT9pRKzLtnysPjhclN6TprCd9FBWo2ABbn9cRICPhbQZcA&oe=62882D44&_nc_sid=7bff83", "username": "google", ... } This is a great, easy method to scrape Instagram profiles - it even includes the details of the first 12 posts including photos and videos! Parsing Instagram User Data The user dataset we scraped can be a bit daunting as it contains a lot of data. To reduce it to the most important bits we can use Jmespath. import jmespath def parse_user(data: Dict) -> Dict: """Parse instagram user's hidden web dataset for user's data""" log.debug("parsing user data {}", data['username']) result = jmespath.search( """{ name: full_name, username: username, id: id, category: category_name, business_category: business_category_name, phone: business_phone_number, email: business_email, bio: biography, bio_links: bio_links[].url, homepage: external_url, followers: edge_followed_by.count, follows: edge_follow.count, facebook_id: fbid, is_private: is_private, is_verified: is_verified, profile_image: profile_pic_url_hd, video_count: edge_felix_video_timeline.count, videos: edge_felix_video_timeline.edges[].node.{ id: id, title: title, shortcode: shortcode, thumb: display_url, url: video_url, views: video_view_count, tagged: edge_media_to_tagged_user.edges[].node.user.username, captions: edge_media_to_caption.edges[].node.text, comments_count: edge_media_to_comment.count, comments_disabled: comments_disabled, taken_at: taken_at_timestamp, likes: edge_liked_by.count, location: location.name, duration: video_duration }, image_count: edge_owner_to_timeline_media.count, images: edge_felix_video_timeline.edges[].node.{ id: id, title: title, shortcode: shortcode, src: display_url, url: video_url, views: video_view_count, tagged: edge_media_to_tagged_user.edges[].node.user.username, captions: edge_media_to_caption.edges[].node.text, comments_count: edge_media_to_comment.count, comments_disabled: comments_disabled, taken_at: taken_at_timestamp, likes: edge_liked_by.count, location: location.name, accesibility_caption: accessibility_caption, duration: video_duration }, saved_count: edge_saved_media.count, collections_count: edge_saved_media.count, related_profiles: edge_related_profiles.edges[].node.username }""", data, ) return result
此函数将接收完整的数据集并将其缩减为仅包含重要字段的更扁平的结构。我们正在使用 JMespath 的重塑功能,它允许我们将数据集提炼成一个新的结构。
抓取 Instagram 发布数据
为了抓取 Instagram 帖子数据,我们将使用与以前相同的方法,但这次我们将使用端点post
。 为了动态生成帖子视图,Instagram 使用 GraphQL 后端查询返回帖子数据、评论、喜欢和其他信息。我们可以使用此端点来抓取发布数据。 所有 Instagram GrapQL 端点都通过以下方式访问: https://www.instagram.com/graphql/query/?query_hash=<>&variables=<>
查询哈希和变量定义查询功能的位置。例如,为了抓取帖子数据,我们将使用以下查询哈希和变量:
{ "query_hash": "b3055c01b4b222b8a47dc12b090e4e64", # this post query hash which doesn't change "variables": { "shortcode": "CQYQ1Y1nZ1Y", # post shortcode (from URL) # how many and what comments to include "child_comment_count": 20, "fetch_comment_count": 100, "parent_comment_count": 24, "has_threaded_comments": true } }
因此,要使用 Python 抓取它,我们将使用以下代码:
import json from typing import Dict from urllib.parse import quote import httpx INSTAGRAM_APP_ID = "936619743392459" # this is the public app id for instagram.com def scrape_post(url_or_shortcode: str) -> Dict: """Scrape single Instagram post data""" if "http" in url_or_shortcode: shortcode = url_or_shortcode.split("/p/")[-1].split("/")[0] else: shortcode = url_or_shortcode print("scraping instagram post: {}", shortcode) variables = { "shortcode": shortcode, "child_comment_count": 20, "fetch_comment_count": 100, "parent_comment_count": 24, "has_threaded_comments": True, } url = "https://www.instagram.com/graphql/query/?query_hash=b3055c01b4b222b8a47dc12b090e4e64&variables=" result = httpx.get( url=url + quote(json.dumps(variables)), headers={"x-ig-app-id": INSTAGRAM_APP_ID}, ) data = json.loads(result.content) return data["data"]["shortcode_media"] # Example usage: posts = scrape_post("1067259270", session) print(json.dumps(posts, indent=2, ensure_ascii=False))
这种抓取方法将返回整个帖子数据集,其中包括许多有用的字段,如帖子标题、评论、喜欢和其他信息。然而,它也包括许多通常不是很有用的标志和不必要的字段。 为了减少被抓取的数据集,让我们看看接下来使用 Jmespath 进行的 JSON 解析。
解析 Instagram 发布数据
Instagram 帖子数据甚至比用户个人资料数据更复杂,因此我们将再次使用 Jmespath 来减少它。
import jmespath def parse_post(data: Dict) -> Dict: log.debug("parsing post data {}", data['shortcode']) result = jmespath.search("""{ id: id, shortcode: shortcode, dimensions: dimensions, src: display_url, src_attached: edge_sidecar_to_children.edges[].node.display_url, has_audio: has_audio, video_url: video_url, views: video_view_count, plays: video_play_count, likes: edge_media_preview_like.count, location: location.name, taken_at: taken_at_timestamp, related: edge_web_media_to_related_media.edges[].node.shortcode, type: product_type, video_duration: video_duration, music: clips_music_attribution_info, is_video: is_video, tagged_users: edge_media_to_tagged_user.edges[].node.user.username, captions: edge_media_to_caption.edges[].node.text, related_profiles: edge_related_profiles.edges[].node.username, comments_count: edge_media_to_parent_comment.count, comments_disabled: comments_disabled, comments_next_page: edge_media_to_parent_comment.page_info.end_cursor, comments: edge_media_to_parent_comment.edges[].node.{ id: id, text: text, created_at: created_at, owner: owner.username, owner_verified: owner.is_verified, viewer_has_liked: viewer_has_liked, likes: edge_liked_by.count } }""", data) return result
在这里,就像以前一样,我们使用 jmespath 从我们从爬虫收到的大量 JSON 响应中提取最有用的数据字段。请注意,不同的帖子类型(卷轴、图像、视频等)有不同的可用字段。
抓取所有用户帖子
为了检索用户的帖子和发表评论,我们将使用另一个需要三个变量的 GraphQl 端点:我们从之前抓取用户个人资料中获得的用户ID 、页面大小和页面偏移光标:
{ "id": "NUMERIC USER ID", "first": 12, "after": "CURSOR ID FOR PAGING" }
例如,如果我们想要检索Google创建的所有 Instagram 帖子,我们首先必须检索该用户的 ID,然后编译我们的 graphql 请求。
在 Google 的示例中,graphql URL 为: https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables={id:1067259270,first: 12} 我们可以在浏览器中尝试,我们应该会看到一个 JSON 返回,其中包含前 12 个帖子的数据,其中包括如下详细信息:
- 发布照片和视频
- 帖子评论第一页
- 发布元数据,例如浏览量和评论数
但是,要检索所有帖子,我们需要实现分页逻辑,因为所有信息都分散在多个页面中。
import json import httpx from urllib.parse import quote def scrape_user_posts(user_id: str, session: httpx.Client, page_size=12): base_url = "https://www.instagram.com/graphql/query/?query_hash=e769aa130647d2354c40ea6a439bfc08&variables=" variables = { "id": user_id, "first": page_size, "after": None, } _page_number = 1 while True: resp = session.get(base_url + quote(json.dumps(variables))) data = resp.json() posts = data["data"]["user"]["edge_owner_to_timeline_media"] for post in posts["edges"]: yield parse_post(post["node"]) # note: we're using parse_post function from previous chapter page_info = posts["page_info"] if _page_number == 1: print(f"scraping total {posts['count']} posts of {user_id}") else: print(f"scraping page {_page_number}") if not page_info["has_next_page"]: break if variables["after"] == page_info["end_cursor"]: break variables["after"] = page_info["end_cursor"] _page_number += 1 # Example run: if __name__ == "__main__": with httpx.Client(timeout=httpx.Timeout(20.0)) as session: posts = list(scrape_user_posts("1067259270", session, page_limit=3)) print(json.dumps(posts, indent=2, ensure_ascii=False))
示例输出 [ { "__typename": "GraphImage", "id": "2890253001563912589", "dimensions": { "height": 1080, "width": 1080 }, "display_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-C93CjLzMapgPHOinoltBXypU_wi7s6zzLj1th-s9p-Q&oe=62E80627&_nc_sid=86f79a", "display_resources": [ { "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a", "config_width": 640, "config_height": 640 }, "..." ], "is_video": false, "tracking_token": "eyJ2ZXJzaW9uIjo1LCJwYXlsb2FkIjp7ImlzX2FuYWx5dGljc190cmFja2VkIjp0cnVlLCJ1dWlkIjoiOWJiNzUyMjljMjU2NDExMTliOGI4NzM5MTE2Mjk4MTYyODkwMjUzMDAxNTYzOTEyNTg5In0sInNpZ25hdHVyZSI6IiJ9", "edge_media_to_tagged_user": { "edges": [ { "node": { "user": { "full_name": "Jahmar Gale | Data Analyst", "id": "51661809026", "is_verified": false, "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/284007837_5070066053047326_6283083692098566083_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=106&_nc_ohc=KXI8oOdZRb4AX8w28nr&edm=APU89FABAAAA&ccb=7-5&oh=00_AT-4iYsawdTCHI5a2zD_PF9F-WCyKnTIPuvYwVAQo82l_w&oe=62E7609B&_nc_sid=86f79a", "username": "datajayintech" }, "x": 0.68611115, "y": 0.32222223 } }, "..." ] }, "accessibility_caption": "A screenshot of a tweet from @DataJayInTech, which says: \"A recruiter just called me and said The Google Data Analytics Certificate is a good look. This post is to encourage YOU to finish the course.\" The background of the image is red with white, yellow, and blue geometric shapes.", "edge_media_to_caption": { "edges": [ { "node": { "text": "Ring, ring — opportunity is calling📱\nStart your Google Career Certificate journey at the link in bio. #GrowWithGoogle" } }, "..." ] }, "shortcode": "CgcPcqtOTmN", "edge_media_to_comment": { "count": 139, "page_info": { "has_next_page": true, "end_cursor": "QVFCaU1FNGZiNktBOWFiTERJdU80dDVwMlNjTE5DWTkwZ0E5NENLU2xLZnFLemw3eTJtcU54ZkVVS2dzYTBKVEppeVpZbkd4dWhQdktubW1QVzJrZXNHbg==" }, "edges": [ { "node": { "id": "18209382946080093", "text": "@google your company is garbage for meddling with supposedly fair elections...you have been exposed", "created_at": 1658867672, "did_report_as_spam": false, "owner": { "id": "39246725285", "is_verified": false, "profile_pic_url": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-19/115823005_750712482350308_4191423925707982372_n.jpg?stp=dst-jpg_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=104&_nc_ohc=4iOCWDHJLFAAX-JFPh7&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9sH7npBTmHN01BndUhYVreHOk63OqZ5ISJlzNou3QD8A&oe=62E87360&_nc_sid=86f79a", "username": "bud_mcgrowin" }, "viewer_has_liked": false } }, "..." ] }, "edge_media_to_sponsor_user": { "edges": [] }, "comments_disabled": false, "taken_at_timestamp": 1658765028, "edge_media_preview_like": { "count": 9251, "edges": [] }, "gating_info": null, "fact_check_overall_rating": null, "fact_check_information": null, "media_preview": "ACoqbj8KkijDnBOfpU1tAkis8mcL2H0zU8EMEqh1Dc56H0/KublclpoejKoo3WtylMgQ4HeohW0LKJ+u7PueaX+z4v8Aa/OmoNJJ6kqtG3UxT0pta9xZRxxswzkDjJrIoatuawkpq6NXTvuN9f6VdDFeAMAdsf8A16oWDKFYMQMnuR6e9Xd8f94fmtax2OGqnzsk3n/I/wDsqN7f5H/2VR74/wC8PzWlEkY7g/iv+NVcys+wy5JML59P89zWDW3dSx+UwGMnjjH9KxKynud1BWi79wpQM+g+tJRUHQO2+4pCuO4pKKAFFHP+RSUUgP/Z", "owner": { "id": "1067259270", "username": "google" }, "location": null, "viewer_has_liked": false, "viewer_has_saved": false, "viewer_has_saved_to_collection": false, "viewer_in_photo_of_you": false, "viewer_can_reshare": true, "thumbnail_src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s640x640_sh0.08&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT8aF_4X2Ix9neTg1obSzOBgZW83oMFSNb-i5uqZqRqLLg&oe=62E80627&_nc_sid=86f79a", "thumbnail_resources": [ { "src": "https://scontent-atl3-2.cdninstagram.com/v/t51.2885-15/295343605_719605135806241_7849792612912420873_n.webp?stp=dst-jpg_e35_s150x150&_nc_ht=scontent-atl3-2.cdninstagram.com&_nc_cat=101&_nc_ohc=cbVYU-YGD04AX9-DGya&edm=APU89FABAAAA&ccb=7-5&oh=00_AT9nmASHsbmNWUQnwOdkGE4PvE8b27MqK-gbj5z0YLu8qg&oe=62E80627&_nc_sid=86f79a", "config_width": 150, "config_height": 150 }, "..." ] }, ... ]
建立个人资料 – 标签提及
现在我们可以抓取所有用户帖子,我们可以尝试一个流行的分析练习:抓取所有帖子并提取主题标签提及。 为此,让我们抓取所有帖子,从帖子描述中提取提到的主题标签并计算所有内容:
from collections import Counter def scrape_hashtag_mentions(user_id, session: httpx.AsyncClient, page_limit:int=None): """find all hashtags user mentioned in their posts""" hashtags = Counter() hashtag_pattern = re.compile(r"#(\w+)") for post in scrape_user_posts(user_id, session=session, page_limit=page_limit): desc = '\n'.join(post['captions']) found = hashtag_pattern.findall(desc) for tag in found: hashtags[tag] += 1 return hashtags
运行代码和示例输出 import json import httpx if __name__ == "__main__": with httpx.Client(timeout=httpx.Timeout(20.0)) as session: # if we only know the username but not user id we can scrape # the user profile to find the id: user_id = scrape_user("google")["id"] # will result in: 1067259270 # then we can scrape the hashtag profile hashtags = scrape_hastag_mentions(user_id, session, page_limit=5) # order results and print them as JSON: print(json.dumps(dict(hashtags.most_common()), indent=2, ensure_ascii=False)) { "MadeByGoogle": 10, "TeamPixel": 5, "GrowWithGoogle": 4, "Pixel7": 3, "LifeAtGoogle": 3, "SaferWithGoogle": 3, "Pixel6a": 3, "DoodleForGoogle": 2, "MySuperG": 2, "ShotOnPixel": 1, "DayInTheLife": 1, "DITL": 1, "GoogleAustin": 1, "Austin": 1, "NestWifi": 1, "NestDoorbell": 1, "GoogleATAPAmbientExperiments": 1, "GoogleATAPxKOCHE": 1, "SoliATAP": 1, "GooglePixelWatch": 1, "Chromecast": 1, "DooglersAroundTheWorld": 1, "GoogleSearch": 1, "GoogleSingapore": 1, "InternationalDogDay": 1, "Doogler": 1, "BlackBusinessMonth": 1, "PixelBuds": 1, "HowTo": 1, "Privacy": 1, "Settings": 1, "GoogleDoodle": 1, "NationalInternDay": 1, "GoogleInterns": 1, "Sushi": 1, "StopMotion": 1, "LetsInternetBetter": 1 }
通过这个简单的分析脚本,我们收集了个人资料主题标签,我们可以使用这些标签来确定对任何公共 Instagram 帐户的兴趣。
常问问题
为了总结本指南,让我们看一下有关网络抓取instagram.com 的一些常见问题:网络抓取 instagram.com 是否合法?
是的。Instagram 的数据是公开的,因此以缓慢、尊重的速度抓取 instagram.com 属于道德抓取定义。但是,在处理个人数据时,我们需要了解当地的版权和用户数据法律,例如欧盟的GDPR 。有关更多信息,请参阅我们的网页抓取合法吗?文章。如何从用户名获取 Instagram 用户 ID?
要从公共用户名中获取私有用户 ID,我们可以使用我们的scrape_user
函数抓取用户配置文件,私有 ID 将位于以下id
字段中:
with httpx.Client(timeout=httpx.Timeout(20.0)) as session: user_id = scrape_user('google')['id'] print(user_id)
如何从用户 ID 获取 Instagram 用户名?
要从 Instagram 的私人用户 ID 获取公共用户名,我们可以利用公共 iPhone APIhttps://i.instagram.com/api/v1/users/<USER_ID>/info/
:
import httpx iphone_api = "https://i.instagram.com/api/v1/users/{}/info/" iphone_user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.3.8 (KHTML, like Gecko) Mobile/14G60 Instagram 12.0.0.16.90 (iPhone9,4; iOS 10_3_3; en_US; en-US; scale=2.61; gamut=wide; 1080x1920" resp = httpx.get(iphone_api.format("1067259270"), headers={"User-Agent": iphone_user_agent}) print(resp.json()['user']['username'])
魔术参数 __a=1 不再起作用?
Instagram 一直在推出新的更改,并逐渐取消此功能。但是,在本文中,我们介绍了两种替代功能?__a=1
,即/v1/
性能更好的 API 端点和 GraphQl 端点!