A personal collection of an AI product manager.
Let's face the future together and embrace the AIGC era.

Resource Compilation | 32 Python Crawler Projects to Satisfy Your Appetite!

Today, I have compiled a list of 32 Python web scraping projects for everyone.

I’ve gathered these projects because web scraping is a simple and fast way to get started with Python, and it’s also great for beginners to build confidence. All the links point to GitHub, so have fun exploring! O(∩_∩)O~

  1. WechatSogou [1] – A WeChat public account crawler based on Sogou’s WeChat search. It can be expanded to a Sogou search-based crawler, returning a list where each item is a dictionary of detailed public account information.
  2. DouBanSpider [2] – A Douban book crawler. It can scrape all books under Douban’s book tags, rank them by rating, and store them in Excel for easy filtering, such as finding highly-rated books with over 1,000 reviewers. Different topics can be saved in separate sheets. It uses User Agent spoofing and random delays to mimic browser behavior and avoid being blocked.
  3. zhihu_spider [3] – A Zhihu crawler. This project scrapes user information and social network relationships on Zhihu. It uses the Scrapy framework and stores data in MongoDB.
  4. bilibili-user [4] – A Bilibili user crawler. Total data: 20,119,918. Fields include user ID, nickname, gender, avatar, level, experience points, followers, birthday, address, registration time, signature, and more. It generates a Bilibili user data report after scraping.
  5. SinaSpider [5] – A Sina Weibo crawler. It mainly scrapes user personal information, posts, followers, and followings. It uses Sina Weibo cookies for login and supports multiple accounts to avoid anti-scraping measures. It primarily uses the Scrapy framework.
  6. distribute_crawler [6] – A distributed novel download crawler. It uses Scrapy, Redis, MongoDB, and graphite to implement a distributed web crawler. The underlying storage is a MongoDB cluster, distributed via Redis, and the crawler status is displayed using graphite. It mainly targets a novel website.
  7. CnkiSpider [7] – A CNKI (China National Knowledge Infrastructure) crawler. After setting search conditions, it executes src/CnkiSpider.py to scrape data, stored in the /data directory. The first line of each data file contains the field names.
  8. LianJiaSpider [8] – A Lianjia crawler. It scrapes historical second-hand housing transaction records in Beijing. It includes all the code from the Lianjia simulated login article.
  9. scrapy_jingdong [9] – A JD.com crawler based on Scrapy. Data is saved in CSV format.
  10. QQ-Groups-Spider [10] – A QQ group crawler. It batch scrapes QQ group information, including group name, group number, member count, group owner, group description, etc., and generates XLS(X) / CSV result files.
  11. wooyun_public [11] – A WooYun crawler. It scrapes WooYun’s public vulnerabilities and knowledge base. All public vulnerabilities are stored in MongoDB, taking up about 2GB. If the entire site, including text and images, is scraped for offline querying, it requires about 10GB of space and 2 hours (10M broadband). The knowledge base takes up about 500MB. Vuln search uses Flask as the web server and Bootstrap for the frontend.
  12. spider [12] – A hao123 website crawler. Starting from hao123, it scrolls to scrape external links, collects URLs, and records the number of internal and external links on each URL, along with the title. Tested on Windows 7 32-bit, it can collect about 100,000 URLs every 24 hours.
  13. findtrip [13] – A flight ticket crawler (Qunar and Ctrip). Findtrip is a Scrapy-based flight ticket crawler, currently integrating data from two major ticket websites in China (Qunar + Ctrip).
  14. 163spider [14] – A NetEase client content crawler based on requests, MySQLdb, and torndb.
  15. doubanspiders [15] – A collection of Douban crawlers for movies, books, groups, albums, and more.
  16. QQSpider [16] – A QQ Zone crawler, including logs, posts, personal information, etc. It can scrape 4 million pieces of data per day.
  17. baidu-music-spider [17] – A Baidu MP3 site crawler, using Redis for resumable scraping.
  18. tbcrawler [18] – A Taobao and Tmall crawler. It can scrape page information based on search keywords and item IDs, with data stored in MongoDB.
  19. stockholm [19] – A stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. It scrapes stock data for all stocks in the Shanghai and Shenzhen markets over a selected date range. It supports defining stock selection strategies using expressions and multi-threading. Data is saved in JSON and CSV files.
  20. BaiduyunSpider [20] – A Baidu Cloud crawler.
  21. Spider [21] – A social data crawler. It supports Weibo, Zhihu, and Douban.
  22. proxy pool [22] – A Python crawler proxy IP pool.
  23. music-163 [23] – A crawler for scraping comments on all songs from NetEase Cloud Music.
  24. jandan_spider [24] – A crawler for scraping images from Jiandan.
  25. CnblogsSpider [25] – A Cnblogs list page crawler.
  26. spider_smooc [26] – A crawler for scraping videos from MOOC.
  27. CnkiSpider [27] – A CNKI crawler.
  28. knowsecSpider2 [28] – A Knownsec crawler project.
  29. aiss-spider [29] – A crawler for scraping images from the Aiss app.
  30. SinaSpider [30] – A crawler that uses dynamic IPs to bypass Sina’s anti-scraping mechanism for quick content scraping.
  31. csdn-spider [31] – A crawler for scraping blog articles from CSDN.
  32. ProxySpider [32] – A crawler for scraping and validating proxy IPs from Xici.

Update:

webspider [33] – This system is a job data crawler primarily using Python 3, Celery, and requests. It implements scheduled tasks, error retries, logging, and automatic cookie changes. It uses ECharts + Bootstrap for frontend pages to display the scraped data.

Like(0) 打赏
未经允许不得转载:AIPMClub » Resource Compilation | 32 Python Crawler Projects to Satisfy Your Appetite!
  • MuleRun 深度评测:自进化 AI 代理与专属 VM 运行环境的完美结合

    摘要:MuleRun 不仅是 AI 代理市场,更为每位用户提供专属 24/7 云虚拟机。通过自进化记忆和全新 Agent Builder,任何人都可用自然语言构建、发布并变现 AI 代理。本文深度解析其核心架构与商业模式,探讨 AI 代理经济的未来。

    MuleRun 深度评测:自进化 AI 代理与专属 VM 运行环境的完美结合 MuleRun 深度评测:自进化 AI 代理与专属 VM 运行环境的完美结合 产品截图

    在 AI 工具爆炸式增长的今天,大多数"AI 代理"不过是套了个聊天界面的自动化脚本。然而,2026 年 3 月 16 日登顶 Product Hunt 榜首(获得超 400 票)的 MuleRun,正在进行迄今为止最大胆的尝试:打造一个无需编写代码,任何人都可以构建、销售并在专属云虚拟机(VM)上运行 AI 代理的完整生态系统。

    拥有超过 100 万注册用户、1000 多个活跃代理以及全新推出的支持自然语言创建代理的 Agent Builder,MuleRun 试图将"AI 代理经济"从行业流行语转化为切实可行的商业模式。

    一、MuleRun 的核心产品架构

    剥去营销术语的外衣,MuleRun 实际上是三个核心组件的深度融合:

    1. 具备自进化能力的个人 AI 环境

    MuleRun 摒弃了传统的共享计算资源模式,为每位用户分配专属的云端虚拟机(VM),保证了代理可以 24/7 全天候运行。这种架构赋予了代理真正的"长期记忆"与"自进化"能力。代理能够观察用户的工作模式、决策偏好和重复性任务,并随着时间的推移不断优化自身行为。用户可以在睡前启动一个复杂的工作流,醒来后直接验收完成的结果,上下文不会因为会话的结束而重置。

    2. 丰富且可落地的预置代理市场

    目前,MuleRun 市场上已上架超过 250 个经过验证的代理,涵盖交易助手、电商自动化、短剧制作管线、游戏开发工作流、竞品调研以及社交媒体排期等多种场景。与简单的聊天机器人套壳不同,MuleRun 的代理能够主动调用外部工具、遵循多步工作流,并最终交付完整的输出结果。

    3. 面向创作者的变现平台 (Creator Studio)

    MuleRun 在 2025 年 12 月推出的 Creator Studio,为开发者提供了一条完整的商业化管线:构建代理、设定价格、发布至市场并收取分成。平台接管了托管、计算、存储、安全、自动扩缩容、计费和结算等所有底层基础设施。创作者只需专注于业务逻辑,MuleRun 处理其余的一切。

    其技术底座是框架无关的,支持 ADK、LangGraph、n8n、Flowise 以及自定义部署。在大型语言模型(LLM)的接入上,MuleRun 通过统一的计费系统整合了 OpenAI、Gemini、Claude 等主流提供商,并具备自动故障转移机制。

    二、Agent Builder:降低门槛的杀手锏

    2026 年 1 月开启测试的 Mule Agent Builder 是 MuleRun 近期最重要的更新。它的核心价值主张极其明确:用户只需用自然语言描述代理的任务,平台即可自动完成构建,并一键发布到已接入计费和分发系统的市场中。

    这一功能的推出,意在指数级扩大创作者群体。在 Agent Builder 出现之前,构建代理至少需要一定的技术背景(如编写代码或熟练使用 n8n 的可视化编辑器)。现在,门槛被大幅降低为"是否能用语言清晰描述一个工作流"。如果 Agent Builder 的表现如预期般出色,MuleRun 的创作者数量有望迎来爆发式增长,进而推动经典的"市场飞轮"效应:更多代理吸引更多用户,带来更多收入,最终吸引更多创作者。

    三、竞品分析

    AI 代理市场正变得日益拥挤。MuleRun 的定位与现有的自动化工具和代理网络有着明显的交集与差异。

    | 功能特性 | MuleRun | NexusGPT | Agent.ai | Zapier / Make | |:---|:---|:---|:---|:---| | 预置代理市场 | 是(250+) | 是(1,000+) | 是 | 否(需手动构建) | | 无代码代理创建 | 是(Agent Builder) | 有限支持 | 否 | 是(可视化编辑器) | | 用户专属 VM | | 否 | 否 | 否 | | 代理商业变现 | 是(收入分成) | 是 | 有限支持 | 否 | | 自进化长期记忆 | | 否 | 否 | 否 | | 多 LLM 统一支持 | 是(统一 API) | 是 | 视情况而定 | 有限支持 |

    与 NexusGPT 相比,虽然后者拥有更多的代理数量,但它缺乏 MuleRun 标志性的专属 VM 架构和自进化记忆功能。与 Zapier 和 Make 相比,传统自动化工具要求用户手动构建僵化、基于规则的步骤,而 MuleRun 的代理能够理解上下文并适应变化,这是一种根本性的范式转变。

    四、定价策略

    MuleRun 采用基于积分的订阅模式:

  • Free(免费版):每日 200 积分(自动刷新),10GB 存储空间。足以供新用户进行基础测试和探索。
  • Plus($16/月,年付):每月 2,000 积分,个人 VM(2核·4GB内存·40GB磁盘),支持无限并发任务,100GB 存储。
  • Super($32/月,年付):每月 4,000 积分,更高配置的个人 VM(4核·8GB内存),适合创作者和重度用户。
  • Pro($160/月,年付):每月 20,000 积分,顶级个人 VM(8核·16GB内存),1TB 存储空间,抢先体验新功能。
  • 五、总结

    MuleRun 正在将 AI 代理从"对话框里的玩具"升级为"云端的数字员工"。通过结合专属 VM 架构、自进化记忆和极低门槛的 Agent Builder,它为未来的自动化工作流描绘了一幅令人兴奋的蓝图。无论最终能否成为 AI 时代的"App Store",MuleRun 都已经为整个行业树立了新的标杆。

    👉 访问官网了解更多

    觉得文章有用就打赏一下文章作者

    非常感谢你的打赏,我们将继续提供更多优质内容,让我们一起创建更加美好的网络世界!

    支付宝扫一扫

    微信扫一扫

    Verified by MonsterInsights