Resource Compilation | 32 Python Crawler Projects to Satisfy Your Appetite!-AIPMClub

Today, I have compiled a list of 32 Python web scraping projects for everyone.

I’ve gathered these projects because web scraping is a simple and fast way to get started with Python, and it’s also great for beginners to build confidence. All the links point to GitHub, so have fun exploring! O(∩_∩)O~

WechatSogou [1] – A WeChat public account crawler based on Sogou’s WeChat search. It can be expanded to a Sogou search-based crawler, returning a list where each item is a dictionary of detailed public account information.
DouBanSpider [2] – A Douban book crawler. It can scrape all books under Douban’s book tags, rank them by rating, and store them in Excel for easy filtering, such as finding highly-rated books with over 1,000 reviewers. Different topics can be saved in separate sheets. It uses User Agent spoofing and random delays to mimic browser behavior and avoid being blocked.
zhihu_spider [3] – A Zhihu crawler. This project scrapes user information and social network relationships on Zhihu. It uses the Scrapy framework and stores data in MongoDB.
bilibili-user [4] – A Bilibili user crawler. Total data: 20,119,918. Fields include user ID, nickname, gender, avatar, level, experience points, followers, birthday, address, registration time, signature, and more. It generates a Bilibili user data report after scraping.
SinaSpider [5] – A Sina Weibo crawler. It mainly scrapes user personal information, posts, followers, and followings. It uses Sina Weibo cookies for login and supports multiple accounts to avoid anti-scraping measures. It primarily uses the Scrapy framework.
distribute_crawler [6] – A distributed novel download crawler. It uses Scrapy, Redis, MongoDB, and graphite to implement a distributed web crawler. The underlying storage is a MongoDB cluster, distributed via Redis, and the crawler status is displayed using graphite. It mainly targets a novel website.
CnkiSpider [7] – A CNKI (China National Knowledge Infrastructure) crawler. After setting search conditions, it executes src/CnkiSpider.py to scrape data, stored in the /data directory. The first line of each data file contains the field names.
LianJiaSpider [8] – A Lianjia crawler. It scrapes historical second-hand housing transaction records in Beijing. It includes all the code from the Lianjia simulated login article.
scrapy_jingdong [9] – A JD.com crawler based on Scrapy. Data is saved in CSV format.
QQ-Groups-Spider [10] – A QQ group crawler. It batch scrapes QQ group information, including group name, group number, member count, group owner, group description, etc., and generates XLS(X) / CSV result files.
wooyun_public [11] – A WooYun crawler. It scrapes WooYun’s public vulnerabilities and knowledge base. All public vulnerabilities are stored in MongoDB, taking up about 2GB. If the entire site, including text and images, is scraped for offline querying, it requires about 10GB of space and 2 hours (10M broadband). The knowledge base takes up about 500MB. Vuln search uses Flask as the web server and Bootstrap for the frontend.
spider [12] – A hao123 website crawler. Starting from hao123, it scrolls to scrape external links, collects URLs, and records the number of internal and external links on each URL, along with the title. Tested on Windows 7 32-bit, it can collect about 100,000 URLs every 24 hours.
findtrip [13] – A flight ticket crawler (Qunar and Ctrip). Findtrip is a Scrapy-based flight ticket crawler, currently integrating data from two major ticket websites in China (Qunar + Ctrip).
163spider [14] – A NetEase client content crawler based on requests, MySQLdb, and torndb.
doubanspiders [15] – A collection of Douban crawlers for movies, books, groups, albums, and more.
QQSpider [16] – A QQ Zone crawler, including logs, posts, personal information, etc. It can scrape 4 million pieces of data per day.
baidu-music-spider [17] – A Baidu MP3 site crawler, using Redis for resumable scraping.
tbcrawler [18] – A Taobao and Tmall crawler. It can scrape page information based on search keywords and item IDs, with data stored in MongoDB.
stockholm [19] – A stock data (Shanghai and Shenzhen) crawler and stock selection strategy testing framework. It scrapes stock data for all stocks in the Shanghai and Shenzhen markets over a selected date range. It supports defining stock selection strategies using expressions and multi-threading. Data is saved in JSON and CSV files.
BaiduyunSpider [20] – A Baidu Cloud crawler.
Spider [21] – A social data crawler. It supports Weibo, Zhihu, and Douban.
proxy pool [22] – A Python crawler proxy IP pool.
music-163 [23] – A crawler for scraping comments on all songs from NetEase Cloud Music.
jandan_spider [24] – A crawler for scraping images from Jiandan.
CnblogsSpider [25] – A Cnblogs list page crawler.
spider_smooc [26] – A crawler for scraping videos from MOOC.
CnkiSpider [27] – A CNKI crawler.
knowsecSpider2 [28] – A Knownsec crawler project.
aiss-spider [29] – A crawler for scraping images from the Aiss app.
SinaSpider [30] – A crawler that uses dynamic IPs to bypass Sina’s anti-scraping mechanism for quick content scraping.
csdn-spider [31] – A crawler for scraping blog articles from CSDN.
ProxySpider [32] – A crawler for scraping and validating proxy IPs from Xici.

Update:

webspider [33] – This system is a job data crawler primarily using Python 3, Celery, and requests. It implements scheduled tasks, error retries, logging, and automatic cookie changes. It uses ECharts + Bootstrap for frontend pages to display the scraped data.

Resource Compilation | 32 Python Crawler Projects to Satisfy Your Appetite!

相关推荐

热门文章

快讯

Anker EufyMake E1: Print on Anything? The UV Revolution for Custom Creation

The EufyMake E1: A True Printing Revolution, Not Just Another Gadget

Beyond the Hype: Tangible Benefits for Creators and Entrepreneurs

The Inevitable Reality: Understanding the EufyMake E1's Premium Price

The EufyMake E1: Ushering in a New Era of Digital Fabrication?

Your Car Just Got Smarter: ChatGPT Integrates with Apple CarPlay (iOS 26.4 Update!)

Your AI Co-Pilot: Getting ChatGPT on Apple CarPlay

Why This Integration Is a Game-Changer for Every Driver

The Broader Road Ahead: CarPlay's AI Evolution

Samsung Galaxy S26 & AI Photography: Is Digital Perfection 'Sloppifying' Our Memories?

From Subtle Tweaks to Digital Alchemy: The Pixel's Precedent

What Does 'Sloppifying Memories' Truly Mean?

The Galaxy S26: Pushing the Boundaries of AI Photo Editing

The Future of Our Visual Narrative

Art School's AI Awakening: Why Creatives Must Master the Algorithm (Before It Masters Them)

The Inevitable Integration of AI in Creative Education

Beyond the Hype: Practical Applications and Ethical Quandaries

The Resistance: When Students and Faculty Push Back

Navigating the Future of Creative Practice

Cyber Storm Warning: Iran's April 1st Threat to Apple, Google, Microsoft – A Global Tech Wake-Up Call

The Immediate Threat: Iran's April 1st Ultimatum

Why This Matters to You: Beyond the Headlines

What Can US Tech Firms — and All Businesses — Do?

The Broader Digital War: What's Next?

Duolingo CEO Luis von Ahn's Radical Stance: Why the CAPTCHA Creator Wants to 'Delete' Blockchain and What Tech Leaders Should Hear

A Pragmatist's Core Question: Where's the Real-World Utility?

The Stark Irony: CAPTCHA's Clarity vs. Crypto's Complexity

Beyond the Buzzwords: A Call for Genuine Solutions

What This Authoritative Critique Means for Tech's Future

觉得文章有用就打赏一下文章作者

非常感谢你的打赏，我们将继续提供更多优质内容，让我们一起创建更加美好的网络世界！

支付宝扫一扫

微信扫一扫

分享到：