Distscrape is primarily built towards three primary goals:
- Be as fast as possible
- Nearly everything is asynchronous, allowing it to basically go as fast as the connection between the scraper and the target will let it.
- Guarantee no loss of data
- All work is "checked out" from the Tracker so that even if a worker crashes, the URLs/IDs it was working on can easily be requeued.
- All Tracker operations are atomic, so no two workers can be working on the same item, and no items can accidentally be dropped.
- Remain flexible for all kinds of scraping and crawling workloads
- Discrete component separation allows for everything from site mirroring, to tasks like ID discovery simply by changing out the implementations used (see below).
Distscrape has 4 main components:
- Scraper
- Tracker
- Saver
- Crawl manager
Each of these is explained in detail below:
Scrapers implement the logic that does the actual scraping. This includes downloading content, and searching it for more links, IDs, etc. to crawl.
SimpleScraper
: Scrapes each page for links matching the providedlink_regex
and adds all results back into the work queue.IDScraper
: Scrapes each page for IDs matching the providedid_regex
, and downloads items by GETing the specifieddownload_url_fmt
.
The tracker tracks what all of the workers are doing - it is the central authority for what is going on at any given time. It gives workers their IDs, leases work out to the workers, adds new items the workers find, and ensure no URLs/IDs are lost along the way.
InMemoryTracker
: A simple tracker that uses Pythonset
s to track items. Useful for smaller scrapes.RedisTracker
: A tracker backed by a redis server. Useful for larger scrapes, or scrapes where many machines are required (e.g. due to restrictions per-IP)
Savers, as you may expect, save the content that is downloaded by the scrapers.
FileItemSaver
: Saves results to files on disk based on a configurable file name schemeRedisItemSaver
: Saves results into a redis server based on a configurable key schemeTarItemSaver
: Saves results into a tar(.gz|.bz) archive
The crawl manager is responsible for coordinating all of the other parts of distscrape. It contains the main crawling loop, and proxies requests from one part of the scraper to another.