Crawlgo is a crawler written in golang, it aims to be an extensible, scalable and high-performance distributed crawler system.
Using phantomjs
, crawlgo can crawl web pages rendered with javascript.
- Linux OS
- phantomjs:
phantomjs
should be able to run through the envPATH
. It can be downloaded here.
go get github.com/tossmilestone/crawlgo
cd ${GOPATH}/src/github.com/tossmilestone/crawlgo
sudo make install
The above commands will install crawlgo
in ${GOPATH}/go/bin
.
crawlgo [flags]
Flags:
--download-selector string The DOM selector to query the links that will be downloaded from the site
--enable-profile enable profiling the program to start a pprof HTTP server on localhost:6360
-h, --help help for crawlgo
--save-dir string The directory to save downloaded files. (default "./crawlgo")
--site string The site to crawl
--version version for crawlgo
--workers int The number of workers to run the crawl tasks. If no set, will be 'runtime.NumCPU()'
Crawlgo uses file name to identify the downloaded links. If the file of a link is existed in the save directory, the link will be assumed downloaded already.