spidernews

Crawler implementation in python to collect pages from local news sites with keywords about violence and crimes

Installation steps

pip install mongopy

Depending on the system, you have to enalbe mongopy service.

python initdb.py

python spider.py

Database folder has newssites initialization and API to add, get and reset data and collections from database. Available collections:

crawler.py is where the magic happens. Have some main functions:

parserURL return html and a list of links in given url
parserHTML return founded keywords in a html
spider is a function that survey recursivelly from a baseurl until a max number of pages. If some pages contains one of the keywords, will be saved in repository database