This project was done in addition to my masterthesis in computer science at the MLU.
It enables the user to build and maintain a dynamic database of the data in the DFG Gepris Foerderkatalog.
The plan is to have a public API running, so that others will be able to visualize and use the scraped data further.
First we have to do some configurations for passwords, ports, etc.
They exist in the .env
file. In the beginning it is fine to just use the example environment variables:
cp example.env .env
For most applications you need some external services. To set them up you need to have docker and docker-compose installed.
We use Docker-Compose Profiles for this application. There is:
development
(scrapyd, database, adminer)test
(test-database, adminer)production
(scrapyd, database, adminer, deployer)
Choose the one you need, build it and start it:
docker-compose --profile development build
docker-compose --profile development up
To later use the database please install some basic dependencies on your machine.
This is only needed if you want to use the test
or development
profile.
# this is only for Debian like Distros, please fill a request if you need it on another machine
sudo apt-get install libpq-dev gcc
The corresponding profile is development
.
First go ahead and create a virtual environment:
python3 -m venv venv
Now activate it:
# for bash/zsh
source venv/bin/activate
# for fish
source venv/bin/activate.fish
and install dependencies:
pip install -r requirements.txt
You can already try the first command. Let's check the latest datamonitor results of GEPRIS
# this even works without running docker services
scrapy crawl data_monitor -s NO_DB=True -O test.json
If everything is correctly set up, there should now be a test.json
file with content like:
[
{"last_update": "2021-10-12", "last_approval": "2021-08-12", "gepris_version": "18.5.0", "current_index_version": "63037efd-37e0-424a-a956-438bfe91dc9d", "current_index_date": "2021-10-12 10:05:44", "finished_project_count": 34874, "project_count": 136266, "person_count": 87475, "institution_count": 37472, "humanities_count": 24936, "life_count": 48182, "natural_count": 34897, "engineering_count": 25362, "infrastructure_count": 11055}
]
Please look into the Running the spiders
chapter for detailed info about the spiders.
Also you need several Environment Variables set. Do NOT source the .env
file directly, rather always do:
# you can only do this in this directory, so please navigate here before running the command
source outside_docker.sh
# if you are using the fish shell, please run instead:
exec bash -c "source outside_docker.sh; exec fish"
You can now run any spider you want with:
scrapy crawl SPIDERNAME [-a ARG=ARG_VALUE, ...] [-s SETTING=SETTING_VALUE, ...]
To over write settings use the -s
flag. Useful settings are:
HTTPCACHE_ENABLED=True
enables the HTTPCACHE (in .scrapy/httpcache/SPIDERNAME.db)HTTPCACHE_FORCE_REFRESH=True
on enabled HTTPCACHE it forces an overwrite on the crawled pagesNO_DB=TRUE
makes sure the spider does not write or read from the database (disables some stuff)NO_DB=TRUE
makes sure the spider does not write or read from the database (disables some stuff)
scrapy shell https://www.whatismyip.com/ -s "NO_DB=True" -s "SPIDER_MIDDLEWARES={}" -s "HTTPCACHE_ENABLED=False"
If you want to run spiders from your IDE, please use the runner.py file for it.
The cache of the crawled Websites is stored in .scrapy/httpcache/
(if you filled the cache before, at least).
You will find a folder for each spider, each containing a SPIDERNAME.db
file. This is a gnu DBM file.
Details about its content can be found here.
Use the cache_control.py
script to read, delete and debug data from the cache.
The corresponding profile is development
.
Run the tests from the projects root directory with
python -m unittest
Some of the tests are heavily mocked. Some require local .html
files. Some are not implemented yet and will fail on purpose.
Look into the test module for more information.
In Production there is a cronjob running, that schedules the spiders regularly. Look into the cronfile to see what is happening exactly.
To run it on a AMD64 Architecture, use profile production
.
If you are running this on an ARM Architecture, please use the docker-compose.arm.yml
file like:
docker-compose -f docker-compose.yml -f docker-compose.arm.yml --profile production up --build
There is the possible to receive automatic email messages for important spider runs.
Please fill the specified entries in your .env
file.
The commands to run the spiders in development
and production
should be clear until this point.
We currently have 3 maintained and tested spiders to scrape items from Gepris.
Some of them require arguments to work.
This spider fetches the latest data from https://gepris.dfg.de/gepris/OCTOPUS?task=showMonitor
It does not use or require any argument and does only request one page.
This spider fetches all search results for the search at https://gepris.dfg.de/gepris/OCTOPUS?task=showSearchSimple
It requires the argument context
(str), which can be projekt
, person
or institution
.
It takes an optional argument items
(int), which is the number of displayed results per page (so less items means smaller but more documents to be fetched). It defaults to 1000
.
This spider fetches the details pages for the given ids, for example: https://gepris.dfg.de/gepris/projekt/216628603
Some pages also have results ("Projektergebnisse"), like https://gepris.dfg.de/gepris/projekt/234920277 . In this case, the result is also fetched and added to the scraped item.
Each page is fetched in german and in english language. So have to fetch 2 or 4 documents per ID and produce a single item for each ID.
It requires the argument context
(str), which can be projekt
, person
or institution
.
It requires the argument ids
(str), which tells the spider which ids to scrape. It can be either:
[ID1,ID2,...]
file.json
In this case thefile.json
has to contain a single valid json array, that contain objects, where each of them has the keyid
. This can for example be the output of thesearch_results
spider.db:all:LIMIT
This fetches the LIMIT (a number) latest scraped item ids for this context from the database.db:needed:LIMIT
This fetches the LIMIT (a number) latest scraped item ids for this context from the database, that require a refresh.
There is the option to use proxies. We currently only support proxies of webshare.io.
To use them, register yourself on the website, buy a plan and then head to your proxy list overview, press the "Download Proxy List" button and copy the link.
This link has to be set in your .env
file, at the key WEBSHARE_PROXY_LIST_URL
.
You will now be using the webshare proxies
- A good strategy would be to every day only crawl N Projects. For each project we crawl it's references to person and institution
- Make N so that we will average around a fixed amount of total requests
- What is with subinstitutions?