Potential errors when scraping new organization. Skipped repos. #477

jordanperr · 2020-12-04T20:09:17Z

I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)

For some repositories, it seems to take much longer and the script prints out warning-like messages such as:

Sending REST query...
Checking response...
HTTP/1.1 202 Accepted
API Status {"limit": 5000, "remaining": 4414, "reset": 1607114323}
Query accepted but not yet processed. Trying again in 3sec...

Also, for a very small minority of repos, I get the following error-like message:

GraphQL API error.
[{"path": ["repository", "dependencyGraphManifests"], "locations": [{"line": 1, "column": 244}], "message": "loading"}]

These two errors do not seem to occur simultaneously.

The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.

Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.

Steps to reproduce:

Remove all data from explore/github_data.
Remove all repos and orgs from _explore/input_lists.json, and add "NREL" as an org.
Create python environment and install dependencies from requirements.txt
Set GITHUB_API_TOKEN environment variable
Run ./MASTER.sh

The text was updated successfully, but these errors were encountered:

LRWeber · 2020-12-07T17:51:34Z

The update can take a long time. Our current daily update typically runs for about an hour.

The warning messages with the 202 Accepted response typically come from the commit activity query, and should be expected. That response means that data in particular requires GitHub's side to do some additional internal processing before it can response. The initial query triggers that GitHub process, and the script then repeats the query after allowing it time to finish to return the desired data. The commit activity specifically is then cached (on GitHub's side) for immediate responses for about 24 hours / the rest of the day.

The generic GraphQL API error message means something went wrong on GitHub's side. Sometimes these are intermittent issues, in which case the script will attempt the query again. Other times, this can be caused by something like an empty repo. A closer examination of _explore/LAST_MASTER_UPDATE.log may reveal what is happening in these cases.

jordanperr changed the title ~~Very slow scraping of Github API~~ Very slow scraping for new organization Dec 4, 2020

jordanperr changed the title ~~Very slow scraping for new organization~~ Potential errors when scraping new organization Dec 4, 2020

jordanperr changed the title ~~Potential errors when scraping new organization~~ Potential errors when scraping new organization. Skipped repos. Dec 5, 2020

LRWeber mentioned this issue Dec 11, 2020

Re-usability Improvements and Documentation #481

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential errors when scraping new organization. Skipped repos. #477

Potential errors when scraping new organization. Skipped repos. #477

jordanperr commented Dec 4, 2020 •

edited

Loading

LRWeber commented Dec 7, 2020 •

edited

Loading

Potential errors when scraping new organization. Skipped repos. #477

Potential errors when scraping new organization. Skipped repos. #477

Comments

jordanperr commented Dec 4, 2020 • edited Loading

LRWeber commented Dec 7, 2020 • edited Loading

jordanperr commented Dec 4, 2020 •

edited

Loading

LRWeber commented Dec 7, 2020 •

edited

Loading