You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)
For some repositories, it seems to take much longer and the script prints out warning-like messages such as:
Sending REST query...
Checking response...
HTTP/1.1 202 Accepted
API Status {"limit": 5000, "remaining": 4414, "reset": 1607114323}
Query accepted but not yet processed. Trying again in 3sec...
Also, for a very small minority of repos, I get the following error-like message:
These two errors do not seem to occur simultaneously.
The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.
Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.
Steps to reproduce:
Remove all data from explore/github_data.
Remove all repos and orgs from _explore/input_lists.json, and add "NREL" as an org.
Create python environment and install dependencies from requirements.txt
Set GITHUB_API_TOKEN environment variable
Run ./MASTER.sh
The text was updated successfully, but these errors were encountered:
jordanperr
changed the title
Very slow scraping of Github API
Very slow scraping for new organization
Dec 4, 2020
jordanperr
changed the title
Very slow scraping for new organization
Potential errors when scraping new organization
Dec 4, 2020
jordanperr
changed the title
Potential errors when scraping new organization
Potential errors when scraping new organization. Skipped repos.
Dec 5, 2020
The update can take a long time. Our current daily update typically runs for about an hour.
The warning messages with the 202 Accepted response typically come from the commit activity query, and should be expected. That response means that data in particular requires GitHub's side to do some additional internal processing before it can response. The initial query triggers that GitHub process, and the script then repeats the query after allowing it time to finish to return the desired data. The commit activity specifically is then cached (on GitHub's side) for immediate responses for about 24 hours / the rest of the day.
The generic GraphQL API error message means something went wrong on GitHub's side. Sometimes these are intermittent issues, in which case the script will attempt the query again. Other times, this can be caused by something like an empty repo. A closer examination of _explore/LAST_MASTER_UPDATE.log may reveal what is happening in these cases.
I am running MASTER.sh to download all data from the NREL github organization (which has 350 repos), but it's taking a very long time and I'm not sure if this is normal. For most repositories in the org, the query returns in under a second. It does appear that the script is scraping over 4,000 repositories (possibly dependencies?)
For some repositories, it seems to take much longer and the script prints out warning-like messages such as:
Also, for a very small minority of repos, I get the following error-like message:
These two errors do not seem to occur simultaneously.
The script is still humming along, and I will let it finish, but am wondering if these errors can simply be ignored.
Update: The script has finished and I am able to view the data using the Jekyll dev server. However, it appears that at least 3 repositories (out of 350) were skipped.
Steps to reproduce:
The text was updated successfully, but these errors were encountered: