Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources #281

dabico · 2024-01-11T14:28:18Z

Although we have substantially cut core resource usage from the GitHub API back in #152, I believe that it can be optimized further. Since a repository can have an arbitrary number of labels, languages and topics¹, we solely relied on retrieving them through REST APIs, as pagination ensures that they can all be retrieved, 100 results at a time. This means that for every repository mined, we also have to perform 3 additional REST calls, and that's only considering the best possible case. However, should a repository have less than 100 of each, we could retrieve the complete result list in GraphQL. This, in turn, would eliminate the need to make any further REST requests. At the same time, this would not increase the overall cost of the GraphQL query we currently use.

To grasp the impact of switching to the methodology described above, I decided to look at the average and maximum number of children present in each relationship. I ran the following query:

WITH git_repo_label_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_label
    GROUP BY repo_id
), git_repo_language_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_language
    GROUP BY repo_id
), git_repo_topic_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_topic
    GROUP BY repo_id
), git_repo_label_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_label_count
), git_repo_language_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_language_count
), git_repo_topic_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_topic_count
), git_repo_statistics AS (
    SELECT 'label' AS relation, average, max
    FROM git_repo_label_statistics
    UNION ALL
    SELECT 'language' AS relation, average, max
    FROM git_repo_language_statistics
    UNION ALL
    SELECT 'topic' AS relation, average, max
    FROM git_repo_topic_statistics
)
SELECT relation, average, max
FROM git_repo_statistics;

Which produced the following result:

	average	max
`label`	9.9397	2010
`language`	2.9821	358
`topic`	6.4378	21

The results obtained were in line with what was expected. On average, each repository contains about 9 - 10 labels, which corresponds to the labels generated by default when creating a new repository. At the same time, the number of labels used within a project can reach well into the thousands, reinforcing the need to rely on REST as a fallback when preliminary GraphQL retrieval proves insufficient.

Languages follow a similar trend, with the average number of languages present in a project being 3, and the maximum exceeding 350. Since the presence of languages is directly correlated to the project complexity, as reflected by the repository code contents, it makes sense that the average and maximum number of languages is lower than a user-defined characteristic like labels. Still, we may need to rely on the old method of using REST for complex projects.

Finally, we have the topic counts. Although the average number of topics ranges from 6 to 7, the maximum number of 21 came as a bit of a surprise. GitHub currently restricts the maximum assignable number of topics to 20, which points to a possible issue with our current implementation. Discrepancy aside, the aforementioned restriction implies that all repository topics can be retrieved in a single REST request. Despite this, one should keep in mind that GitHub is a constantly evolving platform. As such, the paginated retrieval of topics through REST should be retained if GitHub were to ever increase the assignable topic upper bound.

Since the initial findings proved promising, I decided to delve further. Given that the maximum number of retrievable labels/languages/topics was 100, I wanted to see the percentage of projects whose label and language counts exceeded 100. Given that the topics currently cap at 20, they were not considered. Executing the following query:

WITH git_repo_count AS (
    SELECT COUNT(*) AS count
    FROM git_repo
), git_repo_label_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_label
    GROUP BY repo_id
), git_repo_language_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_language
    GROUP BY repo_id
)
SELECT
    'labels' AS relation,
    100.0 * SUM(IF(count > 100, 1, 0)) / (SELECT count FROM git_repo_count) AS percentage
FROM git_repo_label_count
UNION ALL
SELECT
    'languages' AS relation,
    100.0 * SUM(IF(count > 100, 1, 0)) / (SELECT count FROM git_repo_count) AS percentage
FROM git_repo_language_count;

Produced the following output:

	percentage
`label`	0.10929
`language`	0.00070

Note that the values displayed are percentages. In conclusion, 99.9% of GraphQL requests will contain all the information we need. In the remaining 0.1%, we will fall back on using the REST method that we have used thus far. This optimization could potentially lead to massive savings, which opens the door to other enhancements, such as parallel miners that divide the mining load, or another background job that can freely use the APIs (proposed in #166).

Although GitHub currently imposes a double-digit hard limit, what if this limitation changes in the future? ↩

The text was updated successfully, but these errors were encountered:

dabico added enhancement Improvements or additions to features refactoring Improvements to code structure server Concerning the server labels Jan 11, 2024

dabico self-assigned this Jan 11, 2024

dabico mentioned this issue Jan 16, 2024

Crawler resource consumption enhancement + refactoring #285

Merged

dabico closed this as completed in #285 Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources #281

Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources #281

dabico commented Jan 11, 2024 •

edited

Loading

Optimise CrawlProjectsJob to use fewer GitHub core resources #281

Optimise CrawlProjectsJob to use fewer GitHub core resources #281

Comments

dabico commented Jan 11, 2024 • edited Loading

Footnotes

Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources #281

Optimise `CrawlProjectsJob` to use fewer GitHub `core` resources #281

dabico commented Jan 11, 2024 •

edited

Loading