Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise CrawlProjectsJob to use fewer GitHub core resources #281

Closed
dabico opened this issue Jan 11, 2024 · 0 comments · Fixed by #285
Closed

Optimise CrawlProjectsJob to use fewer GitHub core resources #281

dabico opened this issue Jan 11, 2024 · 0 comments · Fixed by #285
Assignees
Labels
enhancement Improvements or additions to features refactoring Improvements to code structure server Concerning the server

Comments

@dabico
Copy link
Member

dabico commented Jan 11, 2024

Although we have substantially cut core resource usage from the GitHub API back in #152, I believe that it can be optimized further. Since a repository can have an arbitrary number of labels, languages and topics1, we solely relied on retrieving them through REST APIs, as pagination ensures that they can all be retrieved, 100 results at a time. This means that for every repository mined, we also have to perform 3 additional REST calls, and that's only considering the best possible case. However, should a repository have less than 100 of each, we could retrieve the complete result list in GraphQL. This, in turn, would eliminate the need to make any further REST requests. At the same time, this would not increase the overall cost of the GraphQL query we currently use.

To grasp the impact of switching to the methodology described above, I decided to look at the average and maximum number of children present in each relationship. I ran the following query:

WITH git_repo_label_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_label
    GROUP BY repo_id
), git_repo_language_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_language
    GROUP BY repo_id
), git_repo_topic_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_topic
    GROUP BY repo_id
), git_repo_label_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_label_count
), git_repo_language_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_language_count
), git_repo_topic_statistics AS (
    SELECT
        AVG(count) AS average,
        MAX(count) AS max
    FROM git_repo_topic_count
), git_repo_statistics AS (
    SELECT 'label' AS relation, average, max
    FROM git_repo_label_statistics
    UNION ALL
    SELECT 'language' AS relation, average, max
    FROM git_repo_language_statistics
    UNION ALL
    SELECT 'topic' AS relation, average, max
    FROM git_repo_topic_statistics
)
SELECT relation, average, max
FROM git_repo_statistics;

Which produced the following result:

average max
label 9.9397 2010
language 2.9821 358
topic 6.4378 21

The results obtained were in line with what was expected. On average, each repository contains about 9 - 10 labels, which corresponds to the labels generated by default when creating a new repository. At the same time, the number of labels used within a project can reach well into the thousands, reinforcing the need to rely on REST as a fallback when preliminary GraphQL retrieval proves insufficient.

Languages follow a similar trend, with the average number of languages present in a project being 3, and the maximum exceeding 350. Since the presence of languages is directly correlated to the project complexity, as reflected by the repository code contents, it makes sense that the average and maximum number of languages is lower than a user-defined characteristic like labels. Still, we may need to rely on the old method of using REST for complex projects.

Finally, we have the topic counts. Although the average number of topics ranges from 6 to 7, the maximum number of 21 came as a bit of a surprise. GitHub currently restricts the maximum assignable number of topics to 20, which points to a possible issue with our current implementation. Discrepancy aside, the aforementioned restriction implies that all repository topics can be retrieved in a single REST request. Despite this, one should keep in mind that GitHub is a constantly evolving platform. As such, the paginated retrieval of topics through REST should be retained if GitHub were to ever increase the assignable topic upper bound.

Since the initial findings proved promising, I decided to delve further. Given that the maximum number of retrievable labels/languages/topics was 100, I wanted to see the percentage of projects whose label and language counts exceeded 100. Given that the topics currently cap at 20, they were not considered. Executing the following query:

WITH git_repo_count AS (
    SELECT COUNT(*) AS count
    FROM git_repo
), git_repo_label_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_label
    GROUP BY repo_id
), git_repo_language_count AS (
    SELECT repo_id, COUNT(*) AS count
    FROM git_repo_language
    GROUP BY repo_id
)
SELECT
    'labels' AS relation,
    100.0 * SUM(IF(count > 100, 1, 0)) / (SELECT count FROM git_repo_count) AS percentage
FROM git_repo_label_count
UNION ALL
SELECT
    'languages' AS relation,
    100.0 * SUM(IF(count > 100, 1, 0)) / (SELECT count FROM git_repo_count) AS percentage
FROM git_repo_language_count;

Produced the following output:

percentage
label 0.10929
language 0.00070

Note that the values displayed are percentages. In conclusion, 99.9% of GraphQL requests will contain all the information we need. In the remaining 0.1%, we will fall back on using the REST method that we have used thus far. This optimization could potentially lead to massive savings, which opens the door to other enhancements, such as parallel miners that divide the mining load, or another background job that can freely use the APIs (proposed in #166).

Footnotes

  1. Although GitHub currently imposes a double-digit hard limit, what if this limitation changes in the future?

@dabico dabico added enhancement Improvements or additions to features refactoring Improvements to code structure server Concerning the server labels Jan 11, 2024
@dabico dabico self-assigned this Jan 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvements or additions to features refactoring Improvements to code structure server Concerning the server
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant