Large open source projects on Github have intimidatingly long lists of problems that require addressing. To make it easier to spot the most pressing among them, GitHub recently introduced the “good first issues” feature, which matches contributors with project issues that are likely to fit their interests. The initial version, which launched in May 2019, surfaced recommendations based on labels applied to issues by project maintainers. But an updated release shipped last month incorporates an AI algorithm that Github says surfaces issues in about 70% of repositories recommends to users.
Github says it’s the first deep-learning-enabled product to launch on Github.com.
According to senior machine learning engineer Tiferet Gazit, informed by an analysis and manual curation, Github last year created a list of 300 label names used by popular open source repositories. (All were synonyms for either “good first issue” or “documentation,” like “beginner friendly,” “easy bug fix,” and “low-hanging-fruit.”) But relying on these meant that only about 40% of the repositories recommended had issues that could be surfaced. Plus, it left project maintainers with the burden of triaging and labeling issues themselves.
The new AI recommender system is largely automatic, by contrast. But building it required crafting an annotated training set of hundreds of thousands of samples.
Github began with issues that had any of the roughly-300 labels in its curated list, which the company supplemented with a few sets of issues that were also likely to be beginner-friendly. (This included those that were closed by a user who had never previously contributed to the repository, as well as issues closed that touched only a few lines of code in a single file.) After detecting and removing near-duplicate issues, the training, validation, and test sets were separated across repositories to prevent data leakage from similar content, and Github trained the AI system using only preprocessed and denoised issue titles and bodies to ensure it detected good issues as soon as they’re opened.
In production, each issue for which the recommender predicts a probability above the required threshold is slated for recommendation, with a confidence score equal to its predicted probability. Open issues from non-archived public repositories that have at least one of the labels from the curated label list are given a confidence score based on the relevance of their labels, with synonyms of “good first issue” given higher confidence than synonyms of “documentation.” Within each repository, all detected issues are then ranked primarily based on their confidence score (though label-based detections are generally given higher confidence than ML-based detections), along with a penalty on issue age.
Data acquisition, training, and inference pipelines run daily, according to Gazit, using scheduled workflows to ensure the results remain “fresh” and “relevant.” In the future, Github plans to add better signals to its repository recommendations and a mechanism for maintainers and triagers to approve or remove AI-based recommendations in their repositories. And it plans to extend issue recommendations to offer personalized suggestions on next issues to tackle for anyone who has already made contributions to a project.