Skip to main content

Predicting Open Source Project Activeness Based on OSS Compass Indicators

Zheng Liu,Xiaolan Zu,Xingyu Luo,Zihang Wang,Jierui zhang,Yehui wang, Liang Wang, Xianping Tao Department of Computer Science and Technology, State Key Laboratory for Novel Software Technology, Nanjing University Huawei Technologies Co, Ltd. OSS Compass

Abstract

This study proposes a method to predict the future activeness of open source projects using OSS Compass indicators. It employs a feature-based time series classification approach, extracting statistical features from OSS Compass indicator series and using machine learning algorithms for prediction. The method demonstrated an accuracy of nearly 90% in cross-validation on a dataset of around 600 projects and about 80% accuracy on a larger set of over 10,000 repositories, indicating its practical applicability. The results partially reflect the future health status of open source projects, demonstrating the effectiveness of the OSS Compass indicator system in measuring the health of open source software. The method could provide valuable insights for users, developers, investors, and managers of open source software.

1 OSS Compass Open Source Project Health Metrics Dataset

1.1 Data Preparation

This research encompasses two principal datasets, designed to provide comprehensive information to support our study. Initially, we have the expansive dataset A, which includes data from 20,213 repositories. This dataset's vast scope encompasses a broad spectrum of open-source projects, offering a rich pool of reference materials for our analysis. Furthermore, we have curated a more focused subset, dataset B. This subset, extracted from the larger dataset A, was selected based on key metrics such as star and fork counts of open-source projects, thereby representing the most influential and exemplary projects within the open-source community. Dataset B comprises 735 repositories, whose data are pivotal for extracting features relevant to our research. We then gathered metrics time-series data for each repository, categorized into four dimensions and comprising 80 metrics (including a time dimension), with each metric forming a time series directly corresponding to each repository (zero imputation was employed for missing data).

1.2 Data Cleaning

Upon an intensive review of the collected metric data, we engaged in a rigorous data filtering process to ensure the integrity and reliability of the datasets. Initially, repositories with challenging data retrieval, either due to data scarcity or the presence of irregular, 'dirty' data, were excluded. This filtration resulted in a refined dataset A, now consisting of 20,181 repositories, replete with all requisite metric information. Similarly, dataset B was condensed to 600 repositories, all demonstrating exemplary data quality and usability, laying a solid foundation for further analysis.

1.3 Open Source Repository Selection

Considering the specific requirements for time series length in our predictive tasks, we implemented several data selection and processing steps to ensure the quality and appropriateness of our datasets. Repositories with insufficient time series length, including empty repositories, were excluded as they could not provide adequate data for reliable predictions. Additionally, our focus was on utilizing time series leading up to a project's transition to inactivity; hence, repositories that remained inactive throughout the recorded period were also discarded to ensure the feasibility and accuracy of model training and predictions.

After these meticulous selection and processing steps, we finalized dataset A, comprising 19,413 repositories, and dataset B, consisting of 585 repositories, both tailored to meet the requirements of our predictive tasks. Subsequent analyses were conducted to determine the activity status of these repositories, with specific criteria outlined in section 1.4. In dataset A, 5,411 repositories were deemed active and 14,002 inactive, reflecting an uneven distribution of labels, which aligns with the real-world scenario of fewer active projects. In contrast, dataset B exhibited a more balanced label distribution, with 325 active and 260 inactive repositories, beneficial for subsequent predictive tasks as it enables the model to more effectively learn and predict performance across different repository states.

1.4 Criteria for Determining Active/Inactive Status

The aforementioned criteria serve as independent variables in our predictive analysis. To establish the dependent variable, each repository sample was analyzed and labeled as inactive or active, facilitating the training and prediction process. A repository was classified as inactive if it met either of the following conditions:

1.A lack of activity, defined as commits, branch creations, issue tracking, or pull requests, over a period exceeding one year

2.Minimal commit activity within the past year, indicating sporadic engagement insufficient to maintain active status.

Based on these definitions, each repository was analyzed and accordingly labeled, ensuring an objective, quantifiable basis for subsequent analysis and predictions.

2 Feature-Based Method for Predicting Project Activeness

1.png

Figure2-1 Predictive Method Based on Feature Extraction and Classical Machine Learning Techniques

As illustrated in Figure 2-1, the feature-based approach is a classic machine learning methodology typically employed for processing time series data. The crux of this approach is to transform time series data into a set of features, which include, but are not limited to, statistical metrics like mean, median, maximum, minimum, and variance. By extracting these features, we convert the original time series data into a non-time series CSV data table, enabling the application of traditional data analysis and machine learning techniques such as KNN and Decision Trees. The advantage of this method lies in its ability to simplify complex time series data into fixed-dimension feature vectors, facilitating subsequent data processing and modeling. Additionally, feature extraction helps reduce noise and redundancy in the data, enhancing model stability and performance. The typical steps in the feature-based approach include:

  1. Data Collection: Initially, time series data is collected, which may originate from sensors, financial markets, meteorological stations, or other domains.

  2. Feature Extraction: This involves the computation of various statistical metrics such as mean, median, standard deviation, kurtosis, skewness, and the application of signal processing techniques like Fourier and wavelet transforms for frequency domain feature extraction.

  3. Data Transformation: The extracted features are compiled into a tabular dataset, typically saved as a CSV file, with each row representing a time series sample and each column a feature.

  4. Data Preprocessing: This step includes handling missing values, standardization, normalization, etc., to ensure data stability and usability in modeling.

  5. Modeling and Analysis: Traditional machine learning algorithms such as Linear Regression, Decision Trees, Random Forests, Support Vector Machines, etc., are employed for modeling and analysis of the feature-based data.

  6. Model Evaluation: The model's performance is assessed using techniques like cross-validation and metrics such as accuracy, mean squared error, ROC curve, etc.

Key steps in this process include feature extraction and the design of learning algorithms, elaborated as follows.

2.1 Extracted Feature Set

In this work, we extracted features from 72 indices across four dimensions:

  1. Activity Score: 19 indicators related to activity, such as active_C2_contributor_count_activity, active_C1_pr_create_contributor_activity, etc. (refer to Appendix B.1).

  2. Code Quality Guarantee: 25 indicators including contributor_count_codequality, contributor_count_bot_codequality, etc. (see Appendix B.2).

  3. Community Service and Support: 15 indicators such as issue_first_reponse_avg_community, issue_first_reponse_mid_community, etc. (refer to Appendix B.3).

  4. Group Activity: 13 indicators like contributor_count_group_activity, contributor_count_bot_group_activity, etc. (see Appendix B.4).

Additionally, basic project information such as the project name and grimoire_creation_date were included in the feature computation. However, since these do not fall under the indicators and models proposed by OSS Compass, they were not subjected to feature extraction. For the numerical features in the 72 indices, multiple statistical functions were selected, including length, large standard deviation, mean, maximum, minimum, variance, etc., resulting in 16 representative statistical quantities (detailed in Appendix C). These were used to compute corresponding feature vectors. Further, based on the preferences of different classifiers, two sets of typical feature collections were selected through feature selection:

  1. Feature Set 1: Comprising 596 distinct features for classifiers like XGBoost, RandomForest, AdaBoost, etc.

  2. Feature Set 2: Including 134 features for classifiers such as KNN, Logistic, SVM, etc.

Details of these two feature sets can be found in Appendix B.

2.2 Classification and Predictive Algorithms

  1. XGBoost (Extreme Gradient Boosting): XGBoost enhances model performance by combining multiple decision trees, offering remarkable accuracy and robustness. It iteratively refines the model by optimizing the loss function, preventing overfitting, and supports feature selection. Widely employed in data science challenges like Kaggle competitions, XGBoost stands out for its efficiency in handling complex datasets.

  2. RandomForest: As an ensemble learning method based on multiple decision trees, RandomForest classifies or regresses based on collective tree voting. It introduces randomness to mitigate overfitting risks, boasting excellent generalization capabilities and automatic feature selection. Suitable for a variety of data types, RandomForest is user-friendly and demands minimal hyperparameter tuning.

  3. AdaBoost (Adaptive Boosting): An iterative learning algorithm, AdaBoost enhances model performance by integrating multiple weak learners. It adjusts sample weights based on previous learning errors, focusing more on incorrectly classified samples. Commonly used for binary classification problems, AdaBoost in this project employs decision trees as base learners, denoted as AdaBoost + DecisionTree.

  4. Logistic Regression: A widely used linear model for classification problems, Logistic Regression employs a logistic function to estimate the relationship between input features and a binary target. Known for its simplicity and interpretability, it's frequently applied in probability-based predictions like customer churn or disease diagnosis.

  5. SVM (Support Vector Machine): SVM is a powerful algorithm for classification and regression. It maximizes the margin between different categories by finding the optimal hyperplane. Excellently performing in high-dimensional spaces, SVM can adapt to various data types with different kernel functions.

  6. KNN (K-Nearest Neighbors): KNN, an instance-based learning algorithm, is used for classification and regression. It makes predictions by measuring the nearest K neighbors to a query point. While KNN is straightforward and easy to comprehend, it may be less efficient for large datasets.

Built upon a multitude of classical machine learning algorithms, feature-based classification prediction methods offer diversity, simplicity in implementation, and a wealth of tools. Moreover, by extracting features with clear physical significance to characterize projects, these methods provide high interpretability.

3 Empirical Study Design and Results

This section, grounded on the aforementioned methods, conducts empirical research to validate the effectiveness of the OSS Compass indicator system in predicting future activity and health of projects, supported by machine learning and AI methodologies.

3.1 Methodology of Empirical Study

This subsection introduces the datasets, methods, and evaluation metrics used in the empirical research.

3.1.1 Dataset

Two datasets, A and B as introduced above, were utilized. We initially trained and predicted on dataset B, observing commendable performance. Then, we applied the model to dataset A to assess its generalization capability, i.e., how the model performs in unseen real-world contexts.

3.1.2 Validation Method

  1. Ten-Fold Cross-Validation: Employing ten-fold cross-validation ensured robustness and generalizability in evaluating our classification models. The dataset was divided into ten subsets, with nine serving as training sets and one as a test set, in a repeated process to mitigate the impact of randomness on performance assessment. This method reliably gauges model performance across different data subsets, averting overfitting.

  2. Feature Selection Experiment: Utilizing Tsfresh's select_features function and adjusting the fdr_level parameter, we obtained a carefully curated feature set. These features significantly contributed to classification tasks, enhancing model performance. Our selection, based on statistical significance, managed FDR (False Discovery Rate) stringency. A lower fdr_level ensured highly significant features were chosen, aiding in developing high-performance, reliable classification models.

3.1.3 Evaluation Metrics

The following metrics are commonly used to evaluate the performance of classification models and to verify their effectiveness in processing data:

  1. Accuracy:
  • Accuracy is a fundamental metric for assessing the performance of a classification model.

  • It represents the ratio of the number of correctly classified samples to the total number of samples.

  • Formula: Accuracy= (TP+TN) / (TP+TN+FP+FN)

    TP:True Positive (Number of samples correctly predicted as positive)

    TN:True Negative (Number of samples correctly predicted as negative)

    FP:False Positive (Number of samples incorrectly predicted as positive)

    FN:False Negative (Number of samples incorrectly predicted as negative)

  1. Precision:
  • Precision measures the proportion of samples that are actually positive among those predicted as positive.
  • ormula: Precision = TP/(TP+FP)
  1. Recall::
  • Recall measures the proportion of positive samples correctly identified by the model out of the total positive samples.

  • Formula: Recall = TP/(TP+FN)

  1. F1 Score:
  • The F1 Score is the harmonic mean of precision and recall, used for a comprehensive assessment of model performance.
  • Formula: F1 Score = 2PR/(P+R)
  1. AUC(Area Under the Curve):
  • AUC is the area under the ROC (Receiver Operating Characteristic) curve, used to measure the performance of a model at different thresholds.
  • The ROC curve plots FPR (False Positive Rate) on the x-axis against TPR (True Positive Rate, or Recall) on the y-axis, illustrating classification performance at various thresholds.
  • An AUC value closer to 1 indicates better model performance.
  1. Confusion Matrix:
  • The Confusion Matrix is a matrix that displays the classification results of a model.
  • It includes the counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
  • It is usually presented in a tabular form for a clear, visual representation of model performance.

3.2 Assessment Results: Validation Results on Dataset B

Utilizing ten-fold cross-validation, we employed classifiers including XGBoost, RandomForest, AdaBoost, SVM, KNN, Logistic Regression, and their ensemble models for prediction. The results were measured comprehensively using various metrics as described earlier. XGBoost, RandomForest, and AdaBoost demonstrated effective performance, with accuracies nearing 90%. For the Logistic Regression method, performance was weaker in Feature Set 1 due to a high number of features leading to predictions tending towards all-ones, likely due to overfitting. However, in Feature Set 2, its accuracy reached 86%. SVM and KNN showed moderate performance, sensitive to data and requiring parameter tuning and specific data-condition optimization. Ultimately, we selected XGBoost, RandomForest, and AdaBoost for multiple prediction integrations, achieving an overall prediction accuracy of 90%. Combining multiple models' predictions can enhance overall performance, reduce the risk of overfitting, and decrease the number of prediction errors by a single classifier. The performance of each classifier is detailed below:

3.2.1 Classification Prediction Performance Assessment Based on Feature Set 1

As described in section 2.1, Feature Set 1 includes 596 different features used for classifiers such as XGBoost, RandomForest, and AdaBoost.

2.png

Figure 3-1: Performance Assessment of Open Source Project Activity Classification Prediction Based on XGBoost

As shown in Figure 3-1, using XGBoost on a dataset of 585 open source projects with ten-fold cross-validation, the performance was: Accuracy: 0.8940, Precision: 0.8973, Recall: 0.9138, F1 Score: 0.9055, AUC: 0.8915. The confusion matrix reveals that XGBoost can accurately predict the future activity of projects based on OSS Compass indices.

4.png

Figure 3-2: Performance Assessment of Open Source Project Activity Classification Prediction Based on Random Forest

Figure 3-2 shows that using Random Forest, we achieved classification prediction performance close to that of XGBoost. With ten-fold cross-validation, the performance was: Accuracy: 0.8923, Precision: 0.9068, Recall: 0.8985, F1 Score: 0.9026, AUC: 0.8915.

6.png

Figure 3-3: Performance Assessment of Open Source Project Activity Classification Prediction Based on AdaBoost + DecisionTree

As depicted in Figure 3-3, the AdaBoost + DecisionTree classifier also demonstrated strong predictive capability: Accuracy: 0.8906, Precision: 0.8967, Recall: 0.9077, F1 Score: 0.9021, AUC: 0.8885.

The performances of XGBoost, Random Forest, and AdaBoost were consistent, all achieving close to 90% classification accuracy. Minor differences in classifier performance, as shown, might result from randomness in the ten-fold cross-validation process.

3.2.2 Classification Prediction Performance Assessment Based on Feature Set 2

As mentioned in section 2.1, Feature Set 2 comprises 134 features, used for classifiers including KNN, Logistic Regression, and SVM. Compared to Feature Set 1, these classifiers achieved better results with Feature Set 2.

8.png

Figure 3-4: Performance Assessment of Open Source Project Activity Classification Prediction Based on KNN

As shown in Figure 3-4, the KNN method effectively predicted future activity levels of open source projects: Accuracy: 0.7521, Precision: 0.7632, Recall: 0.8031, F1 Score: 0.7826, AUC: 0.7457.

8.png

Figure 3-5: Performance Assessment of Open Source Project Activity Classification Prediction Based on SVM

As illustrated in Figure 3-5, SVM's classification prediction capability was relatively weak: Accuracy: 0.6291, Precision: 0.6812, Recall: 0.6246, F1 Score: 0.6517, AUC: 0.6296.

10.png

Figure 3-6: Performance Assessment of Open Source Project Activity Classification Prediction Based on Logistic Regression

As depicted in Figure 3-6, Logistic Regression showed superior performance to KNN and SVM in Feature Set 2: Accuracy: 0.8496, Precision: 0.8835, Recall: 0.84, F1 Score: 0.8611, AUC: 0.8508.

Compared to Feature Set 1, classifiers corresponding to Feature Set 2 were relatively weaker, but Logistic Regression approached the performance of Feature Set 1 classifiers.

3.2.3 Ensemble Learning and Overall Results

12.png

Figure 3-7: Performance Assessment After Ensemble of XGBoost, AdaBoost, and RandomForest Classifiers

Based on the aforementioned results, we opted for a majority-vote ensemble learning approach with the three classifiers from section 2.1 using Feature Set 1, achieving the best results: Accuracy: 0.9009, Precision: 0.9159, Recall: 0.9046, F1 Score: 0.9102, AUC: 0.9004.

14.png

Figure 3-8: Overview of Prediction Results

Finally, Figure 3-8 summarizes the validation results on dataset B, listing the performance of different classifiers including accuracy, precision, recall, and F1 score. The results show that, except for KNN and SVM, the remaining classifiers achieved over 85% classification accuracy. The ensemble learning method combining multiple classifiers demonstrated the best classification prediction performance, showing promise for real-world application.

3.3 Assessment Results: Generalization Results on Dataset A

Utilizing the models trained on dataset B as described in section 3.2, which included XGBoost, RandomForest, AdaBoost, SVM, KNN, Logistic Regression, and their ensemble models, we tested these models on dataset A to evaluate their generalization capabilities. The performance of each classifier is detailed below:

3.3.1 Classification Prediction Performance Assessment Based on Feature Set 1

15.png

Figure 3-9: Performance Assessment of Open Source Project Activity Classification Prediction Based on XGBoost

As shown in Figure 3-9, using XGBoost on dataset A, the performance was: Accuracy: 0.8233, Precision: 0.6250, Recall: 0.9148, F1 Score: 0.7426, AUC: 0.8513.

15.png

Figure 3-10: Performance Assessment of Open Source Project Activity Classification Prediction Based on RandomForest

As depicted in Figure 3-10, using RandomForest on dataset A, the performance was: Accuracy: 0.8330, Precision: 0.6703, Recall: 0.7888, F1 Score: 0.7247, AUC: 0.8194.

16.png

Figure 3-11: Performance Assessment of Open Source Project Activity Classification Prediction Based on AdaBoost

Figure 3-11 shows that using AdaBoost on dataset A, the performance was: Accuracy: 0.8940, Precision: 0.7946, Recall: 0.6106, F1 Score: 0.6636, AUC: 0.7738.

3.3.2 Classification Prediction Performance Assessment Based on Feature Set 2

17.png

Figure 3-12: Performance Assessment of Open Source Project Activity Classification Prediction Based on KNN

As shown in Figure 3-12, using KNN on dataset A, the performance was: Accuracy: 0.8439, Precision: 0.7253, Recall: 0.7080, F1 Score: 0.7165, AUC: 0.8022.

18.png

Figure 3-13: Performance Assessment of Open Source Project Activity Classification Prediction Based on SVM

Figure 3-13 reveals that using SVM on dataset A, the performance was: Accuracy: 0.7827, Precision: 0.6276, Recall: 0.5422, F1 Score: 0.5818, AUC: 0.7089.

19.png

Figure 3-14: Performance Assessment of Open Source Project Activity Classification Prediction Based on Logistic Regression

As depicted in Figure 3-14, using Logistic Regression on dataset A, the performance was: Accuracy: 0.7652, Precision: 0.5676, Recall: 0.6622, F1 Score: 0.6112, AUC: 0.7336.

3.3.3 Ensemble Learning and Overall Results

19.png

Figure 3-15: Performance Assessment After Ensemble of XGBoost, AdaBoost, and RandomForest Classifiers

As shown in Figure 3-15, using an ensemble of XGBoost, AdaBoost, and RandomForest on dataset A, the performance was: Accuracy: 0.8348, Precision: 0.6587, Recall: 0.8451, F1 Score: 0.7404, AUC: 0.8380.

21.png

Figure 3-16: Overview of Prediction Results

Finally, Figure 3-16 summarizes the generalization results on dataset A, listing the performance of different classifiers including accuracy, precision, recall, and F1 score. The results indicate that KNN and the ensemble learning method combining multiple classifiers exhibit strong generalization capabilities and effective classification prediction. These two models are recommended for future predictions.

3.4 Threats to Validity

Firstly, concerning internal validity threats, although mature machine learning models were employed, the effectiveness of these models is not extensively validated due to the limited data volume of the dataset. Thus, their practical applicability remains to be further experimented and proven in real-world scenarios. On another note, this method was exclusively applied to the model indices of OSS-Compass. While theoretically suitable for all time series prediction scenarios, its applicability in other contexts requires further validation.

Based on the experimental results, it is recommended to use KNN and an ensemble learning approach combining multiple classifiers as the learning models. After comprehensive training on the large-scale dataset A, these models should be deployed on practical platforms for actual use.

Acknowledgments

Gratitude is extended to the contributors of the open-source community and the OSS Compass platform for providing the data. Thanks are also due to the guiding teachers and fellow students. This work was supported by the Nanjing University Undergraduate Innovation Training Program "Research on Health Prediction Technology of Open Source Software Projects Based on Metric Indicators."

Appendix A: Feature Sets

A.1 Feature Set 1 (Comprising 596 features, named as: Index Name__Feature Function)

  1. commit_frequency_activity__minimum
  2. commit_frequency_codequality__minimum
  3. commit_frequency_without_bot_activity__minimum
  4. commit_frequency_without_bot_codequality__minimum
  5. pr_commit_count_codequality__minimum
  6. is_maintained_codequality__minimum
  7. lines_added_frequency_codequality__minimum
  8. LOC_frequency_codequality__minimum
  9. updated_since_activity__variance
  10. updated_since_activity__mean
  11. active_C2_contributor_count_codequality__minimum
  12. active_C2_contributor_count_activity__minimum
  13. is_maintained_codequality__sum_values
  14. lines_removed_frequency_codequality__minimum
  15. updated_since_activity__maximum
  16. code_quality_guarantee_codequality__minimum
  17. is_maintained_codequality__mean
  18. community_support_score_community__sum_values
  19. issue_open_time_avg_community__length
  20. pr_issue_linked_ratio_codequality__length
  21. commit_frequency_group_activity__length
  22. commit_frequency_without_bot_activity__length
  23. issue_first_reponse_mid_community__length
  24. active_C1_issue_create_contributor_activity__length
  25. community_support_score_community__length
  26. bug_issue_open_time_mid_community__length
  27. issue_open_time_mid_community__length
  28. pr_open_time_mid_community__length
  29. contribution_last_group_activity__length
  30. code_review_ratio_codequality__length
  31. contributor_count_bot_codequality__length
  32. bug_issue_open_time_avg_community__length
  33. lines_removed_frequency_codequality__length
  34. pr_first_response_time_avg_community__length
  35. commit_frequency_org_group_activity__length
  36. commit_frequency_without_bot_codequality__length
  37. commit_frequency_bot_activity__length
  38. code_review_count_community__length
  39. activity_score_activity__length
  40. code_merge_ratio_codequality__length
  41. contributor_count_without_bot_codequality__length
  42. code_review_count_activity__length
  43. commit_frequency_percentage_group_activity__length
  44. org_count_activity__length
  45. closed_issues_count_activity__length
  46. pr_count_codequality__length
  47. updated_since_activity__length
  48. contributor_count_without_bot_activity__length
  49. pr_first_response_time_mid_community__length
  50. org_count_group_activity__length
  51. comment_frequency_community__length
  52. recent_releases_count_activity__length
  53. active_C1_pr_create_contributor_codequality__length
  54. contributor_count_activity__length
  55. active_C2_contributor_count_activity__length
  56. updated_issues_count_activity__length
  57. updated_issues_count_community__length
  58. is_maintained_codequality__length
  59. active_C1_pr_comments_contributor_activity__length
  60. comment_frequency_activity__length
  61. active_C1_pr_create_contributor_activity__length
  62. code_quality_guarantee_codequality__length
  63. contributor_count_codequality__length
  64. commit_frequency_inside_codequality__length
  65. contributor_count_group_activity__length
  66. contributor_count_without_bot_group_activity__length
  67. contributor_count_bot_group_activity__length
  68. commit_frequency_codequality__length
  69. git_pr_linked_ratio_codequality__length
  70. commit_frequency_inside_without_bot_codequality__length
  71. commit_frequency_inside_bot_codequality__length
  72. commit_frequency_bot_codequality__length
  73. active_C1_pr_comments_contributor_codequality__length
  74. commit_frequency_bot_group_activity__length
  75. contributor_count_bot_activity__length
  76. pr_commit_count_codequality__length
  77. lines_added_frequency_codequality__length
  78. pr_open_time_avg_community__length
  79. commit_frequency_without_bot_group_activity__length
  80. contributor_org_count_group_activity__length
  81. issue_first_reponse_avg_community__length
  82. LOC_frequency_codequality__length
  83. active_C2_contributor_count_codequality__length
  84. commit_frequency_activity__length
  85. organizations_activity_group_activity__length
  86. active_C1_issue_comments_contributor_activity__length
  87. closed_prs_count_community__length
  88. pr_merged_count_codequality__length
  89. pr_commit_linked_count_codequality__length
  90. commit_frequency_org_percentage_group_activity__length
  91. community_support_score_community__minimum
  92. updated_since_activity__mean_abs_change
  93. activity_score_activity__sum_values
  94. closed_prs_count_community__minimum
  95. code_quality_guarantee_codequality__sum_values
  96. commit_frequency_org_percentage_group_activity__maximum
  97. activity_score_activity__minimum
  98. pr_merged_count_codequality__minimum
  99. commit_frequency_codequality__sum_values
  100. commit_frequency_activity__sum_values
  101. contributor_count_codequality__minimum
  102. commit_frequency_without_bot_codequality__sum_values
  103. commit_frequency_without_bot_activity__sum_values
  104. contributor_count_without_bot_codequality__minimum
  105. community_support_score_community__mean
  106. pr_commit_count_codequality__sum_values
  107. active_C2_contributor_count_codequality__sum_values
  108. active_C2_contributor_count_activity__sum_values
  109. closed_prs_count_community__sum_values
  110. pr_count_codequality__minimum
  111. commit_frequency_org_percentage_group_activity__mean
  112. commit_frequency_codequality__mean
  113. commit_frequency_activity__mean
  114. commit_frequency_without_bot_codequality__mean
  115. commit_frequency_without_bot_activity__mean
  116. activity_score_activity__mean
  117. contributor_org_count_group_activity__sum_values
  118. updated_since_activity__sum_values
  119. pr_count_codequality__sum_values
  120. pr_commit_count_codequality__mean
  121. commit_frequency_org_group_activity__absolute_sum_of_changes
  122. commit_frequency_codequality__absolute_sum_of_changes
  123. commit_frequency_activity__absolute_sum_of_changes
  124. commit_frequency_without_bot_activity__absolute_sum_of_changes
  125. commit_frequency_without_bot_codequality__absolute_sum_of_changes
  126. commit_frequency_org_group_activity__sum_values
  127. closed_prs_count_community__absolute_sum_of_changes
  128. pr_count_codequality__absolute_sum_of_changes
  129. closed_prs_count_community__mean
  130. code_quality_guarantee_codequality__mean
  131. active_C2_contributor_count_codequality__mean
  132. active_C2_contributor_count_activity__mean
  133. pr_merged_count_codequality__sum_values
  134. pr_commit_count_codequality__absolute_sum_of_changes
  135. pr_merged_count_codequality__absolute_sum_of_changes
  136. commit_frequency_org_group_activity__mean_abs_change
  137. pr_open_time_avg_community__variance
  138. commit_frequency_activity__mean_abs_change
  139. is_maintained_codequality__maximum
  140. commit_frequency_codequality__mean_abs_change
  141. commit_frequency_org_percentage_group_activity__last_location_of_maximum
  142. active_C1_pr_comments_contributor_activity__sum_values
  143. active_C1_pr_comments_contributor_codequality__sum_values
  144. commit_frequency_without_bot_activity__mean_abs_change
  145. commit_frequency_without_bot_codequality__mean_abs_change
  146. pr_open_time_mid_community__mean_abs_change
  147. commit_frequency_org_group_activity__mean
  148. pr_open_time_mid_community__variance
  149. pr_count_codequality__mean
  150. pr_open_time_avg_community__mean_abs_change
  151. active_C1_pr_comments_contributor_activity__minimum
  152. active_C1_pr_comments_contributor_codequality__minimum
  153. active_C1_pr_create_contributor_activity__minimum
  154. active_C1_pr_create_contributor_codequality__minimum
  155. pr_merged_count_codequality__mean
  156. contributor_count_codequality__sum_values
  157. contributor_count_without_bot_codequality__sum_values
  158. closed_prs_count_community__mean_abs_change
  159. contributor_org_count_group_activity__absolute_sum_of_changes
  160. pr_commit_count_codequality__mean_abs_change
  161. pr_merged_count_codequality__mean_abs_change
  162. pr_count_codequality__mean_abs_change
  163. code_review_count_community__sum_values
  164. active_C2_contributor_count_codequality__absolute_sum_of_changes
  165. active_C2_contributor_count_activity__absolute_sum_of_changes
  166. pr_count_codequality__maximum
  167. active_C1_pr_comments_contributor_codequality__mean
  168. active_C1_pr_comments_contributor_activity__mean
  169. commit_frequency_percentage_group_activity__mean
  170. closed_prs_count_community__maximum
  171. commit_frequency_codequality__maximum
  172. commit_frequency_activity__maximum
  173. pr_open_time_avg_community__maximum
  174. pr_open_time_mid_community__maximum
  175. commit_frequency_without_bot_activity__maximum
  176. commit_frequency_without_bot_codequality__maximum
  177. pr_merged_count_codequality__maximum
  178. contributor_org_count_group_activity__mean_abs_change
  179. active_C1_pr_create_contributor_codequality__sum_values
  180. active_C1_pr_create_contributor_activity__sum_values
  181. active_C1_pr_comments_contributor_codequality__absolute_sum_of_changes
  182. active_C1_pr_comments_contributor_activity__absolute_sum_of_changes
  183. pr_commit_count_codequality__maximum
  184. community_support_score_community__maximum
  185. closed_issues_count_activity__minimum
  186. pr_first_response_time_mid_community__last_location_of_minimum
  187. code_review_count_community__mean
  188. pr_open_time_mid_community__mean
  189. commit_frequency_org_group_activity__variance
  190. active_C1_pr_comments_contributor_codequality__maximum
  191. active_C1_pr_comments_contributor_activity__maximum
  192. contributor_org_count_group_activity__mean
  193. code_review_count_community__last_location_of_minimum
  194. active_C2_contributor_count_activity__maximum
  195. active_C2_contributor_count_codequality__maximum
  196. contributor_count_codequality__mean
  197. contributor_count_without_bot_codequality__mean
  198. active_C2_contributor_count_codequality__mean_abs_change
  199. active_C2_contributor_count_activity__mean_abs_change
  200. commit_frequency_org_percentage_group_activity__skewness
  201. code_review_ratio_codequality__last_location_of_minimum
  202. commit_frequency_org_group_activity__maximum
  203. code_review_count_activity__last_location_of_minimum
  204. pr_merged_count_codequality__variance
  205. code_merge_ratio_codequality__minimum
  206. activity_score_activity__skewness
  207. pr_count_codequality__variance
  208. activity_score_activity__maximum
  209. updated_since_activity__kurtosis
  210. contributor_org_count_group_activity__variance
  211. contributor_count_codequality__absolute_sum_of_changes
  212. contributor_count_without_bot_codequality__absolute_sum_of_changes
  213. code_review_ratio_codequality__minimum
  214. active_C2_contributor_count_codequality__last_location_of_minimum
  215. active_C2_contributor_count_activity__last_location_of_minimum
  216. active_C1_pr_comments_contributor_activity__mean_abs_change
  217. active_C1_pr_comments_contributor_codequality__mean_abs_change
  218. active_C1_pr_create_contributor_activity__mean
  219. active_C1_pr_create_contributor_codequality__mean
  220. code_quality_guarantee_codequality__maximum
  221. pr_issue_linked_ratio_codequalitylarge_standard_deviationr_0.05
  222. active_C1_pr_create_contributor_codequality__absolute_sum_of_changes
  223. active_C1_pr_create_contributor_activity__absolute_sum_of_changes
  224. contributor_org_count_group_activity__maximum
  225. pr_first_response_time_mid_community__minimum
  226. pr_first_response_time_avg_community__minimum
  227. closed_prs_count_community__variance
  228. pr_issue_linked_ratio_codequalitylarge_standard_deviationr_0.1
  229. code_review_count_activity__minimum
  230. commit_frequency_activity__last_location_of_minimum
  231. commit_frequency_codequality__last_location_of_minimum
  232. commit_frequency_without_bot_activity__last_location_of_minimum
  233. commit_frequency_without_bot_codequality__last_location_of_minimum
  234. code_review_count_community__minimum
  235. contributor_count_codequality__maximum
  236. contributor_count_without_bot_codequality__maximum
  237. pr_open_time_mid_community__absolute_sum_of_changes
  238. lines_removed_frequency_codequality__sum_values
  239. commit_frequency_codequality__variance
  240. commit_frequency_activity__variance
  241. commit_frequency_percentage_group_activity__last_location_of_maximum
  242. commit_frequency_without_bot_codequality__variance
  243. LOC_frequency_codequality__sum_values
  244. commit_frequency_without_bot_activity__variance
  245. contributor_count_activity__minimum
  246. contributor_count_without_bot_activity__minimum
  247. lines_added_frequency_codequality__sum_values
  248. code_review_count_activity__sum_values
  249. lines_removed_frequency_codequality__absolute_sum_of_changes
  250. code_review_count_community__absolute_sum_of_changes
  251. bug_issue_open_time_mid_community__last_location_of_minimum
  252. pr_first_response_time_avg_community__last_location_of_minimum
  253. updated_since_activity__mean_change
  254. closed_issues_count_activity__sum_values
  255. pr_issue_linked_ratio_codequality__sum_values
  256. pr_commit_linked_count_codequality__absolute_sum_of_changes
  257. active_C1_pr_comments_contributor_activity__variance
  258. active_C1_pr_comments_contributor_codequality__variance
  259. LOC_frequency_codequality__absolute_sum_of_changes
  260. lines_added_frequency_codequality__absolute_sum_of_changes
  261. pr_commit_count_codequality__variance
  262. updated_issues_count_activity__sum_values
  263. updated_issues_count_community__sum_values
  264. pr_commit_linked_count_codequality__sum_values
  265. bug_issue_open_time_avg_community__last_location_of_minimum
  266. pr_issue_linked_ratio_codequality__mean
  267. active_C1_issue_create_contributor_activity__minimum
  268. recent_releases_count_activity__minimum
  269. active_C1_pr_create_contributor_activity__maximum
  270. active_C1_pr_create_contributor_codequality__maximum
  271. lines_removed_frequency_codequality__mean
  272. updated_issues_count_activity__minimum
  273. updated_issues_count_community__minimum
  274. LOC_frequency_codequality__mean
  275. lines_added_frequency_codequality__mean
  276. code_review_ratio_codequality__sum_values
  277. lines_removed_frequency_codequality__mean_abs_change
  278. updated_since_activity__last_location_of_minimum
  279. pr_issue_linked_ratio_codequality__absolute_sum_of_changes
  280. code_review_count_community__maximum
  281. pr_commit_linked_count_codequality__maximum
  282. pr_issue_linked_ratio_codequality__mean_abs_change
  283. code_review_count_activity__mean
  284. pr_commit_linked_count_codequality__mean_abs_change
  285. LOC_frequency_codequality__mean_abs_change
  286. updated_issues_count_community__absolute_sum_of_changes
  287. updated_issues_count_activity__absolute_sum_of_changes
  288. contributor_count_without_bot_activity__sum_values
  289. contributor_count_activity__sum_values
  290. lines_added_frequency_codequality__mean_abs_change
  291. pr_open_time_avg_community__absolute_sum_of_changes
  292. active_C1_pr_comments_contributor_codequality__last_location_of_minimum
  293. active_C1_pr_comments_contributor_activity__last_location_of_minimum
  294. pr_issue_linked_ratio_codequality__first_location_of_maximum
  295. pr_issue_linked_ratio_codequality__last_location_of_maximum
  296. comment_frequency_activity__minimum
  297. comment_frequency_community__minimum
  298. pr_commit_count_codequality__last_location_of_minimum
  299. active_C1_issue_comments_contributor_activity__minimum
  300. pr_commit_linked_count_codequality__mean
  301. pr_open_time_avg_community__mean
  302. updated_since_activity__skewness
  303. contributor_count_codequality__mean_abs_change
  304. contributor_count_without_bot_codequality__mean_abs_change
  305. pr_open_time_mid_community__sum_values
  306. active_C1_issue_create_contributor_activity__sum_values
  307. pr_issue_linked_ratio_codequality__last_location_of_minimum
  308. lines_removed_frequency_codequality__maximum
  309. closed_issues_count_activity__absolute_sum_of_changes
  310. code_review_count_community__mean_abs_change
  311. pr_issue_linked_ratio_codequality__variance
  312. commit_frequency_percentage_group_activity__skewness
  313. active_C1_pr_create_contributor_codequality__mean_abs_change
  314. active_C1_pr_create_contributor_activity__mean_abs_change
  315. updated_since_activity__absolute_sum_of_changes
  316. pr_commit_linked_count_codequality__variance
  317. pr_issue_linked_ratio_codequality__maximum
  318. closed_issues_count_activity__mean
  319. LOC_frequency_codequality__maximum
  320. pr_first_response_time_avg_community__first_location_of_minimum
  321. issue_open_time_avg_community__variance
  322. active_C1_issue_create_contributor_activity__absolute_sum_of_changes
  323. code_review_count_activity__first_location_of_minimum
  324. code_review_count_community__first_location_of_minimum
  325. active_C2_contributor_count_activity__variance
  326. active_C2_contributor_count_codequality__variance
  327. code_quality_guarantee_codequality__variance
  328. pr_issue_linked_ratio_codequality__first_location_of_minimum
  329. lines_added_frequency_codequality__maximum
  330. lines_removed_frequency_codequality__variance
  331. code_merge_ratio_codequality__last_location_of_minimum
  332. active_C1_issue_comments_contributor_activity__sum_values
  333. commit_frequency_percentage_group_activity__absolute_sum_of_changes
  334. issue_open_time_mid_community__variance
  335. code_review_ratio_codequality__first_location_of_minimum
  336. bug_issue_open_time_avg_community__minimum
  337. updated_issues_count_activity__mean
  338. updated_issues_count_community__mean
  339. code_review_ratio_codequality__mean
  340. updated_since_activity__minimum
  341. LOC_frequency_codequality__variance
  342. issue_open_time_avg_community__mean_abs_change
  343. activity_score_activity__mean_change
  344. bug_issue_open_time_mid_community__minimum
  345. issue_open_time_avg_community__maximum
  346. is_maintained_codequalitylarge_standard_deviationr_0.1
  347. contributor_count_without_bot_activity__absolute_sum_of_changes
  348. updated_issues_count_activity__last_location_of_minimum
  349. updated_issues_count_community__last_location_of_minimum
  350. contributor_count_activity__absolute_sum_of_changes
  351. issue_first_reponse_mid_community__mean_abs_change
  352. lines_added_frequency_codequality__variance
  353. code_quality_guarantee_codequality__mean_abs_change
  354. issue_open_time_mid_community__mean_abs_change
  355. is_maintained_codequalitylarge_standard_deviationr_0.05
  356. issue_open_time_mid_community__maximum
  357. pr_commit_linked_count_codequality__minimum
  358. recent_releases_count_activity__sum_values
  359. pr_first_response_time_mid_community__first_location_of_minimum
  360. issue_first_reponse_mid_community__variance
  361. community_support_score_community__mean_abs_change
  362. is_maintained_codequality__last_location_of_maximum
  363. pr_merged_count_codequality__last_location_of_minimum
  364. updated_issues_count_activity__maximum
  365. updated_issues_count_community__maximum
  366. recent_releases_count_activity__absolute_sum_of_changes
  367. issue_first_reponse_avg_community__mean_abs_change
  368. contributor_count_activity__mean
  369. contributor_count_without_bot_activity__mean
  370. active_C1_issue_comments_contributor_activity__absolute_sum_of_changes
  371. code_review_count_activity__absolute_sum_of_changes
  372. active_C1_issue_create_contributor_activity__mean
  373. issue_first_reponse_mid_community__maximum
  374. closed_issues_count_activity__maximum
  375. code_review_count_activitylarge_standard_deviationr_0.05
  376. code_review_count_activitylarge_standard_deviationr_0.1
  377. code_review_count_communitylarge_standard_deviationr_0.05
  378. code_review_count_communitylarge_standard_deviationr_0.1
  379. recent_releases_count_activity__mean
  380. closed_issues_count_activity__mean_abs_change
  381. contributor_count_codequality__variance
  382. contributor_count_without_bot_codequality__variance
  383. code_review_ratio_codequalitylarge_standard_deviationr_0.05
  384. code_review_ratio_codequalitylarge_standard_deviationr_0.1
  385. pr_first_response_time_avg_communitylarge_standard_deviationr_0.1
  386. commit_frequency_org_percentage_group_activity__absolute_sum_of_changes
  387. pr_first_response_time_mid_communitylarge_standard_deviationr_0.05
  388. pr_first_response_time_avg_communitylarge_standard_deviationr_0.05
  389. issue_first_reponse_avg_community__last_location_of_maximum
  390. pr_first_response_time_mid_communitylarge_standard_deviationr_0.1
  391. code_merge_ratio_codequality__first_location_of_maximum
  392. community_support_score_community__variance
  393. code_merge_ratio_codequality__variance
  394. contributor_count_without_bot_group_activity__absolute_sum_of_changes
  395. recent_releases_count_activity__maximum
  396. issue_first_reponse_mid_community__mean
  397. commit_frequency_org_percentage_group_activity__sum_values
  398. contributor_count_group_activity__absolute_sum_of_changes
  399. updated_issues_count_activity__mean_abs_change
  400. updated_issues_count_community__mean_abs_change
  401. active_C1_pr_create_contributor_activity__variance
  402. active_C1_pr_create_contributor_codequality__variance
  403. closed_prs_count_community__last_location_of_minimum
  404. org_count_group_activity__sum_values
  405. org_count_activity__sum_values
  406. contributor_count_group_activity__sum_values
  407. contributor_count_without_bot_group_activity__sum_values
  408. organizations_activity_group_activity__sum_values
  409. comment_frequency_community__sum_values
  410. comment_frequency_activity__sum_values
  411. contributor_count_group_activity__maximum
  412. contributor_count_without_bot_group_activity__maximum
  413. recent_releases_count_activity__mean_abs_change
  414. contribution_last_group_activity__sum_values
  415. org_count_group_activity__maximum
  416. org_count_activity__maximum
  417. active_C1_issue_create_contributor_activity__maximum
  418. pr_issue_linked_ratio_codequality__skewness
  419. community_support_score_community__mean_change
  420. activity_score_activity__absolute_sum_of_changes
  421. code_review_count_community__first_location_of_maximum
  422. organizations_activity_group_activity__absolute_sum_of_changes
  423. organizations_activity_group_activity__maximum
  424. active_C1_issue_comments_contributor_activity__mean
  425. git_pr_linked_ratio_codequality__last_location_of_maximum
  426. contributor_count_without_bot_activity__maximum
  427. contributor_count_activity__maximum
  428. issue_first_reponse_avg_community__variance
  429. bug_issue_open_time_avg_community__sum_values
  430. contribution_last_group_activity__absolute_sum_of_changes
  431. org_count_group_activity__absolute_sum_of_changes
  432. commit_frequency_org_percentage_group_activity__first_location_of_maximum
  433. org_count_activity__absolute_sum_of_changes
  434. contribution_last_group_activity__maximum
  435. contributor_count_without_bot_group_activity__mean_abs_change
  436. org_count_group_activity__mean
  437. org_count_activity__mean
  438. contributor_count_group_activity__mean
  439. code_review_count_activity__maximum
  440. contributor_count_without_bot_group_activity__mean
  441. contributor_count_group_activity__mean_abs_change
  442. issue_first_reponse_mid_community__absolute_sum_of_changes
  443. commit_frequency_group_activity__absolute_sum_of_changes
  444. organizations_activity_group_activity__mean
  445. commit_frequency_without_bot_group_activity__absolute_sum_of_changes
  446. commit_frequency_group_activity__sum_values
  447. commit_frequency_without_bot_group_activity__sum_values
  448. lines_removed_frequency_codequality__last_location_of_minimum
  449. active_C1_pr_comments_contributor_codequality__first_location_of_minimum
  450. active_C1_pr_comments_contributor_activity__first_location_of_minimum
  451. contributor_count_group_activity__variance
  452. contributor_count_without_bot_group_activity__variance
  453. contribution_last_group_activity__mean
  454. pr_open_time_avg_community__minimum
  455. lines_added_frequency_codequality__last_location_of_minimum
  456. issue_first_reponse_avg_community__maximum
  457. code_review_count_activity__mean_abs_change
  458. commit_frequency_group_activity__maximum
  459. commit_frequency_without_bot_group_activity__maximum
  460. active_C1_issue_create_contributor_activity__mean_abs_change
  461. issue_first_reponse_avg_community__first_location_of_minimum
  462. updated_issues_count_activity__mean_change
  463. updated_issues_count_community__mean_change
  464. commit_frequency_group_activity__mean
  465. commit_frequency_group_activity__mean_abs_change
  466. commit_frequency_without_bot_group_activity__mean
  467. commit_frequency_without_bot_group_activity__mean_abs_change
  468. code_merge_ratio_codequality__last_location_of_maximum
  469. pr_commit_linked_count_codequality__last_location_of_minimum
  470. updated_issues_count_activity__variance
  471. updated_issues_count_community__variance
  472. commit_frequency_percentage_group_activity__mean_abs_change
  473. contribution_last_group_activity__mean_abs_change
  474. commit_frequency_without_bot_group_activity__last_location_of_minimum
  475. organizations_activity_group_activity__last_location_of_minimum
  476. closed_issues_count_activity__last_location_of_minimum
  477. commit_frequency_group_activity__last_location_of_minimum
  478. commit_frequency_group_activity__variance
  479. commit_frequency_without_bot_group_activity__variance
  480. LOC_frequency_codequality__last_location_of_minimum
  481. pr_first_response_time_mid_community__first_location_of_maximum
  482. active_C1_issue_comments_contributor_activity__maximum
  483. code_review_count_community__variance
  484. issue_open_time_mid_community__mean
  485. org_count_group_activity__mean_abs_change
  486. org_count_activity__mean_abs_change
  487. organizations_activity_group_activity__mean_abs_change
  488. contribution_last_group_activity__variance
  489. recent_releases_count_activity__last_location_of_minimum
  490. contributor_count_codequality__last_location_of_minimum
  491. is_maintained_codequality__variance
  492. pr_first_response_time_avg_community__last_location_of_maximum
  493. contributor_count_without_bot_codequality__last_location_of_minimum
  494. org_count_group_activity__variance
  495. org_count_activity__variance
  496. contributor_count_group_activity__last_location_of_minimum
  497. commit_frequency_without_bot_group_activitylarge_standard_deviationr_0.1
  498. commit_frequency_group_activitylarge_standard_deviationr_0.1
  499. pr_count_codequality__last_location_of_minimum
  500. commit_frequency_group_activitylarge_standard_deviationr_0.05
  501. commit_frequency_without_bot_group_activitylarge_standard_deviationr_0.05
  502. contributor_count_without_bot_group_activitylarge_standard_deviationr_0.1
  503. contributor_count_group_activitylarge_standard_deviationr_0.1
  504. git_pr_linked_ratio_codequality__maximum
  505. contributor_count_without_bot_group_activity__last_location_of_minimum
  506. issue_first_reponse_mid_community__sum_values
  507. contribution_last_group_activity__minimum
  508. commit_frequency_org_percentage_group_activity__minimum
  509. organizations_activity_group_activitylarge_standard_deviationr_0.1
  510. contribution_last_group_activity__last_location_of_minimum
  511. recent_releases_count_activity__variance
  512. commit_frequency_without_bot_group_activity__first_location_of_minimum
  513. contributor_count_group_activitylarge_standard_deviationr_0.05
  514. contributor_count_without_bot_group_activitylarge_standard_deviationr_0.05
  515. bug_issue_open_time_avg_community__mean
  516. is_maintained_codequality__mean_abs_change
  517. contribution_last_group_activitylarge_standard_deviationr_0.1
  518. contribution_last_group_activitylarge_standard_deviationr_0.05
  519. organizations_activity_group_activitylarge_standard_deviationr_0.05
  520. commit_frequency_group_activity__first_location_of_minimum
  521. code_merge_ratio_codequality__maximum
  522. contributor_org_count_group_activitylarge_standard_deviationr_0.05
  523. organizations_activity_group_activity__minimum
  524. contributor_count_group_activity__minimum
  525. closed_issues_count_activity__variance
  526. org_count_activity__minimum
  527. org_count_group_activity__minimum
  528. commit_frequency_group_activity__minimum
  529. commit_frequency_percentage_group_activity__first_location_of_maximum
  530. git_pr_linked_ratio_codequality__minimum
  531. contributor_org_count_group_activitylarge_standard_deviationr_0.1
  532. issue_first_reponse_avg_community__mean
  533. organizations_activity_group_activity__first_location_of_minimum
  534. is_maintained_codequality__skewness
  535. contributor_count_without_bot_activity__mean_abs_change
  536. contributor_count_bot_activity__sum_values
  537. org_count_activitylarge_standard_deviationr_0.1
  538. org_count_group_activitylarge_standard_deviationr_0.1
  539. contributor_count_activity__mean_abs_change
  540. pr_merged_count_codequality__first_location_of_maximum
  541. community_support_score_community__last_location_of_minimum
  542. git_pr_linked_ratio_codequality__variance
  543. issue_first_reponse_avg_community__absolute_sum_of_changes
  544. contribution_last_group_activity__first_location_of_minimum
  545. bug_issue_open_time_mid_community__sum_values
  546. comment_frequency_activity__variance
  547. comment_frequency_community__variance
  548. code_review_count_activity__variance
  549. commit_frequency_without_bot_group_activity__last_location_of_maximum
  550. commit_frequency_group_activity__last_location_of_maximum
  551. contributor_count_without_bot_group_activity__minimum
  552. contributor_org_count_group_activity__last_location_of_minimum
  553. bug_issue_open_time_avg_community__first_location_of_minimum
  554. pr_first_response_time_avg_community__sum_values
  555. org_count_activitylarge_standard_deviationr_0.05
  556. org_count_group_activitylarge_standard_deviationr_0.05
  557. commit_frequency_without_bot_group_activity__minimum
  558. commit_frequency_percentage_group_activity__sum_values
  559. commit_frequency_percentage_group_activity__minimum
  560. contributor_count_without_bot_group_activity__first_location_of_minimum
  561. code_review_ratio_codequality__last_location_of_maximum
  562. org_count_group_activity__last_location_of_minimum
  563. org_count_activity__last_location_of_minimum
  564. code_review_ratio_codequality__first_location_of_maximum
  565. contributor_count_group_activity__first_location_of_minimum
  566. contributor_count_bot_activity__mean
  567. commit_frequency_org_group_activity__first_location_of_maximum
  568. active_C1_issue_comments_contributor_activity__mean_abs_change
  569. recent_releases_count_activity__first_location_of_maximum
  570. code_merge_ratio_codequality__sum_values
  571. contributor_count_without_bot_group_activity__first_location_of_maximum
  572. contributor_count_group_activity__first_location_of_maximum
  573. active_C1_pr_comments_contributor_activitylarge_standard_deviationr_0.05
  574. active_C1_pr_comments_contributor_codequalitylarge_standard_deviationr_0.05
  575. active_C1_pr_comments_contributor_activitylarge_standard_deviationr_0.1
  576. active_C1_pr_comments_contributor_codequalitylarge_standard_deviationr_0.1
  577. commit_frequency_percentage_group_activitylarge_standard_deviationr_0.05
  578. active_C1_pr_create_contributor_codequality__last_location_of_minimum
  579. active_C1_pr_create_contributor_activity__last_location_of_minimum
  580. bug_issue_open_time_mid_communitylarge_standard_deviationr_0.05
  581. bug_issue_open_time_mid_communitylarge_standard_deviationr_0.1
  582. bug_issue_open_time_avg_communitylarge_standard_deviationr_0.05
  583. bug_issue_open_time_avg_communitylarge_standard_deviationr_0.1
  584. activity_score_activity__last_location_of_minimum
  585. contributor_count_bot_activity__maximum
  586. bug_issue_open_time_mid_community__first_location_of_maximum
  587. contributor_org_count_group_activity__first_location_of_maximum
  588. organizations_activity_group_activity__first_location_of_maximum
  589. commit_frequency_without_bot_group_activity__first_location_of_maximum
  590. code_review_count_activity__first_location_of_maximum
  591. commit_frequency_group_activity__first_location_of_maximum
  592. bug_issue_open_time_mid_community__first_location_of_minimum
  593. commit_frequency_org_percentage_group_activitylarge_standard_deviationr_0.1
  594. commit_frequency_org_percentage_group_activitylarge_standard_deviationr_0.05
  595. pr_open_time_mid_community__minimum
  596. organizations_activity_group_activity**last_location_of_maximum

A.2 Feature Set 2 (Comprising 134 features, named as: Index Name__Feature Function)

  1. commit_frequency_activity__minimum
  2. commit_frequency_codequality__minimum
  3. commit_frequency_without_bot_activity__minimum
  4. commit_frequency_without_bot_codequality__minimum
  5. pr_commit_count_codequality__minimum
  6. is_maintained_codequality__minimum
  7. lines_added_frequency_codequality__minimum
  8. LOC_frequency_codequality__minimum
  9. updated_since_activity__variance
  10. updated_since_activity__mean
  11. active_C2_contributor_count_codequality__minimum
  12. active_C2_contributor_count_activity__minimum
  13. is_maintained_codequality__sum_values
  14. lines_removed_frequency_codequality__minimum
  15. updated_since_activity__maximum
  16. code_quality_guarantee_codequality__minimum
  17. is_maintained_codequality__mean
  18. community_support_score_community__sum_values
  19. issue_open_time_avg_community__length
  20. pr_issue_linked_ratio_codequality__length
  21. commit_frequency_group_activity__length
  22. commit_frequency_without_bot_activity__length
  23. issue_first_reponse_mid_community__length
  24. active_C1_issue_create_contributor_activity__length
  25. community_support_score_community__length
  26. bug_issue_open_time_mid_community__length
  27. issue_open_time_mid_community__length
  28. pr_open_time_mid_community__length
  29. contribution_last_group_activity__length
  30. code_review_ratio_codequality__length
  31. contributor_count_bot_codequality__length
  32. bug_issue_open_time_avg_community__length
  33. lines_removed_frequency_codequality__length
  34. pr_first_response_time_avg_community__length
  35. commit_frequency_org_group_activity__length
  36. commit_frequency_without_bot_codequality__length
  37. commit_frequency_bot_activity__length
  38. code_review_count_community__length
  39. activity_score_activity__length
  40. code_merge_ratio_codequality__length
  41. contributor_count_without_bot_codequality__length
  42. code_review_count_activity__length
  43. commit_frequency_percentage_group_activity__length
  44. org_count_activity__length
  45. closed_issues_count_activity__length
  46. pr_count_codequality__length
  47. updated_since_activity__length
  48. contributor_count_without_bot_activity__length
  49. pr_first_response_time_mid_community__length
  50. org_count_group_activity__length
  51. comment_frequency_community__length
  52. recent_releases_count_activity__length
  53. active_C1_pr_create_contributor_codequality__length
  54. contributor_count_activity__length
  55. active_C2_contributor_count_activity__length
  56. updated_issues_count_activity__length
  57. updated_issues_count_community__length
  58. is_maintained_codequality__length
  59. active_C1_pr_comments_contributor_activity__length
  60. comment_frequency_activity__length
  61. active_C1_pr_create_contributor_activity__length
  62. code_quality_guarantee_codequality__length
  63. contributor_count_codequality__length
  64. commit_frequency_inside_codequality__length
  65. contributor_count_group_activity__length
  66. contributor_count_without_bot_group_activity__length
  67. contributor_count_bot_group_activity__length
  68. commit_frequency_codequality__length
  69. git_pr_linked_ratio_codequality__length
  70. commit_frequency_inside_without_bot_codequality__length
  71. commit_frequency_inside_bot_codequality__length
  72. commit_frequency_bot_codequality__length
  73. active_C1_pr_comments_contributor_codequality__length
  74. commit_frequency_bot_group_activity__length
  75. contributor_count_bot_activity__length
  76. pr_commit_count_codequality__length
  77. lines_added_frequency_codequality__length
  78. pr_open_time_avg_community__length
  79. commit_frequency_without_bot_group_activity__length
  80. contributor_org_count_group_activity__length
  81. issue_first_reponse_avg_community__length
  82. LOC_frequency_codequality__length
  83. active_C2_contributor_count_codequality__length
  84. commit_frequency_activity__length
  85. organizations_activity_group_activity__length
  86. active_C1_issue_comments_contributor_activity__length
  87. closed_prs_count_community__length
  88. pr_merged_count_codequality__length
  89. pr_commit_linked_count_codequality__length
  90. commit_frequency_org_percentage_group_activity__length
  91. community_support_score_community__minimum
  92. updated_since_activity__mean_abs_change
  93. activity_score_activity__sum_values
  94. closed_prs_count_community__minimum
  95. code_quality_guarantee_codequality__sum_values
  96. commit_frequency_org_percentage_group_activity__maximum
  97. activity_score_activity__minimum
  98. pr_merged_count_codequality__minimum
  99. commit_frequency_codequality__sum_values
  100. commit_frequency_activity__sum_values
  101. contributor_count_codequality__minimum
  102. commit_frequency_without_bot_codequality__sum_values
  103. commit_frequency_without_bot_activity__sum_values
  104. contributor_count_without_bot_codequality__minimum
  105. community_support_score_community__mean
  106. pr_commit_count_codequality__sum_values
  107. active_C2_contributor_count_codequality__sum_values
  108. active_C2_contributor_count_activity__sum_values
  109. closed_prs_count_community__sum_values
  110. pr_count_codequality__minimum
  111. commit_frequency_org_percentage_group_activity__mean
  112. commit_frequency_codequality__mean
  113. commit_frequency_activity__mean
  114. commit_frequency_without_bot_codequality__mean
  115. commit_frequency_without_bot_activity__mean
  116. activity_score_activity__mean
  117. contributor_org_count_group_activity__sum_values
  118. updated_since_activity__sum_values
  119. pr_count_codequality__sum_values
  120. pr_commit_count_codequality__mean
  121. commit_frequency_org_group_activity__absolute_sum_of_changes
  122. commit_frequency_codequality__absolute_sum_of_changes
  123. commit_frequency_activity__absolute_sum_of_changes
  124. commit_frequency_without_bot_activity__absolute_sum_of_changes
  125. commit_frequency_without_bot_codequality__absolute_sum_of_changes
  126. commit_frequency_org_group_activity__sum_values
  127. closed_prs_count_community__absolute_sum_of_changes
  128. pr_count_codequality__absolute_sum_of_changes
  129. closed_prs_count_community__mean
  130. code_quality_guarantee_codequality__mean
  131. active_C2_contributor_count_codequality__mean
  132. active_C2_contributor_count_activity__mean
  133. pr_merged_count_codequality__sum_values
  134. pr_commit_count_codequality__absolute_sum_of_changes

Appendix B: Indices for Feature Extraction

Four aspects encompassing 72 indices were involved in the computation of features:

  1. contributor_count_activity,
  2. contributor_count_bot_activity,
  3. contributor_count_without_bot_activity,
  4. active_C2_contributor_count_activity,
  5. active_C1_pr_create_contributor_activity,
  6. active_C1_pr_comments_contributor_activity,
  7. active_C1_issue_create_contributor_activity,
  8. active_C1_issue_comments_contributor_activity,
  9. commit_frequency_activity,
  10. commit_frequency_bot_activity,
  11. commit_frequency_without_bot_activity,
  12. org_count_activity,
  13. comment_frequency_activity,
  14. code_review_count_activity,
  15. updated_since_activity,
  16. closed_issues_count_activity,
  17. updated_issues_count_activity,
  18. recent_releases_count_activity,
  19. activity_score_activity,
  1. contributor_count_codequality,
  2. contributor_count_bot_codequality,
  3. contributor_count_without_bot_codequality,
  4. active_C2_contributor_count_codequality,
  5. active_C1_pr_create_contributor_codequality,
  6. active_C1_pr_comments_contributor_codequality,
  7. commit_frequency_codequality,
  8. commit_frequency_bot_codequality,
  9. commit_frequency_without_bot_codequality,
  10. commit_frequency_inside_codequality,
  11. commit_frequency_inside_bot_codequality,
  12. commit_frequency_inside_without_bot_codequality,
  13. is_maintained_codequality,
  14. LOC_frequency_codequality,
  15. lines_added_frequency_codequality,
  16. lines_removed_frequency_codequality,
  17. pr_issue_linked_ratio_codequality,
  18. code_review_ratio_codequality,
  19. code_merge_ratio_codequality,
  20. pr_count_codequality,
  21. pr_merged_count_codequality,
  22. pr_commit_count_codequality,
  23. pr_commit_linked_count_codequality,
  24. git_pr_linked_ratio_codequality,
  25. code_quality_guarantee_codequality,
  1. issue_first_reponse_avg_community,
  2. issue_first_reponse_mid_community,
  3. issue_open_time_avg_community,
  4. issue_open_time_mid_community,
  5. bug_issue_open_time_avg_community,
  6. bug_issue_open_time_mid_community,
  7. pr_open_time_avg_community,
  8. pr_open_time_mid_community,
  9. pr_first_response_time_avg_community,
  10. pr_first_response_time_mid_community,
  11. comment_frequency_community,
  12. code_review_count_community,
  13. updated_issues_count_community,
  14. closed_prs_count_community,
  15. community_support_score_community,
  1. contributor_count_group_activity,
  2. contributor_count_bot_group_activity,
  3. contributor_count_without_bot_group_activity,
  4. contributor_org_count_group_activity,
  5. commit_frequency_group_activity,
  6. commit_frequency_bot_group_activity,
  7. commit_frequency_without_bot_group_activity,
  8. commit_frequency_org_group_activity,
  9. commit_frequency_org_percentage_group_activity,
  10. commit_frequency_percentage_group_activity,
  11. org_count_group_activity,
  12. contribution_last_group_activity,
  13. organizations_activity_group_activity

Appendix C: 16 Feature Functions

  1. length (Length): This function calculates the length of the dataset, i.e., the number of data points or samples contained.

  2. large_standard_deviation (Large Standard Deviation): Detects the proportion of values in the data that are greater than a given standard deviation. Different threshold values (r values) can be set to determine what constitutes a large standard deviation.

  3. mean (Mean): Calculates the average of all data points in the dataset.

  4. maximum (Maximum): Identifies the maximum value in the dataset.

  5. minimum (Minimum): Identifies the minimum value in the dataset.

  6. sum_values (Sum of Values): Computes the total sum of all data points in the dataset.

  7. variance (Variance): Measures the variance of the dataset, indicating the degree of dispersion of the data points from the mean.

  8. skewness (Skewness): Skewness measures the degree of asymmetry of data distribution, indicating whether the data is left-skewed (negative skewness) or right-skewed (positive skewness).

  9. kurtosis (Kurtosis): Kurtosis assesses the peakedness or flatness of the data distribution, reflecting the shape of the distribution.

  10. absolute_sum_of_changes (Absolute Sum of Changes): Calculates the sum of absolute changes between adjacent data points.

  11. mean_abs_change (Mean Absolute Change): Computes the average absolute change between adjacent data points in the dataset.

  12. mean_change (Mean Change): Calculates the average change between adjacent data points in the dataset.

  13. first_location_of_maximum (First Location of Maximum): Identifies the first occurrence of the maximum value in the dataset.

  14. first_location_of_minimum (First Location of Minimum): Identifies the first occurrence of the minimum value in the dataset.

  15. last_location_of_maximum (Last Location of Maximum): Determines the last occurrence of the maximum value in the dataset.

  16. last_location_of_minimum (Last Location of Minimum): Determines the last occurrence of the minimum value in the dataset.

Copyright © 2023 OSS compass. All Rights Reserved.