Dataset collection for automatic generation of commit messages
https://doi.org/10.32362/2500-316X-2025-13-2-7-17
EDN: OQUHWL
Abstract
Objectives. In contemporary software development practice, version control systems are often used to manage the development process. Such systems allow developers to track changes in the codebase and convey the context of these changes through commit messages. The use of such messages to provide relevant and high-quality descriptions of the changes generally requires a high level of competence and time commitment from the developer. However, modern machine learning methods can enable the automation of this task. Therefore, the work sets out to provide a statistical and comparative analysis of the collected data sample with sets of changes in the program code and their descriptions in natural language.
Methods. In this study, a comprehensive approach was used, including data collection from popular GitHub repositories, preliminary data processing and filtering, as well as statistical analysis and natural language processing method (text vectorization). Cosine similarity was used as a means of assessing the semantic proximity between the first sentence and the full text of commit messages.
Results. A comprehensive study of the structure and quality of commit messages encompassed data collection from GitHub repositories and preliminary data cleansing. The research involved text vectorization of commit messages and evaluation of semantic similarity between the first sentences and full texts of messages using cosine similarity. The comparative analysis of message quality in the collected dataset and several analogous datasets used classification based on the CodeBERT model.
Conclusions. The analysis revealed a low level of cosine similarity (0.0969) between the first sentences and full texts of commit messages, indicating a weak semantic relationship between them and refuting the hypothesis that first sentences serve as summaries of message content. The low proportion of empty messages in the collected dataset at 0.0007% was significantly lower than expected, indicating high-quality data collection. The results of classification analysis showed that the proportion of messages categorized as “poor” in the collected dataset was 16.82%, substantially lower than comparable figures in other datasets, where this percentage ranged from 34.75% to 54.26%. This fact underscores the high quality of the collected dataset and its suitability for further application in automatic commit message generation systems.
About the Authors
Ivan A. KosyanenkoRussian Federation
Ivan A. Kosyanenko, Postgraduate Student, Department of Instrumental and Applied Software, Institute of Information Technologies
78, Vernadskogo pr., Moscow, 119454
Competing Interests:
The authors declare no conflicts of interest.
Roman G. Bolbakov
Russian Federation
Roman G. Bolbakov, Cand. Sci. (Eng.), Associate Professor, Head of the Department of Instrumental and Applied Software, Institute of Information Technologies
78, Vernadskogo pr., Moscow, 119454
Scopus Author ID 57202836952
Competing Interests:
The authors declare no conflicts of interest.
References
1. Tian Y., Zhang Y., Stol K., Jiang L., Liu H. What makes a good commit message? Proceedings of the 44th International Conference on Software Engineering. 2022;44:2389–2401. https://doi.org/10.1145/3510003.3510205
2. Kosyanenko I.A., Bolbakov R.G. About automatic generation of commit messages in version control systems. International Journal of Open Information Technologies (INJOIT). 2022;10(4):55–60 (in Russ.).
3. Liu Z., Xia X., Hassan A., Lo D., Xing Z. Neural-machine-translation-based commit message generation: how far are we? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 2018;33:373–384. https://doi.org/10.1145/3238147.3238190
4. Sun Z., Li L., Liu Y., Du X., Li L. On the importance of building high-quality training datasets for neural code search. In: Proceedings of the 44th International Conference on Software Engineering. 2022;44:1609–1620. https://doi.org/10.1145/3510003.3510160
5. Hawkins D.M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004;44(1):1–12. https://doi.org/10.1021/ci0342472
6. Banko M., Brill E. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 2001;26–33. https://doi.org/10.3115/1073012.1073017
7. Halevy A., Norvig P., Pereira F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009;24(2):8–12. https://doi.org/10.1109/MIS.2009.36
8. Tao W., Wang Y., Shi E., Du L., Han S., Zhang H., Zhang D., Zhang W. On the evaluation of commit message generation models: An experimental study. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE. 2021;126–136. https://doi.org/10.48550/arXiv.2107.05373
9. Jiang S., McMillan C. Towards automatic generation of short summaries of commits. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE. 2017;320–323. https://doi.org/10.48550/arXiv.1703.09603
10. Myagkova E.Yu. To the problem of “formal” and “inner” grammar. Vestnik Tverskogo gosudarstvennogo universiteta. Seriya: Filologiya = Herald of Tver State University. Series: Philology. 2012;24(4):96–102 (in Russ.).
11. Xu S., Yao Y., Xu F., Gu T., Tong H., Lu J. Commit message generation for source code changes. In: Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence (IJCAI-19). 2019;3975–3981. https://doi.org/10.24963/ijcai.2019/552
12. Liu Q., Liu Z., Zhi H., Fan H., Du B., Qian Y. Generating commit messages from diffs using pointer-generator network. In: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE. 2019;299–309. http://doi.org/10.1109/MSR.2019.00056
13. Loyola P., Marrese-Taylor E., Matsuo Y. A neural architecture for generating natural language descriptions from source code changes. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017;287–292. https://doi.org/10.18653/v1/P17-2045
14. Liu S., Gao C., Chen S., Yiu L., Liy Y. ATOM: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Trans. Software Eng. 2020;48(5):1800–1817. https://doi.org/10.48550/arXiv.1912.02972
15. Eliseeva A., Sokolov Y., Bogomolov E., Golubev Y., Dig D., Bryskin T. From Commit Message Generation to History-Aware Commit Message Completion. In: 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. 2023;723–735. https://doi.org/10.48550/arXiv.2308.07655
16. Dey T., Mousavi S., Ponce E. Detecting and characterizing bots that commit code. In: Proceedings of the 17th international conference on mining software repositories. 2020;209–219. https://doi.org/10.1145/3379597.3387478
17. Kuchnik M., Smith V., Amvrosiadis G. Validating large language models with ReLM. Proceedings of Machine Learning and Systems. 2023;5:457–476. https://doi.org/10.48550/arXiv.2211.15458
18. Haque S., Zachary E. Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 2022;36–47. https://doi.org/10.1145/3524610.3527909
19. Rahutomo F., Kitasuka T., Aritsugi M. Semantic cosine similarity. In: The 7th International Student Conference on Advanced Science and Technology (ICAST). 2012;4(1):1–2.
20. Roshan R., Bhacho I.A., Zai S. Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach. Eng. Proc. 2023;46(1):5. https://doi.org/10.3390/engproc2023046005
21. Aggarwal C.C., Yu P.S. Outlier Detection in High Dimensional Data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001;30(2):37–46. http://dx.doi.org/10.1145/376284.375668
22. Feng Z., Guo D., Tang F., et al. CodeBERT: A pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. P. 1536–1547. Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.139
23. Qasim R., Bangyal W.H. A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. 2022;2022:3498123. https://doi.org/10.1155/2022/3498123
Supplementary files
|
1. Distribution of the number of tokens in the commit messages | |
Subject | ||
Type | Исследовательские инструменты | |
View
(62KB)
|
Indexing metadata ▾ |
- A comprehensive study of the structure and quality of commit messages encompassed data collection from GitHub repositories and preliminary data cleansing.
- The research involved text vectorization of commit messages and evaluation of semantic similarity between the first sentences and full texts of messages using cosine similarity.
- The comparative analysis of message quality in the collected dataset and several analogous datasets used classification based on the CodeBERT model.
Review
For citations:
Kosyanenko I.A., Bolbakov R.G. Dataset collection for automatic generation of commit messages. Russian Technological Journal. 2025;13(2):7-17. https://doi.org/10.32362/2500-316X-2025-13-2-7-17. EDN: OQUHWL