Dataset collection for automatic generation of commit messages

Ivan A. Kosyanenko; Roman G. Bolbakov

doi:10.32362/2500-316X-2025-13-2-7-17

Dataset collection for automatic generation of commit messages

Ivan A. Kosyanenko, Roman G. Bolbakov

https://doi.org/10.32362/2500-316X-2025-13-2-7-17

EDN: OQUHWL

Full Text:

PDF (Rus) PDF (Eng) Suppl.

Generate QR code

Abstract

Objectives. In contemporary software development practice, version control systems are often used to manage the development process. Such systems allow developers to track changes in the codebase and convey the context of these changes through commit messages. The use of such messages to provide relevant and high-quality descriptions of the changes generally requires a high level of competence and time commitment from the developer. However, modern machine learning methods can enable the automation of this task. Therefore, the work sets out to provide a statistical and comparative analysis of the collected data sample with sets of changes in the program code and their descriptions in natural language.

Methods. In this study, a comprehensive approach was used, including data collection from popular GitHub repositories, preliminary data processing and filtering, as well as statistical analysis and natural language processing method (text vectorization). Cosine similarity was used as a means of assessing the semantic proximity between the first sentence and the full text of commit messages.

Results. A comprehensive study of the structure and quality of commit messages encompassed data collection from GitHub repositories and preliminary data cleansing. The research involved text vectorization of commit messages and evaluation of semantic similarity between the first sentences and full texts of messages using cosine similarity. The comparative analysis of message quality in the collected dataset and several analogous datasets used classification based on the CodeBERT model.

Conclusions. The analysis revealed a low level of cosine similarity (0.0969) between the first sentences and full texts of commit messages, indicating a weak semantic relationship between them and refuting the hypothesis that first sentences serve as summaries of message content. The low proportion of empty messages in the collected dataset at 0.0007% was significantly lower than expected, indicating high-quality data collection. The results of classification analysis showed that the proportion of messages categorized as “poor” in the collected dataset was 16.82%, substantially lower than comparable figures in other datasets, where this percentage ranged from 34.75% to 54.26%. This fact underscores the high quality of the collected dataset and its suitability for further application in automatic commit message generation systems.

Keywords

commit message generation, version control systems, description of changes in software code, cosine similarity, data filtering, text vectorization, dataset, machine learning

About the Authors

Ivan A. Kosyanenko

MIREA – Russian Technological University
Russian Federation

Ivan A. Kosyanenko, Postgraduate Student, Department of Instrumental and Applied Software, Institute of Information Technologies

78, Vernadskogo pr., Moscow, 119454

Competing Interests:

The authors declare no conflicts of interest.

Roman G. Bolbakov

MIREA – Russian Technological University
Russian Federation

Roman G. Bolbakov, Cand. Sci. (Eng.), Associate Professor, Head of the Department of Instrumental and Applied Software, Institute of Information Technologies

78, Vernadskogo pr., Moscow, 119454

Scopus Author ID 57202836952

Competing Interests:

The authors declare no conflicts of interest.

References

1. Tian Y., Zhang Y., Stol K., Jiang L., Liu H. What makes a good commit message? Proceedings of the 44th International Conference on Software Engineering. 2022;44:2389–2401. https://doi.org/10.1145/3510003.3510205

2. Kosyanenko I.A., Bolbakov R.G. About automatic generation of commit messages in version control systems. International Journal of Open Information Technologies (INJOIT). 2022;10(4):55–60 (in Russ.).

3. Liu Z., Xia X., Hassan A., Lo D., Xing Z. Neural-machine-translation-based commit message generation: how far are we? In: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 2018;33:373–384. https://doi.org/10.1145/3238147.3238190

4. Sun Z., Li L., Liu Y., Du X., Li L. On the importance of building high-quality training datasets for neural code search. In: Proceedings of the 44th International Conference on Software Engineering. 2022;44:1609–1620. https://doi.org/10.1145/3510003.3510160

5. Hawkins D.M. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004;44(1):1–12. https://doi.org/10.1021/ci0342472

6. Banko M., Brill E. Scaling to Very Very Large Corpora for Natural Language Disambiguation. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 2001;26–33. https://doi.org/10.3115/1073012.1073017

7. Halevy A., Norvig P., Pereira F. The unreasonable effectiveness of data. IEEE Intell. Syst. 2009;24(2):8–12. https://doi.org/10.1109/MIS.2009.36

8. Tao W., Wang Y., Shi E., Du L., Han S., Zhang H., Zhang D., Zhang W. On the evaluation of commit message generation models: An experimental study. In: 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE. 2021;126–136. https://doi.org/10.48550/arXiv.2107.05373

9. Jiang S., McMillan C. Towards automatic generation of short summaries of commits. In: 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC). IEEE. 2017;320–323. https://doi.org/10.48550/arXiv.1703.09603

10. Myagkova E.Yu. To the problem of “formal” and “inner” grammar. Vestnik Tverskogo gosudarstvennogo universiteta. Seriya: Filologiya = Herald of Tver State University. Series: Philology. 2012;24(4):96–102 (in Russ.).

11. Xu S., Yao Y., Xu F., Gu T., Tong H., Lu J. Commit message generation for source code changes. In: Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence (IJCAI-19). 2019;3975–3981. https://doi.org/10.24963/ijcai.2019/552

12. Liu Q., Liu Z., Zhi H., Fan H., Du B., Qian Y. Generating commit messages from diffs using pointer-generator network. In: 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE. 2019;299–309. http://doi.org/10.1109/MSR.2019.00056

13. Loyola P., Marrese-Taylor E., Matsuo Y. A neural architecture for generating natural language descriptions from source code changes. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017;287–292. https://doi.org/10.18653/v1/P17-2045

14. Liu S., Gao C., Chen S., Yiu L., Liy Y. ATOM: Commit message generation based on abstract syntax tree and hybrid ranking. IEEE Trans. Software Eng. 2020;48(5):1800–1817. https://doi.org/10.48550/arXiv.1912.02972

15. Eliseeva A., Sokolov Y., Bogomolov E., Golubev Y., Dig D., Bryskin T. From Commit Message Generation to History-Aware Commit Message Completion. In: 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. 2023;723–735. https://doi.org/10.48550/arXiv.2308.07655

16. Dey T., Mousavi S., Ponce E. Detecting and characterizing bots that commit code. In: Proceedings of the 17th international conference on mining software repositories. 2020;209–219. https://doi.org/10.1145/3379597.3387478

17. Kuchnik M., Smith V., Amvrosiadis G. Validating large language models with ReLM. Proceedings of Machine Learning and Systems. 2023;5:457–476. https://doi.org/10.48550/arXiv.2211.15458

18. Haque S., Zachary E. Semantic similarity metrics for evaluating source code summarization. In: Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 2022;36–47. https://doi.org/10.1145/3524610.3527909

19. Rahutomo F., Kitasuka T., Aritsugi M. Semantic cosine similarity. In: The 7th International Student Conference on Advanced Science and Technology (ICAST). 2012;4(1):1–2.

20. Roshan R., Bhacho I.A., Zai S. Comparative Analysis of TF–IDF and Hashing Vectorizer for Fake News Detection in Sindhi: A Machine Learning and Deep Learning Approach. Eng. Proc. 2023;46(1):5. https://doi.org/10.3390/engproc2023046005

21. Aggarwal C.C., Yu P.S. Outlier Detection in High Dimensional Data. In: Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data. 2001;30(2):37–46. http://dx.doi.org/10.1145/376284.375668

22. Feng Z., Guo D., Tang F., et al. CodeBERT: A pre-trained model for programming and natural languages. In: Findings of the Association for Computational Linguistics: EMNLP 2020. P. 1536–1547. Online. Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.139

23. Qasim R., Bangyal W.H. A fine-tuned BERT-based transfer learning approach for text classification. J. Healthc. Eng. 2022;2022:3498123. https://doi.org/10.1155/2022/3498123

Supplementary files

	1. Distribution of the number of tokens in the commit messages
	Subject
	Type	Исследовательские инструменты
	View (62KB)	Indexing metadata ▾

A comprehensive study of the structure and quality of commit messages encompassed data collection from GitHub repositories and preliminary data cleansing.
The research involved text vectorization of commit messages and evaluation of semantic similarity between the first sentences and full texts of messages using cosine similarity.
The comparative analysis of message quality in the collected dataset and several analogous datasets used classification based on the CodeBERT model.

Review

For citations:

Kosyanenko I.A., Bolbakov R.G. Dataset collection for automatic generation of commit messages. Russian Technological Journal. 2025;13(2):7-17. https://doi.org/10.32362/2500-316X-2025-13-2-7-17. EDN: OQUHWL

JATS XML

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2782-3210 (Print)
ISSN 2500-316X (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

	Title	Distribution of the number of tokens in the commit messages
	Type	Исследовательские инструменты
	Date	2025-04-09

User

Russian Technological Journal

Dataset collection for automatic generation of commit messages

Full Text:

Abstract

Keywords

About the Authors

References

Supplementary files

Review

For citations:

Cookies policy