New article in Big Data and Cognitive Computing

Name	SBA Research Cookie
Provider	SBA Research
Purpose	Saves the settings of the visitors selected in the cookie box.
Cookie name	sba-research-cookie
Cookie runtime	1 year

Name	YouTube
Provider	YouTube
Purpose	Used to unblock YouTube content.
Privacy policy	https://policies.google.com/privacy
Host(s)	google.com
Cookie name	NID
Cookie runtime	6 months

Name	Vimeo
Provider	Vimeo
Purpose	Used to unblock Vimeo content.
Privacy policy	https://vimeo.com/privacy
Host(s)	player.vimeo.com
Cookie name	vuid
Cookie runtime	2 years

July 8, 2025

Our colleagues Philip König, Sebastian Raubitzek, Dennis Toth, Fabian Obermann and Kevin Mallinger published a new paper on Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories.

In this paper they analyzed over 2.4 million commits from 33 open-source projects and built a CatBoost-based model to predict which code changes are most likely to introduce bugs. By combining process, size, and entropy metrics, their approach not only achieved strong accuracy but also revealed why certain files are more fault-prone. These insights can help developers prioritize testing and strengthen software reliability across diverse projects.

Abstract

Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.

Authors

Philip König, Sebastian Raubitzek, Alexander Schatten, Dennis Toth, Fabian Obermann, Caroline König, and Kevin Mallinger

Links

Full Article
CORE Research Group

News