Floragasse 7 – 5th floor, 1040 Vienna
Subscribe to our Newsletter

News

New article in Big Data and Cognitive Computing

Our colleagues Philip König, Sebastian Raubitzek, Dennis Toth, Fabian Obermann and Kevin Mallinger published a new paper on Boost-Classifier-Driven Fault Prediction Across Heterogeneous Open-Source Repositories.

In this paper they analyzed over 2.4 million commits from 33 open-source projects and built a CatBoost-based model to predict which code changes are most likely to introduce bugs. By combining process, size, and entropy metrics, their approach not only achieved strong accuracy but also revealed why certain files are more fault-prone. These insights can help developers prioritize testing and strengthen software reliability across diverse projects.

Abstract

Ensuring reliability, availability, and security in modern software systems hinges on early fault detection, yet predicting which parts of a codebase are most at risk remains a significant challenge. In this paper, we analyze 2.4 million commits drawn from 33 heterogeneous open-source projects, spanning healthcare, security tools, data processing, and more. By examining each repository per file and per commit, we derive process metrics (e.g., churn, file age, revision frequency) alongside size metrics and entropy-based indicators of how scattered changes are over time. We train and tune a gradient boosting model to classify bug-prone commits under realistic class-imbalance conditions, achieving robust predictive performance across diverse repositories. Moreover, a comprehensive feature-importance analysis shows that files with long lifespans (high age), frequent edits (revision count), and widely scattered changes (entropy metrics) are especially vulnerable to defects. These insights can help practitioners and researchers prioritize testing and tailor maintenance strategies, ultimately strengthening software dependability.

Authors

Philip König, Sebastian Raubitzek, Alexander Schatten, Dennis Toth, Fabian Obermann, Caroline König, and Kevin Mallinger

Links

Full Article
CORE Research Group