Statistical modelling of somatic cell counts using the classification tree technique
Abstract. The research studied a sample of 455 Polish Holstein-Friesian Black and White cows. Its aim was to apply and compare two modern statistical methods, i.e. classification trees and a logistic regression in examination of the impact of selected lactation-related factors (successive lactation, herd size and production level, year of calving, calving season, test day season, lactation phases and the amount of milk obtained in a test milking) on the somatic cell counts. Two different division criteria were taken into account in the creation of classification trees, i.e. entropy reduction and Gini coefficient. The quality of classification trees and multiple regression models was compared taking into consideration the following criteria: an average squared error, cumulative lift, Kolmogorov-Smirnov statistics and the area under the ROC curve. Having conducted the research, it may be concluded that from among the statistical methods applied, the best modelling of the level of somatic cell counts was obtained using the classification tree technique when the division criterion was based on the entropy function. According to the results of the study, somatic cell counts were diversified by the following factors, in a decreasing order of importance: herd production level, year of calving, subsequent lactation, calving season, day of test milking, herd size and the month used to take milk samples. Using somatic cell count as an udder health benchmark, it may be concluded that cows requiring particular attention as a result of udder diagnosis are from those in herds with high milk production levels, with individual cows producing up to 15 kg of milk.