Comparison of different statistical-genetic approaches of QTL detection by Evaluating results from a real dairy cattle data set

Recent reports on estimating QTL positions and effects on milk production traits show several chromosomal regions, for example on BTA6, being putative QTL regions, for milk yield and traits of ingredients. The statistical methods and genetic models on which these results were based on, show different advantages and limits. Thus, it is sometimes difficult to compare and evaluate such results. Confirmation studies are inevitable, before drawing conclusions towards finemapping analyses or practical use of such results. We compared three published approaches, realizing up to five different genetic models of QTL estimation analysing the traits fat yield (kg) and fat content (%) based on a real data set of five granddaughter families from the German Holstein dairy cattle population. The marker map contained 16 microsatellite markers, distributed across chromosome BTA6. The focus was mainly on the most likely map positions of the putative QTL and on its character. The significant results obtained from using different methods and adapting several genetic models were conclusive and comparable to QTL positions reported elsewhere.


Introduction
Worldwide, many scientists on the field of statistical genetics are concerned with creating new and more appropriate solutions of QTL estimation.Therefore, when focusing on a QTL study, it is often a question to decide, which approach is most suited for the task one is confronted with.The current solutions partly show different facilities and it depends most likely on the data structure, which genetic model and statistical method is preferable in the special case.In animal breeding, we find the highest number of publications with respect to milk production traits.Usually, the underlying experimental design for QTL estimation in dairy cattle is a half sib structure, either a granddaughter or a daughter design (WELLER et al., 1990).The number of successful QTL studies is steadily growing since the beginning nineties.The type of analyses changed during this time.First, e.g. when focusing on BTA6, there was a concentration on the casein gene variants for estimating direct effects on milk traits (e.g.BOVENHUIS et al., 1992;BOICHARD et al., 1994;LIEN et al., 1995;ORTNER et al., 1995;PANICKE et al., 1996 ).Then, the casein genes were used as markers in a single marker segregation analysis (e.g.van ARENDONK et al., 1994;LIU, 1994).The next step was covering the whole chromosome with several markers, mostly in a distance of ~20 cM (e.g.GEORGES et al., 1995;BOVENHUIS and WELLER, 1994;SPELMAN et al., 1996 ) in order to obtain a more precise result for a putative QTL region.However, these early studies were based on single traitsingle QTL models.ZHANG et al. (1998) and further studies based on modeling more than one QTL, what came closer to the real situation.The focus on multivariate estimation of QTL, taking the information of correlated traits into account, is new (e.g.ALMASY et al., 1997;SØRENSEN et al., 2003).A comprehensive overview on methods of QTL estimation was given by HOESCHELE (2001).In dairy cattle breeding, there are only very few published results based on simultaneous multivariate and multi-QTL-analyses (FREYER et al., 2003).Obtaining results from analyses using different statistical methods, besides from different data sets, is valuable for reasons of confirming a QTL.This means that the own results can be verified and also compared with QTL findings obtained elsewhere.Mostly, current approaches for QTL estimation in animal breeding have been created with respect to the conditions of a special task.They do not allow for general use, e.g. they need a specified family design, limited on two generations genotyped, or excluding animals with incomplete marker information.QTL estimates might reflect these properties.Thus, the goal of this paper was to compare and to evaluate three different statisticalgenetic approaches of QTL estimation, partly based on the same genetic modelling in order to enable direct comparisons, and to show some directions in adapting them, monitoring the method-specific results.We demonstrate this here with results from analysing real data, because of combining the comparison of methods with their practical application.The focus was on finding and comparing QTL map positions.In particular, results obtained with recent reports (FREYER et al., 2002(FREYER et al., , 2003) ) were lighted up, for the two measures yield and percentage of basically the same substratemilk fat, focusing on methodical effects of estimation.

Statistical methods
Our comparison of methods for multi-QTL estimation is based on three different advanced solutions, each created for the specific conditions of dairy cattle breeding.They all allow for the same family structure, which is basically a half sib design, were the (grand)sires are assumed to be unrelated.Further, they allow for a nearly unlimited number of markers on a chromosome, whose information is used for calculating the probability of the unknown QTL alleles.The way of using this information in particular, e.g.within IBD matrices, alters somewhat between the methods.In all cases, the test statistics may be calculated at 1 cM intervals within a marked region of a chromosome.They all allow for incomplete marker information as well.But the effect of this surely depends on the number of markers and their information content.The effect of using DYD or EBV on QTL estimates of QTL positions did not show any difference between the methods.We can provide a brief informative description of the statistical methods only.For a more extensive explanation see FREYER et al. (2003) and the relevant original papers cited, respectively.
Method 1: Least squares method (LS, e.g.HOESCHELE, 2001) The model including n QTL based on the granddaughter design was where y ij was the trait value of son j of (grand)sire i, µ was the overall mean, s i was a fixed effect, b ikq was a regression coefficient for QTL q nested within sire i at position k, P ijkl means the probability of inheriting a QTL allele l of QTL q from sire i at position k, n was the number of QTL fitted at a time (1 or 2) and e ij residual with variance approximately equal to variance/z ij .Here, z ij is a weighting factor according to the reliability of EBV of son j of sire i.Using the model above with q = 1, an Fstatistic for H 0 (no QTL) versus H A (one QTL present) was adapted.The test of the one-QTL model versus the two-QTL model was performed using F-statistics as well.Within a set of computer programs (DU and HOESCHELE, 1999) the LS analysis, optional incorporating epistatic effects between the two QTL, could be performed.Boots-trapping and permutation test of the original data to obtain confidence intervals CI (95%) of QTL positions and a chromosome-wide significance threshold was included.
Method 2: Residual maximum likelihood method (REML, GRIGNOLA et al., 1996;ZHANG et al., 1998) The residual maximum likelihood method) based on variance component analysis (VC) was implemented in a computer program by ZHANG and HOESCHELE (1998).The model including q QTL was according to GRIGNOLA et al. (1997): with variance var(e ij )= σ e 2 /z ij , where in addition to the symbols described for [1] u ij was the additive genetic effect of son j of sire i, v qij k means an effect of allele l (l=1,2) at QTL q of son j of sire i, and σ e 2 is the residual variance.The likelihood ratio test (LR) was used to test for presence of a putative QTL and to discriminate between oneand two-QTL models.The distribution of the LR test statistics from REML analysis follows approximately a chi-square distribution (with 1 to 2 degree of freedom) used for testing, as most authors of such solutions presuppose.
Method 3: Multivariate VC method (AIREML, JENSEN et al., 1997 ;SØRENSEN et al., 2003) This approach is based on (co)variance component analysis and a random QTL model, and was realized within the DMU package (JENSEN and MADSEN, 1994).The data was analysed by fitting the multivariate mixed model including the effect of n q QTL, where y was a vector of fat yield and fat content observed for each son, b was a vector of fixed overall mean effects, a q was a vector of random effects for the QTL q assumed to be normally distributed with mean and variances (0, K i Q 0 ⊗ i|M,p ); correspondingly u was a vector of random breeding values (0, G⊗A) and e a vector of random residual effects (0, j n 1 E⊗I), adjusted for the number of sons daughters (n j ).

K i
0 was an unknown trait specific (co)variance matrix.G and E were unknown (co)variance matrices of polygenic and residual effects, X, W i , Z, are known design matrices associating the traits for each son with the fixed and random effects.A is the numerator relationship matrix and I an identity matrix.Q i|M,k is the identity by descent (IBD) matrix of the QTL q, a function of marker data (M) and the position (k q ) of the QTL q on the chromosome (e.g.GEORGE et al., 2000;SØRENSEN et al., 2003).The hypothesis tests H1, H2, H3 and H4 using the LR test again, assuming to be χ 2 - distributed, based on five different models (see simplified in Table 1).K ij -element in the genetic (co)variance matrix K (dimension=2) for traits i and j, K i (dimension=1) -element in the genetic variance for trait i, k 1 and k 2 -QTL positions for trait i = 1 and j = 2, for hypothesis testing in a likelihood ratio (LR) test: H1: L 00 and L 11 with 3 degrees of freedom (df); H2: L 00 and L 10 for fat yield (L 01 for fat content) with 2 df each; H3: L 00 and L 11 with 3 df; H4: L 11 and L 10 (L 01 ) with 2 df each, if H1 results in a significant finding.
The Akaike information criterion (AIC = -2 ln MLH + 2 * number of parameters in the model) was used for identifying the more parsimonious model in comparison of the close linkage model and the pleiotropic model, because of the non-nested nature of both models.Our comparison is not based on a simulated dataset, because these approaches have been verified and reviewed elsewhere.Here, an application to real conditions is of interest for practical demonstration.The focus was on comparing results from analysing fat yield and fat content, which are positively correlated traits, with r p ranging from 0.23 to 0.55 (a cross section of the international literature) under a model-and method-analytical view.The computations were supported by VPI&SU, USA, and DIAS, DK, on the basis of a collaboration (FREYER et al., 2002(FREYER et al., , 2003)).

Material
We show the analysis based on phenotypic data of 641 sons of five German Holstein sire families, in milk fat traits yield and content.The data was provided by the German Computing Centre for animal breeding purposes, VIT Verden.The standard deviation was 17 kg fat, varying from 11.1 (in family fa 4) to 18.4 (in family fa 2) and 0.28 % fat (varying from 0.22 in family fa 5 to 0.29 in family fa 1), for estimated breeding values (EBV), respectively.Briefly, the marker map for BTA6 contained 16 micro satellite markers (KÜHN et al., 1999 extended) of the order and map position in cM Kosambi: 0,16,40,43,58,59,64,66,67,71,76,79,89,95,102, and 124 cM.The mean number of genotyped sons per sire was 60, with EBV from averagely 278 daughters per son (further details in FREYER et al., 2003).

Single family analyses
For fat yield, we notice an agglomeration of family specific 'spots with highest test statistic values' in the test statistic profile between 62 and 70 cM, exceeding the threshold at 68 cM in family fa 2 (Figure 1).Further, a QTL for fat yield at 95 cM was suggested from fa 1.The same family yielded in a significant QTL estimate for fat content, at 46 cM (Figure 2), whereas the steep peak obtained from fa 3, at 60 cM, was not significant.The single family QTL analysis of fat yield, based on a two-QTL, model was inferior in fa 2. The two suggested QTL positions confirmed the same QTL (Table 2).The analysis result from fa 3 showed QTL positions at 47 and 62 cM for fat yield, which were suggested to be significant here, and a superiority of the epistasis model.

Across family analyses
Analysing all families together by method 1, we found the QTL position at 46 cM for fat content, as for fa 1. Method 2 placed this QTL at 41 cM, respective 43 cM (from the two-QTL model, Table 3).A quite interesting result from method 2 is, that QTL positions at 41 and 68 cM were estimated for fat yield.The two-QTL model realised by method 3 yielded one significant QTL position at 46 cM and a further QTL suggestion at 95 cM for fat content.The first QTL at 46 cM for fat content was confirmed by the one-QTL model.The confidence intervals CI(95%) obtained by bootstrapping were comparably high, likely because of several family-specific QTL.But this was also reported by others (e.g.RON et al., 2001).   1 indicates exceeding of 0.05 threshold (the ln "maximum likelihood" from the two QTL model was tested against the ln "maximum likelihood" from both univariate analyses each, not described in detail) 2 highest F-value over the test statistic profile along the marked chromosome

Bivariate analyses
The results of bivariate analyses, on both fat traits, in principle confirm the previous significant findings, accompanied by additional information.The profile obtained from the pleiotropic model analysis shows its highest test statistic levels between 46 and 58 cM, suggesting a QTL both traits simultaneously (Figure 3).This area contains the estimated QTL position for fat content.The QTL position for fat yield at 68 cM, significantly suggested by fa 2, was not included.The profile of the simple bivariate analysis of fat yield shows a flat course with high test statistic values between 52 and 70 cM, highest at 68 cM.This was covered partly by the profile obtained from the pleiotropic model analysis.The 'common part' of the pleiotropic profile for both traits was found between 52 and 58 cM.The QTL variance components were smallest when the data were analysed 'simply-bivariately' (Table 4).The errors of pleiotropic estimates were fairly high.In case of the close linkage model, the area with higher test statistic values was large for fat yield, roughly covering the region between 45 and 70 cM, and a second QTL for fat content at 95 cM seems to be likely (Figure 4).When comparing the AIC for both decision models (pleiotropic and close linkage model), we notice nearly equal AIC.The basic LR for these models (H1 and H3) were formally significant and similar.We evaluated the simple-bivariate models to be the models of fit here (Table 5).This means, both 'strong' QTL for each of the single traits, from two different families, 'got through' in the bivariate analysis.There are some additional spots in between the 22 cM distance, likely reflecting 'inferior family specific QTL'.The test of H4 did not fully support a pleiotropic QTL.But the difference was small and the close linkage model received its highest test statistic values at 46 cM (for content of fat) and 50 cM (yield of fat) and also at 95 cM for fat content in combination with 68 cM for fat yield.Discussion Confirmation studies and fine mapping of QTL are inevitably to avoid conclusions for practical purposes from results yielded by chance.Analyzing data repeatedly on the basis of varying advanced model assumptions and different statistical-genetic methods may be one step towards verifying QTL results.A QTL segregating in all families can not be presupposed.Here, the single families were analysed effectively by adapting the LS method.It enables distinguishing between putative family specific QTL.Furthermore, an overall analysis is better interpretable with knowledge about the single families.Different QTL suggestions between the families showed up in our example, taken from the family specific test statistic profiles (Figure 1 for fat yield, and Figure 2 for fat content) and from some significant results of the two-QTL model analyses (Table 2).For fat yield in fa 2 (significant at 68 cM from the one-QTL model 1), the two QTL model analysis resulted insignificantly in 64 and 66 cM basically for the same QTL.A small distance can not be verified at this level of QTL search.In case of estimating two significant QTL within a relatively small region of 15 cM, we surely may expect epistatic interactions of both QTL, as found in family fa 3 for fat yield (Table 2).But the test statistic profile (Figure 1) shows only one single peak at 62 cM and thus, it is hard to believe, that a second QTL should be located in this region.This particular result should be better interpreted in the way: If there would be two QTL, they show epistatic interactions.On the basis of three different statistical approaches adapted to a real data set, we can show agreeing results.Mostly, the two-QTL model results were confirming the one position found by the one-QTL model analysis (Table 3).In single family analyses, we found QTL positions at 47 cM, for fat yield in family fa 3 and at 46 cM for fat content in family fa 1.The results of the overall analyses based on method 2 (or method 3) were similar (Table 3), suggesting the QTL at positions 43 cM (fat content) and 68 cM (fat yield).Additionally, we notice estimates for a second fat content QTL at 103 (or 95) cM.The result of the multivariate analysis was most interesting.The increasing power by using extended information from correlated traits compared to univariate VC analysis was proved e.g. by KOROL et al. (1995) and SØRENSEN et al. (2003).Furthermore, this type of analysis enables discriminating between pleiotropic and closely linked trait specific QTL.The test statistic profile of the simple bivariate analysis of fat yield was flat between 52 and 70 cM (Figure 3) and reflecting the range of family specific QTL estimates.The overlapping area between 52 and 58 cM, suggesting one QTL for both traits, obtained by the pleiotropic model analysis in comparison to the simple bivariate analyses, could not be confirmed, neither by any single family result (Table 2), nor by any methodical result (Table 3).But astonishingly, this particular region is in conformance to other reports: RON et al. (2001) found a QTL for both traits at BM143, corresponding to 59 cM in our marker map.ZHANG et al. (1998) reported a QTL for fat content near TGLA37, at 66 cM.The strongest QTL estimates dominated the close linkage model result partly.The very similar AIC values from the pleiotropic and the close linkage model at 58 cM might be a suggestion of a further (pleiotropic) QTL, that would not be unexpected in the case of these both traits describing milk fat production.The results underline that further investigations, especially on finemapping QTL, are necessary for characterizing the QTL more precisely.

Table 1
Models and their parameters specified in the bivariate analysis in Method 3 (Modelle und deren spezifizierte Parameter der bivariaten Analyse inMethod 3, SØRENSEN et al., 2003)

Table 2 Single
family results of the two QTL model analysis (Method 1) of fat yield FY and fat content FC (Ergebnisse von Einzel-Familien-Analysen mit dem Zwei-QTL-Modell in Methode 1 für Fettmenge FY und Fettgehalt FC)