Bayesian analysis of ordinal categorical data under a mixed inheritance model

The effectiveness of proposed Gibbs sampling (GS) algorithm to detect single loci determining livestock threshold traits under a different hypothetical breeding and statistical modeling scenarios was examined. The following factors were included into the analysis: the presence of fixed effects, knowledge of one threshold, the size of the population (1 212 and 3 070 pedigreed individuals, respectively) and proportions of individuals in three genotypic classes. Five threshold and one linear unitrait animal model were employed to analysis of these datasets. The GS algorithm was applied to estimate fixed effects (optionally), additive polygenic variance, single allele frequencies, genotypic effects and one threshold (optionally). For each case, 2 000 000 rounds of GS were conducted. The first 1 000 000 steps were discarded as a burn-in-period. The results were collected from every 20th iteration. In general, the accuracy of parameter estimates is not satisfactory. However, taking into account the scant amount of information provided by the ordinal categorical data, it seems that such an analysis is a good first approach. Except for one case in which the estimate was very close to the true value, in all the other cases the estimated gene effect was smaller than the true effect. In general, the algorithm proposed does not provide overestimated effects of single locus.


Introduction
In the past decades of the 20th century, applied genetic improvement programs in livestock have been targeted at increasing production traits.So far, a considerably phenotypic and genetic gain has been registered.Unfortunately, the production characters are negatively correlated with functional ones, for instance fertility, fecundity, some disease resistances etc.A majority of them is categorical and often binary recorded.In fact, they are difficult to analyze statistically, since two scales (unobserved continuous liability and observed discrete phenotypes) should be statistically modeled.For simplicity classical linear models have been sometimes employed for genetic evaluation in livestock (Szwaczkowski et al. 2002).Furthermore, low heritabilities estimated for functional traits have greatly limited its inclusion in selection programs.Hence, new statistical tools are still developed for analysis of categorical data (Gianola 1982, Molinski et al. 2003).Comparative studies (Casselas 2007, Silvestre 2007) indicate threshold methodology with Gibbs sampling algorithm to analyze these traits.
A number of authors suggested that utilization of marker-assisted selection (Meuwissen & Van Arendonk 1992, Liu & Mathur 2005) and marker-assisted introgression (Dominik et al. 2007), especially for traits with low heritability, can considerably increase selection efficiency and maximize genetic progress.Unfortunately, molecular detection of important loci is still relatively expensive and labor-consuming.Furthermore, the methods used for estimating effects of quantitative trait loci, based on molecular data, often overestimate and or misestimate these effects (Hocking 2005).Similar conclusions were drawn at the 13th Quantitative Trait Locus and Marker Assisted Selection Workshop in 20-21 April, 2009 in Wageningen (Mucha, personal communication) where additionally the usefulness of Bayesian methods was stated.Hence, it seems that advanced molecular works should be preceded by marker-free segregation analysis.A first approach to segregation analysis is described by Elston & Steward (1971).Next, these methods were extended among others by Janss et al. (1995) and Guo & Thompson (1994).However, the majority of these approaches are mainly focused on continuous traits.Recently, the Bayesian marker-free segregation analysis has been more widely employed to threshold traits (Kadarmideen & Janss, 2005;Skotarczak et al. 2008, Sørensen et al. 1995).From the perspective of further molecular and marker segregation analysis, these algorithms can supply a number of useful information on genotype effects, single gene variance, allele frequency and polygenic variance.Furthermore, for the genetic analyses of the threshold characters the fixation and/or estimation (optionally, in case of more than two phenotypic classes) of the thresholds is required.Additionally, sensitivity of statistical inferences about segregation of single genes is influenced by the size and structure of population.It can also be determined by real genotypic effects and their frequencies as well as numerical properties of applied algorithms.This paper is a continuation of an earlier study by Molinski et al. (2003) and Skotarczak et al. (2007Skotarczak et al. ( , 2008) ) concerning Bayesian detection of single loci under threshold animal model.No literature on statistical effectiveness of proposed tools is available.
The main objective of this paper is to check the effectiveness of the described algorithm in detecting single loci determining livestock threshold traits.

Data simulation
Simulation studies enabled us to examine the influences of several factors, such as: the presence of fixed effects, a knowledge of one threshold, the size of the dataset and the proportion of individuals in the three classes of the analyzed variable.More details are listed in Table 1.Three categories for data, fixed effects included, both thresholds unknown 3.
Two categories for data, fixed effects included, fixed threshold i.e. t=0 4.

Continuous variable
To verify the correctness of the described algorithm some simulation studies have been provided.To be as close to reality as possible, the calculations were done for two different real pedigrees.The first one (P1), contained 1 212 individuals with 307 founders and 905 observed items.The average number of observation per individual was 3. The second pedigree (P2), consisted of 3 070 individuals with 454 founders and 2 040 observed animals.The mean replication number was about 5.5.To simulate the data we took two sets of parameters σ 2 a , ƒ A 2 and µ A 1 A 1 , namely S1={0.4,0.2, 0.8} and S2={0.8,0.2, 1.5}.In the case of both simulated data sets the polygenic heritability coefficient equals 0.3.The assumption concerning the allele frequency and the structure of the relationship matrices gave in P1 757 individuals with genotype A 1 A 1 , 338 of type A 1 A 2 and 117 of type A 2 A 2 .For the P2 these numbers were respectively equal to 2 138, 771 and 161.
In each analyzed case, 2 000 000 rounds of Gibbs sampling were conducted.The first 1 000 000 steps were discarded as a burn-in period.The important results were collected from every 20th iteration.The means of the posterior distributions were calculated as the point estimators of the unknown parameters.The statistical significance of the single gene effect was verified by the 95 % Highest Posterior Density Regions -HPDR (Scott 1992).If the HPDR included the 0 value it was stated that the single gene effect has no statistical meaning as opposed to the case when HPDR did not include 0.

Model and estimation of parameters
In the following analyses we assume that the ordinal categorical data y are described by a model where u is the s vector of unobserved variables (i.e.liability), s is the number of recorded individuals, X is the s × b design matrix relating fixed nongenetic effects to observations, β is the b vector of fixed nongenetic effects, Z is the s × q design matrix relating polygenic and single locus effects to observations, q is the number of all individuals and W is the q × 3 unknown matrix containing information on the genotype of each individual; each row of W has one of the following forms: , a is the q vector of random additive polygenic effects and e is the s vector of random effects.The relation between observed phenotypes y i and the liability u i is conditioned by the thresholds t 1 and t 2 , namely parameters.The statistical significance of the single gene effect was verified by the 95% Highest Posterior Density Regions -HPDR (SCOTT, 1992).If the HPDR included the 0 value it was stated that the single gene effect has no statistical meaning as opposed to the case when HPDR did not include 0.

Model and estimation of parameters
In the following analyses we assume that the ordinal categorical data y are described by a model where u is the s vector of unobserved variables (i.e.liability), s is the number of recorded individuals, X is the s x b design matrix relating fixed nongenetic effects to observations, β is the b vector of fixed nongenetic effects, Z is the s x q design matrix relating polygenic and single locus effects to observations, q is the number of all individuals, W is the q x 3 unknown matrix containing information on the genotype of each individual; each row of W has one of the following forms: [1, 0, 0], [0, 1, 0] or [0, 0, 1] corresponding to genotypes is the effect of genotype A 1 A 1 , and - is the effect of A 2 A 2 , a is the q vector of random additive polygenic effects and e is the s vector of random effects.The relation between observed phenotypes y i and the liability u i is conditioned by the thresholds 1 t and 2 t , namely For the realization of Gibbs sampling procedure it is necessary to define the prior distributions for all model parameters.Uniform improper distributions were assumed for vectors β and , the random vectors a and e were normally distributed: a ~ N(0, A 2 a σ ), where A is the q x q relationship matrix, e ~ N(0, s I ) and an inverted chi-square distribution was assumed for the component 2 a σ .Moreover, the uniform distribution on the interval was supposed for the unknown thresholds and the interval [0, 1] for allele frequency.The conditional posterior distributions for vectors β, , a in the presented models are normal with means equal to the solutions of the appropriate mixed model equations (SØRENSEN and GIANOLA, 2002).The following formulas were used to calculate the expected values and variances at every step of Gibbs sampling procedure: for the vector of fixed effects β: ( ) for the vector of additive genetic effects a: For the realization of Gibbs sampling procedure it is necessary to define the prior distributions for all model parameters.Uniform improper distributions were assumed for vectors β and μ, the random vectors a and e were normally distributed: a ~ N(0, Aσ 2 a ), where A is the q × q relationship matrix, e ~ N(0, I s ) and an inverted chi-square distribution was assumed for the component σ 2 a .Moreover, the uniform distribution on the interval [0, t max ] was supposed for the unknown thresholds and the interval [0, 1] for allele frequency.
The conditional posterior distributions for vectors β, μ, a in the presented models are normal with means equal to the solutions of the appropriate mixed model equations (Sørensen & Gianola 2002).
The following formulas were used to calculate the expected values and variances at every step of Gibbs sampling procedure: for the vector of fixed effects β: where x i is the i-th column of X, X −i is matrix X without the i-th column, β −i is vector β without the i-th element, i=1, …, b; for the vector of additive genetic effects a: is the i-th row of matrix vector a without the i-th element, q i ,..., 1 = .Further, the elements of vector  were generated according to the following formula: .In every step of Gibbs sampling procedure it was fixed that 1 1 A A µ > 0. As it was mentioned above we assumed that The values of the liability were generated from a truncated normal distribution with the truncation point defined by the actual threshold value: where z i is the i-th column of Z, A i,i is the element of A −1 in the i-th row and i-th column, A i,−i is the i-th row of matrix A −1 without the i-th element, a −i is vector a without the i-th element, i=1, ..., q.
Further, the elements of vector μ were generated according to the following formula: is the i-th row of matrix Further, the elements of vector  were generated according to the following formula: . In every step of Gibbs sampling procedure it was fixed that 1 1 A A µ > 0. As it was mentioned above we assumed that The values of the liability were generated from a truncated normal distribution with the truncation point defined by the actual threshold value: where denote the density and the cumulative distribution function of the normal distribution, respectively.Threshold 1 t was generated from the following uniform distribution: where denotes the minimum value of the liabilities within observations in the second category; similarly is the maximum value of the liabilities for observations in the first category.Threshold 2 t was generated similarly.The additive variance component was generated from the following inverted chi-square distributions: where 2 , a a S v are the hyperparameters ( a v denotes the degrees of freedom and S is a scale parameter).
According to Guo and Thompson (1994), the elements of the unknown genotypes in table G were generated from the following formula:     ( ) (5 ) where (zw) i is the i-th column of matrix ZW, i=1, 2, 3.In every step of Gibbs sampling procedure it was fixed that μ A 1 A 1 >0.As it was mentioned above we assumed that μ A 1 A 2 =0 and The values of the liability were generated from a truncated normal distribution with the truncation point defined by the actual threshold value: where x R i is the i-th row of matrix X, (zw) R i is the i-th row of matrix ZW, z R i is the i-th row of matrix Z, i=1, ..., s, G is the table of genotypes determining structure of W.
Moreover, Φ (.) and ф (.) denote the density and the cumulative distribution function of the normal distribution, respectively.
Threshold t 1 was generated from the following uniform distribution: where min (u | y=2) denotes the minimum value of the liabilities within observations in the second category; similarly max (u | y=1) is the maximum value of the liabilities for observations in the first category.Threshold t 2 was generated similarly.The additive variance component was generated from the following inverted chi-square distributions: A in the i-th row and ith column, is the i-th row of matrix vector a without the i-th element, q i ,..., 1 = .Further, the elements of vector  were generated according to the following formula: . In every step of Gibbs sampling procedure it was fixed that 1 1 A A µ > 0. As it was mentioned above we assumed that The values of the liability were generated from a truncated normal distribution with the truncation point defined by the actual threshold value: denote the density and the cumulative distribution function of the normal distribution, respectively.Threshold 1 t was generated from the following uniform distribution: where denotes the minimum value of the liabilities within observations in the second category; similarly is the maximum value of the liabilities for observations in the first category.Threshold 2 t was generated similarly.The additive variance component was generated from the following inverted chi-square distributions: where 2 , a a S v are the hyperparameters ( a v denotes the degrees of freedom and 2 a S is a scale parameter).
According to Guo and Thompson (1994)      A without the i-th element, i − a i vector a without the i-th element, q i ,..., 1 = .Further, the elements of vector  were generated according to the following formula: .In every step of Gibb sampling procedure it was fixed that 1 1 A A µ > 0. As it was mentioned above we assumed that The values of the liability were generated from a truncated normal distribution with the truncation point defined by the actual threshold value: where denote the density and the cumulative distribution function of the normal distribution, respectively.Threshold 1 t was generated from the following uniform distribution: where denotes the minimum value of the liabilities within observations in the second category; similarly    , , , | ( where v a , S 2 a are the hyperparameters (v a denotes the degrees of freedom and S 2 a is a scale parameter).
According to Guo and Thompson (1994), the elements of the unknown genotypes in table G were generated from the following formula: are the genotypes of the i-th individual's parents, i = 1,…,q.When the individual is not observed, the last term will be substituted by 1.In the first step of Gibbs sampling, it was assumed that matrix W has the following form: . Further, for the single gene Mendelian transmission probabilities were assumed.To estimate the genotypes for the individuals, the frequency of alleles among the groups of founders is required.The allele frequency was generated from the beta distribution according to the following formula (KADARMIDEEN and is the i-th row of matrix 1 − A without the vector a without the i-th element, q i ,..., 1 = .Further, the elements of vector  were generated acco formula: .In sampling procedure it was fixed that 1 1 A A µ > 0. As it was assumed that The values of the liability were generated from a truncate with the truncation point defined by the actual threshold va ( ) ( )    where G i is the genotype of the i-th individual, G −i is the table of the genotypes of all individuals excluding the i-th individual, G o i refers to the genotype of the progeny of the i-th individual, G m i is the genotype of the i-th individual's mate, G s i , G d i are the genotypes of the i-th individual's parents, i=1, ..., q.When the individual is not observed, the last term will be substituted by 1.In the first step of Gibbs sampling, it was assumed that matrix W Because in the real data the fixed effects are usually present, in the subsequent analysis we have introduced three fixed effects.The estimates of β were usually wrong, but what is very interesting and important, in all the analyzed cases, the contrasts between them were correctly estimated.For the other parameters, a positive influence of an assumption about the zero value of the first threshold, the second being unknown, is observable.Also a positive influence of the magnitude of pedigree and consequently the number of observations is true.The analysis of estimates of ƒ A 2 indicates model 2.1 as a best solution and again the estimate for P2 being better than that for P1.
A subsequent problem is the division of observations into the three groups.A positive influence of a certain balance is seen only for the bigger set of data (P2), in which the major gene effect occurs to be significant.
When the number of observations in the three observed classes is very unbalanced, the question arises whether it is better to reduce the number of classes.Some comparisons of the results obtained for data grouped into two categories show that it is better not to join the classes and proceed with original data.
As usual in the case of discontinuous variables it may also be interesting to check whether it is not better to adopt one of the well known transformations leading to continuity and analyze the data as observations of a continuous variable.When the proportion as the transformation is used the obtained results indicate that such procedure deteriorates the estimates.

Discussion
A detection of single loci with very large effects, determining continuous traits, with balanced size of genotypic classes segregating in a big population is very easy.Some major genes (e.g.Booroola gene -FecB) were identified many years ago by the use of relatively simple methods.They are effectively implemented into genetic improvement programs in livestock and poultry.König et al. (2009) concluded that economic efficiency and increased annual genetic gain in dairy cattle breeding programs are possible due to the replacement of classical procedure based on progeny testing by genome wide selection.It is almost sure that similar tendencies will be registered for other livestock populations.
Although a number of molecular and mathematical methods have been further developed, the detection of important single loci still seems difficult for at least three reasons.Firstly, when a trait of economic interest has a major gene segregating in a given population, the gene can easily be captured by selection.Secondly, the number of important traits (e.g.reproductive ability, disease resistance) has typical discrete vs binary phenotypes.Thirdly, some of the major genes are segregating in small unique populations.
As it has already been mentioned, many single loci have been detected for animal reproductive ability.These genes determining a given trait often have different chromosomal localizations as well as various effects.For instance, eleven different genes are known for fecundity in sheep (Davis 2005).Also single loci have been detected for other binary characters (e.g.disease resistance).However, some of them still have the status of putative quantitative trait loci (Casas & Stone 2005).Physical identification of a single locus with considerable effects requires more efforts and relatively high costs, because genotyping of many individuals is still expensive.Hence, the implementation of molecular information in breeding programs can be effectively preceded by results of the marker-free segregation analysis, mainly hypothetical genotypic effects, allele frequencies as well as quantitative genetic parameters.This corresponds with the results obtained in the present study, in which the proposed Bayesian algorithm does not lead to overestimation of parameters, mainly genotypic effects.Therefore, our results can be used as initiative data prior to classical marker segregation analysis.
Indeed, the proposed method assumed only a biallelic locus without dominance, although these assumptions are quite realistic (Kadarmideen & Janss 2005).Inheritance models for some traits are more complicated, as they include dominance, epistasis or genomic imprinting.Molecular identification of important single loci, especially determined discrete traits, seems to be more difficult (Blasco 2008).
Twenty different scenarios described by six models for two types of populations have been analyzed.To our knowledge, no reports on simulation studies on properties of the Bayesian algorithm to detect single loci by the use of non-marker data under a threshold animal model are available in literature.However, several papers concern the numerical comparison of non-Bayesian methods for both polygenic and mixed inheritance models and influence of data structure on the accuracy of the estimated genetic parameters under a polygenic inheritance model (Le Roy & Elsen 1995, Kominakis 2008).It is well known that statistical inference based on larger samples leads to more accurate estimation.Such dependence has been confirmed in the present study.Estimates obtained for P2 data are usually more accurate compared to P1 data.Generally, the problem of data structure is connected with field collected data.It varies across populations, species etc.Our investigations are addressed for small livestock populations.
From the previous remarks it is obvious that the proposed analysis should be treated as a first step analysis.The accuracy of estimates is not satisfactory.However, taking into account the scant amount of information provided by the categorical data, it seems that such an analysis is a good first approach in the estimation procedure.
A very important result is connected with the estimate of the major gene effect.Except for one case in which the estimate was very close to the true value, in all the other cases the estimated gene effect was less than the true effect.Consequently, it seems to us, that an estimated gene effect, proved to be significant by, for example, the Highest Posterior Density Regions is indicative of the presence of a segregating gene which conditions the analyzed trait.
genotypes of all individuals excluding the i-th individual, i o G refers to the genotype of the progeny of the i-th individual, mi G is the genotype of the i-th individual's mate,

Figure 1
Figure 1Comparison of major gene effect estimates when the true value is equal to 0.8

Figure 3
Figure 3Comparison of genetic additive polygenic variance estimates when the true value is equal to 0.4