TIGER : A software system for fine-mapping quantitative trait loci

The localisation of quantitative trait loci which contribute significantly to phenotype variation of economically important traits in domestic species has become an important goal in animal genomics. Several such loci have been roughly identified using linkage analyses; however the focus has now shifted towards fine mapping and pinpointing causal mutations. In the context of a cooperative national research project, the software system TIGER was developed. TIGER is a UNIX script linking several individual Fortran programmes and is used for comprehensive variance component analysis of fine mapping data. Starting with raw genotype data, pedigree and marker map information and ending with a residual maximum likelihood-based test for each putative quantitative trait locus position, the software provides the user with an ‘all in one’ package capable of linkage analysis, linkage disequilibrium analysis and combined linkage/linkage disequilibrium analysis. The software system has been employed in 4 fine mapping projects on 4 distinct cattle chromosomes.


Introduction
The vast majority of economically important traits in plants and animals are quantitative in nature and can be influenced by many individual chromosomal regions (quantitative trait loci, QTL).The localisation of QTL which contribute significantly to phenotypic variation is an important goal of both the scientific and the agricultural community, although mapping these QTL is only the first step in the search for the underlying genes (COPPIETERS et al., 1999).Several QTL for economically important traits in domestic species have been roughly positioned using linkage analysis (LA).However, the confidence intervals achieved in such studies contain far too many genes to allow the identification of causal mutations and the results are family-specific.The focus has now shifted to fine mapping these QTL as the next step toward identifying genes underlying complex traits.Methods that utilize historical recombination information on the basis of linkage disequilibrium (LD) allow implementation of fine mapping analyses on existing outbred populations.These methods have become popular for QTL fine mapping in livestock, where the creation of mapping populations such as advanced intercross lines (DARVASI and SOLLER, 1995) or others (DARVASI, 1998) is not practicable.The pedigree structure of typical dairy cattle populations is ideally suited for QTL mapping.Through the wide-scale implementation of artificial insemination, the paternal ancestry of many animals in the population can be traced back to a handful of bulls.The effective population size is small, although, as for example in the Holstein Friesian breed, many millions of animals exist.Furthermore, the phenotypic data of female animals for most relevant traits has been systematically recorded for use in routine breeding value estimation.The performance of the sires' progeny can thereby be included in the calculation of the sires' breeding value.This 'granddaughter design' (Figure 1) lends itself to use in mapping QTL and requires less genotyped animals than a 'daughter design', in which only large sets of paternal half sisters are used (GEORGES et al., 1995, WELLER et al., 1990).Various statistical models can be utilized to refine QTL positioning in mapping experiments (Freyer et al. 2003), whereby both regression-based models (RM) and variance component models (VC) are usually preferred.Single-marker or multi-marker association studies have been documented (XU, 2003;GRAPES et al., 2004;Zhao et al., 2007).Although the RM method is robust, disequilibrium between specific haplotypes and the putative QTL may be larger than between single markers and QTL in some cases.
In VC analyses, a random additive allelic effect is estimated for every unique haplotype and a random polygenic effect for each animal is included to correct for correlations between polygenic and allelic coancestry.The covariance between the effects of any two haplotypes in a pedigree is assumed equivalent to the probability that the haplotypes have identical origin (identical by descent, IBD) and are included in the gametic relationship matrix G (FERNANDO and GROSSMAN, 1989).Founder haplotypes in LA (i.e.maternal haplotypes of sires and maternal and paternal haplotypes of grandsires in a granddaughter design) are considered unrelated and their pair-wise IBD probabilities (covariances) are set to zero.For fine mapping experiments, MEUWISSEN andGODDARD (2000, 2001) suggested incorporating historical recombination as a means of exploiting the non-random allelic association between QTL and closely linked markers (i.e.population-wide LD).The authors showed how to compute IBD sub-matrices containing IBD covariances between founder haplotypes conditional on marker data (see Figure 2).This method has consequently been extended to include linkage information (MEUWISSEN et al., 2002;BLOTT et al., 2003;OLSEN et al., 2004).Variance components and individual haplotype effects can be estimated by incorporating the IBD sub-matrix into the gametic relationship matrix G and applying restricted maximum likelihood methods (REML).The disadvantage of this method is that, as opposed to RM methods, certain assumptions regarding population structure must be made (i.e.number of generations since the mutation and effective population size of founder population, selection, etc.).More importantly, an accurate marker map is mandatory, especially with regard to the correct order of markers.Although the first few drafts of bovine linkage maps appeared as early as 1994(i.e. BARDENSE et al., 1994;;BISHOP et al., 1994;GEORGES et al., 1995, etc.), with later versions using improved techniques (i.e.KAPPES et al., 1997;IHARA et al., 2004;EVERTS-VAN DER WIND et al., 2005), gaps and areas of uncertainty in publicly available maps still persist.A whole genome scan for QTL affecting functional traits including udder health was conducted in 2001 in cooperation with German AI and breeding organizations, scientific institutes for animal breeding and animal computing centers (THOMSEN et al., 2001).This project, instigated by the German cattle breeders' federation (ADR), enabled the identification of several QTL affecting functional traits in the German Holstein population, including those on Bos taurus autosomes 2, 18 and 27 affecting somatic cell score in milk (KÜHN et al., 2003;XU et al., 2006;KÜHN et al., 2008).Although extensive information was gained from these experiments, no candidate genes could be identified due to the relatively large confidence intervals typical of QTL mapping experiments.In 2005, a successor project (FUGATO M.A.S.-Net) was initiated with the goal of identifying the positional and functional candidate genes associated with SCS in the regions previously identified.In the context of this second project, a comprehensive software package was developed to process the genotypic data resulting from these efforts using VC methods.The aim of this paper is to present and describe the software system TIGER, which was designed for conducting LA, LD and combined LA/LD analyses for fine mapping experiments.The package is composed of both new and modified pre-existing Fortran programmes within a UNIX shell script.The system has been used for fine mapping on 4 distinct bos Taurus chromosomes.

Program sequence
The system operates in 6 main steps: 1 Allele frequencies and transmitting probabilities for each putative QTL position are determined.2 The IBD sub-matrices for each putative QTL position are calculated and are stored.3 The IBD sub-matrices are tested for positive definiteness; matrices with negative determinants are bent until all eigenvalues are positive.All IBD submatrices are then inverted and stored.Finally, 3 loops are conducted over all putative QTL positions.The first loop conducts LA, the second LD analysis and the third and final loop provides the used with combined LA / LD analysis information, however the same programmes are used for each loop: 4 In the first loop, LA is conducted by setting up and inverting the condensed gametic relationship matrix (BAES et al., 2007;TUCHSCHERER et al., 2004) using transmitting probabilities calculated in step 1, merging the gametic index information with the phenotypes and running ASReml.
5 In the second loop, the IBD sub-matrices calculated in step 2 are incorperated into the gametic relationship matrix.Transmitting probabilities of non-founder gametes (i.e.values calculated in step 1) are set to 0.5 (LD).The gametic index information is then merged with the phenotypes and ASReml is run.In the third and final loop, the IBD sub-matrices calculated in step 2 are again incorporated into the condensed gametic relationship matrices and the transmitting probabilities of non-founder gametes are calculated as in step 1.As in steps 4 and 5, the gametic index information is then merged with the phenotypes and ASReml is run.A schematic flowchart of the package is displayed in Figure 3.The program sequence also describes the acronym 'TIGER', which stands for Transmitting probabilities, IBD sub-matrices, G * matrix and Estimation of variance components using REML.The following sections offer a more detailed description of each step.

Transmitting probabilities and allele frequencies
Although QTL genotypes are unobservable, markers linked to the putative QTL can be genotyped and used to infer the unknown QTL IBD status.The problem is that marker information is often incomplete.Unknown linkage phases, non-informative markers and missing marker genotypes complicate the calculation of transmitting probabilities, which refers to the probability that the sire passed his first (second) allele at a given locus to his offspring.These probabilities are required for calculation of G , which is in turn required for solving the mixed model equations in VC analysis for LA, LD and LD/LA studies.In typical fine mapping experiments, markers are dense.Furthermore, due to the size of the half sib families, reconstruction of haplotypes is fairly precise for both parents and progeny.Computation of probabilities for multi-marker haplotypes is implemented as described in REINSCH (2002); the program implemented was developed especially for use in designs in which 2 generations are genotyped (i.e.half sib families, backcross, etc.).Input files include genotype information, a short pedigree (only sires and grandsires) and a file containing marker and putative QTL positions.Once the input files are read in, genotypes of offspring are checked for compatibility with those of the parents using Mendelian laws of inheritance.The pedigree is tested for parentoffspring conflicts and the origin of progeny alleles is determined for each marker locus.The recombination frequencies for all intervals are calculated using Kosambi's map function.The most probable paternal and maternal haplotypes for parents are then reconstructed.Probabilities for linkage phase 1 are computed from genotypes and pedigree information.After all possible haplotypes of the sires are recorded, the most probable haplotype (linkage phase) produced by the candidates parents is calculated assuming haplotypes of the parents are known and correct.Apart from their use in the recursive calculation of G, the transmitting probabilities also provide a measure of marker informativity.The program was further modified to estimate population-wide allele frequencies using all possible maternal haplotypes of offspring: 1 frequency( )

∑
where: k = a given marker allele, n = the number of final progeny, δ = an indicator variable equal to 1 if allele k is present in a given haplotype and zero otherwise and h = the probability of haplotype j occurring for animal i.The most probable maternal haplotypes are also saved to file for use in calculation of the IBD sub-matrices.Calculation of the IBD sub-matrices MEUWISSEN and GODDARD (2000) described a method to calculate the IBD submatrix based on deterministic predictions which takes the number of markers flanking the putative QTL position, the extent of LD in the population based on the expectation under finite population size, and the time point of mutation occurrence into account.A Fortran (90/95) program based on the gene dropping procedure described by MEUWISSEN andGODDARD (2000, 2001) was developed and incorporated into the TIGER system.The method we employed involves setting up IBD sub-matrices for the maternal haplotypes of the sires and both maternal and paternal haplotypes of the grandsires in the real data set being examined.Parameter options allow the user to define the effective population size, the number of generations of gene-dropping, the number of repetitions and the mapping function to be used.The number of markers left and right of the putative QTL can also be user-defined.GRAPES et al. (2006) found that fitting a 4-6 marker haplotype as a sliding window across the region resulted in the greatest accuracy compared to that of the other haplotype sizes tested.
However for practical applications it may be advantageous to define the size of the 'window' surrounding the putative QTL in cM instead of the number of markers.This allows for more flexibility when the marker distribution over the chromosome is not constant.Haplotypes were sorted into groups according to the number of identical markers to the left and to the right of the putative QTL.
The method and basic principles of gene dropping are relatively simple.Since each parent possesses two alleles at a given locus and since only one of these alleles will be passed on to an offspring, each allele has a one in two chance of being transmitted to each descendant, this probability being independent of the heritage of the other parent.
In other words, this is quite simply a simulation of the Mendelian transmission of a gene at a given locus.The method can be extended to the case of several loci, thus making it possible to measure the change in linkage disequilibrium.Application of this method, however, rests on certain assumptions.Selection at the simulated locus is not taken into account, since we are seeking only to replicate the effect of change due to the Mendelian transmission of genes.In reality, besides random variations, allele frequencies are subject to evolutionary forces such as migration, mutation, and natural selection, which produce specific and non-random variation in allele frequencies.The effect of selection can be considered negligible; however, it is important to realize that it is not accounted for here.One IBD sub-matrix is calculated for every putative QTL locus given by the user.These matrices are stored individually and are dense.The average inbreeding coefficient within simulated generations and the Monte Carlo errors of each haplotype constellation are saved to file for control purposes.

Positive definiteness of IBD coefficient matrices
The numeric instability of the IBD sub-matrices was mentioned but not discussed in depth by MEUWISSEN and GODDARD (2000).The IBD coefficient sub-matrices in LD and LA/LD analyses are very dense because all founder alleles (paternal and maternal alleles of grandsires and maternal alleles of sires) are assigned covariances.Furthermore, these matrices are not necessarily positive definite, which impedes the inversion step required for their use as a part of the gametic relationship matrix.However, IBD coefficient matrices with determinants less than one can be 'bent' until the lowest eigenvalue exceeds a preset limit (i.e.zero to achieve positive definiteness) (HAYES and HILL, 1981;ESSL 1991).We adopted the BENDPDF program of ESSL (1991) and added the matrix inversion subroutine DKMXHF of MEYER (2008, personal communication) to ensure positive definiteness of the IBD sub-matrices, proper calculation of matrix determinants and correct inversion.The numbers of negative eigenvalues in each sub-matrix as well as the matrix determinants are saved to file for control purposes.In our experience, approximately 20 % of all IBD sub-matrices within must be bent; the number of negative eigenvalues in our submatrices normally ranged from from 0 to 4.5 % per matrix.

The G matrix
Because allelic (gametic) effects are random, their covariance must be included for estimation in a variance component model.This information is included in the gametic relationship matrix G. TUCHSCHERER et al. (2004)  ) can be given and all transmission probabilities below 0.03 are 'condensed' to 0 and all those above 0.97 to 1. Haplotypes corresponding to such transmission probabilities are considered 'non-unique'.The resulting condensed gametic relationship matrix, G * , can have between n and 2n rows and columns, depending on how many unique gametes occur, whereby n is the number of genotyped animals in the pedigree.This method prevents numerical difficulties due to quasilinear dependencies which may occur due to the inclusion of almost identical rows and columns in the gametic relationship matrix.The condensed covariance matrix therefore contains one row and one column for every unique gamete (haplotype) of genotyped animals and is sorted according to gametic generation (i.e.founder gametes followed by gametes with known ancestors).If only grandsires and sires are genotyped, all maternal gametes of sires are considered founders and therefore unrelated; the pairwise IBD probabilities of these alleles are zero.COBRA (BAES and REINSCH, 2007) was designed to calculate the condensed matrix G * based on the algorithm of TUCHSCHERER et al. (2004).The COBRA program was slightly modified to include IBD sub-matrices described above for LD and combined LA/LD analyses (see Figure 2 for the arrangement of haplotype effects in G * ).An ASReml parameter file corresponding to the data on the given marker interval (i.e.number of QTL effects in the G * matrix, etc.) is created for each position.

Likelihood calculation
Variance component analyses for continuous traits are normally based on the mixed model, and maximum likelihood or related methods of inference are employed.In particular, the so-called residual or restricted maximum likelihood (REML) is widely used for analyses of continuous traits.The popularity of mixed model analyses by REML has been furthered by the availability of appropriate software.Although implemented in general statistical packages, specific programs for VC estimation in an animal breeding context have been developed (GILMOUR et al., 2006;KOVACS and GROENEVELD, 2003;MADSEN and JENSEN, 2006).These programs use a mixture of Fisher's scoring and Newton-Raphson to maximize the restricted likelihood, called the average information restricted maximum-likelihood (AI-REML) algorithm.Due to the somewhat delicate procedure required for setting up and inverting G * , a program allowing user-defined covariance structures and their inverses was required.This prerequisite was filled by ASReml (GILMOUR et al., 2006).The following mixed model equations were applied: Where y (m×1) = a vector of m DYDs for n animals, µ = a fixed effect common to all observations, u (n×1) = a vector of random polygenic effects and v (gam×1) = a vector of random gametic effects of a marked QTL that is linked to a single polymorphic marker locus with gam equal to the number of unique gametes.Subscripts in parenthesis of the vectors and matrices denote their dimensions.X (m×1) , Z (m×n) and W (m×gam) = known incidence matrices, 1 u (n×n) − A = the inverse of the relationship matrix and * 1 v (gam×gam) − G = the inverse of the condensed gametic QTL relationship matrix.D -1 = the inverse of a diagonal matrix containing weights for each observation.Expectations of u, v and e and covariances between them are assumed to be zero.The model depicted above contains only the fixed effect µ, however LD can easily be included by regression on one or more markers instead of through the IBD sub-matrix.The number of regression coefficients is then the number of different marker alleles of the given marker(s).

Conclusion
A comprehensive software package was developed to process and analyse the genotypic data within the context of the FUGATO M.A.S.-Net project using VC methods.This paper presented the software system TIGER, which was designed for conducting linkage, linkage disequilibrium and combined linkage/linkage disequilibrium analyses for fine mapping experiments.The TIGER system employs both new and modified pre-existing Fortran programmes within a UNIX shell script and is an 'all in one' solution for analysing genotypic data.

Fig. 2 :
Fig. 2: Schematic representation of the gametic relationship matrix.Haplotypes are sorted according to generations with founder haplotypes to the left (blocks A and B) and haplotypes with known ancestors to the right (block C).The blocks marked with horizontal lines (A and B, IBD submatrix) are empty in a linkage analysis (i.e.elements = 0) and are dense for linkage disequilibrium and combined linkage/linkage disequilibrium experiments (i.e.elements ≠ 0).Block C is always dense.(Vereinfachte Darstellung der gametischen Verwand-schaftsmatrix.Haplotypen werden nach Generationen sortiert, wobei Gründerhaplotypen den Blöcken A und B und Haplotypen mit bekannten Vorfahren dem Block C zugeordnet sind.)

Fig. 3 :
Fig. 3: Schematic representation of the TIGER system (Vereinfachte Darstellung des TIGER-Systems) showed how to eliminate possible linear dependencies in this matrix by considering haplotypes with extremely high or extremely low transmitting probabilities identical to those of the respective parent.If transmission probabilities very close to one or zero occur, a predefined threshold (e.g.