Review of the assessment of animal welfare with special emphasis on the “ Welfare Quality ® animal welfare assessment protocol for growing pigs ”

This paper discusses the arising need for an objective, but feasible, reliable and valid method for assessing animal welfare on farms. Animal welfare has become especially important since the industrialisation of animal housing after the Second World War and as public awareness and concern has increased. Simultaneously, alienation of the public from agriculture has taken place, as the population has moved increasingly from rural areas to towns. This has led to a very emotional discussion concerning the welfare of farmed animals, and thus a need for not only a clear definition but also a way of objectively measuring it has arisen. It is probably best defined as a total of the different conceptions health, natural behaviour and positive affective state. In the last few years, different methods for an objective assessment have been developed; however, all of them still face great challenges in their practical implementation and acceptance. The most promising method is probably the Welfare Quality (WQ) approach, especially as it concentrates on animal-based parameters. The development of the WQ protocols emphasised not only the different conceptions of animal welfare but also especially the feasibility, reliability and validity of the parameters to be included. One of the main challenges of these protocols remains, however, the final aggregation of the results to a welfare score. Furthermore, a thorough cost–benefit analysis has not been carried out so far. Even more importantly, only a few studies have addressed the general reliability and validity of the complete protocols, and those studies that have addressed these issues have also revealed challenges concerning the interobserver and test–retest reliability of some of the included parameters. As an example, this is discussed in detail for the “Welfare Quality animal welfare assessment protocol for growing pigs”. In conclusion, the WQ approach can be seen as promising, but it has also revealed that there are still a considerable number of challenges that need to be addressed in further studies on the WQ protocols in order to achieve constant improvement. These challenges should be borne in mind in the application of these protocols, which should not be simply referred to as a gold standard.


Introduction
Animal welfare has become an important topic of public and political debate in the last few decades (Hobbs et al., 2002;Fraser, 2009).However, there is a lack of agreement concerning the question of what animal welfare should comprise (Broom, 1988).Therefore, the ongoing discussion is carried out in a very emotional and subjective way.This has led to the necessity of an objective and generally accepted method for assessing the welfare situation of animals (Webster, 2005).One of the most famous methods, having gained great importance in the field of animal welfare science in the last few years, is the Welfare Quality ® (WQ) approach, which promises to be objective, feasible, valid and reliable (Veissier et al., 2013).The following review discusses in Sect. 2 the definition of the term "animal welfare", how it has become of increasing importance in the last few decades and what the public's current concerns are.Throughout this discussion, the need for a method to objectively assess animal welfare becomes obvious.Suitable parameters and existing assessment systems are introduced in the Sect. 3 and their particular advantages and disadvantages are described.Section 4 describes in detail the development of the "Welfare Quality ® animal welfare assessment protocols", their construction, how they function and also the current challenges this system still faces.One of the most important challenges is that only a few studies have been carried out so far on the feasibility, reliability and validity of the complete protocols in their on-farm use.As an example, an overview of the few studies emphasising the "Welfare Quality ® animal welfare assessment protocol for growing pigs" and the existing challenges revealed in these studies is provided in the Sect. 5.

Definition
Although most people have an opinion on animal welfare and how animals should be treated (Keeling et al., 2011), there is a lack of consensus on a clear definition of the term itself (Moberg, 2000).Furthermore, in the literature, the term wellbeing is often used; however, according to Fraser (1998), well-being and welfare can probably be used as synonyms.One has to bear in mind that the term animal welfare is not scientifically based but rather a term that arose from a debate in society (Keeling, 2005), as described in detail later on.Animal welfare can best be described by contemplating three different conceptions: (1) basic health and biologic functioning, which includes the physical well-being of the animal; (2) natural living, which concerns the possibility to express normal behaviour; and (3) the affective state, especially concerning positive emotions (Fraser, 2008).A veterinarian, on the one hand, may well focus on the health of the animals and thus claim that indoor housing systems provide better welfare as the animals can be kept free from parasitic infections more easily.An ethologist, on the other hand, may think of good welfare more in terms of the expression of normal behaviour, thus putting more emphasis on free range systems and considering parasitic infections as minor welfare problems.Despite this disagreement, all defenders of different conceptions may agree on certain factual issues such as the mortality rate (Fraser, 2008).The challenge of a comprising definition is, therefore, to find a basis of agreement on these three conceptions, which most scientific definitions try to take into account.For example, the Brambell report (1965) emphasised the psychological aspect of animal welfare, defined as "a wide term that embraces both the physical and mental well-being of the animal.Any attempt to evaluate welfare, therefore, must take into account the scientific evidence available concerning the feelings of animals".Hughes (1976) also defined animal welfare comprehensively as "a state of complete mental and physical health, where the animal is in harmony with its environment".It is eye-catching that this definition is not far from the definition of human health given by the World Health Organization (World Health Organization, 1946), in which health is described as a "state of complete physical, mental and social well-being and not merely the absence of disease or infirmity".

Importance
In the last five decades, a considerable increase in interest, a new critical attitude and concern towards animal welfare have developed on the side of society.Nevertheless, a lack of consensus remains due to different attitudes, background knowledge and personal interpretations.
On the market, there is increasing demand from the public for products that originate from animal-friendly housing (Roex and Miele, 2005) and it is especially the public, i.e. the customers -most of them without expert background knowledge of animals -who think the affective state to be the most important part of animal welfare in general (Vanhonacker et al., 2008).
Although the truth about animal welfare is probably a compromise based on all three conceptions, i.e. basic health and biologic functioning, natural living and positive affective state, there is still an emotional discussion about what animal welfare really is.This is especially due to the lack of consensus in different groups of society defending their conception about welfare (Lassen et al., 2006).Therefore, an increasing need has arisen to overcome the highly emotional and partly also anthropomorphic discussion by finding a way of objectively describing and assessing animal welfare (Webster, 2005).This becomes even more important when one looks at the rising demand for animal-welfare-friendly products on the market described above.There is, however, still a gap between the customers' rising interest and their willingness to pay more for animal-friendly products (Bennett et al., 2002).Trustworthy labelling would help customers to find the products they really want.This can only be realised by a reliable and valid animal welfare assessment, which somehow comprises all the different aspects and understandings of animal welfare (Blokhuis et al., 2013a).
3 Methods for the assessment of animal welfare

Parameters
All parameters used for such an assessment must always provide good accuracy, validity, reliability and feasibility (Velarde and Geers, 2007).Accuracy describes how close the measured value is to its true value.The validity of a parameter refers to its extent of measuring animal welfare and to what extent it actually measures what it is supposed to measure (Velarde and Geers, 2007).For instance, positive and negative social patterns of behaviour are relevant to welfare Arch.Anim. Breed., 58, 237-249, 2015 www.arch-anim-breed.net/58/237/2015/as positive social behaviour is known to provide a rewarding function at least for the receiver and furthermore to reinforce and stabilise social relationships (Sato, 1984;Sato et al., 1991).Negative social behaviour, on the other hand, may cause stress and injuries in the receptor animal (Tuchscherer et al., 1998;Menke et al., 1999).It can, however, be critically questioned whether the assessment of this behaviour in a relatively short interval of time with an observer effect in the stable reflects the real behaviour of the animals, thus whether the methodology as described is adequate and valid.Reliability refers to the repeatability of a measure, i.e. the relative similarity of repeated measurements on the same object.
It can be divided into the interobserver, intraobserver and test-retest reliability (de Passille and Rushen, 2005).The interobserver reliability describes how well different observers agree in their findings, assessing the same objects at the same time and under the same conditions.Intraobserver reliability measures the agreement that the same observer achieves when assessing exactly the same objects at the same time.
Interobserver and intraobserver reliability can be influenced by the training of the observers (Wirtz and Caspar, 2002).Intraobserver reliability in terms of animal welfare can only be measured by the assessment of pictures or video sequences due to the fact that only pictures and videos assure the assessment of exactly the same images.On-farm assessments, on the other hand, are always studies assessing the test-retest reliability of the method, because it can never be assured that truly the same objects are assessed in the practical situation of farms, due to different individuals.Even if the same individuals are assessed, there are still changes in the individuals such as weight gain or pregnancy status.The test-retest reliability thus refers to the tested method and its capability to produce consistent results despite routine procedures and minor changes in the object that are not of interest in terms of the assessment (Temple et al., 2013).Finally, feasibility means that the monitoring method produces reliable results at an affordable cost, and thus it analyses the cost-benefit ratio (Velarde and Geers, 2007).
Parameters that have been discovered as valid, reliable and feasible by the various animal welfare studies in the last few decades and are now available for the assessment of animal welfare can be divided into resource-based, managementbased and animal-based parameters (Blokhuis et al., 2010).Resource-based parameters are, for instance, the evaluation of the availability of space or the measuring of the width of slats in the case of slatted flooring.These parameters are relatively easily accessible and are usually of excellent feasibility and reliability (Winckler, 2006).But simply because the resources provide good welfare does not mean that the animal itself experiences welfare (Napolitano et al., 2009).For instance, a lame animal suffering from an infectious disease will not feel well even if it lives in the best possible environment.Management-based parameters are parameters concerning the management routines, e.g.castration or taildocking procedures as well as vaccination schemes.They, too, are relatively easily accessible, but mainly through interviewing farm managers (Welfare Quality ® , 2009).Thus, one has to rely on them telling the truth about (the farm's) management procedures.Of course, castration and tail-docking procedures can be assessed directly on the animals; however, whether pain relief has been provided can scarcely be proven.According to Rousing et al. (2001), resource-and management-based parameters represent rather an assessment of the risk of the potential welfare problems than an actual assessment of the state of welfare.In contrast, animalbased parameters are parameters concerning the animal itself, such as the evaluation of lameness or the results of lung examinations at the slaughterhouse.They reflect the actual effect of the resource-and management-based parameters on the animal.This is especially important concerning the animal's ability to cope with its environment (Smulders and Algers, 2009).These are, however, probably also those parameters that are most time-consuming to evaluate (Bartussek, 1999) and, furthermore, reliability is often questionable (Veissier et al., 2013).Moreover, interpretation of animal-based parameters may remain ambiguous when interpreted solely (Bracke, 2007); for example, laying hens in caged systems may provide better health parameters but do not necessarily show natural behaviour (Keeling, 2005).It becomes clear that, in terms of a valid, reliable and feasible assessment, the focus should be given to animal-based parameters, however not solely referring to them but rather considering a combination of these three different parameters (Hewson, 2003).

Methods for the overall assessment of welfare
In the last few decades, different methods for the overall assessment of animal welfare have been proposed and published (Johnsen et al., 2001).In the following, examples of overall welfare assessment systems taking into account all the aspects of welfare (health and biologic functioning, natural behaviour, positive affective state) are presented.
Different checklists exist which are often a combination of resource-, management-and animal-based parameters.The focus is especially directed towards a feasible assessment, and the main aim is either labelling by associations or self-checks by the farmer to detect areas of concern.Examples are the checklist of labelling for the eco-label "Bioland" (Schumacher et al., 2007) or the informative leaflet for farmers published by the German Association for Agriculture (Deutsche Landwirtschaftsgesellschaft) (Pelzer and Kaufmann, 2012).
The concept of critical control points is derived from food hygiene, in which it is used in terms of process control to identify weak points (Mortimore and Wallace, 2013).It has been adapted to animal welfare science in various approaches during the last few years.The most famous approach of critical control points is probably the control system for slaughterhouses developed by Grandin (2006).The principle is based on checklists consisting of questions on management, environment or the animal itself.These can be answered either simply with "yes" or "no" (Grandin, 2010) or else with a traffic-light system, with green meaning that everything is in order in terms of this control point, yellow indicating minor or initial problems that should be observed further and red referring to unacceptable conditions that need to be eliminated immediately (von Borell et al., 2001).All control points have to be fulfilled in terms of quality assurance schemes; there is no distinction made concerning their importance.A good critical control point should be specific and measure many aspects (Grandin, 2006).
The animal needs index (ANI) was invented by Bartussek (1985) and was further advanced and refined in the following years.Evaluation schemes for cattle, pigs and poultry (Bartussek, 1995a, b) are now widely available.Basically, five areas influencing the animals were identified: freedom of movement, social contacts, flooring, light and air conditions, and intensity of care.For all of these areas, parameters to be measured in the stable were identified, these being mostly resource-and management-based parameters.Minimum requirements in all the influencing areas have to be fulfilled.Thereafter, the interaction between incriminating and exonerating factors is considered by allowing higher values in one area to compensate for lower values in another.This is done in order to take into account the animal's interaction with its environment and its ability to cope with its condition of life (Haiger et al., 1988).The different groups of parameters are aggregated to a total sum, thereby taking into account multifactorial, interdisciplinary and all-embracing approaches.Despite its good feasibility and broad application, especially in Austria, the ANI has been criticised since evaluation remains subjective for some parameters due to it being based on the practical experience of the assessor (von Borell, 2001) and furthermore not considering animal-based parameters to a sufficient degree (Blokhuis et al., 2013b).
Decision support systems were first introduced into animal welfare science by Bracke (2001).This overall welfare assessment system was first tested on husbandry systems for pigs (Bracke et al., 2002) and adapted to poultry (de Mol et al., 2006), cattle (Ursinus et al., 2009) and recently to salmon (Stien et al., 2013).It was invented as a tool to actually make decisions concerning animal welfare (Bracke et al., 2002).In terms of decision support systems, certain needs of the animals are identified, i.e. needs that have to be fulfilled in order to provide good welfare.For instance, for sows, needs were identified for food and water, resting, social contact, kinesis, exploration, body care, territorialism, thermal comfort, good air condition, health, safety and nest-building behaviour, and maternal care.In a further step, parameters were found to which good and bad levels were assigned according to expert opinion.Weights were attributed to these parameters and, finally, with the help of a computer model, values for the tested housing condition in terms of welfare can now be calculated (Bracke et al., 2002).The main disadvantage of this welfare assessment tool is that parameters are mostly resource-based (Blokhuis et al., 2013c).Thus, they describe rather the potential of a housing system to provide for a good welfare state.For instance, it could be reliably proven for dairy cows that loose housing systems provide more welfare than tie stalls (Bracke et al., 2002).But just because resources provide good welfare does not mean that the animal itself indeed experiences welfare as there are many other influencing factors.Also, it is probably not suitable for assessing welfare differences between two farms of the same housing system.This can only be analysed by the use of animal-based parameters (Blokhuis et al., 2013c).
The Bristol Welfare Assurance Programme started when the initiators of "Freedom Food", a label for organic farm products in Great Britain, realised around the year 2000 that their method of certification by checklists, which only contained resource-based parameters, was no longer up to date concerning the assessment of welfare.They therefore started a project in cooperation with the University of Bristol with the aim of developing a certification tool that emphasised animal-based parameters and thus truly reflected the welfare status of a farm.Firstly, animal-based parameters with the potential capability to evaluate "Freedom Food" farms in terms of their welfare status were defined by expert opinion (Main et al., 2003;Whay et al., 2007).Existing animal welfare assessment techniques were adapted to farming and certification in the ongoing research (Leeb et al., 2004).Special attention was assigned to the relevance of the chosen parameters and to the legislative background concerning the situation of farm animals in Great Britain (Whay et al., 2003).Nowadays, evaluation sheets for the on-farm welfare assessment of laying hens, cattle and pigs are available.All concepts of animal welfare are taken into consideration by behavioural observations, scans of a sample of animals especially concerning the health status, and qualitative evaluations, the latter being based on the overall impression of the observer.Results are expressed mainly as percentages of affected animals, and these percentages are compared to threshold values.If threshold values are exceeded, guidelines of intervention are accessible, which have, however, not yet been sufficiently validated (Leeb et al., 2014).The observer needs background knowledge of the husbandry of farm animals and their behaviour, as some parameters include subjective assessment.Intensive training of the observers and strict testing of their repeatability is urgently needed in order to provide comparable results (Leeb et al., 2014).After all the conditions and requirements have been enumerated, it becomes quite obvious that reliable results in terms of the Bristol Welfare Assurance Programme still remain a great challenge especially due to the problem of subjective evaluations.Application up to now has been limited to Great Britain, as it is specifically designed for British animal protection laws (Leeb et al., 2014).
WQ was an interdisciplinary and international research project dating from the years 2004 to 2009 with the aim of developing reliable animal welfare assessment tools on farms and at slaughterhouses, starting out with tools for poultry, cattle and pigs.In the basic definition of welfare, opinions of the public, farmers and scientists were taken into account (Blokhuis, 2008).In order to consider all aspects of welfare, animal welfare was defined in terms of four principles: good feeding, good health, good housing and appropriate behaviour.Twelve criteria were then assigned to these principles, and for each of the criteria, parameters to be measured were found in the stable.Animal-based parameters were chosen whenever possible in terms of validity, reliability and feasibility.Behavioural observations and examinations of single animals were performed using a specified sample size.The outcomes were first presented as percentages of affected animals, which were then further aggregated to grades between 0 and 100 (with 0 describing the worst and 100 the best possible condition of that parameter) by different mathematical procedures at the level of criteria and principles.Training of the observers was mandatory, but no special background knowledge was needed (Welfare Quality ® , 2009).The main problem of these welfare assessment systems, however, is that they take a relatively long time for the on-farm sampling, i.e. 4 to 8 h depending on the animal species, the size of the farm and the arrangements of the buildings.Furthermore, reliability is not assured for all the parameters in their on-farm use and for the protocols themselves.However, in terms of the WQ network, the protocols are still under the continuous process of revision and improvement as research and knowledge continue to grow (Winckler and Knierim, 2014).

Starting point of WQ
The above-mentioned pressing need to ensure the welfare of farm animals and, what is even more important, to measure it objectively using a scientific basis, led to the WQ project.This project answered the growing societal need of consumers and citizens for high welfare standards in terms of food quality and an increased transparency of the production chain.The first approaches were determined as a response to the European Commission's call for proposals aiming at "improving animal production methods that take into account consumer demands for high standards of animal welfare, health and food quality", according to Blokhuis et al. (2013a).The research proposal containing these approaches was successful, and thus the project "Integration of animal welfare in the food quality chain: from public concern to improved welfare and transparent quality" started in May 2004 and lasted until December 2009.During that time, it developed into the largest international network of scientists and stakeholders ever having worked together in an interdisciplinary manner.About 200 scientists from 43 different institutes in 13 European and 4 Latin American countries were involved.The scientists, among whom were mathemati-cians as well as animal and social scientists, were integrated with farmers, processors, slaughterhouse managers, retailers, animal protection organisations (non-government organisations, NGOs) and members of the public.The project was thus a good example of science and society working together to improve the welfare of farm animals.This great number of people involved required a strong management structure, which was implemented by a supervising steering committee.The main aim became the development and establishment of a generally accepted, valid, reliable and feasible system for assessing animal welfare on-farm and at slaughterhouses (Blokhuis et al., 2003).

Definition of welfare in terms of WQ
As concerns WQ aimed at a balanced welfare assessment to satisfy public, industry, political and scientific, a holistic definition for the term was also needed.Thus, the views of customers, industrialists, farmers, legislators and scientists were considered and a dialogue between science and society was initiated.From studies based on interviewing different groups of the society, it turned out that each group considered different aspects of animal welfare to be more or less important.But, in general, it was possible to obtain good agreement concerning the basic definition of animal welfare (Bock and van Leeuwen, 2005;Bock and Van Huik, 2007;Buller and Roe, 2008).The progress of finding an all-embracing definition took place in such a way that, at the beginning, animal scientists proposed a definition of animal welfare based on three conceptions -good health and biologic functioning, natural living and positive affective state (Fraser, 2008) -as well as on the five freedoms of the Farm Animal Welfare Council (FAWC, 1979).As a result, four main principles were identified: (1) good feeding, (2) good housing, (3) good health and (4) appropriate behaviour.At this stage, 10 appending criteria were defined that formed the underlying contents of these principles.Before these elements were finally included in an animal welfare assessment scheme, this first design of a definition was assigned to citizen focus groups.Focus groups were different groups of the public (e.g.urban mothers, seniors, young singles, vegetarians, hunters, gourmets) who were given background knowledge and then encouraged to critically discuss the proposed definition of animal welfare (Blokhuis et al., 2013c).Furthermore, surveys of randomly chosen people of the population of different nations based on computer-assisted telephone interviews were carried out in order to find out their opinions on animal welfare (Kjaernes and Lavik, 2008).It was concluded that, in general, scientists thought it basically important to avoid the negative impacts of welfare such as diseases, while society in general put more emphasis on the positive aspects such as ensuring positive emotional states, thus allowing the animals to feel comfortable and content.The influence of public opinion resulted in the inclusion of two further criteria.The criteria (1) absence of prolonged hunger and ( 2 sence of prolonged thirst were assigned to the principle of good feeding.For the principle of good housing, the criteria (3) comfort around resting, (4) thermal comfort and (5) ease of movement were identified.The criteria (6) absence of injuries, (7) absence of disease and (8) absence of pain induced by management procedures were allocated to the principle of good health, and the criteria (9) expression of social behaviour, (10) expression of other behaviour, (11) good human-animal relationship and (12) positive emotional state to the principle of appropriate behaviour.An overview of the principles, criteria and parameters included subsequently is presented in Table 1.

Development of welfare measures and protocols
After the final definition, measures to assess these criteria had to be found.To guarantee the feasibility and the general acceptance, it was determined that the whole assessment must be finished within one day by only one observer.As already mentioned, one of the main aims was to place the focus on animal-based parameters to truly assess the welfare state of the animals and to make the protocol internationally applicable in all different kinds of housing systems.Moreover, the overall system, and thus every single measure, had to provide good validity, reliability and feasibility.Therefore, some resource-and management-based parameters were also included if no suitable animal-based parameter could be found.
The selection of appropriate measures started out with a thor-ough review of literature in search of suitable welfare parameters (Veissier et al., 2013).Measures were considered further if they had a good validity in terms of the criterion they were supposed to assess, which was defined and decided by experts (Scott et al., 2001).Furthermore, the measures had to provide applicability on-farm in different housing systems or at slaughterhouses.During the procedure of this review of the literature, it was discovered that few welfare parameters had actually been tested accurately for reliability (Engel et al., 2003;Knierim and Winckler, 2009).The finally extracted measures to be further considered were divided into three groups regarding their already defined validity and reliability: the first group included parameters that had not been previously validated further in terms of criterion validity, i.e. the relationship of a tested measure to an already approved measure, or concept validity, i.e. the experimental proof that the measure was related to what it was supposed to measure; in our case, this considers whether it is related to the welfare state.Parameters were sorted into the second group if they had already undergone validity testing.These parameters were then tested for their reliability, which was done by observing video clips and pictures (Courboulay et al., 2009;Forkman and Keeling, 2009;Leruste et al., 2009;Schulze Westerath et al., 2009;Plesch et al., 2010).Less effort was devoted to the field of clinical examinations, which were simply expected to be of good reliability.The third group contained parameters for which validated procedures were available and had previously been considered to be reliable.Studies on the choice and extraction of parameters can be found in the WQ reports (Forkman and Keeling, 2009;Forkman and Keeling, 2009a, b).
Measures that were now found to be sufficiently valid and reliable were evaluated in terms of the information they provided in relation to all other potential parameters.Therefore, analyses of correlation and association between different animal-based parameters or else between animal-based and resource-or management-based parameters were carried out.Furthermore, before the final exclusion of a parameter, a calibration of the simplified version of the monitoring system was made against the full version.In order to simplify the use and thus provide better reliability, all parameters with more scoring possibilities than three were minimised to the following three categories: "absence", "low affection" and "high affection".This was done, for instance, in terms of the categorisation system for the parameter bursitis in pigs, which was proven to be reliable by Lyons et al. (1995) using a five-point scale.These scientifically approved versions of protocol assessment were again given to the focus groups mentioned earlier as well as to farmers, and their opinions on certain parameters were considered before a decision was made on a final version (Veissier et al., 2013).
Using this procedure, protocols were developed which promised to provide a valid, reliable and feasible animalbased animal welfare assessment which was furthermore generally accepted, as all groups of the society were involved in the process of decision making.

Integration of data to generate an overall assessment of welfare
A scoring model was designed in order to translate the judgements at parameter level into refined and easily understandable information about the overall welfare state (Botreau et al., 2008(Botreau et al., , 2009)).This model needed to be sufficiently sensitive to identify and quantify variations and differences among farms.There were many challenges for the scientists to be solved due to the multidimensional nature of animal welfare.
(1) Ethical dilemmas definitely exist, such as what has to be considered more important: the good health of an animal or its appropriate behaviour.In this case, certain questions arise such as whether it is better for animals to show frightened and fearful behaviour than to be sick (Fraser, 1995), or whether one must predominantly consider the small number of animals in an extremely poor condition or the majority of the animals, i.e. at the average.Consequently, one has to decide whether a farm that houses a considerable number of animals affected by a moderate lameness is labelled as better than a farm housing only a small number of animals exhibiting severe lameness.Moreover, there is the general question of whether one aspect of welfare can be compensated for by another one.(2) Furthermore, the final protocols incorporate numerous measurements very different from each other and data are expressed at all different scale levels.Hence, completely different things need to be aggregated.(3) And even if this challenge of aggregation has been successfully mastered, there is still the question of what is to be considered as good welfare (Veissier et al., 2011).
(1) In terms of the ethical decisions, a flexible model was designed in order to most accurately represent most possible ethical decisions.Different data sets were shown to a group of scientists, and they were asked to assign a value between 0 and 100 to these theoretical data, with 0 representing the worst possible level; 20 an acceptable level, indicating that legislation requirements are met; 50 describing a welfare state of neither good nor bad; and 100 representing the best possible level.These assigned values were later discussed in juries made up of citizens and farmers.It can be concluded that more emphasis was put on animals in a poor condition than on those in a good condition, but also that the whole group was more important than individual animals (Miele et al., 2011).
(2) The calculation of scores followed a bottom-up approach.Hence, based on the measurement results, first scores were calculated for the 12 criteria and then scores for the four principles.If the results were expressed at farm level and there were a limited number of categories, a decision tree (Magerman, 1995) was applied to calculate the scores.In the case of just one measure belonging to a criterion, in which, however, several degrees were possible, the results were expressed as percentages of affected animals for each of the degrees.To calculate the score for the criteria, a certain weight was assigned to the different degrees with the weight increasing with the severity.If the measures assigned to a criterion resulted in data expressed on different scales, these were compared to alarm thresholds and, finally, the number of alarms was valued.Hereby, scores scientists assigned in terms of the ethical decisions were used to define functions to transform the data into scores.Thereby, the scoring did not follow a linear reasoning.For instance, a farm with 10 % lame cows was judged to be of a far lower welfare standard than a farm with 0 % lame cows, but a farm with 70 % lame cows was not scored much better than a farm with 80 % lame cows.Therefore, in terms of the assignment of weights, non-linear functions, i.e. cubic I-spline functions (Curry and Schoenberg, 1966), were used.However, if many different items needed to be aggregated, the experts were incapable of considering weights.This is especially the case for the calculation of principle scores from the level of criteria.In this case, blocks of measures were first considered and Choquet integrals (Grabisch and Roubens, 2000) were used for further aggregation.Although some measures could theoretically be related to different criteria, they were only considered once in order to avoid double counting.
(3) After the calculation of the scores at criteria and principle level through these different mathematical procedures, there still remained the question of what is supposed to be good welfare.One possibility was to define as normal the average score that common farms usually achieve (Whay et al., 2003), but this approach was strongly criticised as the welfare level could be generally bad (Bekoff, 2008).It was chosen -by considering opinions of animal scientists as well as social scientists representing the opinion of the publicto define theoretical thresholds for a farm considered to be acceptable, good and excellent.As for the ethical decisions, on a given scale between 0 and 100 with 0 presenting the worst and 100 the best theoretically possible welfare state, 20 was set as a limit for acceptability representing legislative requirements to be met, 55 as an aspiration value for categorisation into enhanced, and 80 for categorisation into excellent.These four categories were chosen in order to meet all the requirements of potential implications of the protocols (e.g.compulsory or voluntary labelling, self-assessment and research).First, it was chosen to rely on unanimity; thus, these aspirational values had to be met in all four principles.However, this turned out to be too strict and unrealistic in practical use at the time, as not a single farm was scored with excellent in a study on the practical implication of the protocol.Therefore, an indifference threshold of 5 was defined, meaning that for instance a score of 50 was not regarded as significantly different from 55. Furthermore, it was determined that a farm be scored as excellent if it reached values of greater than 55 on all principles and greater than 80 on two of them.It was further scored as enhanced if all the principles exceeded a value of 20 and two of them exceeded the value of 55.The criterion of acceptability was met if a value of greater than 10 was achieved in all principles and a value of greater than 20 on three.Therefore, it became obvious that some compensation was finally allowed.This adjustment of the model was done in order to find a balance between theoretical expectations and what can be realistically achieved at the moment, and thus in order to achieve a balance between theory and pragmatism (Botreau et al., 2007(Botreau et al., , 2009;;Veissier et al., 2011).

Challenges of the protocols
Critical components of the reliability of an assessment system are, of course, the assessors themselves, who need to be credible and competent (Butterworth, 2009).Although the protocols are accessible for free and how to take the measures is well described, training is needed in order to carry out the assessments correctly (Welfare Quality ® , 2009).The training programme needs to be presented in a standardised way internationally so that reliable results are obtained.Furthermore, trainers should be retrained after a certain time to assure they work correctly in terms of reliability (Velarde et al., 2010).Training has been carried out so far by members of the WQ group by travelling to the companies asking for training.However, until recently, certification of trainers had not been available, nor had there been a model to ensure that the protocols are still used correctly at some point after training (Butterworth et al., 2013).
The main concern in terms of feasibility is that a protocol assessment takes a very long time (ranging between 4 and 8 h, depending on the species, farm size and distance of farm buildings) and, consequently, the implementation is too costly.During the development of the protocols, feasibility was always taken into account when decisions were made about the inclusion or exclusion of certain measurements.All measurements currently included provide together the best possible comprehensive animal-based welfare assessment in terms of validity, reliability and feasibility.Therefore, it does not seem possible to simply exclude time-consuming measurements (Veissier et al., 2013).However, ongoing research provides the potential to automate some measures, e.g.lameness scoring (Chapinal et al., 2010).Furthermore, implementation concepts are conceivable in which not always the whole protocol assessment is carried out, but after an initial, full assessment, follow-up assessments are carried out referring only to those parts of the protocol that reveal negative or problematic issues on that particular farm (Veissier et al., 2013).However, up to now, there has been no agreement on how often protocol assessments need to be carried out at all.Furthermore, there is a lack of comprehensive cost-benefit analyses of the application of the WQ protocols (Manteca and Jones, 2013).
There also remains a problem of the general reliability of the protocols.This might seem surprising as such an enormous effort was made to evaluate the validity, reliability and feasibility of each single measurement before it was included in the protocols.However, many parameters that were included were changed concerning, for example, their scaling in order to be more feasible in their application.For instance, the parameter bursitis as a measurement tool for comfort around resting in pigs was proven to be reliable in the study of Lyons et al. (1995), who used a five-point scale for the assessment.This five-point scale was changed without further testing and cut down to a three-point scale in terms of the "Welfare Quality ® animal welfare assessment protocol for growing pigs".Moreover, most reliability studies carried out in terms of the development of the protocols were based on video clips and pictures but not on on-farm assessments.Therefore, there is a lack of studies concerning the reliability of the complete WQ protocols in on-farm use.Some studies have indicated that this may be problematic, especially concerning consistency over time (Botreau et al., 2013;Temple et al., 2013).
Although the protocols were developed with the aim of being applicable in all different housing systems worldwide, their use turned out to be problematic in extensive outdoor production systems.Behavioural observations are difficult to carry out in a large field and the animals are often not used to close observation; they need to be grouped to be observed, which requires higher input by the farmer and causes a greater disturbance to the animals (Turner and Dwyer, 2007).Furthermore, the outcomes may be strongly dependant on the weather; for instance, animals are known to change their activity patterns in particularly hot or cold weather conditions (Hahn, 1999;Tucker et al., 2007).Another challenge in terms of assessing extensive systems is that, for some parameters, validation that is obvious for intensive systems presents ambiguity such as foraging behaviour in pigs, which could be assigned to feeding or exploratory behaviour.
Potential implementations of the WQ protocols include the evaluation of whether legislative requirements on a farm are met or demands in terms of voluntary labelling are fulfilled.Moreover, they would be useful in terms of self-assessment or for the evaluation of the welfare potential of new farming systems or breeds (Botreau et al., 2013).However, the question in general is how to use the information obtained to promote and support management decisions and practices in order to improve the welfare situations on farms.Of course, the results need to be reported to the farmer and practical advice needs to be provided on what problems exist, how they are caused and how they can be tackled (Manteca and Jones, 2013).Botreau et al. (2013) reported that only a minority of farmers who had received feedback and advice actually implemented the improvement strategies, which was especially due to financial, practical and motivational problems.Therefore, improvement strategies must be practicable, robust, safe, affordable, easy to implement and in the longterm interest of the farmers.Integrated approaches should be chosen, thus including environmental, managing and genetic strategies (Boissy et al., 2005;Jones et al., 2005).
In terms of serving as a certification tool, the WQ protocols have great potential, as there is a need for homogeneous and transparent certification (Evans and Miele, 2007).But several studies have revealed that the European public had expected higher welfare standards than those achieved in terms of the WQ categories (Miele et al., 2011;Evans and Miele, 2012).The whole aggregation procedure seems to still be unsatisfactory, as only a few scientists were used to assign the weights to the single measurements, as revealed by de Vries et al. (2012).They concluded that the role of expert opinion and the type of algorithm operator used in terms of the aggregation of measures should be reconsidered.
5 Use of the "Welfare Quality ® animal welfare assessment protocol for growing pigs" The following provides an overview of studies concerning the "Welfare Quality ® protocol for growing pigs" in the form published in 2009.The first study to be accomplished on the finished protocol for growing pigs was carried out in terms of the WQ project.In this study, the protocol was tested on 71 growing pig farms in different countries with varying management practices and farm sizes.It was concluded that although some adaptations of the protocol may be necessary, it can be feasibly applied under a variety of different conditions (Veissier et al., 2013).Temple et al. (2011a) applied the protocol for growing pigs to 30 conventional farms in Spain to estimate its feasibility and sensitivity.It was found that the protocol assessment was easy to perform and took about 6 h.For each animal-based parameter, confidence limits were estimated, helping to identify farms with a poor status of welfare.In general, sensitivity was proven; however, the causes of the variability and differences of the results were found to be difficult to interpret.In another study, Temple et al. (2011b) put their focus on the behaviour principle of the protocol and compared the outcomes of the measurements between intensively and extensively kept pigs.Differences between the housing systems were found in terms of the positive emotional state, which was scored higher in the extensive systems and in terms of social behaviour, which increased in intensive systems.This was interpreted as a coping strategy of the animals in intensive systems to enhance positive emotions by increased social behaviour.It becomes obvious that interpretation of the results in terms of behavioural observations of the protocol is not always straightforward.
In 2012, Temple published two further studies (Temple et al., 2012a, b) in which the welfare of growing pigs in five different housing systems in two different countries were compared and factors influencing the outcomes of animalbased parameters were disclosed.These factors were the age of animals, feeding system, stocking density and type of flooring.All these studies proved that differences between farms could be detected very well by the application of the "Welfare Quality ® animal welfare assessment protocol for growing pigs".However, interpretation of the outcomes was not always straightforward.Furthermore, the studies revealed some of the challenges of the protocols mentioned already, such as the difficulties in use under extensive conditions.The revealing of the causal effects on the outcomes is of eminent importance, as these need to be considered in the further use of the protocol since it would reduce consistency over time if these aspects were not taken into account.One of the revealed effects was for instance the age of the pigs.
A study on the test-retest reliability (Temple et al., 2013) dealt exactly with this problem of consistency over time.Although measures were corrected for the previously described effect of age, sufficient test-retest reliability was only found in terms of two parameters.But farms with persistent welfare problems could be safely identified.Nevertheless, test-retest reliability definitely needs enhancement in terms of future implementations of the protocols and the above-mentioned challenges have to be clarified in the future, such as the question of how often the protocol needs to be carried out.The problems already revealed in terms of the protocol also highlight the further need for studies considering the reliability of the on-farm use of the entire protocols.

Conclusions
The aim of the present paper was to discuss the rising need for an objective and generally accepted way of measuring animal welfare.Furthermore, existing measurement methods were described with special emphasis on the WQ approach.Thereby, in particular, the existing challenges were highlighted.Animal welfare can be best defined considering three different concepts, i.e. health, natural behaviour and positive affective state.It can be best measured by animal-based parameters as these consider the real state of the animals, whereas resource-and management-based parameters are more a risk assessment of the potential of the environment to provide for a good welfare.The WQ protocols are based on such a broad definition and furthermore rely on animal-based parameters whenever feasible.However, there are also still a number of challenges concerning the reliability and validity of the protocols in their on-farm use and the aggregation of parameters to a final welfare score.These revealed challenges need to be addressed in further studies to allow for constant improvement and enhancement of the protocols.Even more importantly, they have to be borne in mind in the application of the protocols in terms of welfare assessment to avoid misinterpretations of the welfare situation of the animals.

Table 1 .
Principles, criteria and parameters of the "Welfare Quality ® animal welfare assessment protocol for growing pigs".