Lesson 2 - Univariate Statistics and Experimental Design

Sensory Evaluation Methods
Lesson 2: Univariate Statistics and Experimental Design

Topic 2.1: Univariate Statistics Terms Summary Statistics Topic 2.2: Central Tendency and Dispersion Topic 2.3: The Null Hypothesis and Type I and Type II Errors The Null Hypothesis (H0) Type I and Type II Errors Topic 2.4: Basic Statistical Concepts and Associated Tests Degrees of Freedom Confidence Interval One-tailed or two-tailed Test? The Normal Distribution Central Limit Theorem The Binomial Test The Chi-Square Test Student's t-test Statistical versus Practical Significance Topic 2.5: Correlation and Regression Topic 2.6: Analysis of Variance Judges: are they a random or fixed effect? Topic 2.7: Multiple Mean Comparison Tests Topic 2.8: Nonparametric Statistics Topic for discussion: How carefully should you check the assumptions behind the statistical tests you use? Topic 2.9: Experimental Design Significance, Power and Precision Sample Size Determination Randomization Types of Designs Balanced Incomplete Block Design (BIB) Crossover Design Factorial Designs Which Design to Choose? References Tables Lesson 2 Page 1 of 34
Copyright The Regents of the University of California 2006 Copyright Dr. Jean-Xavier Guinard 2006
Lesson 2: Univariate Statistics & Experimental Design
Topic 2.1: Univariate Statistics

Lesson Objectives
In this lesson we will cover basic statistics and the principles of experimental design. For some, this will be a review, and for others, it will represent a significant amount of new material. The goal here is not so much to go into the theory or the mathematics of statistical tests and experimental design, but rather to show their application to the design and analysis of sensory tests. Some theoretical background is required, though, to fully grasp the nature of these statistical protocols and the type of questions they are suited to answer. You will be given that background. More importantly, you will become sufficiently knowledgeable about experimental design and basic statistics to be able to select the appropriate design and statistical procedure to analyze data from a given sensory test and to interpret the meaning of the results. The working title for this lesson includes the terms 'univariate statistics' (in contrast with multivariate statistics, which we will cover in Lesson 6), because we will focus mostly on situations where one variable is analyzed (or two in the case of correlation and regression). Most of these assignments for this lesson will have you run statistical tests on actual data. We are providing a set of guidelines on how to run the test with various software programs, or even by hand, if you are so inclined. You can find this univariate statistics tutorial on the topic outline for this lesson. Objectives: 1. To provide a detailed overview of relevant statistical principles, and of statistical tests, their applications, and interpretation. 2. To describe and explain the basic concepts of experimental design. To cite Michael O'Mahony (1986) from his statistics textbook, "Because behavioral and biological data are so inherently variable, statistical analysis is a useful tool for pinpointing and clarifying trends which otherwise might be obscured by a welter of numbers." But we could also cite Benjamin Disraeli here, "There are three kinds of lies: lies, damned lies, and statistics." The message to take away is that statistics can be manipulated to the point of reversing the outcome of a test. So beware... For example, one might treat judges as a so-called fixed effect in an analysis of variance of scaling data and conclude that there is a significant difference among the samples for the scaled attribute, and just as easily (and validly) reach the opposite conclusion (no difference among the samples) by treating judges as a so-called random effect. [We will explain the difference between random and fixed effects later in this lesson.] So consider both the message, and the messenger (and if you are the messenger, behave as a rigorous and honest statistician - hence the need for some working knowledge of statistics).
Lesson 2 Page 2 of 34
Topic 2.1: Univariate Statistics

It is often said that the most important part of a research experiment is the experimental design. This is because a poorly designed experiment will produce data of limited validity. We will examine the basic principles of experimental design and go over the types of designs suited for your experimental needs. Another misconception we must dismiss right off the bat is that the numerical, and therefore complex, nature of statistics and experimental design makes them boring and somewhat scary. Experimental design and statistical analysis are exercises in logic and as such can be rather entertaining. We aim to have everyone enjoy designing sensory tests and running statistical analyses by the end of this course (and doing it right!). Because experimental design requires some knowledge and understanding of basic statistics, we will begin this lesson with basic statistics and end it with experimental design.
Terms
First, let's go over some terms we will use often. 1. Descriptive statistics are used to describe the data (e.g., graphs, tables, averages, ranges, etc.). 2. Inferential statistics are used to infer, from a sample, facts about the population it came from. It follows that a parameter is a fact regarding a population, whereas a statistic is a fact regarding a sample. Statistical tests for inferential statistics are divided into parametric and nonparametric tests. Parametric tests are used to analyze data from interval or ratio scales (continuously distributed, following a normal distribution), while nonparametric tests are designed to handle ordinal data (ranks) and nominal data (categories).
Summary Statistics
There are two major distinctions in summary statistics: Whether the statistics are measures of central tendency (i.e., where do most of the numbers fall?), or Whether they are measures of dispersion (i.e., how much spread is there in the numbers?)
The following section will discuss these topics in greater detail.
Topic 2.2: Central Tendency and Dispersion

Measures of central tendency are intended to provide a feel for the average value of a data set. There are three commonly used measures of central tendency: 1. The mean is the average of all data points. 2. The median is the middle number of a set of numbers arranged in order. 3. The mode is the value which occurs most frequently in the data set. While a data set can have only one mean and median, it can potentially be multimodal (have several modes). Can you think of an example of a multimodal distribution from the first course? The median has the advantage of not being influenced by outliers in the data set. That is not true of the mean, however. So watch out for outliers if, like most people, your favorite measure of central tendency is the mean. It is possible to draw inferences about population means from sample means, thus the mean is the most widely used measure of central tendency. For example, by recording the age of a sample of 300 undergraduate students at a public university and calculating the mean age of that sample, we can assume that the mean age of college students across the nation won't be too far off that number. Measures of dispersion indicate how scattered the data are. There are five commonly used measures of dispersion: 1. The range is the difference between the highest value and the lowest value in the data set = Maximum Minimum 2. The interquartile range (IQR) is the range between the upper and lower quartiles (25th and 75th percentiles) = 75th percentile - 25th percentile 3. The variance is the sum of the squared differences between the observed values and the mean and then averaged across the population. Where: X is an observed value, and are the population and sample means respectively, and N is the size (or number) of the sampled population.
- population - sample 4. The standard deviation is the square root of the variance
5. The coefficient of variation equals the standard deviation divided by the mean
The range of the sample is not a good estimate of the range of the population; it is too small. This limitation is reduced by increasing the size of the sample. The interquartile range eliminates the effect of outliers on the range.

We should also define the standard error of the mean (s.e.m.) as the standard deviation divided by the square root of N, the number of observations from which the mean was derived. In most of the figures you will find in published research, means are shown along with s.e.m. confidence intervals.
You should familiarize yourself with the basic ways in which summary statistics are calculated for data spreadsheets in your favorite office software(s) (e.g., Microsoft Excel) and/or statistical software (Minitab, SAS, SPSS, etc...). Also consult the univariate
statistics tutorial on the lesson outline.
Topic 2.3: The Null Hypothesis and Type I and Type II Errors
The Null Hypothesis (H0)
The Null Hypothesis (H0) may be a hypothesis stating that there is NO DIFFERENCE: Between two samples Between the means of two sets of numbers Between the number of people or things in various categories
Type I and Type II Errors

A Type I Error is one we commit if we reject the Null Hypothesis when it is actually true. In a difference test, for example, that means concluding that the two samples are different when they are not perceptibly different. The risk of committing a Type I Error is called alpha (). A Type II Error is one we commit if we do not reject the Null Hypothesis when it (i.e., the Null Hypothesis) is false. In a difference test, that means concluding that the two samples are not different when they are perceptibly different. Beta () is the risk of NOT finding a difference when one actually exists. Most statistical tests are designed around the significance level of alpha (). Traditionally, alpha significance levels of 5%, 1% and 0.1%, expressed as p < 0.05, p < 0.01, or p < 0.001 respectively, are considered to accept or reject the Null Hypothesis.
P = 1-
The power of a statistical test P is defined as 1 - beta. In discrimination testing, P is the probability of finding a difference if one actually exists, or the probability of making the correct decision that the two samples are perceptibly different. The power of the test P is dependent on: The magnitude of the difference between samples, The size of alpha, and The number of judges performing the test
In practice, we set the desired level of P to determine how many judges should be recruited to conduct the test. This is what is known as power analysis. It is an important tool for anyone involved in experimental design and data collection, and one that we will examine in more detail in our discrimination testing lesson.
Topic 2.4: Basic Statistical Concepts and Associated Tests

Degrees of Freedom
Basically, the number of degrees of freedom refers to the number of categories to which data can be assigned 'freely' without being predetermined. For example, if we produced 10 samples using either of 3 processing methods, and we had 4 samples manufactured with process A, and 4 samples manufactured with process B, then we know that the remaining 2 had to be manufactured with process C. For the three categories, we could only assign samples freely to two. For the third, we had no choice. So in most cases, the number of degrees of freedom is n - 1, where n is the number of categories.
Confidence Interval
A confidence interval gives a sense for where a sample mean is likely to fall. It is typically set at +/- 2 s.e.m. (Standard Error of the Mean) for a 95% probability level. It is derived from the critical value of t at p < 0.05 (for a high number of degrees of freedom).
As mentioned above, the standard error of the mean is equal to the standard deviation divided by the square root of N, the number of observations from which the mean is derived.
One-tailed or two-tailed Test?

This is a question that often arises in the use of statistical tests of the null hypothesis. This basically has to do with whether there is only one correct 'direction' or 'answer' for a test with two alternative outcomes, or whether both can be envisioned. An example from difference testing can help here. In a paired comparison (a type of difference test - see Lesson 4), we may compare two samples for the intensity of one attribute ("which sample is stronger for attribute X?"), or for preference ("which do you prefer?"). In the first scenario, there is only one correct answer (one of the samples MUST be stronger than the other), so we use one-tailed probabilities to analyze the results. In the second case, the person is free to prefer one sample or the other. Both alternatives are 'acceptable,' and so we use two-tailed probabilities. Some statisticians argue that in most scenarios, you may not know whether one-tailed or two-tailed probabilities are warranted, and that you should always use two-tailed probabilities. We concur, except in those (few) instances where you have a very good reason for using a one-tailed test.

The Normal Distribution
The normal distribution is a symmetrical bell-shaped curve which can vary in height and width, where the mean and the mode coincide. The characteristics of a normal distribution (what is required to draw the curve) are the mean () and the standard deviation (). Whatever the value of and , 68% of the values fall within one standard deviation of the mean, and 95% are within two standard deviations. Normal (also called Gaussian) distributions are important because they often approximate distributions occurring in nature, and because parametric tests are designed for normally-distributed data. Here is an example showing the lifetime of one hundred light bulbs. If we plot the data, we find that it is normally distributed (bell-shaped distribution).

Mathematically, a normal or Gaussian distribution is defined as follows:
Central Limit Theorem

This theorem states that sampling distributions, provided that the sample size is large enough, will approach normal distributions. For example, if you plot the ages of the students in an undergraduate class of 20 at UC Davis, the distribution may or may not be normal. But if you plot the data from 10 classes of 20 together, it will most likely be normally distributed. To normalize distribution statistics, mean and variance data can be converted into z-scores:
A z-score is really the distance from the mean in terms of standard deviations.

The Binomial Test
A binomial test is used to determine whether more observations fall into one of two categories vs. the other. We use the so-called binomial expansion to calculate probabilities for various problems, such as tossing coins and drawing cards from a deck. A common use for the binomial test in sensory evaluation is for the analysis of difference tests. For difference testing, the two categories are judges able to discriminate vs. judges unable to discriminate between samples.
The Chi-Square Test

The chi-square test is used to test hypotheses about frequency of occurrence. Just like there is a normal and a binomial distribution, there is a chi-square distribution. Chi-square = In practice, we compare the calculated chi-square value to the largest value that could occur on the null hypothesis (found in tables for various levels of significance). The chi-square test is very powerful (readily rejects H0), prevents a Type II error, but it does not guard well against a Type I error (see below). The chi-square test requires that all observations be independent.

Student's t-test
The comparison of the means from two sets of observations (e.g., two products or two panels) is a very common task in sensory evaluation. Are they statistically different, or more or less the same? This question is answered using the t- statistic, which is distributed somewhat like a z-score, except that it is wider and flatter. We compare the calculated t-value for those two means to tabled values of t at the usual probability levels of alpha (5%, 1%, and 0.1%). There are three variations of the t-test: 1. One-sample t-test. This test determines whether a sample (we mean a statistical sample here) with a given mean X comes from a population with a given mean mu ().
2. Two-sample t-test, related samples. This test determines whether two samples were drawn from the same population (means not significantly different) or from different populations (means significantly different), in a related-samples design. We would use this t-test to compare the mean ratings for a panel of judges in two conditions - for example, rating the intensity of an attribute in a sample under white light vs. red light.
3. Two-sample t-test, independent (unrelated) samples. This test determines whether two samples were
drawn from the same population (means not significantly different) or from different populations (means significantly different), in an independent-sample design - for example comparing the mean ratings given to a sample by two different panels.
Statistical versus Practical Significance

A consideration with regards to evaluating significance is the role of the number of cases or people that are involved in testing for significance. With large N's, such as with tests where there are 100 or more observations or respondents, you may reach a statistical significance at the .05 level with very small differences in the means. Thus, although statistical significance may be obtained, the actual difference in the means may be inconsequential relative to other factors which could influence decision-making. This "practical" significance should be taken into account when making recommendations from a sensory test. Lesson 2 Page 11 of 34
Topic 2.5: Correlation and Regression

Linear correlation is a measure of the degree of association between two data sets; it indicates to what degree a plot of the two sets of data fits on a straight line. Linear regression is the technique whereby a straight line that best fits the data set is drawn as close as possible to all the points. The formula used to calculate a Pearson's Product-Moment correlation coefficient r is:
The calculated value for r is compared to a table of critical values for Pearson's Product-Moment correlation coefficient. Those are given for various significance levels, and for one-tailed and two-tailed scenarios. Note that the degrees of freedom for a correlation coefficient are n - 2, where n is the number of pairs of X and Y observations used to calculate r. This is the one time when degrees of freedom are not equal to n - 1!
Both linear correlation and linear regression are based on the assumption that there is a linear relationship between the data sets. THIS IS AN IMPORTANT AND SOMETIMES OVERLOOKED ASSUMPTION. Linear correlation is based on four additional assumptions: Each pair of X and Y values must be independent. The data must come from a bivariate normal distribution. Generally, X and Y should be randomly sampled. X and Y should be homoscedastic (of equal variance).
In practice, these assumptions are rarely checked. A significant correlation between two variables X and Y may imply that: 1. 2. 3. 4. X causes Y Y causes X X and Y are both caused by some other factor None of the above
That is, correlation does NOT necessarily imply CAUSALITY. Another consideration in looking at the significance of correlation coefficients is the fact that tables give us information on what level of confidence we can have that the r is not zero. It does not tell us that this r will be of practical value. For example, an r of 0.2 could be significantly different from zero at .05 level but this would only mean that X would only account for four percent of the variance in Y. (Note: r is called the coefficient of determination and indicates the variance of Y accounted for by X, in this example 0.22 = 0.04 or 4%.)
Topic 2.6: Analysis of Variance

Analysis of variance is by far the most common statistical test performed on sensory evaluation and consumer testing data, where more than two products are compared using scaled responses, such as the intensity of an attribute in descriptive analysis, or the degree of liking of a product in consumer testing. It provides a very sensitive tool for assessing whether treatment variables, such as ingredient or processing variables, have an effect on the sensory properties and/or acceptability of a product. Analysis of variance is a method used for finding and quantifying variation that can be attributed to specific (assignable) causes, against a background of existing variation due to other (unexplained and non-assignable) causes. These other unexplained causes account for the so-called experimental error or noise in the data. What the analysis of variance algorithm essentially does is to compare the size of the variance from each assignable cause to that of the experimental error. If the ratio of the two is large, it is likely that the assignable cause is a significant source of variation in the data (i.e., it accounts for a significant chunk of the variance). If the ratio is small, that variable does not contribute too much of the variance in the data, and it is not a significant source of variation. So this is all about sorting and separating the 'big chunks' from the 'small chunks' among the sources of variations and their interactions. The problem usually lies in figuring out where the medium-size chunks belong - with the large or the small ones? Analysis of variance estimates the variance or squared deviations attributable to each factor in the design. This is the degree to which each factor or variable moves the data away from the overall mean of the data set. It also estimates the amount of variance represented by the error. (Think of the error as the 'other' variance, not attributable to the factors we manipulated in our experiment). Then, analysis of variance examines the ratio of each factor's variance to the error variance. This ratio follows the distribution of an F-statistic, and it is called an F-ratio. Mathematically, it is the ratio of the mean-squared differences among treatments over the mean-squared error. In a two-product situation, it is simply the square of the t-value (see Student's t-test above). A significant F-ratio for a given factor states that the means for that factor were significantly different - all it takes for that to happen is for two of the means to be 'different'. Note than an F-ratio has two degrees of freedom: one for the numerator (number of treatment levels minus one), and one for the denominator - the error's.

To summarize, the question we want to answer here is: Is there a significant difference among the means for a given treatment, relative to the error? The question can also be phrased as: Is the treatment or variable a significant source of variation in the data? To answer that question: 1. We calculate the variance or squared standard deviations. 2. We compare the variance due to the treatments to the error using F-ratios. 3. If the F-ratio for a treatment is significant (traditionally we look at 5%, 1%, and 0.1% significance levels), we conclude that the treatment is a significant source of variation (the means for that treatment are significantly different, and we specify at which level of alpha, e.g., p < 0.05, 0.01, or 0.001). The next step is to compare the means for the treatment levels to each other using any of a number of multiple mean comparison tests (see below). We mentioned above that we were interested in examining not only individual sources of variations or treatments, but their interactions as well. What is an interaction, and what are the implications of a significant interaction? An interaction between two treatment/variables means that the effect of one variable is different depending on the level of the other variable. Let's use a descriptive analysis example to demonstrate this, and assume that our ANOVA (i.e., analysis of variance) model includes samples, judges, replications, and their interactions as sources of variation. If we have a panel of 10 judges scaling the intensity of attribute X in samples A and B on a 15-point scale, we might have a situation where all the judges give more or less the same rating of 3 to sample A and 6 to sample B. (See left panel on the slide - that's ideal; it means the panel was well trained and all the judges use the scale in exactly the same way.) We may also have all judges agree that sample B is stronger than sample A, but not rate the difference in quite the same way. (See center panel on the slide.) But we can have a situation where most judges rate B as stronger than A, but some rate it the other way around, so there will be an interaction between the 'judges' and 'samples' variables. (See right panel on the slide - this would be an indication of poor concept alignment and it would warrant additional training for those judges going against the grain.)

Judges: are they a random or fixed effect?
This is a somewhat controversial issue which is going to take us back to our definition of the individuals involved in sensory tests. Are the (trained) judges doing a sensory evaluation test? Or are they human subjects whose thresholds are getting measured, or perhaps consumers giving hedonic ratings? If the individuals have been randomly sampled from the general population (and hence are representative of that population), then they should be treated as a random effect. This is the case when a consumer panel of 300 housewives is assembled to evaluate the quality of TV dinners. They represent the general population of housewives at large. This is also the case with a sample of 300 overweight human subjects recruited randomly from the at-large population of overweight individuals for a clinical study of a diet pill. In sensory analytical tests (sensory evaluation) however, judges are TRAINED, which causes them to no longer represent a random sample from the population. In an analysis of variance design that includes judges as a source of variation, we recommend that they be treated as a fixed effect. But to be fair, we should note that many statisticians and sensory scientists will argue otherwise, and treat judges in a descriptive analysis as a random effect. (You may want to consult some of the main sensory evaluation textbooks on the subject or discuss this with the company's statistician(s), and make up your own mind.) Does it make a difference anyway? Well yes, it can. The difference resides in the calculation of the F-ratio for the other fixed effects. In a design with judges, samples, replications, and their interactions as sources of variation, the sample F-ratio will be: MSSamples / MSError if judges are treated as a fixed effect, or MSSamples / MSJudgesXSamples if they are treated as a random effect, where MS stands for Mean Squares.
Sometimes, this can make a difference in the significance of the F-ratio. How do we run an analysis of variance and then read and interpret the results? This will be an important component of the tutorial at the end of the lesson, and the focus of one of the exercises for this lesson. We will use SAS (Statistical Analysis Systems - PC Version) to run our ANOVA. But you may use any other software with that capability. We usually present the results as a table showing F-ratios and their significance for the sources of variation (treatments) and their interactions as shown in the table below from your reading assignment.
Topic 2.7: Multiple Mean Comparison Tests

In an analysis of variance, we determine whether sources of variation are significant or not by examining F-ratios. For example, if the F-ratio for a treatment in the design is significant, it means that overall, the means for that treatment are significantly different. At this point, there is a need to determine how individual means for that treatment differ from each other. This is done with multiple mean comparison tests, most of which are based on the t-test. The rationale here is to avoid an inflated risk of a Type I error that is inherent when making paired comparisons between means with multiple t-tests. There are different methods for multiple mean comparisons, including Duncan's studentized range test, Fisher's Least Significant Difference (LSD) test, Newman-Keuls' test, Scheffe's test, and Tukey's HonestlySignificant-Difference (HSD) test. Scheffe's test is the most conservative (i.e., least likely to produce significant differences among means), and Fisher's LSD the least (i.e., it requires the smallest difference between means to establish significance). The Duncan test guards well against Type I error among a set of paired comparisons, after a significant F-ratio has been found in the ANOVA for that treatment. We recommend using either Fisher's LSD test or Duncan's test depending on the circumstances. Fisher's Least Significant Difference is calculated as follows:
Where: n = number of treatments (e.g., number of judges, samples or replications) in a one-way ANOVA or one-factor repeated measures analysis of variance t = t-value for a two-tailed test with the degrees of freedom for the error term (that number is 1.98 with a high enough number of degrees of freedom) and an alpha level of 5, 1 or 0.1%. If the difference between 2 treatment means is larger than the LSD, then the means are deemed to be significantly different at the corresponding alpha level (5, 1 or 0.1%). Duncan's studentized range statistic q goes as follows:
This q value must exceed a tabulated value based on the number of means being compared. All of these multiple mean comparison tests are typically included as options in the statistical software you use. It is simply a matter of writing in the one or two lines in your program requesting that these multiple mean comparisons be run after the main ANOVA procedure. We will go through that procedure with our ANOVA example in the tutorial, and you will make multiple mean comparisons in an ANOVA assignment as well.
Topic 2.8: Nonparametric Statistics

The t-test, analysis of variance, and multiple mean comparisons are examples of parametric statistics. These methods are suited for situations where the variables under study are continuous, as with rating scales (i.e., an intensity rating on a 0-10 numerical scale varies continuously from 0 to 10). There are many instances in sensory evaluation, however, when we categorize performance into right and wrong answers, or when we count the number of people who choose one product over another. This is called discrete, categorical data, and to properly study it statistically, we must use nonparametric statistics. We have already introduced the binomial and chi-square distributions and tests and their applications. Other nonparametric tests of interest in sensory evaluation are rank order tests. Ranking is a form of difference testing where two or more samples are ranked in order of intensity of an attribute or preference. The simplest case of ranking is the paired comparison. A simple nonparametric test of difference with paired data is the sign test. It only involves counting the direction of paired scores, assuming a 50/50 split under the null hypothesis. Once the + and - signs have been counted (no ties allowed), we compute the ratio of pluses to total pairs and check the corresponding probability level in a two-tailed binomial probability table. An alternative to the independent samples t-test is the Mann Whitney U test. This test can be used in a situation where two sets of data are to be compared and the level of measurement is ordinal. Two tests are commonly used to analyze ranking data with multiple products. One is the Friedman analysis of variance. The second is Kramer's rank sum test. Both will be covered in some detail in the lesson on difference testing (Lesson 4). Finally, we should mention an alternative to Pearson's product-moment correlation coefficient - the Spearman rank order correlation. It is a recommended alternative for data with a high degree of skew or outliers, or for data from ordinal scales. The Spearman rank order correlation ( or rho) asks whether the two variables line up in similar rankings across a set of observations. Tables of significance indicate whether an association exists on the basis of these rankings. The data must be converted to ranks first (if it is not ranking data to begin with), and a difference score calculated for each pair of ranks (d):
Where: = the Spearman correlation coefficient (rho) d = the sum of the squares of the differences between ranks N = the number of cases The table below helps summarize the parallels between parametric and nonparametric statistics.
Topic 2.8: Nonparametric Statistics

Food for thought: How carefully should you check the assumptions behind the statistical tests you use?
"Very carefully!" is going to be the answer from your statistician. The livelihood of a statistician is indeed to make sure that the statistics are done right, and that means checking all the assumptions behind a statistical test. You should never lose sight of the bigger question, however (i.e., the one you are trying to answer with the sensory test you carried out). It is rather unlikely that if you follow sound experimental design and test execution practices, the conclusions you reach will be completely flawed because one the many assumptions underlying the particular statistical test you used to analyze the data was violated. For example, let's examine the following question: Is there a correlation between the acceptability of frozen yogurt (as measured by hedonic ratings from consumers) and acidity (as measured by pH or sensory ratings of sourness)? If the two variables are indeed related (most likely inversely so), the fact that the two sets of observations you collect for a set of yogurt samples varying in acidity may not quite have the same variance and should not lead you to the wrong findings. (To figure out which assumptions(s) may have been violated here, see the section on correlation and regression above.)
Topic 2.9: Experimental Design

The experimental design of a research study is the most critical component of any study. We can define experimental design as an organized approach to the collection of experimental data. Ideally, this approach should define the population to be studied, the randomization process, the administration of treatments, the sample size requirement, and the method(s) of statistical analysis. In statistical terms, the experimental design defines: The size and number of the experimental units. The manner in which the treatments are allotted to the units. The appropriate type and grouping of the experimental units.
The experimental unit is the smallest subdivision of the experimental material such that any two could be assigned to different treatments.
Significance, Power and Precision

Hypotheses and estimates in the experimental 'world' are usually subject to error. An experiment should therefore be designed in such a way that: 1. The probability (i.e., the alpha value) of rejecting the null hypothesis H0 (1 = 2) when it is true is low (typically 5%, 1%, or 0.1%) = significance level 2. The probability of rejecting the null hypothesis when it is false (1 - ) is high (typically 90%) = power 3. The estimate of the difference between 1 and 2 should (with high confidence - 95%) be within +/- 10% of being correct = precision of estimate To accomplish this, one first needs an estimate of the expected standard deviation of the intended experiment. This does not need to be iron clad and any preliminary experiment or literature information would be usable. Next, one needs to be sold to the experimental objective that an 'important' difference can be identified. This is NOT a statistical concept, but rather an educated decision based on the background and implications of your results and conclusion.

Significance, Power and Precision (Cont.)
Sample Size Determination The issue of the number of judges, subjects, or consumers required for a given sensory test is not always dealt with properly. Whereas many sensory scientists are satisfied to use what is considered a 'typical' number of judges for a given test (e.g., 30 for a difference test, 10-15 for descriptive analysis, and 100-300 for a consumer test), there are statistical means to determine what an appropriate number of subjects should be for a given application. These means are collectively known as 'power analysis.' It requires estimates of the variance and of within-subject correlations for the measurement to be carried out in the sensory test (e.g., intensity scaling, hedonic scaling, etc...), and of the minimum size of the difference required to establish significance (e.g., 1 point on a 10-point intensity scale or a 9-point hedonic scale). Note that power analysis is more critical for groups of human subjects or consumers randomly selected from the population they represent. In the case of judges recruited for descriptive analysis, because they are trained to use a scale in a specific way, those estimates of variance and within-judge correlations are low and high, respectively, and the required sample size for a panel goes down dramatically. We therefore know (and this is confirmed by experience) that a panel size of 10-15 judges is amply sufficient for descriptive analysis (provided they are properly trained). Now, let's look at an example, based on a clinical study with human subjects. We completed a study of the effect of dietary fat on preferences for fat in selected foods (Guinard et al., 1999). The main hypothesis was that a shift in energy from fat in the diet would result in a corresponding shift in hedonic response to fat in foods. A power analysis was carried out to estimate the size of the sample that would be adequate to test our hypothesis. We took estimates for variance (MSE = 1.89) and within-subject correlations ( = 0.66) from a prior study of animal fat acceptability among athletes conducted with 40 subjects on 29 meat and dairy products using the 9-point hedonic scale. Based on that study, we deemed a minimum detectable difference of 0.8 on the 9-point scale adequate to verify our hypothesis. With a power of 90%, and a two-tailed test with alpha = 0.05, we found that minimum detectable differences of 0.9, 0.8 and 0.7 would be achieved with 20, 25 or 30 subjects, respectively. We recruited 25 subjects for our experiment. For details on how to compute these estimates into the calculation of required sample size, consult Fleiss (1986), Schlesselman (1973), or any experimental design textbook.

Randomization
Randomization is the process whereby we assure that each experimental unit has an equal probability of being assigned to each treatment. The purpose of randomization is to guarantee that the statistical test will have a valid significance level. In sensory testing, a common form of randomization is the random distribution of panelists to a specific group. By doing this, the uncontrolled variation among panelists is distributed to treatment groups, and the treatment effect is therefore similarly affected, resulting in cancellation of the effect on overall variation. Another form of randomization is the random ordering of sample presentation in a sensory test. It is recommended that the sample order be counterbalanced, with each serving sequence occurring an equal number of times (fully counterbalanced design). It is also recommended to use completely balanced serving sequences when the possibility of carryover effects between samples exist. The Mutually Orthogonal Latin Squares (MOLS) design developed by Wakeling and MacFie (1995) is an example of a design where every sample occurs in every position in the sequence (first, second,...) the same number of times, as well as before and after every other sample in the design the same number of times. This is particularly useful in preference mapping applications where consumers see several samples, and first-order and/or carry over effects are a possibility.

Types of Designs
Completely Randomized Design (CRD) This design is well suited for a situation where the number of samples is small and all samples can be evaluated by all the panelists in a single session. In this case, each panelist receives all the samples, in a completely randomized order. This would be the case for some central location consumer tests, where each consumer is asked to indicate his/her degree of liking of a set of products. In sensory evaluation with trained judges, because of the need for replicate judgments, it may not be possible to have each judge evaluate all the samples replicated in a single session. Such a situation warrants the use of the next type of design, the RCB design. Randomized Complete Block Design (RCB) This is the most widely used design when there are more than two variables or treatments to be compared. In the RCB design, the rows are known as blocks, and the columns as treatments. The control of extraneous variations due to rows is known as 'blocking.' It is common in sensory evaluation experimental design. Even when they are trained, judges do not generally use the scale uniformly, with some judges using the ends of the scale, others sticking to the middle, some scoring high and others low. By using judges as a block, all comparisons are within judges, are are more precise because judge variation does not enter in the calculation of treatment differences. We can write the statistical model as follows:
Xij = + Ti + Bj + Eij
Where: Xij is the observed value for the ith treatment and jth judge is the mean Ti is the effect of the ith treatment Bj is the effect of blocks Eij are random errors assumed to be normally and independently distributed with variance e The variance e includes the variation due to panelists, experimental materials, and other errors not controlled by the design.

The layout for a RCB design is:
The corresponding ANOVA table is:
The RCB design is frequently used when trained panels must evaluate several samples in replicate (not feasible in one single session). In this case, it is best to have each judge evaluate all samples in a single session, and then return to evaluate them again in another session, etc. In this type of study, the blocks are the sessions, and the samples are randomized across judges within each block.

Balanced Incomplete Block Design (BIB)
This is an extension of the RCB design, with the following parameters: t = number of treatments k = number of experimental units per block r = number of replications of each treatment b = number of blocks (judges) = number of blocks in which each pair of treatments are compared
These parameters are not independent and the following requirements apply:
rt = bk = N and (t - 1) = r(k - 1)
where N is the total number of observations in the experiment. The layout for a BIB design is:
The corresponding ANOVA table is:
A BIB design is used when there are too many treatments in the experiment for the judges to evaluate all the samples in a single session (block). In this case, judges evaluate subsets of samples during different sessions.

Crossover Design
A crossover design is a plan characterized by the measurement of the response of judges/consumers from the evaluation of two treatments/products, each being evaluated in sequence. Crossover designs are not recommended for use in sensory evaluation (with judges), but they are used extensively in clinical research and have their application in consumer home-use tests, where consumers are asked to first use a product for some period of time and then use the other product, after which they complete a questionnaire regarding the products. The layout of the crossover design is shown below as:
In designing a crossover home-use test, two groups of consumers are formed - I and II. Consumers are assigned randomly to the two groups. In the first period, Group I uses product A first and product B second, and vice versa for Group II. Two assumptions of this model, which may not always be met in a home-use test, are that (1) there is no productby-period interaction; that is, the difference between products A and B is the same regardless of the sequence in which they were evaluated; and (2) there are no order or carry-over effects from one product to the other. Neither assumption is likely to hold true, however.

Factorial Designs
Factorial designs are plans used to study the effects of two or more factors on product attributes, where each level k of a factor is varied simultaneously with the other factors in the experiment. Chief among factorial designs is the 2 factorial design. It is the foundation of response surface methodology, an optimization technique we will discuss in Courses 3 and 4. In the 2 factorial design, there are k factors (e.g., A, B, C, ...,) each at two levels - low and high. In a two-factor experiment, the response Y consists of the effects of factors A, B, interaction AB, and the residual E (error). Note that without replications, the AB interaction cannot be estimated. The statistical model is written as:
k
Yijk = + Ai +Bj + (AB)ij + Eijk

Where: i = 1, 2 (low, high) j = 1, 2 (low, high) k = 1, 2, ... r replications A center point can be added to the 2 factorial design. It is located half-way between the low and high levels of the two variables. The 3 factorial design is another design where we consider low, medium, and high levels of each variable.
k k

Which Design to Choose?
A premier advantage of blocking designs is to enable the experimenter to assign variance estimates to variables which cannot necessarily be controlled, but can be fixed, and therefore blocked. A disadvantage in the practical use of these various designs is that researchers will often choose the design for its own sake rather than for the needs of the particular experiment. The designs are efficient only if the separate variance estimates are important. If a variable which is blocked does not contribute importantly to the overall variance, its inclusion via a specific design lowers efficiency. Factorial designs also develop information on interactions between variables. In many sensory experiments, it is the interaction between variables which is the most important conclusion to be drawn from the study. Therefore, a factorial design allows you to develop estimates of interaction(s) which can be tested for significance. Clearly the most important prerequisite to these experimental designs is to intimately know the objectives of the variables in your experiment. The easy way to do this is to write everything out in terms of a model. Now, at this stage you should look at every variable and decide if it needs to be included in the experiment or not, if it represents a treatment which should be examined at different levels, and whether it interacts with other variables, it can be blocked, or can it be randomized. With this information, you are in a position to decide what kind of design you need, based on the constraints on your variables, time, resources etc. For more information on experimental designs in sensory evaluation and consumer testing, consult M. C. Gacula Jr's Design and Analysis of Sensory Optimization (1993), Food & Nutrition Press, Trumbull, CT. For a comprehensive look at experimental designs, consult the 'bible' - Cochran and Cox's Experimental Designs (1957), Wiley, New York. OK, is everyone still standing (or sitting)? We just went through a lot of complex statistical concepts, and it is now time to digest them through a series of entertaining assignments!
References (and useful resources)

Cochran, W. G, and Cox, G. M. (1957). Experimental Designs (2nd ed.). New York: Wiley. Fleiss, J. L. (1986). The Design and Analysis of Clinical experiments. New York: John Wiley and Sons. Gacula, M. C. Jr. (1993). Design and Analysis of Sensory Optimization. Trumbull, CT: Food & Nutrition Press. Guinard, J.-X., & Cliff, M. C. (1987). Descriptive analysis of Pinot noir wines from Carneros, Napa and Sonoma. American Journal of Enology and Viticulture, 38, 211-215. Guinard, J.-X., Sechevich, P., Meaker, K., Jonnalagadda, S. S. & Kris-Etherton, P. (1999). Sensory responses to fat are not affected by varying dietary energy intake from fat and saturated fat over ranges common in the American diet. Journal of the American Dietetic Association, 99(6), 690-696. Lawless, H. T. & Heymann, H. (1998). Sensory Evaluation of Food. Principles and Practices. New York: Chapman & Hall. O'Mahony, M. (1986). Sensory Evaluation of Food: Statistical Methods and Procedures. New York: Marcel Dekker. Schlesselman, J. J. (1973). Planning a longitudinal study. I. Sample size determination. Journal of Chronic Diseases, 26, 553-560. Wakeling, I. & MacFie, H. G. H. (1995). Designing consumer trials balanced for first and higher orders of carry-over effect when only a subset of k samples from t may be tested. Food Quality and Preference, 6, 299-308.
Tables
Table of critical values for chi-square Table of critical values for Pearson's product-moment correlation coefficient Tables of Spearman rank order correlation values Table of critical values of t (Student's t-test)
Univariate Statistics Tutorial

Most of the assignments for this lesson will have you run statistical tests on actual data. We are providing a set of guidelines here on how to run the tests with Excel or PC SAS. The first tutorial covers mean, variance, standard deviation, t-test, chi-squared, correlation and regression on Excel. The second tutorial covers analysis of variance on SAS.
Stats Tutorial 1 Stats Tutorial 2 (Analysis of Variance Tutorial)

Lesson 2 - Univariate Statistics and Experimental Design

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 2 - Univariate Statistics and Experimental Design

Uploaded by

Copyright:

Available Formats

Sensory Evaluation Methods