Interestingly, although increasing the sample size will not change the underlying distribution of the population, it can often go a long way toward correcting for skewness in the test statistic Thus, the t -test often becomes valid, even with fairly skewed data, if the sample size is large enough. In fact, using data from Figure 5 , we did a simulation study and determined that the sampling distribution for the difference in means is indeed approximately normal with a sample size of 30 data not shown.

In contrast, carrying out a simulation with a sample size of only 15 did not yield a normal distribution of the test statistic and thus the t -test would not have been valid. Unfortunately, there is no simple rule for choosing a minimum sample size to protect against skewed data, although some textbooks recommend Even a sample size of 30, however, may not be sufficient to correct for skewness or kurtosis in the test statistic if the sample data i. The bottom line is that there are unfortunately no hard and fast rules to rely on.

Thus, if you have reason to suspect that your samples or the populations from which there are derived are strongly skewed, consider consulting your nearest statistician for advice on how to proceed. In the end, given a sufficient sample size, you may be cleared for the t -test. Alternatively, several classes of nonparametric tests can potentially be used Section 6. Although these tests tend to be less powerful than the t -test at detecting differences, the statistical conclusions drawn from these approaches will be much more valid.

Furthermore, the computationally intensive method bootstrapping retains the power of the t -test but doesn't require a normal distribution of the test statistic to yield valid results. In some cases, it may also be reasonable to assume that the population distributions are normal enough. Normality, or something quite close to it, is typically found in situations where many small factors serve to contribute to the ultimate distribution of the population.

Because such effects are frequently encountered in biological systems, many natural distributions may be normal enough with respect to the t -test. Another way around this conundrum is to simply ask a different question, one that doesn't require the t -test approach. In fact, the western blot example is one where many of us would intuitively look toward analyzing the ratios of band intensities within individual blots discussed in Section 6.

You may be surprised to learn that nothing can stop you from running a t -test with sample sizes of two. Of course, you may find it difficult to convince anyone of the validity of your conclusion, but run it you may! Another problem is that very low sample sizes will render any test much less powerful. What this means in practical terms is that to detect a statistically significant difference with small sample sizes, the difference between the two means must be quite large.

In cases where the inherent difference is not large enough to compensate for a low sample size, the P -value will likely be above the critical threshold. In this event, you might state that there is insufficient evidence to indicate a difference between the populations, although there could be a difference that the experiment failed to detect.

### Primary Sidebar

Alternatively, it may be tempting to continue gathering samples to push the P -value below the traditionally acceptable threshold of 0. As to whether this is a scientifically appropriate course of action is a matter of some debate, although in some circumstances it may be acceptable. However, this general tactic does have some grave pitfalls, which are addressed in later sections e.

One good thing about working with C. The same cannot be said for many other biological or experimental systems. This advantage should theoretically allow us to determine if our data are normal enough or to simply not care about normality since our sample sizes are high. In any event, we should always strive to take advantage of this aspect of our system and not short-change our experiments. Of course, no matter what your experimental system might be, issues such as convenience and expense should not be principal driving forces behind the determination of sample size.

Rather, these decisions should be based on logical, pragmatic, and statistically informed arguments see Section 6. Nevertheless, there are certain kinds of common experiments, such as qRT-PCR, where a sample size of three is quite typical. Of course, by three we do not mean three worms. Here, three refers to the number of biological replicates.

In such cases, it is generally understood that worms for the three extracts may have been grown in parallel but were processed for mRNA isolation and cDNA synthesis separately. Better yet, the templates for each biological replicate may have been grown and processed at different times. Here, three or more equal-sized aliquots of cDNA from the same biological replicate are used as the template in individual PCR reactions. Of course, the data from technical replicates will nearly always show less variation than data from true biological replicates.

Importantly, technical replicates should never be confused with biological replicates. In the case of qRT-PCR, the former are only informative as to the variation introduced by the pipetting or amplification process. As such, technical replicates should be averaged, and this value treated as a single data point. In this case, suppose for the sake of discussion that each replicate contains extracts from 5, worms.

If all 15, worms can be considered to be from some a single population at least with respect to the mRNA of interest , then each observed value is akin to a mean from a sample of 5, In that case, one could likely argue that the three values do come from a normal population the Central Limit Theorem having done its magic on each , and so a t -test using the mean of those three values would be acceptable from a statistical standpoint. It might possibly still suffer from a lack of power, but the test itself would be valid. Similarly, western blot quantitation values, which average proteins from many thousands of worms, could also be argued to fall under this umbrella.

The paired t -test is a powerful way to detect differences in two sample means, provided that your experiment has been designed to take advantage of this approach. In our example of embryonic GFP expression, the two samples were independent in that the expression within any individual embryo was not linked to the expression in any other embryo. For situations involving independent samples, the paired t -test is not applicable; we carried out an unpaired t -test instead.

For the paired method to be valid, data points must be linked in a meaningful way. If you remember from our first example, worms that have a mutation in b show lower expression of the a ::GFP reporter. In this example of a paired t -test, consider a strain that carries a construct encoding a hairpin dsRNA corresponding to gene b. Using a specific promoter and the appropriate genetic background, the dsRNA will be expressed only in the rightmost cell of one particular neuronal pair, where it is expected to inhibit the expression of gene b via the RNAi response.

In contrast, the neuron on the left should be unaffected. In addition, this strain carries the same a ::GFP reporter described above, and it is known that this reporter is expressed in both the left and right neurons at identical levels in wild type. The experimental hypothesis is therefore that, analogous to what was observed in embryos, fluorescence of the a ::GFP reporter will be weaker in the right neuron, where gene b has been inhibited. In this scenario, the data are meaningfully paired in that we are measuring GFP levels in two distinct cells, but within a single worm.

We then collect fluorescence data from 14 wild-type worms and 14 b RNAi worms.

## 15 Grammar Goofs That Make You Look Silly [Infographic]

A visual display of the data suggests that expression of a ::GFP is perhaps slightly decreased in the right cell where gene b has been inhibited, but the difference between the control and experimental dataset is not very impressive Figure 9A, B. Furthermore, whereas the means of GFP expression in the left neurons in wild-type and b RNAi worms are nearly identical, the mean of GFP expression in the right neurons in wild type is a bit higher than that in the right neurons of b RNAi worms. For our t -test analysis, one option would be to ignore the natural pairing in the data and treat left and right cells of individual animals as independent.

In doing so, however, we would hinder our ability to detect real differences. The reason is as follows. We already know that GFP expression in some worms will happen to be weaker or stronger resulting in a dimmer or brighter signal than in other worms. This variability, along with a relatively small mean difference in expression, may preclude our ability to support differences statistically.

Figure 9. Representation of paired data. Here the wild-type and b RNAi strains have been separated, and we are specifically comparing expression in the left and right neurons for each genotype. In addition, lines have been drawn between left and right data points from the same animal. Two things are quite striking. One is that worms that are bright in one cell tend to be bright in the other.

Second, looking at b RNAi worms, we can see that within individuals, there is a strong tendency to have reduced expression in the right neuron as compared with its left counterpart Figure 9D. However, because of the inherent variability between worms, this difference was largely obscured when we failed to make use of the paired nature of the experiment. This wasn't a problem in the embryonic analysis, because the difference between wild type and b mutants was large enough relative to the variability between embryos.

In the case of neurons and the use of RNAi , the difference was, however, much smaller and thus fell below the level necessary for statistical validation. The rationale behind using the paired t -test is that it takes meaningfully linked data into account when calculating the P -value. The paired t -test works by first calculating the difference between each individual pair.

Then a mean and variance are calculated for all the differences among the pairs. Finally, a one-sample t -test is carried out where the null hypothesis is that the mean of the differences is equal to zero. Furthermore, the paired t -test can be one- or two-tailed, and arguments for either are similar to those for two independent means. Of course, standard programs will do all of this for you, so the inner workings are effectively invisible.

Given the enhanced power of the paired t -test to detect differences, it is often worth considering how the statistical analysis will be carried out at the stage when you are developing your experimental design. Then, if it's feasible, you can design the experiment to take advantage of the paired t -test method. Some textbooks, particularly older ones, present a method known as the critical value approach in conjunction with the t -test. This method, which traditionally involves looking up t -values in lengthy appendices, was developed long before computer software was available to calculate precise P -values.

Part of the reason this method may persist, at least in some textbooks, is that it provides authors with a vehicle to explain the basis of hypothesis testing along with certain technical aspects of the t -test. As a modern method for analyzing data, however, it has long since gone the way of the dinosaur. Feel no obligation to learn this. It's worth a try, though. Join us, won't you? Gossett was allowed an exception, but the higher-ups insisted that he use a pseudonym. In this case the mean and median of the data set are identical. For left-skewed distributions, the mean is less than the median and the skewness will be a negative number.

For right-skewed distributions, the mean is more than the median and the skewness will be a positive number. In the case of a normal distribution, this number is zero. Distributions with relatively sharp peaks and long tails will have a positive kurtosis value whereas distributions with relatively flat peaks and short tails will have a negative kurtosis value.

The test ultimately generates an approximate P-value where the null hypothesis is that the data are derived from a population that is normal. The conclusion of non-normality can also be reached informally by a visual inspection of the histograms. The Anderson-Darling test does not indicate whether test statistics generated by the sample data will be sufficiently normal. For an illustration of this phenomenon for proportions, see Figure 12 and discussion thereof.

Why many? This is so mainly because details of the proof of the theorem depend on the particular statistical context. Judging this requires experience, but, in essence, the larger the sample size, the less normal the distribution can be without causing much concern. The two-sample t -test works well in situations where we need to determine if differences exist between two populations for which we have sample means. But what happens when our analyses involve comparisons between three or more separate populations? Here things can get a bit tricky. Such scenarios tend to arise in one of several ways.

Most common to our field is that we have obtained a collection of sample means that we want to compare with a single standard. For example, we might measure the body lengths of young adult-stage worms grown on RNAi-feeding plates, each targeting one of different collagen genes. In this case, we would want to compare mean lengths from animals grown on each plate with a control RNAi that is known to have no effect on body length. On the surface, the statistical analysis might seem simple: just carry out two-sample t -tests where the average length from each collagen RNAi plate is compared with the same control.

The problem with this approach is the unavoidable presence of false-positive findings also known as Type I errors. The more t -tests you run, the greater the chance of obtaining a statistically significant result through chance sampling. This type of multiple comparisons problem is common in many of our studies and is a particularly prevalent issue in high-throughput experiments such as microarrays, which typically involve many thousands of comparisons. Taking a slightly different angle, we can calculate the probability of incurring at least one false significance in situations of multiple comparisons.

For example, with just two t -tests and a significance threshold of 0. Using probability calculators available on the web also see Section 4. Thinking about it this way, we might be justifiably concerned that our studies may be riddled with incorrect conclusions! Furthermore, reducing the chosen significance threshold to 0. Moreover, by reducing our threshold, we run the risk of discarding results that are both statistically and biologically significant.

A related but distinct situation occurs when we have a collection of sample means, but rather than comparing each of them to a single standard, we want to compare all of them to each other. As is with the case of multiple comparisons to a single control, the problem lies in the sheer number of tests required.

With sample means, that number skyrockets to 4, Based on a significance threshold of 0. Obviously, both common sense, as well as the use of specialized statistical methods, will come into play when dealing with these kinds of scenarios. In the sections below, we discuss some of the underlying concepts and describe several practical approaches for handling the analysis of multiple means. Before delving into some of the common approaches used to cope with multiple comparisons, it is worth considering an experimental scenario that would likely not require specialized statistical methods.

For example, if 36 RNAi clones are ultimately identified that lead to resistance to a particular bacterial pathogen from a set of 17, clones tested, how does one analyze this result? No worries: the methodology is not composed of 17, statistical tests each with some chance of failing. That's because the final reported tally, 36 clones, was presumably not the result of a single round of screening.

A second or third round of screening would effectively eliminate virtually all of the false positives, reducing the number of clones that show a consistent biological affect to In other words, secondary and tertiary screening would reduce to near zero the chance that any of the clones on the final list are in error because the chance of getting the same false positives repeatedly would be very slim. Perhaps more than anything else, carrying out independent repeats is often best way to solidify results and avoid the presence of false positives within a dataset.

To help grapple with the problems inherent to multiple comparisons, statisticians came up with something called the family-wise error rate. This is also sometimes referred to as the family-wide error rate , which may provide a better description of the underlying intent. The basic idea is that rather than just considering each comparison in isolation, a statistical cutoff is applied that takes into account the entire collection of comparisons. In the family-wise error rate approach, the criterion used to judge the statistical significance of any individual comparison is made more stringent as a way to compensate for the total number of comparisons being made.

We can use our example of the collagen experiment to further illustrate the meaning of the family-wise error rate. Suppose we test genes and apply a family-wise error rate cutoff of 0. Perhaps this leaves us with a list of 12 genes that lead to changes in body size that are deemed statistically significant. Several techniques for applying the family-wise error rate are described below. The Bonferroni method , along with several related techniques, is conceptually straightforward and provides conservative family-wise error rates.

To use the Bonferroni method, one simply divides the chosen family-wise error rate e. Going back to our example of the collagen genes, if the desired family-wise error rate is 0. Thus, individual t -tests yielding P -values as low as 0. This may sound rather severe. In fact, a real problem with the Bonferroni method is that for large numbers of comparisons, the significance threshold may be so low that one may fail to detect a substantial proportion of true positives within a data set.

For this reason, the Bonferroni method is widely considered to be too conservative in situations with large numbers of comparisons. Another variation on the Bonferroni method is to apply significance thresholds for each comparison in a non-uniform manner. For example, with a family-wise error rate of 0. Another way to think about this is that the sum of the 10 individual cutoffs must add up to 0. Interestingly, the integrity of the family-wise error rate is not compromised if one were to apply a 0.

This is because 0. The rub, however, is that the decision to apply non-uniform significance cutoffs cannot be made post hoc based on how the numbers shake out! For this method to be properly implemented, researchers must first prioritize comparisons based on the perceived importance of specific tests, such as if a negative health or environmental consequence could result from failing to detect a particular difference.

For example, it may be more important to detect a correlation between industrial emissions and childhood cancer rates than to effects on the rate of tomato ripening. This may all sound rather arbitrary, but it is nonetheless statistically valid. As discussed above, the Bonferroni method runs into trouble in situations where many comparisons are being made because a substantial proportion of true positives are likely to be discarded for failing to score below the adjusted significance threshold. Stated another way, the power of the experiment to detect real differences may become unacceptably low.

## 15 Grammatical Errors that Make You Look Silly

Benjamini and Hochberg are credited with introducing the idea of the false discovery rate FDR , which has become an indispensable approach for handling the statistical analysis of experiments composed of large numbers of comparisons. Importantly, the FDR method has greater power than does the Bonferroni method.

The latter method starts from the position that no differences exist. The FDR method does not suppose this. The FDR method is carried out by first making many pairwise comparisons and then ordering them according to their associated P -values, with lowest to highest displayed in a top to bottom manner.

In the examples shown in Table 3 , this was done with only 10 comparisons for three different data sets , but this method is more commonly applied to studies involving hundreds or thousands of comparisons. What makes the FDR method conceptually unique is that each of the test-derived P -values is measured against a different significance threshold. In the example with 10 individual tests, the one giving the lowest P -value is measured against 30 0. Conversely, the highest P -value is measured against 0. With ten comparisons, the other significance thresholds simply play out in ordered increments of 0.

For example, the five significance thresholds starting from the top of the list would be 0. If comparisons were being made, the highest threshold would still be 0. Having paired off each experimentally derived P -value with a different significance threshold, one checks to see if the P -value is less than the prescribed threshold. If so, then the difference is declared to be statistically significant a discovery , at which point one moves on to the next comparison, involving the second-lowest P -value.

This process continues until a P -value is found that is higher than the corresponding threshold. At that point, this and all remaining results are deemed not significant. Table 3. Illustration of FDR method, based on artificial P -values from 10 comparisons. The highlighted values indicate the first P -value that is larger than the significance threshold i. Examples of how this can play out are shown in Table 3. Note that even though some of the comparisons below the first failed test may themselves be less than their corresponding significance thresholds Data Set 3 , these tests are nevertheless declared not significant.

This may seem vexing, but without this property the test would not work. Put another way, that test, along with all those below it on the list, are declared persona non grata and asked to leave the premises! Although the FDR approach is not hugely intuitive, and indeed the logic is not easily tractable, it is worth considering several scenarios to see how the FDR method might play out. For example, with independent tests of two populations that are identical, chance sampling would be expected to result on average with a single t -test having an associated P -value of 0. However, given that the corresponding significance threshold would be 0.

Next, consider the converse situation: t -tests carried out on two populations that are indeed different. Furthermore, based on the magnitude of the difference and the chosen sample size, we would expect to obtain an average P -value of 0. Of course, chance sampling will lead to some experimental differences that result in P -values that are higher or lower than 0. Because this is less than the cutoff of 0. But compared with the Bonferroni method, where the significance threshold always corresponds to the lowest FDR cutoff, the proportion of these errors will be much smaller. This section will contain only three paragraphs.

This is in part because of the view of some statisticians that ANOVA techniques are somewhat dated or at least redundant with other methods such as multiple regression see Section 5. In addition, a casual perusal of the worm literature will uncover relatively scant use of this method. Traditionally, an ANOVA answers the following question: are any of the mean values within a dataset likely to be derived from populations 33 that are truly different? Correspondingly, the null hypothesis for an ANOVA is that all of the samples are derived from populations, whose means are identical and that any difference in their means are due to chance sampling.

Thus, an ANOVA will implicitly compare all possible pairwise combinations of samples to each other in its search for differences. Notably, in the case of a positive finding, an ANOVA will not directly indicate which of the populations are different from each other. An ANOVA tells us only that at least one sample is likely to be derived from a population that is different from at least one other population. Because such information may be less than totally satisfying, an ANOVA is often used in a two-tiered fashion with other tests; these latter tests are sometimes referred to as post hoc tests.

In cases where an ANOVA suggests the presence of different populations, t -tests or other procedures described below can be used to identify differences between specific populations. Moreover, so long as the P -value associated with the ANOVA is below the chosen significance threshold, the two means that differ by the greatest amount are assured of being supported by further tests. The correlate, however, is not true. Thus, ANOVA will provide a more conservative interpretation than t -tests using chosen pairs of means. Of course, focusing on certain comparisons may be perfectly valid in some instances see discussion of planned comparisons below.

In fact, it is generally only in situations where there is insufficient structure among treatment groups to inspire particular comparisons where ANOVA is most applicable. In cases of a positive ANOVA finding, a commonly used post hoc method is Tukey's test , which goes by a number of different names including Tukey's honest significant difference test and the Tukey-Kramer test. As is the case for other methods for multiple comparisons, the chance of obtaining false negatives increases with the number of populations being tested, and, with post hoc ANOVA methods, this increase is typically exponential.

Tukey's test does have more power than the Bonferroni method but does not generate precise P -values for specific comparisons. To get some idea of significance levels, however, one can run Tukey's test using several different family-wise significance thresholds 0. Thus if your analyses take you heavily into the realm of the ANOVA, it may be necessary to educate yourself about the differences between these approaches. Figure 10 provides a visual summary of the multiple comparisons methods discussed above.

As can be seen, the likelihood of falsely declaring a result to be statistically significant is highest when conducting multiple t -tests without corrections and lowest using Bonferroni-type methods. Conversely, incorrectly concluding no significant difference even when one exists is most likely to occur using the Bonferroni method. Thus the Bonferroni method is the most conservative of the approaches discussed, with FDR occupying the middle ground.

Additionally, there is no rule as to whether the uniform or non-uniform Bonferroni method will be more conservative as this will always be situation dependent. Though discussed above, ANOVA has been omitted from Figure 10 since this method does not apply to individual comparisons. Nevertheless, it can be posited that ANOVA is more conservative than uncorrected multiple t -tests and less conservative than Bonferroni methods. Finally, we can note that the statistical power of an analysis is lowest when using approaches that are more conservative discussed further in Section 6.

Figure Strength versus weakness comparison of statistical methods used for analyzing multiple means. There is no law that states that all possible comparisons must be made. It is perfectly permissible to choose a small subset of the comparisons for analysis, provided that this decision is made prior to generating the data and not afterwards based on how the results have played out!

In addition, with certain datasets, only certain comparisons may make biological sense or be of interest. Thus one can often focus on a subset of relevant comparisons. As always, common sense and a clear understanding of the biology is essential. These situations are sometimes referred to as planned comparisons, thus emphasizing the requisite premeditation.

An example might be testing for the effect on longevity of a particular gene that you have reason to believe controls this process. The fact that you included all of these conditions in the same experimental run, however, would not necessarily obligate you to compensate for multiple comparisons when analyzing your data. In addition, when the results of multiple tests are internally consistent, multiple comparison adjustments are often not needed. For example, if you are testing the ability of gene X loss of function to suppress a gain-of-function mutation in gene Y, you may want to test multiple mutant alleles of gene X as well as RNAi targeting several different regions of X.

In such cases, you may observe varying degrees of genetic suppression under all the tested conditions. Here you need not adjust for the number of tests carried out, as all the data are supporting the same conclusion. In the same vein, it could be argued that suppression of a mutant phenotype by multiple genes within a single pathway or complex could be exempt from issues of multiple comparisons.

Finally, as discussed above, carrying out multiple independent tests may be sufficient to avoid having to apply statistical corrections for multiple comparisons. Imagine that you have written up a manuscript that contains fifteen figures fourteen of which are supplemental. Embedded in those figures are 23 independent t -tests, none of which would appear to be obvious candidates for multiple comparison adjustments.

However, you begin to worry. Since the chosen significance threshold for your tests was 0. Thinking about this more, you realize that over the course of your career you hope to publish at least 50 papers, each of which could contain an average of 20 statistical tests. This would mean that over the course of your career you are To avoid this humiliation, you decide to be proactive and impose a career-wise Bonferroni correction to your data analysis.

Going through your current manuscript, you realize that only four of the 23 tests will meet your new criteria. With great sadness in your heart, you move your manuscript into the trash folder on your desktop. Although the above narrative may be ridiculous indeed, it is meant to be so , the underlying issues are very real. Conclusions based on single t -tests, which are not supported by additional complementary data, may well be incorrect.

Thus, where does one draw the line? One answer is that no line should be drawn, even in situations where multiple comparison adjustments would seem to be warranted. Results can be presented with corresponding P -values, and readers can be allowed to make their own judgments regarding their validity. For larger data sets, such as those from microarray studies, an estimation of either the number or proportion of likely false positives can be provided to give readers a feeling for the scope of the problem.

Even without this, readers could in theory look at the number of comparisons made, the chosen significance threshold, and the number of positive hits to come up with a general idea about the proportion of false positives. Although many reviewers and readers may not be satisfied with this kind of approach, know that there are professional statisticians who support this strategy. Perhaps most importantly, understand that whatever approaches are used, data sets, particularly large ones, will undoubtedly contain errors, including both false positives and false negatives.

Wherever possible, seek to confirm your own results using multiple independent methods so that you are less likely to be fooled by chance occurrence. Sections 2 and 3 dealt exclusively with issues related to means. For many experiments conducted in our field, however, mean values are not the end goal. For example, we may seek to determine the frequency of a particular defect in a mutant background, which we typically report as either a proportion e. Moreover, we may want to calculate CIs for our sample percentages or may use a formal statistical test to determine if there is likely to be a real difference between the frequencies observed for two or more samples.

In other cases, our analyses may be best served by determining ratios or fold changes , which may require specific statistical tools. Finally, it is often useful, particularly when carrying out genetic experiments, to be able to calculate the probabilities of various outcomes. This section will cover major points that are relevant to our field when dealing with these common situations. Most readers are likely proficient at calculating the probability of two independent events occurring through application of the multiplication rule.

More practically, we may wish to estimate the frequency of EcoRI restriction endonuclease sites in the genome. Because the EcoRI binding motif is GAATTC and each nucleotide has a roughly one-in-four chance of occurring at each position, then the chance that any six-nucleotide stretch in the genome will constitute a site for EcoRI is 0. Of course, if all nucleotides are not equally represented or if certain sequences are more or less prevalent within particular genomes, then this will change the true frequency of the site.

Thus, even when calculating straightforward probabilities, one should be careful not to make false assumptions regarding the independence of events. In carrying out genetic studies, we will often want to determine the likelihood of obtaining a desired genotype. For example, if we are propagating an unbalanced recessive lethal mutation let , we will need to pick phenotypically wild-type animals at each generation and then assess for the presence of the lethal mutation in the first-generation progeny.

Basic Mendelian genetics as applied to C. Thus, picking five wild-type animals will nearly guarantee that at least one of the F1 progeny is of our desired genotype. Furthermore, there is a 0. To calculate probabilities for more-complex problems, it is often necessary to account for the total number of combinations or permutations that are possible in a given situation. Thus, for permutations the order matters, whereas for combinations it does not. Depending on the situation, either combinations or permutations may be most relevant.

To illustrate the process of calculating combinations and permutations, we'll first use an example involving peptides. If each of the twenty standard amino acids aa is used only once in the construction of a aa peptide, how many distinct sequences can be assembled? We start by noting that the order of the amino acids will matter, and thus we are concerned with permutations. In addition, given the set up where each amino acid can be used only once, we are sampling without replacement.

For example, 5! Also by convention, 1! Note that because we were sampling without replacement, the incremental decrease with each multiplier was necessary to reflect the reduced choice of available amino acids at each step. One thought would be to use the multiplication rule where we multiply 0. If this seems a bit lower than you might expect, your instincts are correct. For this reason, it underestimates the true frequency of interest, since there are multiple ways of getting the same combination.

Thus, we must take permutations into account in order to determine the frequency of the generic combination. Because deriving permutations by hand as we did above becomes cumbersome if not impossible very quickly, one can use the following equation where n is the total number of items with n 1 that are alike and n 2 that are alike, etc. Thus plugging in the numbers for our example, we would have 5! This illustrates a more general rule regarding the probability Pr of obtaining specific combinations:. Note, however, that we may often be interested in a slightly different question than the one just posed.

In this case, we would have to sum the probabilities for three out of five [ 5! The ability to calculate permutations can also be used to determine the number of different nucleotide sequences in a mer where each of the four nucleotides G, A, T, C is used five times. Namely, 20! Finally, we can calculate the number of different peptides containing five amino acids where each of the twenty amino acids is chosen once without replacement.

In this case, we can use a generic formula where n is the total number of items from which we select r items without replacement. This would give us 20! Thus, using just a handful of formulas, we are empowered with the ability to make a wide range of predictions for the probabilities that we may encounter. Specifically, the Poisson distribution can be used to predict the probability that a given number of events will occur over a stipulated interval of time, distance, space, or other related measure, when said events occur independently of one another.

For example, given a known forward mutation rate caused by a chemical mutagen, what is the chance that three individuals from a collection of 1, F1s derived from mutagenized P0 parents will contain a mutation in gene X? Also, what is the chance that any F1 worm would contain two or more independent mutations within gene X? For this formula to predict probabilities accurately, it is required that the events be independent of each other and occur at a constant average rate over the given interval.

If these criteria are violated, then the Poisson distribution will not provide a valid model. For example, imagine that we want to calculate the likelihood that a mutant worm that is prone to seizures will have two seizures i. For this calculation, we rely on previous data showing that, on average, mutant worms have 6. Note that if we were to follow 20 different worms for 5 minutes and observed six of them to have two seizures, this would suggest that the Poisson distribution is not a good model for our particular event Rather, the data would suggest that multiple consecutive seizures occur at a frequency that is higher than predicted by the Poisson distribution, and thus the seizure events are not independent.

In contrast, had only one or two of the 20 worms exhibited two seizures within the time interval, this would be consistent with a Poisson model. Another useful strategy for calculating probabilities, as well as other parameters of interest that are governed by chance, is sometimes referred to as the intuitive approach. This includes the practice of plugging hypothetical numbers into scenarios to maximize the clarity of the calculations and conclusions.

Our example here will involve efforts to maximize the efficiency of an F2-clonal genetic screen to identify recessive maternal-effect lethal or sterile mutations Figure For this experiment, we will specify that P0 adults are to be cloned singly onto plates following mutagenesis. The question is: what is the optimal number of F2s to single-clone from each F1 plate? Schematic diagram of F2-clonal genetic screen for recessive mutations in C. But will the returns prove diminishing and, if so, what is the most efficient practice?

The first column shows the number of F2 animals picked per F1, which ranges from one to six. As expected, the likelihood increases with larger numbers of F2s, but diminishing returns are evident as the number of F2s increases. Columns 3—5 tabulate the number of worm plates required, the implication being that more plates are both more work and more expense.

Here, a higher frequency would infer that the desired mutations of interest are more common. Table 4. Intuitive approach to determine the maximum efficiency of an F2-clonal genetic screen. From this we can see that either two or three F2s is the most efficient use of plates and possibly time, although other factors could potentially factor into the decision of how many F2s to pick.

We can also see that the relative efficiencies are independent of the frequency of the mutation of interest. Importantly, this potentially useful insight was accomplished using basic intuition and a very rudimentary knowledge of probabilities. Of course, the outlined intuitive approach failed to address whether the optimal number of cloned F2s is 2.

We note that an online tool has been created by Shai Shaham Shaham, that allows users to optimize the efficiency of genetic screens in C. To use the tool, users enter several parameters that describe the specific genetic approach e. The website's algorithm then provides a recommended F2-to-F1 screening ratio. Entering parameters that match the example used above, the website suggests picking two F2s for each F1, which corresponds to the number we calculated using our intuitive approach.

In addition, the website provides a useful tool for calculating the screen size necessary to achieve a desired level of genetic saturation. In many situations, the likelihood of two events occurring is not independent. This does not mean that the two events need be totally interdependent or mutually exclusive, just that one event occurring may increase or decrease the likelihood of the other.

Put another way, having prior knowledge of one outcome may change the effective probability of a second outcome. The area of statistics that handles such situations is known as Bayesian analysis or inference, after an early pioneer in this area, Thomas Bayes. More generally, conditional probability refers to the probability of an event occurring based on the condition that another event has occurred. Although conditional probabilities are extremely important in certain types of biomedical and epidemiological research, such as predicting disease states given a set of known factors 39 , this issue doesn't arise too often for most C.

Bayesian models and networks have, however, been used in the worm field for applications that include phylogenetic gene tree construction Hoogewijs et al. Bayesian statistics is also used quite extensively in behavioral neuroscience Knill and Pouget, ; Vilares and Kording, , which is growing area in the C. We refer interested readers to textbooks or the web for additional information see Appendix A. It is common in our field to generate data that take the form of binomial proportions. Examples would include the percentage of mutant worms that arrest as embryos or that display ectopic expression of a GFP reporter.

As the name implies, binomial proportions arise from data that fall into two categories such as heads or tails, on or off, and normal or abnormal. More generically, the two outcomes are often referred to as a success or failure. To properly qualify, data forming a binomial distribution must be acquired by random sampling , and each outcome must be independent of all other outcomes. Coin flips are a classic example where the result of any given flip has no influence on the outcome of any other flip. Also, when using a statistical method known as the normal approximation discussed below , the binomial dataset should contain a minimum of ten outcomes in each category although some texts may recommend a more relaxed minimum of five.

This is generally an issue only when relatively rare events are being measured. For example, flipping a coin 50 times would certainly result in at least ten heads or ten tails, whereas a phenotype with very low penetrance might be detected only in three worms from a sample of In this latter case, a larger sample size would be necessary for the approximation method to be valid. As we often deal with theoretical populations that are effectively infinite in size, however, this stipulation is generally irrelevant.

An aside on the role of normality in binomial proportions is also pertinent here. It might seem counterintuitive, but the distribution of sample proportions arising from data that are binary does have, with sufficient sample size, an approximately normal distribution. As can be seen, the distribution becomes more normal with increasing sample size.

## Online Library of Liberty

How large a sample is required, you ask? The requirements are reasonably met by the aforementioned minimum of ten rule.

- Geosynthetic Design & Construction Guidelines. Participant Notebook.
- Thomas Mann.
- The Three Ecologies?
- Non-equilibrium entropy and irreversibility;

Illustration of the Central Limit Theorem for binomial proportions. Panels A—D show results from a computational sampling experiment where the proportion of successes in the population is 0. The x axes indicate the proportions obtained from samples sizes of 10, 20, 40, and The y axes indicate the number of computational samples obtained for a given proportion. As expected, larger-sized samples give distributions that are closer to normal in shape and have a narrower range of values.

To address the accuracy of proportions obtained through random sampling, we will typically want to provide an accompanying CI. As previously discussed in the context of means, determining CIs for sample proportions is important because in most cases we can never know the true proportion of the population under study. As for means, lower CIs e. Perhaps surprisingly, there is no perfect consensus among statisticians as to which of several methods is best for calculating CIs for binomial proportions Thus, different textbooks or websites may describe several different approaches.

That said, for most purposes we can recommend a test that goes by several names including the adjusted Wald, the modified Wald, and the Agresti-Coull A-C method Agresti and Coull, ; Agresti and Caffo, Furthermore, even though this approach is based on the normal approximation method, the minimum of ten rule can be relaxed. It then uses the doctored numbers, together with the normal approximation method, to determine the CI for the population proportion. For example, if in real life you assayed 83 animals and observed larval arrest in 22, you would change the total number of trials to 87 and the number of arrested larvae to In addition, depending on the software or websites used, you may need to choose the normal approximation method and not something called the exact method for this to work as intended.

Importantly, the proportion and sample size that you report should be the actual proportion and sample size from what you observed; the doctored i. Nevertheless, unless the obtained percentage is 0 or , we do not recommend doing anything about this as measures used to compensate for this phenomenon have their own inherent set of issues.

Very often we will want to compare two proportions for differences. Is this difference significant from a statistical standpoint? To answer this, two distinct tests are commonly used. These are generically known as the normal approximation and exact methods. In fact, many website calculators or software programs will provide the P -value calculated by each method as a matter of course, although in some cases you may need to select one method.

The approximation method based on the so-called normal distribution has been in general use much longer, and the theory behind this method is often outlined in some detail in statistical texts. The major reason for the historical popularity of the approximation method is that prior to the advent of powerful desktop computers, calculations using the exact method simply weren't feasible. Its continued use is partly due to convention, but also because the approximation and exact methods typically give very similar results. Unlike the normal approximation method, however, the exact method is valid in all situations, such as when the number of successes is less than five or ten, and can thus be recommended over the approximation method.

Regardless of the method used, the P -value derived from a test for differences between proportions will answer the following question: What is the probability that the two experimental samples were derived from the same population? Put another way, the null hypothesis would state that both samples are derived from a single population and that any differences between the sample proportions are due to chance sampling.

Much like statistical tests for differences between means, proportions tests can be one- or two-tailed, depending on the nature of the question. For the purpose of most experiments in basic research, however, two-tailed tests are more conservative and tend to be the norm. In addition, analogous to tests with means, one can compare an experimentally derived proportion against a historically accepted standard, although this is rarely done in our field and comes with the possible caveats discussed in Section 2.

A question that may arise when comparing more than two binomial proportions is whether or not multiple comparisons should be factored into the statistical analysis. The issues here are very similar to those discussed in the context of comparing multiple means Section 3. In the case of proportions, rather than carrying out an ANOVA, a Chi-square test discussed below could be used to determine if any of the proportions are significantly different from each other.

Like an ANOVA, however, this may be a somewhat less-than-satisfying test in that a positive finding would not indicate which particular proportions are significantly different. In addition, FDR and Bonferroni-type corrections could also be applied at the level of P -value cutoffs, although these may prove to be too conservative and could reduce the ability to detect real differences i.

In general, we can recommend that for findings confirmed by several independent repeats, corrections for multiple comparisons may not be necessary. We illustrate our rationale with the following example. Suppose you were to carry out a genome-wide RNAi screen to identify suppressors of larval arrest in the mutant Y background. With retesting of these 1, clones, most of the false positives from the first round will fail to suppress in the second round and will be thrown out. A third round of retesting will then likely eliminate all but a few false positives, leaving mostly valid ones on the list.

In addition, let's imagine that real positives would also be identified giving us 1, positives in total. Admittedly, at this point, the large majority of identified clones would be characterized as false positives. In the second round of tests, however, the large majority of true positives would again be expected to exhibit statistically significant suppression, whereas only 50 of the 1, false positives will do so.

Following the third round of testing, all but two or three of the false positives will have been eliminated.

- Theory of the Combination of Observations Least Subject to Errors;
- How Can It Be Used?.
- Effective Publications Management: Keeping Print Communications on Time, on Budget, on Message!
- Pretty Little Purses & Pouches.
- Geopedology: An Integration of Geomorphology and Pedology for Soil and Landscape Studies;

Thus, by carrying out several experimental repeats, additional correction methods are not needed. At times one may be interested in calculating the probability of obtaining a particular proportion or one that is more extreme, given an expected frequency. For example, what are the chances of tossing a coin times and getting heads 55 times?

This can be calculated using a small variation on the formulae already presented above. Here n is the number of trials, Y is the number of positive outcomes or successes, and p is the probability of a success occurring in each trial. Thus we can determine that the chance of getting exactly 55 heads is quite small, only 4.

Nevertheless, given an expected proportion of 0. In fact, we are probably most interested in knowing the probability of getting a result at least as or more extreme than 55 whether that be 55 heads or 55 tails. Thus our probability calculations must also include the results where we get 56, 57, 58… heads as well as 45, 44, 43 …0 heads. Adding up these probabilities then tells us that we have a Rather than having to calculate each probability and adding them up, however, a number of websites will do this for you.

One of the assumptions for using the binomial distribution is that our population size must be very large relative to our sample size Learn more. If you have previously obtained access with your personal account, Please log in. If you previously purchased this article, Log in to Readcube. Log out of Readcube.

Click on an option below to access. Log out of ReadCube. After mentioning both some positive and some negative aspects of statistics, a formal framework for statistics is presented which contains the concept formation, derivation of results and interpretation of mathematical statistics as three essential steps. The difficulties especially of interpretation are shown for examples in several areas of statistics, such as asymptotics and robustness.

Volume 26 , Issue 3. The full text of this article hosted at iucr. If you do not receive an email within 10 minutes, your email address may not be registered, and you may need to create a new Wiley Online Library account. If the address matches an existing account you will receive an email with instructions to retrieve your username.

Tools Request permission Export citation Add to favorites Track citation. Share Give access Share full text access. Share full text access. Please review our Terms and Conditions of Use and check box below to share full-text version of article. Get access to the full version of this article. View access options below. You previously purchased this article through ReadCube. Institutional Login.