Why Not All Studies On Exercise Are Created Equal

Or how I learned to stop just looking at the results

Earlier this week, I made mention of a study on kettlebell training. It had some interesting outcomes, but that study also illustrates a key point that I need to discuss in the name of intellectual honesty.

You see, while the study was an interesting one and the findings were fascinating, there was a problem that I failed to consider while discussing it, and it’s time to set the record straight.


When discussing whether kettlebell training is useful for gaining strength, I mentioned one particular study. The study was by Wade et al and looked at kettlebell training with Air Force personnel.


At the time, I found the results pretty interesting. In particular:


Now, understand that the subjects were engaging in a one-handed swing. They weren’t adding in calisthenics movements, yet there was apparently some improvement, though not statistically significant.


Despite this, the two kettlebell training groups had an improved maximal pushup performance to some degree, seemingly more than the PT group.


Now, one would expect subjects engaging in PT training–typically a replication of the testing itself–to show improvement on maximal pushups, but the kettlebell groups?

The problem was I took the results at face value.

I shouldn’t have.

You see, Wade’s findings are perfectly legitimate, but they should also be viewed somewhat skeptically. The reason, the sample size. From Wade’s article:

Thirty active-duty members between the ages of 18 and 40 years were recruited for participation in this study. Subjects were asked to complete a medical screening questionnaire and were cleared by the research medical monitor before participation. Only data from the subjects who completed both pretesting and posttesting were reported.

Thirty people.

And that was at the start of the study. Based on the table describing the pretest characteristics, it looks like only 20 completed the study.

So we’re looking at results for 20 folks total.

Now, let’s remember a key point about the results. This is also from Wade’s study:


Results: Twenty subjects completed the study. There were no statistically significant changes in 1.5-mile run time between or within groups. The 40- yard dash significantly improved within the KB swing (p ≤ .05) and KB + run group (p ≤ .05); however, there were no significant differences in the traditional PT group (p ≤ .05) or between groups. Maximal push-ups significantly improved in the KB + run group (p ≤ .05) and trends toward significant improvements in maximal push-ups were found in both the KB (p = .057) and traditional PT (p = .067) groups.

Note the difference in push-ups? The kettlebell + run group improved on push-ups despite not doing any. Fascinating, right?

The problem was, that particular group consisted of just six people (two males, four females).

Why does this matter?

Simple. With such a small sample size, it only takes one outlier to completely skew the findings.

A few years back, a man by the name of John Bohannon made an awful lot of headlines. He also made an awful lot of science journalists look pretty damn stupid too.

You see, Bohannon was part of an effort to illustrate a flaw with science writing, and he did it by taking part in a study that apparently found chocolate could help you lose weight. After a time, he wrote about that and how he pulled it off.

Part of what he and his partner did was use a small sample size.


I know what you’re thinking. The study did show accelerated weight loss in the chocolate group—shouldn’t we trust it? Isn’t that how science works?


Here’s a dirty little science secret: If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result. Our study included 18 different measurements—weight, cholesterol, sodium, blood protein levels, sleep quality, well-being, etc.—from 15 people. (One subject was dropped.) That study design is a recipe for false positives.


Think of the measurements as lottery tickets. Each one has a small chance of paying off in the form of a “significant” result that we can spin a story around and sell to the media. The more tickets you buy, the more likely you are to win. We didn’t know exactly what would pan out—the headline could have been that chocolate improves sleep or lowers blood pressure—but we knew our chances of getting at least one “statistically significant” result were pretty good.
Whenever you hear that phrase, it means that some result has a small pvalue. The letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data. The conventional cutoff for being “significant” is 0.05, which means that there is just a 5 percent chance that your result is a random fluctuation. The more lottery tickets, the better your chances of getting a false positive. So how many tickets do you need to buy?


P(winning) = 1 – (1 – p)n


With our 18 measurements, we had a 60% chance of getting some“significant” result with p < 0.05. (The measurements weren’t independent, so it could be even higher.) The game was stacked in our favor.

Now, understand that I’m not accusing Wade or her colleagues of any such thing. I believe it was a good-faith effort at trying to determine whether or not the kettlebell was a valid means for training military personnel.

However, I still shouldn’t have used the study.

The problem with it is that p-value for push-up improvement is effectively meaningless in the grand scheme of things. One person having an off day during pre-testing could easily have skewed the result numbers to show greater improvement with an exercise not even remotely similar to anything that had been trained.

Further, why would the kettlebell + run group show so much more improvement than the kettlebell and the standard PT group? The standard PT group were doing push-ups, for crying out loud! How does that make any sense?

That’s what had me wracking my brain on for the last few days, which is why I looked at the sample size, something I should have done in the first place and didn’t.

So why didn’t I?

To be honest, I think I checked out the results, read that there were 30 subjects (which Bohannon notes is the current cut-off for being taken seriously by many journals), and soldiered on because the results told me what I wanted to hear.

Obviously, that’s a mistake I shouldn’t have made. I didn’t do it consciously, but none the less…I need to own up to it.

Yet it also gives us a great opportunity to discuss one of the problems with some of these studies. So many of them use small samples, probably because it’s difficult to recruit people who will agree to train a certain way for a length of time in full compliance with the experiment’s protocols. A thousand people may like the idea of doing that experiment, but if only 5 will actually finish, your study is kind of hosed.

That’s why it becomes important for us, as people who use this information, to maintain skepticism. That’s especially true when it tells you what you want to hear.

However, here’s where I didn’t screw up completely.

You see, I used this in conjunction with other studies to support my position. In other words, I didn’t use this and only this to “prove” that kettlebells could get you strong.

Further, the small sample size doesn’t mean what she and her colleagues found wasn’t accurate. After all, she found what she found, right? It’s entirely possible that her findings weren’t the product of statistical noise. It’s even likely that it wasn’t. Outliers, by definition, aren’t common.

That’s not to excuse my lack of understanding earlier, mind you. I’m only mentioning it because this newfound understanding doesn’t nuke my underlying point in that post.

But like I mentioned previously, studies are difficult when it comes to kettlebells. Frankly, so many of them are limited in so many ways. If they don’t betray the study designer’s lack of understanding when it comes to kettlebells, like the study by Otto and colleagues, it lacks a decent size sample, it seems.

That said, I won’t deflect from my mea culpa on this. It’s my bad, and while Wade’s study is interesting and I may refer back to it, I’ll make it a point to mention the problems with the study going forward.

Author: Tom

Tom is a husband, father, novelist, opinion writer, and former Navy Corpsman currently living in Georgia. He's also someone who has lost almost 60 pounds in a safe, sustainable way, so he knows what he's talking about.

2 thoughts on “Why Not All Studies On Exercise Are Created Equal”

Leave a Reply

Your email address will not be published. Required fields are marked *