Let’s assume, for the moment, that you want a device that measures your running power. Yes, there are reasonable questions and spirited debates that verge on the philosophical about what running power really means, and whether it offers anything that you couldn’t get from a GPS watch or a heart-rate monitor. But as I discussed in the March issue of Outside, lots of runners are leaving those questions behind and wondering instead about more practical issues—like which running power device they should spring for.

Power meter for running

That’s what a research team at the University of Murcia in Spain, led by Jesús Pallarés, decided to explore in a new study published in the European Journal of Sport Science. They report no outside sponsorship and no conflicts of interest. (Neither do I.) They recruited 12 trained runners, strapped on equipment from the four main players in the running power market, and put them through a series of tests to assess how the various power meters performed.

The power meters they used were: a Stryd footpod linked to either a phone or a Garmin watch; a pair of RunScribe footpods linked to a Garmin watch; the Garmin Running Power app using a Forerunner 935 and a chest-mounted heart rate monitor equipped with accelerometers; and Polar’s watch-only estimate of running power. Bear in mind that because of the lag between experiment and publication, these likely aren’t the current versions of any of these devices.

The runners did four days of testing: two identical days on an indoor treadmill, and two identical days on an outdoor track. (The Polar device was only used outdoors, since it makes its estimates based on GPS data.) By comparing the data from nominally identical sessions, the researchers were able to calculate various measures of repeatability: if you measure the same thing twice, how close do you come to getting the same answer? This is obviously a pretty important characteristic if you want to base any training or racing decisions on your power data.

There are various ways to measure repeatability, and the Stryd device came out on top in all of them. For example, the coefficient of variation should generally be less than 5 percent to get meaningful data from exercise tests. In the outdoor tests, Stryd came in at 4.3 percent, compared to 7.7 percent for Garmin, 14.5 percent for Polar, and 14.8 percent for RunScribe. Even for Stryd, that variation was the equivalent of 12.5 watts, suggesting that you shouldn’t get too stressed if your power output fluctuates by a few watts from one day to the next.

 

The other set of tests involved comparing running power to oxygen consumption, or VO2, which is a proxy measure for how much energy you’re burning (at least during relatively easy running). Here, much as I’d love to avoid it, it’s worth dipping back into those arguments about the meaning of running power.

As I wrote in 2018, the concept of power has no useful intrinsic definition in running, since each stride consists of a mishmash of positive, negative, internal, and external power as your legs and arms swing backwards and forwards, your tendons stretch and recoil, and so on. Instead, what people think of as running power is basically an analogy to cycling power, where the power applied to the pedals has a consistent relationship to how much energy you’re burning and thus how sustainable your effort is. As a result, my conclusion in 2018 was that a running power meter is useful only insofar as it successfully tracks VO2—which, as it happens, was precisely what Stryd was trying to rig its algorithm to do.

Not everyone agrees with that definition. While reporting my recent magazine piece on running power, I went back and forth with an engineer at Garmin about the goal of its running power app. Their algorithm, they insisted, is not designed to track VO2. Instead, it’s designed to estimate the power applied by your foot to the road. I still can’t quite figure out why you’d care about that number in isolation, if it doesn’t also tell you something about how much energy you’re burning, like it does in cycling. Be that as it may, it’s worth noting that the VO2 tests below are only relevant if you think (as I do) that VO2 matters.

 

They did three sets of VO2 tests, each of which involved three-minute bouts of running separated by four-minute bouts of rest. The first test started at just under 11-minute mile pace and got progressively faster with each stage until the runners were no longer running aerobically (meaning that VO2 would no longer provide a useful estimate of energy consumption). The second test stayed at about 9:30 mile pace, but subsequent stages added vests weighing 2.5 then 5 kilograms. The third test, which was only performed indoors, varied the slope between -6 percent and +6 percent in five stages.

Here’s a set of graphs showing the relationship between running power (on the horizontal axis) and oxygen consumption (on the vertical axis) for each of the devices for the running speed test. If running power is indeed a good proxy for energy consumption at various speeds, you’d expect all the dots to fall along a nice straight line.

(Courtesy European Journal of Sport Science)

Once again, you can see that the Stryd data is pretty tightly clustered around the straight line. Their calculated standard error is 6.5 percent when connected to the phone app and 7.3 percent when connected to the Garmin watch. (For what it’s worth, I see no reason that the Stryd device should give different data based on what it’s connected to, so I assume those results are equivalent.) The picture gets a little uglier for the other devices: 9.7 percent for Polar, 12.9 percent for Garmin, and 14.5 percent for RunScribe.

When you vary the weight or the slope, the Stryd remains just as accurate, with standard errors of 6.3 and 6.9 percent respectively. But the other ones don’t handle it as well, particularly when slope is varied: Garmin’s standard error balloons to 19.0 percent and RunScribe’s to 18.5 percent. Polar doesn’t even get a score for slope, because it doesn’t work on the treadmill.

A side note: Polar does reasonably well in the VO2 test, and it’s worth pausing to understand why. The other three devices are all using accelerometers to estimate the accelerations and forces of your feet smacking into the ground, and feeding that data into an algorithm that essentially estimates VO2. Polar is completely skipping the middleman, because it doesn’t even bother trying to estimate the forces and accelerations. It just uses the speed measured by your GPS and the slope measured by a barometer, along with other personal data you’ve inputted. In a sense, it’s taking my claim that running power is only useful as a VO2 estimator to its logical conclusion—though calling its calculation a “power” seems a little cheeky.

 

A couple of other caveats to consider. One is that they forced everyone to maintain the same cadence (based on their individual cadences during an initial familiarization run) throughout all the test sessions to “improve the quality of the repeatability.” This strikes me as bizarre: one of the main points of the study was to find out how repeatable the measurements were, so eliminating one of the potential sources of variation sort of defeats the purpose. Maybe one of the devices gives terrible data when you change your cadence due to natural variations in pace or slope, while the others handle it fine. If so, that would be worth knowing.

The other caveat, as I mentioned above, is that all of these devices and algorithms continue to evolve. My article in the print magazine focused on how the latest Stryd devices can now measure and account for wind conditions, which is a pretty cool new feature that doesn’t make it into this study. The other devices and algorithms continue to evolve too, so this isn’t the final word on the topic. But for now, if you’re in the market for a running power device—and if what you really mean by that is a consistently repeatable estimate of oxygen consumption—this data suggests that Stryd is your best bet.