Tuesday, 7 October 2014

More bad statistics

Every time I open a newspaper I seem to come across badly presented or misleading statistics. Often the reporting of statistical findings is so bad that it is impossible to work out what is meant. Here are two examples I have noticed recently of dubious statistical inference.

1. House prices in National Parks in England and Wales.
A report in The Times yesterday was about the extra premium that people pay for a house in a national park, noting that house prices in Snowdonia have the lowest premium compared to houses in the surrounding area of all the National Parks in England and Wales. The findings were on the basis of the average prices of houses in various parks. I assume the 'average' being referred to is the arithmetic mean. The report concluded with the following puzzling statement:

Homes in the New Forest, Hampshire, were found to be the least affordable in the national parks, commanding the highest price premium ... (so far so good, I can understand that) ... with an average price starting at more than £500,000. 

What? An average price 'starting at'? Surely an average price is an average price; a set of data cannot have a range of averages!

So, do they actually just mean 'an average price of more than £500,000'? If so, then the 'starting at' is misleading. Or do they mean that they have worked out the average prices for a number of different categories of houses and the cheapest category has an average price of more than £500,000? That might be what is meant, but no other references to average prices in the article suggest this.

So, once again I am left not knowing what is being asserted by a statistical statement, other than a general sense that I probably could not afford a house in the New Forest!

2. Incidents of sexual abuse on trains in the UK.
A recent report in my daily newspaper invited me to be horrified at the rise in incidents of sexual abuse on trains. I am aware that rightly this is a sensitive subject, so let me state quite clearly that, of course, even a single case of sexual abuse on a train journey is one too many and should be condemned; and steps should taken to prevent such a thing happening. But this report claimed that there had been a 20% increase of such cases over five years from 2008 to 2013. It was at this huge increase in such cases that the reader was being encouraged to be horrified.

But the data provided in the article did not support the conclusion that there was a huge increase in such incidents. There may well have been, but there was no way of actually drawing this conclusion from the information given. For a start, there was insufficient detail about how the data was collected to know whether the comparison being used was valid. It could be, for example, that there have been social or procedural changes since 2008 that make it easier for victims of sexual abuse to report what had happened. 'More reported incidents' is not the same as 'more incidents'. An increase in the number of incidents being reported could be perceived as a good result because it could lead to a decrease in the number of incidents.

But there was also an inbuilt misunderstanding of the idea of a statistical variable in this newspaper report. Closer reading of the article revealed that they were just comparing the number of incidents reported in 2008 with the number reported in 2013. Is a rise of 20% from one to the other such a dramatic event as the headline suggested? Purely in statistical terms, no ... it is not, for two reasons.

First, no information was provided about the actual numbers being compared. So this means we have no idea as to whether this rise of 20% is significant or not. For example, if there had been only 5 incidents in 2008 and then 6 in 2013, that would have been a 20% increase, but somehow just one more incident does not seem like a dramatic increase requiring a headline.

Second, we are not told anything about how much variation there is in this statistic (the number of incidents per year). If, for example, there happens to be a very high variance in the number of incidents, then it could be that 2008 was one of the years towards the lower end of the range of values of this statistic and 2013 was one of the years towards the upper end. In the years in between the statistic might have gone up and down quite a bit. The article asserted that there had been an increase over the five years, giving the impression that the number of incidents has been gradually going up year by year. But that conclusion cannot be drawn from result of comparing 2008 with 2013. A cynical reader (what me?) might even wonder if the two years being used for the comparison had been chosen to generate the greatest possible difference and therefore the most dramatic headline.

No comments:

Post a Comment