Crowdsourced data is not a substitute for real statistics

Guest Beneblog by Patrick Ball, Jeff Klingner, and Kristian Lum

After the earthquake in Haiti, Ushahidi organized a centralized text messaging system to allow people to inform others about people trapped under damaged buildings and other humanitarian crises. This system was extremely effective at communicating specific needs in a timely way that required very little additional infrastructure. We think that this is important and valuable. However, we worry that crowdsourced data are not a good data source for doing statistics or finding patterns.

An analysis team from European Commission's Joint Research Center analyzed the text messages gathered through Ushahidi together with data on damaged buildings collected by the World Bank and the UN from satellite images. Then they used spatial statistical techniques to show that the pattern of aggregated text messages predicted where the damaged buildings were concentrated.

Ushahidi member Patrick Meier interpreted the JRC results as suggesting that "unbounded crowdsourcing (non-representative sampling) largely in the form of SMS from the disaster affected population in Port-au-Prince can predict, with surprisingly high accuracy and statistical significance, the location and extent of structural damage post-earthquake."

One problem with this conclusion is that there are important areas of building damage where very few text messages were recorded, such as the neighborhood of Saint Antoine, east of the National Palace. But even the overall statistical correlation of text messages and building damage is not useful, because the text messages are really just reflecting the underlying building density.

Benetech statistical consultant Dr. Kristian Lum has analyzed data from the same sources that the JRC team used. She found that after controlling for the prior existence of buildings in a particular location, the text message stream adds little to no useful information to the prediction of patterns of damaged building locations. This is not surprising, as most of the text messages in this data set were requests for food, water, or medical help, rather than reports of damage.

In fact, once you control for the presence of any buildings (damaged or undamaged), the text message stream seems to have a weak negative correlation with the presence of damaged buildings. That is, the presence of text messages suggests there are fewer (not more) damaged buildings in a particular area. It may be that people move away from damaged buildings (perhaps to places where humanitarian assistance is being given) before texting.

Here's the bottom line: if you have a map of buildings from before the earthquake, you already know more about the likely location of damaged buildings than if you relied on an SMS stream, based on the Haiti data presented. That is, to find the most damaged buildings, you should go to where there are the most buildings! The text message stream doesn't help the decision process. Indeed, it would seem to be slightly more likely to lead you to areas that have fewer damaged buildings. Crowd-sourcing has many valuable uses in a crisis, but identifying spatial patterns of damaged buildings isn't one of them.

health and technology

Crowdsourced data is not a substitute for real statistics

No comments:

Popular Post