Tips and tricks to automate data for DDJ projects

15/07/2018

When you whish to create an automation project relying on data, there are several points not to miss before going deeper in the process. As it is a technical challenge and a journalistic challenge, I have developed an assessment methodology which aims to answer both. This framework was tested within a web application which provides real time news about air quality in Brussels and which was also developed for a wider investigative project. Because it is more important to prevent than to cure and because any DDJ project needs to be fed by accurate and reliable data, here are few simple ways to ensure you that data will fit your journalistic need. The big deal is not about the process but about data.

First of all, you have to assess the data quality on a formal way. The purpose is here to ensure that data can be automated once collected. You have to ensure you that you have the right to use the data and that there is no encoding problems, no HTML overload (if data are collected from the web), no duplicate data, no missing values and an explicit labelling with no orthographic incoherence. If standards are not used, it will be up to you to correct it. If metadata exist, in many cases, it will highlight you about the way to understand the data (values and labelling).

On a journalistic point of view, you will have to ask you if it is a primary source (reliability), if the dataset is appropriate and complete, and if it is relevant for you story. Accuracy, correctness and precision are the three others criteria to meet. The “completeness” indicator might be more difficult to assess due to the issue of the NULL value which can be interpreted in several ways: data exist but are not known, data are not relevant for the variable, data is relevant but not exists, value equals zero…

Those assessments can be easily achieved with answering yes or no. But you will have to dig a bit deeper to ask you, for example, the relationship between the data provider and the primary source of the data (reliability and truthworthiness), and if the data fit for automation considering problems detected and ways to solve it. Moreover, ask you first what is the added value of your automation project in terms of journalistic purpose.

When data are provided in real time, this framework can be completed by the understanding of the data lifecycle because data are evolving with time and the way of how data will evolve will depend on the application domain. For example, data about air quality become a fixed average only after 24 hours. Whatever data are provided in real time or not, don’t forget to record it: the history of the values can lead to a good story. And if data are a total mess, it could also tell you more about the management of the data producer (a good story could also be hidden behind).

 

To resume: understand well your dataset. If you see problems, think about how you can solve it or prevent it (it is always better to prevent than to solve), then automate it. As data evolve with time (even if it is not provided in real time), keep a record of the data to give you more chance to catch a good story.