Data is the newest of a number of things which we seem to prefer open rather than closed. We always liked open minds and open hearts better than their opposites and we like open doors (at least when outside looking in) more than closed ones but open data is the contemporary spearhead of more recent developments whose timeline contains such things as open source software, open hardware and open access scientific literature. In all cases our desire for openness is an expression of the inadequacy of our established ways to cope with the new realities.
Advances in Information Technology have lowered the costs involved in all aspects of data generation and management and open data is the best formulated project of this new data reality. The ambient societal impact of this technological change caused the barriers between content producers and content consumers to practically vanish leading to a world of big data and social media, a largely digital world. The research implications are that we can now have access to much more data than before because on the one hand we are 50 years into the digitization process and on the other much of the data is now born digital (in step with the arrival of digital natives who live large parts of their lives online rather than off). Sadly this, however, does not mean we are anywhere near having sufficient data: the state of affairs regarding economic measurement for example lags terribly behind technology as can be seen by our dire inability to nowcast the economy.
Empirical research was never as important or attractive as it is today. An ever increasing appetite for research results in the popular press pollutes the fact pool with misunderstandings, flat out wrong statements, fabricated results, stylized facts, ideology and all kinds of inaccuracies. To keep it all trustworthy, consistent and meaningful and prevent it from imploding we need to reform our ways with regards to data. We need to invent ways to assign data to research papers, we need persistent identification methods of data and data citation conventions and we need the data to be available which is the minimum of what open should mean.
Data suitable for empirical research is diverse and inhomogeneous and cannot all be opened the same way. Unlike in mathematics where, we do not require that the notes taken during research be published, because once the theorem is written it stands to judgement on its own merits, in empirical research we need to keep and curate the data, we need to document how we collected the data, we need metadata and paradata, we need to describe the collection method sufficiently in order to make the resulting variables meaningful and the research results based on them worthy of scientific discourse.
In short, open data should be the criterion which separates scientific assertions from scientific hearsay. Whenever the data is not available to the research community under the same terms and conditions as the authors of a research paper the paper is scientific hearsay at best. When the data is available we have scientific research which contains scientific assertions worth of debate and discourse.
Open data will not be solved at once and may not be solved at all completely. It is however encouraging that as the most exciting data is now being amassed in closed, proprietary troves in the form of big data we start to comprehend the importance of open data or at least available data.