more data

Empirical scientists (especially economists) are typically forced to work with too little data from which they want to extract too much. On the one hand this is because they think simplicity is important. This is in the best of cases a legacy inherited from information theory but it often degenerates to believing an economy “consists of” 2-3 aggregates such as GDP and the like.  On the other hand of course they have no choice than to do so: data that fits the empirical questions is hard to come by. This is reflected for example in such things as the use of “proxies”, in place of the actual variables, and their varying proximity to that which they are supposed to proxy. In addition intuition (the “story”) is used to fill the gap which lacking data opens. A good example is the research around so called network effects. If you want to know what the effect of friendship is on such things as unemployment, salaries, life satisfaction etc you are basically on your own because there is no data which could help you with this other than say if you proxy friendship with “facebook friendship” and collect the rest on your own. The situation is better in  some fields. So what can be done to remedy this?

One major step would be that there is no publicly financed project in any modern society without a plan for data generation, retention and re-usability: secondary data use should be a primary concern. In other words if you are going to build anything, start a program, design your administrative data systems (tax, social security,…) etc consult empirical scientists asking them to help you answer such questions as: Where do I need what kind of measurements, how often, in what detail for which possible purposes, which data do I keep, for how long and last but not least: how can I not prevent that the data is useful for purposes I cannot think of now? This is an issue of maximization of return on investment. The complexity of modern societies is such that not doing is is simply negligent.

The fact that a preliminary and often inaccurate quarterly GDP is known with a six week delay in societies whose function relies critically on information is similar to a nuclear power plant which knows the temperature of its core with a delay of two days.

Empirical scientists have so far fought their fight in the front of the availability of the existing data but I do not think this is enough. I think they should now target the law maker and demand that projects have built in data production and retention concepts. It is not any more just about getting access to the existing data but about influencing what kind of data to produce and retain.


This entry was posted in comment, data, economics, nowcasting, technology. Bookmark the permalink.