What is Needed to Ensure that the Boom in Big Data Lasts?

KONISHI Yoko
Fellow, RIETI

Year 2013: The dawn of big data

Until now, the general public has thought of data, statistical research, and statistics as difficult, mechanical, and old-fashioned. However, terms like big data, statistics, and data scientists grew more common in 2013. These terms now seem advanced and even glamorous. Many people have decided that they need to become familiar with data and study statistics. Business magazines have put together special features to meet this need, and authors are writing practical guides on statistics and data analysis. Many who work in statistics perhaps feel now that their discipline is finally getting its day in the sun. However, there are also signs that this could be a transient boom, and some are concerned and worried that interest will be lost after a year, just as what happens with other trends.

Two reasons exist for their concern. First, neither research nor statistics is exactly glamorous. The results, on the other hand, should be glamorous. The processes involved in obtaining those results?data-collecting research, data building, and analysis with statistical methods?require extensive, ongoing intense work. Data are still just data, even if they are big. To the extent that data increase in scale, analysis of big data may create more challenges and make the processes more complex. The second reason is that, frankly, there is no clear answer yet as to how to use the big data in front of us and what kinds of techniques have to be developed to do so.

Characteristics of big data from social and economic activities

While there are various definitions of big data, generally it refers to data of such great size that they cannot be stored on a conventional server or data management system. In addition, the extreme number of entries and a variety of variables may be collected in different data formats. For that reason, it is said to be difficult to process big data with common software when conducting data analysis. The expansion of IT technology supports our day-to-day activities and enables collection and storage of data. Such data are used to fashion business strategies and create new systems and business opportunities. Familiar examples include supermarket point-of-sale (POS) data and ridership information for public transportation. Data that would have been destroyed in the past because of high storage cost now are being retained, without any sifting. "Without any sifting" is one way that modern data storage differs from that in the past. In this respect, the following points should be kept in mind when conducting statistical analysis of big data.

1) At the time of data collection, it is hard to identify what one wants to analyze or what the data will reveal
If big data were widely usable, researchers and analysts would want to access them quickly to try to discover something, since they would be of such greater size and information volume than conventional data. However, no matter how big the data, if researchers and analysts start conducting statistical analysis just because the data exist, they will always find something missing. And in many cases, analyses make big assumptions because of limitations in the data.

2) Compared to conventional statistical research, there are more cases with no individual recognition
For example, although a massive amount of data can be generated from just one day of monitoring train passengers in the Tokyo metropolitan area, personal information on passengers who use printed tickets instead of transit smart cards is not provided. Of course, an analyst could still compare day-to-day statistical indicators over time, but he or she would not be able to reflect individual differences, for example, in ridership forecasts. If individuals could be recognized, as they are in many government agency statistics, the accumulation of data not only on individual differences but also for the same individual over time would make that data usable as information. One solution to this problem is the retail industry's ID-POS data, which use reward cards to tie individual characteristics to the data.

3) No data are included unless the subject acts
In the above example, the only data being collected are on rail passengers. Big data accumulate the results of events and actions in which the individual participates. If he or she does not participate, or the action takes place in a different region or store than the one on which data are being collected, that person's data are not included. For that reason, the analyst has to be aware of the bias problem created by sampling when conducting statistical analysis. This missing-value problem can also occur in conventional statistical data, but it can be avoided by carefully selecting the subject of the research or by the method of creating a survey form.

What economics does well

Government agencies collect statistics by surveying industries and individuals. Their purpose is to learn and keep a record of national conditions and to help set policy. Researchers and businesses design survey forms and collect data according to their own interests. In other words, they decide first what they want to know. Then information is collected to achieve a purpose with limited time and cost spent. This is in contrast to big data. In the field of economics, researchers have been conducting empirical analysis using such data. They spend a great deal of time deciding what they want to learn, what is their hypothesis, and what model they should create. This helps them take full advantage of limited data and data collection opportunities. Moreover, to account for the impact that individuals' characteristics have on their economic activities, statistical techniques have been developed for panel data, wherein individual attributes are linked to activity data. Such techniques can be applied to the ID-POS data described in 2) above. Statistical techniques for policy evaluation are useful against the problems described in 3), namely, the missing-value problem and the sample selection problem that occurs when people participate in activities intentionally. Many other statistical techniques have been developed to identify causal relationships or to adapt to economic phenomena, such as estimation methods for economic activities that occur by endogenous and simultaneous decision.

Expectations of big data

The accumulation of such knowledge can also contribute to the analysis of big data. There are many field-specific statistical theories and techniques in disciplines other than economics. Even if we recognize their importance to advance academics, we still have not achieved knowledge sharing and joint research among different fields well. Obstacles include defining the problem, convergence into common terminologies, and differences in methodology. On the other hand, after the technology of collecting and storing big data has been developed to a certain level, it will be necessary to explain proactively the useful value it has to society. Going forward, as systems become more established, the technology becomes more developed, and more progress is made on building data such that all activities of people's lives are recorded (a so-called lifelog), it is predicted that analysis in each academic field will run into limits. I expect that through using statistics as the common language, the existence of big data will help achieve the field-transcending, interdisciplinary research that people have sought for so long, and that new theories and statistical techniques will be developed as a result. It is further expected that these breakthroughs will increase the value and estimation that people have about big data, so that analysts will be able to collect actively the information they require for their analysis objectives.

January 28, 2013

January 28, 2014