RIETI - Can Big Data Change Official Statistics: Learning from advanced overseas cases

Date	March 14, 2019
Speaker	KONISHI Yoko (Senior Fellow, RIETI)
Moderator	MUKAI Kentaro (Deputy Director, General Coordination Office, Research and Statistics Department, METI)
Language(s)	Japanese
Announcement	With the accumulation of large amounts of data and the development of new technologies as a consequence of the dissemination of big data, artificial intelligence (AI) and IoT technologies in recent years, there is a growing interest in applying big data to official statistics. What degree of benefit would be seen in statistical surveys and their accuracy if technology were applied to data collection in order to reduce statistical survey costs, or if private sector big data and administrative record information could be used in creating statistical indicators? In this report, advanced examples in the UK, the Netherlands and Singapore are introduced as outcomes of overseas fact-finding tours made as a part of METI's 2018 big data project, along with discussions about potential advantages or challenges in applying private sector data to official statistics, which is a topic METI is currently working on.

Summary

Introduction

KONISHI Yoko's photo Today, I would like to talk about the positive effects that the use of big data and AI-related technologies would have on official statistics, which have been prominent in the news media since 2018, as well as a bright perspective for the future. Specifically, I will introduce initiatives undertaken by the world's leading big data user countries, based on an overseas investigation report we made as part of the "2018 Project by the Ministry of Economy, Trade and Industry (METI) for Undertaking/Reviewing Results of the Current Survey of Commerce Using Big Data and Developing New Indices," in addition to some results from the Ministry's "New Index Development Project."

Tailwind for big data, AI, and statistics booms

First, to help better understand the recent growth of interest in official statistics, I would like to explain their current circumstances. AI has been a subject of study since as early as 1950s. However, there were no computers capable of implementing the theory with the algorithms they had already developed at that time, and once the capabilities of computers reached an appropriate level, there was still insufficient data to advance the study of AI further until the early 2010s. Machine learning, which is one approach to AI, has also been flourishing since the 1990s, but encountered the same problem in that the amount of available data was insufficient to qualify as "big data" which is what facilitated the intense boom in the use of machine learning today. During those days, researchers were engaged in R&D activities focused on deep learning and other topics, providing a basis for the current AI boom. In 2012, a deep learning technology won the first prize in an image recognition contest. As a result, AI attracted the world's attention, which caused the third AI boom in Japan starting in 2013. The AI boom was fueled by the big data boom that occurred in the same period in 2012.

In 2012, after a long period of disarray, AI, machine learning, and big data booms started to move in a concerted way, generating a growing interest in statistics. An increase in the number of data users resulted in more people paying attention to official statistics, which had previously been of interest only to a limited number of experts.

Currently, a number of factors—the dissemination of AI and IoT technologies, development of new technologies through the use of big data, and the new demand from industries wishing to apply these technologies—provide unprecedented opportunities, and a significant boom has been observed, along with its impact on society as a whole. With this in mind, in 2014, METI launched a big data project. I have participated in the project since 2016, and act as its Chair for this year. The project team aims to further develop official statistics and create new, data-based businesses in Japan by linking big data and new technologies.

Difficulties in producing official statistics

Recently, there is a growing demand for increasing the accuracy of official statistics. On the other hand, the survey environment has been deteriorating. This means that it has become difficult to maintain the quality of official statistics using only traditional statistical survey approaches which rely on the data obtained from such sources as households and reports submitted by corporations. For example, the rapid emergence of new types of businesses represented by the sharing economy has made it difficult for official statistics producers to identify and classify industrial structures in a timely manner. Further, as a result of changes in corporate activities (e.g. servicification of manufacturing industry and manufacturing by the service industry), it has become difficult to identify industrial structures based on the existing industry classification approach. Some companies have no specific office address and can be contacted only by email or mobile phone. These diversifications have made it challenging to collect questionnaire responses and to identify types of operations using conventional approaches.

In response to these situations, our project team has been undertaking activities to use big data and new technologies in the production of official statistics. The reason for the longevity of this project is partly attributable to the government's mention of the use of big data as a new source of data for official statistics in the Basic Policy for the Fundamental Reform of Economic Statistics, which was adopted at a meeting of the Council on Economic and Fiscal Policy in December 2016.

How will the official statistics workflow change?—Initiatives under the project

We, the project team, are considering the roles to be played by big data and new technologies in the current workflow of statistical surveys (see the figure below). There is a great possibility that we can apply new technologies to the current workflow. For some examples (among many), we can use the following: 1) AI technologies to input data and find outliers in questionnaire responses and aggregate data; 2) data science technologies to aggregate and process data; 3) robotic process automation (RPA) to tabulate data or transform data into graphs and texts; and 4) digital dashboards to publish results for the convenience of users.

The project team also undertook an initiative which replaces the paper- and web-based questionnaire responses with big data held by private sector data vendors. Specifically, after obtaining approval from the Minister of Internal Affairs and Communications in July 2018, we applied this new survey approach to METI's Current Survey of Commerce and conducted the Survey as one of the General Statistical Surveys in accordance with the Statistics Act. The results were published in February 2019. Although it was applied only to a part of the data collection process, as shown in the chart below, we succeeded in developing a new survey approach based on approval from the scheme owner, which means that we have taken the first major step forward.

Receiving/Organizing
Data input
Questionnaires
Adjustment of questionnaire data
Aggregation
Summary examination
Finalization of aggregated data
Tabulation/Transformation into graphs and text
Analysis/Secondary processing/Editing

Fact-finding tour to countries that are most advanced in using big data

We embarked on an overseas fact-finding tour in December 2018. On the tour, we visited multiple cities and organizations in the UK, the Netherlands, and Singapore. These countries were selected from among those that had referred to the promotion of digital government and the use of big data; had examples of empirical studies; and implemented or used big data in official statistics.

During interviews, we checked whether they were engaged in the development of additional human resources responsible for the production of statistics, as well as systems put into place to better the analytical and publishing processes. The term "big data" here refers to both data held by private sector companies as well as administrative record information held by the government.

Use of big data in official statistics

First, in relation to our test survey, in which the private sector's big data was aggregated and used in official statistics instead of paper- and web-based responses, we conducted interviews on the use of big data in official statistics and implementation status of statistical surveys. As a result of the interviews, we found that while big data was used to determine a portion of consumer price index (CPI), none of the three countries was engaged in an initiative designed to replace the survey itself with big data.

On the other hand, we identified many cases where big data was partly used for the creation of indices.

In the UK, the government created the UK House Price Index (UK HPI) as official statistics in cooperation with multiple institutions. Other examples included the use of Google Street View information to measure the depth of greening, as well as the use of ships' transportation data to forecast GDP.

In the Netherlands, the government works in partnership with private sector firms during the index-development period and uses their big data (e.g. mobile phone location information) free of charge for research purposes. The government's policy is that if the big data contains any information that is highly accurate and useful, it should be approved as official statistics.

In Singapore, we found that administrative record information was used in a highly advanced way. Tax information was anonymized so that it could be used for statistics production purposes, even within a limited environment. However, with private sector think tanks and research companies already conducting extensive surveys and rapidly publishing a wide range of indices, the government appeared less motivated to use private sector big data for official statistics.

Development of human resources in the field of statistics production

In terms of human resource development, the UK was the most advanced among the three countries. The UK government provided a detailed definition of "data scientist" and established the Data Science Campus within the Office for National Statistics (ONS) in 2017. On the Campus, 40 data scientists work as faculty to educate 500 data scientists by 2021. One of the characteristics of the Campus is its learning environment, in which experts are provided with advice on their career path and scope of work after the completion of the course, so that they can learn without worrying about these issues.

In the Netherlands, the interview session was conducted in the form of a workshop to allow participants from both sides to mutually report on their activities. Most of the officials who participated from the Netherlands' side had a PhD degree in statistics, physics, or economics, indicating the high level of research ability held by the staff working in the field of statistics.

The Singapore government has concluded an MOU with the National University of Singapore (NUS), setting a goal of making government employees digitally literate, by teaching basic digital literacy skills to all of them and providing data analysis/data science training to 20,000 of them by 2023.

Consequently, we found that all of the three countries were engaged in the development of data scientists by setting high numerical targets and concluding MOUs with academic and public research institutions.

Framework for producing official statistics

In the UK, as a result of recent revisions to statistics-related laws, access to administrative record information, tax information, and private sector data for the purpose of producing statistics is now allowed. In conjunction with this, relevant organizations have been joined by privacy protection and inter-organization data transmission legal experts. In this context, personnel who can act as coordinators between experts have become valuable. These staff members must have sufficient knowledge to understand technical terms as well as strong communication skills. In Japan, it is often the case that one single person plays a number of roles. However, in the UK, coordinators are valued and their positions are highly regarded, which leads to people working smoothly as a team.

The Netherlands established the Center for Big Data and Statistics (CBDS) in 2016, and concluded partnerships with 45 corporations, including the University of Amsterdam, Leiden University, private sector companies such as IBM and Microsoft, and foreign bureaus of statistics. In addition, the government employs statistics officials who have a PhD in statistics, physics, or economics and are capable of performing advanced statistical analyses including the use of AI technologies. They are active in conducting analyses that are useful for the development of new statistical indicators and policies. The results of the analyses are proactively released as a beta version on the CBDS's website.

In Singapore, Data.gov.sg was established in 2014 as an organization that reports directly to the Prime Minister, separately from the existing Department of Statistics. While the government pays less to its statistics officials than GAFA, it competes with the big four tech companies in terms of HR development, and offers women-friendly working conditions to attract talented data scientists.

Improving the method of publishing statistical information

When producing official statistics, most time is spent finalizing the survey results. However, in order to disseminate the results, it is indispensable to develop some systems designed to improve the publication method. We investigated each country's efforts on the publication method, with regard to which Japan is seemingly lagging behind.

The UK seemed less proficient at disclosing information online, but once statistical surveys were produced, they were translated into different source codes and shared among relevant personnel.

The Netherlands is very active in analyzing and publishing statistics, and effectively undertaking PR activities using a variety of media, including the website, Facebook, Twitter, Instagram, RSS, newsletters, and videos. This is against the background that, being a multi-ethnic country, a wide range of languages are spoken in the Netherlands, and therefore sometimes images and pictures might be more effective than words to communicate information. In addition, due to the fact that some generations are not accustomed to accessing paper-based media or websites, the government is eager to use social network services to disseminate information. While information providers may be required to perform multiple tasks, the continued use of different media seems to be important in publishing survey results nationwide. Another major characteristic of the country is that the government sticks to in-house production in the aim of preventing outsourcing-derived complexities from interfering in daily work.

Singapore actively uses data visualization. It has published 1,691 datasets and 13 APIs on the GovTech's Data.gov.sg portal, in a manner that is friendly to those who use them for statistical analysis and data creation. The government saves development time and costs by assigning an in-house team to the system development and by using open source software. One of its soon-to-be-achieved goals is to provide statistical data immediately in an easy-to-use state—namely, in an integrated data format—upon a user's request, just like water coming out of the faucet.

Using big data as data sources for the Current Survey of Commerce: Test survey

As mentioned earlier, we conducted a test survey in which the Current Survey of Commerce was partly replaced with big data and published as official statistics. None of the three countries we visited have conducted such an initiative, indicating that Japan is one step ahead of them. In the following paragraphs, I would like to introduce our initiatives.

First, in 2017, we created the "Scanner data-based Index of Commerce at Large-scale Specialty Retailers for Home Electric Appliances" for the purpose of identifying weekly sales trends using POS (point of sale) data collected by large-scale electric appliance retailers. In cooperation with GfK Marketing Service Japan (hereinafter "GfK"), we developed a sales trend index by collecting, cleaning, and aggregating data based on the same standards as those of existing commercial trend statistics. The index had slight differences from existing statistics but captured the trends almost as accurately. In view of the success of this attempt, we conducted a statistical survey using a new approach, which was designed to obtain information from POS data collected by large-scale electric appliance retailers and to develop questionnaire data for the electric appliances category in the Current Survey of Commerce. In the past, each survey participant company submitted questionnaire responses to METI. However, under the new approach, the questionnaire data is input by entrusted private enterprises that are already doing a data business with the survey participant companies. After the scheme was approved by the Minister of Internal Affairs and Communications in July 2018, the survey was conducted as a General Statistical Survey in accordance with the Statistics Act. Its results were published in February 2019. This may seem to be a small step, but has a significant meaning. As a result of private sector companies with big data having been authorized as official statistics' reporters, the following will become possible: 1) reduction in the burdens on survey participants (reporting companies); 2) product classification using big data; 3) more detailed area classification; 4) more timely and accurate data aggregation/publication; and 5) creation of business opportunities for data vendors.

Characteristics of test survey results (Category: large-scale electric appliance retailers)

The advantages of creating a sales trend index for the Current Survey of Commerce using POS data include the following: 1) increased frequency of data aggregation (weekly instead of monthly); 2) acceleration of publication; 3) more flexible aggregation than the Standard Industrial Classification (because POS data is classified by commodity) including the availability of quantity-based information in addition to price-based information; 4) reduction in the burdens on survey participants; and 5) more efficient performance of statistics tasks.

The use of test survey results obtained in the project will make it possible to aggregate data to create statistics tables on electric appliances sales trends on a weekly basis, which may lead to an earlier publication of statistics. In addition, users will be able to view each commodity's sales records by prefecture, in more detailed commodity breakdowns. Furthermore, it will enable users to get e-commerce-based sales data, which was unable to separate previously. In the following sections, I will introduce new indices that relate to the use examples.

Examples of using test survey results and creation of new indices

For example, users can analyze weather data and sales trends by reviewing the weekly air conditioner sales data. In 2018, Japan recorded the highest temperature after mid-July since 1964, when weather statistics were first developed. In a typical year, sales of air conditioners reach a peak only once in early July; however, we found that there were two peaks in 2018—the first one in early July and the second and higher one in late July.

Similarly, in December 2018, a large-scale cashback campaign was run for product purchases. If a year-on-year comparison was done using nationwide monthly data, the impact of the campaign would only have been seen on a national level and would have lacked any granularity. However, in reality, the comparison was made based on prefectural weekly data, leading to a finding that the impact was most evident in Tokyo. As mentioned above, when data is categorized by period, area, or commodity, users can measure the impact of an event, policy, natural disaster, etc. in a more detailed manner.

The new indicator, in the development of which I took part together with GfK, uses each electric appliance product's country-of-origin information. This enabled us to identify whether the product was made in Japan or another country—so we calculated the ratio of made-in-Japan products to imports on both product-by-product and monthly bases. Needless to say, because the data was POS-based data, we were able to make computations based on both sales value and sales quantity. As a more advanced attempt, we focused only on domestic products and estimated the amount of domestically produced products using the data on made-in-Japan products purchased by consumers. Specifically, we estimated the final Indices of Industrial Production (IIP) using domestic products' sales value and sales quantity data, for eight items included in the Consumer Electric and Electronic Appliances category. Considering the fact that we were going to use sales data, we assumed that the data had a time lag of one month from the production period. Then, we compared the trends and found that the estimate could be used as a "nowcast" of final IIP figures and that the publication date could be earlier, if only slightly.

Future prospects and challenges

Advantages of using private sector big data include the following: 1) their publication can be accelerated; 2) the frequency of aggregation can be increased; 3) aggregation can be done in a more flexible way compared to the Standard Industrial Classification as the data is commodity- or behavior-based; 4) burden on survey participants can be reduced; and 5) statistical tasks can be performed in a more efficient way. On the other hand, disadvantages are the following: 1) It is difficult to control accuracy and bias; and 2) continuous availability of data cannot be guaranteed because of possible merger or bankruptcy of private sector data holders. These disadvantages are not relevant to administrative record information (e.g. tax, registration, and vehicle-inspection information) as it is collected by public agencies.

If the project's test survey becomes a Fundamental Statistical Survey from a General Statistical Survey, Japan may become the leading county in the field of using big data in official statistics. To achieve this, it is necessary to strengthen cooperation between relevant ministries and private sector companies, obtain administrative record information and big data at low cost, and use them in a more active way. We consider that there is an urgent need to enhance the implementation structure, provide training and educational opportunities, and develop human resources including the use of external staff.

Conclusion

What is the answer to the title question, "Can big data change official statistics"? I think the answer is yes. To change official statistics using big data, we need to actively learn from overseas cases and continue the efforts to use data that we have on hand. While it may be possible and meaningful to use private sector big data and administrative record information in statistical surveys, such use requires budgetary reallocation, additional procedures, and appropriate human resources. However, what's more important is the eagerness to use big data, the discovery of talented people who have new and unique ideas, and an environment that is supportive of these people. I strongly hope that in the future Japan can play a leading role in this field.

Q&A Session

Q：: I would like to ask about the use of data obtained from private sector service providers. I think that it is often the case that a private company produces statistics and sells the results. What do you think about the division of roles between the government's official statistics and private sector statistics? Has this topic been discussed in other countries?
A：: Recently, the media has highlighted the private sector's data businesses. Some argue that official statistics may be unnecessary if the government encourages competition between companies and collects reliable data from the winner, or that the production of official statistics may be outsourced to private sector companies. However, I think that at this point in time, there is a difference in quality between the government's official statistics and commercial statistics produced by private firms in accordance with their customers' needs. Singapore seems to have a realistic approach to the division of roles. For example, short-period statistics such as monthly or weekly reports for economic trend are published by the private sector because shopping malls have a significant amount of commercial data and think tanks have high analytical capabilities. On the other hand, the government produces official statistics with longer publishing periods.
Q：: When producing Fundamental Statistics, necessary data is provided free of charge, because data providers are legally required to do so. However, in the case of General Statistics, the government needs to pay outsourcing fees to private sector companies. What measures are taken in other countries to address cost-related issues?
A：: We did not ask questions about actual costs incurred to conduct respective surveys. We hope we can carry out additional research on this point in the future. In the UK and the Netherlands, companies are legally required to submit data if their cooperation is requested by the bureau of statistics for the purpose of producing official statistics. Our investigation found that the bureau obtains data based on agreements, or using amicable approaches. It seemed to me that these laws played a significant role in supporting the statistics production team.
Q：: I think that there is useful big data that remains untapped. Tell us if there is any underused area of data that can be used for official statistics.
A：: In other countries, we did not find any cases in which official statistics were directly replaced by private sector statistics. However, they rapidly used extremely detailed big data in policy development. For example, they decided the type and location of new schools based on data about school distribution that was the best for both parents and children. That data was obtained by combining data concerning actual commuting-to-school distance and the data concerning subjects being learned by the children. In Singapore, where traffic congestion and terrorism are major concerns, they actively used traffic volume, car movement, and parking lot (regardless of state or privately run) data to predict traffic jams, plan new roads, and prevent terrorist attacks by identifying unusual patterns of congestions. Japan is also trying to measure power demand using smart data.

*This summary was compiled by RIETI Editorial staff.