Abstract:
· The granularity and multidimensionality of big data offer advantages to economists in identifying economic trends as they occur (« nowcasting »), testing previously untestable theories of agent behavior, and creating a set of tools for manipulating and analyzing this data.
· Economists face three challenges: access to this data, the ability to replicate it, and the development of technical skills to manipulate it.
· Greater integration of training in computer science and advanced statistics therefore appears to be a public policy priority, as does the development of public research laboratories focused on big data.
· Similarly, closer collaboration between companies that possess massive amounts of data and researchers working on Big Data would be highly beneficial for the advancement of the discipline of economics.
.jpg)
« Big Data. » Behind this highly publicized term lies the emergence of unprecedented volumes of data over the last decade, from digital processes to social media exchanges to the Internet of Things (systems, sensors, mobile devices, etc.).
To quantify this phenomenon, the capacity to store information has grown exponentially in recent decades (in fact, the average information storage capacity per capita has doubled every 40 months[1] since the 1980s), and in 2012, 2.5 exabytes (2.5×10^18) of data were generated daily[2].
These unprecedented new volumes of data are already opening up new opportunities in fields ranging from genetics (drastic reduction in human genome sequencing time) to social sciences (notably with all the data coming from social networks), as well as business analytics software and forecasting algorithms (e.g., Google’s search engine, Apple’s auto-complete feature, online advertising services, insurer risk scoring, credit card company underwriting activities, etc.).
– A concrete example taken by Linar Einav and Jonathan Levin[3] (Stanford University) shows the scale of this phenomenon in the Internet age: in retail, before the emergence of the Internet, data collection was often limited to daily sales—by product in the best case; now, thanks to data obtained through optical scanning and online commerce, the entire journey and behavior of the buyer can be traced, from their purchase history to their queries and exposure to advertising.
Similar examples can be provided for inventories, online transactions, or even public service data (taxes, social benefit programs, etc.), as well as employment data (online employment giants such as LinkedIn and Monster.com aggregate data on job titles, candidates’ skills, their previous employers, their professional level, etc. which has the advantage of providing an unprecedented level of granularity compared to traditional survey data).
But beyond the debates on what this « new asset class » (as the World Economic Forum calls it) can bring to the economy, we may wonder how this data can improve the way we analyze economic activity, and how the development of new methods of data analysis and predictive modeling in statistics and computer science can be useful in economic analysis.
As Taylor, Shroeder, and Meyer[5] show, economics’ place at the intersection of academia and applied business sciences, as well as its significant theoretical and methodological corpus, make it an ideal candidate for the use of larger and richer data sets while maintaining the robustness and representativeness that characterize the discipline.
– As Schroeder[6] describes, Big Data applied to economics represents a radical change in the scale and scope of resources (and the tools to manipulate them) available for the subject of study; this definition differs from the more practical one used in the business world, where the concepts of « volume, variety, and velocity » of data help to build a competitive advantage.
The application of Big Data to economic analysis
The application of Big Data to economic analysis could therefore be associated with the concepts of:
1) « multidimensionality »: in terms of the number of variables per observation, the number of observations, or both
2) « granularity »: massive data sets often provide microeconomic data that is useful for analyzing the behavior of agents.
8 advantages for research and economic policy:
3) Improving the monitoring and forecasting of economic activity at the government level. Central and local governments collect vast amounts of microeconomic administrative data in areas such as tax collection, social programs, education, and demographics, among others.
4) Expand the scope ofpanel analyses. The growing influence of this data can be illustrated by the increasing resonance of articles using it, notably that of Piketty and Saez[7], whose analysis of Internal Revenue Service (IRS) data to highlight economic inequalities has sparked numerous debates on the subject.
5) A level of periodicity and granularity that is often higher than traditional survey data. The use of new data to monitor private sector economic activity, sometimes even in real time (such as MIT’s Billion Prices Project[8], which collects prices from several hundred online retail sites to obtain an accurate proxy for inflation, or Master Card’s SpendingPulse[9] tool, which tracks household consumption via credit card payments), are powerful tools for monitoring economic activity, often with a higher level of periodicity and granularity than traditional survey data.
6) Proxies for economic indicators. Indirect measures such as online searches or social media posts can also be used as proxies for economic indicators such as employment or household confidence (see, for example, Choi and Varian’s paper[10] on using Google « trends » to « predict the present, » suggesting that Google searches for a specific product accurately reflect demand for that product). The availability of « real-time » data can thus offer an advantage in terms of « nowcasting » or identifying economic trends as they unfold.
7) A large amount of data would contribute to a significant improvement in measurements. The gradual availability of large-scale administrative and private data could provide better means of measuring economic effects through more extensive and granular data, particularly with regard to the behavior of individual agents (what Brynjolffson[12] of MIT calls « nano-data »); the large size of the new databases could also solve the statistical problem of limited observations and make the analysis more robust and accurate.
8) A better understanding of the effects of different policies and economic shocks. These new data could encourage economists to ask new questions and explore new research topics in areas as varied as labor market dynamics (Choi and Varian[12]), the effects of preschool education on future income (Chetty et al., see below), stock market dynamics (Moat et al.[13]), and the functioning of online markets (Einav et al.[14]). The ability to combine different databases broadens the scope of research, as shown, for example, in the study by Chetty, Friedman, and Rockoff[15], which combines administrative data on 2.5 million New York schoolchildren with their incomes as adults 20 years later to show the « added value » of having had a « good » teacher; In this case, the high level of granularity in the data makes it possible to link individual school test scores and corresponding tax records for a large sample, which would have been impossible with aggregated data or a smaller sample. Many aspects of individual behavior, such as social relationships (using data from social networks) or geolocation, could also become easier to observe and analyze. The example of Scott Keeter[16] from the Pew Research Center, who proposes using data collected on social networks to supplement or even replace public survey data, clearly illustrates this idea.
9) Enabling « natural experiments. » For example, switching from weekly data to much higher frequency data (up to minute-by-minute), or to data on individual consumers or products, can reveal details or variations at the micro level that would be more difficult to isolate and exploit with more aggregated data. The study by Einav, Farronato, and Levin[17], which analyzes pricing and sales strategies on the Internet, is a concrete example of the advantage of using granular data to obtain rich information about the individuals studied and to explore a variety of consequences for a given experiment (e.g., substitution to other products in the event of a price change). These advantages are particularly interesting when applied to businesses, especially online platforms, for which it is becoming increasingly easy and inexpensive to experiment when they have granular, personalized pricing strategies and increasingly easy automated methods for capturing (and applying) the results of these experiments.
10) New opportunities could also arise from new statistical and machine learning techniques[18], which can help build more robust predictive models, particularly in the field of empirical microeconomics. The study by Einav, Jenkins, and Levin[19] is an example of the use of Big Data techniques in predictive modeling to incorporate heterogeneity into their econometric model; In this study, the use of predictive modeling techniques enables the construction of « credit risk scores » that help researchers model consumer borrowing behavior and how lenders should price loans and set borrowing limits for different types of borrowers based on their risk of default. Capturing heterogeneity through big data techniques and new statistical methodologies could also become advantageous for many other sectors, due to the possibility of going beyond the measurement of « average effects » and being able to link measurable heterogeneity to specific treatment effects and optimal policies. The example of the Safeway grocery chain[20], which offers specific discounts to each customer based on individual price elasticities, shows the growing ability of companies to go beyond simple elasticities in their pricing policy and develop algorithms to estimate the elasticity and optimal prices specific to each type of consumer. The same applies to governments in the development of their economic policy, with the possibility of developing policies that are more tailored to users (e.g., health policies tailored to the medical environment and patient characteristics, education policies tailored to the level, teacher, or mix of students, etc.).
Challenges and caveats
However, while these new databases and statistical techniques open up many opportunities, they also present many challenges for economists.
1) Access to data: A significant portion of the new data that researchers are working with belongs to companies (which aggregate it from their customer base), and the benefits to these companies of benefiting from researchers’ knowledge of this data are not always comparable to the costs of disclosing the data.
2) The unstructured nature of the data, which poses a challenge in econometric terms—simply to separate the dependencies between the series studied; this is the most significant technical challenge with this type of data, requiring the development of new regression tools.
3) The need for economists using this data to develop new skills—specifically in advanced software and languages (SQL, R) as well as machine learning algorithms—in order to be able to combine the conceptual framework of economic research with the ability to apply ideas to massive databases. The highly publicized profession of « data scientist, » which involves analyzing data to find empirical models, lies at the intersection of computer science and econometric analysis. Extracting and synthesizing different variables and searching for relationships between them will therefore become an important part of economists’ work and will require new skills in computer science and databases.
4) As this article has described, we can expect the emergence of Big Data to significantly change the landscape of economic research and policy. However, this development cannot replace economic theory; as Sascha Becker[21] shows, the usual practice of economic forecasting (theory – simulation – calibration – forecasting) cannot be changed because « we need theory to understand the mechanisms or at least to suggest what we would hope to find in the first place. » Indeed, even if big data is very useful for detecting correlations, including subtle correlations that an analysis of smaller databases might miss, it does not tell us which ones are relevant; similarly, the magnitude of the data can lead to « misleading » correlations between series that have nothing in common. In short, Big Data cannot replace the theoretical research phase; indeed, no economic problem can be solved by simple « data crunching, » and there is always a need to understand the problem at hand beforehand.
5) The ability of Big Data and associated statistical techniques to reduce large data sets to a single statistic is only an appearance of accuracy and therefore does not replace in-depth scientific analysis.
Over-reliance on big data can even lead to adverse effects, as these databases often aggregate data that has been grouped in different ways and for different purposes. this risk is particularly significant in the case of data collected from Internet searches, as evidenced by the example of Google Flu Trends, whose responsibility for data collection in the relative failure was pointed out by Harvard statistician Kaiser Fung[22]; Over-reliance on web data has even given rise to what Marcus and Davis[23] call an « echo chamber effect, » with the example of Google Translate results being based on translations from Wikipedia pages… and vice versa. Another danger comes from the growing difficulty of reproducing the data and programs in research papers as data becomes increasingly massive, as Barry Eichengreen[24] points out in Project Syndicate, calling for a greater focus on the historical analysis of economic phenomena rather than on the development of increasingly sophisticated statistical methods.
Recommendations
As described in this article, there are many advantages to using Big Data for economic analysis. In terms of public policy and education recommendations, the « gold mine » of Big Data is fully in line with the exponential development of NICTs in everyday life and represents yet another argument for the development of computer science education, particularly in university programs in economics and sociology. The recent inclusion of a « Big Data » module in the CFA® exam is just one illustration of this phenomenon[25]. The development of public laboratories focused on Big Data could also be a solution to the lack of representation that this discipline suffers from among researchers.
Similarly, closer collaboration between researchers and companies possessing mass data would be beneficial to all stakeholders, allowing companies to benefit from external perspectives and essential decision-making assistance, while economists would benefit from usable « material » for developing new models and testing new theories.
[6]http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199661992.001.0001/acprof-9780199661992-chapter-11
[11]http://digital.mit.edu/bigdata/agenda/slides/Brynjolfsson%20Big%20Data%20MIT%20CDB%202012-12-12.pdf
[18]Readers interested in these new techniques may consult Hal Varian’s paper, « Big Data: New Tricks for Econometrics » (http://pubs.aeaweb.org/doi/pdfplus/10.1257/jep.28.2.3) , in which he describes in depth the new tools for analyzing and manipulating massive data, such as new methods for selecting variables (given that there are more potential predictors), as well as new ways of modeling complex relationships (using machine learning techniques such as decision trees, support vector machines, neural networks, deep learning, etc.).
