Paper Title
Selecting Proxies for Inputs with Limited Data in Data Envelopment Analysis

Model selection is an important issue in Data Envelopment Analysis. A specific case is choosing proxies for inputs/ outputs when the required data are not available. When there are several potential candidates in the data that can capture the characteristics of a theoretical variable, the researcher usually decides a proxy by experience. However, choosing by experience is usually seen as subjective decisions and lack of theoretical grounds. This paper adopts the principle of the benefit of doubt to explore systematic ways of selecting a proper proxy for an input/ output. We observe that this line of literature selects a proxy by choosing the candidate that causes the data closer to the empirical production frontier. Following this line of research, this paper suggests three approaches to find a proxy from several candidates. When a candidate dominates other candidates as a proxy for a variable, our method will select this candidate objectively. All approaches discussed in this paper are applied to 3 industries in China from 2017 to 2019. To select an input proxy for capital, there are three alternatives: total assets, non-current assets and current assets. Although non-current assets may be expected to be an appropriate proxy for capital, it is overwhelmingly outperformed by total assets and current assets. Since these three data variables are the most common data available in published data as proxies for capital, our empirical results are valuable to applied researchers of the Chinese economy. Keywords - Model selection; goodness-of-fit measure; selecting input/ output proxy; Data Envelopment Analysis I. INTRODUCTION Model selection is an important issue in Data Envelopment Analysis (DEA).Since the method of DEA is nonparametric in nature and it relies heavily on linear programming techniques, conventional techniques of regression analysis cannot be applied to estimate the production technology and explore the properties of the model.Early researchers such as Golany and Roll (1989) discussed some general guidelines of selecting variables. Such guidelines are useful but incomplete. One issue has not been addressed in the literature: selecting a proxy for a theoretical variable from several choices. In empirical studies, researchers sometimes need to select a proxy for a theoretical variable from several competing candidates. Making such a decision is difficult and important to derive correct policy implications. When there are several candidates of approximating a certain variable, researchers cannot find any tools. For example, Stefko, Gavurova and Kocisova (2018) considered three candidates of medical devices: number of computed tomography (CT) devices, number of magnetic resonance (MR) devices, and number of all medical devices. Although the results are similar in their case, problems appear when these candidates give different results. Some studies