### Study region and period

Lessons and experiences from China are very valuable for pandemic control in the early stage. In fact, the overlapping periods of the traditional Chinese Spring Festival along with the corresponding national holiday and the epidemic development of COVID-19 in China offer us a “natural experiment” on the evolution pattern of the epidemic. This enables us to isolate the cases of imported transmission from those that were locally transmitted, which is very important in our analysis.

In the first wave of the large-scale outbreak, the COVID-19 epidemic in China can be divided into three stages in early 2020. The first stage is the period (i.e., pre-Jan 24, 2020) before the Spring Festival in China (Chinese Lunar New Year), in which the epidemic gradually worsened in Wuhan before it began to spread just before the national vacation break. The second stage, from January 24 to January 30, 2020, consists of the period of the Spring Festival and vacation, which was later officially extended to February 2, and some companies extended it even further, to February 9. This division occurs naturally, based on the vacation period for the holiday, which involves massive travel comprising the largest scale of human migration in the world. Although the pattern of this temporary migration is rather complicated, in general, it consists of people who work in big cities traveling to their hometowns, which are typically small cities or rural areas, before the holiday, and returning to those big cities after the holiday. From a social perspective, this migration pattern is easily understood, but when the traditional travel pattern is combined with the spread of COVID-19, things can get very complicated.

In the case of China, in the first stage, the COVID-19 infection in China spread from a single epicenter, i.e., Wuhan in Hubei Province, to various locations in China. Wuhan is the epicenter in China, where most cases of COVID-19 infection are confirmed. Other cities in Hubei, of which Wuhan is the provincial capital, can be considered the first ring in the transmission belt in China. Most of the people coming from Wuhan are located in this region. As announced in official reports, more than 5 million people rushed out of Wuhan before the city was locked down^{37}. Because of the limited time window and transportation modes, most of them could not go very far, so they remained in other cities in Hubei.

In the second stage, the whole country was alerted about the novel coronavirus spreading, and many cities in China became subject to strict regulations. Therefore, mobility was reduced to a minimal level, as people were asked to stay home to avoid possible infection, which simplifies our analysis. In other words, if new cases of infection with COVID-19 are confirmed at that period, they are most likely due to local transmission from travelers returning from Wuhan before the city was locked down.

However, the real challenge comes in the third stage, when containment of the virus was necessary to avoid further spread. If the people infected with SARS-CoV-2 are in small cities or in rural areas and then return to the big cities across the country, the disease outbreak can evolve from a single epicenter to multiple epicenters, which can lead to a trouble.

As shown in Fig. 1a, the transmission of COVID-19 in the first stage is relatively simple, with only one epicenter in China, i.e., Wuhan. In this figure, *Wu* is Wuhan, *B* is the metropolitan agglomeration of Beijing and Tianjin, *S* is the metropolitan agglomeration of Shanghai, *G* is the metropolitan agglomeration of Guangzhou and Shenzhen, and *C* is the metropolitan agglomeration of Chengdu and Chongqing. These four urban agglomerations, including the Wuhan region, are the most populated regions in mainland China.

In addition, in Fig. 1 N1–N4 are many midsize and small cities, as well as rural areas. As illustrated in Fig. 1b, the possible transmission pattern of COVID-19 in the third stage becomes much more complicated. With just four examples of small cities shown in the illustration, the transmission network is already very complex. In the real world, small cities might number in the hundreds or even thousands, making the containment of SARS-CoV-2 infection even more difficult. Please note that Fig. 1a,b have no probabilistic meaning or structure, which are not based on real-world data. They are just simple graphical illustrations to show how the pandemic would evolve. Figure 1a,b are just for the general purpose of overall illustration of the transmission pattern, which can be applied to other types of infectious diseases in any other region.

That being the case, why are we so concerned about the second stage? As of January 23, 2020, Wuhan had already been locked down, and if the local transmission in that period is low, then we do not need to worry too much about a possible second wave in China. However, if the transmission in the second stage is severe, then concern over the third stage is warranted.

### Study design

This study matches reported information on the epidemic with the characteristics of cities in China that have COVID-19 cases in the first wave of the outbreak. The introduction of urban characteristics to the analysis of the COVID-19 epidemic has already been discussed comprehensively^{36}. In the first wave of the epidemic in China, for most of the cities outside Hubei province, the number of infections became stabilized in early February 2020, so later numbers (many of which are newly imported cases) do not affect our analysis. However, the scale of infection in Hubei province, especially in Wuhan city, is clear until early June 2020 after a city-level PCR test for COVID-19. Therefore, our data set does not include the data in Hubei province to show the spread of infection, which is the right strategy.

To introduce spatial analysis, we add geographic information, such as the GPS coordinates of the cities. We obtain the number of confirmed COVID-19 cases from DingXiang Yuan^{38} at 8:19 a.m. Beijing time, on February 10, 2020. The reason our data are for this particular date is so that we can isolate different scenarios of epidemic transmission in stage 2 from those in stage 3 mentioned earlier. In addition, data on urban characteristics come from the China City Statistical Yearbook 2018^{39}. Finally, the GPS coordinates of the cities are from *Google Earth*. The distance between cities is then calculated using the haversine formula, in which the earth’s radius is set as 6371 km^{40}.

Table 1 lists the descriptive statistics of all the variables used in this study. We only use data on cities outside Hubei below. Because Wuhan, the capital of Hubei, was the epicenter in China in the first wave of the outbreak, its reported number of COVID-19 cases is lagging due to the technical difficulties in laboratory confirmation in the early stage^{36}. This is common in that when a large-scale outbreak occurs, reported cases with symptoms are usually subject to delay^{41}.

### Statistical analysis

In this section, we apply a set of spatial statistical models. As in statistical techniques dealing with time dependence, spatial statistical models solve correlation problems in space. The two most fundamental and frequently used model specifications are as follows^{42,43}.

$$y = \rho \times W \times y \, + \, X \times \beta + \, e,$$

(1)

$$y \, = \, X \times \beta \, + \, u, \, u \, = \lambda \times W \times u \, + \, e.$$

(2)

Equation (1) is the spatial mixed autoregressive model (SAR), and Eq. (2) is the spatial autoregressive error model (SEM). Although more advanced spatial techniques, such as the three-dimensional spatial weight matrix, were discussed later^{40}, these two fundamental model specifications are sufficient for us in this study. These equations are self-explanatory, where *y* is the dependent variable, and *X* is a vector of explanatory variables. *W* is introduced as the spatial weight matrix, which is typically constructed by the reciprocal of the distance between any two cities *i* and *j*, i.e., \(1/d_{ij}\), as every element in the matrix. Then, if the data set has *n* cities, the size of *W* must be *n* by *n* (for a computer-based simulation of *W*, see Fig. 2, which is the illustration of the spatial weight matrix that is commonly used in spatial econometrics). Since there can be so many ways of transportation, such as car, bus, train, airline, and even ship, they may result in different time of transportation. Therefore, distance is simple to show the spatial relationship as a general method, which might be more appropriate in our setting than using travel time which is also common in the current GIS technique.

In addition, *e* and *u* are stochastic error terms. Finally, the most important components of these equations are *ρ* and *λ*, which are the spatial dependence of the dependent variable and the error term, respectively. Simply speaking, spatial dependence, also known as the spillover effect, can appear in either the dependent variable or the error term. In practice, the parameters are often calculated with the maximum likelihood estimation (MLE) method or with the generalized method of moments (GMM) as well.

In this study, *y* is the number of confirmed COVID-19 cases. *X* is mainly urban characteristics, including the length of the urban subway, the density of the local population, the wastewater discharged annually, the annual residential waste, and the public green space per capita. Here, the population density is a very close concept to the total population, but it is more meaningful in the analysis of an infectious disease. In addition, the distance to Wuhan might or might not be included in *X*, depending on the performance of the empirical models with different practical meanings. For the convenience of applying probabilities below, here we do not use the log form of the number of COVID-19 cases. However, we do use the log form for many of the explanatory variables for higher precision. Please note that the variables selection here follows Liu^{36}, which has successfully built an urban analytical framework for large-scale infectious diseases such as COVID-19. But the analysis in this study is completely new.

As this study tries to isolate local transmission from the imported transmission of COVID-19 infection, we herein propose a novel explanation of the probability of COVID-19 transmission among the local population in the target city based on a spatial statistical analysis of viral transmission.

First, to quantify the spread of COVID-19 among cities in China, we propose the following equation which is self-explained:

$$P_{i} = P_{epicenter} \times P_{{{\text{out}}}} \times P_{{{\text{import}}}} \times P_{{{\text{local}}}} ,$$

(3)

where *P*_{epicenter} is the probability of confirmed COVID-19 cases in the local population at the epicenter of the outbreak, i.e., Wuhan in the case of China in the first wave. *P*_{out} is the probability of people’s departure from Wuhan before the city was locked down. *P*_{import} is the probability of COVID-19 infection among people in the target city *i* from external sources. In addition, *P*_{local} is the probability of the transmission of the novel coronavirus in the target city *i*. Finally, *P*_{i} is the probability of confirmation of COVID-19 cases in the local population in the target city *i*.

Equation (3) is the novel contribution of this study. Although it resembles the SEIR models in epidemiology, they are completely different. As we know, in *SEIR* models, *S* means “susceptible,” *E* is “exposed,” *I* stands for “infectious,” and *R* indicates “recovered”. Essentially, it is a combination of four differential equations, which can also be interpreted as the product of multiplying four corresponding probabilities. In this study, as illustrated in Eq. (3), the chain of probabilities is based on the assumption of independent distribution, and this joint distribution is consistent with both logic and common sense. Unfortunately, the probabilities proposed in Eq. (3) are in fact unknown. Therefore, this study tries hard to find some reasonable proxies to simulate these important probabilities. As we can see, *P*_{import} and *P*_{epicenter} are two steps in the transmission chain. While *P*_{epicenter} is completely exogenous in this model, *P*_{import} may be affected by several factors that are discussed later in this study. Besides, independent and identically distributed (*i.i.d.*) is a common assumption and practice in setting up the joint distribution, otherwise, the model would become unnecessarily complex.

Second, now the clues are very clear to us. Essentially, Eq. (3) indicates how to calculate the probability of COVID-19 infection in the epicenter of the infection outbreak in a region. Therefore, if we transform Eq. (3), we have:

$$P_{{{\text{epicenter}}}} = {\raise0.7ex\hbox{${P_{i} }$} \!\mathord{\left/ {\vphantom {{P_{i} } {P_{{{\text{out}}}} \times P_{{{\text{import}}}} \times P_{{{\text{local}}}} }}}\right.\kern-0pt} \!\lower0.7ex\hbox{${P_{{{\text{out}}}} \times P_{{{\text{import}}}} \times P_{{{\text{local}}}} }$}},$$

(4)

which identifies the probabilistic perspective for understanding the scale of the outbreak at the epicenter of the COVID-19 epidemic in that region, which is crucial for us to understand to defeat the virus.