May 1, 2020

When the COVID-19 pandemic will end

Dear all,

I was the head of Data Science at British Transport Police, and one of our department tasks is to efficiently allocate staff, depending on the crime rates, which correlate to passenger flow. As you understand, the passenger flow will undertake significant change as soon as the Government decides to cancel quarantine or stop some limitations. The question naturally arises: when will the pandemic end and how to prepare for a return to normal life.

A few weeks ago I began to deal with the issue of forecasting the pandemic flow, and having tried many models, from generally accepted SARS to a variety of ‘exotic’ ones, I came to to the conclusion that having significant uncertainty in the source data and various approaches in disease identification (and these approaches may change over time for the same country), the principle Zen of Python - “simple is better than complex, and complex is better than complicated.” will be best fitting here.

Therefore, accepting two hypotheses: (a) disease distribution and death rate obey the law of large numbers and might be described by symmetric normal distribution, and (b) we will not have a second peak of infection after quarantine removal, I built a simple interactive model also available on GitHub.

This model reads data from the WHO website, checking for updates every half hour (checking a hash of the files), then uses the curve_fit function from the scipy.optimize package trying to fit the normal distribution curve (more precisely, logistic distribution for the data, minimizing the standard deviation (other optimization methods are available too.)

Working with the curve_fit function it might be useful to set a few non-default parameters like the following

fitC = curve_fit(logistic_model, x, cum, bounds=(0, [10, 200, p]), maxfev=1e5)

where the first three parameters are function, x and y to be fit; bounds are numbers within which python will try to fit the parameters and maxfev is the number of attempts to fit a curve.

The Logistic Probability Density Function is:

$f(t) = \frac{e^z}{\sigma(1 + e^z)^2}$

where $ z = \frac{t - \mu}{\sigma} $

as $\sigma$ is the scale parameter, the result of the function will hardly meet huge values, so I have added a multiplication mult, so finally we multiply f(x) by mult:

def logistic(x,mu,s,mult):
    return mult * (np.exp(-(x - mu) / s)) / (s * (1 + np.exp (-(x - mu) / s))**2)

The very first model had no restrictions, so it sometimes showed quite inadequate data. Therefore, it was decided to limit the search depth (which nevertheless was set more than the standard one) and limit the error in determining the infection peak to ten days (before that I tried to set a limit by the number of cases equal to the population of the country, but this method did not work correctly).

Additionally, it seemed reasonable to limit the scope of the chart to a peak with deviations of plus or minus three sigmas.

The Logistic Standard Deviation will be used for calculation of chart visibility area and is $$display$$ \sigma_T = \sigma \pi \frac{\sqrt{3}}{3} $$display$$

as we used the three-sigma range, deviation from the peak is $$display$$ \sigma_T = \sigma \pi \sqrt{3} $$display$$

As a result, the script began to generate the forecasts shown below.

Forecast example

It should be noted that the script forms a daily (non-cumulative) forecast, therefore to calculate the total number of patients a cumulative function is used.

$ F(x, \mu, \sigma) = \frac{1}{1+e^{-\frac{(x-\mu)}{s}}} $

def logistic_model(x,a,b,c):
    return c/(1+np.exp(-(x-b)/a))

The script has shown its accuracy and applicability both for the organization I work for, and for the UK. It seems to me that the forecast formed for other countries looks quite accurate too. It is written using Dash framework, which is based on their own Plotly library and Flask web framework.

I would be happy to hear any feedback and/or constructive criticism.