COVID-19 hackathon “Ustawka 2020” — SEIR and ARIMA approach

7 min readJul 31, 2020

Written by: Lukasz Cmielowski, PhD, Jan Wasilewski, Patryk Wielopolski, Robert Benke

Hackathon

I had the pleasure to be part of the jury for this year edition of “Ustawka 2020” hackathon for students in Poland. The hackathon was organised by IBM Poland and University of Warsaw. The topic for this year’s edition was prediction of COVID-19 daily confirmed cases, deaths and recoveries. The predictions had to be made per each day in 14 days period (June 15–26th).

Fifteen teams representing different universities across whole Poland took place in competition. I have asked highly ranked teams to share their solutions in this and following stories. Let’s start with “śpiewające fortepiany” team represented by Jan Wasilewski, Robert Benke and Patryk Wielopolski. You can play with this modelling approach using notebooks environment in Watson Studio . Notebook link can be found in references section at the bottom.

COVID-19 modelling using SEIR and ARIMA

The solution is based on the combination of well known epidemiological model SEIR and time series model ARIMA. In our case the classical SEIR model (Susceptible-Exposed-Infectious-Recovered) has been extended by two new stages: dead and diagnosed, which we called SEID-RDe. The team has used the first model to predict future trends and the second one to explain daily fluctuations.

Assumptions

The population has been divided into 6 disjoint groups and transitions between them.
The susceptible stage (S) contains all living in Poland people who can get infected in the feature. We assume that you are safe after the first infection and cannot get COVID-19 again. Therefore, the size of the first stage is non-increasing in time.
The second stage (E), exposed is a temporary stage for people who are infected but do not pass COVID-19 further.
Infectious phase (I) contain all the people who can infect others, and have not been diagnosed yet.
(D) stays for diagnosed group. Another quite strong assumption occurs at this step because we believe that all infected people are diagnosed at some point. That means that we don’t allow a transition from Infectious to any of the final stages. From D one goes to one of two states R or De: recovered and dead respectively.

Algorithms

The approach consists of two steps: general trend estimation with SEIR-RD and residual estimation model ARIMA.

Bayesian SEIR
The first part of our model is the epidemiological model SEID-RD which describes the dynamics of the epidemic by investigating different groups of population, described above. Changes in those groups are written in a system of differential equations with parameters that have to be estimated from data. The solution of this system is continuous functions reflecting the situation in the groups. We treat this functions as trends. Because of relatively small available current data and knowledge from past pandemics, bayesian estimation seems to suit well to this problem. We have chosen standard normal and negative binomial priors (for more details see reference section).

Following the diary of all constraints introduced by the government, we chose 4th of May as the first day in our training set. This decision was motivated by changing restrictions which might interfere previous data. This decision gave birth to new questions about the initial states of SEID-RDe. It is trivial to initialise SEIR states at the beginning of the epidemic, however, this is a serious problem later on. Fortunately, not all stages were so hard to estimate. Stage S is simply the whole population without the people in other states. Stages D-RDe can be read out from the aggregation of historical data (from the be- ginning to the May, 4th). Finally, for E and I, we used information from the SEIR model build for the previous time interval.

Having initial values, priors and the model set we were able to sample from posterior distribution. Generated using Random Walk MCMC implemented in tensorflow-probability 100k iterations have converged to true posterior distributions.

Despite SEID-RDe model was created for epidemic modelling it assumes that its parameters (conditions) are constant in time and it is not the case in this example due to changing restrictions by the government. To correct this potential bias, consider seasonality (we observed that number of tests done during the weekend is significantly lower) and capture other relations that are present but not modelled by SEID-RDe model we fit another model to residuals.

Bayesian SIR trend analysis (yellow lines represent predictions with confidence intervals).

ARIMA
Autoregressive integrated moving average is widely used model for time series forecasting. It contains three parts. The AR part of ARIMA indicates that the evolving variable of interest is regressed on its own lagged (i.e., prior) values. The MA part indicates that the regression error is actually a linear combination of error terms whose values occurred contemporaneously and at various times in the past. The I (for ”integrated”) indicates that the data values have been replaced with the difference between their values and the previous values (and this differencing process may have been performed more than once). The purpose of each of these features is to make the model fit the data as well as possible.

The model is generally denoted as ARIMA(p, d, q) where p is the number of lags, d is the number of times the data has been differenced and q is the order of moving-average model. The standard ARMA model, which is special case of ARIMA and base for that model, can be written as follows

In our case we have fitted three different models for infected, recovered and dead. In every case we have been searching for optimal p, d, q. The goal was to predict and explain daily fluctuations, so from original data we have subtracted trend (SEIR-RD) and then used obtained residuals for ARIMA modelling. Prediction was made in the same manner, i.e. we predict trend by epidemiology model and then add predictions from residual model.

Real observations vs. final predictions (yellow lines).

Evaluation error

Evaluation of the model with data from the period of IBM hackathon shows that we had great results for new cases and death prediction but we were not so good with the prediction of recovers.

new confirmed cases prediction error (daily) was less than 25 in 9 over 12 days of the hackathon
deaths had much smaller variance and did not play a huge role in the final score. However, it still seems to be impressive that in more than half of the days we were less than 5 deaths from the true value. The minimal error for deaths and new cases we recorded was below 2.
recovers — there was a radical change in the magnitude of recovers at the time of the challenge which SEID-RDe was not able to predict. Despite all the predicting ability of ARIMA models, we were not fast enough to follow the new trend. Therefore, we had a terrible prediction of the number of recovers. The mean absolute error was above 100.

Python packages

Fitting SEIR model requires a solving system of the differential equation at every step of sampling from a posterior distribution. We used MCMC implementation from tensorflow probability and odeint function from scipy.integrate for finding a numerical solution of the differential equation. Our second model was fitted by auto arima from pmdarima.arima. To reproduce our results you would also need:

TensorFlow in version 2.1.0+
pandas
numPy
matplotlib
functools

Watson Studio notebook runtime

For team work we have used Watson Studio notebook runtime on IBM Cloud. It allowed us to easily communicate, collaborate and share our work results. It comes with notebook execution scheduler that was extremely useful for long running jobs.

Summary

This approach has its strengths and weaknesses. On one hand, it uses a broadly accepted by experts epidemiological model for COVID-19 expansion modelling and adjusts it to the short-term prediction requirements. Both SEID-RDe and ARIMA have just a few parameters and thus, they estimate their parameters with relatively low variance. However, on the other hand, fitting two models separately may cause an error propagation, the estimated parameters for SEID- RDe may be inappropriate in the second half of June. Nevertheless, the results seem to be quite good and worth further exploration.