Building COVID-19 models with Watson AutoAI SDK
As a continuation of epidemic models comparative analysis we want to examine one more regression model created by Watson Studio AutoAI. We will be using new python API to define and trigger AutoAI experiment. The jupyter notebook with all steps can be found here.
COVID-19 data
We are using COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University.
As a working environment Watson Studio notebook runtime is used (free plan is offered). In first step we load confirmed cases data into pandas DataFrame for further preprocessing.
In next step we:
- filter df to have Poland cases only
- reshape it to contain Day count starting from Jan 22nd and Cases number columns only
- remove (keep) 5 last records for test purposes.
Such preprocessed data set is stored as CSV file in Cloud Object Storage for further analysis and sharing. If you want to know exact steps to work with COS in Watson Studio please refer to this documentation.
AutoAI experiment
AutoAI is available within Watson Machine Learning with graphical interface through IBM Watson Studio. You need to create an instance of Watson Machine Learning first (free plan is offered as well).
I will be using python API to work with AutoAI. The graphical interface is also offered for non-programmers. To programmatically work with AutoAI experiment we need to install watson-machine-learning-client-V4
package available on pypi.
!pip install -U watson-machine-learning-client-V4
Now, we are ready to define AutoAI optimiser for our use case. First we need to provide connection details to our training data set (CSV file uploaded to COS).
Next, we can define the optimiser. The following information is required:
name
- experiment nameprediction_type
- type of the problem (regression)prediction_column
- target column name (“Cases”)scoring
- optimisation metric (MSE)
We can preview configuration by calling optimizer.get_params()
.
To speed up our test we have reduced number of estimators to two. Now, we can trigger the experiment in interactive mode.
Now, we can list trained pipelines and evaluation metrics information in form of a pandas DataFrame by calling the summary()
method. You can use the DataFrame to compare all found pipelines and select the one you like for further testing. With default number of examined estimators (1) we received four pipeline models. Let’s use pandas plotting capabilities to compare R2 metric value for all of them.
As we can see Pipeline_4 shows the best r_2 score on holdout data. The R2 score is over 98%.
You can easily load AutoAI model to your local runtime by calling get_pipeline()
method. The best model in scikit-learn enriched format is returned by default. Enrichments allow to visualize model as a graph or pretty print the model definition code. As we can see the Ridge has been chosen as best estimator here.
Let’s use plotly package to visualize total number of confirmed cases and predictions to see how our model fits to real data. The red dot represents holdout record. The green one test records that were kept for local evaluation before data has been submitted to AutoAI.
Model inference
In this section I will show how easily you can expose your model as WebService. Next, you can integrate it with your shiny application by sending prediction requests to it and presenting results back in your app.
You just need to:
- point the pipeline model you want to deploy either by providing model object or name
- set web service name
After web service is created you can retrieve all details by calling:
service.get_params()
If you want to get predictions for next days you just need to make scoring request against web service. Pandas DataFrame is supported.
As a result of score call list of predicted cases is returned. As an input pandas DataFrame with feature columns is passed.
Next steps
Pipeline model refinery (tuning) with semi automated data science library Lale. Please check out Kiran’s story on that: “Refining Watson AutoAI Output Pipelines”.