Reproducible Research at ENAR

I gave a talk at the Spring ENAR meetings this morning on some of the technical aspects of creating the book. The session was on reproducible research and the slides are here.

I was dinged for not using git for version control (we used dropbox for simplicity) but overall the comments were good. There was a small panel at the end for answering questions, which were mostly related to proprietary systems (e.g. SAS).

​I was also approached by an editor for Computational Statistics in regards to writing all of this up, which I will when I get a free moment. 

Confidence in Prediction

A few colleagues have just published a paper on measuring the confidence in prediction in regression models ("Interpretable, Probability-Based Confidence Metric for Continuous Quantitative Structure-Activity Relationship Models"). The idea is related to applicability domains: the region of the predictor space were the model can create reliable predictions. 

Historically, the primary method for computing the applicability domain was to judge the similarity of new samples to the training set to characterize if the model would need to extrapolate for these samples. That doesn't take into account the training set outcomes or, more importantly, any regions inside the training set space where the model does not fit the data well. 

The approach that this paper takes is to create a local root mean squared error that is weighted by the distance of the new sample to the nearest training set neighbors (inspired by the approach taken by Quinlan (1993) for instance-based corrections in Cubist). 

We examine confidence in prediction methods in our book in the section "When Should You Trust Your Model’s Prediction?"