How does machine learning identify bad data?

Wednesday, September 18, 2019
September 11, On The Respect of Data Research Town Hall, at Duke Penn Pavilion

“Dear DOSI”: Questions from the Audience of our Research Town Halls

 David Carlson, PhD, Assistant Professor of Civil and Environmental Engineering and Biostatistics and Bioinformatics at Duke, answers to the numerous questions inspired by his presentation about the Nuts and Bolts of Respecting Research Data


Is there a resource for machine learning best practices?

Duke as a whole is attempting to make data science and machine learning more accessible. Please check out for details, including In-Person Learning Experiences on a variety of topics.

Is anyone expected to understand big data?

In many fields the scale and scope of data is reaching a point where no one person can go through it all and understand every nuance. Many of the “big data” tools are designed to help digest this information, including many data visualization techniques. Additionally, while much of machine learning is designed around “black-box” techniques, there are many novel interpretable machine learning techniques being proposed and increasingly used in practice.

How does machine learning identify bad data/missing values affecting outcomes/prediction? If it doesn't, how do we?

There are machine learning techniques to determine outliers that can be used to explore potential data quality issues in the dataset. However, whether it is appropriate to screen data will be highly application dependent. For example, these techniques can be very helpful to identify sensor failures; alternatively, if we remove patients from our analysis just because they were atypical, we can run into issues of biasing the data.

Is there a good standard on the ratio of data elements and number of data points/ number of observations to run these models?

There is not a great standard on how much data is necessary. With fewer than 100 data points, these methods are extremely challenging, but not impossible. As the data gets bigger, they are more and more feasible.  Note that many of the state-of-the-art examples are trained on truly “big data.” The famed ImageNet dataset used in a lot of the image processing benchmarks contains 1 million images (and companies with the best performing methods are rumored to augment that with 100s of millions of additional images), and the latest natural language models are trained with billions of text examples.  Luckily, a lot of the effort in these models are transferable, which means that we can use the method initially trained on these large corpuses and adapt it for use on novel domains, such as medical imaging, with significantly fewer examples and still inherit a lot of the benefits. This allows us to make the techniques work on “moderate” data.

Is there best practice of data splits? E.g., 60% train 20% validation 20% test?

Using 60-20-20 or 70-10-20 is very common as rules of thumb.  As a best practice, you can estimate the uncertainty interval or statistical power that would arise from the test set with power calculations from statistics. That is not particularly common in practice, though.

What are the differences between statistics and data science?

Data science is a very broad term with no consensus definition.  In my personal opinion, which I expect would be debated, it is a broad term that includes statistics and machine learning analyses in addition to a lot of the data extraction and preprocessing and end implementations to use the learned data tools in practice.

It's fascinating to learn about prediction. Do most traditional biostatisticians use these methods?

It is today still relatively uncommon, but in the words of our Biostatistics and Bioinformatics Chair, Dr. Page, it is happening “more and more.”

To learn more, access the tutorial The Secrets of Machine Learning: Ten Things You Wish You Had Known Earlier to be More Effective at Data Analysis, by David Carlson and Cynthia Rudin