120 Forsyth St
Boston, MA 02115
Women In Engineering
“Small Data: A Big Challenge for Classical Machine Learning”
Prof. Sarah Ostadabbas, Assistant Professor, ECE
Sarah Ostadabbas is a first year assistant professor at the Electrical and Computer Engineering Department of Northeastern University (NEU). Sarah joined NEU from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Sarah has recently formed the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation.
We live in a world inundated with data: websites, mobile devices, security systems, and even small wireless sensor systems constantly collect data. In fact, such systems often collect so much data that traditional data processing techniques are insufficient. This is the big data problem. However, sometimes things go the other way: there is a critical constriction in the size of the data at some point in the processing pipeline that prevents traditional machine learning techniques from working. We call this the Small Data problem. For all but the most simple classification and regression models, traditional machine learning requires a considerable amount of training data to learn a model without overfitting. Unfortunately, for many datasets, especially within medical applications, there is not enough training data for these models to work. There are a number of common reasons for this constriction including: plenty of healthy subjects/not enough sick subjects, per-person differences that require individualized training, and a high cost of features (such as certain tests or data that costs a lot of money/time to collect). As an example, collecting data from the following subjects is increasingly problematic as we proceed from right to left: rats, apes, healthy adults, sick adults, healthy children, and sick children. So, ideally, you would want to learn as much from subjects toward the left of the list and try to generalize the knowledge to subjects on the right. But this introduces another aspect of the small data problem: it’s not just the amount of data, but the quality. A model learned from rats might break down pretty quickly if applied to a sick child. But there may not be enough data from sick children to adequately train the full model. How do we solve these problems? Like terrible scientists, but great engineers, we cheat: we use as much pre-existing or inexpensively obtained knowledge to constrain the problems to the point where the model is simple enough to be correctly trained with the available data.
Egan Research Center
120 Forsyth Street, Boston, MA 02115
Egan Conference Rooms: Egan 306