Detecting Data-Code Mismatches in Machine Learning Pipelines

27.06.2023

Speaker : Stefanie Scherzinger, Professor at the University of Passau
Date : 27.06.2023
Time: 10.00 to 11.00
Location : Payatakes Seminar Room - FORTH, Main Building, 1st floor
Host : Prof. Constantinos Marias

Abstract:

Deploying a machine learning pipeline is a resource-demanding task that requires a combination of data and software engineering expertise. However, even with meticulous testing, the risk of encountering run-time errors during pipeline operation remains a significant concern. In this presentation, our focus lies in addressing the run-time errors caused by mismatches between the data to be processed and the code responsible for its processing. We present strategies for the early detection of these mismatches through static and dynamic analysis, leveraging techniques from JSON Schema reasoning. Specifically, we showcase our recent contributions to essential tasks such as schema validation, schema extraction, and checking schema containment. Furthermore, we provide an outlook on the challenges introduced by the latest drafts of JSON Schema. Lastly, we conclude with a discussion on application domains for our contributions, extending beyond the fortification of machine learning pipelines against run-time errors.

Bio:

Stefanie Scherzinger is a professor at the University of Passau where she chairs the "Scalable Database Systems" group.

Traditionally, her research has a strong focus on schema evolution, and is motivated by her previous work experience as a software engineer at first IBM and then Google. Lately, she has gotten hooked on building well-principled tools for handling the JSON Schema language.

Search form

Detecting Data-Code Mismatches in Machine Learning Pipelines