Stefan Schmid - Datalogi

Predictable big data analytics in unpredictable environments

To begin with: where are you from?

I was born and studied in Switzerland. I received my PhD from ETH Zurich and spent the last 6 years in Berlin, Germany, working as a Senior Research Scientist at the Telekom Innovation Laboratories (T-Labs) and TU Berlin.

Why did you apply to be a part of the talent programme?

Being a new Associate Professor at AAU, the programme will help me bootstrap my career in Denmark and establish my own research group. Moreover, it provides me with the necessary resources and flexibilities to collaborate with other researchers within AAU and in Denmark in general. Last but not least, I hope that through the talent management programme, I will acquire additional competences in advising, leading and promoting students and research teams as well as in handling finances.

What is your research project more specifically about?

The amount of data collected by sensors, smart devices, social software, the Internet-of-Things, as well as research institutions and organizations operating at a global scale, is approaching peta bytes a day. Accordingly, the efficient analysis of such big data sets and streams (called big data analytics) has become mission-critical, not only for scientists, but also global corporations, policy makers, and every-day users.

Big data analytics requires large-scale distributed computing infrastructures (such as the Cloud), whose performance today however is not well-understood, subject to resource interference, and unpredictable. This hurts user experience and business profits.

The goal of my project, “PreLytics”, is to build a “predictable whole” (in terms of big data analytics performance) out of less-predictable parts (in terms of resource interference and resource demand). The project relies on the insight that for the first time, given modern virtualization technology, the computing infrastructure is no longer necessarily something fixed: rather, the infrastructure can in principle be tailored toward the application needs. Accordingly, my idea is to leverage the numerous unexplored resource allocation flexibilities and redundancies available in virtualized infrastructures, to anticipate and compensate for execution uncertainties, but also to render performance adaptable, at runtime, depending on the users’ needs.

Have you thought about how your project can contribute to bringing knowledge into the world?

The project will contribute to our fundamental understanding of dynamic and interdependent yet predictable distributed systems: a field of increasing importance but which currently lacks scientific foundations. Indeed, in this area it is necessary to go from theory to practice rather than vice versa: distributed systems are generally known to be notoriously difficult to reason about and debug, and due to their large scale, even “rare” events are likely to eventually occur. The scientific insights obtained in this project are then used to develop industry relevant algorithms and technologies. For a large company like Google or Microsoft, a small 1% improvement of resource utilization can entail cost reductions in the order of millions of DKK.

How do you plan to spend the money? 

The money will be invested in PhD students as well as in research visits.

What does it mean to you to be accepted into the programme?

I feel fortunate and excited about the opportunities this programme opens for me, and am looking forward to my new responsibilities.

What do you hope to achieve with your research project?

I hope to be able to build a small but internationally leading research group in this area.

Who is your role model within the research world?

I am fortunate to know many excellent and inspiring researchers all over the world. To just name two, I very much like the work by George Varghese and Bruce Maggs.  

What do you see as the most interesting research result within your field of research?

The specific area in which this project is situated is still young, however, the fundamental underlying question of how to efficiently process large amounts of data in a distributed system is of course not new. Indeed, the current research builds upon a large number of wonderful results obtained over the last decades, and it is difficult to give a fair answer here.