Research Portfolio
My research addresses foundations of algorithmic and statistical data analysis. In many of my works, I model and analyze large data sets from a graph or network perspective. I also develop new data science techniques for complex relational data, and apply them in the context of software engineering, information system and computational social science. My approach is quantitative, data-driven and interdisciplinary, combining methods from computer science, mathematics and physics.
In September 2014 I was awarded a Juniorfellowship by the German Informatics Society (GI e.V.).
Some facets of my research are summarized in the self-portrayal Understanding Complex Systems: When Big Data Meets Network Science. Below I outline some recent research interests and key findings.
Statistical Relational Learning
Relational data mining techniques play an important role in many disciplines, such as information science, sociology, bioinformatics or economics. They provide new ways to explore large corpora of data which capture dyadic relationships, interactions or links between documents, humans, genes, or financial institutions. However, we now increasingly have access to complex data that capture more than just dyadic relations. Examples include multi-relational data, time-stamped relations, relational data with noise, or sequential data. The question when a graph abstraction of such complex relational data is justified has not been answered satisfactorily.
To address this problem, I develop new algorithmic and statistical data mining techniques for relational data with complex characteristics. I am particularly interested in new ways to infer patterns in sequential data on networks. In a recent work, I developed a new method (i) to test when a network abstraction of such data is justified, and (ii) to infer optimal higher-order graphical models which generalize network-analytic methods. It has been implemented in the python package pathpy, which is available on github.
Exemplary publications
- Ingo Scholtes
When is a Network a Network? Multi-Order Graphical Model Selection in Pathways and Temporal Networks
In KDD'17 - Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, Nova Scotia, Canada, August 13-17, 2017
- Giona Casiraghi, Vahan Nanumyan, Ingo Scholtes, Frank Schweitzer
From Relational Data to Graphs: Inferring Significant Links using Generalized Hypergeometric Ensembles
In Social Informatics. SocInfo 2017, Lecture Notes in Computer Science, Vol. 10540, September 2017
- G Casiraghi, V Nanumyan, I Scholtes, F Schweitzer: Generalized Hypergeometric Ensembles: Statistical Hypothesis Testing in Complex Networks, arXiv 1607.02441, July 2016
Network Analytics for Time Series Data
Graph analytics and (social) network analysis have become cornerstones of data science. They are widely applied to relational data studied in disciplines such as computer science, physics, systems biology, social science or economics. However, we are increasingly confronted with high-frequency, time-resolved data which not only tell us who is related to whom, but also when and in which sequence these relations occurred. The analysis of such data is still a challenge. A naive application of network analysis and modeling techniques discards information on the timing and ordering of relations, which is the foundation of so-called causal or time-respecting paths, i.e. it is needed to answer the question who can influence whom. In my research, I study the effects of temporal ordering in time-resolved relational data from real-world systems. Using a combination of information-theoretic and statistical methods, we could demonstrate that temporal correlations in data from social and biological systems break the transitivity of causal paths. We further showed that the application of network-based data analysis and modeling techniques as well as algebraic methods to time-stamped data yields wrong results.
Addressing the problem that common graphical representations of relational data discard information on the temporal ordering of relations, we developed a data analysis framework based on higher-order graphical models. Extending the common network perspective, it allows to combine information on both topological and temporal characteristics of time-resolved relational data into compact probabilistic graphical models. This approach provides new ways to (i) model dynamical processes like diffusion, cascades or epidemic spreading, (ii) detect temporal-topological clusters based on higher-order Laplacians and spectral methods, (iii) assess the importance of nodes, and (iv) study the controllability of complex systems. This research aims at methodological advances which not only provide us with novel data mining techniques, but whose impact reaches beyond computer science, with applications in the modeling of complex systems in physics, systems biology, social science and economics.
Exemplary publications
- Y Zhang, A Garas and I Scholtes: Controllability of temporal networks: An analysis using higher-order networks, arXiv 1701.06331, January 2017
- I Scholtes, N Wider, A Garas: Higher-Order Aggregate Networks in the Analysis of Temporal Networks: Path structures and centralities, European Physical Journal B, March 2016
- I Scholtes, N Wider, R Pfitzner, A Garas, CJ Tessone, F Schweitzer: Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks, Nature Communications, September 2014
- R Pfitzner, I Scholtes, A Garas, CJ Tessone, F Schweitzer: Betweenness Preference: Quantifying Correlations in the Topological Dynamics of Temporal Networks, Physical Review Letters, May 2013
Data Science in Software Engineering
Software systems are at the heart of the digital society: They control critical infrastructures like communication or energy systems, fuel the increasing automation in industrial manufacturing and are key drivers of the digital economy. Despite this importance, the development of complex software systems is still a fundamental challenge. Credible reports indicate that the majority of software projects run over time or budget -- or fail altogether, resulting in billions of dollars wasted every year. And while technical aspects like, e.g., programming techniques, testing methods, or developer support tools have improved significantly over the past years, our understanding how human and social factors contribute to success or failure of software projects is still in its infancy.
Addressing these challenges, I use data science to quantitatively study collaborative software engineering processes. As an example, we use network analysis and statistical modeling to study the evolution of software architectures based on large-scale data from software repositories. This not only allows us to trace the maintainability of software systems. We can also assist developers in the refactoring of code. We further extract large data sets from online support tools, and analyze them to better understand how social factors influence software development processes. This approach has helped us to uncover social mechanisms at work in software development, to quantify risks in Open Source communities, and to improve information systems used by software development teams.
Exemplary publications
- MS Zanetti, CJ Tessone, I Scholtes, F Schweitzer: Automated Software Re-modularization Based on Move Refactoring, International Conference on Modularity, April 2014
- MS Zanetti, I Scholtes, CJ Tessone, F Schweitzer: Categorizing Bugs with Social Networks: A Case Study on Four Open Source Software Communities, International Conference on Software Engineering (ICSE), May 2013
- MS Zanetti, I Scholtes, CJ Tessone, F Schweitzer: The Rise and Fall of a Central Contributor: Centralization and Performance in the Gentoo Community, International Workshop on Cooperative and Human Aspects in Software Engineering (ICSE CHASE), May 2013
Computational Social Science
The increasing volume of available data on social systems opens new opportunities for large-scale, quantitative studies of social phenomena. Such studies can help us to better understand how humans communicate and collaborate, what makes teams productive, what mechanism are at work in successful social organizations, and how technology shapes human behavior. This research not only offers new ways to address long-standing issues in the social sciences, it is also crucial to model, design and manage socio-technical systems.
Addressing these questions, I use data science techniques to study social organizations. In a large-scale analysis of data on more than 30,000 developers in 58 Open Source Software projects, we could validate and quantify the Ringelann effect known from social psychology and organizational theory. We could also show how coordination structures in software development teams influence the productivity of team members. Studying large bibliographic data sets, we could further show how social mechanisms influence editorial processes and citation practices. These works provide actionable insights for project management and policy-making.
Exemplary publications
- I Scholtes, P Mavrodiev, F Schweitzer: From Aristotle to Ringelmann: a large-scale analysis of productivity and coordination in Open Source Software projects, In Empirical Software Engineering, March 2016
- E Sarigöl, D Garcia, I Scholtes, F Schweitzer: Quantifying the effect of editor-author relations on manuscript handling times, In Scientometrics, March 2017
- E Sarigöl, R Pfitzner, I Scholtes, A Garas, F Schweitzer: Predicting Scientific Success Based on Coauthorship Networks, In EPJ Data Science, September 2014