Explainability, Interpretability and Visualization of Machine Learning Models

Claudio T. Silva - New York University

Abstract: This talk is about the understanding and interpretation of machine learning (ML) models. We have been pursuing techniques and systems for local and global explainability of black-box machine learning models, such as deep neural networks (DNNs) or gradient boosted trees (GBTs). The goal of our work is to support different user categories, including model builders and practitioners in the tasks required to evaluate global and local model dynamics as well as drill into explanations for specific model predictions. The primary intent of the research project is to foster more effective diagnosis, uncover bias, and surface opportunities for feature selection pertaining to previously opaque models. First, we will present GALE (Globally Assessing Local Explanations), a technique to quantitatively measure the similarity between sets of explanations through topology. GALE relies on MAPPER, a technique that extracts a low-dimensional structure from the underlying data. In GALE’s case, our underlying data is the set of feature attributions produced by a local explanation method. We produce our graphical representation using the model predictions as a filter function. Effectively, GALE is able to produce a domain-agnostic signature of the output from any local explanation method. Using these signatures, we can calculate their corresponding persistence diagrams and calculate the distance between signatures. Explanation methods which produce similar manifolds should produce a low distance between one another. GALE was recently presented as a spotlight talk at the Topology, Algebra, and Geometry in Machine Learning (TAG-ML) workshop at ICML 2022, and the system can be obtained at https://github.com/pnxenopoulos/gale. Then, we will introduce Calibrate, an interactive visual analytics system to analyze model calibration. Calibration concerns the ability of a model to produce probabilistic predictions which reflect real occurrences. Oftentimes, machine learning models may be accurate, but their predicted probabilities may be incorrect for human interpretation. Thus, using these uncalibrated probabilities, particularly when an end user interprets them for decision-making, can be dangerous. Furthermore, their use may exacerbate inequalities, such as those found in healthcare, finance or criminal justice, which are domains that routinely use model predicted probabilities for critical decision-making. We developed Calibrate through a requirements analysis with data scientists who routinely analyze model calibration. Calibrate supports interactive data subset creation and the ability for a user to inspect selected prediction regions, both locally, by looking at individual observations, and globally, through common performance representations, like a confusion matrix. Calibrate also implements learned reliability diagrams (LRDs), our proposed improvement upon traditional calibration visualization. Our full paper on Calibrate was presented at the IEEE VIS 2022 conference and will appear in the IEEE Transactions on Visualization and Computer Graphics; the system is available at https://github.com/VIDA-NYU/pycalibrate.

Short-bio: Cláudio T. Silva is an Institute Professor at New York University jointly appointed in the Center for Data Science and the Tandon School of Engineering. He is also affiliated with the Center for Urban Science and Progress (which he helped co-found in 2012) and the Courant Institute of Mathematical Sciences. His research interests include visualization, visual analytics, machine learning, reproducibility and provenance, geometric computing, and computer graphics. He has put his work to practice in urban and sports-related applications. He received his BS in mathematics from the Universidade Federal do Ceará (Brazil) in 1990, and his MS and PhD in computer science at the State University of New York at Stony Brook in 1996. Claudio has advised 20+ PhD, 10 MS students, and mentored 20+ post-doctoral associates. He has over 300 publications, including 20 that have received best paper awards. According to Google Scholar, his h-index is 73 and his papers have received over 23,000 citations. Claudio was the elected Chair of the IEEE Technical Committee on Visualization and Computer Graphics (2015–18), is a Fellow of the IEEE and received the IEEE Visualization Technical Achievement Award. He was the senior technology consultant (2012-17) for MLB Advanced Media’s Statcast player tracking system, which received a 2018 Technology & Engineering Emmy Award from the National Academy of Television Arts & Sciences (NATAS). His work has been covered by The New York Times, The Economist, ESPN, and other major news media

Dataset Search for Data Discovery, Augmentation and Explanation

Juliana Freire - New York University

Abstract: Recent years have seen an explosion in our ability to collect and catalog immense amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should in theory allow us to make progress on many of our most important scientific and societal questions. However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of available information to discover datasets that are needed for their specific application. While search engines have addressed the discovery problem for Web documents, there are many new challenges involved in supporting the discovery of structured data---from crawling the Web in search of datasets, to the need for dataset-oriented queries and new strategies to rank and display results. I will discuss these challenges and present our recent work in this area. In particular, I will introduce a new class of data-relationship queries that, given a dataset, identifies related datasets; I will describe a collection of methods that efficiently support different kinds of relationships that can be used for data explanation and augmentation; and I will demonstrate Auctus, an open-source dataset search engine that we have developed at the NYU VIDA Center.

Short-bio: Juliana Freire is a Professor of Computer Science and Data Science at New York University. She was the elected chair of the ACM Special Interest Group on Management of Data (SIGMOD), served as a council member of the Computing Research Association’s Computing Community Consortium (CCC), and was the NYU lead investigator for the Moore-Sloan Data Science Environment. She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, and different application areas, including urban analytics, predictive modeling, and computational reproducibility. Freire has co-authored over 200 technical papers (including 11 award-winning publications), several open-source systems, and is an inventor of 12 U.S. patents. According to Google Scholar, her h-index is 64 and her work has received over 17,000 citations. She is an ACM Fellow, a AAAS Fellow, and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She received the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She received a B.S. degree in computer science from the Federal University of Ceara (Brazil), and M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook.

Graph Signal Processing: from data science to machine learning

Luis G. Nonato - University of São Paulo

Abstract: Graph Signal Processing (GSP) is a recently developed methodology to analyze signals (scalar fields) defined on the nodes and edges of a graph, enabling mechanisms to reduce noise, detect "discontinuities", and extract patterns from the signals. In this talk we will show how GSP can be used to analyze urban-related phenomena related to mobility and crime. Moreover, we will also discuss how GSP can be used to build Graph Neural Networks models to generate a latent representation of data defined on graphs and predict real phenomena.

Short-bio: Luis Gustavo Nonato received the PhD degree in applied mathematics from the Pontifícia Universidade Católica do Rio de Janeiro, Rio de Janeiro - Brazil, in 1998. He is professor with the Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo, São Carlos, Brazil. Nonato was a visiting professor at the Center for Data Science, New York University, from 2016 to 2018 and he was also a visiting scholar at the SCI Institute, University of Utah, from 2008 to 2010. Nonato served on several program committees, including IEEE SciVis, IEEE InfoVis, and EuroVis; he was associate editor of Computer Graphics Forum and IEEE Transactions on Visualization and Computer Graphics journals, General Chair of the IEEE Visualization conference in 2021, and editor-in-chief of the SBMAC SpringerBriefs in Applied Mathematics and Computational Sciences. Nonato's main research interests include geometric computing, data science, machine learning, and visualization.

Uncertainty in Deep Learning

Raul Queiroz Feitosa - Pontifical Catholic University of Rio de Janeiro

Abstract:Despite the great success of Deep Learning (DL) based solutions in several areas, its use in some applications is still limited due to the difficulty in estimating how trustful their results are. The lecture addresses this issue and discusses methods of evaluating uncertainty in AP-based solutions. The lecture will consist of three blocks. Initially, it will introduce the uncertainty concept in DL and present the different sources and the main types of uncertainty. Then, the central block describes the main techniques for measuring uncertainty. Finally, the lecture shows examples of uncertainty estimates used in active learning schemes.

Short-bio: Raul holds a degree in Electronic Engineering from the Instituto Tecnológico de Aeronáutica (ITA) (1979), a Master's degree in Electronic Engineering from the ITA (1983), a PhD in Computer Science from the University of Erlangen-Nürnberg, Germany (1988). He completed a post-doctoral internship at the University of Hanover, Germany, in 2015. He is currently an Associate Professor in the postgraduate program at the Department of Electrical Engineering at the Pontifical Catholic University of Rio de Janeiro (PUC-Rio). He is a senior member of the IEEE-Geoscience and Remote Sensing Society (IEEE-GRSS) and a member of the International Society of Photogrammetry and Remote Sensing (ISPRS) and SELPER-Brasil. He was vice president of ISPRS Commission I between 2016 and 2022, founder and president of the Brazilian chapter of IEEE-GRSS between 2015 and 2017, and from June 2022 ISPRS Regional Representative for Latin America. His areas of interest include Image Analysis, Remote Sensing, Pattern Recognition, Biometrics and Computer Vision. He supervised about three dozen master's dissertations and a dozen doctoral theses. He has authored/co-authored around 60 journal publications and over 180 scientific conference proceedings as well as book chapters.

Computer Vision and Machine Learning for Optimal Farm Management

João Dorea - University of Wisconsin-Madison

Abstract: Artificial Intelligence (AI) can be described as the science and engineering of making intelligent machines and computer programs. The advance of AI systems in different fields of science has created incredible opportunities for the new generation of students and scientists to answer research questions that would not otherwise be possible before the recent progress toward more intelligent systems. AI technology such as computer vision, natural language processing, and robotics has become a real component of our lives through well-known applications as face recognition, speech-to- text, robotics, and virtual reality. The area of agriculture has leveraged the AI development by other scientific domains, and livestock systems have gradually experienced the implementation of modern solutions to solve critical problems related to animal monitoring systems for health and welfare, greenhouses gas emissions, animal traceability, and labor shortage. In this talk, we will discuss some examples of AI technologies with potential to revolutionize livestock systems in the next decades, such as computer vision systems and mixed reality. We will discuss how these AI examples relate to real-world challenges currently faced by farmers, industry, and the scientific community.

Short-bio: Joao Dorea is an Assistant Professor in precision agriculture and data analytics. He obtained his master’s degree and PhD in animal science from the University of Sao Paulo in Brazil. Joao spent two years coordinating dairy and beef research in Latin America for DSM, a global supplier of animal health and nutrition products. In 2017, he joined the University of Wisconsin-Madison as a Postdoc, and in 2019, he was hired as an Assistant Professor in the Department of Animal and Dairy Sciences at the UW-Madison. Joao develops research focused on digital technology and predictive analytics to optimize farm management decisions. His research group is interested in large-scale development and implementation of computer vision systems, wearable sensors, and infrared spectroscopy to monitor nutrition, health, and welfare of livestock animals.

Machine learning to approach sustainability using scarcely labeled to unlabeled earth observation data

Dário Oliveira - Getulio Vargas Foundation

Abstract: In the last decades, a debate on a responsible, sustainable human presence on Earth emerged strongly. With climate change and the overwhelming economic pressure on Nature, empowering procedures for efficient resource use with the recent advances in artificial intelligence is vital to create adequate policies and trigger warning alerts accordingly. Remotely sensing natural dynamic phenomena, like phenological crop cycles or deforestation processes, is challenging. Continuous and smooth physical processes usually rule such phenomena, but remote sensing involves different sensors with very different essence, scale, and visit rates, corrupted by stochastic events, resulting in highly complex multimodal multitemporal and multi-scale datasets. Moreover, data labeling in Earth Observation (EO) applications is usually scarce or unavailable due to the massive amount of data continuously acquired or notorious fieldwork limitations. This presentation discusses machine learning approaches for earth observation data with scarce labeling, aiming to develop environmental protection solutions, promote efficient tools to adapt to EO's global warming effects, and support efficient agricultural practices with a lower data annotation burden.

Short-bio: Dário Oliveira - received his M Sc. (2009) and Ph.D. (2013) degrees in Electrical Engineering from the Pontifical Catholic University of Rio de Janeiro, Brazil. He was a visiting scholar at the Instituto Superior Técnico in Lisbon, Portugal (2008-2009) and the Leibniz University of Hannover in Germany (2011-2012). As a postdoctoral fellow, he studied machine learning applied to neuroscience at the University of Sao Paulo, Brazil (2014-2015) and applied to dairy science at the University of Wisconsin, USA (2020-2021). From 2015 to 2021, he worked in the industry at the General Electric Global Research Center in Rio de Janeiro and later at the IBM Research lab in São Paulo, Brazil. From 2021 to 2022 he worked as a Guest Professor at the AI4EO lab at the Technical University of Munich, Germany. He currently works at the School of Applied Mathematics, Getulio Vargas Foundation, Rio de Janeiro, Brazil.

3D structural modeling of whole interactomes leads to better understanding of disease mechanisms and better drug design

Haiyuan Yu - Cornell University

Abstract: Protein-protein interactions facilitate much of known cellular function. While simply knowing which proteins interact provides valuable information, far more specific hypotheses can be generated if structural details of the interactions are known. However, co-crystal structures and homology models cover only <10% of all known human interactions. To solve this issue, we developed a unified machine learning framework that we used to create the first multi-scale whole-proteome structural interactome in human for all known protein interactions reported in the literature. Combining our full-coverage 3D human interactome with the recent AlphaFold2 database, we were able to compile a complete repository of the structures of every single protein as well as the binding interfaces for all known interacting protein pairs in humans, and developed a computational tool, named NetFlow3D, integrating spatial cluster identification with a 3D structurally-informed protein network model to create a multiscale functional map of somatic mutations in cancer. By applying NetFlow3D to 415,017 somatic protein-altering mutations from 5,950 TCGA tumors across 19 cancer types, we identified 1,656 intra- and 3,343 inter-protein mutation clusters, of which ~50% would not have been found if using only experimentally-determined protein structures. Moreover, studying the global organization of local spatial mutation clusters led to a 5.5-fold increase in the number of significantly dysregulated protein subnetworks, the majority of which were previously blurred by non-clustered background mutations using standard network analyses. During the Pandemic, we used the latest quantitative proteomics approach to experimentally generate a comprehensive SARS-CoV-2-human protein-protein interactome, and applied our 3D structural modeling tools to the all available SARS-CoV-2-human interactions. Our 3D models help provide insight into SARS-CoV-2 etiology by identifying protein-protein interactions enriched for recent sequence deviation between SARS-CoV-1 and SARS-CoV-2. By comparing predicted binding sites on human proteins for binding with SARS-CoV-2 proteins and FDA approved drugs, we were able to prioritize a set of drugs as potential antiviral treatment.

Short-bio: Haiyuan Yu performs research in the broad areas of Network Systems Biology. The Yu group uses integrated computational-experimental systems biology approaches to determine protein interactions and complex structures on the scale of the whole cell. In particular, his group focuses on protein-protein and gene regulatory networks and seeks to understand how such intricate systems evolve and how their perturbations lead to human disease, especially Autism Spectrum Disorder and cancer. Towards these goals, Haiyuan led his group to develop the concept of “3D structurally-resolved interactome networks”, where they integrate multi-scale structural modeling, machine learning, and high-throughput genomics/proteomics experiments to determine protein interactions and their binding interfaces on the whole proteome scale. More recently, in close collaboration with John Lis and his group, the Yu group demonstrate that enhancer RNAs (eRNAs) detected by the novel PRO-cap assay is a critical assay for active enhancers genome-wide. PRO-cap has great sensitivity and specificity, among all RNA-sequencing assays to detect eRNAs (thus active enhancers) across the whole genome with high resolution. Yu is the Tisch University Professor in the Department of Computational Biology and Weill Institute for Cell and Molecular Biology, also the founding Director of Center for Innovative Proteomics (CIP) at Cornell University.

Prediction of bacterial phenotypes from genomes using machine learning

João Carlos Setubal - University of São Paulo

Abstract: The last 20 years have seen an explosion of genome sequencing, particularly of bacteria. Currently, there are almost half a million prokaryotic genomes available from GenBank. Many of these genomes are MAGs: Metagenome-Assembled Genomes. For most of these MAGs, all we know is a genome sequence and the environment from where it was sampled. In this talk, I will describe ongoing work that aims to predict a number of bacterial phenotypes using only the genome sequence and its annotation as input, with standard machine learning as basic methodology. The phenotypes to be predicted are limited by the number of experimentally-proven phenotypes available. Nevertheless, so far we have obtained encouraging results for about 10 phenotypes, including optimal growth pH, optimal growth temperature, ability to sporulate, and ability to fix nitrogen. Joint work with Bruno Iha.

Short-bio: João Carlos Setubal is Full Professor at the University of São Paulo, Chemistry Institute, Department of Biochemistry. Setubal has a PhD in Computer Science from the University of Washington, where he also spent a sabbatical year at the Department of Genome Sciences. Setubal is a bioinformatician, working primarily on the development of computational tools for and analysis of bacterial genomes and metagenomes.

Network Medicine: Integrating data towards a better understanding of human diseases and vaccination

Helder Nakaya - University of São Paulo

Abstract: Diseases are mostly a consequence of an abnormality in multiple genes. Network Medicine investigates how molecules interact with each other in complex intracellular and intercellular networks. The talk will show the recent advances on this emerging field of research and its impact on precision medicine, vaccinology and drug repositioning

Short-bio: Dr. Helder Nakaya is a former professor at the University of São Paulo and deputy director of School of Pharmaceutical Sciences. He is now a senior researcher at the Hospital Israelita Albert Einstein in São Paulo. He is also an adjunct professor of the Department of Pathology at Emory University School of Medicine in US. Dr. Nakaya is an affiliate member of the Brazilian Academy of Sciences and principal investigator at the Center for Research in Inflammatory Diseases and at the Scientific Platform Pasteur-USP.

Understanding Societies from their Digital Records

Pedro Vaz-de-melo - Federal University of Minas Gerais

Abstract: The immense availability of mobile computing technologies such as smartphones and tablets and the worldwide adoption of social applications such as Facebook and Twitter have allowed people to be continuously connected to the Internet. In this scenario, people act as social sensors, voluntarily providing data that captures their daily experiences from observations of the physical and online world. In addition, open data initiatives from various sectors of society, such as the Open Data portal of the Chamber of Deputies of Brazil, publish structured data that can be freely used by anyone to foster the development of intelligent tools. In today's world, every movement or action generates a digital record and this gigantic database can be seen as a digital and structured representation of the world and its societies. In this talk, I will describe some simple computational methods that make use of this public data to discover knowledge in large-scale complex social systems. In particular, I will show how this data can be used to understand cultural behaviors and to promote transparency in political systems and activities. I show that these methods are capable of discovering surprising features and patterns, at a lower cost and faster than traditional methods.

Short-bio: Pedro O.S. Vaz de Melo is an associate professor in the Computer Science Department (DCC) of FederalUniversity of Minas Gerais (UFMG). He has degree (2003) and Masters (2007) in Computer Science from the Pontifical Catholic University of Minas Gerais (2003). He got his Ph.D. at Federal University of Minas Gerais (UFMG) with a one year period as a visiting researcher in Carnegie Mellon University and a five months period as a visiting researcher at INRIA Lyon. His research interest is mostly focused on knowledge discovery and data mining in complex and distributed systems. Pedro has published more than ninety peer reviewed papers in journals, magazines and conferences, such as ACM SIGKDD, Neurips, WWW, NAACL, ICWSM, ACM TKDD, PLOS ONE and IEEE Commag. He has also received several grants from CNPq and Fapemig, two Google Research Awards for Latin America and the 2020 CNIL-Inria Privacy Award.

Measure Vectorization for Automatic Topologically-Oriented Learning with guarantees

Frédéric Chazal - INRIA Saclay

Abstract: Robust topological information commonly comes in the form of a set of persistence diagrams that can be seen as discrete measures and are uneasy to use in generic machine learning frameworks. In this talk we will introduce a fast, learnt, unsupervised vectorization method, named ATOL, for measures in Euclidean spaces and use it for reflecting underlying changes in topological behaviour in machine learning contexts. The algorithm is simple and efficiently discriminates important space regions where meaningful differences to the mean measure arise. We will show that it is proven to be able to separate clusters of persistence diagrams. We will illustrate the strength and robustness of our approach on a few synthetic and real data sets.

Short-bio: Frederic Chazal is Directeur de Recherche (senior researcher) at INRIA Saclay Ile-de-France since 2007 and the Director of the DATAIA Institute at Université Paris-Saclay since 2021. After a PhD in pure mathematics, he oriented his research to computational geometry and topology for data sciences. He is leading the DataShape team at INRIA, a group working on Topological Data Analysis (TDA), a recent and rapidly growing field at the crossing of mathematics, statistics, machine learning and computer science. Frederic's contributions to the field go from fundamental mathematical aspects to algorithmic and applied problems. He published more than 90 papers in major computer sciences conferences and mathematics journals, he co-authored 2 reference books and 3 patents. He is the Editor-in-Chief of the Journal of Applied and Computational Topology (Springer), and he is or has been, also an associate editor of 3 other international journals: Discrete and Computational Geometry (Springer), SIAM Journal on Imaging Science, Graphical Models (Elsevier). During the last few years Frederic has been heading several national and international research projects on geometric and topological methods in statistics, machine learning and AI. He is also the scientific head of joint industrial research projects between Inria and several companies such as Fujitsu (TDA, Machine Learning and explainable AI) or the French SME Sysnav.

Provably expressive temporal graph networks

Diego Mesquita - Getulio Vargas Foundation

Abstract: Temporal graph networks (TGNs) have gained prominence as models for embedding dynamic interactions, but little is known about their theoretical underpinnings. We establish fundamental results about the representational power and limits of the two main categories of TGNs: those that aggregate temporal walks (WA-TGNs), and those that augment local message passing with recurrent memory modules (MP-TGNs). Specifically, novel constructions reveal the inadequacy of MP-TGNs and WA-TGNs, proving that neither category subsumes the other. We extend the 1-WL (Weisfeiler-Leman) test to temporal graphs, and show that the most powerful MP-TGNs should use injective updates, as in this case they become as expressive as the temporal WL. Also, we show that sufficiently deep MP-TGNs cannot benefit from memory, and MP/WA-TGNs fail to compute graph properties such as girth. These theoretical insights lead us to PINT -- a novel architecture that leverages injective temporal message passing and relative positional features. Importantly, PINT is provably more expressive than both MP-TGNs and WA-TGNs. PINT significantly outperforms existing TGNs on several real-world benchmarks.

Short-bio: Diego Mesquita has earned his Ph.D. (2021) in Computer Science from Aalto university. Prior to that, he earned both his bachelor's (2016) and master’s (2017) degrees from Universidade Federal do Ceará, also in computer science. Research-wise Diego focuses on fundamental research within machine learning (ML), with a focus on probabilistic methods, graph neural networks, and explainability. His works have been featured in prestigious ML venues such as AISTATS, UAI, and NeurIPS. Since 2022, he is a faculty member at FGV EMAp.