# Mini-Courses

### Deep Learning

#### Moacir Ponti

This series of lectures presents basic concepts of machine learning and deep learning, in particular when applied to the problem of learning features from raw data. Firstly, fundamental definitions of learning/processing units are introduced, as well as architectures that structure those learning units in layers. Then, we put those together forming neural networks and explain how to train those networks via optimization. In fact, learning is achieved by optimizing some objective function which depends on the task one wishes to solve. Secondly, we explain dense learning units that process all input data to produce the results, as well as convolutional ones that process data locally, considering the spatial configuration, which is useful for example for image recognition. Both supervised networks for classification and unsupervised networks for reconstruction are going to be presented along the lecture. In addition, we describe recurrent units such as the long-short term memory method thaw allot to take into account data in sequences, e.g. in time. By using the presented methods, one can design powerful models to solve different tasks such as classification and feature learning, but also combine those to create even more complex recognition systems. Throughout the lectures, examples using tensorflow and keras are given to illustrate the concepts.

**Slides:** [moacir1.pdf] [moacir2.pdf] [moacir3.pdf]

### Natural Language Processing - Dealing with unstructured data

#### Renata Vieira

In this course, we will present an overview of the area of NLP, its fundaments, and current trends. We will present some important applications and recent results of our research at PUCRS. The course will address issues related to the unstructured data (information in various textual forms) with an emphasis on information extraction.

**Slides:** [renata1.pdf] [renata2.pdf] [joaquim.pdf]

# Lectures

### Spatio-Temporal Data Analytics via Graph Signal Processing

#### Luis Gustavo Nonato

Signal processing has long been a fundamental tool in fields such as image processing, computer vision, and computer graphics, leveraging the development of filtering mechanisms designed to tackle problems ranging from denoising to object registration. More recently, the signal processing machinery has been extended to unstructured domains such as graphs, fostering a multitude of new theoretical developments and applications. In this talk, we show how graph signal processing is being applied to assist in the analysis of spatiotemporal data, leveraging the development of a number of visual analytic tools. In particular, we present examples of applications involving taxi data analysis, identification of crime patterns and study of dynamic networks. The design of filters such as edge-detection and feature preserving smoothing in graphs will also be discussed.

**Slides:** [gustavo.pdf]

### The dispersion of crime concentration during a period of crime increase

#### Joana Monteiro

Extensive empirical evidence shows that crime concentrates in place, with these findings being important for helping to target police resources. Little is known, however, about whether these crime concentration areas are where crime increases the most during a period of crime increase. Using data from the seven largest cities in the state of Rio de Janeiro, Brazil, we show that during a period of crime increase, the locations most responsible for the increases were the micro-places where crime previously concentrated. We argue that the increases in crime in areas of crime concentration were mainly due to these places offering stable favorable conditions for crime. The study introduces a simple index—the Crime Concentration Dispersion Index—which helps police agencies determine where to target resources during a period of crime increase and offers results that provide an important Latin American urban perspective to the literature on crime concentration.

**Slides:** [johana.pdf]

### Gradient Boosting for inverse problems

#### Yuri Saporito

In this work we proposed a novel non-parametric method to solve inverse problems. The method is based on the Gradient Boosting from the statistical learning literature and uses smooth transition trees as base learners. The smoothness and robustness of the method generates well-behaved solution of inverse problems. We will apply the method to the estimation of local volatility functions, a very well-known problem in Quantitative Finance. The method generates well-behaved local volatility functions, capable of replicating vanilla option prices and the implied volatility surface. Furthermore, the method proved to be useful for pricing exotic options.

**Slides:** [yuri.pdf]

### Optimal Invariant Tests in IV Regression

#### Marcelo Moreira

We will go over the concept of using model symmetries to construct tests with correct size and good power. Special attention will be devoted to the instrumental variable (IV) regression. Contrary to popular belief, we show there exist model symmetries when equation errors are heteroskedastic and autocorrelated (HAC). Our theory is consistent with existing results for the homoskedastic model, but in general uses information on the structural parameter beyond the Anderson-Rubin, score, and rank statistics. This suggests that tests based only the Anderson-Rubin and score statistics discard information on the causal parameter of interest. We apply our theory to construct designs in which these tests indeed have power arbitrarily close to size. Other tests, including other adaptations to the CLR test, do not suffer the same deficiencies. Finally, we use the model symmetries to propose a novel weighted-average power test for the HAC-IV model.

### Understanding the shape of data: a brief introduction to Topological Data Analysis

#### Frédéric Chazal

Topological Data Analysis (TDA) is a recent and fast growing field at the crossing of mathematics, computer science, and statistics. It is mainly motivated by the idea that topology and geometry provide a powerful approach to infer, analyze and exploit robust qualitative and quantitative information about the structure of data represented, in general, as point clouds or samples in Euclidean or more general metric spaces. With the emergence and development of persistent homology theory, computational topology and geometry have brought new efficient mathematical and computational tools to infer, analyze and exploit the topological and geometric structure of complex data. The goal of this talk is to provide a short introduction to TDA and persistent homology through the presentation of a few problems, results and concrete examples.

**Slides:** [frederic.pdf]

### Big Data and Artificial Intelligence for Learning and Wellbeing

#### Helen Meng

Advancements in ICT such as smart devices, IoT, cloud computing, etc. have coalesced to offer powerful new channels for data capture and analytics. We are now able to examine Big Data, to uncover hidden patterns, unknown correlations and other useful information. Furthermore, Big Data and powerful machine learning algorithms are fueling Artificial Intelligence(AI), which can automate many pre-formulated processes and bring about disruptive transformations. This talk presents several ongoing investigations, covering the domains of learning and health, where we apply Big Data analytics and AI to examine the world around us, to unravel the complexities and intricacies in the data for shaping technological innovations that have significant implications to ourselves and our future generations.

### Machine learning algorithms for making inferences on networks and answering questions in biology and medicine

#### Alberto Paccanaro

An important idea that has emerged recently is that a cell can be viewed as a set of complex networks of interacting bio-molecules and genetic disease is the result of abnormal interactions within these networks. In this talk, I will present novel machine learning algorithms for solving problems in systems biology and medicine that can be phrased in terms of inference in such large-scale networks. I will begin by describing a method to accurately quantify a distance between disease modules on the human interactome that uses only disease phenotype information. I will then show how this measure can be exploited by a semi-supervised learning algorithm for inferring disease genes for heritable disease. Importantly, our approach allows the prediction of disease genes for diseases for which no disease gene is already known. Finally, I will present a method for the prediction of drug side effects. This algorithm, which is based on matrix factorization, is the first that can predict the frequency of drug side effects in the population.

**Slides:** [alberto.pdf]

### AutoML: Towards Automated Machine Learning

#### Andre Carvalho

As the number of successful applications of Machine Learning algorithms grows, there is also an increase in the need to make these algorithms easily accessible by users without Machine Learning expertise. There have been several efforts in this direction, involving not only the recommendation of the most suitable algorithm, but also their most appropriate hyper-parameter values. These several efforts started a new research area, named Automated Machine Learning, AutoML, which has attracted the attention of researchers and practitioners not only from the academia, but also from several companies working with data science. This talk will present the main approaches and recent advances in this area, covering also works carried out in the Analytics Laboratory, at USP São Carlos.

**Slides:** [andre.pdf]

### City Data Science and Technology

#### Fabio Kon

Most of the world's population lives in cities. In Brazil, over 85% of the population lives in cities and this number is growing. New Information and Communication Technologies make every citizen carry one or more computers all the time while city things are connected to the Internet. Modern Data Science enables computation over huge amounts of data and the execution of sophisticated algorithms, e.g., for machine learning. In this talk, we will discuss how large amounts of georeferenced data from different sources can be processed with data science tools and combined into a variety of applications, ranging from map-based dashboards for city management and public policymaking to citizen apps.

**Slides:** [fabio.pdf]

### Sports Data Science

#### Claudio Silva

While there has always been interest in analyzing sports data, this research area has received significantly more attention in recent years due to both the recognition of the importance of objective statistics and the proliferation of available data. New technology is starting to enable the capture of game play at unprecedented levels of detail, including the tracking of positions of all players and game events at all times. Instead of being starved for data, analysts now have access to volumes of highly accurate gameplay data. This data deluge requires the development of novel visualization and machine learning tools and is leading to major new developments in sports data science. In this talk, we will review recent developments in this area and the enabling technologies. We will also cover our recent work, including the development of the Statcast Baseball Metrics Engine (BME) and related data science tools and techniques. This is joint work with Dr. Carlos Dietrich, and many others at NYU and MLB Advanced Media.