The Theory-Practice Interplay in Machine Learning -Emerging Theoretical Challenges

Shai Ben David [This email address is being protected from spambots. You need JavaScript enabled to view it.]

University of Waterloo, Canada

Statistical machine learning is a fast growing area, focused on automated detection of meaningful patterns in large and complex data sets.  Theoretical analysis has played a major role in some of the most prominent practical successes in this field. However, our mainstream machine learning theory assumes some strong simplifying assumptions which are often unrealis­tic. In the past decade, the practice of machine learning has led to the development of various heuristic paradigms that answer the needs of a vastly growing range of applications. Many useful such paradigms fall beyond the scope of the currently available analysis, raising the need for major extensions of the common theoretical models.

In this talk, I will survey some of these application-motivated challenges. In particular, I will discuss recent developments in the theoretical analysis of semi-supervised learning, multi-task learning, "learning to learn", privacy-preserving learning and more.



Development of New Computational Approaches for the Analysis of Gene Expression Datasets and Discovering Significant Networks Genes

Dr. Malik Yousef [This email address is being protected from spambots. You need JavaScript enabled to view it.]

Institute of Applied Research, The Galilee Society, Israel

Background: Classification using microarray datasets is usually based on a small number of samples involving tens of thousands of genes expression measurements which usually concurrently obtained. The selection of the significant genes poses a challenging issue in this high dimension data analysis and interpretation.  Classification based on groups of genes which interact with each other sometimes exhibits better performance than classification using single genes (This was demonstrated in the previous SVM-RCE (Recursive Cluster Elimina­tion) research, where a correlation metric for the gene clustering was used. Recently large databases of gene interaction networks have been generated and they have become an extremely important resource for the analysis of genetic phenomena. In this talk I will demonstrate that integrating these prior biological knowledge networks into a recursive feature elimination algorithm using SVM exhibits good performance and improves the interpretability of the results.

Results: Initially, one thousand genes are selected by t-test from the training set. The genes are then filtered so that the only genes that mapped to the gene networks database remain. The Gene Expression Network Analysis Tool (GXNA) is applied on the remaining genes to form n clusters of genes that are highly connected in the net­work. Linear SVM is used to classify the samples using clusters, and a weight is assigned to each cluster based on its importance to the clas­sification. The least informative clusters are removed while retaining the remainder to the next step.  This process is repeated until an optimal classification is obtained.

Conclusion: More than 90% accuracy can be obtained in classification of microarray datasets by integrating the prior knowledge interaction network in the microarray.



Vox Populi: On Learning from Crowds

Ohad Shamir [This email address is being protected from spambots. You need JavaScript enabled to view it.]

The Hebrew University of Jerusalem

With the emergence of large scale search engines and crowd­sourcing websites, machine learning practitioners increasingly have to cope with datasets labeled by a large number of teachers. Unlike traditional datasets, which are usually labeled by a small select group of experts, here no a-priori knowledge is available on the teachers, and their quality usually ranges from excellent to downright malicious. Often, there is no control over the assignment of instances to teachers, and common multi-teacher techniques can be impossible or very wasteful to imple­ment.  As a result, these kind of datasets pose new challenges to both the theory and the practice of machine learning.

In this talk, I'll discuss some novel approaches to improve and learn from such data, which use the collective "wisdom of the crowd" to mitigate the effect of low quality and malicious teachers.


Joint work with Ofer Dekel


Machine Learning on Physical Robots

Peter Stone [This email address is being protected from spambots. You need JavaScript enabled to view it.]

University of Texas at Austin

As robot technology advances, we are approaching the day when robots will be deployed prevalently in uncontrolled, unpredictable environments: the proverbial "real world".  As this happens, it will be essential for these robots to be able to adapt auto­nomously to their changing environment.  For a robot to learn to improve its performance based entirely on real-world environmental feedback, the robot's behavior specification and learning algorithm must be constructed so as to enable data-efficient learning.  This talk presents 3 examples of machine learning on physical robots.

First, for a robot, the ability to get from one place to another is one of the most basic skills.  However, locomotion on legged robots is a challenging multidimensional control problem.  We present a machine learning approach to legged locomotion, with all training done on the physical robots.  The resulting learned walk is considerably faster than all previously reported hand-coded walks for the same robot platform.

Second, robots whose main sensor is a digital camera are often equipped with a color map that maps each pixel value to an associated color label.  Typically, these color maps are created with the help of extensive and time-consuming manual labeling, and they are sensitive to illumination changes.  We present a method for automatic color-map learning and illumination invariance on camera-based mobile robots.

Third, we present a technique for Autonomous Sensor and Actuator Model Induction (ASAMI) on a mobile robot.  While pre­vious approaches to calibration make use of an independent source of feedback, ASAMI is unsupervised, in that it does not receive any well-calibrated feedback about its location.  Starting with only an in­accurate action model, it learns accurate relative action and sensor models. Furthermore, ASAMI is fully auto-nomous, in that it operates with no human supervision.

All of these methods are fully implemented and tested on a Sony Aibo ERS-7 robot.



Predictive Information, Learning, and the Perception Action Cycle

Naphtali Tishby [This email address is being protected from spambots. You need JavaScript enabled to view it.]

The Hebrew University of Jerusalem

Learning is an important component of a wider cycle of functions that characterize autonomous systems, often called the Perception-Action-Cycle. The perception-action cycle was described by the neuroscientist JM Fuster as the circular flow of information that takes place between the organism and its envi­ronment in the course of a sensory-guided sequence of behavior towards a goal. When the environment is (asymptotically mean) stationary, the performance of the cycle is determined by the ability of the organism to efficiently extract information from the past that is valuable for the organism in the future, on mul­tiple time scales. This observation suggests an intriguing rigorous analogy between the perception action cycle and Shannon's classical model of communication. I will present this analogy and discuss some of its consequences for optimal bio­logical adaptation and performance. More specifically, I will present some recent applications of this theoretical framework to auditory perception and language.



Collaborative Filtering with Temporal Dynamics

Yehuda Koren [This email address is being protected from spambots. You need JavaScript enabled to view it.]


Customer preferences for products are drifting over time. Product perception and popularity are constantly changing as new selection emerges. Similarly, customer inclinations are evolving, leading them to ever redefine their taste. Thus, modeling temporal dynamics should be a key when designing recommender systems or general customer preference models. However, this raises unique challenges. Within the eco-system intersecting multiple products and customers, many different characteristics are shifting simultaneously, while many of them influence each other and often those shifts are delicate and asso­ciated with a few data instances. This distinguishes the problem from concept drift explorations, where mostly a single concept is tracked. Classical time-window or instance-decay approaches cannot work, as they lose too much signal when discarding data instances. A more sensitive approach is requi-red, which can make better distinctions between transient effects and long term patterns. The paradigm we offer is creating a model tracking the time changing behavior throughout the life span of the data. This allows us to exploit the relevant components of all data instances, while discarding only what is modeled as being irre­levant. Accordingly, we revamp two leading collaborative filtering recom-mendation approaches. Evaluation is made on a large movie rating dataset by Netflix. Results are encouraging and better than those previously reported on this dataset.



Learning when the Training and Application Domains Differ - Extending Theory to "Dirty" Realistic Scenarios

Shai Ben David [This email address is being protected from spambots. You need JavaScript enabled to view it.]

University of Waterloo, Canada

Machine learning enjoys a deep and powerful theory that has led to a wide variety of highly successful practical tools. However, most of this theory is developed under some simpli­fying assumptions that clearly fail in the real world. In particular, a fundamental assumption of the theory is that the data available for training and the data of the target application come from the same source. In this work, we have extended existing theory to allow modeling realistic scenarios in which the learner has to settle for data that is inherently different than the data to which the learned program will be applied. Our extension allows analyzing some existing practical heuristics, as well as guide the development of novel techniques aimed for handling such data discrepancies.


The talk is based on joint work with John Blitzer, Koby Crammer and Fernando Pereira.



Active Online Learning via a Maximally Informative Classifier

Noam Slonim [This email address is being protected from spambots. You need JavaScript enabled to view it.]


I will describe an online classification scheme which is based on a simple information theoretic principle. I will further show that this scheme can be combined with an active learning approach in order to maximize classification accuracy for a given training set size. The proposed method is highly efficient in terms of run-time and memory footprint requirements. Expe­rimental results in text classification tasks over five real world corpora demonstrate the validity and the advantages of this approach.


Joint work with Elad Yom-Tov

Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree

Daphna Weinshall [This email address is being protected from spambots. You need JavaScript enabled to view it.]

The Hebrew University of Jerusalem

Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unex­pected events, focusing on "incongruent events" - when "general level" and "specific level" classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy. An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.


All Learning is Robust

Shie Mannor [This email address is being protected from spambots. You need JavaScript enabled to view it.]


Controlling "overfitting" (i.e., when the decision rules obtained fit the training samples extremely well, but fail to "generalize" well, i.e., perform poorly for the true distribution) is a long standing topic of study in machine learning. Regularization is a widely used technique to control overfitting. The success of regularization in a host of different algorithms is usually interpreted as coming from penalizing the complexity of the resulting decision rules. In this talk we explain it from a robust optimization perspective. That is, assuming that each sample has certain disturbance, we find the best decision under the most adversarial disturbance. We show that the resulting class of decision rules is richer than regularization, and a special choice of disturbance exactly recovers the solution obtained by penalizing complexity via regularization. We show that Support Vector Machines and Lasso can be re-derived from robust optimization perspective. This equivalence relationship between regularization and robustness gives a physical interpretation of the regularization process. Moreover, we are able to explain from a robustness point of view why support vector machines are consistent, and why Lasso produces sparse solutions. We finally mention some recent results showing that robustness is a necessary and sufficient condition for consistency of learning algorithms.



Anytime Learning of Anycost Classifiers

Shaul Markovitch [This email address is being protected from spambots. You need JavaScript enabled to view it.]


The classification of new cases using a predictive model incurs two types of costs - testing costs and misclassification costs. Recent research efforts have resulted in several novel algorithms that attempt to produce learners that minimize both types of costs simultaneously. In many real life scenarios, however, we cannot afford to conduct all the tests required by the predictive model. For example, a medical center might have a fixed predetermined budget for diagnosing each patient. Due to their structure, decision trees are considered attractive for cost-bounded classification as they measure only the tests along a single path.

In this work we present an anytime framework for producing decision-tree based classifiers that can make accurate decisions within a strict bound on testing costs. These bounds can be known to the learner, known to the classifier but not to the learner, or not predetermined. Extensive experiments with a variety of datasets show that our proposed framework produces trees of lower misclassification costs along a wide range of testing cost bounds.


This work is done in collaboration with Saher Esmeir



Convex Duality in Learning and Inference Algorithms

Amir Globerson [This email address is being protected from spambots. You need JavaScript enabled to view it.]

The Hebrew University of Jerusalem

Machine learning algorithms are often used to extract rules and structure from large datasets, and have been successfully used in fields such as machine vision, natural language processing and computational biology. With the growing availability of text and images on the web, and high-throughput experiments in biology, learning algorithms are becoming a key tool for organizing, searching, and interpreting this data.

Learning algorithms are typically required to work with very large datasets of high dimensional signals. Thus scalability of algorithms is a key issue that must be addressed. In this talk I will show how the concept of convex duality can be used to design simple and scalable algorithms, and to help understand convergence of previously existing ones.

I will first address the problem of probabilistic inference in multi-variate distributions, and show how convex duality results in simple convergent message passing algorithms for this problem. Specifically, I will show how such algorithms may be used for obtaining exact solutions to difficult inference problems such as those that arise in protein design.

I will next address the problem of learning classifiers from labeled data, and present an exponentiated gradient algorithm that is also derived using convex duality. The algorithm can be shown to converge, and its convergence rate can be analyzed. I will conclude by showing an application of the algorithm to a large scale natural language parsing task, and demonstrate that it converges significantly faster than previous state of the art algorithms.



 User Centric Design of Mobile Applications – Industry Perspective

Jonna Häkkilä [This email address is being protected from spambots. You need JavaScript enabled to view it.]
Nokia Research Center, Finland

Mobile phones have become one of the truly ubiquitous devices in our everyday life offering access to numerous types of applications and services virtually anytime, anywhere. The characteristics of the devices set challenges to human-computer interaction (HCI) designers not only because of the restrictions set by mobile phone’s physical limitations and inferior computing power, but also because of the demand to create products for large and multicultural audiences. The intention of this lecture is to offer insight to the practicalities of the user centric design of mobile applications, and look through examples of hands-on HCI work in mobile phone industry.