Embeddings are core components of modern model-based Collaborative Filtering (CF) methods, such as Matrix Factorization (MF) and Deep Learning variations. In essence, embeddings are mappings of the original sparse representation of categorical features (eg, user and items) to dense low-dimensional representations. A well-known limitation of such methods is that the learned embeddings are opaque and hard to explain to the users. On the other hand, a key feature of simpler KNN-based CF models (aka user/item-based CF) is that they naturally yield similarity-based explanations, ie, similar users/items as evidence to support model recommendations. Unlike related works that try to attribute explicit meaning (via metadata) to the learned embeddings, in this paper, we propose to equip the learned embeddings of MF with meaningful similarity-based explanations. First, we show that the learned user/item …
Given a facial matcher, in explainable face verification, the task is to answer: how relevant are the parts of a probe image to establish the matching with an enrolled image. In many cases, however, the trained models cannot be manipulated and must be treated as "black-boxes". In this paper, we present six different saliency maps that can be used to explain any face verification algorithm with no manipulation inside of the face recognition model. The key idea of the methods is based on how the matching score of the two face images changes when the probe is perturbed. The proposed methods remove and aggregate different parts of the face, and measure contributions of these parts individually and in-collaboration as well. We test and compare our proposed methods in three different scenarios: synthetic images with different qualities and occlusions, real face images with different facial expressions, poses, and occlusions and faces from different demographic groups. In our experiments, five different face verification algorithms are used: ArcFace, Dlib, FaceNet (trained on VGGface2 and Casia-WebFace), and LBP. We conclude that one of the proposed methods achieves saliency maps that are stable and interpretable to humans. In addition, our method, in combination with a new visualization of saliency maps based on contours, shows promising results in comparison with other state-of-the-art art methods. This paper presents good insights into any face verification algorithm, in which it can be clearly appreciated which are the most relevant face areas that an algorithm takes into account to carry out the recognition process.
The presentation turns around the subject of explainable AI. More specifically, we deal with attribution numerical scores that are assigned to features values of an entity under classification, to identify and rank their importance for the obtained classification label. We concentrate on the popular SHAP score  that can be applied with black-box and open models. We show that, in contrast to its general #P-hardness, it can be computed in polynomial time for classifiers that are based on decomposable and deterministic Boolean decision circuits. This class of classifiers includes decision trees and ordered binary decision diagrams. This result was established in . The presentation illustrates how the proof heavily relies on the connection to SAT-related computational problems.
Large-scale pre-trained language models have shown remarkable results in diverse NLP applications. Unfortunately, these performance gains have been accompanied by a significant increase in computation time and model size, stressing the need to develop new or complementary strategies to increase the efficiency of these models. In this paper we propose DACT-BERT, a differentiable adaptive computation time strategy for BERT-like models. DACT-BERT adds an adaptive computational mechanism to BERT's regular processing pipeline, which controls the number of Transformer blocks that need to be executed at inference time. By doing this, the model learns to combine the most appropriate intermediate representations for the task at hand. Our experiments demonstrate that our approach, when compared to the baselines, excels on a reduced computational regime and is competitive in other less restrictive ones.
Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.
Recently, few-shot video classification has received an increasing interest. Current approaches mostly focus on effectively exploiting the temporal dimension in videos to improve learning under low data regimes. However, most works have largely ignored that videos are often accompanied by rich textual descriptions that can also be an essential source of information to handle few-shot recognition cases. In this paper, we propose to leverage these human-provided textual descriptions as privileged information when training a few-shot video classification model. Specifically, we formulate a text-based task conditioner to adapt video features to the few-shot learning task. Furthermore, our model follows a transductive setting to improve the task-adaptation ability of the model by using the support textual descriptions and query instances to update a set of class prototypes. Our model achieves state-of-the-art performance on four challenging benchmarks commonly used to evaluate few-shot video action classification models.