Marcelo Mendoza

Marcelo Mendoza

Especialidad: Procesamiento del lenguaje natural, análisis de redes sociales, recuperación de información en texto
Marcelo es profesor asociado del  Departamento de Ciencias de la Computación, en la Facultad de Ingeniería de la Pontificia Universidad Católica de Chile. Es Dr. en ciencias de la computación de la Universidad de Chile. Mg. e Ing., Universidad Técnica Federico Santa María.  

PUBLICACIONES

Publisher: arXiv, Link>

ABSTRACT

Word embeddings are vital descriptors of words in unigram representations of documents for many tasks in natural language processing and information retrieval. The representation of queries has been one of the most critical challenges in this area because it consists of a few terms and has little descriptive capacity. Strategies such as average word embeddings can enrich the queries' descriptive capacity since they favor the identification of related terms from the continuous vector representations that characterize these approaches. We propose a data-driven strategy to combine word embeddings. We use Idf combinations of embeddings to represent queries, showing that these representations outperform the average word embeddings recently proposed in the literature. Experimental results on benchmark data show that our proposal performs well, suggesting that data-driven combinations of word embeddings are a promising line of research in ad-hoc information retrieval.


Publisher:, Link>

ABSTRACT

Medical images are an essential input for the timely diagnosis of pathologies. Despite its wide use in the area, searching for images that can reveal valuable information to support decision-making is difficult and expensive. However, the possibilities that open when making large repositories of images available for search by content are unsuspected. We designed a content-based image retrieval system for medical imaging, which reduces the gap between access to information and the availability of useful repositories to meet these needs. The system operates on the principle of query-by-example, in which users provide medical images, and the system displays a set of related images. Unlike metadata match-driven searches, our system drives content-based search. This allows the system to conduct searches on repositories of medical images that do not necessarily have complete and curated metadata. We explore our system’s feasibility in computational tomography (CT) slices for SARS-CoV-2 infection (COVID-19), showing that our proposal obtains promising results, advantageously comparing it with other search methods.


Publisher: Revista Bits de Ciencia, Link>

ABSTRACT

Los bots tienen un nefasto efecto en la diseminación de información engañosa o tendenciosa en redes sociales [1]. Su objetivo es amplificar la alcanzabilidad de campañas, transformando artificialmente mensajes en tendencias. Para ello, las cuentas que dan soporte a campañas se hacen seguir por cuentas manejadas por algoritmos. Muchas de las cuentas que siguen a personajes de alta connotación pública son bots, las cuales entregan soporte a sus mensajes con likes y retweets. Cuando estos mensajes muestran un inusitado nivel de reacciones, se transforman en tendencias, lo cual aumenta aún más su visibilidad. Al transformarse en tendencias, su influencia en la red crece, produciendo un fenómeno de bola de nieve.

 

Publisher: arXiv, Link>

ABSTRACT:

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.


Publisher: arXiv, Link>

ABSTRACT

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level representations. In this work, we propose to use ideas from predictive coding theory to augment BERT-style language models with a mechanism that allows them to learn suitable discourse-level representations. As a result, our proposed approach is able to predict future sentences using explicit top-down connections that operate at the intermediate layers of the network. By experimenting with benchmarks designed to evaluate discourse-related knowledge using pre-trained sentence representations, we demonstrate that our approach improves performance in 6 out of 11 tasks by excelling in discourse relationship detection.


Publisher: Diagnostics, Link>

ABSTRACT

Medical imaging is essential nowadays throughout medical education, research, and care. Accordingly, international efforts have been made to set large-scale image repositories for these purposes. Yet, to date, browsing of large-scale medical image repositories has been troublesome, time-consuming, and generally limited by text search engines. A paradigm shift, by means of a query-by-example search engine, would alleviate these constraints and beneficially impact several practical demands throughout the medical field. The current project aims to address this gap in medical imaging consumption by developing a content-based image retrieval (CBIR) system, which combines two image processing architectures based on deep learning. Furthermore, a first-of-its-kind intelligent visual browser was designed that interactively displays a set of imaging examinations with similar visual content on a similarity map, making it possible to search for and efficiently navigate through a large-scale medical imaging repository, even if it has been set with incomplete and curated metadata. Users may, likewise, provide text keywords, in which case the system performs a content- and metadata-based search. The system was fashioned with an anonymizer service and designed to be fully interoperable according to international standards, to stimulate its integration within electronic healthcare systems and its adoption for medical education, research and care. Professionals of the healthcare sector, by means of a self-administered questionnaire, underscored that this CBIR system and intelligent interactive visual browser would be highly useful for these purposes. Further studies are warranted to complete a comprehensive assessment of the performance of the system through case description and protocolized evaluations by medical imaging specialists.


Publisher:  Scientific Reports Link>

ABSTRACT

The rise of bots that mimic human behavior represents one of the most pressing threats to healthy information environments on social media. Many bots are designed to increase the visibility of low-quality content, spread misinformation, and artificially boost the reach of brands and politicians. These bots can also disrupt civic action coordination, such as by flooding a hashtag with spam and undermining political mobilization. Social media platforms have recognized these malicious bots’ risks and implemented strict policies and protocols to block automated accounts. However, effective bot detection methods for Spanish are still in their early stages. Many studies and tools used for Spanish are based on English-language models and lack performance evaluations in Spanish. In response to this need, we have developed a method for detecting bots in Spanish called Botcheck. Botcheck was trained on a collection of Spanish-language accounts annotated in Twibot-20, a large-scale dataset featuring thousands of accounts annotated by humans in various languages. We evaluated Botcheck’s performance on a large set of labeled accounts and found that it outperforms other competitive methods, including deep learning-based methods. As a case study, we used Botcheck to analyze the 2021 Chilean Presidential elections and discovered evidence of bot account intervention during the electoral term. In addition, we conducted an external validation of the accounts detected by Botcheck in the case study and found our method to be highly effective. We have also observed differences in behavior among the bots that are following the social media accounts of official presidential candidates.

Publisher: Publications, Link>

ABSTRACT

The evaluation of research proposals and academic careers is subject to indicators of scientific productivity. Citations are critical signs of impact for researchers, and many indicators are based on these data. The literature shows that there are differences in citation patterns between areas. The scope and depth that these differences may have to motivate the extension of these studies considering types of articles and age groups of researchers. In this work, we conducted an exploratory study to elucidate what evidence there is about the existence of these differences in citation patterns. To perform this study, we collected historical data from Scopus. Analyzing these data, we evaluate if there are measurable differences in citation patterns. This study shows that there are evident differences in citation patterns between areas, types of publications, and age groups of researchers that may be relevant when carrying out researchers’ academic evaluation.


Publisher:, Link>

ABSTRACT

Social networks are used every day to report daily events, although the information published in them many times correspond to fake news. Detecting these fake news has become a research topic that can be approached using deep learning. However, most of the current research on the topic is available only for the English language. When working on fake news detection in other languages, such as Spanish, one of the barriers is the low quantity of labeled datasets available in Spanish. Hence, we explore if it is convenient to translate an English dataset to Spanish using Statistical Machine Translation. We use the translated dataset to evaluate the accuracy of several deep learning architectures and compare the results from the translated dataset and the original dataset in fake news classification. Our results suggest that the approach is feasible, although it requires high-quality translation techniques, such as those found in the translation’s neural-based models.


Publisher: arXiv, Link>

ABSTRACT

The field of natural language understanding has experienced exponential progress in the last few years, with impressive results in several tasks. This success has motivated researchers to study the underlying knowledge encoded by these models. Despite this, attempts to understand their semantic capabilities have not been successful, often leading to non-conclusive, or contradictory conclusions among different works. Via a probing classifier, we extract the underlying knowledge graph of nine of the most influential language models of the last years, including word embeddings, text generators, and context encoders. This probe is based on concept relatedness, grounded on WordNet. Our results reveal that all the models encode this knowledge, but suffer from several inaccuracies. Furthermore, we show that the different architectures and training strategies lead to different model biases. We conduct a systematic evaluation to discover specific factors that explain why some concepts are challenging. We hope our insights will motivate the development of models that capture concepts more precisely.


agencia nacional de investigación y desarrollo
Edificio de Innovación UC, Piso 2
Vicuña Mackenna 4860
Macul, Chile