CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

M. M. Abootorabi and E. Asgari. Clasp: Contrastive language‑speech pretraining for multilingual multimodal information retrieval. In 47th European Conference on Information Retrieval (ECIR2025), Lucca, Italy, April 2025. (accepted, to be published)

Paper Link     Code     Models     Newly Introduced Dataset

CLASP is a multilingual, multimodal representation model designed for audio-text information retrieval. It uses contrastive learning to bridge the gap between language and speech domains and is trained on a diverse speech-text dataset. The model sets new benchmarks in retrieval metrics across multiple languages. It created the shared embedding space for speech and text modalities to bypass the need for Automatic Speech Recognition (ASR) models in several tasks by following these steps:

Under the Supervision of Dr. Ehsaneddin Asgari.

March. 2023 – May. 2024


Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

M. M. Abootorabi, A. Zobeiri, M. Dehghani, M. Mohammadkhani, B. Mohammadi, O. Ghahroodi, M. S. Baghshah, and E. Asgari. Ask in any modality: A comprehensive survey on multimodal retrieval‑augmented generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), (submitted)

Preprint Link     Github Repository

A comprehensive survey on Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field.

Contributions:

Under the Supervision of Dr. Ehsaneddin Asgari and Prof. Mahdieh Soleymani Baghshah

July. 2024 – Feb. 2025


Emotion Classification in Code‑Mixed Conversations and Multimodal Emotion Cause Pair Extraction Within Conversational Contexts

M. M. Abootorabi, N. Ghazizadeh, S. A. Dalili, A. Ghahramani Kure, M. Dehghani, and E. Asgari. AIMA at SemEval‑2024 task 10: History‑based emotion recognition in Hindi‑English code‑mixed conversations. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval‑2024), pages 1704–1710, Mexico City, Mexico, June 2024. Association for Computational Linguistics     Paper Link

A. Ghahramani Kure, M. Dehghani, M. M. Abootorabi, N. Ghazizadeh, S. A. Dalili, and E. Asgari. AIMA at SemEval‑2024 task 3: Simple yet powerful emotion cause pair analysis. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval‑2024), pages 1698–1703, Mexico City, Mexico, June 2024. Association for Computational Linguistics     Paper Link

Two papers published by us in Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024).

One of our papers suggested a novel approach for emotion recognition in code‑mixed conversations, utilizing pre‑trained large models and GRU networks to integrate both previous and future context information of the current utterance, as well as sequential information of the conversation up to that point, to recognize each utterance’s emotion.

Another paper focused on extracting exact emotion‑cause pairs from conversations using textual, audio, and visual cues, consisting of embedding extraction, emotion classification, and cause analysis via QA techniques.

Under the Supervision of Dr. Ehsaneddin Asgari.

Aug. 2023 – March. 2024


Developing Automated Medical Report Generation for Fundus Fluorescein Angiography Images (A Novel Approach in Ophthalmology Research)

Remote Research Assistant at University of New South Wales (Summer internship)

It introduces a model that utilizes FFA-IR datasets, which comprise Fundus Fluorescein Angiography Images and their corresponding reports. We try to make the model use more medical information rather than just using language metrics for generating the report. Also, we made some changes in the RL algorithm to perform better. The model generates a report for each case based on the patient’s images by following the steps below:

Under the Supervision of Dr. Imran Razzak and Dr. Usman Naseem

June. 2023 – Sep. 2023 (Summer internship)


Developing Models & Pipelines For Text Localization In English & Persian

This research involves the development of a multimodal model that uses Contrastive Learning and other paradigms to encode texts and audio into a shared embedding space, enhancing the efficiency of text localization in audio streams across various applications. It finds relevant parts of the long speech related to the query.

Under the Supervision of Dr. Ehsaneddin Asgari.

Jan. 2023 – March 2023