Name:
Yuexi Zhang
Title:
Human Action and Event Detection by Leveraging Multi-modality Techniques
Date:
8/19/2024
Time:
10:00:00 AM
Committee Members:
Prof. Octavia Camps (Advisor)
Prof. Mario Sznaier
Prof. Sarah Ostadabbas
Abstract:
Human Action and Event Analysis with multi-modalities has emerged as a critical area of research in computer vision and machine learning, driven by the need to understand complex human behaviors in diverse environments.
A significant advantage of multi-modal analysis is its application in cross-view action recognition, where activities are observed from different viewpoints. To tackle such a problem, we propose a flexible frame which is able to integrate diverse modalities(RGB pixels, 2D/3D key points, etc.) to overcome the limitations of single-modal approaches. It consists of two branches where a Dynamic Invariant Representation branch (DIR) concentrates on identifying view-invariant properties through key points trajectories while Context Invariant Representation branch(CIR) is to capture the pixel-level view-invariant features. In the meantime, our approach leverages contrastive learning techniques to enhance the effectiveness of recognition accuracy, where it enables the model to learn more discriminative and view-invariant features by contrastive positive pairs against negative pairs. The fusion of multi-modal data, coupled with contrastive learning, leads to improved accuracy in recognizing actions across various views and environments. Extensive experiments demonstrate the effectiveness of our approach on diverse modalities. Furthermore, another promising application with multi-modal techniques is zero-shot action detection, which aims to recognize actions that the model has not been explicitly trained on. Recently, with language models are quickly developed, leveraging LLMs in this context has shown significant potentials, as these models can bridge the gap between seen and unseen actions by understanding and generalizing from textual descriptions. To further explore the problem, we propose a transformer encoder-decoder architecture with global and local text prompt, which allowing the model to infer the characteristics of unseen actions based on different textual attributes. We evaluate our approach on different benchmarks to demonstrate advantages.