Loading Events

« All Events

  • This event has passed.

Yuexi Zhang PhD Dissertation Defense

August 19, 2024 @ 10:00 am - 11:00 am

Name:
Yuexi Zhang

Title:
Human Action and Event Detection by Leveraging Multi-modality Techniques

Date:
8/19/2024

Time:
10:00:00 AM

Committee Members:
Prof. Octavia Camps (Advisor)

Prof. Mario Sznaier

Prof. Sarah Ostadabbas

Abstract:
Human Action and Event Analysis with multi-modalities has emerged as a critical area of research in computer vision and machine learning, driven by the need to understand complex human behaviors in diverse environments.

A significant advantage of multi-modal analysis is its application in cross-view action recognition, where activities are observed from different viewpoints. To tackle such a problem, we propose a flexible frame which is able to integrate diverse modalities(RGB pixels, 2D/3D key points, etc.) to overcome the limitations of single-modal approaches. It consists of two branches where a Dynamic Invariant Representation branch (DIR) concentrates on identifying view-invariant properties through key points trajectories while Context Invariant Representation branch(CIR) is to capture the pixel-level view-invariant features. In the meantime, our approach leverages contrastive learning techniques to enhance the effectiveness of recognition accuracy, where it enables the model to learn more discriminative and view-invariant features by contrastive positive pairs against negative pairs. The fusion of multi-modal data, coupled with contrastive learning, leads to improved accuracy in recognizing actions across various views and environments. Extensive experiments demonstrate the effectiveness of our approach on diverse modalities. Furthermore, another promising application with multi-modal techniques is zero-shot action detection, which aims to recognize actions that the model has not been explicitly trained on. Recently, with language models are quickly developed, leveraging LLMs in this context has shown significant potentials, as these models can bridge the gap between seen and unseen actions by understanding and generalizing from textual descriptions. To further explore the problem, we propose a transformer encoder-decoder architecture with global and local text prompt, which allowing the model to infer the characteristics of unseen actions based on different textual attributes. We evaluate our approach on different benchmarks to demonstrate advantages.

Details

Date:
August 19, 2024
Time:
10:00 am - 11:00 am

Other

Department
Electrical and Computer Engineering
Topics
MS/PhD Thesis Defense
Audience
PhD