default search action
32nd MM 2024: Melbourne, VIC, Australia
- Jianfei Cai, Mohan S. Kankanhalli, Balakrishnan Prabhakaran, Susanne Boll, Ramanathan Subramanian, Liang Zheng, Vivek K. Singh, Pablo César, Lexing Xie, Dong Xu:
Proceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 - 1 November 2024. ACM 2024, ISBN 979-8-4007-0686-8
Keynote Talks
- Pascale Fung:
From Assistants to Agents in the LLM Era. 1 - Benoit Huet:
Revolutionizing Lung Cancer Diagnostics with eyonis TM LCS: Cutting-edge AI/ML Technology-based SaMD for Enhanced Patient Care. 2-3 - Judy Kay:
Empowering People to Harness and Control their Multimodal Data in Scrutable User models. 4-5 - Jiebo Luo:
Large Multimodal Models as Social Multimedia Analysis Engines. 6-7
Oral Session 1: Large Language Models & Applications 1
- Haicheng Liao, Yongkang Li, Chengyue Wang, Yanchen Guan, Kahou Tam, Chunlin Tian, Li Li, Chengzhong Xu, Zhenning Li:
When, Where, and What? A Benchmark for Accident Anticipation and Localization with Large Language Models. 8-17 - Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li:
A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models. 18-27 - Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, Yu Cheng:
Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval using Language. 28-37 - Huishan Ji, Qingyi Si, Zheng Lin, Weiping Wang:
Towards Flexible Evaluation for Generative Visual Question Answering. 38-47 - Jiaqi Zhu, Shaofeng Cai, Fang Deng, Beng Chin Ooi, Junran Wu:
Do LLMs Understand Visual Anomalies? Uncovering LLM's Capabilities in Zero-shot Anomaly Detection. 48-57 - Yudong Li, Xianxu Hou, Dezhi Zheng, Linlin Shen, Zhe Zhao:
FLIP-80M: 80 Million Visual-Linguistic Pairs for Facial Language-Image Pre-Training. 58-67
Oral Session 2: Large Language Models & Applications 2
- Esmée Henrieke Anne de Haas, Lik-Hang Lee, Yiming Huang, Carlos Bermejo, Pan Hui, Zijun Lin:
Towards Trustworthy MetaShopping: Studying Manipulative Audiovisual Designs in Virtual-Physical Commercial Platforms. 68-77 - Weiqi Li, Shijie Zhao, Bin Chen, Xinhua Cheng, Junlin Li, Li Zhang, Jian Zhang:
ResVR: Joint Rescaling and Viewport Rendering of Omnidirectional Images. 78-87 - Yunqiang Pei, Kaiyue Zhang, Hongrong Yang, Yong Tao, Qihang Tang, Jialei Tang, Guoqing Wang, Zhitao Liu, Ning Xie, Peng Wang, Yang Yang, Hengtao Shen:
Improving Interaction Comfort in Authoring Task in AR-HRI through Dynamic Dual-Layer Interaction Adjustment. 88-97 - Yang Lu, Junxian Li, Zhitong Cui, Jiapeng Hu, Yanna Lin, Shijian Luo:
Designing Spatial Visualization and Interactions of Immersive Sankey Diagram in Virtual Reality. 98-107 - Zhang Wan, Sheng Tang, Jiawei Wei, Ruize Zhang, Juan Cao:
DragEntity: Trajectory Guided Video Generation using Entity and Positional Relationships. 108-116 - Kento Shigyo, Yifan Cao, Kentaro Takahira, Mingming Fan, Huamin Qu:
VR-Mediated Cognitive Defusion: A Comparative Study for Managing Negative Thoughts. 117-126
Oral Session 3: Novel Multimedia Applications 1
- Yinxuan Gui, Bin Zhu, Jingjing Chen, Chong Wah Ngo, Yu-Gang Jiang:
Navigating Weight Prediction with Diet Diary. 127-136 - Feiyu Chen, Cong Xu, Qi Jia, Yihua Wang, Yuhan Liu, Haotian Zhang, Endong Wang:
Egocentric Vehicle Dense Video Captioning. 137-146 - Jinyue Chen, Lingyu Kong, Haoran Wei, Chenglong Liu, Zheng Ge, Liang Zhao, Jianjian Sun, Chunrui Han, Xiangyu Zhang:
OneChart: Purify the Chart Structural Extraction via One Auxiliary Token. 147-155 - Jiawei Lin, Zhaoyun Jiang, Jiaqi Guo, Shizhao Sun, Ting Liu, Zijiang Yang, Jian-Guang Lou, Dongmei Zhang:
IconDM: Text-Guided Icon Set Expansion Using Diffusion Models. 156-165 - Haipeng Zhou, Hongqiu Wang, Tian Ye, Zhaohu Xing, Jun Ma, Ping Li, Qiong Wang, Lei Zhu:
Timeline and Boundary Guided Diffusion Network for Video Shadow Detection. 166-175 - Yichang Qu, Bing Li, Jie Huang, Feng Zhao:
Training Pansharpening Networks at Full Resolution Using Degenerate Invariance. 176-185
Oral Session 4: Graph and Diffusion Models
- Jielong Lu, Zhihao Wu, Zhaoliang Chen, Zhiling Cai, Shiping Wang:
Towards Multi-view Consistent Graph Diffusion. 186-195 - Liyuan Ma, Xueji Fang, Guo-Jun Qi:
Equilibrated Diffusion: Frequency-aware Textual Embedding for Equilibrated Image Customization. 196-204 - Weilun Feng, Chuanguang Yang, Zhulin An, Libo Huang, Boyu Diao, Fei Wang, Yongjun Xu:
Relational Diffusion Distillation for Efficient Image Generation. 205-213 - Hongjie Wu, Linchao He, Mingqin Zhang, Dongdong Chen, Kunming Luo, Mengting Luo, Jizhe Zhou, Hu Chen, Jiancheng Lv:
Diffusion Posterior Proximal Sampling for Image Restoration. 214-223 - Yiheng Huang, Hui Yang, Chuanchen Luo, Yuxi Wang, Shibiao Xu, Zhaoxiang Zhang, Man Zhang, Junran Peng:
StableMoFusion: Towards Robust and Efficient Diffusion-based Motion Generation Framework. 224-232 - Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Wen Zhang, Huajun Chen:
Making Large Language Models Perform Better in Knowledge Graph Completion. 233-242
Oral Session 5: Multimodal Models and Applications
- Rishikesh Devanathan, Apoorva Singh, A. S. Poornash, Sriparna Saha:
Seeing Beyond Words: Multimodal Aspect-Level Complaint Detection in Ecommerce Videos. 243-252 - Hsiang-Hui Hung, Huu-Phu Do, Yung-Hui Li, Ching-Chun Huang:
TimeNeRF: Building Generalizable Neural Radiance Fields across Time from Few-Shot Input Views. 253-262 - Xiaoxuan Shen, Fenghua Yu, Yaqi Liu, Ruxia Liang, Qian Wan, Kai Yang, Jianwen Sun:
Revisiting Knowledge Tracing: A Simple and Powerful Model. 263-272 - Peiming Li, Ziyi Wang, Mengyuan Liu, Hong Liu, Chen Chen:
ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models. 273-281 - Bochao Liu, Pengju Wang, Weijia Guo, Yong Li, Liansheng Zhuang, Weiping Wang, Shiming Ge:
Private Gradient Estimation is Useful for Generative Modeling. 282-290 - Ke Zhu, Liang Zhao, Zheng Ge, Xiangyu Zhang:
Self-Supervised Visual Preference Alignment. 291-300
Oral Session 6: Innovations in Medical Imaging and Physiological Measurement
- Yuxin Hong, Xiao Zhang, Xin Zhang, Joey Tianyi Zhou:
Evolution-aware VAriance (EVA) Coreset Selection for Medical Image Classification. 301-310 - Ruiqi Wang, Jinyang Huang, Jie Zhang, Xin Liu, Xiang Zhang, Zhi Liu, Peng Zhao, Sigui Chen, Xiao Sun:
FacialPulse: An Efficient RNN-based Depression Detection via Temporal Facial Landmarks. 311-320 - Wei Zhang, En Zhu, Juan Chen, YunPeng Li:
MDDR: Multi-modal Dual-Attention aggregation for Depression Recognition. 321-329 - Wei Qian, Kun Li, Dan Guo, Bin Hu, Meng Wang:
Cluster-Phys: Facial Clues Clustering Towards Efficient Remote Physiological Measurement. 330-339 - Zhenxi Song, Ruihan Qin, Huixia Ren, Zhen Liang, Yi Guo, Min Zhang, Zhiguo Zhang:
EEG-MACS: Manifold Attention and Confidence Stratification for EEG-based Cross-Center Brain Disease Diagnosis under Unreliable Annotations. 340-349 - Xueyuan Xu, Li Zhuo, Jinxin Lu, Xia Wu:
WSEL: EEG Feature Selection with Weighted Self-expression Learning for Incomplete Multi-dimensional Emotion Recognition. 350-359
Oral Session 7: Imaging, Computer Vision & Graphics
- Yuanbo Wen, Tao Gao, Ting Chen:
Unpaired Photo-realistic Image Deraining with Energy-informed Diffusion Model. 360-369 - Zeyu Li, Ruitong Gan, Chuanchen Luo, Yuxi Wang, Jiaheng Liu, Ziwei Zhu, Qing Li, Xucheng Yin, Man Zhang, Zhaoxiang Zhang, Junran Peng:
MaterialSeg3D: Segmenting Dense Materials from 2D Priors for 3D Assets. 370-379 - Xiao Han, Yiming Ren, Peishan Cong, Yujing Sun, Jingya Wang, Lan Xu, Yuexin Ma:
Gait Recognition in Large-scale Free Environment via Single LiDAR. 380-389 - Tang Tao, Longfei Gao, Guangrun Wang, Yixing Lao, Peng Chen, Hengshuang Zhao, Dayang Hao, Xiaodan Liang, Mathieu Salzmann, Kaicheng Yu:
LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields. 390-398 - Mu Chen, Zhedong Zheng, Yi Yang:
Transferring to Real-World Layouts: A Depth-aware Framework for Scene Adaptation. 399-408 - Yujian Mo, Yan Wu, Junqiao Zhao, Zhenjie Hou, Weiquan Huang, Yinghao Hu, Jijun Wang, Jun Yan:
Sparse Query Dense: Enhancing 3D Object Detection with Pseudo Points. 409-418
Oral Session 8: Multimodal Reasoning & Inference
- Changmeng Zheng, Dayong Liang, Wengyu Zhang, Xiaoyong Wei, Tat-Seng Chua, Qing Li:
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning. 419-428 - Qian Guo, Xinyan Liang, Yuhua Qian, Zhihua Cui, Jie Wen:
A Progressive Skip Reasoning Fusion Method for Multi-Modal Classification. 429-437 - Wenxin Xu, Hexin Jiang, Xuefeng Liang:
Leveraging Knowledge of Modality Experts for Incomplete Multimodal Learning. 438-446 - Bo Xu, Junzhe Zheng, Jiayuan He, Yuxuan Sun, Hongfei Lin, Liang Zhao, Feng Xia:
Generating Multimodal Metaphorical Features for Meme Understanding. 447-455 - Junjie Shi, Caozhi Shang, Zhaobin Sun, Li Yu, Xin Yang, Zengqiang Yan:
PASSION: Towards Effective Incomplete Multi-Modal Medical Image Segmentation with Imbalanced Missing Rates. 456-465 - Mengze Li, Kairong Han, Jiahe Xu, Yueying Li, Tao Wu, Zhou Zhao, Jiaxu Miao, Shengyu Zhang, Jingyuan Chen:
Cross-modal Observation Hypothesis Inference. 466-475
Oral Session 9: Image, Video, and Multimedia Processing
- Jiyang Li, Lechao Cheng, Zhangye Wang, Tingting Mu, Jingxuan He:
LoopGaussian: Creating 3D Cinemagraph with Multi-view Images via Eulerian Motion Field. 476-485 - Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin:
Q-Ground: Image Quality Grounding with Large Multi-modality Models. 486-495 - Cheng Ye, Weidong Chen, Jingyu Li, Lei Zhang, Zhendong Mao:
Dual-path Collaborative Generation Network for Emotional Video Captioning. 496-505 - Hu Lin, Chengjiang Long, Yifeng Fei, Qianchen Xia, Erwei Yin, Baocai Yin, Xin Yang:
Exploring Matching Rates: From Keypoint Selection to Camera Relocalization. 506-514 - Zhihong Zhu, Xuxin Cheng, Zhaorun Chen, Yuyan Chen, Yunyan Zhang, Xian Wu, Yefeng Zheng, Bowen Xing:
InMu-Net: Advancing Multi-modal Intent Detection via Information Bottleneck and Multi-sensory Processing. 515-524 - Chaoya Jiang, Hongrui Jia, Mengfan Dong, Wei Ye, Haiyang Xu, Ming Yan, Ji Zhang, Shikun Zhang:
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models. 525-534
Oral Session 10: Speech and Audio in Multimedia Processing
- Zhongxu Wang, Yujia Wang, Mingzhu Li, Hua Huang:
ArtSpeech: Adaptive Text-to-Speech Synthesis with Articulatory Representations. 535-544 - Shuai Yu, Xiaoliang He, Ke Chen, Yi Yu:
HKDSME: Heterogeneous Knowledge Distillation for Semi-supervised Singing Melody Extraction Using Harmonic Supervision. 545-553 - Yixuan Zhou, Xiaoyu Qin, Zeyu Jin, Shuoyi Zhou, Shun Lei, Songtao Zhou, Zhiyong Wu, Jia Jia:
VoxInstruct: Expressive Human Instruction-to-Speech Generation with Unified Multilingual Codec Language Modelling. 554-563 - Navonil Majumder, Chia-Yu Hung, Deepanway Ghosal, Wei-Ning Hsu, Rada Mihalcea, Soujanya Poria:
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization. 564-572 - Xihua Wang, Yuyue Wang, Yihan Wu, Ruihua Song, Xu Tan, Zehua Chen, Hongteng Xu, Guodong Sui:
TiVA: Time-Aligned Video-to-Audio Generation. 573-582 - Alejandro Galán-Cuenca, Jose J. Valero-Mas, Juan C. Martinez-Sevilla, Antonio Hidalgo-Centeno, Antonio Pertusa, Jorge Calvo-Zaragoza:
MUSCAT: A Multimodal mUSic Collection for Automatic Transcription of Real Recordings and Image Scores. 583-591
Oral Session 11: Emotion & Sentiment
- Jianing Zhao, Jingjing Wang, Yujie Jin, Jiamin Luo, Guodong Zhou:
Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanced Video Large Language Model. 592-601 - Daiqing Wu, Dongbao Yang, Yu Zhou, Can Ma:
Bridging Visual Affective Gap: Borrowing Textual Knowledge by Learning from Noisy Image-Text Pairs. 602-611 - Tan Yu, Jingjing Wang, Jiawen Wang, Jiamin Luo, Guodong Zhou:
Towards Emotion-enriched Text-to-Motion Generation via LLM-guided Limb-level Emotion Manipulating. 612-621 - Wenjie Zheng, Jianfei Yu, Rui Xia:
A Unimodal Valence-Arousal Driven Contrastive Learning Framework for Multimodal Multi-Label Emotion Recognition. 622-631 - Xinji Mai, Junxiong Lin, Haoran Wang, Zeng Tao, Yan Wang, Shaoqi Yan, Xuan Tong, Jiawen Yu, Boyang Wang, Ziheng Zhou, Qing Zhao, Shuyong Gao, Wenqiang Zhang:
All rivers run into the sea: Unified Modality Brain-Inspired Emotional Central Mechanism. 632-641 - Xin Li, Shangfei Wang, Xuandong Huang:
Temporal Enhancement for Video Affective Content Analysis. 642-650
Poster Session 1
- Pei He, Licheng Jiao, Lingling Li, Xu Liu, Fang Liu, Wenping Ma, Shuyuan Yang, Ronghua Shang:
Domain Generalization-Aware Uncertainty Introspective Learning for 3D Point Clouds Segmentation. 651-660 - Yi Ma, Peiqi Duan, Yuchen Hong, Chu Zhou, Yu Zhang, Jimmy S. J. Ren, Boxin Shi:
Color4E: Event Demosaicing for Full-color Event Guided Image Deblurring. 661-670 - Jiajie Zhu, Xia Du, Jizhe Zhou, Chi-Man Pun, Qizhen Xu, Xiaoyuan Liu:
DP-RAE: A Dual-Phase Merging Reversible Adversarial Example for Image Privacy Protection. 671-680 - Xinyi Zhang, Qinpeng Cui, Qiqi Bao, Wenming Yang, Qingmin Liao:
Geometry-Guided Diffusion Model with Masked Transformer for Robust Multi-View 3D Human Pose Estimation. 681-690 - Meiqi Cao, Rui Yan, Xiangbo Shu, Guangzhao Dai, Yazhou Yao, Guo-Sen Xie:
AdaFPP: Adapt-Focused Bi-Propagating Prototype Learning for Panoramic Activity Recognition. 691-700 - Junsheng Wang, Tiantian Gong, Yan Yan:
Partially Aligned Cross-modal Retrieval via Optimal Transport-based Prototype Alignment Learning. 701-709 - Hu Gao, Bowen Ma, Ying Zhang, Jingfan Yang, Jing Yang, Depeng Dang:
Learning Enriched Features via Selective State Spaces Model for Efficient Image Deblurring. 710-718 - Hangjun Che, Xinyu Pu, Deqiang Ouyang, Beibei Li:
Enhanced Tensorial Self-representation Subspace Learning for Incomplete Multi-view Clustering. 719-728 - Jian-Jun Qiao, Meng-Yu Duan, Xiao Wu, Yu-Pei Song:
CartoonNet: Cartoon Parsing with Semantic Consistency and Structure Correlation. 729-737 - Qianyu Guo, Jieji Ren, Haofen Wang, Tianxing Wu, Weifeng Ge, Wenqiang Zhang:
Visual-Language Collaborative Representation Network for Broad-Domain Few-Shot Image Classification. 738-747 - Wenzhuo Xu, Kai Chen, Ziyi Gao, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang:
Highly Transferable Diffusion-based Unrestricted Adversarial Attack on Pre-trained Vision-Language Models. 748-757 - Hongzhi Wang, Xiubo Liang, Tao Zhang, Yue Gu, Weidong Geng:
PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation. 758-767 - Zengsheng Kuang, Changxing Ding, Huan Yao:
Learning Context with Priors for 3D Interacting Hand-Object Pose Estimation. 768-777 - Yang Chen, Jingcai Guo, Tian He, Xiaocheng Lu, Ling Wang:
Fine-Grained Side Information Guided Dual-Prompts for Zero-Shot Skeleton Action Recognition. 778-786 - Shuo Zhang, Yupeng Zhai, Jilin Mei, Yu Hu:
FusionOcc: Multi-Modal Fusion for 3D Occupancy Prediction. 787-796 - Shaokun Wang, Yifan Yu, Yuhang He, Yihong Gong:
Enhancing Pre-trained ViTs for Downstream Task Adaptation: A Locality-Aware Prompt Learning Method. 797-806 - Fangming Cui, Xun Yang, Chao Wu, Liang Xiao, Xinmei Tian:
Advancing Prompt Learning through an External Layer. 807-816 - Hanzi Wang, Jiamin Ren, Yifeng Ding, Lei Ren, Huixing Jiang, Wei Chen, Fangxiang Feng, Xiaojie Wang:
Q-MoE: Connector for MLLMs with Text-Driven Routing. 817-825 - Guozhen Peng, Yunhong Wang, Yuwei Zhao, Shaoxiong Zhang, Annan Li:
GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild. 826-835 - Qiang Wang, Yuning Cui, Yawen Li, Yaping Ruan, Ben Zhu, Wenqi Ren:
RFFNet: Towards Robust and Flexible Fusion for Low-Light Image Denoising. 836-845 - Minghe Gao, Shuang Chen, Liang Pang, Yuan Yao, Jisheng Dang, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang, Tat-Seng Chua:
Fact : Teaching MLLMs with Faithful, Concise and Transferable Rationales. 846-855 - Yue Zhang, Parisa Kordjamshidi:
Narrowing the Gap between Vision and Action in Navigation. 856-865 - Zequn Zeng, Jianqiao Sun, Hao Zhang, Tiansheng Wen, Yudi Su, Yan Xie, Zhengjue Wang, Bo Chen:
HICEScore: A Hierarchical Metric for Image Captioning Evaluation. 866-875 - Chen Feng, Georgios Tzimiropoulos, Ioannis Patras:
CLIPCleaner: Cleaning Noisy Labels with CLIP. 876-885 - Haochen Zhao, Hui Meng, Deqian Yang, Xiaozheng Xie, Xiaoze Wu, Qingfeng Li, Jianwei Niu:
GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data. 886-895 - Kin-Chung Chan, Jun Xiao, Hana Lebeta Goshu, Kin-Man Lam:
Point Cloud Densification for 3D Gaussian Splatting from Sparse Input Views. 896-904 - Xiaorui Huang, Gen Luo, Chaoyang Zhu, Bo Tong, Yiyi Zhou, Xiaoshuai Sun, Rongrong Ji:
Deep Instruction Tuning for Segment Anything Model. 905-914 - Ziyi Wang, Yiming Rong, Deyang Jiang, Haoran Wu, Shiyu Zhou, Bo Xu:
CIEASR: Contextual Image-Enhanced Automatic Speech Recognition for Improved Homophone Discrimination. 915-924 - Jinxu Zhang, Yongqi Yu, Yu Zhang:
CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA. 925-934 - Hebaixu Wang, Hao Zhang, Xunpeng Yi, Xinyu Xiang, Leyuan Fang, Jiayi Ma:
TeRF: Text-driven and Region-aware Flexible Visible and Infrared Image Fusion. 935-944 - Ruonan Zhang, Ziwei Shang, Fengjuan Wang, Zhaoqilin Yang, Shan Cao, Yigang Cen, Gaoyun An:
Synergetic Prototype Learning Network for Unbiased Scene Graph Generation. 945-954 - Jiawei Zhu, Yishu Liu, Huanjia Zhu, Hui Lin, Yuncheng Jiang, Zheng Zhang, Bingzhi Chen:
Combating Visual Question Answering Hallucinations via Robust Multi-Space Co-Debias Learning. 955-964 - Qian Cao, Xu Chen, Ruihua Song, Xiting Wang, Xinting Huang, Yuchen Ren:
See or Guess: Counterfactually Regularized Image Captioning. 965-974