Yuan Gong

Member of Technical Staff

x.AI Corp.

Palo Alto, CA 94306

Phone: (574) 401-0833

Email: yuangong@mit.edu

Research Interest

AI for Audio, Speech, and Natural Language Processing, including the following topics: multi-modal large language models, foundation models for audio and speech AI, audio-visual multi-modal AI, speech AI systems for health applications, and secure and trustworthy speech AI.

My recent research can be summarized in my talk at the 2024 MIT Imagination In Action Summit.

Bio

I am a Member of Technical Staff at xAI. I'm recognized as a core contributor to Grok Advanced Voice Mode, the first author of LTU, the first audio large language model, and the first author of AST, a widely used audio classifier.

Before joining xAI, I was a Research Scientist at the MIT CSAIL Spoken Language Systems Group (SLS), working with Dr. James Glass. Prior to that, I earned my Ph.D. in Computer Science from the University of Notre Dame, where I was supervised by Dr. Christian Poellabauer. In the summer of 2019, I was an Applied Scientist Intern working on clinical text mining with the AWS Comprehend Medical team, supervised by Mohammed Khalilia and Parminder Bhatia. Before coming to Notre Dame, I received my B.Sc. degree in Electrical Engineering (Biomedical Engineering major) from Fudan University in 2015. My research advisors were Dr. Yuanyuan Wang (ultrasound image denoising) and Dr. Yuedong Xu (network science).

Education

2020.7 Ph.D., Computer Science and Engineering, University of Notre Dame, IN, USA (GPA: 4.0/4.0)

2015.7 B.Sc., Electrical Engineering (Biomedical Engineering Major), Fudan University, Shanghai, China. (GPA Rank: 1/15, First Prize Scholarship)

Employment

2024.8 - Member of Technical Staff, xAI Corp., Palo Alto, USA

2023.8 - 2024.8 Research Scientist II, Massachusetts Institute of Technology, Cambridge, USA

2020.8 - 2023.7 Postdoc Research Associate, Massachusetts Institute of Technology, Cambridge, USA

2015.8 - 2020.7 Graduate Research Assistant, University of Notre Dame, Notre Dame, USA

2019.5 - 2019.8 Applied Scientist Intern, Amazon Web Service, Seattle, USA

2014.6 - 2015.7 Undergraduate Research Assistant, Fudan University, Shanghai, China

2012.7 - 2012.8 Intern, Philips Healthcare, Shanghai, China

Awards

ASRU 2023 Best Paper Finalist (top 3% paper, 12/435)

ICASSP 2023 Outstanding Reviewer

INTERSPEECH 2019 Best Student Paper Award Nomination

Depression Detection Challenge Winner, the 7th ACM Multimedia Audio/Visual Emotion Challenge and Workshop (AVEC 2017)

IJCAI, ISCA, ICHI, NSF Travel Grant

Outstanding Graduate of Fudan University (2015), Fudan First Prize Scholarship (Top 3%, 2014), Outstanding Student of Dept. of Information Technology (2013), Outstanding Student of Fudan University (2012)

Publications

2025

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo, Andrew Rouditchenko, Yuan Gong, Saurabhchand Bhati, Samuel Thomas, Brian Kingsbury, Leonid Karlinsky, Rogerio Feris, James R. Glass, Hilde Kuehne
Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, USA, June 2025 (CVPR 2025, accepted, to appear)

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Alexander H. Liu, Sang-gil Lee, Chao-Han Huck Yang, Yuan Gong, Yu-Chiang Frank Wang, James R. Glass, Rafael Valle, Bryan Catanzaro
Proceedings of the 13th International Conference on Learning Representations, Singapore, May 2025 (ICLR 2025, accepted, to appear)
Paper

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models
Xin Hong, Yuan Gong, Vidhyasaharan Sethu, Ting Dang
Proceedings of the 50th International Conference on Acoustics, Speech, & Signal Processing, Hyderabad, India, April 2025 (ICASSP 2025)
Paper

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction
Yuanchao Li, Yuan Gong, Chao-Han Huck Yang, Peter Bell, Catherine Lai
Proceedings of the 50th International Conference on Acoustics, Speech, & Signal Processing, Hyderabad, India, April 2025 (ICASSP 2025)
Paper

2024

Listen, Think, and Understand
Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass
Proceedings of the 12th International Conference on Learning Representations, Vienna, Austria, May 2024 (ICLR 2024)
Paper | Interactive Demo | Code | 5-Min Video (Chrome only)

Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
Andrew Rouditchenko, Yuan Gong, Samuel Thomas, Leonid Karlinksy, Hilde Kuehne, Rogerio Feris, James Glass
Proceedings of the 25th Conference of the International Speech Communication Association, Kos Island, Greece, September 2024 (Interspeech 2024)
Paper | Code

Automatic Prediction of Amyotrophic Lateral Sclerosis Progression using Longitudinal Speech Transformer
Liming Wang, Yuan Gong, Nauman Dawalatabad, Marco Vilela, Katerina Placek, Brian Tracey, Yishu Gong, Alan Premasiri, Fernando Vieira, James Glass
Proceedings of the 25th Conference of the International Speech Communication Association, Kos Island, Greece, September 2024 (Interspeech 2024)
Paper

Natural Language Embedded Programs for Hybrid Language Symbolic Reasoning
Tianhua Zhang, Jiaxin Ge, Hongyin Luo, Yung-Sung Chuang, Mingye Gao, Yuan Gong, Xixin Wu, Yoon Kim, Helen Meng, and James Glass
Proceedings of Findings of the 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Mexico, June 2024 (Findings of NAACL 2024)
Paper | Code | MIT News

DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners
Saurabhchand Bhati, Yuan Gong, Leonid Karlinsky, Hilde Kuehne, Rogério Feris, and James Glass
Proceedings of the 2024 IEEE Spoken Language Technology Workshop, Macao, China, December 2024 (SLT 2024)
Paper

Large language model based generative error correction: A challenge and baselines for speech recognition, speaker tagging, and emotion recognition
Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr Żelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai, Shinji Watanabe, Andreas Stolcke
Proceedings of the 2024 IEEE Spoken Language Technology Workshop, Macao, China, December 2024 (SLT 2024)
Paper

2023

Joint Audio and Speech Understanding (top 3% paper, best paper finalist)
Yuan Gong, Alexander H. Liu, Hongyin Luo, Leonid Karlinsky, and James Glass
Proceedings of the 2023 IEEE Automatic Speech Recognition and Understanding Workshop, Taipei, December 2023 (ASRU 2023)
Paper | Interactive Demo | Code | Poster

SAIL: Search-Augmented Instruction Learning
Hongyin Luo, Yung-Sung Chuang, Yuan Gong, Tianhua Zhang, Yoon Kim, Xixin Wu, Helen Meng, and James Glass
Proceedings of Findings of the 2023 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP 2023)
Paper | Code

Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
Yuan Gong, Sameer Khurana, Leonid Karlinsky, and James Glass
Proceedings of the 24th Conference of the International Speech Communication Association, Dublin, Ireland, August 2023 (Interspeech 2023)
Paper | Code | Interactive Demo | Poster | 中文博客介绍 | 中文代码解读

Contrastive Audio-Visual Masked Autoencoder (notable-top-25% paper)
Yuan Gong, Andrew Rouditchenko, Alexander H. Liu, David Harwath, Leonid Karlinsky, Hilde Kuehne, and James Glass
Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, May 2023 (ICLR 2023)
Paper | Code | Video | Slides | Poster | MIT News

Improving Computational Efficiency of Voice Anti-Spoofing Models
Jian Yang, Bryan Ning Xia, John Bailey, Yuan Gong, John Michael Templeton, and Christian Poellabauer
Proceedings of the 2023 IEEE 20th International Conference on Mobile Ad Hoc and Smart Systems (MASS 2023)
Paper | Code

2022

UAVM: Towards Unifying Audio and Visual Models
Yuan Gong, Alexander H. Liu, Andrew Rouditchenko, and James Glass
IEEE Signal Processing Letters, 2022
Paper | Code

Detecting Dementia from Long Neuropsychological Interviews
Nauman Dawalatabad, Yuan Gong, Sameer Khurana, Rhoda Au, and James Glass
Proceedings of Findings of the 2022 Conference on Empirical Methods in Natural Language Processing (Findings of EMNLP 2022), Abu Dhabi, December 2022
Paper

Vocalsound: A Dataset For Improving Human Vocal Sounds Recognition
Yuan Gong, Jin Yu, and James Glass
Proceedings of the 47th International Conference on Acoustics, Speech, & Signal Processing (ICASSP 2022), Singapore, May 2022
Paper | Dataset & Code | Video | Slides

Transformer-Based Multi-Aspect Multi-Granularity Non-Native English Speaker Pronunciation Assessment
Yuan Gong, Ziyi Chen, Iek-Heng Chu, Peng Chang, and James Glass
Proceedings of the 47th International Conference on Acoustics, Speech, & Signal Processing (ICASSP 2022), Singapore, May 2022
Paper | Code | Video | Slides | Blog in Chinese

SSAST: Self-Supervised Audio Spectrogram Transformer
Yuan Gong, Cheng-I Jeff Lai, Yu-An Chung, and James Glass
Proceedings of the 36th AAAI Conference on Artificial Intelligence (AAAI 2022), Vancouver, Canada, February-March 2022
Paper | Code | Slides

2021

AST: Audio Spectrogram Transformer
Yuan Gong, Yu-An Chung, and James Glass
Proceedings of the 22nd Conference of the International Speech Communication Association (Interspeech 2021), Brno, Czech Republic, August-September 2021
Paper | Code | Talk | Blog in Chinese

PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation
Yuan Gong, Yu-An Chung, and James Glass
IEEE Transactions on Audio, Speech and Language Processing, 2021
Paper | Code | Video | Slides | Blog in Chinese

2020

Detecting Replay Attacks Using Multi-Channel Audio: A Neural Network-Based Method
Yuan Gong, Jian Yang, and Christian Poellabauer
IEEE Signal Processing Letters, 2020
Paper | Code

2019

Second-order Non-local Attention Networks for Person Re-identification
Bryan Xia, Yuan Gong, Yizhe Zhang, and Christian Poellabauer
Proceedings of the 2019 International Conference on Computer Vision (ICCV 2019), Seoul, Korea, October-November 2019
Paper | Blog

ReMASC: Realistic Replay Attack Corpus for Voice Controlled Systems (best student paper award nomination)
Yuan Gong, Jian Yang, Jacob Huber, Mitchell MacKnight, Christian Poellabauer
Proceedings of the 20th Conference of the International Speech Communication Association (Interspeech 2019), Graz, Austria, September 2019
Paper | Dataset

Real-time Adversarial Attacks
Yuan Gong, Boyang Li, Christian Poellabauer, and Yiyu Shi
Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI 2019), Macao, China, August 2019.
Paper | Code | Media

2018

Deep Obfuscation: Precise Masking of Sensitive Information to Protect Against Machine Learning Adversaries (Poster)
Yuan Gong and Christian Poellabauer
Proceedings of the 2018 Annual Computer Security Applications Conference Poster Session, San Juan, Puerto Rico, December 2018.

Crafting Adversarial Examples For Speech Paralinguistics Applications
Yuan Gong and Christian Poellabauer
Proceedings of the DYnamic and Novel Advances in Machine Learning and Intelligent Cyber Security (DYNAMICS) Workshop, San Juan, Puerto Rico, December 2018.
Paper

Impact of Aliasing on Deep CNN-Based End-to-End Acoustic Models
Yuan Gong, Kevin Shin, and Christian Poellabauer
Proceedings of the 19th Conference of the International Speech Communication Association (Interspeech 2018), Hyderabad, India, September 2018.
Paper

Improving LIWC Using Soft Word Matching (Poster)
Yuan Gong, Hasini Yatawatte, Christian Poellabauer, Sandra Schneider, and Susan Latham
Proceedings of the 9th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), Washington, DC, August-September 2018.
Paper

Automatic Autism Spectrum Disorder Detection Using Everyday Vocalizations Captured by Smart Devices
Yuan Gong and Christian Poellabauer
Proceedings of the 9th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB), Washington, DC, August-September 2018.
Paper

Protecting Voice Controlled Systems Using Sound Source Identification Based on Acoustic Cues
Yuan Gong and Christian Poellabauer
Proceedings of the 27th International Conference on Computer Communications and Networks (ICCCN), Hangzhou, China, July-August 2018.
Paper

An Overview of Vulnerabilities of Voice Controlled Systems
Yuan Gong and Christian Poellabauer
Proceedings of the 1st International Workshop on Security and Privacy for the Internet-of-Things (IoTSec), Orlando, FL, April 2018.
Paper

2017

Topic Modeling Based Multi-modal Depression Detection (depression challenge winner)
Yuan Gong and Christian Poellabauer
Proceedings of the 7th Audio/Visual Emotion Challenge and Workshop (AVEC) in conjunction with ACM Multimedia (ACM-MM), Mountain View, CA, October 2017.
Paper

Continuous Assessment of Children's Emotional States using Acoustic Analysis
Yuan Gong and Christian Poellabauer
Proceedings of the 5th IEEE International Conference on Healthcare Informatics (ICHI), Park City, UT, August 2017.
Paper

2015

A Smart Low-Power-Consumption ECG Monitor Based on MSP430F5529 and CC2540 (winner of the 2014 TI national (China) biomedical device design contest)
Yuan Gong, Jin Cao, Zehui Luo, Guohui Zhou
Chinese Journal of Medical Instrumentation 39.4, 2015
Paper

Preprint

CMKD: CNN/Transformer-Based Cross-Model Knowledge Distillation for Audio Classification
Yuan Gong, Sameer Khurana, Andrew Rouditchenko, and James Glass
Paper

Service

Task Co-chair, IEEE SLT 2024 Challenge: LLM-Based Post ASR Speech Emotion Challenge (Task3 of the GenSEC-LLM Challenge).
Co-organizer, Interspeech 2024 Special Session: Responsible Speech Foundation Models.

Invited Talks and Guest Lectures

Beyond Perception: Making Audio AI Understand Sounds
MIT Imagination In Action Summit, 6/7/2024. [video] [slides] [poster]

Audio Large Language Models: From Sound Perception to Understanding
MIT Embodied Intelligence Seminar, 10/19/2023. [video] [slides]

SANE Workshop 2023, 10/26/2023. [video]

From Audio Perception to Understanding: A Path Towards Audio AGI
Amazon GenAI, 4/24/2024.

General Audio Processing: Perception, Understanding, and Generation
MIT 6.8620/HST.728 Spoken Language Processing (Guest Lecture). 4/16/2024. [slides]

How We Evaluate Our Audio and Speech Large Language Model?
ASRU SPARKS Workshop, 12/16/2023. [slides]

Recent Progress of MIT SLS's Research on Audio Classification and Understanding (two-hour tutorial talk)
Amazon Alexa, 12/15/2023. [slides]

Contrastive Audio-Visual Masked Autoencoder
IBM, 7/28/2023; Hong Kong University of Science and Technology, 10/11/2023.

Large Language Models that Listen
Takeda, 5/30/2023; Signify, 7/21/2023.

Introduction of Audio Spectrogram Transformer - Architecture, Training, and Pre-training
Mitsubishi Electric Research Laboratories, 6/8/2022; ByteDance, 6/14/2022; Adobe, 7/12/2022.

Introduction of Audio Spectrogram Transformer - Architecture, Training, and Pre-training
AI Time. 5/26/2022. [video in Mandarin] [slides]

General Audio Processing
MIT 6.345/HST.728 Spoken Language Processing (Guest Lecture). 4/19/2022.

Audio Spectrogram Transformer
MIT Embodied Intelligence Seminar. 10/14/2021. [video]

Audio Spectrogram Transformer for Audio Scene Analysis
ISCA SIGML Seminar. 6/16/2021. [video] [slides]

Win the cat and mouse game: ensuring the security of the speech processing systems to real-world threats
University of Notre Dame CSE60641 (Guest Lecture). 10/31/2019. [slides]

Speech Processing: Machine Learning Approaches, Novel Applications, and New Security Concerns
University of Notre Dame CSE60641 (Guest Lecture, Sole Instructor). 9/20/2018. [slides]

Contact

Please feel free to reach out (yuangong@mit.edu) if you have any questions about my work. I do not use WeChat. As a research scientist, I am not involved in student/researcher recruitment at CSAIL, please contact a PI for such inquiries.