I am now a first-year PhD candidate at Centre for Language Studies (CLS) , Radboud University, supervised by M.A. Larson (Martha), co-supervised by C. Tejedor Garcia (Cristian), and L.F.M. ten Bosch (Louis). Before that, I was a visiting researcher at the Audio Information Research(AIR) Lab of University of Rochester. I used to be a senior researcher at Tencent AI Lab, AI-Generated Content (AIGC) Center for speech synthesis applied research.

My research interest include but not limit to Speech diagnostic for neurodegenerative disease, Responsible AI, and Spoken Dialogue System.


🔥 News

  • 2024.07: Becoming a PhD Candidate at Centre for Language Studies, Radboud University
  • 2024.05: One paper is accepted by INTERSPEECH 2024
  • 2023.12: One paper is accepted by ICASSP 2024 (Oral)
  • 2023.05: One paper is accepted by INTERSPEECH 2023

📖 Educations

  • 2024 - *, PhD Candidate, Centre for Language Studies, Radboud University
  • 2016 - 2018, M.Sc in Computer Science, Rutgers University–New Brunswick, NJ, USA.
  • 2011 - 2015, B.Sc in Information and Computing Science, Beijing Jiaotong University, Beijing, China.

💻 Professional Experience

  • 2024.7 -, PhD Candidate, Centre for Language Studies, Radboud University
  • 2023.7 - 2023.10, Visiting Researcher, Audio Information Research(AIR) Lab, University of Rochester
  • 2021 - 2023.7, Senior Researcher, AI-Generated Content (AIGC) Center, Tencent AI Lab
  • 2018 - 2021, Machine Learning Engineer, 17 Education&Tech Group Inc. (IPO on NASDAQ)
  • 2017 - 2018, Machine Learning Engineer Intern, Learnable.ai

📄Professional Service

Review Committee: ICASSP 2024, ISMIR 2023-2024

📝 Publications

INTERSPEECH 2023
sym

EE-TTS: Emphatic Expressive TTS with Linguistic Information
Yi Zhong, Chen Zhang, Xule Liu, Chenxi Sun, Weishan Deng, Haifeng Hu, Zhongqian Sun. [Demo Page]

Contribution:

  • EE-TTS can identify appropriate emphasis positions from text and synthesize expressive speech with emphasis and linguistic information.
  • This work outperforms baseline with expressiveness-MOS improvements from 3.76 to 4.25 and naturalness-MOS from 3.67 to 4.34.
  • EE-TTS helps build AI playmate services for the world’s most-played mobile MOBA game Honor of Kings (DAU 100+ million).

ICASSP 2024 (Oral) SynthTab: Leveraging Synthesized Data for Guitar Tablature Transcription

Yongyi Zang*, Yi Zhong*, Frank Cwitkowitz, Zhiyao Duan. [Demo Page]

INTERSPEECH 2024 GTR-Voice: Articulatory Phonetics Informed Controllable Expressive Speech Synthesis

Zehua Kcriss Li, Meiying Melissa Chen, *Yi Zhong*, Pinxin Liu, Zhiyao Duan. [Demo Page]

🗂️️ Selected Projects

🎙 Speech Synthesis

Few-shot Voice Cloning and Style Transfer

  • Achieved few-shot voice cloning using 20 utterances. The Similarity-MOS of timbre reached 4.6 with a MOS of 3.8 and a clear pronunciation correction effect on L2 English speakers.[patented]
  • Pre-train and finetune paradigm and frame-level pitch modeling are used to achieve few-shot style transfer using 20 utterances. The style SMOS has been improved from 3.5 to 4.5 while naturalness MOS remains above 4.0.

🎼 Music

Probabilistic Topic Models Based Music Recommendation System supervisor: Vladimir Pavlovic

  • Leveraged CRNN for music tagging, and exploited Latent Dirichlet Allocation (LDA) and Hierarchical Dirichlet Process (HDP) probabilistic topic models for music topic modelling.
  • Use KL divergence to compute the similarity of song-topic distributions for the recommendation.

💬 Speech Recognition & Evaluation

Recognition and Evaluation of Oral English

  • Design and optimize the Goodness Of Pronunciation (GOP) feature, implementation, and tuning of LR, XGBoost, LSTM classifiers. Attained SOTA English oral evaluation consistency rate. [patented]
  • Full pipeline chain-model training and optimization based on Kaldi framework, including corpus crawling, language and acoustic model training, Bi-RNN implementation, RNN-Rescore, etc.
  • Achieved 5%-10% WER on various benchmark datasets and outperformed Google ASR API on children datasets.

🗣️ Voice Conversion

Voice Conversion Timbre Similarity Improvement

  • Method: Optimized the bottleneck of hidden representation for an any-to-one PPG-pipeline VC system. [patented]
  • Result: Improved the similarity MOS of voice timbre from 3.9 to 4.3.
  • Implemented many-to-many VC models such as VQ-VAE, StarGAN-VC for comparison.