Suwon Shon

32 Vassar St., 32-G436
Cambridge, MA, USA

swshon (at) csail (dot) mit (dot) edu


Resume / CSAIL Profile / Google Scholar

I'm a research scientist at Spoken Language Systems group of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) working with Dr. James Glass. I received B.S and integrated Ph.D. degree in electrical engineering from Korea University at South Korea in 2010 and 2017, respectively. I joined MIT as postdoctoral associate in 2017. My research focuses on machine learning technologies for speech signal processing and I have been working on speaker and language recognition and related pre-processing techniques.

Recent work and projects

  • (Apr. 2019) Organizing Arabic Dialect Identification (ADI) track of the Fifth edition of the Multi- Genre Broadcast Challenge (MGB-5) [website] [baseline] [ADI17 dataset]
    • We aim to present fine-grained analysis of Arabic dialects.
    • Over 3,000 hours of Arabic dialect speech data of 17 Arabic countries collected from the YouTube.
    • (Dec. 2019) A challenge special session was held in ASRU 2019.
    • Challenge overview slides is HERE.
  • (May~Oct. 2018) Participated NIST Speaker Recognition Evaluation (SRE) 2018 as member of JHU-MIT Team [system description]
  • (May 2018) Organizing MCE 2018 [website] [plan] [code] [dataset]
  • (Mar. 2018) Organizing task in Vardial 2018 [code]
    • If you want to start the Arabic Dialect Identification task with dialect embeddings, you can download here
    • The complete program and papers are available at workshop of COLING 2018 [link]
  • (Feb. 2018) Real-time Arabic dialect identification is online! you can find the system here
  • (Dec. 2017) I lead MIT-QCRI team for the 3rd Multi-Genre Broadcast (MGB-3) Challenge on Arabic Dialect Identification (ADI) task and we won the challenge!
    • MIT-QCRI team paper was presented in ASRU 2017
    • We marked 75% overall accuracy which is significantly higher than second team (70%) and detail results can be found on the summary paper.
    • We further improved performance of the system and now it is 81% on MGB-3 Test set which is best accuracy reported until now. Check paper [14] below. (Feb.28 2018)

Presentations

  • "The Fifth Edition of Multi-genre Broadcast Challenge: MGB-5", IEEE ASRU 2019 Special Session Overview, Sentosa, Singapore, Dec. 17, 2019 [slides]
  • "Overview of Automatic Speaker Recognition", Network course at MIT Beaverworks, MIT Lincoln Lab., Cambridge, MA, USA, Aug. 16, 2019
  • "Deep learning models for voice identity", Amazon, Sunnyvale, CA, USA, Jul. 24, 2019
    • Same topic was presented at Apple, Cupertino, CA, USA (Jul. 26, 2019), ASAPP, New York City, NY, USA (Aug. 1, 2019), Facebook, Menlo Park, CA, USA (Aug. 26, 2019)
  • "Robust Speaker Recognition influenced by noise and face", ETRI, Daejeon, South Korea, Apr. 29, 2019
    • Same topic was presented at Naver (Apr. 24, 2019), KAIST (Apr. 17, 2019) and Korea university (Apr. 16, 2019).
  • "Analyzing hidden representation of end-to-end speaker recognition system", KAIST, Daejeon, South Korea, Jul. 5, 2018
    • Same topic was presented at Kookmin University (Jul. 7, 2018), Korea University (Jul. 7, 2018), NCSOFT (Jul. 6, 2018) and Naver (Jul. 11, 2018).
  • "Recent Speaker Recognition progress", Philips Visit Day, Cambridge, MA, USA, Apr. 11 2018
  • "Speaker / Dialect Recognition under Limited Resources", Qatar Computing Research Institute, Doha, Qatar, Nov. 14, 2017
  • “Autoencoder based Domain Adaptation for Speaker Recognition under Insufficient Channel Information”, Interspeech 2017, Stockholm, Sweden, Aug. 22, 2017

Publications (Peer-reviewed)

[29] Suwon Shon, James Glass, “Multimodal Association for Speaker Verification”, To appear in Interspeech 2020, pp. -, Shanghai, China, October 2020 [preprint]
[28] Shammur A Chowdhury, Ahmed Ali, Suwon Shon, James Glass, “What does an End-to-end Dialect Identification Model Learn about Non-dialectal Information?”, To appear in Interspeech 2020, pp. -, Shanghai, China, October 2020 [preprint]
[27] Suwon Shon, Ahmed Ali, Younes Samih, Hamdy Mubarak, James Glass, “ADI17: A Fine-grained Arabic Dialect Identification Dataset”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8239-8243, Barcelona, Spain (Virtually presented due to Covid-19), May 2020 [dataset][paper][slides]
[26] Ahmed Ali, Suwon Shon, Younes Samih, Hamdy Mubarak, Ahmed Abdelali, James Glass, Steve Renals, Khalid Choukri, “The MGB-5 Challenge: Recognition and Dialect Identification of Dialectal Arabic Speech”, IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, pp. 1026-1033, Singapore, December 2019 [website][paper][slides]
[25]Suwon Shon, Hao Tang, James Glass, “VoiceID Loss: Speech Enhancement for Speaker Verification”, Interspeech, pp. 2888-2892, Graz, Austria, September 2019 [paper][arxiv][demo][slides]
[24]Suwon Shon, Younggun Lee, Taesu Kim, “Large-scale Speaker Retrieval on Random Speaker Variability Subspace”, Interspeech, pp. 2963-2967, Graz, Austria, September 2019 [paper][arxiv][poster]
[23]Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass, “MCE 2018: The 1st Multi-target Speaker Detection and Identification Challenge Evaluation”, Interspeech, pp. 356-360, Graz, Austria, September 2019 [paper][arxiv][poster]
[22]Jesús Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Fred Richardson, Suwon Shon, François Grondin, Réda Dehak, Leibny Paola García-Perera, Daniel Povey, Pedro A. Torres-Carrasquillo, Sanjeev Khudanpur, Najim Dehak, “State-of-the-art Speaker Recognition for Telephone and Video Speech: the JHU-MIT Submission for NIST SRE18”, Interspeech, pp. 1488-1492, Graz, Austria, September 2019 [paper]
[21]Achintya K. Sarkar, Zheng-Hua Tan, Hao Tang, Suwon Shon and James Glass, "Time-Contrastive Learning Based Deep Bottleneck Features for Text-Dependent Speaker Verification", IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1267-1279, August 2019 [arxiv]
[20]Suwon Shon, Tae-Hyun Oh, James Glass, “Noise-tolerant Audio-Visual Online Person Verification using an Attention-based Neural Network Fusion”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3995-3999, Brighton, UK, May 2019 [paper][arxiv][poster][Voxceleb2 test trials]
[19]Suwon Shon, Ahmed Ali, James Glass, “Domain Attentive Fusion for End-to-end Dialect Identification with Unknown Target Domain”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5951-5955, Brighton, UK, May 2019 [paper][arxiv][poster]
[18]Seongkyu Mun, Suwon Shon, “Domain Mismatch Robust Acoustic Scene Classification using Channel Information Conversion”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 845-849, Brighton, UK, May 2019 [paper][arxiv]
[17]Suwon Shon, Wei-Ning Hsu and James Glass, “Unsupervised Representation Learning of Speech for Dialect Identification”, IEEE Workshop on Spoken Language Technology (SLT), pp. 105-111, Athens, Greece, Dec. 2018, [paper][arxiv][poster]
[16]Suwon Shon, Hao Tang and James Glass, “Frame-level Speaker Embeddings for Text-independent Speaker Recognition and Analysis of End-to-end Model”, IEEE Workshop on Spoken Language Technology (SLT), pp. 1007-1013, Athens, Greece, Dec. 2018, [paper][arxiv][poster][Supplementary figures]
[15]Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass and others, “Language Identification and Morphosyntactic Tagging: Second VarDial Evaluation Campaign”, The Fifth Workshop on NLP for Languages, Varieties and Dialects (VarDial) of COLING , pp. 1-17, Santa Fe, USA, Aug. 2018 [paper]
[14]Suwon Shon, Ahmed Ali and James Glass, “Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition”, Speaker Odyssey: The Speaker and Language Recognition Workshop, pp. 98-104, Les Sables d'Olonne, France, June 2018 [paper][poster][code]
[13]Maryam Najafian, Sameer Khurana, Suwon Shon, Ahmed Ali and James Glass, “Exploiting Convolutional Neural Network for Phonotactic based Dialect Identification”, IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), pp. 5174-5178, Calgary, Canada, April 2018 [paper] [poster]
[12]Suwon Shon, Ahmed Ali and James Glass, “MIT-QCRI Arabic Dialect Identification System for the 2017 Multi-Genre Broadcast Challenge”, IEEE Automatic Speech Recognition and Understanding (ASRU) Workshop, pp. 374-380, Okinawa, Japan, December 2017 [paper] [poster] [code]
[11]Suwon Shon, Seongkyu Mun, Wooil Kim and Hanseok Ko, “Autoencoder based Domain Adaptation for Speaker Recognition under Insufficient Channel Information”, Interspeech, pp. 1014-1018, Stockholm, Sweden, August 2017 [paper] [slide]
[10]Suwon Shon, Seongkyu Mun and Hanseok Ko, “Recursive Whitening Transformation for Speaker Recognition on Language Mismatched Condition”, Interspeech, pp. 2869-2873, Stockholm, Sweden, August 2017 [paper] [poster]
[9]Seongkyu Mun, Suwon Shon, Wooil Kim, David Han and Hanseok Ko, “Deep Neural Network based Learning and Transferring Mid-level Auto Features for Acoustic Scene Classification”, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 796-800, New Orleans, USA, March 2017 [paper]
[8]Seongkyu Mun, Suwon Shon, Wooil Kim and Hanseok Ko, “Deep Neural Network Bottleneck Features for Acoustic Event Recognition”, Interspeech, pp. 2954-2957, San Francisco, CA, USA, September 2016 [paper]
[7]Suwon Shon, Seongkyu Mun, David Han and Hanseok Ko, “A non-negative matrix factorization based subband decomposition for acoustic source localization”, Electronics Letters, Vol.51, No.22, pp. 1723-1724, 2015 [paper]
[6]Suwon Shon, Seungkyu Mun, David Han, Hanseok Ko, “Maximum Likelihood Linear Dimension Reduction of Heteroscedastic Feature for Robust Speaker Recognition”, IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Karlsruhe, Germany, August 25-28, 2015 [paper]
[5]Seungkyu Mun, Suwon Shon, Wooil Kim, Hanseok Ko, “Generalized cross-correlation based noise robust abnormal acoustic event localization utilizing non-negative matrix factorization”, IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Seoul, South Korea, September 26-29, 2014 [paper]
[4]Suwon Shon, David K. Han, and Hanseok Ko, “Abnormal Acoustic Event Localization based on Selective Frequency Bin in High Noise Environment for Audio Surveillance”, IEEE International Conference on AdvancedVideo and Signal-based Surveillance (AVSS), pp. 87-92 , Krakow, Poland, August 2013 [paper]
[3]Suwon Shon, David K. Han, Jounghoon Beh and Hanseok Ko, “Full Azimuth Multiple Sound Source localization with 3-channel microphone array”, IEICE Trans. on Fundamentals, Vol. E95-A, No. 4 pp. 745-750, April 2012 [paper]
[2]Suwon Shon, Eric Kim, Jongsung Yoon and Hanseok Ko, “Sudden Noise Source Localization System for intelligent Automobile application with Acoustic Sensors”, IEEE International Conference on Consumer Electronics, pp. 237-238, Las Vegas, NV, USA, January 2012 [paper]
[1]Suwon Shon, Jounghoon Beh, Cheoljong Yang, David K. Han and Hanseok Ko, “Motion Primitives for Designing Flexible Gesture Set in Human-Robot Interface”, International Conference on Control, Automation and Systems, pp. 1501-1504, Il-san, South Korea, October 2011 [paper]

Manuscripts (Non-peer-reviewed)

[4]Younggun Lee, Suwon Shon, Taesu Kim, “Learning pronunciation from a foreign language in speech synthesis networks”, in preparation, [arxiv]
[3]Jesus Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom, Fred Richardson, Suwon Shon, François Grondin, Reda Dehak, Leibny Paola Garcia-Perera, Pedro A. Torres-Carrasquillo, and Najim Dehak, "The JHU-MIT System Description for NIST SRE18", Proc. NIST Speaker Recognition Evaluation Workshop, Athens, Greece, December 2018. [paper]
[2]Suwon Shon, Najim Dehak, Douglas Reynolds, James Glass, "MCE 2018: The 1st Multi-target speaker detection and identification Challenge Evaluation (MCE) Plan", MCE 2018 plan description [arxiv] [website]
[1]Suwon Shon and Hanseok Ko, “KU-ISPL Speaker Recognition Systems under Language Mismatch Condition for NIST 2016 Speaker Recognition Evaluation”,NIST SRE16 workshop, San Diego, USA, December 2016 [arxiv] [poster]