Pronunciation Quality Evaluation Approach Based on Bimodal Fusion with Noise Adaptive Weight

Xi-Bin Jia, Kewei Zhang, Yanfang Han, David Powers

    Research output: Contribution to conferencePaperpeer-review


    Facing the requirement of the virtual pedagogy application to have the ability of evaluating English learners' pronunciation quality, the paper proposes an automatic assessment method based on a bimodal fusion decision algorithm. The pronunciation level is scored by comparing the similarity between learner and standard's audio and video speech signals separately. The final score of the learner's pronunciation is gotten by fusing the above scores with the linear weighting combination approach. Referring to the knowledge that the visual speech can aid the audio to improve the human perception especially under noisy environments, the paper proposes a noise adaptive weighting strategy in fusing process. To solve the problem of disagreement of speech length due to the various speaking speed, the paper adopts the dynamic warping algorithm to do the time alignment between the test speeches and the standard ones. The data selected from the Australia audio and visual speech corpus (AVOZES) is employed to test the performance of our automatic evaluating system. The experiment result shows that audio and visual speech fusion approach improves the rationality of automatic pronunciation accessing system by making full use of correlative and complementary information between acoustic and visual speech comparing to the audio-speech-only evaluation results.

    Original languageEnglish
    Number of pages4
    Publication statusPublished - 1 Dec 2012
    EventInternational Conference on Computing and Convergence Technology -
    Duration: 9 Dec 2012 → …


    ConferenceInternational Conference on Computing and Convergence Technology
    Period9/12/12 → …


    • bimodal fusion
    • pronunciation evaluation
    • timing alignment
    • visual speech


    Dive into the research topics of 'Pronunciation Quality Evaluation Approach Based on Bimodal Fusion with Noise Adaptive Weight'. Together they form a unique fingerprint.

    Cite this