Spatial and temporal visual speech feature for Chinese phonemes

xibin jia, yanfang Han, David Powers, xiyuan Bao

    Research output: Contribution to journalArticlepeer-review

    1 Citation (Scopus)


    This paper aims to propose a practical set of features for representing the visual speech of Chinese phonemes. The state and hence visibility of teeth and tongue play important roles in pronunciation, but discriminating them in images or video is tricky. This paper introduces the concept of inner appearance features based on structural analysis. Our experiment results show preliminary evidence that describing the pixel distribution of the upper and lower inner mouth separately can improve the ability to discriminate useful facial features as well as individual phonemes. The Chinese phonemes defined in the SAPI Speech Interface generally corresponding to one character or morpheme, and our dynamic feature is proposed based on the traditional division of these syllabic phonemes into a consonant-like onset and a vowel-and/or nasal-like coda. Features are established by combining a series of frames and identifying the most salient change frame as the key frame to avoid provide an objective framework for phoneme onset recognition. Our work provides a basis for bimodal AudioVisual Chinese speech recognition as well as unimodal Visual speech reading, but is also targeted to AudioVisual speaking face/talking head synthesis. 1548-7741/

    Original languageEnglish
    Pages (from-to)4177-4185
    Number of pages9
    JournalJournal of Information & Computational Science
    Issue number14
    Publication statusPublished - 15 Nov 2012


    • Chinese phoneme
    • Dynamic feature
    • Inner lip appearance


    Dive into the research topics of 'Spatial and temporal visual speech feature for Chinese phonemes'. Together they form a unique fingerprint.

    Cite this