Unsupervised Learning of Linguistic Structure: An Empirical Evaluation

Research output: Contribution to journalArticlepeer-review

13 Citations (Scopus)


Computational Linguistics and Natural Language have long been targets for Machine Learning, and a variety of learning paradigms and techniques have been employed with varying degrees of success. In this paper, we review approaches which have adopted an unsupervised learning paradigm, explore the assumptions which underlie the techniques used, and develop an approach to empirical evaluation. We concentrate on a statistical framework based on N-grams, although we seek to maintain neurolinguistic plausibility. The model we adopt places putative linguistic units in focus and associates them with a characteristic vector of statistics derived from occurrence frequency. These vectors are treated as defining a hyperspace, within which we demonstrate a technique for examining the empirical utility of the various metrics and normalization, visualization, and clustering techniques proposed in the literature. We conclude with an evaluation of the relative utility of a large array of different metrics and processing techniques in relation to our defined performance criteria.

Original languageEnglish
Pages (from-to)91-131
Number of pages41
JournalInternational Journal of Corpus Linguistics
Issue number1
Publication statusPublished - Jan 1997


  • Classification
  • Feature maps
  • Multidimensional scaling
  • Orthography
  • Phonology
  • Self-organization
  • Singular valued decomposition
  • Spearman Rank Correlation
  • Syntax
  • Tagging
  • Unsupervised learning


Dive into the research topics of 'Unsupervised Learning of Linguistic Structure: An Empirical Evaluation'. Together they form a unique fingerprint.

Cite this