11-661/761 Language and Statistics
Fall 2019

Internet search, speech recognition, machine translation, question answering, information retrieval, biological sequence analysis -- are all at the forefront of this century’s information revolution. In addition to their use of machine learning, these technologies rely heavily on classic statistical estimation techniques. Yet most CS and engineering undergraduate programs do not prepare students in this area beyond an introductory probability & statistics course. This course is designed to address this gap.

The goal of "Language and Statistics" is to ground the data-driven techniques used in language technologies in sound statistical methodology. We start by formulating various language technology problems in both an information theoretic framework (the source-channel paradigm) and a Bayesian framework (the Bayes classifier). We then discuss the statistical properties of words, sentences, documents and whole languages, and the various computational formalisms used to represent language. These discussions naturally lead to specific concepts in statistical estimation.

Topics include: Zipf's distribution and type-token curves; point estimators, Maximum Likelihood estimation, bias and variance, sparseness, smoothing and clustering; interpolation, shrinkage, and backoff; entropy, cross entropy and mutual information; decision tree models applied to language; latent variable models and the EM algorithm; hidden Markov models; exponential models and the maximum entropy principle; semantic modeling and dimensionality reduction; probabilistic context-free grammars and syntactic language models.

“Language and Statistics” is designed for LTI and other SCS graduate students, but others are also welcome. CS undergraduate upperclassmen who have taken it have done well, although they found it challenging.

The Master-level version of this course (11-661) does not require the course project.

Instructor: Bhiksha Raj


Lecture: Monday and Wednesday, 12:00 noon - 1.20pm

Location: Gates-Hillman Complex GHC 4307

Recitation: TBD

Office hours:


Strong quantitative aptitude. Familiarity and comfort with basic undergraduate-level probability. Some programming skill.


This course is worth 12 units.

Course Work

Assignments 7 assignments, total contribution to grade 70%
Final Project 1 project, total contribution to grade 30% (only for those taking 11-761; otherwise, renormalize)

Late Policy

Extensions: If you have an unavoidable conflict that prevents you from completing an assignment on time (such as travel to a conference or a medical emergency), please send an email to the TA in charge of that assignment, as soon as you become aware of the problem, briefly stating the circumstances and how much more time you need. The TA-in-charge is authorized to grant an extension as long as you requested it promptly. Do not send extension requests to the Instructor, nor to the other TAs.

Assignments turned late without prior approval will incur a 25% penalty per 24-hour period or any part thereof.


The course will not follow a specific book, but will draw from a number of sources. We list relevant books at the end of this page. We will also put up links to relevant reading material for each class. Students are expected to familiarize themselves with the material before the class. The readings will sometimes be arcane and difficult to understand; if so, do not worry, we will present simpler explanations in class.

Discussion board: Piazza

We will use Piazza for discussions. Here is the link. Please sign up.

Academic Integrity

You are expected to comply with the University Policy on Academic Integrity and Plagiarism.
  • You are allowed to talk with / work with other students on homework assignments
  • You can share ideas but not code, you should submit your own code
Your course instructor reserves the right to determine an appropriate penalty based on the violation of academic dishonesty that occurs. Violations of the university policy can result in severe penalties including failing this course and possible expulsion from Carnegie Mellon University. If you have any questions about this policy and any work you are doing in the course, please feel free to contact your instructor for help.

Schedule of Classes (This schedule is only tentative and subject to change)

Lectures Dates Topics Required Reading (due BEFORE class) Lecture notes/Slides Additional readings, if any Assignments
1,2 August 29, September 5
  • Course Overview
  • Statistical Language Modeling
  • Computational Linguistics
  • Statistical Decision Making
  • Source-Channel Paradigm
  • Heterogeneity of language
  • Word types and tokens, type token curves
  • Zipf's law, Mandelbrot's distribution
[MS] 1.1 - 1.3, 2.1
  • Historical overview by Lafferty
  • [BCW] ch. 4, Zipf's Law
  • HW 1 out on 09/08
    3,4,5 September 10, September 12, September 17
    • Probability Review
    • Parameter Estimation
    • Desiderata for Estimators
    [MS] 1.4 HW1 due on 09/14
    6,7 September 19, September 24, September 26
    • Unigrams: Statistical Estimation
    • Maximum Likelihood Estimates
    [MS] 6.2.1 [mD] ch. 6, esp. 6.5 HW2 out on 09/23. Latex source available here
    October 1 Class Cancelled
    8 October 3
    • Sparseness
    • Smoothing
    [MS] 6.2.2 Slides included in the Maximum Likelihood Estimation deck Good-Turing HW2 due on 09/30
    9 Oct 5 @ NSH 3002 (Make up class)
    • Probability Simplex
    • N-grams: Linear Interpolation
    • Pseudocounts
    • MAP
    • Conjugate Priors
    • [MS] 6.1, 6.3
    • [sK]
    Slides included in the Maximum Likelihood Estimation deck Chen & Goodman 98(pp. 1-21) HW 3 out on 10/08
    10,11 Oct 8, Oct 10
    • MAP cont'd
    • Expectation Maximization (EM) I
    • EM II (formal)
    • Gaussian Mixture Models (GMMs)
    Slides included in the Maximum Likelihood Estimation deck More advanced EM notes by John Lafferty
  • HW 3 due on 10/15
  • HW 4 out on 10/22. LaTeX source available here
  • 12,13 Oct 15, Oct 17
    • Whole Sentence Models
    • Ngrams: Shrinkage and EM for Shrinkage
    • Backoff
    • Information Theory I
    14,15 Oct 22, Oct 24
    • Information Theory II
    • Perplexity, Feature Selection
    • HW 4 due on 10/29
    16,17 Oct 29, Oct 31
    • Clustering
    • Trees
    18,19 Nov 5, Nov 7
    • Hidden Markov Models - I and II
    • Exam Review
    [MS] ch. 9 Larry Rabiner's classic HMM tutorial
    20 Nov 12
    • Hidden Markov Models - III
    [MS] ch. 9 Larry Rabiner's classic HMM tutorial
    • HW 5 due on 11/12
    • HW 6 out on 11/13
    Nov 15 - Nov 17
    • Take Home Exam
    21,22,23 Nov 14, Nov 19, Nov 26
    • Exponential Models - I and II
    • Maximum Entropy Models
    • HW 6 due on 11/22
    • HW 7 out on 11/23
    24 November 28
    • Semantic Models
    • Probabilistic Context Free Grammars (PCFG)
    • The Inside-Outside Algorithm
    [MS] 11.1-11.4
    25,26 Dec 3, Dec 5
      Overflow Topics:
    • Latent Semantic Analysis
    • Dimensionality Reduction
    • cf. Word2vec
    • Syntactic Language Models
    • Presentations - Mandatory Class Attendance
    HW 7 due on 12/07

    Abbreviations (in order of appearance):

    Abbreviation Book
    [MS] Manning and Schutze, Foundations of Statistical Natural Language Processing.
    [BCW] Bell, Cleary and Witten, Text Compression.
    [mD] Morris DeGroot, Probability and Statistics, 2nd edition.
    [sK] Slava M. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer", IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 35, no. 3, pp. 400-401, 1987.
    [BBDM] L. Bahl, P. Brown, P. de Souza and R. Mercer, "A tree-based statistical language model for natural language speech recognition", IEEE Transactions on Acoustic, Speech and Signal Processing, vol. 37, no. 7, pp. 1001-1008, 1989.
    [rR] Roni Rosenfeld, "A maximum entropy approach to adaptive statistical language modeling", Speech and Language, vol. 10, pp. 187-228, 1996.