../

CS 272: Foundations of Learning, Decisions, and Games

Table of Contents

Notes for Nika Haghtalab’s Fall 2025 course. A work in progress.

Lecture 2 (September 2nd, 2025)

  • What is an algorithm that learns the class of 2D axis-aligned rectangles in the consistency model? What is the runtime of such an algorithm?
  • What is True Error? What is Empirical Error?
  • When does an algorithm PAC-learn a concept class? What is the difference between proper and improper PAC-learnability?
  • Prove the PAC-learnability of finite concept classes.

Defining a Formal Model for Statistical Learning

  • Domain, $\mathcal X$, the set of all possible instances. Each instance is vector of features.
  • Label set, $\mathcal Y$. Assume for now that $\mathcal Y = \set{0, 1}$.
  • Concept class, $\mathcal C$, where a concept is a mapping $c: \mathcal X \to \mathcal Y$. Example concepts: Boolean expressions of the features, look-up tables, etc. A concept class is a restriction on the search space of possible classifiers.

The Consistency Model

Definition (Consistency). A concept, $c$, is consistent with a set of samples, $S = \lbrace (x_1, y_1), \ldots (x_m, y_m) \rbrace$, if $c(x_i) = y_i$ for all samples in $S$.

In the Consistency Model, an algorithm $\mathcal A$ learns $\mathcal C$ if, given $S$,

\[\mathcal A(S) = \begin{cases}c \quad \text{such that $c$ is consistent with }S\\ \text{DNE}\quad \text{if no such $c \in \mathcal C$}\end{cases}\]

Training Data

The training data for an algorithm is $S \subseteq \mathcal X \times \mathcal Y$. Often, it is assumed that this training data is sampled IID from a data distribution, $\mathcal D$, which is a probability distribution over $\mathcal X$. For now, assume there exists a true labeling function, $f(x)$, where $f: \mathcal X \to \mathcal Y$.

Assumption (Realizability of $\mathcal C$). Under the Realizability Assumption, there exists a concept, $c \in \mathcal C$, such that $\text{err}_{(\mathcal D, f)} = 0$. It follows that $c(x_i) = f(x_i) = y_i$ for all $(x_i, y_i) \in S$, so $\text{err}_S(c) = 0$.

Both $\mathcal D$ and $f(x)$ are unknown to the learning algorithm, $\mathcal A$.

Measures of Error

Definition (True Error). The True Error is calculated over the data distribution, and is defined as

\[\text{err}_{\mathcal D}(c) = \mathbb P \left[ c(x) \neq f(x) \right]\]

Definition (Empirical Error). The Empirical Error is calculated over the training data, and is defined as

\[\text{err}_S(c) = \frac{1}{m} \sum_{i=1}^m \mathbb 1\set{c(x_i) \neq y_i}\]

PAC-Learnability

Conceptually, a finite concept class $\mathcal C$ is PAC-learnable if a sufficiently accurate classifier can be learned, with sufficiently high probability, given a sufficiently large training dataset.

Definition (PAC-Learnability). Formally, a finite concept class $\mathcal C$ is PAC-learnable if $\exists m: (0, 1)^2 \to \mathbb N$ such that, for any $(\epsilon, \delta) \in (0, 1)^2$ and $(\mathcal D, f)$ where Realizability of $\mathcal C$ with respect to $(\mathcal D, f)$ holds, $\exists \mathcal A$ such that, given $S$ where $\lvert S \rvert \geq m(\epsilon, \delta)$, $\mathcal A(S) = c$ where $\text{err}_{(\mathcal D, f)}(c) \leq \epsilon$ with probability at least $1 - \delta$.

Theorem (PAC-Learnability of Finite Concept Classes). If $\mathcal A$ learns $\mathcal C$ in the Consistency Model, where $\mathcal C$ is a finite concept class, then $\mathcal A$ PAC-learns $\mathcal C$ with number of samples

\[m(\epsilon, \delta) = \mathcal O\left(\frac{\ln(|C|/\epsilon)}{\epsilon} \right)\]

Lecture 3 (September 4th, 2025)

  • What is a growth function? Consider the example for 2D axis-aligned rectangles.
  • What is the Fundamental Theorem of PAC-Learning?

    Lecture 4 (September 9th, 2025)

    Lecture 5 (September 11th, 2025)

  • Mistake bound model

Lecture 6 (September 16th, 2025)

  • Relax consistency assumption
  • What is the Majority/Halving Algorithm?
  • Weighted Majority Algorithm
  • $n$ experts, don’t want to regret not listening to a recommendation
    • cost function, adversary
    • alg cost close to opt cost as much as possible
  • randomized weighted majority

Theorem. For any $T > 0$, adaptive adversary, for random weighted majority

\[\mathbb E \sum_{t=1}^T c^t(i^t) < \frac{1}{1-\epsilon}\mathbb E \min_{i \in [n]} \sum_{t=1}^T c^t(i) + \frac{1}{\epsilon} \ln n\] \[\ln (W^{t+1}) \leq \ln n - \epsilon \mathbb E \sum_{t=1}^T c^t (i^t)\]