Introduction to Machine Learning (COMP90049)

Ruochen ChenMarch 2, 2025About 49 min

Introduction to Machine Learning (COMP90049)

Week 1

Machine learning is a method of teaching software to learn from data and make decisions on their own, without being explicitly programmed.

Three ingredients for machine learning

Data

Discrete vs continuous vs ...
Big data vs small data
Labeled data vs unlabeled data
Public vs sensitive data

Models

function mapping from inputs to outputs
probabilistic machine learning models
geometric machine learning models
parameters of the function are unknown

Learning

Improving (on a task) after data is taken into account
Finding the best model parameters (for a given task)
Supervised vs. unsupervised learning

Info

Supervised Learning:监督学习使用带有标签的数据进行训练，模型学习输入（features）到输出（labels）之间的映射关系。 Unsupervised Learning:无监督学习使用没有标签的数据，模型通过分析数据的模式和结构来进行学习。

Linear Algebra Review

Matrices

Matrices addition/subtraction: Add(Subtract) correspond ingentries in A and B
Matrix multiplication: Multiply corresponding entries in A and B and sum the products
Matrix transpose: Transpose of a matrix is obtained by interchanging its rows and columns. Matrix is symmetric if it is equal to its transpose.
Matrix inverse: The inverse of a matrix A is denoted by A^-1 and is obtained by multiplying A by its inverse.
A matrix cannot be inverted if: More rows than columns, More columns than rows,Redundant rows/columns (linear independence)

Vectors

A vector is a matrix with several rows and one column
Vector addition/subtraction: Add(Subtract) corresponding entries in A and B
Vector inner product: Multiply corresponding entries in A and B and sum the products
Vector Euclidean norm: The square root of the sum of the squares of the entries in the vector.
Vector inner product: The dot product of two vectors A and B is the sum of the products of their corresponding entries.
The cosine of the angle between two vectors can be found by using norms and the inner product

Instances, Attributes, and Learning Paradigms (Supervised vs. Unsupervised Learning)

In ML terminology examples are called Instances
Each instance can have some Features or Attributes
Concepts are things that we aim to learn. Generally, in the form of labels or classes

Unsupervised do not have access to an inventory of classes and instead discover groups of ‘similar’ examples in a given dataset.

Clustering is unsupervised — the learner operates without a set of labelled training data
Success is often measured subjectively; evaluation is problematic

Supervised methods have prior knowledge of classes and set out to discover and categorise new instances according to those classes

Classification learning is supervised
In Classification, we can exhaustively list/enumerate all possible labels for a given instance; a correct prediction entails mapping an instance to the label which is truly correct
Regression learning is supervised
In Regression,"infinitely" many labels are possible, we cannot conceivably enumerate them; a “correct” prediction is when the numeric value is acceptably close to the true value

Featured Data Types

Discrete: Nominal (Categorical)

Values are distinct symbols, values themselves serve only as labels or names
No relation is implied among nominal values (no ordering or distance measure)
Only equality tests can be performed
e.g. Student Number

Ordinal

An explicit order is imposed on the values
Addition and subtraction does not make sense
e.g. Educational Level

Continuous: Numeric

Numeric quantities are real-valued attributes
All mathematical operations are allowed

Equal Width vs. Equal Frequency vs. Clustering

Method	Equal Width Binning	Equal Frequency Binning	Clustering
Definition	Each bin has the same width	Each bin contains the same number of data points	Groups data points based on similarity
Type	Data discretization	Data discretization	Unsupervised learning
Advantages	Easy to compute, simple	Suitable for skewed distributions	Can detect natural groupings in data
Disadvantages	Sparse or dense bins if data density varies	Uneven bin width, harder to interpret	May require tuning (e.g., number of clusters)
Common Algorithms	Fixed width intervals	Quantiles-based binning	K-Means, DBSCAN, Hierarchical Clustering
Use Cases	Histogram creation, feature engineering	Handling skewed data in ML models	Customer segmentation, anomaly detection

Info

Equal Width Binning：用于简单离散化，每个 bin 宽度相同，但可能会导致数据密度不均衡。 Equal Frequency Binning：每个 bin 的数据量相等，适合处理偏态数据，但 bin 的宽度不一致，可能难以解释。 Clustering（聚类）：用于无监督学习，根据数据点的相似性自动分组，适合发现隐藏模式，但通常需要调整参数（如 k 值）。

Standardization vs. Normalization

Method	Standardization (Z-score)	Min-Max Normalization
Formula	( X' = \frac{X - \mu}{\sigma} )	( X' = \frac{X - X_{\min}}{X_{\max} - X_{\min}} )
Range	Mean = 0, Std = 1	[0,1] or [-1,1]
Best for	Normally distributed data	Data with fixed bounds
Sensitive to outliers?	Less sensitive	More sensitive
Formula	(X - mean) / std	(X - min) / (max - min)

Info

Standardization：将数据标准化到均值为 0，标准差为 1 的分布，适用于正态分布的数据。 Normalization：将数据缩放到 [0,1] 或 [-1,1] 范围内，适用于数据范围固定的数据。

Week 2

K-Nearest Neighbors (KNN)

supervied learning algorithm

KNN Classification

Return the most common class label among neighbors
Example: cat vs dog images; text classification; ...

KNN Regression

Return the average value of among K nearest neighbors
Example: housing price prediction;

To measure categorical distance, we can use:

Hamming distance: number of positions where the two strings differ
Jaccard Similarity: intersection over union of two sets

To measure numerical distance, we can use:

Manhattan distance: sum of absolute differences between corresponding components
Euclidean distance: square root of the sum of the squares of the differences between corresponding components
Cosine distance: 1 minus the cosine of the angle between two vectors

To measure oridinal distance, we can use:

Normalized Ranks: rank each value and normalize them to [0, 1]

Inverse Distance

Give more weight to the nearer neighbors rather than quantity.
The bigger the weight, the more important the neighbor is.

Inverse Linear Distance

Give more weight to the nearer neighbors, but with a decreasing slope.
The bigger the weight, the more important the neighbor is.

方法名称	是否考虑距离	权重形式	特点
Majority Vote	❌ 不考虑	所有邻居权重相等	简单、容易实现，容易受异常值影响
Inverse Distance	✅ 强调距离	权重 = 1 / 距离	近邻影响大，远邻影响小，权重下降快
Inverse Linear Distance	✅ 强调距离	权重 = 1 - d/D（D是最大距离，d是当前距离）	更平滑下降，抗异常能力更强

Value of K

K Value	Bias	Variance	Overfitting	Underfitting	Best For
Small K (e.g., K=1, K=3)	Low Bias: The model can closely follow the data.	High Variance: Sensitive to noise and outliers.	Likely to overfit due to high sensitivity to small fluctuations in the training data.	Unlikely to underfit unless the data is too noisy or simple.	- Complex data with clear patterns - When the dataset is relatively small.
Large K (e.g., K=10, K=20)	High Bias: The model becomes less sensitive to variations in the data.	Low Variance: Smoothing out the noise by considering more neighbors.	Less likely to overfit as it smooths out fluctuations.	Might underfit if the data has complex relationships or non-linear patterns.	- Noisy data - When a generalization is more important than capturing every detail.
Medium K (e.g., K=5, K=7)	A balanced approach with moderate bias.	Balanced variance, aiming for generalization.	Minimizes both overfitting and underfitting.	Good compromise between bias and variance.	- Standard choice for most datasets, balancing generalization and accuracy.

Why KNN

Pros
Intuitive and simple
No assumptions
Supports classification and regression
No training: new data →evolve and adapt immediately
Cons
How to decide on best distance functions?
How to combine multiple neighbors?
How to select K ?
Expensive with large (or growing) data sets

Lazy Learning vs. Eager Learning

Criteria	Lazy Learning (e.g., KNN)	Eager Learning
Definition	Delays learning until a query is made	Learns from the training data immediately
Training Phase	Fast (no model building)	Slow (model is built during training)
Prediction Phase	Slow (requires processing the entire dataset)	Fast (uses the pre-built model)
Memory Requirement	High (stores the entire training dataset)	Lower (only stores the model)
Flexibility	High (can adapt to new data easily)	Low (requires retraining for new data)
Example	K-Nearest Neighbors (KNN)	Decision Trees, Neural Networks

Probility

P(A=a): the probability that random variable A takes value a
0 <= P(A=a) <= 1
P(True) = 1
P(False) = 0

Joint Probability

P(A, B): joint probability of two events A and B
the probability of both A and B occurring = P(A ∩ B)

Conditional Probability

P(A|B): the probability of A occurring given that B has occurred
P(A|B) = P(A ∩ B) / P(B)

Independent Probability

Two events A and B are independent if P(A|B) = P(A)
P(A, B) = P(A) * P(B)

Info

Bayes' Rule

P(A|B) = ( P(B|A) * P(A) ) / P(B)
Bayes’ Rule allows us to compute P(A|B) given knowledge of the ‘inverse’ probability P(B|A).

Probability Distributions

Probability distributions can be discrete or continuous.
Discrete Random Variable: Takes on a countable number of distinct values (e.g., number of heads in coin flips).
Continuous Random Variable: Takes on an infinite number of possible values (e.g., height of students).

Distribution	Type	Range	Parameters	Formula	Example	Use Cases
Normal	Continuous	−∞ to +∞	Mean μ, Variance σ²	`P(x) = (1 / √(2πσ²)) * exp(-((x - μ)² / (2σ²)))`	Human height, exam scores	Linear regression, Gaussian models
Bernoulli	Discrete	0, 1	Probability p	`P(X = k) = p^k (1 - p)^(1 - k)`	Coin flip	Binary classification
Binomial	Discrete	0 to n	Number of trials n, Success probability p	`P(k) = C(n, k) * p^k * (1 - p)^(n - k)`	Number of heads in 10 coin flips	Binary classification, hypothesis testing
Multinomial	Discrete	0 to n for each category	Number of trials n, Probabilities p₁, ..., pₖ	`P(x₁, ..., xₖ) = (n! / (x₁!x₂!...xₖ!)) * ∏(pᵢ^xᵢ)`	Rolling a dice multiple times	Text classification, NLP
Categorical	Discrete	1 to k	Probabilities p₁, ..., pₖ	`P(X = i) = pᵢ`	Choosing a color from a set of options	Classification, clustering

Week 3

Zero-R

A simple baseline model that predicts the most frequent class in the training data.

One-R

Also known as Decision stom
Uses only one feature (“best” feature) to build a model

Desicion Trees

Choose an attribute to partition the data at the node such that each partition is as pure (homogeneous) as possible.
In each partition, most of the instances should belong to as few classes as possible
Each partition should be as large as possible.

ID3 (Iterative Dichotomiser 3)

A top-down approach that splits the data into smaller subsets based on the value of a chosen feature.

Entropy (measure of uncertainty. The expected (average) level of uncertainty (surprise))

For a Low probability event: if it happens, it’s big news! Big surprise! High information!
For a High probability event: it was likely to happen anyway. Not very surprising. Low information!
Higher H means more uncertain.
Low entropy means high certainty
High entropy means high uncertainty

Conditional Entropy measures the amount of uncertainty in X given Y.

Information Gain (measure of the reduction in entropy after splitting)

Information gain measures the reduction in entropy about the target variable achieved by partitioning the data based on a given feature.
Choose the largest as information gain to split the tree.

Shortcomings of IG

Overfitting: Greedy algorithm may choose a feature that is too specific and does not generalize well to unseen data.
C4.5: use gain ratio to choose the best feature to split the tree.
- Gain ratio (GR) reduces the bias for information gain towards highly branching attributes by normalising relative to the split information
- Split info (SI) is the entropy of a given split (evenness of the distribution of instances to attribute values)
- GR = IG / SI
- Choose the largest as GR to split the tree.
- 如果下面的attribute的选项是pure的，就不选择这个选项作为继续的分支，然后在剩余的attribute里选择GR最大的attribute作为继续的分支。如果下面的attribute全是pure的，就到达了叶子节点。

Info

计算 信息增益（Information Gain, IG）、增益率（Gain Ratio, GR） 和 分裂信息（Split Information, SI），我们需要以训练集中的“PLAY”列为目标变量（label）来进行分类评估。

下面是逐步讲解和计算方法：

ID	Outlook	Temp	Humidity	Wind	PLAY
A	s	h	h	F	N
B	s	h	h	T	N
C	o	h	h	F	Y
D	r	m	h	F	Y
E	r	c	n	F	Y
F	r	c	n	T	N

✅ 2. 计算总熵 Entropy(S)

$$ Entropy(S) = -p_+\log_2(p_+) - p_-\log_2(p_-) $$

$$ p_+ = 3/6 = 0.5,\quad p_- = 0.5 $$

$$ Entropy(S) = -0.5\log_2(0.5) - 0.5\log_2(0.5) = 1 $$

✅ 3. 对每个属性计算信息增益（以 Outlook 为例）

Outlook 的取值有：`s`, `o`, `r`

s（A, B）：PLAY = N, N → Entropy = 0
o（C）：PLAY = Y → Entropy = 0
r（D, E, F）：PLAY = Y, Y, N →
$$ p_+ = 2/3, p_- = 1/3
\Rightarrow Entropy = -\frac{2}{3}\log_2\frac{2}{3} - \frac{1}{3}\log_2\frac{1}{3} ≈ 0.918 $$

加权总熵：

$$ Entropy_{Outlook} = \frac{2}{6} \cdot 0 + \frac{1}{6} \cdot 0 + \frac{3}{6} \cdot 0.918 ≈ 0.459 $$

信息增益：

$$ IG(Outlook) = Entropy(S) - Entropy_{Outlook} = 1 - 0.459 = 0.541 $$

✅ 4. Split Information（SI）

$$ SI(Outlook) = -\sum_i \frac{|S_i|}{|S|} \log_2 \frac{|S_i|}{|S|} $$

s: 2/6 → log₂(2/6)
o: 1/6 → log₂(1/6)
r: 3/6 → log₂(3/6)

$$ SI = -(\frac{2}{6} \log_2 \frac{2}{6} + \frac{1}{6} \log_2 \frac{1}{6} + \frac{3}{6} \log_2 \frac{3}{6})
\approx -(0.389 + 0.431 + 0.5) = 1.32 $$

✅ 5. Gain Ratio（GR）

$$ GR = \frac{IG}{SI} = \frac{0.541}{1.32} ≈ 0.41 $$

✅ 类似方法可以应用于其他特征：

按照属性值分组，计算每组的熵和加权熵
再算信息增益（IG）
再算分裂信息（SI）
最后算增益率（GR）

Naive Bayes Theory

Info

arg max: argument of maximum value

Supervied ML method

Example:

If any term P(xm|y ) = 0 then the class probability P(y|x ) = 0
To solve this: use Laplace smoothing.

First Solution: We can assign a (small) positive probability 𝜀 to every unseen class-feature combination
Second Solution: We can add a “pseudocount” α to each feature count observed during training, often is 1.

Probabilities are changed drastically when there are few instances; with a large number of instances, the changes are small
Laplace smoothing (and smoothing in general) reduces variance of the NB classifier because it reduces sensitivity to individual (non-)observations in the training data

Different Naive Bayes

Naïve Bayes classifiers have several key variants that differ based on how they model the distribution of features. Below is a comparison of the most common types:

Variant	Assumption on Feature Distribution	Use Case
Gaussian Naïve Bayes (GNB)	Assumes features follow a Gaussian (normal) distribution.	Suitable for continuous data, often used in text classification and real-world datasets with normally distributed features.
Multinomial Naïve Bayes (MNB)	Assumes feature counts follow a multinomial distribution.	Best for text classification (e.g., spam detection, document classification) where features are word counts or term frequencies.
Bernoulli Naïve Bayes (BNB)	Assumes binary feature presence (1 = present, 0 = absent).	Used in binary text classification (e.g., sentiment analysis, spam filtering), where features represent whether a word appears in a document.
Complement Naïve Bayes (CNB)	A modification of Multinomial Naïve Bayes, designed to handle class imbalances.	Works better for imbalanced datasets and improves accuracy by adjusting feature probabilities.
Categorical Naïve Bayes	Assumes features are discrete categorical variables.	Used for classification tasks with categorical inputs that are not necessarily text-based.

Each variant modifies the way probabilities are calculated based on the data's nature, making Naïve Bayes a flexible and effective algorithm for different types of classification tasks.

Conclusion of Naive Bayes

Why does it work given that it’s a blatantly wrong model of the data?

we don’t need the true distribution over P(y|x ), we just need to be able to identify the most likely outcome

Advantages of Naive Bayes

easy to build and estimate
easy to scale to many feature dimensions (e.g., words in the vocabulary) and data sizes
reasonably easy to explain why a specific class was predicted
good starting point for a classification project

Info

How NB Works

Calculate Prior Probabilities: Compute the probability of each class based on the training data.
Compute Likelihood: Estimate the probability of features given each class using the conditional probability formula.
Apply Bayes’ Theorem: Use Bayes' rule to compute the posterior probability for each class.
Classify: Assign the class with the highest posterior probability to the new instance.

Comparison	Naïve Bayes	Logistic Regression
Model Type	Generative Model	Discriminative Model
Probability Learned	( P(x, y) ) (Joint Probability)	( P(y
Assumptions	Assumes feature independence	No specific feature independence assumption
Computational Complexity	Low, fast computation	Higher, requires gradient descent
Use Cases	Text classification (e.g., spam detection)	Tasks requiring feature relationship modeling
Suitable for Large Datasets?	Yes, simple computation	Yes, but computationally more intensive

Understanding Odds and Log Odds in Logistic Regression

1. What are Odds?

Odds represent the ratio of the probability of an event occurring to the probability of it not occurring. Mathematically, odds are defined as:

[ \text{odds} = \frac{P}{1 - P} ]

where:

( P ) is the probability of the event occurring.
( 1 - P ) is the probability of the event not occurring.

Types of Odds

Odds against an event: When ( 0 < \text{odds} < 1 ), meaning the event is less likely to happen than not.
Odds in favor of an event: When ( \text{odds} > 1 ), meaning the event is more likely to happen than not.

Example:

If an event has a 60% chance of occurring (( P = 0.6 )), the odds are: [ \text{odds} = \frac{0.6}{1 - 0.6} = \frac{0.6}{0.4} = 1.5 ] This means the event is 1.5 times more likely to occur than not.

2. Why Use Log Odds (Logit Function)?

Since odds can range from 0 to infinity, they are not ideal for direct modeling in a regression setting. Instead, we take the logarithm of odds, known as the logit function:

[ \text{log odds} = \log \left(\frac{P}{1 - P}\right) ]

Advantages of Log Odds:

Transforms probability into an unrestricted range
- ( P ) is always between ( 0 ) and ( 1 ), but log odds can take any value from (-\infty) to (+\infty).
- This makes it easier to model using linear regression techniques.
Handles Non-Linearity in Probability
- The relationship between probability and log odds is non-linear, but when transformed to log odds, it becomes linear.

Example Calculation:

If the event has a probability of ( P = 0.8 ): [ \text{odds} = \frac{0.8}{1 - 0.8} = \frac{0.8}{0.2} = 4 ]
- Taking the natural logarithm: [ \log(4) \approx 1.386 ]
- Now, instead of dealing with probabilities, we can work with a linear scale.

3. Connection to Logistic Regression

In logistic regression, we model the probability of an event occurring using the equation:

[ \log \left(\frac{P}{1 - P}\right) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_n x_n ]

This means:

Instead of predicting probabilities directly, logistic regression predicts log odds, which follow a linear relationship with input features.
We then use the sigmoid function to convert log odds back into probabilities.

4. Conclusion

Odds measure how likely an event is compared to it not happening.
Log odds (logit function) transform probabilities into a linear form, making them easier to model.
Logistic regression leverages log odds to predict probabilities effectively.

Understanding this transformation is key to interpreting logistic regression results and making informed predictions.

Logit function: The logarithm of the odds.

Softmax function: The function that converts a vector of real numbers to a probability distribution.

Logistic Regression Summary

Pros

Probabilistic interpretation
No restrictive assumptions on features
Often outperforms Naive Bayes
Particularly suited to frequency-based features (so, popular in NLP)

Cons

Can only learn linear feature-data relationships
Some feature scaling issues
Often needs a lot of data to work well
Regularisation a nuisance, but important since overfitting can be a big problem

Week 5

Support Vector Machines (SVM)

Classifier

A linear classifier (through origin) with parameters divides the space into positive and negative halves

Hard Margin

The margin is the distance between the decision boundary and the closest data point.
The goal is to maximize the margin while keeping the data points as far away from the decision boundary as possible.
The decision boundary is the line that separates the positive and negative data points.
It is called hard margin because the data points must be correctly classified on both sides of the decision boundary.

Soft Margin

The margin is softened by introducing a penalty term that depends on the distance of the data points from the decision boundary.
The goal is to minimize the margin while keeping the data points as far away from the decision boundary as possible.
The decision boundary is the line that separates the positive and negative data points.
It is called soft margin because the data points can be misclassified on both sides of the decision boundary.

Info

Kernel Trick

The kernel trick is a way to transform non-linear data into a higher-dimensional space where it becomes linearly separable.
The kernel function is a similarity function that measures the similarity between two data points in the original space.
The kernel trick allows us to use non-linear models in a linear model.

Applications of SVM

Multiclass problems

Introduce parameter vectors: If there are ( k ) categories, we introduce a parameter vector ( \theta_1, \theta_2, \ldots, \theta_k ) for each category.
Jointly learn parameters: Jointly learn these parameters by ensuring that the discriminant function associated with the correct category has the highest value. The goal is to minimize the following expression: [ \frac{1}{2} \sum_{y=1}^{k} | \theta_y |^2 ] The constraint is that for all ( y' \neq y_i ) and ( i = 1, \ldots, n ), it satisfies: [ (\theta_{y_i} \cdot x_i) \geq (\theta_{y'} \cdot x_i) + 1 ]
Predict new examples: Predict the label of a new example based on the following formula: [ \hat{y} = \arg\max_{y=1,\ldots,k} (\theta^*_y \cdot x) ]

Ordinal regression problems

Target variable: The target variable in ordinal regression is an ordinal variable, where the categories have a certain order, such as education level (elementary, middle school, high school, university, graduate) or movie ratings (1 star to 5 stars).
Model purpose: Unlike multiclass classification problems, ordinal regression models not only predict the correct category but also the order relationship between categories.

Translate each rating into a set of binary labels

We can convert these labels into binary labels. For example, we can divide the ratings into two categories: "like" and "dislike". Here is a simple conversion example:
1 star and 2 stars: Dislike (represented by 0)
3 stars: May require additional processing as it can be considered neutral or context-dependent
4 stars and 5 stars: Like (represented by 1)
Alternatively, we can create a binary label for each rating level as follows:
1 star: [1, 0, 0, 0, 0]
2 stars: [0, 1, 0, 0, 0]
3 stars: [0, 0, 1, 0, 0]
4 stars: [0, 0, 0, 1, 0]
5 stars: [0, 0, 0, 0, 1]

Ranking

In machine learning, ranking tasks typically involve ordering a set of items such that certain items have higher priority over others. SVM can address ranking problems by modifying the standard binary classification SVM, a method commonly known as RankSVM. RankSVM learns a ranking function through pairwise comparisons, ensuring that positive examples rank higher than negative ones.

Structured prediction

Structured prediction is a branch of machine learning that involves predicting structured outputs, such as sequences, trees, or graphs. SVM can be extended to handle these problems through Structured SVM. Structured SVM learns a prediction function by maximizing the margin, which can output entire structures rather than simple class labels. This approach is particularly useful in fields like natural language processing, bioinformatics, and computer vision.

The core idea of Support Vector Machines (SVM) is to find an optimal hyperplane that best separates data points of different classes while maximizing the margin between the classes. This margin is the distance from the hyperplane to the nearest training data point. The goal of SVM is to find a hyperplane with the largest margin, which can improve the model's generalization ability and reduce the risk of overfitting.

Week 6

Info

方法	描述	优点	缺点
Holdout	简单划分为训练+测试	快速、实现简单	结果可能不稳定
Stratification	分层抽样，保持类别分布	保证各类别比例均衡	通常与其他方法结合使用
Cross-validation	多次划分训练+验证，轮流训练	结果稳健、泛化更好	计算成本较高（训练多次模型）

Scenario	Holdout	Cross-Validation
Large dataset (millions of records)	✅ Works well	❌ Too slow
Small dataset (few thousand samples)	❌ Might not be reliable	✅ More accurate
Need fast evaluation	✅ Quick & simple	❌ Computationally heavy
Hyperparameter tuning	❌ Risky (prone to overfitting)	✅ More stable
Deep learning models (large compute cost)	✅ Saves time	❌ Too expensive

Evaluation

if our model will perform effectively on unseen data

Holdout

Split the data randomly into two parts: a training set and a test set.
Train the model on the training set and evaluate it on the test set.

Holdout Weakness

The size of the split affects the estimate of the model’s behaviour: Lots of test instances, and few training instances → , the learner doesn’t have enough information to build an accurate model. Lots of training instances, and few test instances → learner builds an accurate model, but test data might not be representative (so estimates of performance can be too high/too low)
Bias in Sampling: Random sampling of data can lead to different distribution in train and test datasets

Stratification

Stratification ensures that each fold or partition of the data maintains the same class distribution as the original dataset

Stratification Weakness

Complexity: Can be more complex to implement compared to simple random sampling.
Inefficient Resource Utilization: some data is only used for training and some only for testing

k-fold Cross-validation

Divide the data into k equal parts, and use k-1 parts for training and 1 part for testing.
Repeat k times, each time using a different part for testing and the remaining parts for training.
The average performance of the k models is the estimate of the performance of the model on the entire dataset.

k-fold Cross-validation Weakness

Fewer folds: more instances per partition, more variance in performance estimates
More folds: fewer instances per partition, less variance but slower

Hyperparameter Tuning

If the Decision Tree is too short (Depth is too small), The DT would be too simple (Something like 1-R)
If the Decision Tree is too long (Depth is too big), The DT would be too complex and cannot generalize well
To decide about the best depth for a tree (and other hyperparameters), we need a way to measure how well our tree can - edict labels for new, unseen data.
But… If we use our Test data for hyperparameter tuning, then we don’t have any other (unseen) data to test the performance of the final model (after the hyperparameter tuning) → It will cause data leakage
We need a third set → Validation data, or use cross-validation to split the data into training, validation, and test sets.

Classification Evaluation Metrics

We can present all possible classification results a two-class problem in the following confusion matrix.
Confusion matrix is for classification problems.
True Positive (TP): The model correctly predicts the positive class.
False Positive (FP): The model incorrectly predicts the positive class.
True Negative (TN): The model correctly predicts the negative class.
False Negative (FN): The model incorrectly predicts the negative class.

Recall

Recall: From the cases that are actually positive, what percentage has been correctly identified (predicted positive) by our classifier. Recall = TP / (TP + FN)

Precision

From the cases that are predicted positive, what percentage are actually positive. Precision = TP / (TP + FP)

F1-score

A popular metric that can help with finding a balance between Precision & Recall is F1-Score. F1-Score is the harmonic mean of Precision and Recall. F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

Averaging methods

In multi-class problems, Macro Averaging, Micro Averaging, and Weighted Averaging are three methods for calculating the average precision or recall. These methods are particularly useful in evaluating the performance of classification models, especially in datasets with imbalanced classes. Here is a brief description of these three methods:

1. Macro Averaging

Macro Averaging first calculates the metrics (such as precision, recall, or F1 score) for each class, and then calculates the simple arithmetic average of these metrics. It treats each class as equally important, without considering the actual frequency or size of the classes. The calculation formula is as follows: $$ \text{Macro Average} = \frac{1}{N} \sum_{i=1}^{N} \text{metric}_i $$ Where, $N$ is the total number of classes, and $\text{metric}_i$ is the value of the metric for the $i$-th class. Advantages: Each class is treated equally, so this may be a better metric for imbalanced datasets. Disadvantages: If some classes are very few, they may disproportionately affect the overall performance evaluation.

2. Micro Averaging

Micro Averaging first calculates the total precision, total recall, and total F1 score for all classes, and then calculates the arithmetic average of these totals. It calculates the metrics by considering individual predictions in each class, so it takes into account the actual frequency of the classes. The calculation formula is as follows: $$ \text{Micro Average Precision} = \frac{\sum_{i=1}^{N} TP_i}{\sum_{i=1}^{N} TP_i + \sum_{i=1}^{N} FP_i} $$ $$ \text{Micro Average Recall} = \frac{\sum_{i=1}^{N} TP_i}{\sum_{i=1}^{N} TP_i + \sum_{i=1}^{N} FN_i} $$ Where, $TP_i$, $FP_i$, and $FN_i$ are the numbers of true positives, false positives, and false negatives, respectively, for the $i$-th class. Advantages: It takes into account the true distribution of each class, so it is usually a more reliable metric for imbalanced datasets. Disadvantages: It may overlook the imbalance between classes, as the contributions of all classes are averaged.

3. Weighted Averaging

Weighted Averaging is a variant of Macro Averaging, which takes into account the support of each class (i.e., the number of instances in each class). The value of the metric for each class is multiplied by the support of that class, and then the average of these weighted values is calculated. The calculation formula is as follows: $$ \text{Weighted Average} = \frac{\sum_{i=1}^{N} (\text{metric}_i \times \text{support}i)}{\sum^{N} \text{support}_i} $$ Where, $\text{metric}_i$ is the value of the metric for the $i$-th class, and $\text{support}_i$ is the support of the $i$-th class. Advantages: It takes into account the relative size of each class, so it is a fairer metric for imbalanced datasets. Disadvantages: Like Macro Averaging, it may give disproportionate weight to minority classes, which may distort the overall performance evaluation. The choice of which averaging method to use depends on the specific application scenario and the characteristics of the dataset. In the case of class imbalance, Weighted Averaging is often considered the best choice, as it simultaneously considers the performance and relative importance of each class.

Info

If we want to prioritize small classes and ensure each class is given equal importance, macro averaging is the better choice.
If we care more about overall system performance, where larger classes influence the results more (common in real-world applications like fraud detection), micro averaging is preferable.

Regression Performance Metrics

MSE, RMSE, and MAE (Mean Squared Error, Root Mean Squared Error, and Mean Absolute Error)

Sum of Squared Errors (SSE) is the sum of the squared differences between the predicted and actual values.
MSE emphasizes larger errors due to squaring and is sensitive to outliers.
RMSE is the square root of MSE, providing a more interpretable metric in the same units as the target variable.
MAE treats all errors equally, is less sensitive to outliers, and provides a straightforward average error measure.

Evidence of overfitting: large gap between training and test performance
Evidence of underfitting: High error rate for both test and training set

训练误差高，测试误差高 → 欠拟合
训练误差低，测试误差高 → 过拟合
训练误差高，测试误差低 → 非常罕见，通常意味着数据泄露或其他问题

Learning Curves

A learning curve is a plot of learning performance over experience or time
y-axis: Performance measured by accuracy, error rate or other metrics
x-axis: conditions, e.g., sizes of training sets, model complexity, number of iterations…
The learning curve can be used to identify overfitting, underfitting, or a suitable trade-off between the two.
Training learning curve: calculated from the training set that shows how well the model is learning.
Validation learning curve: calculated from a holdout set that shows how well the model is generalising.

data trade-off

More training instances? → (usually) better model
More evaluation instances? → more reliable estimate of effectivenes

Bias and Variance

Model bias: the tendency of our model to make systematically wrong predictions
Evaluation bias: the tendency of our evaluation strategy to over- or under-estimate the effectiveness of our model
Sampling bias: if our training or evaluation dataset isn’t representative of the population.
Model variance: Sensitivity of a machine learning model's predictions to small changes in the training data, leading to different outcomes when the model is trained on different subsets of the data.
Evaluation variance: Variability in the performance metrics of a model (such as accuracy, precision, or recall) when evaluated across different test datasets or under different evaluation conditions.

Week 7: Feature Selection

Univariate Methods

Concept: Evaluate each feature individually in relation to the target variable.
Advantages:
- Intuitive approach to assess the "goodness" of each attribute.
- Model-independent.
- Linear time complexity with respect to the number of attributes.
Good Features: Features that are "correlated" with the label and can predict it effectively.
Methods:
- Signal-to-Noise Ratio (SNR): Select features with high SNR.
- Mutual Information (MI): Measures the dependence between the target variable and each feature. Select features with high MI.
- Chi-Squared Test: Determines if there is a significant association between categorical variables (for classification tasks).
- ANOVA F-value: Evaluates the significance of individual features in a dataset (for classification tasks with continuous features).
- Correlation-based Feature Selection (CFS): Measures the correlation between each feature and the target variable. Selects features with high correlation.

Multivariate Methods

Concept: Consider subsets of features together to capture their combined predictive power.
Limitations of Univariate Methods: Sometimes, single features may not provide good classification, but a combination of features can lead to better decision boundaries.
Complexity: Choosing the optimal subset of attributes that gives the best performance on the validation data.
Feature Subset Assessment:
- Split data into training, validation, and test sets.
- Train a classifier on each feature subset using the training data.
- Select the feature subset that performs best on the validation data.
- Test the selected subset on the test data.
Algorithms:

Filter Methods

Concept: Rank features (or feature subsets) independently of the classifier based on a predefined criterion.
Criterion: Measure the "relevance" of each feature (or feature subset) with respect to the target variable.
Search Strategy: Typically involves sorting features based on individual rankings or nested subsets.
Assessment: No direct assessment, but cross-validation can be used to determine the optimal number of features to select.

Wrapper Methods

Concept: Use a classifier to evaluate the "usefulness" of different feature subsets.
Criterion: Measure the predictive power of each feature subset.
Search Strategy: Search through the space of all possible feature subsets. Train a new classifier for each candidate subset.
Assessment: Use cross-validation to evaluate the performance of each subset.

Embedded Methods

Concept: Similar to wrapper methods, but utilize the classifier's internal knowledge to guide the search for the best feature subset, avoiding the need to train a new classifier for every candidate subset.
Criterion: Measure the "usefulness" of each feature subset.
Search Strategy: Guided search based on the classifier's knowledge.
Assessment: Use cross-validation.

Forward Selection (Wrapper Method)

Running Time: Increases with the number of attributes.
Advantages: Performs well when the optimal subset is small.
Disadvantages: May converge to a suboptimal solution and is not feasible for large datasets.

Forward Selection (Embedded Method)

Running Time: Faster than the wrapper method.
Advantages: Uses the classifier's knowledge to evaluate candidate features.

Backward Elimination (Wrapper Method)

Running Time: Increases with the number of attributes.
Advantages: Removes the most irrelevant attributes at the beginning.
Disadvantages: May converge to a suboptimal solution and is not feasible for large datasets.

Backward Elimination (Embedded Method)

Running Time: Faster than the wrapper method.
Advantages: Uses the classifier's knowledge to evaluate candidate features.

L0-Norm Regularization

Concept: Directly minimize the number of features used by the algorithm by promoting zero entries in the weight vector.
Objective Function: Minimize the number of non-zero entries in the weight vector.
Challenges: The objective function is non-continuous and non-convex, making the optimization problem combinatorially hard.

L1-Norm Regularization

Concept: Use a convex function to promote sparsity in the weight vector, resulting in fewer non-zero entries.
Objective Function: Minimize the L1 norm of the weight vector (taxicab norm or Manhattan norm).
Advantages: The optimization problem is convex, making it easier to solve.
Result: The solution typically has fewer non-zero entries, effectively performing feature selection.

主成分分析 (PCA)

目标：找到一组正交的基向量，能够最大限度地解释数据方差。
步骤：
1. 数据中心化：将数据集中的每个特征减去其均值，消除均值的影响。
2. 计算协方差矩阵：计算中心化数据矩阵的协方差矩阵。
3. 特征值分解/奇异值分解：计算协方差矩阵的特征值和特征向量，或者对中心化数据矩阵进行 SVD。
4. 选择主成分：选择前 m 个特征值最大的特征向量作为主成分。
5. 数据降维：将原始数据投影到主成分构成的子空间中，实现降维。

PCA 的优势：

简化数据：减少变量数量，降低模型复杂度。
揭示潜在结构：找到数据背后的潜在维度，帮助理解数据本质。
提高效率：降低计算成本，提高模型训练和预测速度。

PCA 的局限性：

无法利用类别标签信息：PCA 无法区分不同类别。
最大化方差：PCA 最大化方差，但不一定能够区分不同类别。

PCA 的应用：

图像压缩：将图像分割成小块，并将每个小块投影到低维子空间中，实现压缩。
面部表情识别：将面部图像投影到低维子空间中，提取关键特征，用于表情识别。 PCA 的评估：
避免破坏交叉验证：不要在整个数据集上进行降维，否则会破坏交叉验证。
在交叉验证的训练集上进行降维：在交叉验证的每个训练集上进行降维，并在验证集上进行评估。

Week 8

朴素贝叶斯和逻辑回归都是基于概率模型的分类算法，但朴素贝叶斯假设特征之间相互独立，而逻辑回归则没有这个限制。
感知器不使用概率模型，而是直接优化分类错误率，更加简单直观。

ANN

Artificial neural network is a network of processing elements
Each element converts inputs to output
The output is a function (called activation function) of a weighted sum of inputs
Training an ANN means adjusting weights for training data given a pre-defined network topology

Perceptron

The Perceptron is a minimal neural network
neural networks are composed of neurons
A neuron is defined as y = f(θ0 + θ1 x1 +...+ θd xd)

Perceptron Algorithm

supervised classification algorithm
感知器通过调整权重来优化性能，并通过比较预测输出与真实输出之间的差异来更新权重。如果预测正确，则不做任何改变；如果预测错误，则根据实际情况增加或减少权重。
一个epoch就是完成一次完整的训练过程
if mistakes: 𝑦(𝜃 ⋅ 𝒙) ≤ 0, we need to update
感知器算法通过迭代地更新权重来最小化误分类的数量。每次迭代中，如果发现某个样本被误分类，就根据学习率和样本的特征更新权重。这个过程重复进行直到达到预定的epoch数。
can set learning rate 𝜂 as 1 for simplicity

Online learning vs. Batch learning

The perceptron algorithm is an online algorithm: update weights after each training example
In contrast, Naive Bayes and logistic regression (with Gradient Descent) are batch algorithms

Multi-layer perceptron

Input layer with input units x: the first layer, takes features x as inputs
Output layer with output units y: the last layer, has one unit per possible output (e.g., 1 unit for binary classification)
Hidden layers with hidden units h: all layers in between.

Multi-layer Perceptron vs. other Neural Networks

Multi-layer Perceptron

One specific type of neural network
Feed-forward
Fully connected
Supervised learner

Other types of neural networks

Convolutional neural networks
Recurrent neural networks
Autoencoder (unsupervised)

Linear vs. Non-linear classification

Linear classification

The perceptron, naive bayes, logistic regression are linear classifiers
Decision boundary is a linear combination of features σ𝑖 𝜃𝑖𝑋𝑖
Cannot learn‘feature interactions’ naturally

Non-linear classification

Neural networks with at least 1 hidden layer and non-linear activations
Decision boundary is a non-linear function of the inputs

Feature engineering vs feature learning

项目	Feature Engineering（特征工程）	Feature Learning（特征学习）
使用方式	人工设计并提取特征作为模型输入	模型从原始数据中自动学习特征
典型模型	感知机、朴素贝叶斯、逻辑回归、决策树等	神经网络（MLP、CNN、RNN、Transformer等）
数据输入	需要是人类理解的、高质量的“特征”（如 sunny, overcast 等）	原始的数值数据，如图像像素、文本词向量、音频信号
是否需要领域知识	✅ 需要，特征提取要依赖专家经验	❌ 不需要，模型自动学出与任务相关的特征
控制权/可解释性	高（你知道模型在用什么特征）	低（中间特征难解释，但通常更有效）
对调参的依赖	相对较少（因为特征是固定的）	高（要调网络结构、激活函数、学习率等超参数）

Activation Functions

Logistic or Sigmoid Function: The output range of this function is between 0 and 1, commonly used in the output layer for binary classification problems. But will cause gradient vanishing problem, not commonly used.
Hyperbolic Tangent Function (tanh): The output range of this function is between -1 and 1, more commonly used than the sigmoid function because its output mean is 0, which helps alleviate the gradient vanishing problem.
Rectified Linear Unit (ReLU): Most commonly used activation function This function directly returns the input value when the input is greater than 0, otherwise returns 0. It is one of the most popular activation functions currently, due to its simplicity, efficiency, and ability to accelerate the training process of neural networks.

Note that the activation functions must be non-linear, as without this, the model is simply a (complex) linear model

Info

感知机学习规则

二分类问题：sigmoid
多分类问题：softmax
多标签问题：sigmoid
线性回归: 线性函数

Formula

🧮 公式 1：线性组合（Perceptron 中的加权求和）

$$ \Sigma = \theta_0 + \theta_1 \cdot x_1 + \theta_2 \cdot x_2 $$

✅ 含义：

$\Sigma$：感知机的中间结果（还没经过激活函数）
$\theta_0$：偏置项（bias）
$\theta_1, \theta_2$：输入特征的权重
$x_1, x_2$：输入特征值

这是感知机对一个输入样本做出的“线性打分”。

🔁 公式 2：感知机权重更新规则（Perceptron Learning Rule）

$$ \theta_j^{(t)} \leftarrow \theta_j^{(t-1)} + \eta \cdot (y^{(i)} - \hat{y}^{(i,t)}) \cdot x_j^{(i)} $$

✅ 含义：

$\theta_j^{(t)}$：第 $t$ 次更新后第 $j$ 个权重
$\theta_j^{(t-1)}$：上一次的权重值
$\eta$：学习率（learning rate）
$y^{(i)}$：第 $i$ 个训练样本的真实标签
$\hat{y}^{(i,t)}$：当前模型对样本 $i$ 的预测（经过激活函数）
$x_j^{(i)}$：样本 $i$ 的第 $j$ 个特征

只在预测错误（即 $y \ne \hat{y}$）时才更新。

这两个公式一起构成了感知机的核心工作机制：

先用第一个公式计算预测，再根据第二个公式更新权重。

When is Linear Classification Enough?

If we know our classes are linearly (approximately) separable
If the feature space is (very) high-dimensional ...i.e., the number of features exceeds the number of training instances (kernel trick)
If the traning set is small
If interpretability is important, i.e., understanding how (combinations of) features explain different predictions
…but with increasing availability of data, and more powerful computers, non-linear models are gaining popularity, with currently Neural Networks being their most popular variant.

Pros and Cons of Neural Networks

Pros

Powerful tool!
Neural networks with at least 1 hidden layer can approximate any (continuous) function. They are universal approximators
Automatic feature learning
Empirically, very good performance for many diverse tasks

Cons

Powerful model increases the danger of ‘overfitting’
Requires large training data sets
Often requires powerful compute resources (GPUs)
Lack of interpretability

Week 9

Characteristics	Activation Function	Step Function
Definition	Introduces non-linear factors, determines whether a node is activated	Takes different constant values in different intervals
Mathematical Properties	Usually continuous and differentiable	Discontinuous, with jumps only at specific points
Applications	Introduces non-linearity in neural networks, increasing model complexity	Signal processing, control theory, early perceptron models
Common Types	Sigmoid, Tanh, ReLU, Leaky ReLU, Softmax	Heaviside step function
Advantages	Improves model expressiveness, suitable for complex pattern recognition	Simple, easy to understand and implement
Disadvantages	May cause gradient disappearance or explosion, requires selection of suitable functions	Discontinuous, not suitable for gradient-based optimization algorithms
Use in Neural Networks	Widely used, a core component of modern neural networks	Rarely used, mainly in early models
Summary: Activation functions and step functions have different applications and properties in mathematics and engineering. Activation functions are mainly used in neural networks to introduce non-linearity and improve model expressiveness, while step functions are used in signal processing, control theory, and other fields. In neural networks, step functions are not suitable as activation functions due to their discontinuity.

Why cannot we use the step functon?

Derivative is not defined at 0. When it’s not at 0, the derivative is 0. = No way to update weights and learn anything!

A problem encountered when training a multilayer perceptron (MLP) and the method to solve it.

The update rule depends on the actual target output y: When training a neural network, we typically use an update rule to adjust the network's weights, which depends on the actual target output. However, in a multilayer network, we can only directly obtain the actual output of the final layer, but not the actual output of the hidden layers.
We can only access the actual output of the final layer: In a multilayer network, we can only directly observe the output of the final layer, while the output of the hidden layers is not directly observable.
We do not know the actual activation of the hidden layers: Since we cannot directly observe the actual output of the hidden layers, we do not know the actual activation state of the hidden layers. To solve these problems, backpropagation (Backpropagation) provides an effective method. Backpropagation is an algorithm for training multilayer neural networks, which updates the weights in the network by calculating the partial derivatives of the error. This method allows us to effectively calculate the error partial derivatives for each individual weight, thereby updating the weights of the hidden layers, not just the outermost layer. This way, we can still effectively train the entire network without knowing the actual activation of the hidden layers.

Comparison of Perceptron and Backpropagation

感知机更新	Backprop
只有一层（输出层）	多层（含隐藏层）
权重更新基于误差 $y - \hat{y}$	基于损失函数对每个权重的导数
没有链式法则	使用链式法则（链式求导）
不需要激活函数的导数	必须使用激活函数的导数（如 sigmoid'）
简单快速，但只能处理线性可分	更复杂，但可以拟合非线性函数

Backpropagation

训练多层感知机（MLP） 的标准流程，用于理解 反向传播（Backpropagation）+ 梯度下降（Gradient Descent） 的全过程。

Step 1: Forward propagate an input x

输入 $\mathbf{x}$（训练样本）进入神经网络：

一层一层地执行线性变换和激活函数
对每一层 $l$：
$$ \mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)} \ \mathbf{a}^{(l)} = f(\mathbf{z}^{(l)}) $$
其中 $f$ 是激活函数（如 ReLU、Sigmoid）

✅ 最终得到输出 $\hat{y}$

Step 2: Compute the predicted output $\hat{y}$

这是前向传播的最后结果：

$$ \hat{y} = \text{MLP}(\mathbf{x}) $$

可能是：

标量（用于二分类）
向量（用于多分类，比如经过 softmax）

Step 3: Compare $\hat{y}$ with true output $y$, compute error

定义一个 损失函数（Loss Function）：

常用的损失函数：

均方误差（MSE）：$\frac{1}{2} (y - \hat{y})^2$
交叉熵（Cross-Entropy）用于分类

记为：

$$ \mathcal{L}(\hat{y}, y) $$

我们现在知道当前预测有多“错”。

Step 4: Modify each weight to reduce the error (Gradient Descent)

这一步就是 反向传播的核心：

通过 链式法则 计算损失函数对每一层权重的梯度：

$$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} $$

使用 梯度下降 更新每一层权重：

$$ \mathbf{W}^{(l)} \leftarrow \mathbf{W}^{(l)} - \eta \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} $$

$$ \mathbf{b}^{(l)} \leftarrow \mathbf{b}^{(l)} - \eta \cdot \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} $$

其中 $\eta$ 是学习率。

🔁 每一层的梯度依赖于后一层的梯度：这就是反向传播（误差从输出层传回隐藏层）

Step 5: Repeat for all data and multiple epochs

对训练集中的每个样本重复上面过程（可能使用 mini-batch）
进行多个轮（epochs），直到损失不再明显下降

✨ 总结流程图

Input x
  ↓
[Forward Propagation]
  ↓
Predicted Output ŷ
  ↓
Compare with True Output y
  ↓
[Compute Loss]
  ↓
[Backpropagation: compute gradients]
  ↓
Update weights with Gradient Descent
  ↓
Repeat

Generalized Delta Rule

The Generalized Delta Rule 是 反向传播算法（Backpropagation） 的数学核心，用于训练多层感知机（MLP）。它是一种将感知机的简单“误差更新规则”推广到多层神经网络中的方法。

🧠 一句话解释：

Generalized Delta Rule 就是用链式法则计算隐藏层中每个神经元的误差信号，并据此更新权重。

🧮 回顾感知机的基本更新：

在单层感知机中，如果输出是：

$$ \hat{y} = f(\theta^T x) $$

误差是：

$$ \delta = y - \hat{y} $$

然后我们更新权重：

$$ \theta \leftarrow \theta + \eta \cdot \delta \cdot x $$

但对于多层神经网络（有隐藏层），我们不能直接用 $y - \hat{y}$ 来更新隐藏层的权重。

🚀 Generalized Delta Rule 的核心思想：

对于任意神经元 $j$ 的权重 $w_{ji}$，更新公式为：

$$ w_{ji} \leftarrow w_{ji} + \eta \cdot \delta_j \cdot x_i $$

其中：

$\eta$：学习率
$\delta_j$：第 $j$ 个神经元的“误差信号”
$x_i$：输入（来自上一层）

❗关键点在于怎么定义 $\delta_j$（误差信号）

👉 对于输出层的神经元：

$$ \delta_j = (y_j - \hat{y}_j) \cdot f'(z_j) $$

$f'(z_j)$：激活函数的导数
$z_j$：未激活前的线性输入

👉 对于隐藏层的神经元：

$$ \delta_j = f'(z_j) \cdot \sum_k \delta_k \cdot w_{kj} $$

$\delta_k$：来自下一层（靠近输出层）的误差信号
$w_{kj}$：从本层神经元 $j$ 到下一层神经元 $k$ 的权重

🧠 总结

项目	感知机	Generalized Delta Rule (Backprop)
适用网络	单层	多层（含隐藏层）
误差传播	直接用 $y - \hat{y}$	使用链式法则逐层传播
激活函数导数	不需要	必须计算 $f'(z)$
支持非线性	❌	✅
支持隐藏层训练	❌	✅

Discriminative and Generative

"Discriminative" and "Generative" are two different model approaches in machine learning used for classification or prediction tasks, but their modeling ideas and goals differ. Here are their main differences:

🔍 One-sentence Difference:

Discriminative Models: Directly learn the mapping from input to output (i.e., learn $P(y \mid x)$).
Generative Models: Learn the data generation process (i.e., learn $P(x, y)$, which can derive $P(x \mid y)$ and $P(y)$).

📚 More Specifically:

Attribute	Discriminative Models	Generative Models
What is learned?	Learn $P(y \mid x)$: the probability of class y given input x	Learn $P(x \mid y)$ and $P(y)$: how input x is generated by class y
Examples	Logistic Regression, SVM, Random Forest, BERT	Naive Bayes, HMM, GMM, GAN, VAE
Capabilities	Classification, regression	Classification, sample generation (images, text), missing value imputation, data simulation
Model input distribution?	❌ Does not model the distribution of $x$	✅ Models the generation mechanism of $x$
Can generate data?	❌ No	✅ Can generate new samples (images, text, etc.)
Typical performance	More accurate for classification (because it directly optimizes the classification task)	More robust with less data, but may have slightly weaker classification performance

🎓 Classic Examples:

Naive Bayes (Generative)

Assume you know the word frequency for each class (e.g., "spam" or "normal email").
You can first generate a spam email and then see if a new email resembles spam.

Logistic Regression (Discriminative)

Only learns to directly determine from the features of an email whether it is spam or normal.

✅ When to Use Which?

Use Discriminative:
- Classification is the sole task (e.g., sentiment classification, image recognition)
- Sufficient data is available, aiming for the best prediction performance
Use Generative:
- Want to generate data (images, speech, text, etc.)
- There is missing data, few-shot learning, or a need for stronger expressive capability

Week 10

Unsupervised Learning

Finding the correct label for an instance can be hard and expensive (the labelling usually should be done manually)
Labelled data is limited
On the other hand, we have access to many more datasets that are not labelled

Types of unsupervised learning

PCA: simplify complex datasets with many features by extracting the most essential information
Clustering: Finding groups of items that are similar
Anomaly detection: detecting rare occurrences that seem suspicious because they're different from the established pattern
Associate Rule Mining: Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction

Clustering

Dimension	Type	Description	Example Algorithm
Membership	Exclusive	Each point belongs to only one cluster	K-Means
	Overlapping	A point can belong to multiple clusters	FCM, GMM
Assignment Method	Deterministic	Clear membership	K-Means
	Probabilistic	Probabilistic membership	GMM, FCM
Structure	Hierarchical	Tree structure	Hierarchical Clustering
	Partitioning	Flat partitioning	K-Means
Coverage	Partial	Not all points need to be covered	Clustering with outlier handling
	Complete	All points are classified	Most traditional clustering
Intra-cluster Characteristics	Homogeneous	Clusters are as similar as possible	Most clustering algorithms
	Heterogeneous	Allows significant differences within clusters	Used for specific needs

K-means

K-means clustering process (exclusive, complete and partitioning clustering method)

Ask the user how many clusters they would like (K)
Randomly pick K points (as seeds or initial cluster centroids)
Repeat
- Assign each instance to the cluster with the nearest centroid
- Recompute the centroid of each cluster
Until the centroids don’t change (or until changes are smaller than a threshold, or until relatively few points change clusters)

K-means limitaion

“mean” is ill-defined for categorical attributes
Sensitive to outliers
Sensitive to Initial Centroids
Not able to handle non-spherical clusters
Not able to handle clusters of differing sizes
Not able to handle clusters with different densities
Requires Predefined Number of Clusters (K): Determining the optimal number of clusters can be a challenging task

K-means strengths

Simplicity: straightforward and easy-to-understand algorithm.
Scalability: computationally efficient and can converge relatively quickly. Suitable for clustering tasks involving a large number of data points.
Interpretable Results: Clusters can be easily interpreted and visualised.
Applicability: K-means is applicable to a wide range of clustering tasks (often good enough).

Elbow Method (to find optimal K)

A “good” cluster should have High cluster cohesion (Intra-cluster distances are minimized) and High Cluster Separation (Inter-cluster distances are maximized)
To measure Cohesion: Within-cluster Sum of Squares (WCSS): the sum of the squared distance between each point and the centroid in a cluster (Cluster Cohesion)
Look for a point where the reduction in WCSS significantly slows down: that is the Elbow point

Hierarchical Clustering

Agglomerative (Bottom-up) clustering
- Start with the points (instances) as individual clusters
- At each step, merge the closest pair of clusters until only one cluster (or k clusters) left

Single Linkage: Distance between the closest members of the cluster
- Can handle non-similar shapes
- Sensitive to noise and outliers
Complete Linkage: Distance between the furthest members of the cluster
- Less sensitive to noise and outliers
- Tends to create spherical clusters with a consistent diameter.
- Tends to break large clusters
Centroid Linkage: Distance between the two centroids

Strength

No assumption of any particular number of clusters
- Any desired number of clusters can be obtained by 'cutting' the dendrogram at the appropriate level
No assumption of cluster shape
- not assume specific shapes or distributions for the clusters.
- can handle clusters of various shapes, sizes, and densities
They may correspond to meaningful taxonomies
- Examples in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …)

Weakness

Once a decision is made to combine two clusters, it cannot be undone, i.e., an object that is in the wrong cluster will always stay there.
No objective function is directly minimised, i.e., no 'relative' quality measure

Clustering Metrics

WCSS and BCSS

The two main concepts used to evaluate the goodness of a clustering structure without relying on external information (i.e., unsupervised evaluation) are:
- Cohesion (Intra-cluster Similarity) – Measures how closely related the points within the same cluster are. A good clustering structure should have high cohesion, meaning data points within a cluster should be densely packed and similar to one another. This is typically evaluated using metrics like within-cluster sum of squares (WCSS).
- Separation (Inter-cluster Dissimilarity) – Measures how well-separated different clusters are from each other. A strong clustering structure should have high separation, meaning clusters are distinct and far apart from one another. This is often evaluated using metrics like between-cluster sum of squares (BCSS).
These two principles ensure that clusters are both internally coherent and externally distinct, leading to a well-formed clustering solution.

Calinski-Harabasz Index

Homogeneity

Whether each cluster contains only data points that are all from the same class. Homogeneity checks if all the items in a cluster are the same type.

Completeness

Whether all data points of a particular class are assigned to the same cluster.

Why Do We Need Both Metrics?

Homogeneity alone can be misleading. It only checks whether all instances within a cluster share the same label but does not ensure that instances of the same class are grouped together.
Thus, a good clustering solution should aim for both high homogeneity and high completeness to ensure meaningful and accurate grouping of data points.

数学公式 & 如何计算

这两个指标都基于 条件熵（conditional entropy） 的概念，并且和 V-measure 有关联。

我们用以下变量表示：

$ C $：真实的类别（class）
$ K $：聚类的结果（cluster）

1. Homogeneity (h)

$$ h = \begin{cases} 1 - \frac{H(C|K)}{H(C)} & \text{if } H(C) \neq 0 \ 1 & \text{otherwise} \end{cases} $$

其中：

$ H(C|K) $：给定聚类结果的类别条件熵（表示每个簇内部的类别分布混乱程度）
$ H(C) $：类别的总熵（原始类别分布的不确定性）

💡 含义：如果每个簇里的类别都很单一（$ H(C|K) \approx 0 $），则 h ≈ 1，说明同质性很好。

2. Completeness (c)

$$ c = \begin{cases} 1 - \frac{H(K|C)}{H(K)} & \text{if } H(K) \neq 0 \ 1 & \text{otherwise} \end{cases} $$

其中：

$ H(K|C) $：给定真实类别的聚类条件熵（表示每个类别在聚类结果中被打散的程度）
$ H(K) $：聚类结果的总熵（聚类分布的不确定性）

💡 含义：如果每个类别都被集中在某一个簇中（$ H(K|C) \approx 0 $），则 c ≈ 1，说明完整性很好。

举个例子说明怎么算

示例输入：

数据点	真实类别（Class）	聚类结果（Cluster）
A	Cat	1
B	Cat	1
C	Dog	1
D	Dog	2
E	Bird	2

我们可以看到：

Cluster 1 包含 Cat, Cat, Dog → 不同类 → 同质性差
Cat 分布在 Cluster 1 → 完整性 OK
Dog 分布在 Cluster 1 和 Cluster 2 → 完整性差

手动计算（简化理解）：

同质性（Homogeneity）：
- Cluster 1: 两个猫 + 一个狗 → 不纯
- Cluster 2: 一个狗 + 一个鸟 → 更不纯
- 所以 h < 1
完整性（Completeness）：
- 猫都在 cluster 1 → OK
- 狗分布在两个 cluster → 不完整
- 鸟在 cluster 2 → OK
- 所以 c < 1

Issues of unsupervised learning

Ambiguity in Interpretation
- Clusters or embeddings may not align with human-understandable or task-relevant categories.
- Clusters or latent features are often hard to interpret or evaluate objectively
Wrong Direction
- The model may find structure, but there's no guarantee it's meaningful or useful for our task

Semi-supervised learning

learning from both labelled and unlabelled data, often more unlabelled
Two common strategies within semi-supervised learning are self-training and active learning

Info

类型	是否需要标签	应用场景
Supervised	✅ 是	分类、回归
Semi-supervised	✅ 部分有标签	资源有限但数据多
Unsupervised	❌ 否	聚类、异常检测、降维等

Self-tranining

Repeat
Train a model f on L using any supervised learning method
Apply f to predict the labels on each instance in U
Identify a subset of U with “high confidence” labels
Remove them from U and add them to L with the classifier predictions as the “ground-truth”
Until L does not change

Problems and Solution

错误传播 (Classification errors can propagate)：这是自训练的一个主要风险。如果模型在预测未标记数据的标签时犯了错误（比如给了一个高置信度的错误标签），并且这个错误的数据被加入了 L，那么在下一轮训练中，这个错误的数据就会被当作“正确”的标签来训练模型。这会导致模型在新数据上表现更差，错误会像滚雪球一样越滚越大。
高置信度阈值：一个常用的策略是，模型在预测未标记数据标签时，只有当它对预测结果非常“自信”（置信度高于某个预设的阈值）时，才把这个预测的标签和对应的数据加入到 L 中。如果模型对预测结果不太确定（置信度低），就认为这个预测不可靠，不把它加入 L，从而避免引入错误标签。

Active Learning

pose queries (unlabelled instances) for labelling by an oracle (e.g. a human annotator)

Uncertain Sampling

Least Confidence: Choose samples with the smallest predicted probability for the most likely class
Margin Sampling: Choose samples where the difference between the top two predicted class probabilities is smallest
Entropy sampling: Select the sample with the highest prediction entropy

Query Strategies

Query by Committee(QBC) If the committee disagrees on an instance, the model is uncertain and would benefit most from seeing the true label.

Info

策略	关键点	优势	劣势
Uncertainty Sampling	利用单个模型的不确定性	简单高效，易实现	可能偏向异常点
QBC	利用多个模型的不一致	更稳健，捕捉“争议点”	算法复杂度高，需要训练多个模型

Data Augmentation

Bootstrap sampling: create “new” datasets by resampling existing data, with or without replacement
Cross validation / repeated random subsampling are based on the same idea
Each “batch” has a slightly different distribution of instances, forces model to use different features and not get stuck in local minima
Advantages
- More data nearly always improves learning
- Most learning algorithms have some robustness to noise
Disadvantages
- Biased training data
- May introduce features that don’t exist in the real world
- May propagate errors
- Increases problems with interpretability and transparency

Unsupervised pre-training

BERT

pretrained language model developed by Google that understands text by looking at both left and right context simultaneously
Masked Language Modelling (MLM): Predict missing words in a sentence. "The cat [MASK] on the mat."
Next Sentence Prediction (NSP): Predict if one sentence follows another.

Info

-预训练（Pre-training）：模型先在大规模的无标签数据上训练，比如让语言模型预测句子中的下一个词，或者让图像模型识别图像中的特征。这个阶段模型学习的是通用的“知识”或“表示”，但没有特定任务的标签指导。

迁移（Transfer）：把预训练得到的模型参数（权重）拿过来，作为一个“起点”或者“初始模型”。
微调（Fine-tuning）：然后用你手头具体的有标签数据（比如分类任务的标签）对这个模型进行针对性的训练，让模型学会具体任务的要求。这个过程会稍微调整预训练时学到的参数，使模型更适合你的特定任务。

Week 11

Parametric Models and Non-parametric Models:

Feature	Parametric Models	Non-parametric Models
Assumptions	Strong assumptions about data distribution (e.g., normal distribution, linear relationship)	Almost no assumptions about data distribution
Number of Parameters	Fixed number of parameters, usually few	Number of parameters is not fixed, can be many
Model Complexity	Relatively low model complexity	Model complexity can be high
Training Data Needs	Usually requires less training data	May require a large amount of training data
Generalization Ability	Depends on the accuracy of assumptions	Generally strong generalization ability as it does not rely on specific assumptions
Overfitting Risk	Risk of overfitting, especially if assumptions do not match the true distribution	Also has overfitting risk, but can usually be controlled through model selection and regularization
Computational Efficiency	Usually high computational efficiency	Computational efficiency can be low, especially for large datasets
Flexibility	Low flexibility, constrained by assumptions	High flexibility, able to adapt to complex data structures and relationships
Examples	Linear regression, logistic regression, Naive Bayes	K-nearest neighbors, decision trees, random forests, support vector machines
Application Scenarios	Suitable for situations where assumptions are reasonable and data volume is small	Suitable for situations with large data volumes, complex relationships, or unknown assumptions
Model Interpretability	Usually has good interpretability	Interpretability may be poor, especially for complex models

Ensembles

Ensemble learning (aka. Classifier combination): constructs a set of base classifiers from a given set of training data and aggregates the outputs into a single meta-classifier
Intuition 1: the combination of lots of weak classifiers can be at least as good as one strong classifier
Intuition 2: the combination of a selection of strong classifiers is (usually) at least as good as the best of the base classifiers
Ensembles are effective when individual classifiers are slightly better than random (error < 0.5).

When does ensemble learning work?

The classifiers should not make the same mistakes (not the same bias)
The base classifiers are reasonably accurate (better than chance), slightly better than random (error < 0.5).

Different classifiers (different models or the same model with feature manipulation): Stacking

1. Stacking

Description:

Different classifiers (different models or the same model with feature manipulation): Stacking, also known as stacked generalization, involves training on multiple different models (these models can be different algorithms or the same algorithm with different features or hyperparameters). The predictions of these models are then used as input features to train a higher-level model (meta-learner, usually a simple linear model like logistic regression). How it works:

Level 0 models: Train multiple base models on the complete dataset.
Level 1 model: Use the predictions of the base models as input features to train a higher-level model (meta-learner). Key Points:

Diversity: The effectiveness of stacking relies on the diversity of the base models.
Complexity: It can be more complex compared to other ensemble methods, and implementing and tuning parameters can be more difficult.
Application Scenarios: Stacking is useful when you can use multiple different models or when you want to leverage the strengths of different algorithms. In short: Stacking uses multiple different models to make predictions and uses these predictions as features for another model to obtain the final prediction.

Same classifier, instance manipulation: Bagging (primarily targets variance reduction)

2. Bagging

Description:

Same classifier, instance manipulation: Bagging, or Bootstrap Aggregating, primarily reduces variance by manipulating data instances. It is often used to reduce the variance of models, especially those complex models sensitive to small changes in the training data (e.g., decision trees). How it works:

Bootstrap Sampling: Perform sampling with replacement from the original dataset to create multiple different training sets (called bootstrap samples).
Parallel Training: Independently train the same model on each bootstrap sample.
Aggregate Predictions: Average the predictions (regression) or use majority voting (classification) for all models. Key Points:

Variance Reduction: Bagging primarily improves model stability by reducing variance.
Parallelism: Models in bagging are trained in parallel.
Application Scenarios: Suitable for models that are very sensitive to small changes in the training data, such as decision trees. In short: Bagging improves model stability by repeatedly sampling the original data, training the same model on each sample, and then aggregating the predictions of these models.

Info

Random Forest is considered a "Bagging" (Bootstrap Aggregating) method because it uses bootstrap sampling techniques when constructing each decision tree. Bagging is an ensemble learning method whose core idea is to create multiple subsets from the original dataset through bootstrap sampling, train a base learner (e.g., decision tree) on each subset, and finally improve the overall model's performance by combining the predictions of these base learners. In Random Forest, the specific application of Bagging is reflected in the following aspects:

Bootstrap Sampling: For the original dataset, Random Forest performs multiple bootstrap samplings, each generating a subset of the same size as the original dataset but with possibly repeated elements. These subsets are used to train different decision trees.
Independent Training: Each decision tree is independently trained on its respective bootstrap sample, meaning they do not share the same training data.
Combine Predictions: In the prediction phase, Random Forest combines the predictions of all decision trees. For classification problems, majority voting is usually used; for regression problems, the average prediction value is used. Through this method, Random Forest uses Bagging techniques to reduce model variance, improve generalization ability, and enhance robustness to noise and outliers. Each tree is trained on a slightly different data subset, which helps introduce diversity, making the overall model more stable and accurate. Therefore, Random Forest is classified as a Bagging method.

Random Forest adopts both feature manipulation and instance manipulation approaches.

Same classifier, algorithm manipulation: Boosting (primarily targets bias reduction)

3. Boosting

Description:

Same classifier, algorithm manipulation: Boosting primarily reduces bias by gradually adjusting the model during training. It weights training samples so that the model focuses more on previously misclassified samples. How it works:

Initialization: Initialize the weights of all samples equally.
Iterative Training: Gradually train the model, adjusting the weights of the samples based on the error rate of the previous model so that the weights of misclassified samples increase.
Combine Models: Combine all models, usually weighting them based on each model's accuracy. Key Points:

Bias Reduction: Boosting primarily improves model accuracy by reducing bias.
Sequentiality: Models in boosting are trained sequentially, with each model depending on the output of the previous model.
Application Scenarios: Suitable for tasks where performance can be improved by focusing on difficult samples. In short: Boosting improves model accuracy by gradually adjusting the training process so that the model focuses more on samples misclassified by previous models. Summary:
Stacking uses multiple different models and inputs their outputs as features into another model.
Bagging uses the same model but trains on different data subsets to reduce variance.
Boosting uses the same model but gradually adjusts the model during training to reduce bias.

Feature	Stacking	Bagging	Boosting
Model Type	Different classifiers (different models or the same model with different feature processing)	Same classifier	Same classifier
Objective	Combine predictions of multiple models to improve overall performance	Primarily reduce variance	Primarily reduce bias
Instance Processing	Use the original dataset	Perform bootstrap sampling (sampling with replacement) on training data	Adjust sample weights based on errors from the previous round of models
Model Processing	Use different models or the same model with different feature processing	Use the same model	Use the same model
Algorithm Processing	Use a meta-learner to combine predictions of multiple models	No specific algorithm processing	Adjust model weights through boosting algorithms (e.g., AdaBoost, Gradient Boosting)
Advantages	Combine advantages of different models to improve generalization ability	Reduce overfitting, improve model stability	Improve prediction accuracy, reduce bias
Disadvantages	High computational complexity, requires training multiple models and a meta-learner	May increase model variance, sensitive to outliers	May lead to overfitting, sensitive to noise
Typical Applications	Combine different types of models (e.g., decision trees, neural networks, SVM)	Random Forest	AdaBoost, Gradient Boosting Machines (GBM)
Applicable Scenarios	Scenarios requiring the combination of multiple models to improve performance	Scenarios requiring variance reduction and improved model stability	Scenarios requiring bias reduction and improved prediction accuracy

Summary

Stacking improves overall performance by combining predictions of different models, suitable for scenarios requiring the combination of multiple models to improve performance.
Bagging reduces model variance and improves stability through bootstrap sampling, suitable for scenarios requiring variance reduction and improved model stability.
Boosting reduces model bias and improves prediction accuracy by adjusting sample weights, suitable for scenarios requiring bias reduction and improved prediction accuracy.

Anomaly Detection

Anomalies are different from noise
- Noise is random error or variance in a measured variable
- Noise should be removed before anomaly detection

Types of Anomaly

Global Anomaly

significantly deviates from the rest of the data

Contextual Anomaly

deviates significantly based on a selected context
- Contextual attributes: define the context, e.g., time & location
- Behavioural attributes: characteristics of the object, used in anomaly evaluation, e.g., temperature

Collective Anomaly

deviates significantly from the majority of the data

Anomaly Detection Algorithms

Supervised Anomaly Detection

Use labelled data to train a model to detect anomalies
Challenges
- Labeling data is expensive and time-consuming
- Anomalies are often rare and difficult to detect
- Anomalies may change over time
- Cannot detect unknown and emerging anomalies
- Anomalies may be specific to a particular context
- Anomalies may be specific to a particular user or group of users

Semi-Supervised Anomaly Detection

Labels are available only for normal data
apply the trained model to both unlabelled and labelled data
Any instance that falls outside the learned decision boundary is marked as a potential anomaly
Challenges
- Require labels from the normal class
- Possible high false alarm rate - previously unseen (yet legitimate) data records may be recognised as anomalies

Unsupervised Anomaly Detection

Assume the normal objects are somewhat "clustered" into multiple groups, each having some distinct features
An outlier is expected to be far away from any groups of normal objects
Challenges
- Normal objects may not share any strong patterns, but the collective outliers may share high similarity in a small area
- Ex. In intrusion detection, normal activities are diverse
- Unsupervised methods may have a high false positive rate but still miss many real outliers

Unsupervised Anomaly Detection Approaches

Statistical (model-based)
- Assume that normal data follow some statistical model, such as Gaussian distribution
Proximity-based
- An object is an outlier if the nearest neighbours of the object are far away
Density-based
- Outliers are objects in regions of low density
Clustering-based
- Normal data belong to large and dense clusters

Statistical (model-based)

Pros
- Statistical tests are well-understood and well-validated.
- Quantitative measure of the degree to which an object is an outlier.
Cons
- Data may be hard to model parametrically.
- Multiple modes
- Variable density
- In high dimensions, data may be insufficient to estimate the true distribution.

Proximity-based

Pros
- Easier to define a proximity measure for a dataset than to determine its statistical distribution.
- Quantitative measure of the degree to which an object is an outlier.
- Deals naturally with multiple modes.
Cons
- O(n^2) complexity.
- Score sensitive to choice of k.
- Does not work well if the data has widely variable density.

Density-based

Pros
- Quantitative measure of the degree to which an object is an outlier.
- It can work well even if the data has variable density.
Cons
- O(n^2) complexity
- Must choose parameters
- k for nearest neighbours
- d for distance threshold

Clustering-based

Pros
- Some clustering techniques have O(n) complexity.
- Extends the concept of outlier from single objects to groups of objects.
Cons
- Requires thresholds for large and small clusters.
- Sensitive to the number of clusters chosen.
- Outliers may affect the initial formation of clusters.

Humans tend to perceive out-group members as less nuanced than in-group members.

Correlation Fallacy

Humans have a tendency to mistake correlation (two co-incidentally co-occurring events) with causation.

Historical Bias

A randomly sampled data set, reflects the world as it was including existing biases which should not be carried forward

Measurement Bias

Noisy measurement → errors or missing data points which are not randomly distributed
Mistaking a (noisy) proxy for a label of interest
Oversimplification of the quantity of interest
should 1. Know your domain 2. Know your task 3. Know your data

Model Bias

Weak models: high bias – low variance
Unjustified model assumptions
Blind loss functions: Blind to certain types of errors
should 1. Carefully consider model assumptions 2. Carefully choose loss functions 3. Model groups separately 4. Represent groups fairly in the data

Evaluation / Deployment Bias

Test set not representative
Overfit to a test set
Evaluation metrics may not capture all quantities of interest (disregard minority groups or average effects)
Use of systems in ways they were not intended to use. Lack of education of end-users.