Machine Learning Articles on Lukas Hofbauer

Adam Optimizer

Thu, 11 Sep 2025 18:04:53 +0200

When training deep neural networks, choosing the right optimizer can make the difference between fast, stable convergence and hours of frustration. One of the most widely used algorithms is Adam (Adaptive Moment Estimation), introduced by Diederik P. Kingma and Jimmy Ba in 2014. Adam has become the default optimizer in many frameworks (PyTorch, TensorFlow, JAX) and is still at the heart of cutting-edge models like transformers.

The Idea Behind Adam

Adam combines two key ideas from earlier optimizers:

LoRA from Scratch

Sat, 23 Aug 2025 13:03:08 +0700

LoRA (Low-Rank Adaptation)

LoRA, short for Low-Rank Adaptation, is one of the most popular parameter-efficient fine-tuning (PEFT) methods. It was first proposed by Hu et al., 2021, and has become a go-to technique when adapting large pretrained models to new tasks.

Why do we need PEFT methods in the first place? Finetuning large language models in the traditional way—updating all of their billions of parameters—is simply too expensive in terms of compute, memory, and storage. Researchers realized that we don’t actually need to change every parameter of a pretrained model to make it useful for new tasks.

Neural Network

Sun, 27 Jul 2025 18:17:11 +0700

Build a Neural Network from Scratch

In this post, we’ll walk through how to build a simple neural network from scratch using just NumPy. No high-level libraries like TensorFlow or PyTorch, just the fundamentals.

What is a Neural Network?

A neural network is a set of interconnected layers of simple computational units called neurons. Each neuron receives inputs and returns an output value. It does this by multiplying each input by a learned weight and adding adds a bias term. The resulting value is then passed through a nonlinear activation function. We’ll explain later why this last step is necessary.

Precision Recall and other Classification Metrics

Fri, 06 Jun 2025 17:36:17 +0900

When evaluating a classification model accuracy alone isn’t enough. To better understand how well your model is performing, we need to dig deeper by understanding metrics like precision, recall, F1 score, and performance curves like ROC and Precision-Recall (PR).

We’ll start by using the same classifier as in the Logistic Regression post.

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
import numpy as np
import matplotlib.pyplot as plt

iris = datasets.load_iris()
X = iris["data"][:,1].reshape(-1,1)
y = (iris["target"] == 0).astype(int)

log_reg = LogisticRegression()
log_reg.fit(X,y)

The Confusion Matrix

Everything starts with the confusion matrix, which keeps track of four outcomes in binary classification:

Softmax

Wed, 04 Jun 2025 14:35:57 +0900

Yesterday, we explored how to train a binary classifier using logistic regression. Today, we’ll generalize that idea to handle multiple classes. This generalization is known as softmax regression, or multinomial logistic regression.

The Idea

In binary logistic regression, we used a single score function to compute the probability of a class. For multiclass classification, we extend this by defining one score function per class:

$$ s_k(x) = \theta_k^T x $$

Here, $s_k(x)$ is the score for class $k$, and $\theta_k$ is the parameter vector for that class.

Logistic Regression

Tue, 03 Jun 2025 19:16:22 +0900

Regression methods aren’t just for predicting continuous values—they can also be used for classification. The simplest example of this is Logistic Regression, where we train a linear model to separate two classes in feature space. The goal is for the model to output 1 if an input belongs to our target class and 0 otherwise.

The formula for logistic regression should look familiar if you’ve seen the post on Linear Regression:

Regularized Linear Models

Tue, 03 Jun 2025 01:57:00 +0900

Last time we saw how Polynomial regression can fit complex patterns, but as we increased the degree of the polynomial, we encountered overfitting; the model performs well on the training data but poorly on unseen test data.

Regularization helps combat overfitting by adding a penalty term to the loss function, discouraging overly complex models. We’ll explore three common regularization techniques: Ridge, Lasso, and Elastic Net.

Ridge Regression

Ridge adds a penalty proportional to the squared magnitude of coefficients:

Polynomial Regression

Sun, 01 Jun 2025 20:49:07 +0900

In the Linear Regression notebook, we saw how to model relationships where the target variable depends linearly on the input features. But what if the relationship is non-linear? Does that mean we need an entirely different type of model?

Surprisingly, no. We can still use linear regression to model non-linear relationships, by transforming the input features.

Imagine you’re trying to predict the price of a house based on the size of its plot. If the plot is rectangular and your dataset includes only the length and width, there’s no single feature that directly tells you the area. But since the area = length $\cdot$ width, we could manually create a new feature called area.

Gradient Descent

Sat, 31 May 2025 20:56:40 +0900

Gradient descent is a general-purpose optimization algorithm that lies at the heart of many machine learning applications. The idea is to iteratively adjust a set of parameters, $\theta$, to minimize a given cost function.

Like a ball rolling downhill, gradient descent uses the local gradient of the cost function with respect to $\theta$ to guide its steps in the direction of steepest descent.

The Role of the Learning Rate

The most critical hyperparameter in gradient descent is the learning rate. Choosing the right learning rate is crucial to ensure the algorithm converges efficiently.

Linear Regression

Thu, 22 May 2025 12:05:03 +0900

Linear regression is a fundamental supervised learning algorithm used to model the relationship between a dependent variable $y$ and one or more independent variables $x$. In its simplest form (univariate linear regression), it assumes that the relationship between $x$ and $y$ is linear and can be described by the equation:

$$ \hat y = k \cdot x + d $$

But we can have arbitrarly many input features, as long as they are a linar combination in the form: $$ \hat y = w_0 + w_1 x_1 + w_2 x_2 … w_n x_n$$