Introduction

We are interested in comparing and visualizing the difference between using L1 norm as loss function and L2 norm as loss function for simple regression tasks.

It's equivalent to solving the following optimization problem:

$$ \alpha^+ = \arg\min_x \Vert Ax - b\Vert_1 $$

And for L2 Norm Loss function, we have:

$$ \alpha^+ = \arg\min_x \Vert Ax - b\Vert_2^2 $$

Here is our list of topics we are gonna cover in this website

A repo with all the code involved can be viewed here

Phrasing L1 Loss as a Linear Programming Problem

$$\begin{aligned} & \min_{x} \Vert Ax - b\Vert_1 \\ & \min_{x} \left\vert \sum_{j = 1}^m a_{i, j}x_j - b_i\right\vert \quad \forall\; 1 \le i \le n \\ & \min_{t, x} \left\lbrace -t < \sum_{j = 1}^m a_{i, j}x_j - b_i < t \right\rbrace \quad \forall\; 1 \le i \le n \\ =& \min_{v, x} \left\lbrace v : -v \le Ax - b \le v \right\rbrace \end{aligned}$$

Where \(x, v\) are vectors.

This is not the optimal way, nor the professional way of doing it, I will talk more about later.

There are other ways of solving this problem, somve of the alternatives includes:


CVXOPT

For solving the system, I used a thing called: CVXOPT, available here, it uses Interior Points method, implemented in python using BLAS, which I believe it's not that bad of a solution. It's also currebtly maintained by M. Andersen and L. Vandenberghe, at UCLA.

CVXOPT provides a formulation of the L1 Loss function as a LP problem that exploits the symmetric structure of the matrix to improve the complexity of the algorithm. Which I didn't use. See here for more.

Here is the code using the CVXOPT to get the coefficients for the polynomial fit on the data with the L1 Norm Loss


import matplotlib.pyplot as plt
import numpy as np
import cvxopt as cvx
import cvxopt.modeling as cvxm
from cvxopt import solvers
solvers.options["show_progress"] = False
solvers.options["abstol"] = 1e-15
solvers.options["reltol"] = 1e-10
solvers.options["refinement"] = 10

from tqdm import tqdm

def L1RegFit(X: np.ndarray, y:np.ndarray):
    assert X.ndim == 2 and X.shape[0] == y.shape[0]
    m, n = X.shape
    A = cvx.matrix(X)
    x = cvxm.variable(n)
    b = cvx.matrix(y)
    Epsilon = cvxm.variable(m)
    Lp = cvxm.op(cvxm.sum(Epsilon))
    Lp.addconstraint(-Epsilon <= b - A*x)
    Lp.addconstraint(b-A*x <= Epsilon)
    Lp.solve()
    return (np.array(x.value)).reshape(-1)
            

Boostrap For Confidence Interval

Bootstrap, it's one of the way to estimate the variance of the model, when we don't know enough about the original generative process behind the data and getting new data is impossible. Non-Parametric Bootstrap to be specific.

Given training data matrix \(A\), let \(\tilde{A}\) be a list subset of matrix created by sampling rows of \(A\) with replacement, \(k\) times, which means that, it's absolutely possible, for all rows of \(\tilde{A}\) to be all the same row, but that is very unlikely when we have a lot of rows in matrix \(A\)

Bootstrap Psuedocodes

Let \(\mathbb{M}\) be a set that holds trained model from the bootstrap for loop.
Let \(X\) be the Vandermond Matrix for the testing data set.
for \(\mathbb{II}\) = \(1\) to \(B\)
Make Boostrap Data \(\tilde{A}\) from matrix \(A\), resampled with replacement \(k\) times.
Make model \(\alpha\) using \(\tilde{A}\).
Add \(\alpha\) to \(\mathbb{M}\).
Let \(\hat{Y}\) be row of data predict by all the models in \(\mathbb{M}\)
For \(\mathbb{M}\), \(\alpha\) in Enumerate(\(\mathbb{M}\)):
\(\hat{Y}\)[\(\mathbb{II}\), :] := \(X\alpha\)
Let \(\hat{Y}_{\text{Lower}}\) to be a row vector of the 2.5 percenttile of all columns of matrix \(\hat{Y}_{\text{Lower}}\).
Let \(\hat{Y}_{\text{Upper}}\) to be a row vector of the 97.5 percenttile of all columns of matrix \(\hat{Y}_{\text{Upper}}\).

Code Implementation


def RegBoopStrap(
                X:np.ndarray,   # regression data
                y:np.ndarray,   # labels
                fxn:callable,   # model training function
                queries=None,   # the points to query the confident interval
                trials=None,    # number of times to make the boostrap the model
                boostrapSize:int=None,
                alpha:float=5 # confidence band interval
                ):
    """
        Performs non-parametric boopstrap for regression models
    :param X:
    :param y:
    :param fxn:
    :param queries:
    :param trials:
    :param boostrapSize:
    :return:
    """
    assert X.ndim == 2 and X.shape[0] == y.shape[0] and y.ndim == 1
    assert X.ndim == 2
    assert alpha < 50 and alpha > 0
    m, n = X.shape
    trials = X.shape[0] if trials is None else trials
    queries = X[:, 1] if queries is None else queries
    queries.sort()
    bootstrapSize = m if boostrapSize is None else boostrapSize
    assert trials > 20
    assert bootstrapSize <= m, "don't recommend bootstrap size to be larger than sample size"
    BoostrapYhat = []
    for _ in tqdm(range(trials)):
        Selection = np.random.randint(0, m, bootstrapSize)
        Coefficient = fxn(X[Selection], y[Selection])
        V = np.vander(queries, Coefficient.shape[0])
        Yhat = (V@Coefficient[:, np.newaxis]).reshape(-1)
        BoostrapYhat.append(Yhat)
    OutputMatirx = np.array(BoostrapYhat)
    ConfidenceBand = np.percentile(OutputMatirx, [alpha/2, 100 - alpha/2], axis=0)
    return ConfidenceBand
            

Results

The generative model we used is \(\cos(\pi x)\), on the interval (-1, 1), with constant variance of 0.01, and 20% of the data are outliers set to be at a value of exactly -3.

Model with L1, L2 norm as loss function are trained, with 300 boopstraped models and \(k = n\) where \(n\) is the number of rows of matrix \(A\).

The degree of polynomials we use to fit the generative model is: 4

Then, a confidence 99% confidence interval are produced, with a fine grid.

This is the results

L1, L2 Confidence Band

Observe that the confidence band of the model with L1 norm as loss function is narrower and its centering around the ground truth generative model. Comparing to the L2 model, which is much more suspectible to the effect of the outliers below the curve.