Logistic Regression

Next: Neural Networks Up: Supervised, Scalar methods Previous: Fisher / Perceptron (NN)

Logistic Regression

``Logistic regression'', like Fisher's Method and the Perceptron (Section 3.1.1) is a supervised method for the two class classification problem [16]. Though a different model is used, it can be shown that logistic discrimination and Fisher discrimination are the same when sampling from multivariate distributions with common covariance matrices [17].

Logistic regression tries to model the (logarithmic) odds-ratio for the classification (variable Y) as a linear function of the p ``input'' variables $\vec{X}=\{X_1,X_2,\ldots,X_p\}$ ; $\vec{\beta}$ is the (p+1) dimensional coefficient vector:

$\begin{displaymath} \log\left[ \frac{f( Y=1 \vert \vec{X})}{f( Y=0 \vert \vec{X... ..._1 + \ldots + X_p \beta_p = \beta_0 + \vec{X}\,^T \vec{\beta} \end{displaymath}$

(15)

The odds-ratio is the factor of how many times the event (Y=1) is more likely to happen than event (Y=0) given the knowledge of X. By taking the logarithm we map the values $(0,\infty)$ to $(-\infty,\infty)$ . As:

$\begin{displaymath}f( Y=1 \vert \vec{X}) > f( Y=0 \vert \vec{X}) \quad \Leftrigh... ...frac{f( Y=1 \vert \vec{X})}{f( Y=0 \vert \vec{X})} \right] > 0 \end{displaymath}$

we can see the similarity to Fisher's method and the Perceptron in classifying. In Logistic Discrimination the log-odds-ratio of the conditional classification and therefore indirectly the conditional probabilities $f( Y=1 \vert \vec{X})$ and $f( Y=0 \vert \vec{X})$ are modeled. For classification purposes we just need to know which of the probabilities is the higher one. This means our decision surface reduces to:

$\begin{displaymath}w := \;\beta_0 + X_1 \beta_1 + \ldots + X_p \beta_p \; \left... ...sify\quad 1 \\ \le 0 & classify\quad 0 \end{array}\right. \end{displaymath}$

which is the same (n-1) dimensional hyperplane as used by the linear classifiers. Actually we can use all different kinds of functions to model the logarithmic odds ratio. We could also weight our classification in such a way that we only classify something as ``1'' if the probability for this event is higher then some given probability p. This just means changing 0 to a different value in the above formula.

In standard logistic regression the model-parameters $\beta_i$ are obtained via maximum likelihood estimators. By transforming the model 15 for the log-odds-ratio we get $(\;f(Y=1 \vert \vec{X})\;= \;1-f(Y=0 \vert \vec{X})\;)$ :

$\begin{displaymath}\pi(\vec{X}) := f(Y=1 \vert \vec{X}) = \frac{exp(\beta_0 + X_... ..._p)} {1 + exp(\beta_0 + X_1 \beta_1 + \ldots + X_p \beta_p)} \end{displaymath}$

(16)

Assuming that all data entities are independent, then the joint probability distribution P of our n training-entities is the product of the individual distributions. Let $(\vec{x_i},y_i),\;1 \le i \le n$ be a training-tupel from the data-set, $\vec{x_i} \in dom(X)$ are the values of the p input variables and $y_i \in dom(Y)=\{0,1\}$ is the corresponding correct classification:

$\begin{displaymath}\prod^n_{i=1} \pi\left(\vec{x_i}\right)^{y_i} \cdot (1-\pi\l... ...t(\:(\vec{x_i},y_i)_{i=1,\ldots,n} \:;\; \vec{\beta} \:\right) \end{displaymath}$

This joint distribution $P(\:(\vec{x_i},y_i)_{i=1,\ldots,n} \:;\; \vec{\beta} \:)$ , is dependent on the model-parameters $\vec{\beta} = (\beta_0,\ldots,\beta_p)$ . and our training-set $(\vec{x_i},y_i)_{i=1,\ldots, n} , \quad \vec{x_i} = (x_{i1},\ldots,x_{ip}) \in dom(X)$ . Given our training data we want to adjust the model-parameters $\vec{\beta}$ such that the joint probability is maximized (maximum likelihood for our training data to occur). This is done by basic mathematics for finding maxima of a function (differentiating, etc., or numerically for more difficult model-functions).

Next: Neural Networks Up: Supervised, Scalar methods Previous: Fisher / Perceptron (NN)

Thomas Prang
1998-06-07