Motivation. Inspiration is from Maximum Likelihood Estimators. Consider classifying into two classes, If we consider , the weights as a parameter is a probability distribution for class , classification output as such:

  • : given datapoint the probability (distribution) of it being in class . The pdf of this is what we want to find.
  • : given that data is in class , what is the distribution of the datapoints? We know this from the data. We will construct the former from the latter.

From as a Multivariate Normal Distribution

If we have as a multivariate normal distribution, it has the probability distribution:

where

  • is the mean point of datapoints in class
  • is the Covariance matrix for this distribution Both off these we do not know yet. Then using Bayes’ Rule:

where

  • See ^unj2yy for the definition
  • Softmax ensures this is a probability distribution. Now, we calculate . Substitute the multivarate normal pdf into :

Finally, we plug this into our softmax function to get:

Finding the Optimal Parameters