Comparing the SVM and Logistic Regression
What’s SVM and Logistic Regression
SVM
Determine the binary classification result by giving the optimized decision boundary between the 2 classes.
Non-linear feature mapping by kernel tricks, ex: polynomial, radio base function.
Discriminative model, inference the output class y base on the evidence x.
The output of SVM is in the range (-∞, ∞), the positive sign represents the positive inference result and vice versa for the negative.
Note: SVM can also output the probability after performing the calibration.
Logistic Regression
Determine the binary classification result by giving the probability.
Non-linear feature mapping by kernel tricks for Kernel Logistic Regression, ex: polynomial, radio base function.
Note: Kernel Logistic Regression doesn’t support by scikit-learn.
Discriminative model, inference the output class y base on the evidence x.
The output of Logistic Regression is in the range [0, 1] which represents for the probability.
Note: The precise probability can be obtained by after performing the calibration.
How to Choose Between SVM and Logistic Regression
Visualize the Data Density on the Decision Boundary
- Low data density on the decision boundary -> SVM
Note: Low data density is more like a black-or-white event so it’s reasonable to optimized on the decision boundary.
- High data density on the decision boundary -> Logistic Regression
Note: High density shows up that there’s an ambiguity to determine whether the output should be positive or negative.
Decide a Linear or Non-Linear Kernel
- Small feature size and large data -> non-linear kernel.
Note: Because the data is large enough, so we can introduce more complex model to get better performance.
- Large feature size and small data -> linear kernel.
Note:
- Because the data size is small, so its better to choose simple model to prevent over-fitting.
- If the feature size is to large, it’s recommend to visualize the data distribution on the feature space to see whether the decision boundary will be smooth or not.
- Small feature size and extra large data -> linear kernel.
Note: The quadratic programming on extra large data set would cost a lot of computing latency.
Ways of Implementation
SVM
Scikit-Learn
- optimization: quadratic programing
Pytorch
- Model: linear layer
- Loss function: hinge loss
Logistic Regression
Scikit-Learn
- optimization: SGD
Pytorch
- Model: linear layer + sigmoid layer
- Loss function: cross entropy