Class Imbalance

Types of Imbalance:

Marginal Imbalance:
1. A dataset is marginally imbalanced if one class is rare compared to other class.
  
  $Pr(Y=1)\eqsim0$
Conditional Imbalance:
1. A dataset is conditionally imbalanced when it is easy to predict the correct labels for most cases.
  
  $Pr(Y=1|X=0)\eqsim0 \; and Pr(Y=1|X=1)\eqsim1$

How to Overcome:

Sub-Sampling (Down-sampling):

Subsampling is a method for reducing data size by selecting a subset of the original data. The imbalance is meant to kept constant for both training and test set.
Up-Sampling:

In this approach cases from the minority classes are sampled with replacement until each class has the approximately same size.
SMOTE (synthetic minority over-sampling technique):
1. Case-Control Strategy:
  - In SMOTE, the case-control strategy refers to the selection of cases (samples) from the minority class and the control group (samples from the majority class).
  - Cases are the minority class samples that are used as reference points for generating synthetic samples.
  - Controls are typically randomly selected majority class samples that are used for comparison during the synthesis process.
2. Sampling Mechanism:
  - SMOTE generates synthetic samples by interpolating between pairs of minority class samples.
  - For each minority class sample, SMOTE selects its k nearest neighbors in the feature space. The value of k is typically chosen based on the characteristics of the dataset.
  - Synthetic samples are then created by randomly selecting one or more of these nearest neighbors and generating new samples along the line segments connecting the original sample and its neighbors.
  - The number of synthetic samples generated is often determined by a specified oversampling ratio or target minority class distribution.
$$ \tilde\pi \approx \frac{e^{\hat\beta_0}}{1+e^{\hat\beta_0}} \newline \hat\beta_0\approx log \big(\frac{\tilde\pi}{1-\tilde\pi} ) $$

$$ \hat\beta_0^*\approx log \big(\frac{\pi}{1-\pi} ) $$

$$ \hat\beta_0^* = \hat\beta_0 - \big(\frac{\tilde\pi}{1-\tilde\pi} ) + \big(\frac{\pi}{1-\pi} ) $$

Here $\tilde\pi$ is estimated probability and $\pi$ is Prevalence of MI

$$ \pi = \frac{n_{cases}}{n_{controls}+n_{cases}} $$