A clustering method.
Can also be used as a generative model.
Represents a given dataset as a combination of multiple Gaussian distributions.
Provides a probability density function, which explains its use as a generative model.
Can automatically determine the number of clusters.
Reveals the prior distribution of explanatory variable X (latent variable).
N ( x ∣ μ , σ 2 ) = 1 2 π σ 2 exp { − 1 2 σ 2 ( x − μ ) 2 } N(x|\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\} N ( x ∣ μ , σ 2 ) = 2 π σ 2 1 exp { − 2 σ 2 1 ( x − μ ) 2 }
N ( x ∣ μ , Σ ) = 1 ( 2 π ) m 2 1 ∣ Σ ∣ 1 2 exp { − 1 2 ( x − μ ) T Σ − 1 ( x − μ ) } ( x : R a n d o m v a r i a b l e v e c t o r , μ : A v e r a g e v e c t o r , Σ : C o v a r i a n c e m a t r i x ) N(\mathbf x|\mathbf\mu,\Sigma)=\frac{1}{(2\pi)^\frac{m}{2}}\frac{1}{|\Sigma|^{\frac{1}{2}}}\exp\left\{-\frac{1}{2}(\mathbf x-\mathbf\mu)^T\Sigma^{-1}(\mathbf x-\mathbf\mu)\right\}\quad\left(\mathbf x:
Random~variable~vector,\mathbf\mu:Average~vector,\Sigma:Covariance~matrix\right) N ( x ∣ μ , Σ ) = ( 2 π ) 2 m 1 ∣Σ ∣ 2 1 1 exp { − 2 1 ( x − μ ) T Σ − 1 ( x − μ ) } ( x : R an d o m v a r iab l e v ec t or , μ : A v er a g e v ec t or , Σ : C o v a r ian ce ma t r i x )
p ( x ) = ∑ k = 1 n π k N ( x ∣ μ k , Σ k ) ( n : n G a u s s i a n d i s t r i b u t i o n s , k : k − t h G a u s s i a n d i s t r i b u t i o n , π k : M i x t u r e c o e f f i c i e n t ( W e i g h t o f e a c h G a u s s i a n d i s t r i b u t i o n , ∑ k = 1 n π k = 1 ) ) p(\mathbf x)=\sum_{k=1}^n\pi_kN(\mathbf x|\mu_k,\Sigma_k)\quad\left(n:n ~Gaussian~distributions,k:k-th~Gaussian~distribution,\pi_k:Mixture~coefficient\left(Weight~of~each~Gaussian~distribution,\sum_{k=1}^n\pi_k=1\right)\right) p ( x ) = k = 1 ∑ n π k N ( x ∣ μ k , Σ k ) ( n : n G a u ss ian d i s t r ib u t i o n s , k : k − t h G a u ss ian d i s t r ib u t i o n , π k : M i x t u re coe ff i c i e n t ( W e i g h t o f e a c h G a u ss ian d i s t r ib u t i o n , k = 1 ∑ n π k = 1 ) )
Reference: https://datachemeng.com/wp-content/uploads/gaussianmixturemodel.pdf
The prior distribution here is the distribution of "which cluster a variable belongs to after receiving it." If we define the latent variable as z \mathbf z z , then z \mathbf z z is
A vector (i.e., a matrix) with a one-hot vector that is 1 in one cluster and 0 in the others.
If there is no information about the sample, the probability that z k = 1 \mathbf z_k=1 z k = 1 is set to follow the mixture coefficient.
p ( z k = 1 ) = π k p(\mathbf z_k=1)=\pi_k p ( z k = 1 ) = π k
If this is not introduced, the area will not be 1 in the mixture Gaussian distribution (it is obvious that the sum of probabilities will exceed 1 if simply added).
Using Bayes' theorem, the probability that a sample x \mathbf x x will be z k = 1 z_k=1 z k = 1 given is
p ( z k = 1 ∣ x ) = p ( z k = 1 ) p ( x ∣ z k = 1 ) ∑ i = 1 n p ( z k = 1 ) p ( x ∣ z i = 1 ) = π k p ( x ∣ z k = 1 ) ∑ i = 1 n π i p ( x ∣ z i = 1 ) = π k N ( x ∣ μ k , Σ k ) ∑ i = 1 n π i N ( x ∣ μ i , Σ i ) p(z_k=1|\mathbf x)=\frac{p(z_k=1)p(\mathbf x|z_k=1)}{\sum_{i=1}^np(z_k=1)p(\mathbf x|z_i=1)}=\frac{\pi_kp(\mathbf x|z_k=1)}{\sum_{i=1}^n\pi_ip(\mathbf x|z_i=1)}=\frac{\pi_kN(\mathbf x|\mu_k,\Sigma_k)}{\sum_{i=1}^n\pi_iN(\mathbf x|\mu_i,\Sigma_i)} p ( z k = 1∣ x ) = ∑ i = 1 n p ( z k = 1 ) p ( x ∣ z i = 1 ) p ( z k = 1 ) p ( x ∣ z k = 1 ) = ∑ i = 1 n π i p ( x ∣ z i = 1 ) π k p ( x ∣ z k = 1 ) = ∑ i = 1 n π i N ( x ∣ μ i , Σ i ) π k N ( x ∣ μ k , Σ k )
From this,
k ⋆ = arg max k p ( z k = 1 ∣ x ) k^\star=\arg\max_kp(z_k=1|\mathbf x) k ⋆ = arg k max p ( z k = 1∣ x )
By doing so, the cluster at a certain point x \mathbf x x can be estimated.
def gaussian_densty (x, mu, sigma ):
diff = x - mu
sigma_inv = np.linalg.inv(sigma)
sigma_det = np.linalg.det(sigma)
z = np.exp(-np.dot(diff.T, np.dot(sigma_inv, diff)) / 2 )
return z / np.sqrt(np.power(2 *np.pi, len (mu)) * sigma_det)
def mixture_gaussian_densty (x, mu_list, sigma_list, pi_list ):
z = 0
for i in range (len (pi_list)):
z += pi_list[i] * gaussian_densty(x, mu_list[i], sigma_list[i])
return z
Tip
Note that only the probability density function has been created, so the corresponding code is required to sample it.
def sample_multivariate_gaussian (mu, sigma, num_samples=1 ):
return np.random.multivariate_normal(mu, sigma, num_samples)
def sample_mixture_gaussian (mu_list, sigma_list, pi_list, num_samples=100 ):
samples = []
num_clusters = len (pi_list)
cluster_sizes = np.random.multinomial(num_samples, pi_list)
for i in range (num_clusters):
cluster_samples = sample_multivariate_gaussian(mu_list[i], sigma_list[i], cluster_sizes[i])
samples.append(cluster_samples)
return np.vstack(samples)
mu0 = np.array([0 , -0.5 ])
sigma0 = np.array([[1.0 , 0 ], [0 , 1.0 ]])
mu1 = np.array([2.5 , 2 ])
sigma1 = np.array([[0.5 , 0.3 ], [0.3 , 0.7 ]])
mu2 = np.array([-2 , 1.5 ])
sigma2 = np.array([[1.2 , 0.2 ], [0.2 , 0.4 ]])
mu_list = [mu0, mu1, mu2]
sigma_list = [sigma0, sigma1, sigma2]
pi_list = [0.45 , 0.25 , 0.3 ]
NUM_DATA = 500
samples = sample_mixture_gaussian(mu_list, sigma_list, pi_list, num_samples=NUM_DATA)
samples[:10 ]
def plot_mixture_gaussian (mu_list, sigma_list, pi_list, samples=None , figsize=(8 ,6 ) ):
x_range = np.linspace(-5 , 5 , 100 )
y_range = np.linspace(-5 , 5 , 100 )
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros((len (x_range), len (y_range)))
for i in range (len (x_range)):
for j in range (len (y_range)):
x = np.array([x_range[i], y_range[j]])
Z[i, j] = mixture_gaussian_densty(x, mu_list, sigma_list, pi_list)
plt.figure(figsize=figsize)
if samples is not None :
plt.scatter(samples[:, 1 ], samples[:, 0 ], color='red' , s=10 , label='Samples' )
plt.contour(X, Y, Z, cmap='viridis' )
plt.colorbar(label='Density' )
plt.xlabel('x' )
plt.ylabel('y' )
plt.title('Mixture Gaussian Distribution' )
plt.show()
plot_mixture_gaussian(mu_list, sigma_list, pi_list, samples)
To be considered later
"Likelihood maximization" is one of the methods to estimate the distribution from the sample data.
L i k e l i h o o d f u n c t i o n : L ( θ ; x ) = ∏ i = 1 n P ( x i ; θ ) , M a x i m u m l i k e l i h o o d e s t i m a t o r : θ ⋆ = arg max θ L ( θ ; x ) ( n : N u m b e r o f s a m p l e s ) Likelihood~function:{\cal L}(\theta;\mathbf x)=\prod_{i=1}^nP(x_i;\theta),\quad Maximum~likelihood~estimator:\theta^\star=\arg\max_\theta{\cal L}(\theta;\mathbf x)\quad(n:Number~of~samples) L ik e l ih oo d f u n c t i o n : L ( θ ; x ) = i = 1 ∏ n P ( x i ; θ ) , M a x im u m l ik e l ih oo d es t ima t or : θ ⋆ = arg θ max L ( θ ; x ) ( n : N u mb er o f s am pl es )
Reference: https://cochineal19.hatenablog.com/entry/2021/11/08/003751
It's hard to calculate the likelihood function directly because it's a product of many probabilities. Therefore, it is common to use the log likelihood function.
There is no effect on the maximum value of the function because the logarithm is a monotonically increasing function.
L ( μ , Σ , π ; x ) = ∏ i = 1 n p ( x ; μ , Σ , π ) = ∏ i = 1 n ∑ k = 1 n π k N ( x ∣ μ k , Σ k ) → log L ( μ , Σ , π ; x ) = log ∏ i = 1 n ∑ k = 1 n π k N ( x ∣ μ k , Σ k ) = ∑ i = 1 n log ∑ k = 1 n π k N ( x ∣ μ k , Σ k ) {\cal L}(\mu,\Sigma,\pi;\mathbf x)=\prod_{i=1}^np(\mathbf x;\mu,\Sigma,\pi)=\prod_{i=1}^n\sum_{k=1}^n\pi_kN(\mathbf x|\mu_k,\Sigma_k)\\\to\log{\cal L}(\mu,\Sigma,\pi;\mathbf x)=\log\prod_{i=1}^n\sum_{k=1}^n\pi_kN(\mathbf x|\mu_k,\Sigma_k)=\sum_{i=1}^n\log\sum_{k=1}^n\pi_kN(\mathbf x|\mu_k,\Sigma_k) L ( μ , Σ , π ; x ) = i = 1 ∏ n p ( x ; μ , Σ , π ) = i = 1 ∏ n k = 1 ∑ n π k N ( x ∣ μ k , Σ k ) → log L ( μ , Σ , π ; x ) = log i = 1 ∏ n k = 1 ∑ n π k N ( x ∣ μ k , Σ k ) = i = 1 ∑ n log k = 1 ∑ n π k N ( x ∣ μ k , Σ k )
def log_likelihood (mu_list, sigma_list, pi_list, sample ):
log_likelihood = 0
for i in range (len (sample)):
log_likelihood += np.log(mixture_gaussian_densty(sample[i], mu_list, sigma_list, pi_list))
return log_likelihood
Since we need to find the prior distribution ( p(z|x) ) of ( p(x|z) ), the prior probability of each cluster for a given data point ( \mathbf{x_i} ) is expressed as follows:
p μ , Σ , π ( z i k = 1 ∣ x i ) = π k N ( x i ; μ k , Σ k ) ∑ j = 1 K π j N ( x i ; μ j , Σ j ) ≡ γ ( z i k ) p_{\mu,\Sigma,\pi}(z_{ik}=1|\mathbf{x_i}) = \frac{\pi_k{\cal N}(\mathbf{x_i};\mu_k,\Sigma_k)}{\sum_{j=1}^K \pi_j {\cal N}(\mathbf{x_i};\mu_j,\Sigma_j)} \equiv \gamma(z_{ik}) p μ , Σ , π ( z ik = 1∣ x i ) = ∑ j = 1 K π j N ( x i ; μ j , Σ j ) π k N ( x i ; μ k , Σ k ) ≡ γ ( z ik )
Intuitively, we are calculating the probability that the distribution of each cluster fits the given data point.
def responsibility (data, mu_list, sigma_list, pi_list ):
gamma = np.zeros((len (data), len (pi_list)))
for i in range (len (data)):
for j in range (len (pi_list)):
gamma[i, j] = pi_list[j] * gaussian_density(data[i], mu_list[j], sigma_list[j])
gamma[i] /= np.sum (gamma[i])
return gamma
To perform classification, we need to find:
z ⋆ = arg max z i k γ ( z i k ) z^\star = \argmax_{z_{ik}} \gamma(z_{ik}) z ⋆ = z ik arg max γ ( z ik )
This gives us the cluster with the highest responsibility for each data point.
df = pd.DataFrame(samples[:, [1 , 0 ]], columns=['x' , 'y' ])
gamma = responsibility(samples, mu_list, sigma_list, pi_list)
df['gamma0' ] = gamma[:, 0 ]
df['gamma1' ] = gamma[:, 1 ]
df['gamma2' ] = gamma[:, 2 ]
df['z_star' ] = df[['gamma0' , 'gamma1' , 'gamma2' ]].idxmax(axis=1 )
x_range = np.linspace(-5 , 5 , 100 )
y_range = np.linspace(-5 , 5 , 100 )
X, Y = np.meshgrid(x_range, y_range)
Z = np.zeros((len (x_range), len (y_range)))
for i in range (len (x_range)):
for j in range (len (y_range)):
x = np.array([x_range[i], y_range[j]])
Z[i, j] = mixture_gaussian_density(x, mu_list, sigma_list, pi_list)
sns.scatterplot(x='x' , y='y' , hue='z_star' , data=df, palette='viridis' )
plt.contour(X, Y, Z, cmap='viridis' )
plt.colorbar(label='Density' )
plt.xlabel('x' )
plt.ylabel('y' )
plt.title('Mixture Gaussian Distribution' )
plt.show()
To maximize the log-likelihood, we need to solve:
arg max μ , Σ , π log L ( μ , Σ , π ; x ) = arg max μ , Σ , π ∑ i = 1 n log ∑ k = 1 K π k N ( x ∣ μ k , Σ k ) \argmax_{\mu,\Sigma,\pi} \log {\cal L}(\mu, \Sigma, \pi; \mathbf{x}) = \argmax_{\mu,\Sigma,\pi} \sum_{i=1}^n \log \sum_{k=1}^K \pi_k N(\mathbf{x} | \mu_k, \Sigma_k) μ , Σ , π arg max log L ( μ , Σ , π ; x ) = μ , Σ , π arg max i = 1 ∑ n log k = 1 ∑ K π k N ( x ∣ μ k , Σ k )
In this case, the log-sum part makes it difficult to solve analytically → We optimize it using the Expectation-Maximization (EM) algorithm.
Set randomly
Calculate the responsibility for each data point at the current step:
p μ , Σ , π ( z i k = 1 ∣ x i ) = π k N ( x i ; μ k , Σ k ) ∑ j = 1 K π j N ( x i ; μ j , Σ j ) ≡ γ ( z i k ) p_{\mu,\Sigma,\pi}(z_{ik}=1|\mathbf x_i)=\frac{\pi_k{\cal N}(\mathbf x_i;\mu_k,\Sigma_k)}{\sum_{j=1}^K\pi_j{\cal N}(\mathbf x_i;\mu_j,\Sigma_j)}\equiv\gamma(z_{ik}) p μ , Σ , π ( z ik = 1∣ x i ) = ∑ j = 1 K π j N ( x i ; μ j , Σ j ) π k N ( x i ; μ k , Σ k ) ≡ γ ( z ik )
Update the parameters to maximize the likelihood using the responsibilities:
∂ L ∂ μ k = 0 → μ k = 1 N k ∑ i = 1 n γ ( z i k ) x i ∂ L ∂ Σ k = 0 → Σ k = 1 N k ∑ i = 1 n γ ( z i k ) ( x i − μ k ) ( x i − μ k ) T ( N k = ∑ i = 1 n γ ( z i k ) ) \begin{aligned}
&\frac{\partial{\cal L}}{\partial\mu_k}=0\to\mu_k=\frac{1}{N_k}\sum_{i=1}^n\gamma(z_{ik})\mathbf x_i\\
&\frac{\partial{\cal L}}{\partial\Sigma_k}=0\to\Sigma_k=\frac{1}{N_k}\sum_{i=1}^n\gamma(z_{ik})(\mathbf x_i-\mu_k)(\mathbf x_i-\mu_k)^T
\end{aligned}
\quad\left(N_k=\sum_{i=1}^n\gamma(z_{ik})\right) ∂ μ k ∂ L = 0 → μ k = N k 1 i = 1 ∑ n γ ( z ik ) x i ∂ Σ k ∂ L = 0 → Σ k = N k 1 i = 1 ∑ n γ ( z ik ) ( x i − μ k ) ( x i − μ k ) T ( N k = i = 1 ∑ n γ ( z ik ) )
For π k \pi_k π k , since ∑ k = 1 K π k = 1 \sum_{k=1}^K\pi_k=1 ∑ k = 1 K π k = 1 , we use the Lagrange multiplier method to maximize the likelihood:
∂ G ∂ π k = 0 → π k = N k ∑ k = 1 K N k ( G = L + λ ( ∑ k = 1 K π k − 1 ) ) \frac{\partial G}{\partial\pi_k}=0\to\pi_k=\frac{N_k}{\sum_{k=1}^KN_k}\quad\left(G={\cal L}+\lambda\left(\sum_{k=1}^K\pi_k-1\right)\right) ∂ π k ∂ G = 0 → π k = ∑ k = 1 K N k N k ( G = L + λ ( k = 1 ∑ K π k − 1 ) )
If the change in the likelihood meets a predefined threshold ϵ \epsilon ϵ :
L n e w − L o l d < ϵ {\cal L}_{new}-{\cal L}_{old}<\epsilon L n e w − L o l d < ϵ
then stop the iterations; otherwise, repeat the EM steps.
K = 3
mu_list = [np.random.randn(2 ) for _ in range (K)]
sigma_list = []
for _ in range (K):
A = np.random.randn(2 , 2 )
sigma = np.dot(A, A.T)
sigma_list.append(sigma)
pi_list = np.random.rand(K)
pi_list = pi_list / np.sum (pi_list)
print ("mu_list:" , mu_list)
print ("sigma_list:" , sigma_list)
print ("pi_list:" , pi_list)
n_iter = 0
likely = log_likelihood(mu_list, sigma_list, pi_list, samples) / NUM_DATA
print ('Iteration: {0}, log_likelihood: {1}' .format (n_iter, likely))
plot_mixture_gaussian(mu_list, sigma_list, pi_list, samples, figsize=(4 ,3 ))
th = 0.0001
while True :
n_iter += 1
gamma = responsibility(samples, mu_list, sigma_list, pi_list)
n_k = np.sum (gamma, axis=0 )
pi_list_next = (n_k / n_k.sum ()).tolist()
mu_list_next = list (((samples.T @ gamma) / n_k).T)
sigma_list_next = []
for k in range (len (pi_list)):
sigma_k = np.zeros_like(sigma_list[k], dtype=float )
for i in range (len (samples)):
sigma_k += gamma[i, k] * np.matmul(
(samples[i] - mu_list[k]).reshape(-1 , 1 ),
(samples[i] - mu_list[k]).reshape(1 , -1 )
)
sigma_list_next.append(sigma_k/n_k[k])
mu_list = copy.deepcopy(mu_list_next)
sigma_list = copy.deepcopy(sigma_list_next)
pi_list = copy.deepcopy(pi_list_next)
likely_before = likely
likely = log_likelihood(mu_list, sigma_list, pi_list, samples) / NUM_DATA
print ('Iteration: {0}, log_likelihood: {1}' .format (n_iter, likely))
plot_mixture_gaussian(mu_list, sigma_list, pi_list, samples, figsize=(4 ,3 ))
delta = likely - likely_before
if delta < th:
break
print ("mu_list:" , mu_list)
print ("sigma_list:" , sigma_list)
print ("pi_list:" , pi_list)