top of page

Density Estimation with High-dimensional Data

Objectives

Combine principal component analysis with the Gaussian mixture model to solve high-dimensional density estimation problems.


In this hands-on activity we are going to create a model that can sample hand-written digits. To achieve this, we are going to use PCA to reduce the dimensionality of the MNIST images and then apply Gaussian mixture density estimation on the principal components. The resulting model will not be perfect, but it very simple and a decent start. For simplicity, we are going to work only with the threes.

Start by loading the data and extracting the threes:


import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
threes = x_train[y_train == 3]
vectorized_threes = threes.reshape((threes.shape[0], threes.shape[1] * threes.shape[2]))

Apply PCA:

from sklearn.decomposition import PCA
# How many PCA components you want to keep:
num_components = 2
pca = PCA(n_components=num_components, whiten=True).fit(vectorized_threes)

Now use the Gaussian mixture model on the principal components:

from sklearn.mixture import GaussianMixture
# The principal components:
Z = pca.transform(vectorized_threes)
# Train with different number of mixture components and estimate BIC
bics = []
models = []
for nmc in range(1,10):
    m = GaussianMixture(n_components=nmc).fit(Z)
    bics.append(m.bic(Z))
    models.append(m)
bics = np.array(bics)
fig, ax = plt.subplots(dpi=150)
ax.bar(range(1, 10), bics)
ax.set_ylabel('BIC Score')
ax.set_xlabel('Number of mixture components');

Output:



Let's find the mixture model:

model = models[np.argmin(bics)]
print(model)

Output:

GaussianMixture(n_components=6)

bottom of page