Clustering

Visitor clustering lets you leverage customer characteristics to dynamically categorize visitors and generate cluster sets based on selected data inputs, thus identifying groups that have similar interests and behaviors for customer analysis and targeting.

Clustering Process

The clustering process requires you to identify metrics and dimension elements to use as inputs, and allows you to choose a specific target population to apply these elements to create specified clusters. When you run the clustering process, the system uses the metric and dimension inputs to determine appropriate initial centers for the specified number of clusters. These centers are then used as a starting point to apply the K-Means algorithm.



The initial centers are intelligently chosen via a Canopy Clustering pass. Data clusters are created by associating every data point to the nearest center. The mean of each of the K clusters becomes the new center. The algorithm is repeated in steps 2 and 3 until convergence is reached. This can take multiples passes.

The Maximum Iterations in the Options menu allows the analyst to specify the maximum number of iterations to be performed by the clustering algorithm. Setting this option may result in faster completion of the clustering process based on the maximum iterations cap at the expense of exact convergence of the cluster centers.

Note: Once the clusters have been defined, the Cluster Dimension can be saved for use just like any other dimension. It can also be loaded into the Cluster Explorer to examine the separation of cluster centers.

KMeans Algorithms

In the Cluster Builder, you can now select Options > Algorithm to select algorithms when defining clusters.

  • KMeans. This algorithm uses canopy clustering to define the centers of the cluster.
  • KMeans++. This algorithm expedites cluster building when running against large sets of data.
KMeans++ is an improved implementation of KMeans clustering algorithm because it provides better initialization of initial k centers. (The original KMeans algorithm chooses initial centers randomly.) KMeans++ selects the first center randomly. The remaining k-1 centers will be chosen one by one based on the distance a data point is to the closest existing center. The furthest data points have a better chance to be chosen as a new center than nearby data points. After the initial center is chosen, the procedure is performed exactly the same as the original KMeans clustering.

The workflow for KMeans++ is exactly the same as the workflow for KMeans clustering, except that you need to select Options > Algorithm > KMeans++ in the cluster builder.

Note: Each DPU runs its own KMeans++ procedure on its own data portion. If the DPU has enough available memory (the ratio is configurable in the PAServer.cfg file), then the data of those involved variables will be brought into memory. The remaining k-1 initial center selection and converging iterations all happen in memory, which is faster than the previous KMeans clustering.