Abstract
Clustering of high dimensionality data which can be seen in almost all fields these days is becoming very tedious process. The key disadvantage of high dimensional data which we can pen down is curse of dimensionality. As the magnitude of datasets grows the data points become sparse and density of area becomes less making it difficult to cluster that data which further reduces the performance of traditional algorithms used for clustering. Semi-supervised clustering algorithms aim to improve clustering results using limited supervision. The supervision is generally given as pair wise constraints; such constraints are natural for graphs, yet most semi-supervised clustering algorithms are designed for data represented as vectors [2]. In this paper, we unify vector-based and graph-based approaches. We first show that a recently-proposed objective function for semi-supervised clustering based on Hidden Markov Random Fields, with squared Euclidean distance and a certain class of constraint penalty functions, can be expressed as a special case of the global kernel k-means objective [3]. A recent theoretical connection between global kernel k-means and several graph clustering objectives enables us to perform semi-supervised clustering of data. In particular, some methods have been proposed for semi supervised clustering based on pair wise similarity or dissimilarity information. In this paper, we propose a kernel approach for semi supervised clustering and present in detail two special cases of this kernel approach. The semi supervised clustering problem is thus formulated as an optimization problem for kernel learning [4]. An attractive property of the optimization problem is that it is convex and, hence, has no local optima. While a closed-form solution exists for the first special case, the second case is solved using an iterative majorization procedure to estimate the optimal solution asymptotically. Experimental results based on both synthetic and real-world data show that this new kernel approach is promising for semi supervised clustering [5]. We consider the problem of clustering a given dataset into k clusters subject to an additional set of constraints on relative distance comparisons between the data items. The additional constraints are meant to reflect side-information that is not expressed in the feature vectors, directly. Relative comparisons can express structures at finer level of detail than must-link (ML) and cannotlink (CL) constraints that are commonly used for semi-supervised clustering [6]. Relative comparisons are particularly useful in settings where giving an ML or a CL constraint is difficult because the granularity of the true clustering is unknown. Our main contribution is an efficient algorithm for learning a kernel matrix using the log determinant divergence (a variant of the Bregman divergence) subject to a set of relative distance constraints. Given the learned kernel matrix, a clustering can be obtained by any suitable algorithm, such as kernel k-means. We show empirically that kernels found by our algorithm yield clustering’s of higher quality than existing approaches that either use ML/CL constraints or a different means to implement the supervision using relative comparisons [7]. The proposed algorithm detects arbitrary shaped clusters in the dataset and also improves the performance of clustering by minimizing the intra-cluster distance and maximizing the inter-cluster distance which improves the cluster quality.
View more >>