Abstract
Data clustering is one of the essential tools for perceptive structure of a data set. It plays a crucial and initial role in machine learning, data mining and information retrieval. The intrinsic properties of the traditional algorithms intended for numerical data, can be employed to measure distance between feature vectors and cannot be directly applied for clustering of categorical data ,Wherever domain value are distinct haven’t any ordering outlined. The final data partition generated by traditional algorithms, results in incomplete information and the core ensemble information matrix presents only cluster data point relations with many entries left unknown and disgrace the quality of the resulting cluster [2]. This paper discusses one method of clustering a high dimensional dataset using dimensionality reduction and context dependency measures (CDM). First, the dataset is partitioned into a predefined number of clusters using CDM [5]. Then, context dependency measures are combined with several dimensionality reduction techniques and for each choice the data set is clustered again. The results are combined by the cluster ensemble approach. Finally, the Rand index is used to compute the extent to which the clustering of the original dataset (by CDM alone) is preserved by the cluster ensemble approach. Cluster ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature [7]. Cluster ensembles can provide robust and stable solutions by leveraging the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned. In this paper, we address the problem of combining multiple weighted clusters which belong to different subspaces of the input space. We leverage the diversity of the input clustering’s in order to generate a consensus partition that is superior to the participating ones [9]. Since we are dealing with weighted clusters, our consensus function makes use of the weight vectors associated with the clusters. The experimental results show that our ensemble technique is capable of producing a partition that is as good as or better than the best individual clustering [10]. Experiments on three real data sets were conducted with three data generation methods and three consensus functions. The results have shown that the ensemble clustering with Fast Map projection outperformed the ensemble clustering with random sampling and random projection [2]. The proposed approach has produced higher efficient clustering with negligible overlapping. We have used iris data set for the evaluation of the proposed approach and the results shows that the proposed method has produced efficient result than others [4]. We proposed a soft feature selection procedure (called LAC) that assigns weights to features according to the local correlations of data along each dimension. Dimensions along which data are loosely correlated receive a small weight, which has the effect of elongating distances along that dimension [6].
View more >>