Objectives:- To aid data analysis and maintenance, a number of clustering algorithms have been proposed to partition a large data source into meaningful clusters. Selecting an appropriate clustering algorithm that can help the process of understanding a large data source is a challenging issue.
Functional Specifications:- The effectiveness of a particular algorithm may be influenced by a number of different factors. However, the influence of a particular factor like quality can be considered to identify the effectiveness of an algorithm for a given data source. The project provides a comparative analysis of four clustering algorithms namely K-Means, Partitioning Around Medoids, Minimum Spanning Tree and Nearest Neighbor applied to diabetes dataset. Besides rapidly generating the clusters, the analysis also provides a basis for determining the quality of the clusters generated and helps in identifying the algorithm that generates good quality clusters. As Data Characterization is a summarization of the general characteristics or features of a target class of data, the characteristics of diabetes data are also analyzed taking into account positively tested records as target class of data using the approach of Attribute Oriented Induction.
User Interface:- Windows based user inteface with ease of use.
Preferred Technologies:- Java (Applets, AWT Events and Swings) or C#.Net 2.0 or Vb.Net 2.0
About clustering:-
      Cluster computing is not a new area of computing. It is, however, evident that there is a growing interest in its usage in all areas where applications have traditionally used parallel or distributed computing platforms. The mounting interest has been fuelled in part by the availability of powerful microprocessors and high-speed networks as off-the-shelf commodity components as well as in part by the rapidly maturing software components available to support high performance and high availability applications. This rising interest in clusters led to the formation of an IEEE Computer Society Task Force on Cluster Computing (TFCC1 ) in early 1999.
     A “commodity cluster” is a local computing system comprising a set of independent computers and a network interconnecting them. A cluster is local in that all of its component subsystems are supervised within a single administrative domain, usually residing in a single room and managed as a single computer system. The constituent computer nodes are commercial-off-the-shelf (COTS), are capable of full independent operation as is, and are of a type ordinarily employed individually for standalone mainstream workloads and applications. The nodes may incorporate a single microprocessor or multiple microprocessors in a symmetric multiprocessor (SMP) configuration. The interconnection network employs COTS local area network (LAN) or systems area network (SAN) technology that may be a hierarchy of or multiple separate network structures. A cluster network is dedicated to the integration of the cluster compute nodes and is separate from the cluster’s external (worldly) environment. A cluster may be employed in many modes including but not limited to: high capability or sustained performance on a single problem, high capacity or throughput on ajob or process workload, high availability through redundancy of nodes, or high bandwidth through multiplicity of disks and disk access or I/O channels. A “Beowulf-class system” is a cluster with nodes that are personal computers (PC) or small symmetric multiprocessors (SMP) of PCs integrated by COTS local area networks (LAN) or system area networks (SAN), and hosting an open source Unix-like node operating system. An Windows-Beowulf system also exploits low cost mass market PC hardware but instead of hosting an open source Unixlike O/S, it runs the mass market widely distributed Microsoft Windows and NT operating systems. A “Constellation” differs from a commodity cluster in that the number of processors in its node SMPs exceeds the number of SMPs comprising the system and the integrating network interconnecting the SMP nodes may be of custom technology and design. Definitions such as these are useful in that they provide guidelines and help focus analysis. But they can also be overly constraining in that they inadvertently rule out some particular system that intuition dictates should be included in the set. Ultimately, common sense must prevail.