ECCC
Electronic Colloquium on Computational Complexity
Login | Register | Classic Style



REPORTS > DETAIL:

Paper:

TR04-058 | 28th May 2004 00:00

Identifying Clusters from Positive Data

RSS-Feed

Abstract:
The present work studies clustering from an abstract point of view and investigates its properties in the framework of inductive inference. Any class $S$ considered is given by a numbering $A_0,A_1,...$ of nonempty subsets of the natural numbers or the rational k-dimensional vector space as a hypothesis space. A clustering task is a finite and nonempty set of indices of pairwise disjoint sets. The class $S$ is said to be clusterable if there is an algorithm which, for every clustering task $I$, converges in the limit on any text for the union of the $A_i$ with $i$ in $I$ to a finite set $J$ of indices of pairwise disjoint clusters having the same union. A class is called semiclusterable if there is such an algorithm which finds a $J$ with the last condition relaxed to the union of the $A_j$ with $j$ in $J$ is a superset of the union of the $A_i$ with $i$ in $I$. The relationship between natural topological properties and clusterability is investigated. Topological properties can provide sufficient or necessary conditions for clusterability but they cannot characterize it. On one hand, many interesting conditions make use of both the topological structure of the class and a well-chosen numbering. On the other hand, the clusterability of a class does not depend on the decision which numbering of the class is used as a hypothesis space for the clusterer. These ideas are demonstrated in the context of geometrically defined classes. Clustering of many of these classes requires besides the text for the clustering task some additional information: the class of convex hulls of finitely many points in a rational vector space can be clustered with the number of clusters as additional information. Interestingly, the class of polygons (together with their interiors) is clusterable if the number of clusters and the overall number of vertices of these clusters is given to the clusterer as additional information. Intriguingly this additional information is not sufficient for classes including figures with holes. While some classes are unclusterable due to their topological structure, others are only computationally intractable. Oracles can be used to distinguish between both cases: the former cannot be clustered using any oracle while the latter can be clustered using some oracle. It is shown that there are maximal oracles that allow clustering of all problems that can be clustered with the help of some oracle. In particular, $E$ is maximal iff either $K$ is Turing reducible to $E$ or $K'$ is Turing reducible to $E'$. Furthermore, no 1-generic oracle below $K$ and no 2-generic oracle permits to cluster any class which is not clusterable without an oracle.


ISSN 1433-8092 | Imprint