Contents

  • 1  Notebook configuration

  • 2  Overview

  • 3  The input data interface

    • 3.1  Examples

Usage of different data input formats and clustering recipes

[3]:
import sys

import matplotlib as mpl

from cnnclustering import cluster
from cnnclustering import _types, _fit
[2]:
print(sys.version)
3.8.8 (default, Mar 11 2021, 08:58:19)
[GCC 8.3.0]

Notebook configuration

[5]:
# Matplotlib configuration
mpl.rc_file(
    "../../matplotlibrc",
    use_default_template=False
)
[6]:
# Axis property defaults for the plots
ax_props = {
    "xlabel": None,
    "ylabel": None,
    "xlim": (-2.5, 2.5),
    "ylim": (-2.5, 2.5),
    "xticks": (),
    "yticks": (),
    "aspect": "equal"
}

# Line plot property defaults
line_props = {
    "linewidth": 0,
    "marker": '.',
}

Overview

Common-nearest-neighbours clustering can be done on data in a variety of different input formats with variations in the actual execution of the procedure. A typical case for example, would be to use the coordinates of a number of points in some data-space. These coordinates may be stored in a 2-dimensional (NumPy-)array but they could be also held in a database. Maybe instead of point coordinates, we can also begin the clustering with pre-computed pairwise distances between the points. The present implementation in the cnnclustering package is aimed to be generic and widely agnostic about the source of input data (see also the explanation of the algorithm in reference). This is achieved by wrapping the input data structure into an InputData object that complies with a universal input data interface. Similarly, the way how neighbourhoods are calculated and represented during the clustering is not hard-coded in the implementation. It can be modified with the choice of Neighbours and NeighboursGetter objects with a matching interface. The following sections will describe the types of objects used and how to compose them in a Clustering object. The described components can be found in the _types submodule.

The individual component object may be instances of regular Python classes (inheriting from a corresponding abstract base class). Alternatively, they may be instantiated from Cython extension types.

The input data interface

Input data objects should expose the following (typed) attributes:

  • n_points (int): The total number of points in the data set.

  • n_dim (int): The number of dimensions per data point.

  • data (any): If applicable, a representation of the underlying data, preferably as NumPy array. Can be omitted.

  • meta (dict): A Python dictionary storing meta-information about the data. Used keys are for example:

    • "kind": One of ["points", "distances", "neighbours"], revealing the kind of input data stored.

    • "edges": If stored input data points are actually belonging to more than one data source, a list of integers can state the number of points per parts.

Additional object specific attributes may be present. Interaction with the input data object (e.g. from a NeighboursGetter should go through one of the following methods:

  • float get_component(int, int): Takes a point and a dimension index and returns the corresponding value (float).

  • int get_n_neighbours(int): Takes a point index and returns the total number of neighbours for this point.

  • int get_neighbour(int, int): Takes a point and a member index and returns the index of the corresponding member in the data set.

Not all of the above may be meaningful depending on the nature of the stored data. If an attribute or method is not not applicable, it should be present but return 0 for consistency.

Currently supported realisations of the input data interface are:

  • InputData: A Python abstract base class definining the input data interface.

  • InputDataNeighbours: Neighbours of points stored as sequences (not type inference).

  • InputDataExtNeighboursMemoryview: Neighbours of points exposed in a 0 padded 2-dimensional memoryview.

  • InputDataExtPointsMemoryview: Point coordinates exposed in a 2-dimensional memoryview.

Examples

[16]:
original_points = np.array([[0, 0, 0],
                            [1, 1, 1]], dtype=float)
input_data = _types.InputDataExtPointsMemoryview(original_points)
print(
    f"data:\n{input_data.data}\n"
    f"n_points:\n{input_data.n_points}\n"
    f"component (1, 2):\n{input_data.get_component(1, 2)}\n"
)
data:
[[0. 0. 0.]
 [1. 1. 1.]]
n_points:
2
component (1, 2):
1.0