Demonstration of (generic) interfaces

Go to:

Notebook configuration

[3]:
import sys

import numpy as np

import cnnclustering
from cnnclustering import cluster
from cnnclustering import _types, _fit

Print Python and package version information:

[4]:
# Version information
print("Python: ", *sys.version.split("\n"))

print("Packages:")
for package in [np, cnnclustering]:
    print(f"    {package.__name__}: {package.__version__}")
Python:  3.8.8 (default, Mar 11 2021, 08:58:19)  [GCC 8.3.0]
Packages:
    numpy: 1.20.1
    cnnclustering: 0.4.3

Labels

_types.Labels is used to store cluster label assignments next to a consider indicator and meta information. It also provides a few transformational methods.

Initialize Labels as

  • Labels(labels)

  • Labels(labels, consider=consider)

  • Labels(labels, consider=consider, meta=meta)

  • Labels.from_sequence(labels_list, consider=consider_list, meta=meta)

Technically, Labels is not used as a generic class. A clustering, i.e. the assignments of cluster labels to points through a fitter (using a bunch of generic interfaces), uses an instance of Labels by directly modifying the underlying array of labels, a Cython memoryview that can be accessed from the C level as Labels._labels. Labels.labels provides a NumPy array view to Labels._labels.

Examples:

[5]:
# Requires labels to be initialised
labels = _types.Labels()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-a465ac6e3f53> in <module>
      1 # Requires labels to be initialised
----> 2 labels = _types.Labels()

src/cnnclustering/_types.pyx in cnnclustering._types.Labels.__cinit__()

TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[13]:
labels = _types.Labels(np.array([1, 1, 2, 2, 2, 0]))
labels
[13]:
Labels([1, 1, 2, 2, 2, 0])
[15]:
labels = _types.Labels.from_sequence([1, 1, 2, 2, 2, 0])
labels
[15]:
Labels([1, 1, 2, 2, 2, 0])
[16]:
print(labels)
[1 1 2 2 2 0]
[17]:
labels.labels
[17]:
array([1, 1, 2, 2, 2, 0])
[18]:
labels.consider
[18]:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
[19]:
labels.meta
[19]:
{}
[21]:
labels.set
[21]:
{0, 1, 2}
[22]:
labels.mapping
[22]:
defaultdict(list, {1: [0, 1], 2: [2, 3, 4], 0: [5]})
[25]:
labels.sort_by_size()
print(labels)
[2 2 1 1 1 0]

Cluster parameters

An instance of _types.ClusterParameters is used during a clustering to pass around cluster parameters.

Initialise ClusterParameters as:

  • ClusterParameters(radius_cutoff)

  • ClusterParameters(radius_cutoff, similarity_cutoff)

ClusterParameters is a simple struct like class that offers collective access and passing of cluster parameters.

Examples:

[27]:
# Requires at least a radius
_types.ClusterParameters()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-27-d3c71fb983e3> in <module>
      1 # Requires at least a radius
----> 2 _types.ClusterParameters()

src/cnnclustering/_types.pyx in cnnclustering._types.ClusterParameters.__cinit__()

TypeError: __cinit__() takes at least 1 positional argument (0 given)
[29]:
cluster_params = _types.ClusterParameters(1)
print(cluster_params)
{'radius_cutoff': 1.0, 'similarity_cutoff': 0, 'similarity_cutoff_continuous': 0.0, 'n_member_cutoff': 0, 'current_start': 1}

Input data

Common-nearest-neighbours clustering can be done on data in a variety of different input formats with variations in the actual execution of the procedure. A typical case for example, would be to use the coordinates of a number of points in some feature space. These coordinates may be stored in a 2-dimensional (NumPy-)array but they could be also held in a database. Maybe instead of point coordinates, we can also begin the clustering with pre-computed pairwise distances between the points. The present implementation in the cnnclustering package is aimed to be generic and widely agnostic about the source of input data. This is achieved by wrapping the input data structure into an input data object that complies with a universal input data interface. The input data interface is on the Python level defined through the abstract base class _types.InputData and specialised through its abstract subclasses InputDataComponents, InputDataPairwiseDistances, InputDataPairwiseComuter, InputDataNeighbourhoods, and InputDataNeighbourhoodsComputer. Valid input data types inherit from one of these abstract types and provide concrete implementation for the required methods. On the Cython level, the input data interface is universally defined through _types.InputDataExtInterface. Realisations of the interface by Cython extension types inherit from InputDataExtInterface and should be registered as a concrete implementation of on of the Python abstract base classes.

InputData objects should expose the following (typed) attributes:

  • data (any): If applicable, a representation of the underlying data, preferably as NumPy array. Not strictly required for the clustering.

  • n_points (int): The total number of points in the data set.

  • meta (dict): A Python dictionary storing meta-information about the data. Used keys are for example:

    • "access_coords": Can point coordinates be retrieved from the input data (bool)?

    • "edges": If stored input data points are actually belonging to more than one data source, a list of integers can state the number of points per parts.

  • (InputData) get_subset(indices: Container): Return an instance of the same type holding only a subset of points (as given by indices). Used by Clustering.isolate().

InputDataComponents objects should expose the following additional attributes:

  • n_dim (int): The total number of dimensions.

  • (float) get_component(point: int, dimension: int): Return one component of a point with respect to a given dimension.

  • (NumPy ndarray) to_components_array(): Transform/return underlying data as a 2D NumPy array.

InputDataExtComponentsMemoryview

Examples:

[32]:
# Requires data to initialise
_types.InputDataExtComponentsMemoryview()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-32-41a19cb81e61> in <module>
      1 # Requires data to initialise
----> 2 _types.InputDataExtComponentsMemoryview()

src/cnnclustering/_types.pyx in cnnclustering._types.InputDataExtComponentsMemoryview.__cinit__()

TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[34]:
input_data = _types.InputDataExtComponentsMemoryview(np.random.random(size=(10, 3)))
print(input_data)
InputDataExtComponentsMemoryview
[36]:
input_data.data
[36]:
array([[0.15423156, 0.048149  , 0.21238066],
       [0.31544151, 0.45775574, 0.61957889],
       [0.56523987, 0.25913205, 0.89349825],
       [0.13423745, 0.81121165, 0.73824816],
       [0.40574509, 0.27321913, 0.03709493],
       [0.31003679, 0.03195195, 0.29738916],
       [0.16060228, 0.12021594, 0.53725757],
       [0.64273307, 0.32431991, 0.17237345],
       [0.46686891, 0.8965295 , 0.52424868],
       [0.84518244, 0.49240724, 0.18182637]])
[37]:
input_data.meta
[37]:
{'access_coords': True}
[35]:
input_data.n_points
[35]:
10
[38]:
input_data.n_dim
[38]:
3

Clustering

For more details on Clustering initialisation refer to the Advanced usage tutorial.