Demonstration of (generic) interfaces¶
Go to:
Notebook configuration¶
[3]:
import sys
import numpy as np
import cnnclustering
from cnnclustering import cluster
from cnnclustering import _types, _fit
Print Python and package version information:
[4]:
# Version information
print("Python: ", *sys.version.split("\n"))
print("Packages:")
for package in [np, cnnclustering]:
print(f" {package.__name__}: {package.__version__}")
Python: 3.8.8 (default, Mar 11 2021, 08:58:19) [GCC 8.3.0]
Packages:
numpy: 1.20.1
cnnclustering: 0.4.3
Labels¶
_types.Labels
is used to store cluster label assignments next to a consider indicator and meta information. It also provides a few transformational methods.
Initialize Labels
as
Labels(labels)
Labels(labels, consider=consider)
Labels(labels, consider=consider, meta=meta)
Labels.from_sequence(labels_list, consider=consider_list, meta=meta)
Technically, Labels
is not used as a generic class. A clustering, i.e. the assignments of cluster labels to points through a fitter (using a bunch of generic interfaces), uses an instance of Labels
by directly modifying the underlying array of labels, a Cython memoryview that can be accessed from the C level as Labels._labels
. Labels.labels
provides a NumPy array view to Labels._labels
.
Examples:
[5]:
# Requires labels to be initialised
labels = _types.Labels()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-a465ac6e3f53> in <module>
1 # Requires labels to be initialised
----> 2 labels = _types.Labels()
src/cnnclustering/_types.pyx in cnnclustering._types.Labels.__cinit__()
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[13]:
labels = _types.Labels(np.array([1, 1, 2, 2, 2, 0]))
labels
[13]:
Labels([1, 1, 2, 2, 2, 0])
[15]:
labels = _types.Labels.from_sequence([1, 1, 2, 2, 2, 0])
labels
[15]:
Labels([1, 1, 2, 2, 2, 0])
[16]:
print(labels)
[1 1 2 2 2 0]
[17]:
labels.labels
[17]:
array([1, 1, 2, 2, 2, 0])
[18]:
labels.consider
[18]:
array([1, 1, 1, 1, 1, 1], dtype=uint8)
[19]:
labels.meta
[19]:
{}
[21]:
labels.set
[21]:
{0, 1, 2}
[22]:
labels.mapping
[22]:
defaultdict(list, {1: [0, 1], 2: [2, 3, 4], 0: [5]})
[25]:
labels.sort_by_size()
print(labels)
[2 2 1 1 1 0]
Cluster parameters¶
An instance of _types.ClusterParameters
is used during a clustering to pass around cluster parameters.
Initialise ClusterParameters
as:
ClusterParameters(radius_cutoff)
ClusterParameters(radius_cutoff, similarity_cutoff)
…
ClusterParameters
is a simple struct like class that offers collective access and passing of cluster parameters.
Examples:
[27]:
# Requires at least a radius
_types.ClusterParameters()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-d3c71fb983e3> in <module>
1 # Requires at least a radius
----> 2 _types.ClusterParameters()
src/cnnclustering/_types.pyx in cnnclustering._types.ClusterParameters.__cinit__()
TypeError: __cinit__() takes at least 1 positional argument (0 given)
[29]:
cluster_params = _types.ClusterParameters(1)
print(cluster_params)
{'radius_cutoff': 1.0, 'similarity_cutoff': 0, 'similarity_cutoff_continuous': 0.0, 'n_member_cutoff': 0, 'current_start': 1}
Input data¶
Common-nearest-neighbours clustering can be done on data in a variety of different input formats with variations in the actual execution of the procedure. A typical case for example, would be to use the coordinates of a number of points in some feature space. These coordinates may be stored in a 2-dimensional (NumPy-)array but they could be also held in a database. Maybe instead of point coordinates, we can also begin the clustering with pre-computed pairwise distances between the points. The
present implementation in the cnnclustering
package is aimed to be generic and widely agnostic about the source of input data. This is achieved by wrapping the input data structure into an input data object that complies with a universal input data interface. The input data interface is on the Python level defined through the abstract base class _types.InputData
and specialised through its abstract subclasses InputDataComponents
, InputDataPairwiseDistances
,
InputDataPairwiseComuter
, InputDataNeighbourhoods
, and InputDataNeighbourhoodsComputer
. Valid input data types inherit from one of these abstract types and provide concrete implementation for the required methods. On the Cython level, the input data interface is universally defined through _types.InputDataExtInterface
. Realisations of the interface by Cython extension types inherit from InputDataExtInterface
and should be registered as a concrete implementation of on of the
Python abstract base classes.
InputData
objects should expose the following (typed) attributes:
data
(any): If applicable, a representation of the underlying data, preferably as NumPy array. Not strictly required for the clustering.n_points
(int
): The total number of points in the data set.meta
(dict
): A Python dictionary storing meta-information about the data. Used keys are for example:"access_coords"
: Can point coordinates be retrieved from the input data (bool)?"edges"
: If stored input data points are actually belonging to more than one data source, a list of integers can state the number of points per parts.
(
InputData
)get_subset(indices: Container)
: Return an instance of the same type holding only a subset of points (as given by indices). Used byClustering.isolate()
.
InputDataComponents
objects should expose the following additional attributes:
n_dim
(int
): The total number of dimensions.(
float
)get_component(point: int, dimension: int)
: Return one component of a point with respect to a given dimension.(
NumPy ndarray
)to_components_array()
: Transform/return underlying data as a 2D NumPy array.
InputDataExtComponentsMemoryview¶
Examples:
[32]:
# Requires data to initialise
_types.InputDataExtComponentsMemoryview()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-32-41a19cb81e61> in <module>
1 # Requires data to initialise
----> 2 _types.InputDataExtComponentsMemoryview()
src/cnnclustering/_types.pyx in cnnclustering._types.InputDataExtComponentsMemoryview.__cinit__()
TypeError: __cinit__() takes exactly 1 positional argument (0 given)
[34]:
input_data = _types.InputDataExtComponentsMemoryview(np.random.random(size=(10, 3)))
print(input_data)
InputDataExtComponentsMemoryview
[36]:
input_data.data
[36]:
array([[0.15423156, 0.048149 , 0.21238066],
[0.31544151, 0.45775574, 0.61957889],
[0.56523987, 0.25913205, 0.89349825],
[0.13423745, 0.81121165, 0.73824816],
[0.40574509, 0.27321913, 0.03709493],
[0.31003679, 0.03195195, 0.29738916],
[0.16060228, 0.12021594, 0.53725757],
[0.64273307, 0.32431991, 0.17237345],
[0.46686891, 0.8965295 , 0.52424868],
[0.84518244, 0.49240724, 0.18182637]])
[37]:
input_data.meta
[37]:
{'access_coords': True}
[35]:
input_data.n_points
[35]:
10
[38]:
input_data.n_dim
[38]:
3
Clustering¶
For more details on Clustering
initialisation refer to the Advanced usage tutorial.