cluster

The functionality of this module is primarily exposed and bundled by the Clustering class. An instance of this class aggregates various types (defined in _types here).

Go to:

Clustering

class cnnclustering.cluster.Clustering(input_data=None, fitter=None, predictor=None, labels=None, alias: unicode = 'root', parent=None, **kwargs)

Represents a clustering endeavour

A clustering object is made by aggregation of all necessary parts to carry out a clustering of input data points.

Keyword Arguments
  • input_data – Any object implementing the input data interface. Represents the data points to be clustered. If this is not a valid (registered) concrete implementation of InputData, this invokes the creation of a clustering via ClusteringBuilder.

  • fitter – Any object implementing the fitter interface. Executes the clustering procedure.

  • predictor – Any object implementing the predictor interface. Translates a clustering result to another Clustering object with different input_data.

  • labels – An instance of Labels holding cluster label assignments for points in input_data. If this is not an instance of class:~cnnclustering._types.Labels, attempts a corresponding intialisation.

  • alias – A descriptive string identifier associated with this clustering.

  • parent – An instance of Clustering of which this clustering is a child of.

Note

A clustering instance may also be created using the clustering builder ClusteringBuilder, e.g. as

clustering = ClusteringBuilder(data).build().

property children

Return a mapping of child cluster labels to cnnclustering.cluster.Clustering instances representing the children of this clustering.

evaluate(self, ax=None, clusters: Optional[Container[int]] = None, original: bool = False, unicode plot_style: str = u'dots', parts: Optional[Tuple[Optional[int]]] = None, points: Optional[Tuple[Optional[int]]] = None, dim: Optional[Tuple[int, int]] = None, mask: Optional[Sequence[Union[bool, int]]] = None, ax_props: Optional[dict] = None, annotate: bool = True, unicode annotate_pos: str = u'mean', annotate_props: Optional[dict] = None, plot_props: Optional[dict] = None, plot_noise_props: Optional[dict] = None, hist_props: Optional[dict] = None, free_energy: bool = True)

Returns a 2D plot of an original data set or a cluster result

Args: ax: The Axes instance to which to add the plot. If

None, a new Figure with Axes will be created.

clusters:

Cluster numbers to include in the plot. If None, consider all.

original:

Allows to plot the original data instead of a cluster result. Overrides clusters. Will be considered True, if no cluster result is present.

plot_style:

The kind of plotting method to use.

  • “dots”, ax.plot()

  • “scatter”, ax.scatter()

  • “contour”, ax.contour()

  • “contourf”, ax.contourf()

parts:

Use a slice (start, stop, stride) on the data parts before plotting. Will be applied before a slice on points.

points:

Use a slice (start, stop, stride) on the data points before plotting.

dim:

Use these two dimensions for plotting. If None, uses (0, 1).

mask:

Sequence of boolean or integer values used for optional fancy indexing on the point data array. Note, that this is applied after regular slicing (e.g. via points) and requires a copy of the indexed data (may be slow and memory intensive for big data sets).

annotate:

If there is a cluster result, plot the cluster numbers. Uses annotate_pos to determinte the position of the annotations.

annotate_pos:

Where to put the cluster number annotation. Can be one of:

  • “mean”, Use the cluster mean

  • “random”, Use a random point of the cluster

Alternatively a list of x, y positions can be passed to set a specific point for each cluster (Not yet implemented)

annotate_props:

Dictionary of keyword arguments passed to ax.annotate().

ax_props:

Dictionary of ax properties to apply after plotting via ax.set(**ax_props)(). If None, uses defaults that can be also defined in the configuration file (Note yet implemented).

plot_props:

Dictionary of keyword arguments passed to various functions (plot.plot_dots() etc.) with different meaning to format cluster plotting. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).

plot_noise_props:

Like plot_props but for formatting noise point plotting.

hist_props:

Dictionary of keyword arguments passed to functions that involve the computing of a histogram via numpy.histogram2d.

free_energy:

If True, converts computed histograms to pseudo free energy surfaces.

Returns

Figure, Axes and a list of plotted elements

fit(self, double radius_cutoff: float, cnn_cutoff: int, member_cutoff: int = None, max_clusters: int = None, cnn_offset: int = None, sort_by_size: bool = True, info: bool = True, record: bool = True, record_time: bool = True, v: bool = True, purge: bool = False)None

Execute clustering procedure

Parameters
  • radius_cutoff – Neighbour search radius.

  • cnn_cutoff – Similarity criterion.

  • member_cutoff – Valid clusters need to have at least this many members. Passed on to Labels.sort_by_size() if sort_by_size is True. Has no effect otherwise and valid clusters have at least one member.

  • max_clusters – Keep only the largest max_clusters clusters. Passed on to Labels.sort_by_size() if sort_by_size is True. Has no effect otherwise.

  • cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 is equivalent to cnn_cutoff = 0 in this version.

  • sort_by_size – Weather to sort (and trim) the created Labels instance. See also Labels.sort_by_size().

  • info – Wether to modify Labels.meta information for this clustering.

  • record – Wether to create a Record instance for this clustering which is appended to the Summary.

  • record_time – Wether to time clustering execution.

  • v – Be chatty.

  • purge – If True, force re-initialisation of cluster label assignments.

fit_hierarchical(self, radius_cutoff: Union[float, List[float]], cnn_cutoff: Union[int, List[int]], member_cutoff: int = None, max_clusters: int = None, cnn_offset: int = None)

Execute hierarchical clustering procedure

property fitter
classmethod get_builder_kwargs(cls)
get_child(self, label)
property hierarchy_level

The level of this clustering in the hierarchical tree of clusterings (0 for the root instance).

info(self)
property input_data
isolate(self, bool purge: bool = True, bool isolate_input_data: bool = True)

Create child clusterings from cluster labels

Parameters
  • purge – If True, creates a new mapping for the children of this clustering.

  • isolate_input_data – If True, attaches a subset of the input data of this clustering to the child.

property labels

Direct access to cnnclustering._types.Labels.labels holding cluster label assignments for points in input_data.

make_parameters(self, double radius_cutoff: float, cnn_cutoff: int, current_start: int)Type[ClusterParameters]
pie(self, ax=None, pie_props=None)
predict(self, other: Type[u'Clustering'], double radius_cutoff: float, cnn_cutoff: int, clusters: Optional[Sequence[int]] = None, cnn_offset: Optional[int] = None, info: bool = True, record: bool = True, record_time: bool = True, v: bool = True, purge: bool = False)

Execute prediction procedure

Parameters
  • othercnnclustering.cluster.Clustering instance for which cluster labels should be predicted.

  • radius_cutoff – Neighbour search radius.

  • cnn_cutoff – Similarity criterion.

  • cluster – Sequence of cluster labels that should be included in the prediction.

  • cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 is equivalent to cnn_cutoff = 0 in this version.

  • purge – If True, force re-initialisation of predicted cluster labels.

reel(self, depth: Optional[int] = None)None

Wrap up label assignments of lower hierarchy levels

Parameters
  • depth – How many lower levels to consider. If None,

  • all. (consider) –

summarize(self, ax=None, unicode quantity: str = u'execution_time', treat_nan: Optional[Any] = None, convert: Optional[Any] = None, ax_props: Optional[dict] = None, contour_props: Optional[dict] = None, unicode plot_style: str = u'contourf')

Generate a 2D plot of record values

Record values (“time”, “clusters”, “largest”, “noise”) are plotted against cluster parameters (radius cutoff r and cnn cutoff c).

Parameters
  • ax – Matplotlib Axes to plot on. If None, a new Figure with Axes will be created.

  • quantity

    Record value to visualise:

    • ”time”

    • ”clusters”

    • ”largest”

    • ”noise”

  • treat_nan – If not None, use this value to pad nan-values.

  • ax_props – Used to style ax.

  • contour_props – Passed on to contour.

property summary

Return an instance of cnnclustering.cluster.Summary collecting clustering results for this clustering.

to_nx_DiGraph(self, ignore=None)

Convert cluster hierarchy to networkx DiGraph

Keyword Arguments

ignore – A set of label not to include into the graph. Use for example to exclude noise (label 0).

tree(self, ax=None, ignore=None, pos_props=None, draw_props=None)
trim_shrinking_leafs(self)
trim_trivial_leafs(self)

Scan cluster hierarchy for removable nodes

If the cluster label assignments on a clustering are all zero (noise), the clustering is considered trivial. In this case, the labels and children are reset to None.

class cnnclustering.cluster.ClusteringBuilder(data, preparation_hook=None, registered_recipe_key=None, clustering_type=None, alias=None, parent=None, **recipe)

Orchestrate correct initialisation of a clustering

Parameters

data – Data that should be clustered in a format compatible with ‘input_data’ specified in the building recipe. May go through preparation_hook to establish compatibility.

Keyword Arguments
  • preparation_hook – A function that takes input data as a single argument and returns the (optionally) reformatted data plus additional information (e.g. “meta”) in form of an argument tuple and a keyword argument dictionary that can be used to initialise an input data type. If None uses _default_preparation_hook.

  • recipe – Building instructions for a clustering initialisation. Should be a mapping of component keyword arguments to componenet type details.

aggregate_components(self)
build(self)

Initialise clustering with data and components

Records and summary

class cnnclustering.cluster.Record(n_points=None, radius_cutoff=None, cnn_cutoff=None, member_cutoff=None, max_clusters=None, n_clusters=None, ratio_largest=None, ratio_noise=None, execution_time=None)

Cluster result container

cnnclustering.cluster.Record instances can created during cnnclustering.cluster.Clustering.fit() and are collected in cnnclustering.cluster.Summary.

to_dict(self)
class cnnclustering.cluster.Summary(iterable=None)

List like container for cluster results

Stores instances of cnnclustering.cluster.Record.

insert(self, index, item)
to_DataFrame(self)

Convert list of records to (typed) pandas.DataFrame

Returns

pandas.DataFrame

cnnclustering.cluster.make_typed_DataFrame(columns, dtypes, content=None)

Construct pandas.DataFrame with typed columns

cnnclustering.cluster.timed(function)

Decorator to measure execution time