Clusters-Features’s documentation

Welcome to the official documentation of the Python package : Clusters-Features

Package features

ClustersCharacteristics

class ClustersFeatures.ClustersCharacteristics(pd_df_, **args)

Class Author: BERTRAND Simon - simonbertrand.contact@gmail.com

Made for preparing the summer mission with iCube, Strasbourg (D-IR on FoDoMust). This class has been made in order to facilitate the manipulation of Clusters generated by unsupervised techniques. It computes many scores and indexes to evaluate the generated clusters. Some utils tools such as data visualisation are also implemented.

Parameters

pd_df (pd.DataFrame) – Dataframe to analyse concatenated with the target vector
target (str) – The name of the column target of pd_df dataframe

Returns

ClustersCharacteristics Instance

>>> CC=ClustersCharacteristics(pd_df,label_target="target")

Many features are available as instance variables, here is the list:

InstVar self.num_clusters: Returns the number of clusters
InstVar self.num_observations: Returns the number of observations (pd_df.shape[0])
InstVar self.num_observation_for_specific_cluster: Returns a dict with cluster as key and number of observations as value
InstVar self.data_dimension: Returns the number of features/directions/dimensions (pd_df.shape[1]-1)
InstVar self.labels_clusters: Returns a list of all clusters labels
InstVar self.label_target: Returns the given argument “target” used in the initialisation of ClustersCharacteristics instance
InstVar self.data_clusters: Returns a dict with label cluster as key and sub-dataframe with same label cluster target as value
InstVar self.data_centroids: Returns a dict with label cluster as key and centroid point Series as value
InstVar self.data_barycenter: Returns a Series of the dataframe barycenter
InstVar self.data_radiuscentroid: Returns a dict with [“max”:,”75p”,”median”,”mean”,”min”] as keys and a dict with clusters as keys and centroid radius as value
InstVar self.data_target: Returns the vector target
InstVar self.data_frame: Returns the dataframe without the target vector
InstVar self.data_features: Returns the dataframe with the target vector (pd_df)
InstVar self.data_every_element_distance_to_every_element: Returns pairwise elements distances (Generated by Scipy)
InstVar self.data_every_element_distance_to_centroids: Returns each distance between element of the dataset and each centroid
InstVar self.data_every_possible_cluster_pairs: Returns all the possible clusters pairs of elements
InstVar self.data_every_cluster_element_distance_to_centroids: Returns the distance between element belonging a cluster and its centroid for each cluster

For example :

>>> CC.num_clusters

Data

class ClustersFeatures.src._data.__Data

The ClustersCharacteristics object creates attributes that define clusters. We can find them in the Data subclass. To use these methods, you need to initialize a ClusterCharacteristics instance and then write the corresponding methods:

For example:

>>> CC=ClustersCharacteristics(pd_df,"target")
>>> CC.data_intercentroid_distance_matrix()

data_intercentroid_distance(Cluster1, Cluster2)

Computes distances between centroid of Cluster1 and centroid of Cluster2.

Parameters

Cluster1 – Cluster1 label name
Cluster2 – Cluster2 label name

Returns

float

>>> CC.data_intercentroid_distance(CC.labels_clusters[0], CC.labels_clusters[1])

data_intercentroid_distance_matrix(**args)

Computes the distance between one centroid and another and return the matrix of this general term

Return a symetric matrix (xi,j)i,j where xi,j is the distance between centroids of cluster i and j

Parameters: target= (bool) – Concatenate the output with the data target
Returns: A symetric pandas dataframe with the computed distances between each centroid

>>> CC.data_intercentroid_distance_matrix()

data_interelement_distance_between_elements_of_two_clusters(Cluster1, Cluster2)

Returns every pairwise distance between elements belonging Cluster1 or Cluster2

If Cluster1 is equal to Cluster2, than these distances are inter-clusters and the output is symetric. Else, these are extra-clusters and the output is not symetric.

Parameters

Cluster1 – Label cluster column name
Cluster2 – Label cluster column name

Returns

A pandas dataframe with the given clusters pairwise elements distance

>>> CC.data_interelement_distance_between_elements_of_two_clusters(CC.labels_clusters[0], CC.labels_clusters[1])

data_interelement_distance_for_clusters(**args)

Returns a dataframe with two columns. The first column is the distance for each element belonging clusters in the “clusters=” list argument. The second column is a boolean column equal to True when both elements are inside the same cluster. We use here the Pandas Multi-Indexes to allow users to link the column Distance with dataset points.

Parameters: clusters= – labels of clusters to compute pairwise distances
Returns: A pandas dataframe with two columns : one for the distance and the other named ‘Same Cluster ?’ is equal to True if both elements belong the same cluster

Computing all the distances between the 3 first clusters of the dataframe

>>> CC.data_interelement_distance_for_clusters(clusters=CC.labels_clusters[0:3])

data_interelement_distance_for_two_element(ElementId1, ElementId2)

Calls the distance between Element1 and Element2

Parameters

ElementId1 – First element pandas index
ElementId2 – Second element pandas index

Returns

float

>>> CC.data_interelement_distance_for_two_element(CC.data_features.index[0],CC.data_features.index[1])

data_radius_selector_specific_cluster(Query, Cluster)

Returns the radius of one given cluster with different query.

Parameters

Query (str) – in the list [‘max’, ‘min’, ‘median’, ‘mean’] or “XXp” for the XXth radius percentile or “XX%” for a percentage of the max radius.
Cluster – The cluster label

Returns

a float.

data_same_target_for_pairs_elements_matrix()

Returns a boolean matrix where the general term is equal to True when the index elements belong the same cluster with the column element

Returns: A boolean pandas dataframe with shape (num_observations,num_observations)

>>> CC.data_same_target_for_pairs_elements_matrix()

Score

This section allows users to evaluate their clustering by checking values of the indices from below.

References :

Clustering Indice - Bernard Desgraupes (University Paris Ouest, Lab Modal’X) - 2017

Study on Different Cluster Validity Indices - Shyam Kumar K, Dr. Raju G (NSS College Rajakumari, Idukki & Kannur University, Kannur in Kerala, India) - 2018

Understanding of Internal Clustering Validation Measures - Yanchi Liu, Zhongmou Li, Hui Xiong, Xuedong Gao, Junjie Wu - 2010

Scatter Score

class ClustersFeatures.src._score.__Score

scatter_matrix_T()

Returns the total dispersion matrix : it is self.num_observations times the variance-covariance matrix of the dataset.

Returns: a Pandas dataframe.

scatter_matrix_WG()

Returns the sum of scatter_matrix_specific_cluster_WGk for all k, it is also called as within group matrix.

Returns: a Pandas dataframe.

scatter_matrix_between_group_BG()

Return the matrix composed with the dispersion between centroids and the barycenter.

Returns: a Pandas dataframe.

scatter_matrix_specific_cluster_WGk(Cluster)

Returns the within cluster dispersion for a specific cluster (sum square distances between cluster’s elements and the centroid of the concerned cluster).

Parameters: Cluster – Cluster label name.
Returns: a Pandas dataframe.

score_between_group_dispersion()

Returns the between group dispersion, can also be seen as the trace of the between group matrix.

Returns: float.

score_mean_quadratic_error()

Mean quadratic error, also the same as score_pooled_within_cluster_dispersion / num_observations.

Returns: float.

score_pooled_within_cluster_dispersion()

Returns the sum of score_within_cluster_dispersion for each cluster.

Returns: float.

score_totalsumsquare()

Trace of scatter_matrix_T, we can compute it differently by using variance function.

Returns: float.

score_within_cluster_dispersion(Cluster)

Returns the trace of the WGk matrix for a specific cluster. It’s the same as score_total_sum_square but computed with WGk matrix’ coefficients.

Parameters: Cluster – Cluster label name.
Returns: float.

Index

class ClustersFeatures.src._score_index.__ScoreIndex

score_index_Log_Det_ratio()

Defined in the first reference.

Returns NaN value when the WG matrix or Total Scatter Matrix is not invertible. :returns: float.

score_index_PBM()

Defined in the first reference.

Returns: float.

score_index_SD()

Defined in the first reference.

Since we haven’t different numbers of cluster, we can’t compute the weighting coefficient : We will pass the average scattering for clusters and the total separation between clusters as the returned tuple.

Returns: A tuple of float that are (Scattering, Separation).

score_index_ball_hall()

Returns the Ball Hall index defined in the first reference.

Returns: float.

score_index_banfeld_Raftery()

Defined in the first reference.

Returns: float.

score_index_c()

Defined in the first reference.

Returns: float.

score_index_c_for_each_cluster(Cluster)

A variant of C Index for each cluster. The main difference is that we do not take the sum of all pairs of point but we directly take the number of pairs for the given cluster.

Parameters: Cluster – Cluster label name.
Returns: float.

score_index_calinski_harabasz()

Defined in the first reference.

Returns: float.

score_index_davies_bouldin()

Defined in the first reference.

It is the mean of score_index_davies_bouldin_for_each_cluster.

Returns: float.

score_index_davies_bouldin_for_each_cluster()

Defined in the first reference.

Returns: np.array of davies bouldin score for each clusters.

score_index_det_ratio()

Defined in the first reference.

Returns NaN value when the WG matrix or Total Scatter Matrix is not invertible. :returns: float.

score_index_dunn()

Defined in the first reference.

Returns: float.

score_index_generalized_dunn(**args)

Returns one of the 18 generalized dunn indices.

param int wc_distance

within cluster indice according to main reference. Int included in [1,2,3].

param int bc_distance

between cluster indice according to main reference. Int included in [1,2,3,4,5,6].

Returns: float.

score_index_generalized_dunn_matrix()

Returns the 18 generalized dunn indices defined in the first reference.

Returns: A pandas dataframe with shape (6,3).

score_index_log_ss_ratio()

Defined in the first reference.

Returns: float.

score_index_mclain_rao()

Defined in the first reference.

Returns: float

score_index_point_biserial()

Defined in the first reference.

Returns: float.

score_index_ratkowsky_lance()

Defined in the first reference.

Returns: float.

score_index_ray_turi()

Defined in the first reference.

Returns: float.

score_index_scott_symons()

Defined in the first reference.

Returns NaN if one of the WGk matrix is not inversible.

Returns: float.

score_index_silhouette()

Using the scikit-learn library to fast compute Silhouette score.

Returns: float.

score_index_silhouette_for_every_cluster()

Using the scikit-learn library to fast compute the mean for each cluster of the silhouette score.

Returns: A pandas Series with silhouette score for each cluster.

score_index_trace_WiB()

Defined in the first reference.

Returns NaN if WG matrix is not invertible.

Returns: float.

score_index_wemmert_gancarski()

A special thanks to M.Gançarski who recruited me for my first traineeship at iCube, Strasbourg, its index have been implemented here:

Defined in the first reference.

Returns: float.

score_index_xie_beni()

Defined in the first reference.

Returns: float.

IndexCore

In this library, there are two types of methods to calculate these scores: Using IndexCore which automatically caches the already calculated indexes or calculating directly using the score_index methods. The second method can make the calculation of the same index repetitive, which can be very slow because we know that some of these indexes have a very high computational complexity.

Warning

Special care to the indices.json structure. All the IndexCore class is based on this json structure. Modifying the aspect of indices.json brings to modifying the structure of many functions in this document. In other words, it is strongly discouraged to modify the global aspect of the json without having done a thorough analysis of the program. To add an index, it is important to add data to the json following its current structure.

indices.json structure dependency : indices.json, _info.py, __init__

class ClustersFeatures.index_core.__IndexCore

IndexCore_compute_every_index()

Compute all indexes and save it to cache.

Returns: A dict with all the indexes values.

IndexCore_generate_output_by_info_type(board_type, indices_type, code)

Returns the queried index. If it has already been computed, then the cached result is returned.

Parameters: board_type: (str) –

A str in the following list [‘general’, ‘radius’, ‘clusters’]. General shows indices that are calculed for the entire dataset. Radius shows informations about the distributions of radius and clusters allows users to check the indices for each clusters.

Parameters: indices_type: (str) –

A str in the following list [‘max’, ‘min’, ‘max diff’, ‘min diff’]. If ‘max’ (respect. ‘min’), then higher (respect. lower) is the score, the better is the clustering. For “max diff” and “min diff”, it is usefull to use them when you need to find the best number of clusters. Max diff will correspond to the maximum difference between clustering 1 with K clusters and clustering 2 with K’ clusters (K!=K’). See the Bernard Desgraupes reference for more explanations.

Parameters: code: (str) –

A str corresponding to one of the code inside the indices.json file. Check these codes with IndexCore_get_all_index().

Returns: list or float or pandas dataframe or pandas series.

IndexCore_get_all_index()

Returns a dict with all the indexes and its corresponding code.

Returns: dict

>>> CC.IndexCore_get_all_index
>>> {'general': {'max': {'Between-group total dispersion': 'G-Max-01', 'Mean quadratic error': 'G-Max-02', 'Silhouette Index': 'G-Max-03', 'Dunn Index': 'G-Max-04', 'Generalized Dunn Indexes': 'G-Max-GDI', 'Wemmert-Gancarski Index': 'G-Max-05', 'Calinski-Harabasz Index': 'G-Max-06', 'Ratkowsky-Lance Index': 'G-Max-07', 'Point Biserial Index': 'G-Max-08', 'PBM Index': 'G-Max-09'}, 'max diff': {'Trace WiB Index': 'G-MaxD-01', 'Trace W Index': 'G-MaxD-02'}, 'min': {'Banfeld-Raftery Index': 'G-Min-01', 'Ball Hall Index': 'G-Min-02', 'C Index': 'G-Min-03', 'Ray-Turi Index': 'G-Min-04', 'Xie-Beni Index': 'G-Min-05', 'Davies Bouldin Index': 'G-Min-06', 'SD Index': 'G-Min-07', 'Mclain-Rao Index': 'G-Min-08', 'Scott-Symons Index': 'G-Min-09'}, 'min diff': {'Det Ratio Index': 'G-MinD-01', 'Log BGSS/WGSS Index': 'G-MinD-02', 'S_Dbw Index': 'G-MinD-03', 'Nlog Det Ratio Index': 'G-MinD-04'}}, 'clusters': {'max': {'Centroid distance to barycenter': 'C-Max-01', 'Between-group Dispersion': 'C-Max-02', 'Average Silhouette': 'C-Max-03', 'KernelDensity mean': 'C-Max-04', 'Ball Hall Index': 'C-Max-05'}, 'min': {'Within-Cluster Dispersion': 'C-Min-01', 'Largest element distance': 'C-Min-02', 'Inter-element mean distance': 'C-Min-03', 'Davies Bouldin Index': 'C-Min-04', 'C Index': 'C-Min-05'}}, 'radius': {'min': {'Radius min': 'R-Min-01', 'Radius mean': 'R-Min-02', 'Radius median': 'R-Min-03', 'Radius 75th Percentile': 'R-Min-04', 'Radius max': 'R-Min-05'}}}

IndexCore_get_number_of_index()

Returns the number of indices inside the indices.json file.

Returns: int

Confusion Hypersphere

The confusion hypersphere subclass counts the number of element contained inside a n-dim sphere (hypersphere) of given radius and centered on each cluster centroid. The given radius is the same for each hypersphere.

class ClustersFeatures.src._confusion_hypersphere.__ConfusionHypersphere

confusion_hyperphere_around_specific_point_for_two_clusters(point, Cluster1, Cluster2, radius)

This function returns the number of elements belonging to Cluster1 or Cluster2 that are contained in the hypersphere of specific radius and centred on the given point.

Parameters

point (list,np.ndarray) – The point on which the hypersphere will be centred.
Cluster1 – Cluster1 label name.
Cluster2 – Cluster2 label name.
radius (float) – The radius of the hyperpshere.

Returns

int

confusion_hypersphere_for_linspace_radius_each_element(**args)

This method returns the results of the above method for a linear radius space. “=”

Parameters: n_pts (int) – Allows users to set the radius range.
Returns: A pandas dataframe

confusion_hypersphere_matrix(**args)

Returns the confusion hypersphere matrix.

Parameters

radius_choice (float) – The radius of the hypersphere.
counting_type (str) – a str in [‘including’, ‘excluding’]. If including, then the elements belonging cluster i and contained inside the hypersphere of centroid i are counted (for i=j). If excluding, then they’re not counted.
proportion (bool) – If True, returns the proportion.

Returns

A pandas dataframe.

If (xi,j)i,j is the returned Matrix, then the matrix can be described as follows :

for proportion = False : xi,j is the number of element belonging to the cluster j contained inside (euclidian norm) the hyperpshere with specified radius of cluster i
for proportion = True : xi,j is the number of element belonging to the cluster j contained inside (euclidian norm) the hypersphere with specified radius of cluster i divided by the number of elements inside the cluster j

>>> CC.confusion_hypersphere_matrix(radius=35, counting_type="including", proportion=True)

Info

The Info subclass shows two different informative boards that gives many kinds of informations about the general dataset and the clusters.

class ClustersFeatures.src._info.__Info

clusters_info(**args)

Generate a board that gives information about the different clusters.

Parameters: scaler (str) – Returns the scaled output. Avalaible scalers : ‘min_max’, ‘robust’, ‘standard’.
Returns: A pandas dataframe.

>>> CC.clusters_info

general_info(**args)

Generate a board that gives general information about the dataset.

Parameters: hide_nan (bool) – Show the NaN indices and their corresponding code. If True, showing is disabled.
Returns: A pandas dataframe.

>>> CC.general_info(hide_nan=False)

Density

class ClustersFeatures.src._density.__Density

density_estimation(method, **args)

Returns an estimation of density by summing n-dim gaussian laws. Since creating a n-dim meshgrid is very high computational complexity, we can only make an estimation on the observations of the dataset. We consider a density function to output a density estimation for a precise n-dim coordinate. Then we apply it to the coordinates of the dataframe points.

Parameters

method (str) – a str contained in the list : [‘intra’,’inter’,’total’]. “intra” argument is to specify the density for each observation relative to each cluster. “total” argument is an estimation of the density for each observation relative to all clusters at the same time. “Inter” argument is an estimation of total density of each cluster relative to the total density of another cluster. For this argument, the released matrix is symetric.
clusters (list) – List of specified cluster to estimate the density.

Returns

A pandas dataframe depending on the given “method” argument.

density_projection_2D(reduction_method, percentile, **args)

The density projection uses a reduction method to estimate the density with a 2D Meshgrid.

We estimate the density by summing num_observations times a 2D gaussian distribution centred on each element of the dataset. The percentile is an argument that sets the minimum density contour to select. For percentile=99, only the 1% most dense are going to be selected.

Parameters

reduction_method (str) – “UMAP” or “PCA”. Reduces the total dimension of the dataframe to 2.
percentile (int) – Sets the minimum density contour to select as a percentile of the current density distribution.
cluster= (list) – A list of clusters to estimate density.
return_clusters_density= (bool) – Adds a key in the returned dict with a Z values meshgrid for each clusters.
return_data (bool) – Returns the reduction data. It’s the same as self.utils_PCA(2) or self.utils_UMAP() but packed in the returned dict.

Returns

A dict containing all the data.

>>> CC.density_projection_2D("PCA", 99, cluster=CC.labels_clusters, return_data=False, return_clusters_density=True)

density_projection_2D_generate_png(reduction_method, percentile, **args)

This method generates a PNG where each density shape is observable.

We use the PIL library to generate this PNG.

Parameters

reduction_method (str) – “UMAP” or “PCA”
percentile (int) – Sets the minimum density contour to select as a percentile of the current density distribution.
show_image (bool) – Show the generated image with Plotly. If it is not installed, it is recommended to turn to False this argument.

Returns

A dict containing all the data.

>>> CC.density_projection_2D_generate_png("PCA", 99, show_image=False)

density_projection_3D(percentile, **args)

The density projection uses 3D PCA reduction method to estimate the density with a 3D Meshgrid.

We estimate the density by summing num_observations times a 3D gaussian distribution centred on each element of the dataset. The percentile is an argument that sets the minimum density contour to select. For percentile=99, only the 1% most dense are going to be selected.

Parameters

percentile (int) – Sets the minimum density contour to select as a percentile of the current density distribution.
cluster= (list) – A list of clusters to estimate density. It is forbidden to put more than 2 distincts clusters. Letting this argument empty will result to a estimation of each clusters as a single density.
return_clusters_density= (bool) – Adds a key in the returned dict with the density values for each cluster.
return_grid (bool) – Adds a key in the returned dict with the full 3D meshgrid.

Returns

A dict containing all the data.

>>> CC.density_projection_3D(99, cluster=CC.labels_clusters, return_grid=False, return_clusters_density=True)

Utils

class ClustersFeatures.src._utils.__Utils

utils_ClustersRank(**args)

Defines a mean rank for each cluster based on the min/max indexes of the cluster board.

The method uses the min-max scaler to put each row of the clusters_info board at the same dimension. We separate the min and the max indices to output a rank for each indices. If the indices min i is the lower of all clusters, then its rank is self.num_observations-th. To generate the final rank, we compute the mean rank for each cluster with min et max type. Then we sum the mean rank of the min indices to the mean rank of the max indices. As we want a rank where first position is the better, we invert the above sum and get the final rank. By adding params, you can provide the mean rank for each cluster by passing cluster_rank=True.

Parameters: cluster_rank= (bool) – Returns the mean rank for each cluster
Returns: The final leaderboard.

>>> CC.utils_ClustersRank(mean_cluster_rank=True)

utils_KernelDensity(**args)

Function that returns an estimation of Kernel Density with the best bandwidth.

Parameters

return_KDE (bool) – If argument return_KDE = True, so the KDE model is returned to generate samples later. It uses the Scikit Learn Library.
clusters (list) – List of clusters to evaluate KernelDensity, order is not important.If no clusters specified, the KernelDensity is done on the entire data set

Returns

An estimation of KernelDensity for each sample if return_KDE is false
A tuple with the estimation of KD for each sample and the KDE model if return_KDE is true

utils_PCA(n_components)

Principal Component Analysis : Use the scikit learn library

Parameters: n_components – number of data dimension after reduction
Returns: A n_components-D projection of the whole data set

utils_UMAP(**args)

Uniform Manifold Approximation Projection : Use the umap-learn library.

The result is cached to avoid same and repetitive calculs.

Parameters: show_target (bool) – Concatenate target to output dataframe
Returns: A pandas dataframe with the 2D projection of the whole data set.

utils_ts_filtering(filter, **args)

Filter a time-serie with different filters from statsmodels.

Col argument can be specified if it is wanted to filter a column of self dataframe.

Else, you can directly specify a time-serie with the data argument.

Parameters: filter (str) – Type of filter. Have to be in the list [‘STL’, ‘HP’, ‘BK’, ‘CF’] respectively for :

STL : Season-trend decomposing using LOESS. - https://www.statsmodels.org/devel/generated/statsmodels.tsa.seasonal.STL.html
HP : Hodrick-Prescott filter. - https://www.statsmodels.org/devel/generated/statsmodels.tsa.filters.hp_filter.hpfilter.html?highlight=hp
BK : Filter a time series using the Baxter-King bandpass filter. - https://www.statsmodels.org/devel/generated/statsmodels.tsa.filters.bk_filter.bkfilter.html?highlight=bk#statsmodels.tsa.filters.bk_filter.bkfilter
CF : Christiano Fitzgerald asymmetric, random walk filter. - https://www.statsmodels.org/devel/generated/statsmodels.tsa.filters.cf_filter.cffilter.html?highlight=cf#statsmodels.tsa.filters.cf_filter.cffilter

Parameters

periods= (int/float) – Specify the period between each sample.
col= (str/int) – Required if data is None: Specify the column of self data set to filter.
data= (list/np.ndarray) – Required if col is None : Specify the data to filter.

Returns

A pandas dataframe with columns as the decomposed signals.

Graph

class ClustersFeatures.src._graph.__Graph

graph_PCA_3D()

Shows the 3D PCA reduction graph with Plotly.

Returns: Plotly figure instance

>>> CC.graph_PCA_3D()

graph_boxplots_distances_to_centroid(Cluster)

Shows a box plot of the distances between all elements and the centroid of given cluster.

Parameters: Cluster – Cluster centroid name to evaluate the elements distance with.
Returns: Plotly figure instance.

>>> CC.graph_boxplots_distances_to_centroid(CC.labels_clusters[0])

graph_confusion_hypersphere_evolution_for_linspace_radius(n_pts, proportion)

Returns a Plotly animation with dataframes generated by the Confusion Hypersphere for different radius.

This animation allows users to understand which clusters are more confused with each other. You can also interpret compactness as follows: The diagonal term (when proportion is True) that first reaches the value 1 corresponds to the most compact cluster in the dataset :param int n_pts: Number of points for the radius linspace. :param bool proportion: Put the value of proportion to Confusion Hypersphere arguments :returns: Plotly figure instance.

>>> CC.graph_confusion_hypersphere_evolution_for_linspace_radius(50, True)

graph_projection_2D(feature1, feature2)

A simple 2D projection on two given features with Plotly.

Parameters

feature1 – The first dataframe columns to project
feature2 – The second dataframe columns to projectv

Returns

Plotly figure instance.

graph_reduction_2D(reduction_method)

Shows the 2D reduction graph with Plotly.

Parameters: reduction_method (str) – “UMAP” or “PCA”
Returns: Plotly figure instance

>>> CC.graph_reduction_2D("UMAP")

graph_reduction_density_2D(reduction_method, percentile, graph)

Shows the result of 2D PCA density estimation with Plotly.

Parameters

reduction_method (str) – “UMAP” or “PCA”. Reduces the total dimension of the dataframe to 2.
percentile (int) – Sets the minimum density contour to select as a percentile of the current density distribution.
graph (str) – “interactive” or “contour”. Shows different ways to visualize the density.

Returns

Plotly figure instance.

graph_reduction_density_3D(percentile, **args)

Shows the result of 3D PCA density estimation with Plotly.

Parameters

percentile (int) – Sets the minimum density contour to select as a percentile of the current density distribution.
cluster= (list) – A list of clusters to estimate density.

Returns

Plotly figure instance

>>> CC.graph_reduction_density_3D(99,cluster=CC.labels_clusters[:2])

>>> CC.graph_reduction_density_3D(99,cluster=CC.labels_clusters[0])

>>> CC.graph_reduction_density_3D(99)

Clusters-Features’s documentation

Package features

ClustersCharacteristics

Data

Score

Scatter Score

Index

IndexCore

Confusion Hypersphere

Info

Density

Utils

Graph

Indices and tables