package sklearn

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type
module Parallel : sig ... end
val additive_chi2_kernel : ?y:Ndarray.t -> x:Ndarray.t -> unit -> Ndarray.t

Computes the additive chi-squared kernel between observations in X and Y

The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative. This kernel is most commonly applied to histograms.

The chi-squared kernel is given by::

k(x, y) = -Sum (x - y)^2 / (x + y)

It can be interpreted as a weighted difference per entry.

Read more in the :ref:`User Guide <chi2_kernel>`.

Notes ----- As the negative of a distance, this kernel is only conditionally positive definite.

Parameters ---------- X : array-like of shape (n_samples_X, n_features)

Y : array of shape (n_samples_Y, n_features)

Returns ------- kernel_matrix : array of shape (n_samples_X, n_samples_Y)

References ---------- * Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification of texture and object categories: A comprehensive study International Journal of Computer Vision 2007 https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf

See also -------- chi2_kernel : The exponentiated version of the kernel, which is usually preferable.

sklearn.kernel_approximation.AdditiveChi2Sampler : A Fourier approximation to this kernel.

val check_array : ?accept_sparse: [ `String of string | `Bool of bool | `StringList of string list ] -> ?accept_large_sparse:bool -> ?dtype: [ `String of string | `Dtype of Py.Object.t | `TypeList of Py.Object.t | `None ] -> ?order:[ `F | `C | `None ] -> ?copy:bool -> ?force_all_finite:[ `Bool of bool | `Allow_nan ] -> ?ensure_2d:bool -> ?allow_nd:bool -> ?ensure_min_samples:int -> ?ensure_min_features:int -> ?warn_on_dtype:[ `Bool of bool | `None ] -> ?estimator:[ `String of string | `Estimator of Py.Object.t ] -> array:Py.Object.t -> unit -> Py.Object.t

Input validation on an array, list, sparse matrix or similar.

By default, the input is checked to be a non-empty 2D array containing only finite values. If the dtype of the array is object, attempt converting to float, raising on failure.

Parameters ---------- array : object Input object to check / convert.

accept_sparse : string, boolean or list/tuple of strings (default=False) Strings representing allowed sparse matrix formats, such as 'csc', 'csr', etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

accept_large_sparse : bool (default=True) If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by accept_sparse, accept_large_sparse=False will cause it to be accepted only if its indices are stored with a 32-bit dtype.

.. versionadded:: 0.20

dtype : string, type, list of types or None (default="numeric") Data type of result. If None, the dtype of the input is preserved. If "numeric", dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

order : 'F', 'C' or None (default=None) Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.

copy : boolean (default=False) Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

force_all_finite : boolean or 'allow-nan', (default=True) Whether to raise an error on np.inf and np.nan in array. The possibilities are:

  • True: Force all values of array to be finite.
  • False: accept both np.inf and np.nan in array.
  • 'allow-nan': accept only np.nan values in array. Values cannot be infinite.

For object dtyped data, only np.nan is checked and not np.inf.

.. versionadded:: 0.20 ``force_all_finite`` accepts the string ``'allow-nan'``.

ensure_2d : boolean (default=True) Whether to raise a value error if array is not 2D.

allow_nd : boolean (default=False) Whether to allow array.ndim > 2.

ensure_min_samples : int (default=1) Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.

ensure_min_features : int (default=1) Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and ``ensure_2d`` is True. Setting to 0 disables this check.

warn_on_dtype : boolean or None, optional (default=None) Raise DataConversionWarning if the dtype of the input data structure does not match the requested dtype, causing a memory copy.

.. deprecated:: 0.21 ``warn_on_dtype`` is deprecated in version 0.21 and will be removed in 0.23.

estimator : str or estimator instance (default=None) If passed, include the name of the estimator in warning messages.

Returns ------- array_converted : object The converted and validated array.

val check_non_negative : x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> whom:string -> unit -> Py.Object.t

Check if there is any negative value in an array.

Parameters ---------- X : array-like or sparse matrix Input data.

whom : string Who passed X to this function.

val check_paired_arrays : x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> y:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Py.Object.t * Py.Object.t

Set X and Y appropriately and checks inputs for paired distances

All paired distance metrics should use this function first to assert that the given parameters are correct and safe to use.

Specifically, this function first ensures that both X and Y are arrays, then checks that they are at least two dimensional while ensuring that their elements are floats. Finally, the function checks that the size of the dimensions of the two arrays are equal.

Parameters ---------- X : array-like, sparse matrix, shape (n_samples_a, n_features)

Y : array-like, sparse matrix, shape (n_samples_b, n_features)

Returns ------- safe_X : array-like, sparse matrix, shape (n_samples_a, n_features) An array equal to X, guaranteed to be a numpy array.

safe_Y : array-like, sparse matrix, shape (n_samples_b, n_features) An array equal to Y if Y was not None, guaranteed to be a numpy array. If Y was None, safe_Y will be a pointer to X.

val check_pairwise_arrays : ?precomputed:bool -> ?dtype: [ `String of string | `Dtype of Py.Object.t | `TypeList of Py.Object.t | `None ] -> ?accept_sparse: [ `String of string | `Bool of bool | `StringList of string list ] -> ?force_all_finite:[ `Bool of bool | `Allow_nan ] -> ?copy:bool -> x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> y:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Py.Object.t * Py.Object.t

Set X and Y appropriately and checks inputs

If Y is None, it is set as a pointer to X (i.e. not a copy). If Y is given, this does not happen. All distance metrics should use this function first to assert that the given parameters are correct and safe to use.

Specifically, this function first ensures that both X and Y are arrays, then checks that they are at least two dimensional while ensuring that their elements are floats (or dtype if provided). Finally, the function checks that the size of the second dimension of the two arrays is equal, or the equivalent check for a precomputed distance matrix.

Parameters ---------- X : array-like, sparse matrix, shape (n_samples_a, n_features)

Y : array-like, sparse matrix, shape (n_samples_b, n_features)

precomputed : bool True if X is to be treated as precomputed distances to the samples in Y.

dtype : string, type, list of types or None (default=None) Data type required for X and Y. If None, the dtype will be an appropriate float type selected by _return_float_dtype.

.. versionadded:: 0.18

accept_sparse : string, boolean or list/tuple of strings Strings representing allowed sparse matrix formats, such as 'csc', 'csr', etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

force_all_finite : boolean or 'allow-nan', (default=True) Whether to raise an error on np.inf and np.nan in array. The possibilities are:

  • True: Force all values of array to be finite.
  • False: accept both np.inf and np.nan in array.
  • 'allow-nan': accept only np.nan values in array. Values cannot be infinite.

.. versionadded:: 0.22 ``force_all_finite`` accepts the string ``'allow-nan'``.

copy : bool Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

.. versionadded:: 0.22

Returns ------- safe_X : array-like, sparse matrix, shape (n_samples_a, n_features) An array equal to X, guaranteed to be a numpy array.

safe_Y : array-like, sparse matrix, shape (n_samples_b, n_features) An array equal to Y if Y was not None, guaranteed to be a numpy array. If Y was None, safe_Y will be a pointer to X.

val chi2_kernel : ?y:Ndarray.t -> ?gamma:float -> x:Ndarray.t -> unit -> Ndarray.t

Computes the exponential chi-squared kernel X and Y.

The chi-squared kernel is computed between each pair of rows in X and Y. X and Y have to be non-negative. This kernel is most commonly applied to histograms.

The chi-squared kernel is given by::

k(x, y) = exp(-gamma Sum (x - y)^2 / (x + y))

It can be interpreted as a weighted difference per entry.

Read more in the :ref:`User Guide <chi2_kernel>`.

Parameters ---------- X : array-like of shape (n_samples_X, n_features)

Y : array of shape (n_samples_Y, n_features)

gamma : float, default=1. Scaling parameter of the chi2 kernel.

Returns ------- kernel_matrix : array of shape (n_samples_X, n_samples_Y)

References ---------- * Zhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C. Local features and kernels for classification of texture and object categories: A comprehensive study International Journal of Computer Vision 2007 https://research.microsoft.com/en-us/um/people/manik/projects/trade-off/papers/ZhangIJCV06.pdf

See also -------- additive_chi2_kernel : The additive version of this kernel

sklearn.kernel_approximation.AdditiveChi2Sampler : A Fourier approximation to the additive version of this kernel.

val cosine_distances : ?y:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Py.Object.t

Compute cosine distance between samples in X and Y.

Cosine distance is defined as 1.0 minus the cosine similarity.

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array_like, sparse matrix with shape (n_samples_X, n_features).

Y : array_like, sparse matrix (optional) with shape (n_samples_Y, n_features).

Returns ------- distance matrix : array An array with shape (n_samples_X, n_samples_Y).

See also -------- sklearn.metrics.pairwise.cosine_similarity scipy.spatial.distance.cosine : dense matrices only

val cosine_similarity : ?y:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> ?dense_output:bool -> x:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> unit -> Py.Object.t

Compute cosine similarity between samples in X and Y.

Cosine similarity, or the cosine kernel, computes similarity as the normalized dot product of X and Y:

K(X, Y) = <X, Y> / (||X||*||Y||)

On L2-normalized data, this function is equivalent to linear_kernel.

Read more in the :ref:`User Guide <cosine_similarity>`.

Parameters ---------- X : ndarray or sparse array, shape: (n_samples_X, n_features) Input data.

Y : ndarray or sparse array, shape: (n_samples_Y, n_features) Input data. If ``None``, the output will be the pairwise similarities between all samples in ``X``.

dense_output : boolean (optional), default True Whether to return dense output even when the input is sparse. If ``False``, the output is sparse if both input arrays are sparse.

.. versionadded:: 0.17 parameter ``dense_output`` for dense output.

Returns ------- kernel matrix : array An array with shape (n_samples_X, n_samples_Y).

val delayed : ?check_pickle:Py.Object.t -> function_:Py.Object.t -> unit -> Py.Object.t

Decorator used to capture the arguments of a function.

val distance_metrics : unit -> Py.Object.t

Valid metrics for pairwise_distances.

This function simply returns the valid pairwise distance metrics. It exists to allow for a description of the mapping for each of the valid strings.

The valid distance metrics, and the function they map to, are:

=============== ======================================== metric Function =============== ======================================== 'cityblock' metrics.pairwise.manhattan_distances 'cosine' metrics.pairwise.cosine_distances 'euclidean' metrics.pairwise.euclidean_distances 'haversine' metrics.pairwise.haversine_distances 'l1' metrics.pairwise.manhattan_distances 'l2' metrics.pairwise.euclidean_distances 'manhattan' metrics.pairwise.manhattan_distances 'nan_euclidean' metrics.pairwise.nan_euclidean_distances =============== ========================================

Read more in the :ref:`User Guide <metrics>`.

val effective_n_jobs : ?n_jobs:Py.Object.t -> unit -> Py.Object.t

Determine the number of jobs that can actually run in parallel

n_jobs is the number of workers requested by the callers. Passing n_jobs=-1 means requesting all available workers for instance matching the number of CPU cores on the worker host(s).

This method should return a guesstimate of the number of workers that can actually perform work concurrently with the currently enabled default backend. The primary use case is to make it possible for the caller to know in how many chunks to slice the work.

In general working on larger data chunks is more efficient (less scheduling overhead and better use of CPU cache prefetching heuristics) as long as all the workers have enough work to do.

Warning: this function is experimental and subject to change in a future version of joblib.

.. versionadded:: 0.10

val euclidean_distances : ?y:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> ?y_norm_squared:Ndarray.t -> ?squared:bool -> ?x_norm_squared:Ndarray.t -> x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Ndarray.t

Considering the rows of X (and Y=X) as vectors, compute the distance matrix between each pair of vectors.

For efficiency reasons, the euclidean distance between a pair of row vector x and y is computed as::

dist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y))

This formulation has two advantages over other ways of computing distances. First, it is computationally efficient when dealing with sparse data. Second, if one argument varies but the other remains unchanged, then `dot(x, x)` and/or `dot(y, y)` can be pre-computed.

However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required by, e.g., ``scipy.spatial.distance`` functions.

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array-like, sparse matrix, shape (n_samples_1, n_features)

Y : array-like, sparse matrix, shape (n_samples_2, n_features)

Y_norm_squared : array-like, shape (n_samples_2, ), optional Pre-computed dot-products of vectors in Y (e.g., ``(Y**2).sum(axis=1)``) May be ignored in some cases, see the note below.

squared : boolean, optional Return squared Euclidean distances.

X_norm_squared : array-like of shape (n_samples,), optional Pre-computed dot-products of vectors in X (e.g., ``(X**2).sum(axis=1)``) May be ignored in some cases, see the note below.

Notes ----- To achieve better accuracy, `X_norm_squared` and `Y_norm_squared` may be unused if they are passed as ``float32``.

Returns ------- distances : array, shape (n_samples_1, n_samples_2)

Examples -------- >>> from sklearn.metrics.pairwise import euclidean_distances >>> X = [0, 1], [1, 1] >>> # distance between rows of X >>> euclidean_distances(X, X) array([0., 1.], [1., 0.]) >>> # get distance to origin >>> euclidean_distances(X, [0, 0]) array([1. ], [1.41421356])

See also -------- paired_distances : distances betweens pairs of elements of X and Y.

val gen_batches : ?min_batch_size:Py.Object.t -> n:int -> batch_size:Py.Object.t -> unit -> Py.Object.t

Generator to create slices containing batch_size elements, from 0 to n.

The last slice may contain less than batch_size elements, when batch_size does not divide n.

Parameters ---------- n : int batch_size : int Number of element in each batch min_batch_size : int, default=0 Minimum batch size to produce.

Yields ------ slice of batch_size elements

Examples -------- >>> from sklearn.utils import gen_batches >>> list(gen_batches(7, 3)) slice(0, 3, None), slice(3, 6, None), slice(6, 7, None) >>> list(gen_batches(6, 3)) slice(0, 3, None), slice(3, 6, None) >>> list(gen_batches(2, 3)) slice(0, 2, None) >>> list(gen_batches(7, 3, min_batch_size=0)) slice(0, 3, None), slice(3, 6, None), slice(6, 7, None) >>> list(gen_batches(7, 3, min_batch_size=2)) slice(0, 3, None), slice(3, 7, None)

val gen_even_slices : ?n_samples:Py.Object.t -> n:int -> n_packs:Py.Object.t -> unit -> Py.Object.t

Generator to create n_packs slices going up to n.

Parameters ---------- n : int n_packs : int Number of slices to generate. n_samples : int or None (default = None) Number of samples. Pass n_samples when the slices are to be used for sparse matrix indexing; slicing off-the-end raises an exception, while it works for NumPy arrays.

Yields ------ slice

Examples -------- >>> from sklearn.utils import gen_even_slices >>> list(gen_even_slices(10, 1)) slice(0, 10, None) >>> list(gen_even_slices(10, 10)) slice(0, 1, None), slice(1, 2, None), ..., slice(9, 10, None) >>> list(gen_even_slices(10, 5)) slice(0, 2, None), slice(2, 4, None), ..., slice(8, 10, None) >>> list(gen_even_slices(10, 3)) slice(0, 4, None), slice(4, 7, None), slice(7, 10, None)

val get_chunk_n_rows : ?max_n_rows:Py.Object.t -> ?working_memory:Py.Object.t -> row_bytes:int -> unit -> Py.Object.t

Calculates how many rows can be processed within working_memory

Parameters ---------- row_bytes : int The expected number of bytes of memory that will be consumed during the processing of each row. max_n_rows : int, optional The maximum return value. working_memory : int or float, optional The number of rows to fit inside this number of MiB will be returned. When None (default), the value of ``sklearn.get_config()'working_memory'`` is used.

Returns ------- int or the value of n_samples

Warns ----- Issues a UserWarning if ``row_bytes`` exceeds ``working_memory`` MiB.

val haversine_distances : ?y:Ndarray.t -> x:Ndarray.t -> unit -> Ndarray.t

Compute the Haversine distance between samples in X and Y

The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere. The first distance of each point is assumed to be the latitude, the second is the longitude, given in radians. The dimension of the data must be 2.

.. math:: D(x, y) = 2\arcsin\sqrt{\sin^2((x1 - y1) / 2) + \cos(x1)\cos(y1)\sin^2((x2 - y2) / 2)}

Parameters ---------- X : array_like, shape (n_samples_1, 2)

Y : array_like, shape (n_samples_2, 2), optional

Returns ------- distance : array, shape (n_samples_1, n_samples_2)

Notes ----- As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average.

Examples -------- We want to calculate the distance between the Ezeiza Airport (Buenos Aires, Argentina) and the Charles de Gaulle Airport (Paris, France)

>>> from sklearn.metrics.pairwise import haversine_distances >>> from math import radians >>> bsas = -34.83333, -58.5166646 >>> paris = 49.0083899664, 2.53844117956 >>> bsas_in_radians = radians(_) for _ in bsas >>> paris_in_radians = radians(_) for _ in paris >>> result = haversine_distances(bsas_in_radians, paris_in_radians) >>> result * 6371000/1000 # multiply by Earth radius to get kilometers array([ 0. , 11099.54035582], [11099.54035582, 0. ])

val is_scalar_nan : x:Py.Object.t -> unit -> Py.Object.t

Tests if x is NaN

This function is meant to overcome the issue that np.isnan does not allow non-numerical types as input, and that np.nan is not np.float('nan').

Parameters ---------- x : any type

Returns ------- boolean

Examples -------- >>> is_scalar_nan(np.nan) True >>> is_scalar_nan(float("nan")) True >>> is_scalar_nan(None) False >>> is_scalar_nan("") False >>> is_scalar_nan(np.nan) False

val issparse : x:Py.Object.t -> unit -> Py.Object.t

Is x of a sparse matrix type?

Parameters ---------- x object to check for being a sparse matrix

Returns ------- bool True if x is a sparse matrix, False otherwise

Notes ----- issparse and isspmatrix are aliases for the same function.

Examples -------- >>> from scipy.sparse import csr_matrix, isspmatrix >>> isspmatrix(csr_matrix([5])) True

>>> from scipy.sparse import isspmatrix >>> isspmatrix(5) False

val kernel_metrics : unit -> Py.Object.t

Valid metrics for pairwise_kernels

This function simply returns the valid pairwise distance metrics. It exists, however, to allow for a verbose description of the mapping for each of the valid strings.

The valid distance metrics, and the function they map to, are: =============== ======================================== metric Function =============== ======================================== 'additive_chi2' sklearn.pairwise.additive_chi2_kernel 'chi2' sklearn.pairwise.chi2_kernel 'linear' sklearn.pairwise.linear_kernel 'poly' sklearn.pairwise.polynomial_kernel 'polynomial' sklearn.pairwise.polynomial_kernel 'rbf' sklearn.pairwise.rbf_kernel 'laplacian' sklearn.pairwise.laplacian_kernel 'sigmoid' sklearn.pairwise.sigmoid_kernel 'cosine' sklearn.pairwise.cosine_similarity =============== ========================================

Read more in the :ref:`User Guide <metrics>`.

val laplacian_kernel : ?y:Ndarray.t -> ?gamma:float -> x:Ndarray.t -> unit -> Ndarray.t

Compute the laplacian kernel between X and Y.

The laplacian kernel is defined as::

K(x, y) = exp(-gamma ||x-y||_1)

for each pair of rows x in X and y in Y. Read more in the :ref:`User Guide <laplacian_kernel>`.

.. versionadded:: 0.17

Parameters ---------- X : array of shape (n_samples_X, n_features)

Y : array of shape (n_samples_Y, n_features)

gamma : float, default None If None, defaults to 1.0 / n_features

Returns ------- kernel_matrix : array of shape (n_samples_X, n_samples_Y)

val linear_kernel : ?y:Ndarray.t -> ?dense_output:bool -> x:Ndarray.t -> unit -> Py.Object.t

Compute the linear kernel between X and Y.

Read more in the :ref:`User Guide <linear_kernel>`.

Parameters ---------- X : array of shape (n_samples_1, n_features)

Y : array of shape (n_samples_2, n_features)

dense_output : boolean (optional), default True Whether to return dense output even when the input is sparse. If ``False``, the output is sparse if both input arrays are sparse.

.. versionadded:: 0.20

Returns ------- Gram matrix : array of shape (n_samples_1, n_samples_2)

val manhattan_distances : ?y:Ndarray.t -> ?sum_over_features:bool -> x:Ndarray.t -> unit -> Ndarray.t

Compute the L1 distances between the vectors in X and Y.

With sum_over_features equal to False it returns the componentwise distances.

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array_like An array with shape (n_samples_X, n_features).

Y : array_like, optional An array with shape (n_samples_Y, n_features).

sum_over_features : bool, default=True If True the function returns the pairwise distance matrix else it returns the componentwise L1 pairwise-distances. Not supported for sparse matrix inputs.

Returns ------- D : array If sum_over_features is False shape is (n_samples_X * n_samples_Y, n_features) and D contains the componentwise L1 pairwise-distances (ie. absolute difference), else shape is (n_samples_X, n_samples_Y) and D contains the pairwise L1 distances.

Notes -------- When X and/or Y are CSR sparse matrices and they are not already in canonical format, this function modifies them in-place to make them canonical.

Examples -------- >>> from sklearn.metrics.pairwise import manhattan_distances >>> manhattan_distances([3], [3]) array([0.]) >>> manhattan_distances([3], [2]) array([1.]) >>> manhattan_distances([2], [3]) array([1.]) >>> manhattan_distances([1, 2], [3, 4], [1, 2], [0, 3]) array([0., 2.], [4., 4.]) >>> import numpy as np >>> X = np.ones((1, 2)) >>> y = np.full((2, 2), 2.) >>> manhattan_distances(X, y, sum_over_features=False) array([1., 1.], [1., 1.])

val nan_euclidean_distances : ?y:Ndarray.t -> ?squared:bool -> ?missing_values:[ `Int of int | `PyObject of Py.Object.t ] -> ?copy:bool -> x:Ndarray.t -> unit -> Ndarray.t

Calculate the euclidean distances in the presence of missing values.

Compute the euclidean distance between each pair of samples in X and Y, where Y=X is assumed if Y=None. When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates:

dist(x,y) = sqrt(weight * sq. distance from present coordinates) where, weight = Total # of coordinates / # of present coordinates

For example, the distance between ``3, na, na, 6`` and ``1, na, 4, 5`` is:

.. math:: \sqrt\frac{4

((3-1)^2 + (6-5)^2)

}

If all the coordinates are missing or if there are no common present coordinates then NaN is returned for that pair.

Read more in the :ref:`User Guide <metrics>`.

.. versionadded:: 0.22

Parameters ---------- X : array-like, shape=(n_samples_1, n_features)

Y : array-like, shape=(n_samples_2, n_features)

squared : bool, default=False Return squared Euclidean distances.

missing_values : np.nan or int, default=np.nan Representation of missing value

copy : boolean, default=True Make and use a deep copy of X and Y (if Y exists)

Returns ------- distances : array, shape (n_samples_1, n_samples_2)

Examples -------- >>> from sklearn.metrics.pairwise import nan_euclidean_distances >>> nan = float("NaN") >>> X = [0, 1], [1, nan] >>> nan_euclidean_distances(X, X) # distance between rows of X array([0. , 1.41421356], [1.41421356, 0. ])

>>> # get distance to origin >>> nan_euclidean_distances(X, [0, 0]) array([1. ], [1.41421356])

References ---------- * John K. Dixon, "Pattern Recognition with Partly Missing Data", IEEE Transactions on Systems, Man, and Cybernetics, Volume: 9, Issue: 10, pp. 617 - 621, Oct. 1979. http://ieeexplore.ieee.org/abstract/document/4310090/

See also -------- paired_distances : distances between pairs of elements of X and Y.

val normalize : ?norm:[ `L1 | `L2 | `Max | `PyObject of Py.Object.t ] -> ?axis:Py.Object.t -> ?copy:bool -> ?return_norm:bool -> x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Py.Object.t * Py.Object.t

Scale input vectors individually to unit norm (vector length).

Read more in the :ref:`User Guide <preprocessing_normalization>`.

Parameters ---------- X : array-like, sparse matrix, shape n_samples, n_features The data to normalize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

norm : 'l1', 'l2', or 'max', optional ('l2' by default) The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).

axis : 0 or 1, optional (1 by default) axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.

copy : boolean, optional, default True set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix and if axis is 1).

return_norm : boolean, default False whether to return the computed norms

Returns ------- X : array-like, sparse matrix, shape n_samples, n_features Normalized input X.

norms : array, shape n_samples if axis=1 else n_features An array of norms along given axis for X. When X is sparse, a NotImplementedError will be raised for norm 'l1' or 'l2'.

See also -------- Normalizer: Performs normalization using the ``Transformer`` API (e.g. as part of a preprocessing :class:`sklearn.pipeline.Pipeline`).

Notes ----- For a comparison of the different scalers, transformers, and normalizers, see :ref:`examples/preprocessing/plot_all_scaling.py <sphx_glr_auto_examples_preprocessing_plot_all_scaling.py>`.

val paired_cosine_distances : x:Ndarray.t -> y:Ndarray.t -> unit -> Ndarray.t

Computes the paired cosine distances between X and Y

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array-like, shape (n_samples, n_features)

Y : array-like, shape (n_samples, n_features)

Returns ------- distances : ndarray, shape (n_samples, )

Notes ----- The cosine distance is equivalent to the half the squared euclidean distance if each sample is normalized to unit norm

val paired_distances : ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?kwds:(string * Py.Object.t) list -> x:Py.Object.t -> y:Py.Object.t -> unit -> Py.Object.t

Computes the paired distances between X and Y.

Computes the distances between (X0, Y0), (X1, Y1), etc...

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : ndarray (n_samples, n_features) Array 1 for distance computation.

Y : ndarray (n_samples, n_features) Array 2 for distance computation.

metric : string or callable The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options specified in PAIRED_DISTANCES, including "euclidean", "manhattan", or "cosine". Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

Returns ------- distances : ndarray (n_samples, )

Examples -------- >>> from sklearn.metrics.pairwise import paired_distances >>> X = [0, 1], [1, 1] >>> Y = [0, 1], [2, 1] >>> paired_distances(X, Y) array(0., 1.)

See also -------- pairwise_distances : Computes the distance between every pair of samples

val paired_euclidean_distances : x:Ndarray.t -> y:Ndarray.t -> unit -> Py.Object.t

Computes the paired euclidean distances between X and Y

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array-like, shape (n_samples, n_features)

Y : array-like, shape (n_samples, n_features)

Returns ------- distances : ndarray (n_samples, )

val paired_manhattan_distances : x:Ndarray.t -> y:Ndarray.t -> unit -> Py.Object.t

Compute the L1 distances between the vectors in X and Y.

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array-like, shape (n_samples, n_features)

Y : array-like, shape (n_samples, n_features)

Returns ------- distances : ndarray (n_samples, )

val pairwise_distances : ?y:Ndarray.t -> ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?n_jobs:[ `Int of int | `None ] -> ?force_all_finite:[ `Bool of bool | `Allow_nan ] -> ?kwds:(string * Py.Object.t) list -> x:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> unit -> Py.Object.t

Compute the distance matrix from a vector array X and optional Y.

This method takes either a vector array or a distance matrix, and returns a distance matrix. If the input is a vector array, the distances are computed. If the input is a distances matrix, it is returned instead.

This method provides a safe way to take a distance matrix as input, while preserving compatibility with many other algorithms that take a vector array.

If Y is given (default is None), then the returned matrix is the pairwise distance between the arrays from both X and Y.

Valid values for metric are:

  • From scikit-learn: 'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'. These metrics support sparse matrix inputs. 'nan_euclidean' but it does not yet support sparse matrices.
  • From scipy.spatial.distance: 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule' See the documentation for scipy.spatial.distance for details on these metrics. These metrics do not support sparse matrix inputs.

Note that in the case of 'cityblock', 'cosine' and 'euclidean' (which are valid scipy.spatial.distance metrics), the scikit-learn implementation will be used, which is faster and has support for sparse matrices (except for 'cityblock'). For a verbose description of the metrics from scikit-learn, see the __doc__ of the sklearn.pairwise.distance_metrics function.

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array n_samples_a, n_samples_a if metric == "precomputed", or, n_samples_a, n_features otherwise Array of pairwise distances between samples, or a feature array.

Y : array n_samples_b, n_features, optional An optional second feature array. Only allowed if metric != "precomputed".

metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is "precomputed", X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

n_jobs : int or None, optional (default=None) The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

force_all_finite : boolean or 'allow-nan', (default=True) Whether to raise an error on np.inf and np.nan in array. The possibilities are:

  • True: Force all values of array to be finite.
  • False: accept both np.inf and np.nan in array.
  • 'allow-nan': accept only np.nan values in array. Values cannot be infinite.

.. versionadded:: 0.22

**kwds : optional keyword parameters Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Returns ------- D : array n_samples_a, n_samples_a or n_samples_a, n_samples_b A distance matrix D such that D_, j is the distance between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then D_, j is the distance between the ith array from X and the jth array from Y.

See also -------- pairwise_distances_chunked : performs the same calculation as this function, but returns a generator of chunks of the distance matrix, in order to limit memory usage. paired_distances : Computes the distances between corresponding elements of two arrays

val pairwise_distances_argmin : ?axis:int -> ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?metric_kwargs:Py.Object.t -> x:Ndarray.t -> y:Ndarray.t -> unit -> Py.Object.t

Compute minimum distances between one point and a set of points.

This function computes for each row in X, the index of the row of Y which is closest (according to the specified distance).

This is mostly equivalent to calling:

pairwise_distances(X, Y=Y, metric=metric).argmin(axis=axis)

but uses much less memory, and is faster for large arrays.

This function works with dense 2D arrays only.

Parameters ---------- X : array-like Arrays containing points. Respective shapes (n_samples1, n_features) and (n_samples2, n_features)

Y : array-like Arrays containing points. Respective shapes (n_samples1, n_features) and (n_samples2, n_features)

axis : int, optional, default 1 Axis along which the argmin and distances are to be computed.

metric : string or callable metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.

Distance matrices are not supported.

Valid values for metric are:

  • from scikit-learn: 'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'
  • from scipy.spatial.distance: 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'

See the documentation for scipy.spatial.distance for details on these metrics.

metric_kwargs : dict keyword arguments to pass to specified metric function.

Returns ------- argmin : numpy.ndarray Yargmin[i], : is the row in Y that is closest to Xi, :.

See also -------- sklearn.metrics.pairwise_distances sklearn.metrics.pairwise_distances_argmin_min

val pairwise_distances_argmin_min : ?axis:int -> ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?metric_kwargs:Py.Object.t -> x:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> y:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> unit -> Py.Object.t * Py.Object.t

Compute minimum distances between one point and a set of points.

This function computes for each row in X, the index of the row of Y which is closest (according to the specified distance). The minimal distances are also returned.

This is mostly equivalent to calling:

(pairwise_distances(X, Y=Y, metric=metric).argmin(axis=axis), pairwise_distances(X, Y=Y, metric=metric).min(axis=axis))

but uses much less memory, and is faster for large arrays.

Parameters ---------- X : array-like, sparse matrix, shape (n_samples1, n_features) Array containing points.

Y : array-like, sparse matrix, shape (n_samples2, n_features) Arrays containing points.

axis : int, optional, default 1 Axis along which the argmin and distances are to be computed.

metric : string or callable, default 'euclidean' metric to use for distance computation. Any metric from scikit-learn or scipy.spatial.distance can be used.

If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.

Distance matrices are not supported.

Valid values for metric are:

  • from scikit-learn: 'cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan'
  • from scipy.spatial.distance: 'braycurtis', 'canberra', 'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard', 'kulsinski', 'mahalanobis', 'minkowski', 'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener', 'sokalsneath', 'sqeuclidean', 'yule'

See the documentation for scipy.spatial.distance for details on these metrics.

metric_kwargs : dict, optional Keyword arguments to pass to specified metric function.

Returns ------- argmin : numpy.ndarray Yargmin[i], : is the row in Y that is closest to Xi, :.

distances : numpy.ndarray distancesi is the distance between the i-th row in X and the argmini-th row in Y.

See also -------- sklearn.metrics.pairwise_distances sklearn.metrics.pairwise_distances_argmin

val pairwise_distances_chunked : ?y:Ndarray.t -> ?reduce_func:Py.Object.t -> ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?n_jobs:[ `Int of int | `None ] -> ?working_memory:int -> ?kwds:(string * Py.Object.t) list -> x:Py.Object.t -> unit -> Py.Object.t

Generate a distance matrix chunk by chunk with optional reduction

In cases where not all of a pairwise distance matrix needs to be stored at once, this is used to calculate pairwise distances in ``working_memory``-sized chunks. If ``reduce_func`` is given, it is run on each chunk and its return values are concatenated into lists, arrays or sparse matrices.

Parameters ---------- X : array n_samples_a, n_samples_a if metric == "precomputed", or, n_samples_a, n_features otherwise Array of pairwise distances between samples, or a feature array.

Y : array n_samples_b, n_features, optional An optional second feature array. Only allowed if metric != "precomputed".

reduce_func : callable, optional The function which is applied on each chunk of the distance matrix, reducing it to needed values. ``reduce_func(D_chunk, start)`` is called repeatedly, where ``D_chunk`` is a contiguous vertical slice of the pairwise distance matrix, starting at row ``start``. It should return one of: None; an array, a list, or a sparse matrix of length ``D_chunk.shape0``; or a tuple of such objects. Returning None is useful for in-place operations, rather than reductions.

If None, pairwise_distances_chunked returns a generator of vertical chunks of the distance matrix.

metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string, it must be one of the options allowed by scipy.spatial.distance.pdist for its metric parameter, or a metric listed in pairwise.PAIRWISE_DISTANCE_FUNCTIONS. If metric is "precomputed", X is assumed to be a distance matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays from X as input and return a value indicating the distance between them.

n_jobs : int or None, optional (default=None) The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

working_memory : int, optional The sought maximum memory for temporary distance matrix chunks. When None (default), the value of ``sklearn.get_config()'working_memory'`` is used.

`**kwds` : optional keyword parameters Any further parameters are passed directly to the distance function. If using a scipy.spatial.distance metric, the parameters are still metric dependent. See the scipy docs for usage examples.

Yields ------ D_chunk : array or sparse matrix A contiguous slice of distance matrix, optionally processed by ``reduce_func``.

Examples -------- Without reduce_func:

>>> import numpy as np >>> from sklearn.metrics import pairwise_distances_chunked >>> X = np.random.RandomState(0).rand(5, 3) >>> D_chunk = next(pairwise_distances_chunked(X)) >>> D_chunk array([0. ..., 0.29..., 0.41..., 0.19..., 0.57...], [0.29..., 0. ..., 0.57..., 0.41..., 0.76...], [0.41..., 0.57..., 0. ..., 0.44..., 0.90...], [0.19..., 0.41..., 0.44..., 0. ..., 0.51...], [0.57..., 0.76..., 0.90..., 0.51..., 0. ...])

Retrieve all neighbors and average distance within radius r:

>>> r = .2 >>> def reduce_func(D_chunk, start): ... neigh = np.flatnonzero(d < r) for d in D_chunk ... avg_dist = (D_chunk * (D_chunk < r)).mean(axis=1) ... return neigh, avg_dist >>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func) >>> neigh, avg_dist = next(gen) >>> neigh array([0, 3]), array([1]), array([2]), array([0, 3]), array([4]) >>> avg_dist array(0.039..., 0. , 0. , 0.039..., 0. )

Where r is defined per sample, we need to make use of ``start``:

>>> r = .2, .4, .4, .3, .1 >>> def reduce_func(D_chunk, start): ... neigh = np.flatnonzero(d < r[i]) ... for i, d in enumerate(D_chunk, start) ... return neigh >>> neigh = next(pairwise_distances_chunked(X, reduce_func=reduce_func)) >>> neigh array([0, 3]), array([0, 1]), array([2]), array([0, 3]), array([4])

Force row-by-row generation by reducing ``working_memory``:

>>> gen = pairwise_distances_chunked(X, reduce_func=reduce_func, ... working_memory=0) >>> next(gen) array([0, 3]) >>> next(gen) array([0, 1])

val pairwise_kernels : ?y:Ndarray.t -> ?metric:[ `String of string | `Callable of Py.Object.t ] -> ?filter_params:bool -> ?n_jobs:[ `Int of int | `None ] -> ?kwds:(string * Py.Object.t) list -> x:[ `Ndarray of Ndarray.t | `PyObject of Py.Object.t ] -> unit -> Py.Object.t

Compute the kernel between arrays X and optional array Y.

This method takes either a vector array or a kernel matrix, and returns a kernel matrix. If the input is a vector array, the kernels are computed. If the input is a kernel matrix, it is returned instead.

This method provides a safe way to take a kernel matrix as input, while preserving compatibility with many other algorithms that take a vector array.

If Y is given (default is None), then the returned matrix is the pairwise kernel between the arrays from both X and Y.

Valid values for metric are: 'additive_chi2', 'chi2', 'linear', 'poly', 'polynomial', 'rbf', 'laplacian', 'sigmoid', 'cosine'

Read more in the :ref:`User Guide <metrics>`.

Parameters ---------- X : array n_samples_a, n_samples_a if metric == "precomputed", or, n_samples_a, n_features otherwise Array of pairwise kernels between samples, or a feature array.

Y : array n_samples_b, n_features A second feature array only if X has shape n_samples_a, n_features.

metric : string, or callable The metric to use when calculating kernel between instances in a feature array. If metric is a string, it must be one of the metrics in pairwise.PAIRWISE_KERNEL_FUNCTIONS. If metric is "precomputed", X is assumed to be a kernel matrix. Alternatively, if metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two rows from X as input and return the corresponding kernel value as a single number. This means that callables from :mod:`sklearn.metrics.pairwise` are not allowed, as they operate on matrices, not single samples. Use the string identifying the kernel instead.

filter_params : boolean Whether to filter invalid parameters or not.

n_jobs : int or None, optional (default=None) The number of jobs to use for the computation. This works by breaking down the pairwise matrix into n_jobs even slices and computing them in parallel.

``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary <n_jobs>` for more details.

**kwds : optional keyword parameters Any further parameters are passed directly to the kernel function.

Returns ------- K : array n_samples_a, n_samples_a or n_samples_a, n_samples_b A kernel matrix K such that K_, j is the kernel between the ith and jth vectors of the given matrix X, if Y is None. If Y is not None, then K_, j is the kernel between the ith array from X and the jth array from Y.

Notes ----- If metric is 'precomputed', Y is ignored and X is returned.

val polynomial_kernel : ?y:Ndarray.t -> ?degree:int -> ?gamma:float -> ?coef0:float -> x:Ndarray.t -> unit -> Py.Object.t

Compute the polynomial kernel between X and Y::

K(X, Y) = (gamma <X, Y> + coef0)^degree

Read more in the :ref:`User Guide <polynomial_kernel>`.

Parameters ---------- X : ndarray of shape (n_samples_1, n_features)

Y : ndarray of shape (n_samples_2, n_features)

degree : int, default 3

gamma : float, default None if None, defaults to 1.0 / n_features

coef0 : float, default 1

Returns ------- Gram matrix : array of shape (n_samples_1, n_samples_2)

val rbf_kernel : ?y:Ndarray.t -> ?gamma:float -> x:Ndarray.t -> unit -> Ndarray.t

Compute the rbf (gaussian) kernel between X and Y::

K(x, y) = exp(-gamma ||x-y||^2)

for each pair of rows x in X and y in Y.

Read more in the :ref:`User Guide <rbf_kernel>`.

Parameters ---------- X : array of shape (n_samples_X, n_features)

Y : array of shape (n_samples_Y, n_features)

gamma : float, default None If None, defaults to 1.0 / n_features

Returns ------- kernel_matrix : array of shape (n_samples_X, n_samples_Y)

val row_norms : ?squared:Py.Object.t -> x:Ndarray.t -> unit -> Py.Object.t

Row-wise (squared) Euclidean norm of X.

Equivalent to np.sqrt((X * X).sum(axis=1)), but also supports sparse matrices and does not create an X.shape-sized temporary.

Performs no input validation.

Parameters ---------- X : array_like The input array squared : bool, optional (default = False) If True, return squared norms.

Returns ------- array_like The row-wise (squared) Euclidean norm of X.

val safe_sparse_dot : ?dense_output:Py.Object.t -> a:[ `Ndarray of Ndarray.t | `SparseMatrix of Csr_matrix.t ] -> b:Py.Object.t -> unit -> Py.Object.t

Dot product that handle the sparse matrix case correctly

Parameters ---------- a : array or sparse matrix b : array or sparse matrix dense_output : boolean, (default=False) When False, ``a`` and ``b`` both being sparse will yield sparse output. When True, output will always be a dense array.

Returns ------- dot_product : array or sparse matrix sparse if ``a`` and ``b`` are sparse and ``dense_output=False``.

val sigmoid_kernel : ?y:Ndarray.t -> ?gamma:float -> ?coef0:float -> x:Ndarray.t -> unit -> Py.Object.t

Compute the sigmoid kernel between X and Y::

K(X, Y) = tanh(gamma <X, Y> + coef0)

Read more in the :ref:`User Guide <sigmoid_kernel>`.

Parameters ---------- X : ndarray of shape (n_samples_1, n_features)

Y : ndarray of shape (n_samples_2, n_features)

gamma : float, default None If None, defaults to 1.0 / n_features

coef0 : float, default 1

Returns ------- Gram matrix : array of shape (n_samples_1, n_samples_2)

OCaml

Innovation. Community. Security.