2 views

What are some of the deciding factors to take into consideration when choosing a similarity index. In what cases is a Euclidean Distance preferred over Pearson and vice versa?

by (108k points)

The measures of similarity should be invariant under admissible data transformations, which is to say changes in scale. The measure is calculated for interval data, which automatically disregards differences in variables that can be attributed to differences in scale. If you recall, all valid interval scales, applied to the same objects, can be translated into each other by a linear transformation. By performing this we can see how similar two interval variables are, you must first do away with differences in scale by either standardizing the data (this is what the correlation coefficient does), or by trying to find the constants m and b such that the transformed variable mX+b is as similar as possible to Y, and then reporting the calculated similarity (this is what the r-square measure of regression does). Likewise, a measure designed for ordinal data should respond only to differences in the rank order, not to the absolute size of scores. A measure designed for ratio data should control for differences due to a multiplicative factor.

Correlation is unit independent; if you scale one of the objects ten times, you will get different Euclidean distances and the same correlation distances. Therefore, correlation metrics are excellent when you want to measure the distance between such objects as genes defined by their expression profile.

Often, absolute or squared correlation is used as distance metrics, because we are more interested in the strength of the relationship than in its sign.

However, the correlation is only suitable for high dimensional data; there is hardly a point of calculating it for two- or three-dimensional data points.

Also note that "Pearson distance" is a weighted type of Euclidean distance, and not the "correlation distance" using the Pearson correlation coefficient.

As we know the Euclidean is regularly the "default" distance utilized in e.G., K-nearest neighbors (classification) or K-means (clustering) to locate the "k closest points" of a particular pattern point. Another prominent example is hierarchical clustering, agglomerative clustering (entire and unmarried linkage) where you want to find the gap between clusters. If you wish to know more about Euclidean Distance then visit this K-means clustering algorithm Tutorial.