Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression (where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a median, a geometric median, or centroid. A median is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but unlike a medoid, it is not necessarily a point from within the original dataset. == Definition == Let X := { x 1 , x 2 , … , x n } {\textstyle {\mathcal {X}}:=\{x_{1},x_{2},\dots ,x_{n}\}} be a set of n {\textstyle n} points in a space with a distance function d. Medoid is defined as x medoid = arg min y ∈ X ∑ i = 1 n d ( y , x i ) . {\displaystyle x_{\text{medoid}}=\arg \min _{y\in {\mathcal {X}}}\sum _{i=1}^{n}d(y,x_{i}).} == Clustering with medoids == Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: Partitioning Around Medoids (PAM), the standard k-medoids algorithm Hierarchical Clustering Around Medoids (HACAM), which uses medoids in hierarchical clustering == Algorithms to compute the medoid of a set == From the definition above, it is clear that the medoid of a set X {\displaystyle {\mathcal {X}}} can be computed after computing all pairwise distances between points in the ensemble. This would take O ( n 2 ) {\textstyle O(n^{2})} distance evaluations (with n = | X | {\displaystyle n=|{\mathcal {X}}|} ). In the worst case, one can not compute the medoid with fewer distance evaluations. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in O ( n ) {\textstyle O(n)} by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of O ( n log n ϵ 2 ) {\textstyle O\left({\frac {n\log n}{\epsilon ^{2}}}\right)} distance computations to approximate the medoid within a factor of ( 1 + ϵ Δ ) {\textstyle (1+\epsilon \Delta )} with high probability, where Δ {\textstyle \Delta } is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover Δ {\textstyle \Delta } may not be known apriori. RAND was leveraged by TOPRANK which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points exactly, and picks the minimum of those. TOPRANK needs O ( n 5 3 log 4 3 n ) {\textstyle O(n^{\frac {5}{3}}\log ^{\frac {4}{3}}n)} distance computations to find the exact medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with O ( n 3 2 2 Θ ( d ) ) {\textstyle O(n^{\frac {3}{2}}2^{\Theta (d)})} distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes O ( n log n ) {\textstyle O(n\log n)} distance evaluations under statistical assumptions on the points. Correlated Sequential Halving also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time. == Implementations == An implementation of RAND, TOPRANK, and trimed can be found here. An implementation of Meddit can be found here and here. An implementation of Correlated Sequential Halving can be found here. == Medoids in text and natural language processing (NLP) == Medoids can be applied to various text and NLP tasks to improve the efficiency and accuracy of analyses. By clustering text data based on similarity, medoids can help identify representative examples within the dataset, leading to better understanding and interpretation of the data. === Text clustering === Text clustering is the process of grouping similar text or documents together based on their content. Medoid-based clustering algorithms can be employed to partition large amounts of text into clusters, with each cluster represented by a medoid document. This technique helps in organizing, summarizing, and retrieving information from large collections of documents, such as in search engines, social media analytics and recommendation systems. === Text summarization === Text summarization aims to produce a concise and coherent summary of a larger text by extracting the most important and relevant information. Medoid-based clustering can be used to identify the most representative sentences in a document or a group of documents, which can then be combined to create a summary. This approach is especially useful for extractive summarization tasks, where the goal is to generate a summary by selecting the most relevant sentences from the original text. === Sentiment analysis === Sentiment analysis involves determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. Medoid-based clustering can be applied to group text data based on similar sentiment patterns. By analyzing the medoid of each cluster, researchers can gain insights into the predominant sentiment of the cluster, helping in tasks such as opinion mining, customer feedback analysis, and social media monitoring. === Topic modeling === Topic modeling is a technique used to discover abstract topics that occur in a collection of documents. Medoid-based clustering can be applied to group documents with similar themes or topics. By analyzing the medoids of these clusters, researchers can gain an understanding of the underlying topics in the text corpus, facilitating tasks such as document categorization, trend analysis, and content recommendation. === Techniques for measuring text similarity in medoid-based clustering === When applying medoid-based clustering to text data, it is essential to choose an appropriate similarity measure to compare documents effectively. Each technique has its advantages and limitations, and the choice of the similarity measure should be based on the specific requirements and characteristics of the text data being analyzed. The following are common techniques for measuring text similarity in medoid-based clustering: ==== Cosine similarity ==== Cosine similarity is a widely used measure to compare the similarity between two pieces of text. It calculates the cosine of the angle between two document vectors in a high-dimensional space. Cosine similarity ranges between -1 and 1, where a value closer to 1 indicates higher similarity, and a value closer to -1 indicates lower similarity. By visualizing two lines originating from the origin and extending to the respective points of interest, and then measuring the angle between these lines, one can determine the similarity between the associated points. Cosine similarity is less affected by document length, so it may be better at producing medoids that are representative of the content of a cluster instead of the lengt
Read more →