Let'stalkaboutClustering

2022-5-16 18:45| 发布者: Hocassian| 查看: 90| 评论: 0|原作者: 三无计划的博客

摘要:
C:\Users\Administrator\Downloads\2019-10-13-21-42-29-124521735665599-无文字 三无计划-采集的数据-后羿采集器.html

发布时间

Apr 25, 2019

标题链接

https://blog.imalan.cn/archives/320/

标题

Let's talk about Clustering

word-count

+ 1376 字

导语

A brief introduction on traditional clustering algorithms, and a novel new method: Clustering by fast search and find of density peaks.

正文

This post is about clus­ter­ing. I will make a short in­tro­duc­tion of tra­di­tional meth­ods, then talk about a mag­i­cal al­go­rithm pub­lished by Alex Ro­driguez and Alessan­dro Laio on Science magazine, 2014. I call it mag­i­cal be­cause it's sim­ple yet pow­er­ful, and su­per in­tu­itive. The prin­ci­ple be­hind it is re­ally se­man­tic.

Clus­ter­ing, as a kind of un­su­per­vised learn­ing method, aims to di­vide dataset into some pieces (i.e. "classes"), mak­ing the intraclass difference

1 as small as pos­si­ble while the interclass difference
2 as large as pos­si­ble.

For ex­am­ple, the pic­ture bel­low shows 5000 points scat­tered on a 2-D plane, we hu­mans can eas­ily as­sign them into 15 groups, be­cause there is an ob­vi­ous dis­trib­ut­ing pat­tern of these points. The ques­tion is, how could com­put­ers fin­ish this job?

5000 points scattered on a 2-D plane
5000 points scattered on a 2-D plane

Traditional methods

In the past few decades many al­go­rithms have been de­vel­oped just to fin­ish this "sim­ple" task, some of them are quite del­i­cate and beau­ti­ful. How­ever, de­pend­ing on the ra­tio­nale be­hind them, they all have some short­com­ings like any other al­go­rithm does. They are ei­ther slow or too com­pli­cated to im­ple­ment, or just can­not fin­ish the job.

The C-Means al­go­rithm might be the most fa­mous one of its kind. We ran­domly choose some points as clus­ter cen­ters at first, then we as­sign points into some clus­ters by their "dis­tance" to clus­ter cen­ters, then cal­cu­late new cen­ters to re­place old ones. This pro­ce­dure is ac­com­plished by re­peat­ing those steps sev­eral times, un­til clus­ter cen­ters do not move any­more. This method is in­spi­ra­tional, and many sim­i­lar meth­ods have been de­vel­oped, such as Fuzzy C-Means, which uses a fuzzy ma­trix to weight the tar­get func­tion. This kind of al­go­rithms are easy to im­ple­ment and are sim­ple to un­der­stand, but as been men­tioned above, the ini­tial clus­ter cen­ters are very im­por­tant, and re­sults vary un­der dif­fer­ent ini­tial state. That makes these al­go­rithms not ro­bust enough, thus not re­li­able in some cases.

An­other kind of al­go­rithms are build upon sta­tis­tics, such as Guassian Mixed Model (GMM) al­go­rithm. It as­sumes that data points are just sam­ples from some mixed Guass­ian mod­els, there­fore the clus­ter­ing task is equal to es­ti­mate sev­eral Guass­ian mod­els, to make the pos­te­rior prob­a­bil­ity as large as pos­si­ble. The­o­ret­i­cal foun­da­tion of these al­go­rithms are clear, math­e­mat­ics makes them look re­li­able. How­ever, ini­tial state still af­fects the re­sult, and most im­por­tantly, the ba­sic as­sump­tion — "data points are just sam­ples from some mixed Guass­ian mod­els" — is­n't al­ways true. An­other prob­lem is, this al­go­rithm could be very slow with large datasets.

Some other al­go­rithms ex­ist, like Hierarchical Clustering, which works fine un­der dis­crete fea­ture val­ues. It re­gards each points as a sin­gle clus­ter at first, then com­bines two clus­ters each time, re­peat this step un­til the to­tal num­ber of clus­ters re­duces to a cer­tain level. It sounds sim­ple, but again, it's slow with large datasets, and not ro­bust enough.

A new method

Al­go­rithms men­tioned above are high­lights of clus­ter­ing meth­ods. Most of them are it­er­a­tive meth­ods, and could be time con­sum­ing with large datasets. Peo­ple have found an ac­cept­able prin­ci­ple: a good clus­ter­ing method is an al­go­rithm which can fully sum­ma­rize the com­mon fea­tures of points, while mak­ing dif­fer­ent clus­ters far­away from each other.

A new method was in­tro­duced by Alex Ro­driguez and Alessan­dro Laio in 2014, their pa­per was pub­lished on Sci­ence. This al­ready in­di­cates that their al­go­rithm must have some­thing spe­cial, since pa­pers in this field are hardly seen on mag­a­zines like Sci­ence. It is worth men­tion­ing that deep learn­ing re­searches were al­ready su­per pop­u­lar in 2014, which makes this al­go­rithm even more ad­mirable — it has ab­solutely noth­ing to do with deep learn­ing.

This al­go­rithm is based on two prin­ci­ples, only two:

  1. Cluster centers are points with high "Local Density (ρ)"
  2. Cluster centers are points with high "Relative Distance (δ)"

Local Density

How to un­der­stand these two prin­ci­ples? Well, the first is rel­a­tively easy: clus­ter cen­ters tends to have more points around them, than other reg­u­lar none-cen­ter points. Imag­ine a leader of a team, or the head of a gang, they are al­ways sur­rounded by many other "lit­tle guys".

For each point, we sim­ply count the num­ber of points around it, that is, closer than a pre­de­fined cut-dis­tance — dc:

ρi=χ(dijdc)

And we have :

χ(x)=1|x<0,χ(x)=0|x0

Geo­met­ri­cally, this step equals to draw­ing a cir­cle (or spher­i­cal sur­face with higher di­men­sions) us­ing dc as ra­dius and pointi as cen­ter, then count the num­ber of points in­side the cir­cle.

Relative Distance

This is a lit­tle more dif­fi­cult to un­der­stand. Ac­tu­ally "Rel­a­tive dis­tance" means "minimum distance to points with higher local density", that is for each point Pi with lo­cal den­sity ρi, we cal­cu­late dis­tances be­tween Pi and any other point Pj with lo­cal den­sity ρj>ρi, then use the min­i­mal dis­tance as the "Rel­a­tive Dis­tance" of Pi. But WHY?

We can eas­ily un­der­stand why we choose points with high Lo­cal Den­sity (ρ) as clus­ter cen­ter can­di­dates — be­cause points around them are com­pact. But don't for­get, an­other cri­te­ria is also im­por­tant for clus­ter cen­ters: distance between clusters should be as large as possible. It is rea­son­able, think about it: if clus­ter cen­ters are too close to each other, there must be huge over­lap be­tween clus­ters, that usu­ally is­n't ideal re­sult.

Again, imag­ine a man who want to be a leader of a team, firstly he gath­ered a lot "lit­tle guys", then he must want to keep dis­tance to those "big­ger bosses", oth­er­wise he might lose in com­pe­ti­tion. The Rel­a­tive Dis­tance (δ) is used to mea­sure the dis­tance be­tween Pi and any other clus­ter cen­ter can­di­dates, as ex­plained, this value should be as large as pos­si­ble. There­fore, δi for Pi could be cal­cu­lated as:

δi=min(dij)|ρj>ρi

If Pi al­ready has the biggest ρ, just let:

δi=max(ρj)|ji

And dij means Eu­clid­ean dis­tance be­tween Pi and Pj.

Implementation

I im­ple­mented this al­go­rithm us­ing Mat­lab, guess how many lines of code does it need?

Click to view the code
function [centers_o, Rho_o, Delta_o] = fsfdp(dataset_i, NumCategory_i, dc_i)
    %%%%%
    % Clustering by Fast search and find of density peaks
    % AlanDecode
    %
    % dataset_i: N * D
    % NumCategory_i: number of categories
    % dc_i: cut distance

    %% Dataset
    dataset = dataset_i;
    datasetSize = size(dataset);
    NumCategory = NumCategory_i;

    %% Algorithm

    % Calc distance between points
    distMat = pdist(dataset);
    distMat = squareform(distMat);

    % Calc Rho (local density)
    Rho = zeros(datasetSize(1), 1);
    for idx = 1:datasetSize(1)
        dists = distMat(idx, :);
        Rho(idx) = size(dists(dists > 0 & dists < dc_i), 2);
    end

    % Sort
    [~, rhoIndex] = sort(Rho, 'descend');

    % Calc Delta (minimum distance between one point
    % and any other point with higher density)
    Delta = zeros(datasetSize(1), 1);
    for idx = 1:datasetSize(1)
        if idx == 1
            % point with highest rho
            index = rhoIndex(idx);

            % use highest distance as Delta
            dists = distMat(index, :);
            Delta(index) = max(dists);
        else
            % current point
            index = rhoIndex(idx);

            % points with higher rho
            rhoIndexSlice = rhoIndex(1:idx - 1);
            p1 = dataset(rhoIndexSlice, :);

            % calc distance
            p2 = repmat(dataset(index, :), idx-1, 1);
            dists = sqrt(sum((p1 - p2).^2, 2));

            % minimal distance
            Delta(index) = min(dists);
        end
    end

    % Find centers, composite Rho and Delta
    Temp = Rho.^2 + Delta.^2;
    [~, tempIndex] = sort(Temp, 'descend');
    % highest [NumCategory] values
    tempIndexSlice = tempIndex(1:NumCategory);
    centers = dataset(tempIndexSlice, :);

    % Return
    Rho_o = Rho;
    Delta_o = Delta;
    centers_o = centers;
end

You can see the core part con­tains less than 50 lines of code, it is so con­cise that I even doubted my­self. But the re­sult re­as­sured me:

Result of FSFDP algorithm
Result of FSFDP algorithm

Clus­ter cen­ters are drawn us­ing blue cir­cles on the right-hand fig­ure. And we can clearly see 15 points with high ρ and δ one the left-hand fig­ure. This al­go­rithm is not it­er­a­tive, the most time con­sum­ing step is sort­ing. This is in­evitable though, be­cause the sort­ing re­sult is nec­es­sary for cal­cu­lat­ing δ.


Dataset in this post could be down­loaded here, just load it in Mat­lab. It in­cludes co­or­di­nates of 5000 2-D points.


  1. the differences of data points in the same class
  2. the differences of data points in different classes

作者

熊猫小 A


路过

雷人

握手

鲜花

鸡蛋

最新评论

返回顶部