matlab - Iregular plot of k-means clustering, outlier removal -


hi i'm working on trying cluster network data 1999 darpa data set. unfortunately i'm not getting clustered data, not compared of literature, using same techniques , methods.

my data comes out this:

matlab figure 1

as can see, not clustered. due lot of outliers (noise) in dataset. have looked @ outlier removal techniques nothing have tried far cleans data. 1 of methods have tried:

%% when outlier considered more 3 standard deviations away mean, determine number of outliers in each column of count matrix: mu = mean(data) sigma = std(data) [n,p] = size(data); % create matrix of mean values replicating mu vector n rows meanmat = repmat(mu,n,1); % create matrix of standard deviation values replicating sigma vector n rows sigmamat = repmat(sigma,n,1); % create matrix of zeros , ones, ones indicate location of outliers outliers = abs(data - meanmat) > 3*sigmamat; % calculate number of outliers in each column nout = sum(outliers) % remove entire row of data containing outlier data(any(outliers,2),:) = []; 

in first run, removed 48 rows 1000 normalized random rows selected full dataset.

this full script used on data:

 %% load data %# read list of features fid = fopen('kddcup.names','rt'); c = textscan(fid, '%s %s', 'delimiter',':', 'headerlines',1); fclose(fid); %# determine type of features c{2} = regexprep(c{2}, '.$',''); %# remove "." @ end attribnom = [ismember(c{2},'symbolic');true]; %# nominal features %# build format string used read/parse actual data frmt = cell(1,numel(c{1})); frmt( ismember(c{2},'continuous') ) = {'%f'}; %# numeric features: read number frmt( ismember(c{2},'symbolic') ) = {'%s'}; %# nominal features: read string frmt = [frmt{:}]; frmt = [frmt '%s']; %# add class attribute %# read dataset fid = fopen('kddcup.data_10_percent_corrected','rt'); c = textscan(fid, frmt, 'delimiter',','); fclose(fid); %# convert nominal attributes numeric ind = find(attribnom); g = cell(numel(ind),1); i=1:numel(ind) [c{ind(i)},g{i}] = grp2idx( c{ind(i)} ); end %# numeric dataset fulldata = cell2mat(c); %% dimensionality reduction columns = 6 [u,s,v]=svds(fulldata,columns); %% randomly select dataset rows = 1000; columns = 6; %# pick random rows indx = randperm( size(fulldata,1) ); indx = indx(1:rows)'; %# pick random columns indy = indy(1:columns); %# filter data data = u(indx,indy); % apply normalization method every cell maxdata = max(max(data)); mindata = min(min(data)); data = ((data-mindata)./(maxdata)); % output matching data datasample = fulldata(indx, :) %% when outlier considered more 3 standard deviations away mean, use following syntax determine number of outliers in each column of count matrix: mu = mean(data) sigma = std(data) [n,p] = size(data); % create matrix of mean values replicating mu vector n rows meanmat = repmat(mu,n,1); % create matrix of standard deviation values replicating sigma vector n rows sigmamat = repmat(sigma,n,1); % create matrix of zeros , ones, ones indicate location of outliers outliers = abs(data - meanmat) > 2.5*sigmamat; % calculate number of outliers in each column nout = sum(outliers) % remove entire row of data containing outlier data(any(outliers,2),:) = []; %% generate sample data k = 6; numobservarations = size(data, 1); dimensions = 3; %% cluster opts = statset('maxiter', 100, 'display', 'iter'); [clustidx, clusters, interclustsum, dist] = kmeans(data, k, 'options',opts, ... 'distance','sqeuclidean', 'emptyaction','singleton', 'replicates',3); %% plot data+clusters figure, hold on scatter3(data(:,1),data(:,2),data(:,3), 5, clustidx, 'filled') scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 100, (1:k)', 'filled') hold off, xlabel('x'), ylabel('y'), zlabel('z') grid on view([90 0]); %% plot clusters quality figure [silh,h] = silhouette(data, clustidx); avrgscore = mean(silh); 

this 2 distinct clusters output:

enter image description here

as can see data looks cleaner , more clustered original. still think better method can used.

for instance observing overall clustering, still have lot of noise (outliers) dataset. can seen here:

enter image description here

i need outlier rows put seperate dataset later classification (only removed clustering)

here link darpa dataset, please note 10% data set has had significant reduction in columns, majority of columns have 0 or 1's running through-out have been removed (42 columns 6 columns):

http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

edit

columns kept in dataset are:

src_bytes: continuous. dst_bytes: continuous. count: continuous. srv_count: continuous. dst_host_count: continuous. dst_host_srv_count: continuous. 

re-edit:

based on discussions anony-mousse , answer, there may way of reducing noise in clustering using k-medoids http://en.wikipedia.org/wiki/k-medoids. i'm hoping there isnt of change in code have of yet not know how implement test whether reduce noise. providing can show me working example accepted answer.

first things first: you're asking lot here. future reference: try break problem in smaller chunks, , post several questions. increases chances of getting answers (and doesn't cost 400 reputation!).

luckily you, understand predicament, , love sort of question!

apart dataset's possible issues k-means, question still generic enough apply other datasets (and googlers ending here looking similar thing), let's go ahead , solved.

my suggestion edit answer until reasonably satisfactory results.

number of clusters

step 1 of clustering problem: how many clusters choose? there few methods know of can select proper number of clusters. there nice wiki page this, containing of methods below (and few more).

visual inspection

it might seem silly, if have well-separated data, simple plot can tell (approximately) how many clusters you'll need, looking.

pros:

  • quick
  • simple
  • works on well-separated clusters in relatively small datasets

cons:

  • and dirty
  • requires user interaction
  • it's easy miss smaller clusters
  • data less-well separated clusters, or many of them, hard method
  • it rather subjective -- next person might select different amount did.

silhouettes plot

as indicated in 1 of other questions, making silhouettes plot make better decision proper number of clusters in data.

pros:

  • relatively simple
  • reduces subjectivity using statistical measures
  • intuitive way represent quality of choice

cons:

  • requires user interaction
  • in limit, if take many clusters there datapoints, silhouettes plot tell that best choice
  • it still rather subjective, not based on statistical means
  • can computationally expensive

elbow method

as silhouettes plot approach, run kmeans repeatedly, each time larger amount of clusters, , see how of total variance in data explained clusters chosen kmeans run. there number of clusters amount of explaned variance increase lot less in of previous choices of number of clusters (the "elbow"). elbow statistically speaking best choice number of clusters.

pros:

  • no user interaction required -- elbow can selected automatically
  • statistically more sound of aforementioned methods

cons:

  • somewhat complicated
  • still subjective, since definition of "elbow" depends on subjectively chosen parameters
  • can computationally expensive

outliers

once have chosen number of clusters of methods above, time outlier detection see if quality of clusters improves.

i start two-step-iterative approach, using elbow method. in pseudo-matlab:

data = initial dataset datamod = initial dataset max = number of clusters chosen visual inspection while (forever) n = max-5 : max+5 if (n < 1), continue, end perform k-means n clusters on datamod if (variance explained shows jump) break end if (you satisfied) break end = 1:n extract points cluster find centroid (let k-means that) calculate standard deviation of distances centroid mark points further 3 sigma possible outliers end datamod = data marked points removed end 

the tough part determining whether you satisfied. key algorithm's effectiveness. rough structure of part

if (you satisfied) break end 

would this

if (situation has improved) data = datamod elseif (situation same or worse) datamod = data break end 

the situation has improved when there fewer outliers, or variance explaned choices of n better during previous loop in while. fiddle with.

anyway, more first attempt wouldn't call this. if sees incompletenesses, flaws or loopholes here, please comment or edit.


Comments

Popular posts from this blog

javascript - backbone.js Collection.add() doesn't `construct` (`initialize`) an object -

php - Get uncommon values from two or more arrays -

Adding duplicate array rows in Php -