Skip to content

Add option to use multiple inits in kmeans? #291

@dahong67

Description

@dahong67

As far as I can tell, kmeans currently always does only one run of K-means from one initialization. However, it can sometimes be quite helpful to do multiple runs of K-means (i.e., run K-means from multiple initializations) then take the best (to avoid bad local minima).

I currently handle this by defining a batch version in my research codes, something like the following:

using Clustering, ProgressLogging
function batchkmeans(X, k, args...; nruns=100, kwargs...)
    runs = @withprogress map(1:nruns) do idx
        # Run K-means
        Random.seed!(idx)  # set seed for reproducibility
        result = with_logger(NullLogger()) do
            kmeans(X, k, args...; kwargs...)
        end

        # Log progress and return result
        @logprogress idx/nruns
        return result
    end

    # Print how many converged
    nconverged = count(run -> run.converged, runs)
    @info "$nconverged/$nruns runs converged"

    # Return runs sorted best to worst
    return sort(runs; by=run->run.totalcost)
end

I think it'd be great to have something like this functionality (not necessarily how I've done it above) built into kmeans!

For reference, scikit-learn provides an n_init argument for their K-means implementation with a default value of 1 when using k-means++ to initialize and 10 when using a random initialization. See here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

If there's interest, I'd be happy to put together a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions