Skip to content

guanlin-he/clustering-release

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

Welcome! This repository provides the code related to Guanlin HE's doctoral dissertation entitled "Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures" (defended on October 19th, 2022 at Université Paris-Saclay). The code includes several optimized parallel implementations of clustering algorithms:

  1. Parallel k-means(++) clustering on CPU
  2. Parallel k-means(++) clustering on GPU
  3. Parallel spectral clustering on CPU
  4. Parallel spectral clustering on GPU
  5. Parallel representative-based spectral clustering on CPU / on GPU / on CPU+GPU

Although the code contain comments, referring to the dissertation can help you understand our parallelization strategies, implementation details and optimisation techniques.

Structure

  • The main.cc file in the home directory contains the top-level function main().
  • All the other .cc files and .cu files lie in the modules folder, while the .h files are put into the include folder. They are further classified by subject in the second-level directory.
  • The charateristics and locations of benchmark datasets are defined in the include/datasets.h file.
  • The default settings of various program parameters are defined in the include/config.h file.
  • The compilation settings are specified in Makefile_definitions under the home directory and in Makefile under different levels of directories.
  • The .txt files (e.g. cluster labels) generated by the clustering program are stored in the output folder by default.
  • The python folder contains some .py files used to generate synthetic data, plot data and clustering, evaluate clustering quality, etc.

Compilation

The code was developed under Linux and is written mainly in C, OpenMP and CUDA. The target compilation tools are gcc and nvcc. Detailed compilation settings can be found in Makefile_definitions and Makefile.

A make clean command is suggested before each compilation with the make command in the home directory. The compilation process may take about 1-2 minutes.

Execution

After compilation, the executable file named Clustering will be produced. Then you can run Clustering with various arguments (enter Clustering -h to get the usage of arguments).

Examples:

  • parallel k-means clustering on CPU using 40 threads:
./Clustering -algo 1 -cpu-nt 40
  • parallel k-means++ clustering on GPU:
./Clustering -algo 2 -seeding-km-gpu 2
  • parallel spectral clustering on GPU (using Gaussian similarity with $\sigma=0.02$ and threshold 0.1 for similarity):
./Clustering -algo 4 -sigma 0.02 -thold-sim 0.1
  • parallel representative-based spectral clustering on CPU+GPU (using k-means algorithm to extract representatives, and using the values above for connectivity parameters):
./Clustering -algo 5 -chain 3 -er 2 -sigma 0.02 -thold-sim 0.1

Note that, for performance concerns, some parameters are defined as constants in include/config.h and include/datasets.h, such as the number of representatives (NB_REPS) and the way of computing similarity (GAUSS_SIM_WITH_THOLD). They are not regulable by command arguments, thus the code requires to be recompiled after modifying the values of these constant parameters.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published