Welcome! This repository provides the code related to Guanlin HE's doctoral dissertation entitled "Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures" (defended on October 19th, 2022 at Université Paris-Saclay). The code includes several optimized parallel implementations of clustering algorithms:
- Parallel k-means(++) clustering on CPU
- Parallel k-means(++) clustering on GPU
- Parallel spectral clustering on CPU
- Parallel spectral clustering on GPU
- Parallel representative-based spectral clustering on CPU / on GPU / on CPU+GPU
Although the code contain comments, referring to the dissertation can help you understand our parallelization strategies, implementation details and optimisation techniques.
- The
main.cc
file in the home directory contains the top-level functionmain()
. - All the other
.cc
files and.cu
files lie in themodules
folder, while the.h
files are put into theinclude
folder. They are further classified by subject in the second-level directory. - The charateristics and locations of benchmark datasets are defined in the
include/datasets.h
file. - The default settings of various program parameters are defined in the
include/config.h
file. - The compilation settings are specified in
Makefile_definitions
under the home directory and inMakefile
under different levels of directories. - The
.txt
files (e.g. cluster labels) generated by the clustering program are stored in theoutput
folder by default. - The
python
folder contains some.py
files used to generate synthetic data, plot data and clustering, evaluate clustering quality, etc.
The code was developed under Linux and is written mainly in C, OpenMP and CUDA. The target compilation tools are gcc
and nvcc
. Detailed compilation settings can be found in Makefile_definitions
and Makefile
.
A make clean
command is suggested before each compilation with the make
command in the home directory. The compilation process may take about 1-2 minutes.
After compilation, the executable file named Clustering
will be produced. Then you can run Clustering
with various arguments (enter Clustering -h
to get the usage of arguments).
Examples:
- parallel k-means clustering on CPU using 40 threads:
./Clustering -algo 1 -cpu-nt 40
- parallel k-means++ clustering on GPU:
./Clustering -algo 2 -seeding-km-gpu 2
- parallel spectral clustering on GPU (using Gaussian similarity with
$\sigma=0.02$ and threshold 0.1 for similarity):
./Clustering -algo 4 -sigma 0.02 -thold-sim 0.1
- parallel representative-based spectral clustering on CPU+GPU (using k-means algorithm to extract representatives, and using the values above for connectivity parameters):
./Clustering -algo 5 -chain 3 -er 2 -sigma 0.02 -thold-sim 0.1
Note that, for performance concerns, some parameters are defined as constants in include/config.h
and include/datasets.h
, such as the number of representatives (NB_REPS
) and the way of computing similarity (GAUSS_SIM_WITH_THOLD
). They are not regulable by command arguments, thus the code requires to be recompiled after modifying the values of these constant parameters.