Overview

Welcome! This repository provides the code related to Guanlin HE's doctoral dissertation entitled "Parallel algorithms for clustering large datasets on CPU-GPU heterogeneous architectures" (defended on October 19th, 2022 at Université Paris-Saclay). The code includes several optimized parallel implementations of clustering algorithms:

Parallel k-means(++) clustering on CPU
Parallel k-means(++) clustering on GPU
Parallel spectral clustering on CPU
Parallel spectral clustering on GPU
Parallel representative-based spectral clustering on CPU / on GPU / on CPU+GPU

Although the code contain comments, referring to the dissertation can help you understand our parallelization strategies, implementation details and optimisation techniques.

Structure

The main.cc file in the home directory contains the top-level function main().
All the other .cc files and .cu files lie in the modules folder, while the .h files are put into the include folder. They are further classified by subject in the second-level directory.
The charateristics and locations of benchmark datasets are defined in the include/datasets.h file.
The default settings of various program parameters are defined in the include/config.h file.
The compilation settings are specified in Makefile_definitions under the home directory and in Makefile under different levels of directories.
The .txt files (e.g. cluster labels) generated by the clustering program are stored in the output folder by default.
The python folder contains some .py files used to generate synthetic data, plot data and clustering, evaluate clustering quality, etc.

Compilation

The code was developed under Linux and is written mainly in C, OpenMP and CUDA. The target compilation tools are gcc and nvcc. Detailed compilation settings can be found in Makefile_definitions and Makefile.

A make clean command is suggested before each compilation with the make command in the home directory. The compilation process may take about 1-2 minutes.

Execution

After compilation, the executable file named Clustering will be produced. Then you can run Clustering with various arguments (enter Clustering -h to get the usage of arguments).

Examples:

parallel k-means clustering on CPU using 40 threads:

./Clustering -algo 1 -cpu-nt 40

parallel k-means++ clustering on GPU:

./Clustering -algo 2 -seeding-km-gpu 2

parallel spectral clustering on GPU (using Gaussian similarity with $\sigma=0.02$ and threshold 0.1 for similarity):

./Clustering -algo 4 -sigma 0.02 -thold-sim 0.1

parallel representative-based spectral clustering on CPU+GPU (using k-means algorithm to extract representatives, and using the values above for connectivity parameters):

./Clustering -algo 5 -chain 3 -er 2 -sigma 0.02 -thold-sim 0.1

Note that, for performance concerns, some parameters are defined as constants in include/config.h and include/datasets.h, such as the number of representatives (NB_REPS) and the way of computing similarity (GAUSS_SIM_WITH_THOLD). They are not regulable by command arguments, thus the code requires to be recompiled after modifying the values of these constant parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Structure

Compilation

Execution

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
include		include
modules		modules
python		python
LICENSE		LICENSE
Makefile		Makefile
Makefile_definitions		Makefile_definitions
README.md		README.md
main.cc		main.cc

License

guanlin-he/clustering-release

Folders and files

Latest commit

History

Repository files navigation

Overview

Structure

Compilation

Execution

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages