Lexicon Induction from Continuous Sign Language Corpus

Overview

lexicon-induction is a project aimed at inducing a lexicon from continuous sign language corpora. The project leverages large datasets of sign language videos, pose estimation technologies, sign language segmentation tools, and machine learning models to analyze and categorize sign language data.

Datasets Used

DGS Corpus¹ - Development of a corpus-based electronic dictionary German Sign Language/German.
Corpus NGT² - An online corpus for professionals and laymen in Dutch Sign Language (NGT).
BSL Corpus³ - Building the British Sign Language Corpus.

Workflow

1. Pose Estimation

For each video in corpus/videos, we run pose estimation:

video_to_pose -i sign.mp4 --format mediapipe -o sign.pose

Output pose files are stored in poses.

2. Segmentation

Pose sequences are automatically segmented using the sign language segmentation tool:

pose_to_segments -i sign.pose -o sign.eaf --video sign.mp4

Segmentation outputs (ELAN files) are stored in segments.

3. Sign Language Recognition

For each segment in the ELAN file:

Crop the corresponding pose.
Run through a sign language recognition model.

We focus on the softmax layer output for label distributions.

4. Clustering

Signs are clustered based on softmax output vectors.
Assumes a Zipfian distribution over sign usage (evidence)
Cluster sizes reflect this distribution.

5. Evaluation

Evaluated on the pre-annotated DGS Corpus.
Metrics relevant for clustering under a Zipfian distribution are used.

References

Prillwitz, Siegmund, et al. "DGS Corpus project--development of a corpus based electronic dictionary German Sign Language/German." Sign-lang at LREC. 2008. ↩
Crasborn, Onno, and Inge Zwitserlood. "The Corpus NGT: An online corpus for professionals and laymen." 2008. ↩
Schembri, Adam, et al. "Building the British sign language corpus." Language Documentation & Conservation, vol. 7, 2013, pp. 136-154. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
lexicon_induction		lexicon_induction
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexicon Induction from Continuous Sign Language Corpus

Overview

Datasets Used

Workflow

1. Pose Estimation

2. Segmentation

3. Sign Language Recognition

4. Clustering

5. Evaluation

References

About

Releases

Packages

Languages

License

sign-language-processing/lexicon-induction

Folders and files

Latest commit

History

Repository files navigation

Lexicon Induction from Continuous Sign Language Corpus

Overview

Datasets Used

Workflow

1. Pose Estimation

2. Segmentation

3. Sign Language Recognition

4. Clustering

5. Evaluation

References

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages