Adding gene and cell type tokenizers #95

matanninio · 2024-01-07T07:32:43Z

Added a script and supporting function for adding a new tokenizer to an existing one. updated the create-tokenizer and update-special-tokens scripts (and renamed them to represent what they do).

Now supporting extended tokenizers, that is adding large tokenizers past the end of the normal tokenizer (id=5000) in a way which allows to have two consistent tokenizers, one with and one without the large tokenizer(s).

Readme has been updated.

…onfig to create the extended tokenizer

…ipts.

floccinauc

Overall looks good. See some minor comments inline

fusedrug/data/tokenizer/modulartokenizer/README.md

floccinauc · 2024-01-07T13:08:34Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

@@ -751,24 +755,95 @@ def set_field(tokenizers_info_cfg: List, name: str, key: str, val: Any) -> List:
        with open(os.path.join(path, "config.yaml"), "w") as f:
            OmegaConf.save(tokenizer_config_overall, f)

-    def _add_single_tokenizer(
+    def update_special_tokens(self, added_tokens: List, save_tokenizer_path: str = None):


Are you sure there's a point in adding this function to the class?

Actually, this was done intentionally. The function modifies the given modular tokenizer internally, so it should be inside the class. Buy the real reason is that the old version would modify the tokenizer and then return it, and this behavior seems to hint that the returned tokenizer is a different one, as was evident from the usage code. This way, it is clearer that the function operates on the object and modifies it.
And finally, it's used in two different scripts now, and importing between scripts does not feel like a good practice.

floccinauc · 2024-01-07T13:17:58Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

+                )
+            added_tokens += new_special_tokens
+
+        # we update the special tokens but do not save here.  remember to save yourself.


Love the phrasing. Getting spiritual vibes from "remember to save yourself"

“Everything not saved will be lost. –Nintendo “Quit Screen” message.”

floccinauc · 2024-01-07T13:21:09Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

+            added_tokens += new_special_tokens
+
+        # we update the special tokens but do not save here.  remember to save yourself.
+        self.update_special_tokens(


I think it's best to put self.update_special_tokens after self.build_inner_decoder. It should be fine as is, but self.update_special_tokens assumes that the modular tokenizer is consistent and usable. This may not be the case before build_inner_decoder is called

Did you push it? I still see update_special_tokens before build_inner_decoder

floccinauc

Looks great!

floccinauc · 2024-01-09T12:51:04Z

fusedrug/data/tokenizer/modulartokenizer/modular_tokenizer.py

+            added_tokens += new_special_tokens
+
+        # we update the special tokens but do not save here.  remember to save yourself.
+        self.update_special_tokens(


Did you push it? I still see update_special_tokens before build_inner_decoder

…onfilict in special tokens

…type-tokenizers

Matan Ninio added 2 commits January 2, 2024 15:29

added support for minimal token id, added genes tokenizer and added c…

f5b0def

…onfig to create the extended tokenizer

cleanup. readme, renamed tokenizers and regenerated them, renamed scr…

0fcec69

…ipts.

matanninio requested review from YoelShoshan and mosheraboh January 7, 2024 07:32

floccinauc self-requested a review January 7, 2024 07:37

floccinauc approved these changes Jan 7, 2024

View reviewed changes

matanninio requested a review from floccinauc January 9, 2024 12:26

precommit cleanup and some minor PR requested edits

bbbc3fd

floccinauc approved these changes Jan 10, 2024

View reviewed changes

Matan Ninio added 2 commits January 10, 2024 11:04

Merge branch 'main' into Adding-gene-and-cell-type-tokenizers - fix c…

a4a2735

…onfilict in special tokens

Merge remote-tracking branch 'origin/HEAD' into Adding-gene-and-cell-…

aa9c121

…type-tokenizers

matanninio merged commit 3ea1d9c into main Jan 10, 2024
4 checks passed

matanninio deleted the Adding-gene-and-cell-type-tokenizers branch January 10, 2024 13:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding gene and cell type tokenizers #95

Adding gene and cell type tokenizers #95

matanninio commented Jan 7, 2024

floccinauc left a comment

floccinauc Jan 7, 2024

matanninio Jan 9, 2024

floccinauc Jan 9, 2024

floccinauc Jan 7, 2024

matanninio Jan 9, 2024

floccinauc Jan 7, 2024

matanninio Jan 9, 2024

floccinauc Jan 9, 2024

floccinauc left a comment

floccinauc Jan 9, 2024

Adding gene and cell type tokenizers #95

Adding gene and cell type tokenizers #95

Conversation

matanninio commented Jan 7, 2024

floccinauc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

floccinauc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment