Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding gene and cell type tokenizers #95

Merged
merged 5 commits into from
Jan 10, 2024

Conversation

matanninio
Copy link
Collaborator

Added a script and supporting function for adding a new tokenizer to an existing one. updated the create-tokenizer and update-special-tokens scripts (and renamed them to represent what they do).

Now supporting extended tokenizers, that is adding large tokenizers past the end of the normal tokenizer (id=5000) in a way which allows to have two consistent tokenizers, one with and one without the large tokenizer(s).

Readme has been updated.

@floccinauc floccinauc self-requested a review January 7, 2024 07:37
Copy link
Collaborator

@floccinauc floccinauc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. See some minor comments inline

@@ -751,24 +755,95 @@ def set_field(tokenizers_info_cfg: List, name: str, key: str, val: Any) -> List:
with open(os.path.join(path, "config.yaml"), "w") as f:
OmegaConf.save(tokenizer_config_overall, f)

def _add_single_tokenizer(
def update_special_tokens(self, added_tokens: List, save_tokenizer_path: str = None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure there's a point in adding this function to the class?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this was done intentionally. The function modifies the given modular tokenizer internally, so it should be inside the class. Buy the real reason is that the old version would modify the tokenizer and then return it, and this behavior seems to hint that the returned tokenizer is a different one, as was evident from the usage code. This way, it is clearer that the function operates on the object and modifies it.
And finally, it's used in two different scripts now, and importing between scripts does not feel like a good practice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

)
added_tokens += new_special_tokens

# we update the special tokens but do not save here. remember to save yourself.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the phrasing. Getting spiritual vibes from "remember to save yourself"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“Everything not saved will be lost. –Nintendo “Quit Screen” message.”

added_tokens += new_special_tokens

# we update the special tokens but do not save here. remember to save yourself.
self.update_special_tokens(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best to put self.update_special_tokens after self.build_inner_decoder. It should be fine as is, but self.update_special_tokens assumes that the modular tokenizer is consistent and usable. This may not be the case before build_inner_decoder is called

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push it? I still see update_special_tokens before build_inner_decoder

@matanninio matanninio requested a review from floccinauc January 9, 2024 12:26
Copy link
Collaborator

@floccinauc floccinauc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

added_tokens += new_special_tokens

# we update the special tokens but do not save here. remember to save yourself.
self.update_special_tokens(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you push it? I still see update_special_tokens before build_inner_decoder

@matanninio matanninio merged commit 3ea1d9c into main Jan 10, 2024
4 checks passed
@matanninio matanninio deleted the Adding-gene-and-cell-type-tokenizers branch January 10, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants