-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding gene and cell type tokenizers #95
Conversation
…onfig to create the extended tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good. See some minor comments inline
@@ -751,24 +755,95 @@ def set_field(tokenizers_info_cfg: List, name: str, key: str, val: Any) -> List: | |||
with open(os.path.join(path, "config.yaml"), "w") as f: | |||
OmegaConf.save(tokenizer_config_overall, f) | |||
|
|||
def _add_single_tokenizer( | |||
def update_special_tokens(self, added_tokens: List, save_tokenizer_path: str = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure there's a point in adding this function to the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this was done intentionally. The function modifies the given modular tokenizer internally, so it should be inside the class. Buy the real reason is that the old version would modify the tokenizer and then return it, and this behavior seems to hint that the returned tokenizer is a different one, as was evident from the usage code. This way, it is clearer that the function operates on the object and modifies it.
And finally, it's used in two different scripts now, and importing between scripts does not feel like a good practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
) | ||
added_tokens += new_special_tokens | ||
|
||
# we update the special tokens but do not save here. remember to save yourself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the phrasing. Getting spiritual vibes from "remember to save yourself"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“Everything not saved will be lost. –Nintendo “Quit Screen” message.”
added_tokens += new_special_tokens | ||
|
||
# we update the special tokens but do not save here. remember to save yourself. | ||
self.update_special_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best to put self.update_special_tokens after self.build_inner_decoder. It should be fine as is, but self.update_special_tokens assumes that the modular tokenizer is consistent and usable. This may not be the case before build_inner_decoder is called
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you push it? I still see update_special_tokens before build_inner_decoder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
added_tokens += new_special_tokens | ||
|
||
# we update the special tokens but do not save here. remember to save yourself. | ||
self.update_special_tokens( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you push it? I still see update_special_tokens before build_inner_decoder
…onfilict in special tokens
Added a script and supporting function for adding a new tokenizer to an existing one. updated the create-tokenizer and update-special-tokens scripts (and renamed them to represent what they do).
Now supporting extended tokenizers, that is adding large tokenizers past the end of the normal tokenizer (id=5000) in a way which allows to have two consistent tokenizers, one with and one without the large tokenizer(s).
Readme has been updated.