-
Notifications
You must be signed in to change notification settings - Fork 2
[WIP] add text augmentor #8
base: main
Are you sure you want to change the base?
Conversation
858aadd
to
8861279
Compare
Very nice. Thanks @sitongye |
Perhaps you can put all code in a file called |
The .md file (documentation) should be at |
|
4a29ae6
to
722cf20
Compare
@sitongye I made some fixes |
now there are "only" pylint issues remaining. |
transformer_tools/augmentation.py
Outdated
def _swap_with_weights(self, ori_word, prob): | ||
# swap a word with candidates or stay the same depending on given probability | ||
# prob: probability of being swapped | ||
swap = self.candidate_dict.get(ori_word) # FIXME: where os candidate_dict set? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sitongye can you please check this. I think candidate_dict
is not available.
TODO
|
transformer_tools/augmentation.py
Outdated
self.mid2ori_model = self._load_transmodel( | ||
self.mid_ori_model_path, self.mid_ori_checkpoints | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sitongye this needs a 3rd param
transformer_tools/augmentation.py
Outdated
self.model = AutoModelForMaskedLM.from_pretrained(local_model_path) | ||
self.nr_candidates = nr_candidates | ||
|
||
def _generate(self, sequence): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sitongye just rename sequence
to sent
?
Hey @sitongye |
transformer_tools/augmentation.py
Outdated
def map_apostrophe(string): | ||
"""Replace special short forms in german back to original forms.""" | ||
# rule based | ||
mapping = { | ||
"'s": " es", | ||
"'nem": " einem", | ||
"'ne": " eine", | ||
"'ner": " einer", | ||
"'nen'": " einen", | ||
"'n": " ein", | ||
} | ||
for key, value in mapping.items(): | ||
string = re.sub(key, value, string) | ||
return string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please check if we realy need this
1be9435
to
f83c233
Compare
rebased on main |
77116ba
to
4c88aa2
Compare
resolved merge conflict |
I think the tests fail because it does not manage to install some pip packages. Maybe fasttext.... |
01d0e7d
to
7e80452
Compare
@PhilipMay I think it's that I imported "math" but did not use it. last time the imports of libraries did not throw a problem. I will use this afternoon to check it and also test the functionalities |
For some reason spacy has problems to be installed. Now that I removed it the tests pass but only the linting has some issues. When spacy is in optional dependencies it does not even install in the CI pipeline. |
7e80452
to
95deded
Compare
95deded
to
6f33414
Compare
out = " ".join(aug_text).strip() | ||
out = re.sub(" +", " ", out) | ||
out = re.sub(" ,", ",", out) | ||
out = re.sub(r" \.", ".", out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sitongye instead of this we should be able to use the whitespace_
attribute of the spacy token.
That would be a cleaner way to concatenate the text.
See: https://spacy.io/api/token#attributes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilipMay do you think it's worth it if I transform the fasttext embedding to be in the spacy format so that we can simply swap the tokens? https://spacy.io/usage/linguistic-features#vectors-similarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Maybe we should write an abstraction to change the way how we swap token?
I think fairseq has issues with py 3.9. |
@PhilipMay Hi philip, seems the transformers import failed, since naming of the "transformers.py" under "transformer_tools" conflicts with the "transformers" library, how shall I resolve that? |
Do you mean the failure in the CI pipeline here in GH? our transformers module should be used as transformer.transformers while the other is just transformers. |
add initial version of text augmentor + documentation as readme.md