Skip to content
This repository has been archived by the owner on May 28, 2023. It is now read-only.

[WIP] add text augmentor #8

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

[WIP] add text augmentor #8

wants to merge 30 commits into from

Conversation

sitongye
Copy link

add initial version of text augmentor + documentation as readme.md

@PhilipMay PhilipMay force-pushed the main branch 3 times, most recently from 858aadd to 8861279 Compare July 28, 2021 11:40
@PhilipMay
Copy link
Member

Very nice. Thanks @sitongye

@PhilipMay
Copy link
Member

Perhaps you can put all code in a file called augmentation.py in the main package folder.

@PhilipMay
Copy link
Member

The .md file (documentation) should be at docs/source/.
Later we can build a Sphinx page from it.

@PhilipMay
Copy link
Member

Text_Augmentation_Class_Diagram.png can be at docs/source/imgs please

@sitongye sitongye force-pushed the text_augmentation branch from 4a29ae6 to 722cf20 Compare July 28, 2021 13:53
@PhilipMay
Copy link
Member

@sitongye I made some fixes

@PhilipMay
Copy link
Member

now there are "only" pylint issues remaining.
You can see them with pyling <filename>.
Maybe you can check them.

def _swap_with_weights(self, ori_word, prob):
# swap a word with candidates or stay the same depending on given probability
# prob: probability of being swapped
swap = self.candidate_dict.get(ori_word) # FIXME: where os candidate_dict set?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitongye can you please check this. I think candidate_dict is not available.

@PhilipMay
Copy link
Member

TODO

  • extend 3rd party libs file

Comment on lines 316 to 274
self.mid2ori_model = self._load_transmodel(
self.mid_ori_model_path, self.mid_ori_checkpoints
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitongye this needs a 3rd param

self.model = AutoModelForMaskedLM.from_pretrained(local_model_path)
self.nr_candidates = nr_candidates

def _generate(self, sequence):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitongye just rename sequence to sent?

@PhilipMay
Copy link
Member

Hey @sitongye
the static code checks are almost ok.
Can you please have a look at the last 2 inline comments I made?

@PhilipMay PhilipMay changed the title add text augmentor [WIP] add text augmentor Aug 3, 2021
Comment on lines 26 to 39
def map_apostrophe(string):
"""Replace special short forms in german back to original forms."""
# rule based
mapping = {
"'s": " es",
"'nem": " einem",
"'ne": " eine",
"'ner": " einer",
"'nen'": " einen",
"'n": " ein",
}
for key, value in mapping.items():
string = re.sub(key, value, string)
return string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check if we realy need this

@PhilipMay PhilipMay force-pushed the text_augmentation branch 2 times, most recently from 1be9435 to f83c233 Compare August 7, 2021 13:00
@PhilipMay
Copy link
Member

rebased on main

@PhilipMay
Copy link
Member

resolved merge conflict

@PhilipMay
Copy link
Member

I think the tests fail because it does not manage to install some pip packages. Maybe fasttext....

@sitongye
Copy link
Author

@PhilipMay I think it's that I imported "math" but did not use it. last time the imports of libraries did not throw a problem. I will use this afternoon to check it and also test the functionalities

@PhilipMay
Copy link
Member

@PhilipMay I think it's that I imported "math" but did not use it. last time the imports of libraries did not throw a problem. I will use this afternoon to check it and also test the functionalities

For some reason spacy has problems to be installed. Now that I removed it the tests pass but only the linting has some issues. When spacy is in optional dependencies it does not even install in the CI pipeline.

Comment on lines +183 to +186
out = " ".join(aug_text).strip()
out = re.sub(" +", " ", out)
out = re.sub(" ,", ",", out)
out = re.sub(r" \.", ".", out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sitongye instead of this we should be able to use the whitespace_ attribute of the spacy token.
That would be a cleaner way to concatenate the text.
See: https://spacy.io/api/token#attributes

Copy link
Author

@sitongye sitongye Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilipMay do you think it's worth it if I transform the fasttext embedding to be in the spacy format so that we can simply swap the tokens? https://spacy.io/usage/linguistic-features#vectors-similarity

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Maybe we should write an abstraction to change the way how we swap token?

@PhilipMay
Copy link
Member

I think fairseq has issues with py 3.9.
Maybe we should remove it for now and focus on fasttext word replacement first.

@sitongye
Copy link
Author

@PhilipMay Hi philip, seems the transformers import failed, since naming of the "transformers.py" under "transformer_tools" conflicts with the "transformers" library, how shall I resolve that?

@PhilipMay
Copy link
Member

PhilipMay commented Aug 15, 2021

@PhilipMay Hi philip, seems the transformers import failed, since naming of the "transformers.py" under "transformer_tools" conflicts with the "transformers" library, how shall I resolve that?

Do you mean the failure in the CI pipeline here in GH?

our transformers module should be used as transformer.transformers while the other is just transformers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants