[WIP] add text augmentor #8

sitongye · 2021-07-28T09:21:49Z

add initial version of text augmentor + documentation as readme.md

PhilipMay · 2021-07-28T11:44:11Z

Very nice. Thanks @sitongye

PhilipMay · 2021-07-28T11:46:50Z

Perhaps you can put all code in a file called augmentation.py in the main package folder.

PhilipMay · 2021-07-28T11:48:10Z

The .md file (documentation) should be at docs/source/.
Later we can build a Sphinx page from it.

PhilipMay · 2021-07-28T11:49:21Z

Text_Augmentation_Class_Diagram.png can be at docs/source/imgs please

PhilipMay · 2021-07-29T10:03:37Z

@sitongye I made some fixes

PhilipMay · 2021-07-29T11:21:32Z

now there are "only" pylint issues remaining.
You can see them with pyling <filename>.
Maybe you can check them.

PhilipMay · 2021-07-29T13:00:30Z

transformer_tools/augmentation.py

+    def _swap_with_weights(self, ori_word, prob):
+        # swap a word with candidates or stay the same depending on given probability
+        # prob: probability of being swapped
+        swap = self.candidate_dict.get(ori_word)  # FIXME: where os candidate_dict set?


@sitongye can you please check this. I think candidate_dict is not available.

PhilipMay · 2021-07-29T13:47:01Z

TODO

extend 3rd party libs file

PhilipMay · 2021-08-02T15:19:33Z

transformer_tools/augmentation.py

+        self.mid2ori_model = self._load_transmodel(
+            self.mid_ori_model_path, self.mid_ori_checkpoints
+        )


@sitongye this needs a 3rd param

PhilipMay · 2021-08-02T15:28:36Z

transformer_tools/augmentation.py

+            self.model = AutoModelForMaskedLM.from_pretrained(local_model_path)
+        self.nr_candidates = nr_candidates
+
+    def _generate(self, sequence):


@sitongye just rename sequence to sent?

PhilipMay · 2021-08-02T15:29:57Z

Hey @sitongye
the static code checks are almost ok.
Can you please have a look at the last 2 inline comments I made?

PhilipMay · 2021-08-04T11:27:51Z

transformer_tools/augmentation.py

+def map_apostrophe(string):
+    """Replace special short forms in german back to original forms."""
+    # rule based
+    mapping = {
+        "'s": " es",
+        "'nem": " einem",
+        "'ne": " eine",
+        "'ner": " einer",
+        "'nen'": " einen",
+        "'n": " ein",
+    }
+    for key, value in mapping.items():
+        string = re.sub(key, value, string)
+    return string


please check if we realy need this

PhilipMay · 2021-08-07T13:00:56Z

rebased on main

PhilipMay · 2021-08-12T13:12:17Z

resolved merge conflict

PhilipMay · 2021-08-13T07:05:35Z

I think the tests fail because it does not manage to install some pip packages. Maybe fasttext....

sitongye · 2021-08-13T08:01:23Z

@PhilipMay I think it's that I imported "math" but did not use it. last time the imports of libraries did not throw a problem. I will use this afternoon to check it and also test the functionalities

PhilipMay · 2021-08-13T08:06:19Z

@PhilipMay I think it's that I imported "math" but did not use it. last time the imports of libraries did not throw a problem. I will use this afternoon to check it and also test the functionalities

For some reason spacy has problems to be installed. Now that I removed it the tests pass but only the linting has some issues. When spacy is in optional dependencies it does not even install in the CI pipeline.

PhilipMay · 2021-08-13T12:57:16Z

transformer_tools/augmentation.py

+        out = " ".join(aug_text).strip()
+        out = re.sub(" +", " ", out)
+        out = re.sub(" ,", ",", out)
+        out = re.sub(r" \.", ".", out)


@sitongye instead of this we should be able to use the whitespace_ attribute of the spacy token.
That would be a cleaner way to concatenate the text.
See: https://spacy.io/api/token#attributes

@PhilipMay do you think it's worth it if I transform the fasttext embedding to be in the spacy format so that we can simply swap the tokens? https://spacy.io/usage/linguistic-features#vectors-similarity

Not sure. Maybe we should write an abstraction to change the way how we swap token?

PhilipMay · 2021-08-13T16:50:54Z

I think fairseq has issues with py 3.9.
Maybe we should remove it for now and focus on fasttext word replacement first.

sitongye · 2021-08-14T13:56:27Z

@PhilipMay Hi philip, seems the transformers import failed, since naming of the "transformers.py" under "transformer_tools" conflicts with the "transformers" library, how shall I resolve that?

PhilipMay · 2021-08-15T05:54:56Z

@PhilipMay Hi philip, seems the transformers import failed, since naming of the "transformers.py" under "transformer_tools" conflicts with the "transformers" library, how shall I resolve that?

Do you mean the failure in the CI pipeline here in GH?

our transformers module should be used as transformer.transformers while the other is just transformers.

PhilipMay force-pushed the main branch 3 times, most recently from 858aadd to 8861279 Compare July 28, 2021 11:40

sitongye force-pushed the text_augmentation branch from 4a29ae6 to 722cf20 Compare July 28, 2021 13:53

PhilipMay reviewed Jul 29, 2021

View reviewed changes

PhilipMay reviewed Aug 2, 2021

View reviewed changes

PhilipMay changed the title ~~add text augmentor~~ [WIP] add text augmentor Aug 3, 2021

PhilipMay force-pushed the main branch from 9e1a6d0 to 3deaca3 Compare August 4, 2021 09:25

PhilipMay reviewed Aug 4, 2021

View reviewed changes

PhilipMay force-pushed the text_augmentation branch 2 times, most recently from 1be9435 to f83c233 Compare August 7, 2021 13:00

PhilipMay force-pushed the text_augmentation branch from 77116ba to 4c88aa2 Compare August 12, 2021 13:11

PhilipMay force-pushed the main branch from 9f72bba to d183718 Compare August 12, 2021 14:17

PhilipMay force-pushed the text_augmentation branch from 01d0e7d to 7e80452 Compare August 13, 2021 07:54

sitongye force-pushed the text_augmentation branch from 7e80452 to 95deded Compare August 13, 2021 08:08

PhilipMay force-pushed the text_augmentation branch from 95deded to 6f33414 Compare August 13, 2021 12:27

PhilipMay and others added 20 commits August 13, 2021 14:27

add fasttext as opt. dep.

007c0ec

add some lint fixes

9d1035f

ignore yaml typong

87210f1

add some lint fixes

5f4e8b9

improve map_apostrophe function

0e05dfe

add some lint fixes

0169acd

add some lint fixes

18aff6a

fix pylint errors

9514326

add online download modes for every involved models

a920cea

fix errors

c49a128

mute import-outside-toplevel

93de3be

variable fix

33ced4a

fix black format and docstyle

4cf05a8

modify functionalities

ebcd03b

modify swap_dice logic

389583e

debug one variable!

59e6e0a

empty commit to trigger CI

4ea0011

test CI by removing spacy

6f33414

remove unused import

60950e3

fix linting issue

72222e1

PhilipMay reviewed Aug 13, 2021

View reviewed changes

PhilipMay added 2 commits August 13, 2021 18:39

enable spacy again

e588e2a

temp. disable py 3.9 for tests

848a973

enable py 3.9 for tests again - this will break pytest pipeline

b4819e9

sitongye and others added 3 commits September 23, 2021 22:16

debug local loading

b199b01

debug token generation

a9742d2

fix all bugs in textembedding

9b8f3e1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] add text augmentor #8

[WIP] add text augmentor #8

sitongye commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 29, 2021

PhilipMay commented Jul 29, 2021

PhilipMay Jul 29, 2021

PhilipMay commented Jul 29, 2021

PhilipMay Aug 2, 2021

PhilipMay Aug 2, 2021

PhilipMay commented Aug 2, 2021

PhilipMay Aug 4, 2021

PhilipMay commented Aug 7, 2021

PhilipMay commented Aug 12, 2021

PhilipMay commented Aug 13, 2021

sitongye commented Aug 13, 2021

PhilipMay commented Aug 13, 2021

PhilipMay Aug 13, 2021

sitongye Aug 13, 2021 •

edited

Loading

PhilipMay Aug 13, 2021

PhilipMay commented Aug 13, 2021

sitongye commented Aug 14, 2021

PhilipMay commented Aug 15, 2021 •

edited

Loading

[WIP] add text augmentor #8

Are you sure you want to change the base?

[WIP] add text augmentor #8

Conversation

sitongye commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 28, 2021

PhilipMay commented Jul 29, 2021

PhilipMay commented Jul 29, 2021

PhilipMay Jul 29, 2021

Choose a reason for hiding this comment

PhilipMay commented Jul 29, 2021

TODO

PhilipMay Aug 2, 2021

Choose a reason for hiding this comment

PhilipMay Aug 2, 2021

Choose a reason for hiding this comment

PhilipMay commented Aug 2, 2021

PhilipMay Aug 4, 2021

Choose a reason for hiding this comment

PhilipMay commented Aug 7, 2021

PhilipMay commented Aug 12, 2021

PhilipMay commented Aug 13, 2021

sitongye commented Aug 13, 2021

PhilipMay commented Aug 13, 2021

PhilipMay Aug 13, 2021

Choose a reason for hiding this comment

sitongye Aug 13, 2021 • edited Loading

Choose a reason for hiding this comment

PhilipMay Aug 13, 2021

Choose a reason for hiding this comment

PhilipMay commented Aug 13, 2021

sitongye commented Aug 14, 2021

PhilipMay commented Aug 15, 2021 • edited Loading

sitongye Aug 13, 2021 •

edited

Loading

PhilipMay commented Aug 15, 2021 •

edited

Loading