workflows: add manual merger #2904

ammirate · 2017-10-25T13:58:57Z

Description

Add a new workflow to perform a manual merge between two records.
The workflow behaves in the following way:

read the records from the DB,
check their sources
merge them and halt, always
when resumed, saves the roots, add some metadata, and save the merged record.

Motivation and Context

Checklist:

I have all the information that I need (if not, move to RFC and look for it).
I linked the related issue(s) in the corresponding commit logs.
I wrote good commit log messages.
My code follows the code style of this project.
I've added any new docs if API/utils methods were added.
I have updated the existing documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

Signed-off-by: Antonio Cesarano [email protected]

kaplun · 2017-10-27T12:22:06Z

inspirehep/modules/workflows/actions/merge_approval.py

+        delayed = True
+        if obj.workflow.name == 'manual_merge':
+            # the manual merge wf should be sync
+            delayed = False


This is really sad. But I see why it's needed. We need to ensure that whatever happens afterwards in the workflow is peanut and can complete quickly. Otherwise the request will end-up in a Timeout.

kaplun · 2017-10-27T12:22:34Z

inspirehep/modules/workflows/actions/merge_approval.py

@@ -32,4 +34,16 @@ class MergeApproval(object):
    @staticmethod
    def resolve(obj, *args, **kwargs):
        """Resolve the action taken in the approval action."""
-        pass
+
+        obj.extra_data["approved"] = True


So we are sure that it's always approved?

For now yes, but in the future, of course, we can have the chance to reject the merge.

kaplun · 2017-10-27T12:23:32Z

inspirehep/modules/workflows/models.py

+            'postgresql',
+        ),
+        default=lambda: dict(),
+        nullable=True


Why nullable?

right, going to remove it 👍

@ammirate did you forget to remove it?

apparently yes!

kaplun · 2017-10-27T12:24:14Z

inspirehep/modules/workflows/tasks/manual_merging.py

+
+@with_debug_logging
+def merge_records(obj, eng):
+    """Merge the records which ids are defined in the `obj` parameter and store


which -> whose

kaplun · 2017-10-27T12:26:38Z

inspirehep/modules/workflows/tasks/manual_merging.py

+    new `head`.
+    """
+
+    head = get_db_record('lit', obj.extra_data['head_control_number'])


Do we need to hardcode lit? In the future you will need to use the same functionality for merging duplicate authors...

we should change it everywhere actually, e.g in the inspire-json-merger API in according to that. I think it's not that hard to adapt it when it will be needed...

kaplun · 2017-10-27T12:28:11Z

inspirehep/modules/workflows/tasks/manual_merging.py

+    head.clear()
+    head.update(obj.data)    # head's content will be replaced by merged
+    update.merge(head)       # update's uuid will point to head's uuid
+    update.delete()          # mark update record as deleted


~~Do you need to explicitly delete? Isn't this part of merge()?~~ it's correct.

Yeah, actually it should be

But it's not, right? As far as I can see it just handles pids from the pidstore.

Right ⬆️

kaplun · 2017-10-27T12:30:07Z

inspirehep/modules/workflows/tasks/manual_merging.py

+    # add schema contents to refer deleted record to the merged one
+    update['new_record'] = get_record_ref(
+        head['control_number'],
+        endpoint='record''record'


kaplun · 2017-10-27T12:36:02Z

@ammirate I found only a small typo-induced bug. Beside that only small nits. Re: the synchronousicityrily of resuming a workflow I confirm what you did would indeed be synchronous.

fixed

jacquerie · 2017-11-22T12:19:40Z

inspirehep/modules/editor/views.py

+def start_manual_merge():
+    """Initiate manual merge on two records."""
+    assert request.json['head_recid']
+    assert request.json['update_recid']


We need to do more validation than this: with the current code any user who can use the editor API can merge any two records, even if they can't read/edit them!

What is the current best-practice to protect a view to catalogers? This is definitively a very common use case?

I see the editor uses: @editor_use_api_permission.require(http_exception=403). As ugly as it is, can we re-use it?

There is also @editor_permission which sounds a good candidate.

jacquerie · 2017-11-22T12:21:43Z

inspirehep/modules/workflows/tasks/manual_merging.py

+
+from invenio_db import db
+
+from inspire_json_merger.inspire_json_merger import inspire_json_merge


The API and module structure of inspire-json-merger need some love. This should really be

from inspire_json_merger import merge

~~opening an issue against inspire-json-merger~~ I already did this... inspirehep/inspire-json-merger#35.

I agree, it's in the TODO list 👍

jacquerie · 2017-11-22T12:31:30Z

inspirehep/modules/workflows/workflows/manual_merge.py

+    wf_id = workflow_object.id    # to retrieve it later
+    workflow_object.extra_data.update(data)
+
+    # preparing identifiers in order to do less requests possible later


jacquerie · 2017-11-22T12:40:58Z

inspirehep/utils/record.py

+                    return False
+                if is_complete(publication_info):
+                    return False
+    return had_at_least_one_journal_title


Wrong rebase! This died in 6c0db18 and should stay dead.

jacquerie · 2017-11-22T12:50:46Z

inspirehep/modules/workflows/tasks/merging.py

+        record_id=str(record_uuid),
+        source=source.lower()
+    ).one_or_none()
+    return entry


A function that does exactly one function call and returns smells of ravioli code. I see that you are using it two times here and several times in the tests: perhaps it's time to graduate it to an official util, since, in essence, this function is the workflow analogue of get_db_record.

I created workflow.py in utils, wdyt?

moved to workflows/utils.

jacquerie · 2017-11-22T12:52:46Z

inspirehep/modules/workflows/tasks/manual_merging.py

+        head['control_number'],
+        endpoint='record'
+    )
+    _add_deleted_records(head, update)


_add_deleted_records is used exactly once (here), and I don't understand the point: in the lines above you modify records directly in the body of this function, while here another function is called to do essentially the same kind of job.

jacquerie · 2017-11-22T12:54:07Z

inspirehep/modules/workflows/tasks/manual_merging.py

+def _get_head_and_update(obj):
+    head = obj.extra_data['head']
+    update = obj.extra_data['update']
+    return head, update


Similar observation as for _add_deleted_records. You can inline this function where it's used:

head, update = obj.extra_data['head'], obj.extra_data['update']

jacquerie · 2017-11-22T13:07:15Z

inspirehep/modules/workflows/tasks/merging.py

+    if arxiv_root:
+        return 'arxiv'
+
+    return None


Alternative algorithm: get all workflow record sources associated with a certain UUID, and iterate over them to see which sources are present. The current code is doing two queries where one would be enough.

jacquerie · 2017-11-22T13:11:30Z

inspirehep/utils/record.py

+
+def get_source(record):
+    """Return the ``source`` of ``acquisition_source`` of a record."""
+    return get_value(record, 'acquisition_source.source')


Nit: let's keep these utils alphabetically sorted (this should be above get_title).

jacquerie · 2017-11-22T13:14:09Z

setup.py

@@ -60,6 +60,7 @@
    'inspire-crawler~=1.0',
    'inspire-dojson~=53.0,>=53.0.0',
    'inspire-matcher~=0.0,>=0.3.3',
+    'inspire-json-merger~=2.0,>=2.0.4',


Nit: let's keep these requirements alphabetically sorted (this should be above inspire-dojson).

you mean...below? d -> j

jacquerie · 2017-11-22T13:37:39Z

tests/integration/workflows/test_manual_merge_workflow.py

+
+def test_manual_merge_existing_records(workflow_app):
+    # celery task, to import locally
+    from inspirehep.modules.migrator.tasks import record_insert_or_replace


Are you sure we need this internal import? If you're not executing this with delay it doesn't go to Celery... For example

inspire-next/tests/integration/refextract/test_refextract_tasks.py

Line 32 in d711dd7

from inspirehep.modules.migrator.tasks import record_insert_or_replace

jacquerie · 2017-11-26T02:38:56Z

This requires the general dependencies bumping done in #2992.

Signed-off-by: Zacharias Zacharodimos <[email protected]>

* Add new tasks for merging records * Add utils to operate with the `workflow_record_source` table * Add unit and integration tests for those functions Signed-off-by: Antonio Cesarano <[email protected]> Signed-off-by: Riccardo Candido <[email protected]>

Signed-off-by: Chris Aslanoglou <[email protected]>

jacquerie · 2017-11-29T14:18:06Z

Superseded by #3005.

kaplun · 2017-11-29T14:19:36Z

~~Cazzo. Per un secondo ho pensato che fosse stato mergiato 🤣~~

Wowow! It has been merged!! @ammirate we shall party!

ammirate added the Status: WIP label Oct 25, 2017

ammirate self-assigned this Oct 25, 2017

ammirate requested a review from david-caro October 25, 2017 13:58

ammirate added this to the Inspire Json Merger milestone Oct 25, 2017

kaplun reviewed Oct 27, 2017

View reviewed changes

kaplun previously requested changes Oct 27, 2017

View reviewed changes

ammirate force-pushed the manual_merger_workflow branch from 0532e5d to 483f808 Compare October 27, 2017 15:18

ammirate added Status: Ready for review and removed Status: WIP labels Oct 27, 2017

ammirate force-pushed the manual_merger_workflow branch from 483f808 to 0857635 Compare October 27, 2017 15:20

ammirate force-pushed the manual_merger_workflow branch from 0857635 to 029f574 Compare October 30, 2017 08:27

ghost assigned chris-asl Oct 30, 2017

ammirate force-pushed the manual_merger_workflow branch 11 times, most recently from 314596e to e53f848 Compare November 1, 2017 16:12

ammirate force-pushed the manual_merger_workflow branch 2 times, most recently from 8d1ae03 to b182e8c Compare November 22, 2017 12:18

jacquerie reviewed Nov 22, 2017

View reviewed changes

ammirate force-pushed the manual_merger_workflow branch 2 times, most recently from 0abc3b6 to da3787d Compare November 23, 2017 10:21

jacquerie mentioned this pull request Nov 26, 2017

hal: use inspire-utils to parse names #2992

Merged

8 tasks

jacquerie mentioned this pull request Nov 26, 2017

elements: urls.value should not be validated as URI inspirehep/inspire-schemas#280

Closed

ammirate force-pushed the manual_merger_workflow branch 3 times, most recently from fb89a88 to 5b07171 Compare November 28, 2017 08:03

zzacharo and others added 3 commits November 28, 2017 11:28

alembic: create workflow_record_resources table

bd1b2d2

Signed-off-by: Zacharias Zacharodimos <[email protected]>

editor: add manual_merge endpoint

081b6c8

Signed-off-by: Chris Aslanoglou <[email protected]>

ammirate force-pushed the manual_merger_workflow branch from 5b07171 to 081b6c8 Compare November 28, 2017 10:28

jacquerie mentioned this pull request Nov 28, 2017

workflows: add manual_merge workflow #3005

Merged

8 tasks

jacquerie closed this Nov 29, 2017

ghost removed the Status: Ready for review label Nov 29, 2017


		from invenio_db import db

		from inspire_json_merger.inspire_json_merger import inspire_json_merge

workflows: add manual merger #2904

workflows: add manual merger #2904

Conversation

ammirate commented Oct 25, 2017 • edited Loading

Description

Motivation and Context

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ammirate Oct 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaplun Oct 27, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

david-caro Nov 2, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kaplun commented Oct 27, 2017

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

jacquerie Nov 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacquerie commented Nov 26, 2017

jacquerie commented Nov 29, 2017

kaplun commented Nov 29, 2017 • edited Loading

ammirate commented Oct 25, 2017 •

edited

Loading

ammirate Oct 27, 2017 •

edited

Loading

kaplun Oct 27, 2017 •

edited

Loading

david-caro Nov 2, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

jacquerie Nov 22, 2017 •

edited

Loading

kaplun commented Nov 29, 2017 •

edited

Loading