Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workflows: add manual merger #2904

Closed
wants to merge 3 commits into from

Conversation

ammirate
Copy link
Contributor

@ammirate ammirate commented Oct 25, 2017

Description

Add a new workflow to perform a manual merge between two records.
The workflow behaves in the following way:

  • read the records from the DB,
  • check their sources
  • merge them and halt, always
  • when resumed, saves the roots, add some metadata, and save the merged record.

Motivation and Context

Checklist:

  • I have all the information that I need (if not, move to RFC and look for it).
  • I linked the related issue(s) in the corresponding commit logs.
  • I wrote good commit log messages.
  • My code follows the code style of this project.
  • I've added any new docs if API/utils methods were added.
  • I have updated the existing documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

Signed-off-by: Antonio Cesarano [email protected]

@ammirate ammirate self-assigned this Oct 25, 2017
@ammirate ammirate added this to the Inspire Json Merger milestone Oct 25, 2017
delayed = True
if obj.workflow.name == 'manual_merge':
# the manual merge wf should be sync
delayed = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really sad. But I see why it's needed. We need to ensure that whatever happens afterwards in the workflow is peanut and can complete quickly. Otherwise the request will end-up in a Timeout.

@@ -32,4 +34,16 @@ class MergeApproval(object):
@staticmethod
def resolve(obj, *args, **kwargs):
"""Resolve the action taken in the approval action."""
pass

obj.extra_data["approved"] = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are sure that it's always approved?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now yes, but in the future, of course, we can have the chance to reject the merge.

'postgresql',
),
default=lambda: dict(),
nullable=True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nullable?

Copy link
Contributor Author

@ammirate ammirate Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, going to remove it 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ammirate did you forget to remove it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently yes!


@with_debug_logging
def merge_records(obj, eng):
"""Merge the records which ids are defined in the `obj` parameter and store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which -> whose

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh!

new `head`.
"""

head = get_db_record('lit', obj.extra_data['head_control_number'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to hardcode lit? In the future you will need to use the same functionality for merging duplicate authors...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should change it everywhere actually, e.g in the inspire-json-merger API in according to that. I think it's not that hard to adapt it when it will be needed...

head.clear()
head.update(obj.data) # head's content will be replaced by merged
update.merge(head) # update's uuid will point to head's uuid
update.delete() # mark update record as deleted
Copy link
Contributor

@kaplun kaplun Oct 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to explicitly delete? Isn't this part of merge()? it's correct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually it should be

Copy link
Contributor

@david-caro david-caro Nov 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's not, right? As far as I can see it just handles pids from the pidstore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right ⬆️

kaplun
kaplun previously requested changes Oct 27, 2017
# add schema contents to refer deleted record to the merged one
update['new_record'] = get_record_ref(
head['control_number'],
endpoint='record''record'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

@kaplun
Copy link
Contributor

kaplun commented Oct 27, 2017

@ammirate I found only a small typo-induced bug. Beside that only small nits. Re: the synchronousicityrily of resuming a workflow I confirm what you did would indeed be synchronous.

@ammirate ammirate force-pushed the manual_merger_workflow branch 2 times, most recently from 8d1ae03 to b182e8c Compare November 22, 2017 12:18
def start_manual_merge():
"""Initiate manual merge on two records."""
assert request.json['head_recid']
assert request.json['update_recid']
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to do more validation than this: with the current code any user who can use the editor API can merge any two records, even if they can't read/edit them!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the current best-practice to protect a view to catalogers? This is definitively a very common use case?

I see the editor uses: @editor_use_api_permission.require(http_exception=403). As ugly as it is, can we re-use it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also @editor_permission which sounds a good candidate.


from invenio_db import db

from inspire_json_merger.inspire_json_merger import inspire_json_merge
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API and module structure of inspire-json-merger need some love. This should really be

from inspire_json_merger import merge

opening an issue against inspire-json-merger I already did this... inspirehep/inspire-json-merger#35.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's in the TODO list 👍

wf_id = workflow_object.id # to retrieve it later
workflow_object.extra_data.update(data)

# preparing identifiers in order to do less requests possible later
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stannis-grammar

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

return False
if is_complete(publication_info):
return False
return had_at_least_one_journal_title
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong rebase! This died in 6c0db18 and should stay dead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doh!

record_id=str(record_uuid),
source=source.lower()
).one_or_none()
return entry
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A function that does exactly one function call and returns smells of ravioli code. I see that you are using it two times here and several times in the tests: perhaps it's time to graduate it to an official util, since, in essence, this function is the workflow analogue of get_db_record.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created workflow.py in utils, wdyt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to workflows/utils.

head['control_number'],
endpoint='record'
)
_add_deleted_records(head, update)
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_add_deleted_records is used exactly once (here), and I don't understand the point: in the lines above you modify records directly in the body of this function, while here another function is called to do essentially the same kind of job.

def _get_head_and_update(obj):
head = obj.extra_data['head']
update = obj.extra_data['update']
return head, update
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar observation as for _add_deleted_records. You can inline this function where it's used:

head, update = obj.extra_data['head'], obj.extra_data['update']

if arxiv_root:
return 'arxiv'

return None
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative algorithm: get all workflow record sources associated with a certain UUID, and iterate over them to see which sources are present. The current code is doing two queries where one would be enough.


def get_source(record):
"""Return the ``source`` of ``acquisition_source`` of a record."""
return get_value(record, 'acquisition_source.source')
Copy link
Contributor

@jacquerie jacquerie Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's keep these utils alphabetically sorted (this should be above get_title).

setup.py Outdated
@@ -60,6 +60,7 @@
'inspire-crawler~=1.0',
'inspire-dojson~=53.0,>=53.0.0',
'inspire-matcher~=0.0,>=0.3.3',
'inspire-json-merger~=2.0,>=2.0.4',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: let's keep these requirements alphabetically sorted (this should be above inspire-dojson).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean...below? d -> j


def test_manual_merge_existing_records(workflow_app):
# celery task, to import locally
from inspirehep.modules.migrator.tasks import record_insert_or_replace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we need this internal import? If you're not executing this with delay it doesn't go to Celery... For example

from inspirehep.modules.migrator.tasks import record_insert_or_replace

@ammirate ammirate force-pushed the manual_merger_workflow branch 2 times, most recently from 0abc3b6 to da3787d Compare November 23, 2017 10:21
@jacquerie
Copy link
Contributor

This requires the general dependencies bumping done in #2992.

zzacharo and others added 3 commits November 28, 2017 11:28
* Add new tasks for merging records
* Add utils to operate with the `workflow_record_source` table
* Add unit and integration tests for those functions

Signed-off-by: Antonio Cesarano <[email protected]>
Signed-off-by: Riccardo Candido <[email protected]>
Signed-off-by: Chris Aslanoglou <[email protected]>
@jacquerie
Copy link
Contributor

Superseded by #3005.

@kaplun
Copy link
Contributor

kaplun commented Nov 29, 2017

Cazzo. Per un secondo ho pensato che fosse stato mergiato 🤣

Wowow! It has been merged!! @ammirate we shall party!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants