-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
workflows: add manual merger #2904
Conversation
delayed = True | ||
if obj.workflow.name == 'manual_merge': | ||
# the manual merge wf should be sync | ||
delayed = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really sad. But I see why it's needed. We need to ensure that whatever happens afterwards in the workflow is peanut and can complete quickly. Otherwise the request will end-up in a Timeout.
@@ -32,4 +34,16 @@ class MergeApproval(object): | |||
@staticmethod | |||
def resolve(obj, *args, **kwargs): | |||
"""Resolve the action taken in the approval action.""" | |||
pass | |||
|
|||
obj.extra_data["approved"] = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are sure that it's always approved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now yes, but in the future, of course, we can have the chance to reject the merge.
'postgresql', | ||
), | ||
default=lambda: dict(), | ||
nullable=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, going to remove it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ammirate did you forget to remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently yes!
|
||
@with_debug_logging | ||
def merge_records(obj, eng): | ||
"""Merge the records which ids are defined in the `obj` parameter and store |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which -> whose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh!
new `head`. | ||
""" | ||
|
||
head = get_db_record('lit', obj.extra_data['head_control_number']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to hardcode lit
? In the future you will need to use the same functionality for merging duplicate authors...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should change it everywhere actually, e.g in the inspire-json-merger
API in according to that. I think it's not that hard to adapt it when it will be needed...
head.clear() | ||
head.update(obj.data) # head's content will be replaced by merged | ||
update.merge(head) # update's uuid will point to head's uuid | ||
update.delete() # mark update record as deleted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need to explicitly delete? Isn't this part of it's correct.merge()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually it should be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it's not, right? As far as I can see it just handles pids from the pidstore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right ⬆️
# add schema contents to refer deleted record to the merged one | ||
update['new_record'] = get_record_ref( | ||
head['control_number'], | ||
endpoint='record''record' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@ammirate I found only a small typo-induced bug. Beside that only small nits. Re: the synchronousicityrily of resuming a workflow I confirm what you did would indeed be synchronous. |
0532e5d
to
483f808
Compare
483f808
to
0857635
Compare
0857635
to
029f574
Compare
314596e
to
e53f848
Compare
8d1ae03
to
b182e8c
Compare
def start_manual_merge(): | ||
"""Initiate manual merge on two records.""" | ||
assert request.json['head_recid'] | ||
assert request.json['update_recid'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to do more validation than this: with the current code any user who can use the editor API can merge any two records, even if they can't read/edit them!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the current best-practice to protect a view to catalogers? This is definitively a very common use case?
I see the editor uses: @editor_use_api_permission.require(http_exception=403)
. As ugly as it is, can we re-use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also @editor_permission
which sounds a good candidate.
|
||
from invenio_db import db | ||
|
||
from inspire_json_merger.inspire_json_merger import inspire_json_merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API and module structure of inspire-json-merger
need some love. This should really be
from inspire_json_merger import merge
opening an issue against I already did this... inspirehep/inspire-json-merger#35.inspire-json-merger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it's in the TODO list 👍
wf_id = workflow_object.id # to retrieve it later | ||
workflow_object.extra_data.update(data) | ||
|
||
# preparing identifiers in order to do less requests possible later |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
inspirehep/utils/record.py
Outdated
return False | ||
if is_complete(publication_info): | ||
return False | ||
return had_at_least_one_journal_title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong rebase! This died in 6c0db18 and should stay dead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doh!
record_id=str(record_uuid), | ||
source=source.lower() | ||
).one_or_none() | ||
return entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A function that does exactly one function call and returns smells of ravioli code. I see that you are using it two times here and several times in the tests: perhaps it's time to graduate it to an official util, since, in essence, this function is the workflow analogue of get_db_record
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created workflow.py
in utils, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to workflows/utils
.
head['control_number'], | ||
endpoint='record' | ||
) | ||
_add_deleted_records(head, update) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_add_deleted_records
is used exactly once (here), and I don't understand the point: in the lines above you modify records directly in the body of this function, while here another function is called to do essentially the same kind of job.
def _get_head_and_update(obj): | ||
head = obj.extra_data['head'] | ||
update = obj.extra_data['update'] | ||
return head, update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar observation as for _add_deleted_records
. You can inline this function where it's used:
head, update = obj.extra_data['head'], obj.extra_data['update']
if arxiv_root: | ||
return 'arxiv' | ||
|
||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative algorithm: get all workflow record sources associated with a certain UUID, and iterate over them to see which sources are present. The current code is doing two queries where one would be enough.
inspirehep/utils/record.py
Outdated
|
||
def get_source(record): | ||
"""Return the ``source`` of ``acquisition_source`` of a record.""" | ||
return get_value(record, 'acquisition_source.source') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's keep these utils alphabetically sorted (this should be above get_title
).
setup.py
Outdated
@@ -60,6 +60,7 @@ | |||
'inspire-crawler~=1.0', | |||
'inspire-dojson~=53.0,>=53.0.0', | |||
'inspire-matcher~=0.0,>=0.3.3', | |||
'inspire-json-merger~=2.0,>=2.0.4', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: let's keep these requirements alphabetically sorted (this should be above inspire-dojson
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean...below? d
-> j
|
||
def test_manual_merge_existing_records(workflow_app): | ||
# celery task, to import locally | ||
from inspirehep.modules.migrator.tasks import record_insert_or_replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure we need this internal import? If you're not executing this with delay
it doesn't go to Celery... For example
from inspirehep.modules.migrator.tasks import record_insert_or_replace |
0abc3b6
to
da3787d
Compare
This requires the general dependencies bumping done in #2992. |
fb89a88
to
5b07171
Compare
Signed-off-by: Zacharias Zacharodimos <[email protected]>
* Add new tasks for merging records * Add utils to operate with the `workflow_record_source` table * Add unit and integration tests for those functions Signed-off-by: Antonio Cesarano <[email protected]> Signed-off-by: Riccardo Candido <[email protected]>
Signed-off-by: Chris Aslanoglou <[email protected]>
5b07171
to
081b6c8
Compare
Superseded by #3005. |
Wowow! It has been merged!! @ammirate we shall party! |
Description
Add a new workflow to perform a manual merge between two records.
The workflow behaves in the following way:
Motivation and Context
Checklist:
RFC
and look for it).Signed-off-by: Antonio Cesarano [email protected]