-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issue with outdated tag set in document index (#1331) #1332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexing on taggit relationship post save does seem logical... I'm puzzled about the order you're reporting; could it be that the tags are cached somehow when we save the document?
I just checked, parasolr is listening to the post_save
signal on the object itself. I don't know what order that happens in, or why there would be a discrepancy between them.
Possibly this stack overflow is related? Seems to be discussing updating m2m on post save, but the clear behavior may also be relevant for what we're trying todo. https://stackoverflow.com/questions/1925383/issue-with-manytomany-relationships-not-updating-immediately-after-save
Yeah, I'm at a bit of a loss here! It does indeed seem to be referring to an old/cached copy of the relationship on |
Codecov Report
@@ Coverage Diff @@
## develop #1332 +/- ##
===========================================
+ Coverage 98.63% 98.75% +0.11%
===========================================
Files 187 197 +10
Lines 10341 11523 +1182
===========================================
+ Hits 10200 11379 +1179
- Misses 141 144 +3 |
@rlskoeser I have a working version of this now, but with one pitfall: because of the You can see an example of this in the document merge code changes. Is this acceptable? I couldn't figure out any other way for the tag set to be up-to-date when the indexing code is called. In all other contexts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this, m2m relationships and signals are tricky to think about and handle properly in terms of indexing. (Wondering if there's anything here we could adapt for common parasolr indexing behavior or at least document in the examples...)
one question for you about the refresh_from_db
call, otherwise I think this is good
geniza/corpus/models.py
Outdated
"tags": { | ||
"post_save": DocumentSignalHandlers.related_save, | ||
"pre_delete": DocumentSignalHandlers.related_delete, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need these handlers, though, right? If someone renames a tag without changing any m2m relationships, we should reindex all the documents with that renamed tag so they reflect the update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Added back.
@@ -485,6 +486,15 @@ def unidecode_tag(sender, instance, **kwargs): | |||
"""Convert saved tags to ascii, stripping diacritics.""" | |||
instance.name = unidecode(instance.name) | |||
|
|||
@staticmethod | |||
def tagged_item_change(sender, instance, action, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the instance here is the document with the m2m relation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Added comment to clarify.
geniza/corpus/models.py
Outdated
# add any tags from merge document tags to primary doc | ||
# NOTE: This must be done before any other changes to self because it will fire | ||
# m2m_changed signal which calls self.refresh_from_db() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this refresh from db likely to be problematic elsewhere? it seems like it could lead to unexpected behavior that might be hard to troubleshoot
in the signal handler, instead of refreshing, could we get a new instance from the db without modifying the local copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea, much better than refresh from db, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this (or adding back the related_save
?) broke a unit test where a shelfmark in index data was relied upon, and the way to fix it was switching the order of these two lines in conftest, such that the tags.add
line happens first:
geniza/geniza/corpus/tests/conftest.py
Lines 60 to 61 in 565be05
doc.tags.add("bill of sale", "real estate") | |
TextBlock.objects.create(document=doc, fragment=fragment) |
Not sure what's going on here because it seems like index_data
should have been called after all of these lines anyway with the most recent copy of the document. Indeed, logging the results of index_data
actually showed that the shelfmark is present in the index data for all three records even before making this change. But I can't test it more thoroughly because my local Solr (on v9) is still returning 0 results for every query.
Also, add back indexing for related_save/delete on tags ref #1331
In this PR
Per #1331:
Questions
previous notes
After investigating #1331, it seems that when
index_data
is called after a document is saved, the updated tag set is not yet present inself.tags.all()
.My attempt at a workaround is to check specifically for the related change via a change in
TaggedItem
. This does trigger a reindex with the correct tag set when you change tags on a document. However, it seems thatindex_data
is called again for the document save after it's called for theTaggedItem
save, which means that it overwrites the index data to be the same as before (outdated tag set).Is there some way to ensure the
TaggedItem
related save indexing trigger always happens after the document save indexing trigger? Or is the approach wrong?(Not exactly the same issue, but also wondering if it's related to this: jazzband/django-taggit#818)