Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with outdated tag set in document index (#1331) #1332

Merged
merged 8 commits into from
Jul 21, 2023
5 changes: 4 additions & 1 deletion geniza/corpus/apps.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from django.apps import AppConfig
from django.db.models.signals import pre_save
from django.db.models.signals import m2m_changed, pre_save


class CorpusAppConfig(AppConfig):
Expand All @@ -12,4 +12,7 @@ def ready(self):
from geniza.corpus.models import TagSignalHandlers

pre_save.connect(TagSignalHandlers.unidecode_tag, sender="taggit.Tag")
m2m_changed.connect(
TagSignalHandlers.tagged_item_change, sender="taggit.TaggedItem"
)
return super().ready()
26 changes: 17 additions & 9 deletions geniza/corpus/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -428,6 +428,7 @@ class DocumentSignalHandlers:
"fragment": "fragments",
"tag": "tags",
"document type": "doctype",
"tagged item": "tagged_items",
"Related Fragment": "textblock", # textblock verbose name
"footnote": "footnotes",
"source": "footnotes__source",
Expand Down Expand Up @@ -485,6 +486,15 @@ def unidecode_tag(sender, instance, **kwargs):
"""Convert saved tags to ascii, stripping diacritics."""
instance.name = unidecode(instance.name)

@staticmethod
def tagged_item_change(sender, instance, action, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the instance here is the document with the m2m relation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Added comment to clarify.

"""Ensure document is indexed after the tags m2m relationship is saved and the list
of tags is refreshed from the database, on any tag change."""
if action in ["post_add", "post_remove", "post_clear"]:
logger.debug("taggit.TaggedItem %s, reindexing related document", action)
instance.refresh_from_db()
ModelIndexable.index_items([instance])


class DocumentQuerySet(MultilingualQuerySet):
def metadata_prefetch(self):
Expand Down Expand Up @@ -1219,10 +1229,6 @@ def index_data(self):
"post_save": DocumentSignalHandlers.related_save,
"pre_delete": DocumentSignalHandlers.related_delete,
},
"tags": {
"post_save": DocumentSignalHandlers.related_save,
"pre_delete": DocumentSignalHandlers.related_delete,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need these handlers, though, right? If someone renames a tag without changing any m2m relationships, we should reindex all the documents with that renamed tag so they reflect the update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Added back.

"doctype": {
"post_save": DocumentSignalHandlers.related_save,
"pre_delete": DocumentSignalHandlers.related_delete,
Expand Down Expand Up @@ -1262,9 +1268,6 @@ def merge_with(self, merge_docs, rationale, user=None):
metadata into this document, adds the merged documents into
list of old PGP IDs, and creates a log entry documenting
the merge, including the rationale."""
# initialize old pgpid list if previously unset
if self.old_pgpids is None:
self.old_pgpids = []

# if user is not specified, log entry will be associated with
# script and document will be flagged for review
Expand All @@ -1287,10 +1290,15 @@ def merge_with(self, merge_docs, rationale, user=None):
needs_review = [self.needs_review] if self.needs_review else []

for doc in merge_docs:
# add merge id to old pgpid list
self.old_pgpids.append(doc.id)
# add any tags from merge document tags to primary doc
# NOTE: This must be done before any other changes to self because it will fire
# m2m_changed signal which calls self.refresh_from_db()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this refresh from db likely to be problematic elsewhere? it seems like it could lead to unexpected behavior that might be hard to troubleshoot

in the signal handler, instead of refreshing, could we get a new instance from the db without modifying the local copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea, much better than refresh from db, thanks!

Copy link
Contributor Author

@blms blms Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this (or adding back the related_save?) broke a unit test where a shelfmark in index data was relied upon, and the way to fix it was switching the order of these two lines in conftest, such that the tags.add line happens first:

doc.tags.add("bill of sale", "real estate")
TextBlock.objects.create(document=doc, fragment=fragment)

Not sure what's going on here because it seems like index_data should have been called after all of these lines anyway with the most recent copy of the document. Indeed, logging the results of index_data actually showed that the shelfmark is present in the index data for all three records even before making this change. But I can't test it more thoroughly because my local Solr (on v9) is still returning 0 results for every query.

self.tags.add(*doc.tags.names())
# initialize old pgpid list if previously unset
if self.old_pgpids is None:
self.old_pgpids = []
# add merge id to old pgpid list
self.old_pgpids.append(doc.id)
# add description if set and not duplicated
# for all supported languages
for lang_code in language_codes:
Expand Down
16 changes: 16 additions & 0 deletions geniza/corpus/tests/test_corpus_signals.py
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,19 @@ def test_unidecode_tags():
# pre_save signal should strip diacritics from tag and convert to ASCII
tag = Tag.objects.create(name="mu'ālim", slug="mualim")
assert tag.name == "mu'alim"


@pytest.mark.django_db
@patch.object(ModelIndexable, "index_items")
def test_tagged_item_change(mock_indexitems, document):
tag_count = document.tags.count()
tag = Tag.objects.create(name="mu'ālim", slug="mualim")
tag2 = Tag.objects.create(name="tag2", slug="tag2")
# should reindex document with the updated set of tags on save
document.tags.add(tag)
document.tags.add(tag2)
document.save()
# should be called at least once for the document post-save & once for the tags M2M change
assert mock_indexitems.call_count >= 2
# most recent call should have the full updated set of tags
assert mock_indexitems.call_args.args[0][0].tags.count() == tag_count + 2