Versioning Dynamic Resources

(Quoting discussion with Gunn Inger and Dieter)

What should get a PID (and how often)

Example 1: A newspaper corpus has new newspaper material added to it every night. We make this corpus available in downdloadble form by adding new material once a year, and the new version then gets a PID. The corpus has one metadata file, and we foresee that information about the new, annual version is expressed within this one metadata file. Sounds OK, or sounds like a «depreciated solution»?

Such a a yearly snapshot with its own PID sounds reasonable to me. The question (without an easy answer I think) is then how it relates to the metadata. One option would be to create a small hierarchy, with a generic metadata file for the corpus and then for each year a separate child metadata file, which then includes a link to the PID of the corpus itself:

[CMDI] Newspaper
|
+-- [CMDI] Newspaper 2014 -> [.tar.gz] hdl:123/v2014
+-- [CMDI] Newspaper 2015 -> [.tar.gz] hdl:123/v2015

The user can then choose to cite the generic corpus (root node) or the specific year (subnode). For processing (and replication purposes) the subnode could be cited.

We also think that the principle of snapshots (i.e. always the whole datasets, not just the new additions) is sensible. How to publish it technically depends on other things, but we choose to make separate metadata records for these objects (snapshots) due to several reasons:

Mostly due to [persistent citation] principle, so that each is always clearly identified not only technically, but also from the user perspective.
Also we find that we usually make other changes too, especially in such a long time frame. E.g. we analyse the data with a newer tagger, etc. This again should show up in "description" metadata fild and possibly in metadata that facilitates linking between records.
There is also a technical limitation of Dspace that assigns PIDs only to records, not to bitstreams. This could be changed, but there is no reason for it. Users who need to group all versions together can always create a collection. Collections do get their PIDs and metadata can be cited.

Directly relevant to the topic of versioning of a resource see also Figshare solution (look for "DOI Versioning"). Their solution has a major problem, though:

DOI (PID) semantics changes when a resource has more versions. At first the PID points persistently to the same data. However when a resource has more versions this is no more true. The PID points to the latest version, which means that if somebody use (or had used) this PID without the versoin suffix, it is not clear what version of data they used and cited.

###Our solution:

Whenever there is a new version of a resource that authors want to be cited, they make [a new record as a version of the older one][new-version]
- When no change of metadata is needed but still a particular change/version needs to be cited, a public versin control system (like a repository at Github, Bitbucket, etc.) and citing a concrete revision is an option. However usually we find that when a specific version needs to be cited, there is a reason for it. That reason belong to a description, i.e. metadata, i.e. a new record.

Complex set of interrelated and continuously updated data

preservation
working copy, history
what to publish, when, how

Example 2: A C centre develops a grammar for Saami, with grammars, morphological analyzers, lexicons and corpora. They would like to deposit (a subset of) their resources at the CLARIN Bergen Centre, since they cannot themselves guarantee stable storage, PIDs etc. Their resources, locally, are integrated in their svn update system, and the resurces are continuously updated, but the changes may be extremely minor. What is the best solution for this case wrt. Metadata and PIDs? The Bergen Centre cannot download new versions from the C centre every day. Is it OK to view a corpus (or a lexicon) from this C centre as an «abstract» corpus which is described in one metadata file, with one resource PID, even though the actual user has to go to the C centre to get the most updated version? It does not seem to make so much sense to create a downloadable snapshot at the Bergen Centre for such resources. (We can, but does it make sense? A user will probably prefer to have the most recent version, and the changes from yesterday’s version to today’s version may be so minor that we cannot create versions/update metadata every time)

That's a complex case :) In principle there is nothing wrong with referring the user to the original host. And so I would absolutely include a link in the metadata file you host.

But depositing is more than just creating a metadata file, since (as you write) it is also about saveguarding a copy to ensure long-term availability. Whether updates should happen on a regular basis would depend on the amount of changes over time.

You could also think about hosting a mirror of the SVN server (http://www.cardinalpath.com/how-to-use-svnsync-to-create-a-mirror-backup- of-your-subversion-repository/) at your side and referring from the CMDI file to that - then the users always have full access to all versions and still the data is safely hosted at your centre.

Updating data that is publicly available in a service

Again what (when) to archive, give it a PID and make it citable

Example 3: In the INESS treebank infrastructure (clarin.uib.no/iness), the treebanks may change from time to time (they are sometimes reparsed). Thus far, one treebank has one PID, which points to the most recent version. We don’t have a clear solution yet for cases where rearchers need a PID for a resource version with a speficic date.

Thoughts?

Although it is good practice to keep a PID per version, it is also an option to indicate that the PID just points to the latest version of the resource, and that it can change. This is the "strict versioning" policy you can find in the centre registry: if that is false, a PID can point to newer versions over time. In that case this should be mentioned in the metadata (description or so) I would think.

Especially if a resource is only accessible via a web application this is a tricky issue. It would depend on your estimation if it makes sense to build in the versioning into the application in that case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Versioning Dynamic Resources

What should get a PID (and how often)

Complex set of interrelated and continuously updated data

Updating data that is publicly available in a service

Clone this wiki locally