DESY FTP #73

kaplun · 2016-11-08T16:44:43Z

During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.

I'd propose that the FTP is divided into one directory per feed.

@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?

ksachs · 2016-11-09T12:41:39Z

Sorry - misunderstanding.

For Elsevier, World Scientific, APS, PoS the publisher data are currently harvested at CERN.
I don't know whether on legacy or labs. After the conversion CERN deposits INSPIRE-xml on the DESY FTP server and sends an email to [email protected]. We need the DESY FTP server only as long as we do the matching/selection/merging via the DESY workflow.

Springer serves their data on their FTP server (ftp.springer-dds.com), no need to copy it to DESY when the harvesting will be done at CERN.

PTEP and Acta Physica Polonica B send emails with attachments.
Is there a possibility at CERN to feed email attachments to a HEPcrawl spider?

Other emails are only alerts to trigger a web-crawl program.
Again it would be nice if an email could trigger a HEPcrawl spider.
For now we just process these journals at DESY.
We don't have HEPcrawl spiders for those anyhow.

kaplun · 2016-11-09T12:56:56Z

I think the easiest thing would be that you indeed store those attachement into a share space such as the mentioned DESY FTP server.

For the triggers... Mmh... So, hepcrawl has indeed an interface to trigger a crawl, @david-caro might provide more information about it. Basically you could then send an HTTP POST request to hepcrawl to trigger the harvesting of the corresponding journal.

kaplun · 2016-11-09T12:58:11Z

http://pythonhosted.org/hepcrawl/operations.html#schedule-crawls

david-caro · 2017-05-18T18:20:14Z

Last week we agreed to create a simple interface to allow hepcrawl to harvest marcxml records from DESY, that way we are not hurried by the legacy shutdown to implement any DESY side flows, and that can be done calmly and bit by bit.

So in order to bootstrap that conversation, I propose to add a folder in DESY FTP with the records to harvest, and heprcawl will pick them up periodically.

The records should be separated in subfolders by source, so hepcrawl knows where they originally come from (springer, elsevier...).

What do you think?

ksachs · 2017-05-22T10:48:26Z

Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml.
Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary.
E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

david-caro · 2017-05-22T11:02:22Z

It needs the source of the crawl for various reasons: * In order to display it properly in the holding pen, sort/search/facets/... * So we can properly match it with the last update from that source (not yet there, but will be needed). * Tracking purposes, as it's not the same coming from a spider that crawls directly publisher A, than coming from desy, even if both originally came from the same publisher A. But yes, having it in the metadata somehow might be enough, just proposed the directory structure for easy organization and implementation (50 dirs is not that many, and easily allows seeing if any provider source is empty or not being crawled properly, adding it to the metadata only means having to check the contents of the files every time you want to know something similar). The key point being, we need a stable and reliable way of knowing the origin of the record.

…

-- David Caro <[email protected]> CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net On May 22, 2017 12:48 PM, ksachs <[email protected]> wrote: Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea. But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary. E.g. for the abstract we add the source (=publisher) to 520__9 anyhow. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAj1jPVAOtZIbBdbFQdoxIqQiBgsjEtrks5r8Wf7gaJpZM4KspYa>.

ksachs · 2017-05-22T11:28:19Z

The origin of the record is 'DESY'

for display the journal might be more useful, fall-back 'DESY' or the publisher if it is in the metadata
matching: only relevant when the data are coming directly from the publisher, e.g. spinger crawler
for tracking purposes the source is DESY, the rest is our (=DESY local) problem including the question whether a publisher got 'stuck'.

This workflow via DESY can be a short term solution for the bigger publishers. Only for the small and infrequent publishers we will need it for a longer period. There it doesn't help to know the folder is still empty, this might be correct. Florian and I would suggest to leave the responsibility whether the harverst/conversion went fine with DESY and just process what is in the metadata.

kaplun · 2017-05-22T11:57:17Z

Ideally would be great to have the real source (i.e. the name of the publisher) so that later, when a crawler is ported from DESY to INSPIRE it is possible to compare apples with apples.
As you might remember, in order to implement the automatic merging of a record update we need to fetch the last version for the corresponding source of the record that is being manipulated. If all the sources read DESY, then we you need to guarantee that you won't ever have the same publication coming through 2 separate sources that are then masked as DESY when they arrive to INSPIRE.

kaplun · 2017-05-22T11:59:06Z

But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml.
Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary.
E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.

@david-caro I think this should be good enough also for hepcrawl indeed to guess the source. After all the source doesn't need to be associated with one and only one hepcrawl-spider.

david-caro · 2017-05-22T12:18:09Z

Then how do we differentiate desy ones from non-desy ones?

ksachs · 2017-05-22T12:58:49Z

don't mix source (way to harvest) and publisher (metadata)

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.
For big publishers the DESY-spider workaround is a short(!!!)-term temporary solution. Don't make it perfect.
For small publishers - that's peanuts. We don't need to compare to previous version.
In any case: it's DESY spider + DOI you can compare to.

@david-caro
desy-spider -> source=DESY, publisher = whatever is in the metadata
other spider -> non-desy

kaplun · 2017-05-22T13:03:08Z

@ksachs in inspire-schema we call source the origin of truth. I.e. the publisher. How things reach us has a sort of a lesser importance and it goes into acquisition_source.

@kaplun Wrt. source: you don't have that info for 1M records in INSPIRE.

Sure but anyway we should start from somewhere, and updates from publishers will be most often about papers that reached us within the last year as preprint. So if we start to have clear data from now onwards, we are going to in regime in one year (i.e. much less pain for cataloger due to unresolved conflicts due to missing/untraceable history).

ksachs · 2017-05-22T13:10:54Z

maybe we are not talking about the same thing. A video meeting might be helpful.
For arXiv: do you want to compare to another arXiv version or the update that comes from the publisher?
For most preprints we don't get the publisher info from arXiv. If we do it can be publisher or journal.

ksachs · 2017-05-22T13:13:12Z

Is there a show-stopper if you just convert the marc to json as for existing INSPIRE records + acquisition_source = DESY?

david-caro · 2017-07-03T11:32:30Z

Ok, so in the end, the acquisition_source for records that are harvested by the desy spider will be:

"acquisition_source": {
    "method": "hepcrawl",
    "source": "desy"
}

And the data of the record will be exactly whatever is passed from desy (the output of dojson on the xml).

Anyone disagrees?

david-caro · 2017-07-03T11:35:15Z

And, the topic of the issue, the ftp will just be a folder with individual xml files, one per record. That will be removed upon ingestion (I recommend moving to a temporary dir that gets cleaned up periodically, though that should probably done on the server side if you want it, just in case we want to rerun anything).

kaplun · 2017-07-03T12:09:54Z

I am not sure one XML file per record is the easiest on DESY side. What about the possibility of grouping multiple records in on MARCXML file? (normally multiple MARCXML records are grouped into a <collection> ... </collection>

fschwenn · 2017-07-03T12:11:22Z

Right, it would be easier if we could pass on collections of records in a file.

david-caro · 2017-07-03T12:49:57Z

Hmm, then in order to parse them we would have to iterate for each record on every file... That might be messy on scrapy side.

…

-- David Caro <[email protected]> CERN - RCS-SIS inSPIRE-HEP High Energy Physics information system http://inspirehep.net On Jul 3, 2017 14:12, Florian Schwennsen <[email protected]> wrote: Right, it would be easier if we could pass on collections of records in a file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAj1jAbigPHToW2jRGeDYbr38QxUJUSbks5sKNpqgaJpZM4KspYa>.

fschwenn · 2017-07-03T13:21:31Z

If needed, we can split the xml also on DESY side - no problem.

david-caro · 2017-07-07T11:29:51Z

No need, we can do on our side :), thanks!

Another question, the macxml files you provide will have files attached to them right? If so, what paths will they have? (so we can download them)
@ksachs @fschwenn ^

fschwenn · 2017-07-07T12:27:12Z

The publishers where we get fulltexts will run via HEPCrawl. For all these smaller publishers for which we need the DESYmarcxmlSpider the only fulltexts are OA for which the xml would contain a weblink.

david-caro · 2017-07-07T12:37:21Z

There will be an overlapping time where some big publishers will still run on desy (springer for example), so we should support those too right?

kaplun assigned fschwenn and ksachs Nov 8, 2016

david-caro self-assigned this May 12, 2017

david-caro added this to the Desy spider & PoS specification milestone May 18, 2017

david-caro added the Status: RFC label May 18, 2017

This was referenced May 18, 2017

desy spider #133

Closed

Source value for desy-ingested records #134

Open

david-caro removed the Status: RFC label Feb 13, 2018

david-caro removed their assignment Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DESY FTP #73

DESY FTP #73

kaplun commented Nov 8, 2016

ksachs commented Nov 9, 2016

kaplun commented Nov 9, 2016

kaplun commented Nov 9, 2016

david-caro commented May 18, 2017

ksachs commented May 22, 2017

david-caro commented May 22, 2017 via email

ksachs commented May 22, 2017

kaplun commented May 22, 2017

kaplun commented May 22, 2017

david-caro commented May 22, 2017

ksachs commented May 22, 2017

kaplun commented May 22, 2017

ksachs commented May 22, 2017

ksachs commented May 22, 2017

david-caro commented Jul 3, 2017

david-caro commented Jul 3, 2017

kaplun commented Jul 3, 2017

fschwenn commented Jul 3, 2017

david-caro commented Jul 3, 2017 via email

fschwenn commented Jul 3, 2017

david-caro commented Jul 7, 2017

fschwenn commented Jul 7, 2017

david-caro commented Jul 7, 2017

DESY FTP #73

DESY FTP #73

Comments

kaplun commented Nov 8, 2016

ksachs commented Nov 9, 2016

kaplun commented Nov 9, 2016

kaplun commented Nov 9, 2016

david-caro commented May 18, 2017

ksachs commented May 22, 2017

david-caro commented May 22, 2017 via email

ksachs commented May 22, 2017

kaplun commented May 22, 2017

kaplun commented May 22, 2017

david-caro commented May 22, 2017

ksachs commented May 22, 2017

kaplun commented May 22, 2017

ksachs commented May 22, 2017

ksachs commented May 22, 2017

david-caro commented Jul 3, 2017

david-caro commented Jul 3, 2017

kaplun commented Jul 3, 2017

fschwenn commented Jul 3, 2017

david-caro commented Jul 3, 2017 via email

fschwenn commented Jul 3, 2017

david-caro commented Jul 7, 2017

fschwenn commented Jul 7, 2017

david-caro commented Jul 7, 2017