-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DESY FTP #73
Comments
Sorry - misunderstanding. For Elsevier, World Scientific, APS, PoS the publisher data are currently harvested at CERN. Springer serves their data on their FTP server (ftp.springer-dds.com), no need to copy it to DESY when the harvesting will be done at CERN. PTEP and Acta Physica Polonica B send emails with attachments. Other emails are only alerts to trigger a web-crawl program. |
I think the easiest thing would be that you indeed store those attachement into a share space such as the mentioned DESY FTP server. For the triggers... Mmh... So, hepcrawl has indeed an interface to trigger a crawl, @david-caro might provide more information about it. Basically you could then send an HTTP POST request to hepcrawl to trigger the harvesting of the corresponding journal. |
Last week we agreed to create a simple interface to allow hepcrawl to harvest marcxml records from DESY, that way we are not hurried by the legacy shutdown to implement any DESY side flows, and that can be done calmly and bit by bit. So in order to bootstrap that conversation, I propose to add a folder in DESY FTP with the records to harvest, and heprcawl will pick them up periodically. The records should be separated in subfolders by source, so hepcrawl knows where they originally come from (springer, elsevier...). What do you think? |
Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea. But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml. |
It needs the source of the crawl for various reasons:
* In order to display it properly in the holding pen, sort/search/facets/...
* So we can properly match it with the last update from that source (not yet there, but will be needed).
* Tracking purposes, as it's not the same coming from a spider that crawls directly publisher A, than coming from desy, even if both originally came from the same publisher A.
But yes, having it in the metadata somehow might be enough, just proposed the directory structure for easy organization and implementation (50 dirs is not that many, and easily allows seeing if any provider source is empty or not being crawled properly, adding it to the metadata only means having to check the contents of the files every time you want to know something similar).
The key point being, we need a stable and reliable way of knowing the origin of the record.
…--
David Caro <[email protected]>
CERN - RCS-SIS
inSPIRE-HEP High Energy Physics information system
http://inspirehep.net
On May 22, 2017 12:48 PM, ksachs <[email protected]> wrote:
Creating a subfolder on the DESY FTP server where CERN can pick up marcxml to feed hepcrawl is a very good idea.
But why does hepcrawl need to know where they came from? It is converted INSPIRE marcxml.
Instead of 50 different subfolders it might be easier to add that info to the metadata if necessary.
E.g. for the abstract we add the source (=publisher) to 520__9 anyhow.
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAj1jPVAOtZIbBdbFQdoxIqQiBgsjEtrks5r8Wf7gaJpZM4KspYa>.
|
The origin of the record is 'DESY'
This workflow via DESY can be a short term solution for the bigger publishers. Only for the small and infrequent publishers we will need it for a longer period. There it doesn't help to know the folder is still empty, this might be correct. Florian and I would suggest to leave the responsibility whether the harverst/conversion went fine with DESY and just process what is in the metadata. |
Ideally would be great to have the real source (i.e. the name of the publisher) so that later, when a crawler is ported from DESY to INSPIRE it is possible to compare apples with apples. |
@david-caro I think this should be good enough also for hepcrawl indeed to guess the source. After all the source doesn't need to be associated with one and only one hepcrawl-spider. |
Then how do we differentiate desy ones from non-desy ones? |
don't mix source (way to harvest) and publisher (metadata) @kaplun Wrt. source: you don't have that info for 1M records in INSPIRE. @david-caro |
@ksachs in inspire-schema we call source the origin of truth. I.e. the publisher. How things reach us has a sort of a lesser importance and it goes into
Sure but anyway we should start from somewhere, and updates from publishers will be most often about papers that reached us within the last year as preprint. So if we start to have clear data from now onwards, we are going to in regime in one year (i.e. much less pain for cataloger due to unresolved conflicts due to missing/untraceable history). |
maybe we are not talking about the same thing. A video meeting might be helpful. |
Is there a show-stopper if you just convert the marc to json as for existing INSPIRE records + acquisition_source = DESY? |
Ok, so in the end, the
And the data of the record will be exactly whatever is passed from desy (the output of dojson on the xml). Anyone disagrees? |
And, the topic of the issue, the ftp will just be a folder with individual xml files, one per record. That will be removed upon ingestion (I recommend moving to a temporary dir that gets cleaned up periodically, though that should probably done on the server side if you want it, just in case we want to rerun anything). |
I am not sure one XML file per record is the easiest on DESY side. What about the possibility of grouping multiple records in on MARCXML file? (normally multiple MARCXML records are grouped into a |
Right, it would be easier if we could pass on collections of records in a file. |
Hmm, then in order to parse them we would have to iterate for each record on every file...
That might be messy on scrapy side.
…--
David Caro <[email protected]>
CERN - RCS-SIS
inSPIRE-HEP High Energy Physics information system
http://inspirehep.net
On Jul 3, 2017 14:12, Florian Schwennsen <[email protected]> wrote:
Right, it would be easier if we could pass on collections of records in a file.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#73 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAj1jAbigPHToW2jRGeDYbr38QxUJUSbks5sKNpqgaJpZM4KspYa>.
|
If needed, we can split the xml also on DESY side - no problem. |
The publishers where we get fulltexts will run via HEPCrawl. For all these smaller publishers for which we need the DESYmarcxmlSpider the only fulltexts are OA for which the xml would contain a weblink. |
There will be an overlapping time where some big publishers will still run on desy (springer for example), so we should support those too right? |
During the INSPIRE Week it has been agreed that DESY would make available through FTP the different feeds that are then loaded into INSPIRE.
I'd propose that the FTP is divided into one directory per feed.
@ksachs @fschwenn can you detail which feeds you would actually put there? I guess a spider per feed will need to be written, correct?
The text was updated successfully, but these errors were encountered: