The Crossref Text and Data Mining API is designed to allow researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription). The publisher remains responsible for actually delivering the full text of the content requested. Thus, open access publishers can simply deliver the requested content while subscription based publishers continue to support subscriptions using their existing access control systems.
Here is a worked example of how to use CrossRef’s metadata services to perform text and data mining. First you should have:
- A list of DOIs that you want to download
- A white-list of licenses that you accept
The way you create these lists is up to you. You might want to start with the list of DOIs, get the licenses and decide which ones you want to agree to. You may get the list from your institution. It is you, the Researcher, who will decide what to do with each license. This is essentially the same mechanism that is widely used when auditing open source software projects for license compliance.
For each DOI you should:
- Use content negotiation to get the metadata for the DOI.
- Check to see if there’s license and full text metadata.
- Check the license against your whitelist.
- If you agree to the license, follow the link and download the full text of the article.
The absence of a license does not mean that the full text can be used without one. Publishers should deposit both the license and the full-text link at the same time.
Step by step
We will show the below examples with the Curl utility. You should be able to integrate with the API very easily with your text and data mining software.
1 – Fetch the Metadata
In the simplest case, a researcher can simply issue a HTTP GET request using a CrossRef DOI and use DOI content negotiation. So, for example, the following curl command will retrieve the metadata for the DOI 10.5555/515151:
curl -L -iH "Accept: application/vnd.crossref.unixsd+xml" http://dx.doi.org/10.5555/515151
This will return the metadata for the specified DOI as well as a link header which points to several representations of the full text on the publisher’s site:
HTTP/1.1 200 OK Date: Wed, 31 Jul 2013 11:24:14 GMT Server: Apache/2.2.3 (CentOS) Link: <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf>; rel="http://id.crossref.org/schema/fulltext"; type="application/pdf", <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml>; rel="http://id.crossref.org/schema/fulltext"; type="application/xml" Vary: Accept Content-Length: 2189 Status: 200 OK Connection: close Content-Type: application/vnd.crossref.unixsd+xml;charset=utf-8
The following code shows how to access this full text link information using Ruby:
require 'open-uri' r = open("http://dx.doi.org/10.5555/515151", "Accept" => "application/vnd.crossref.unixsd+xml") puts r.meta['link']
The same in Python:
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('Accept', 'application/vnd.crossref.unixsd+xml')] r = opener.open('http://dx.doi.org/10.5555/515151') print r.info()['Link']
The same in R:
library(httr) r = content(GET('http://dx.doi.org/10.5555/515151', add_headers(Accept = 'application/vnd.crossref.unixsd+xml'))) r
Note that, if present, the full text URI will also be returned in the metadata for the DOI. So, for instance, in the native CrossRef unixref schema, you would also see this in the returned metadata:
2 – Deciding what to do
Publishers who participate in CrossRef Text and Data Mining Services will also be required to register a stable license URI using the new <license_ref> element which points to the license applying to that CrossRef DOI. So, for example, the following unixref example extract would show that the DOI in question was licensed under the well-recognized Creative Commons CC-BY license:
Whereas the following would indicate that the DOI in question was licensed under a publisher’s proprietary license:
The license that the URI points to does not have to be machine readable. We expect that you will match the license URI to your whitelist. If you agree to it, you can proceed. If you don’t, you can put it in a list of licenses to review later and add to your whitelist (or blacklist).
A slight complication arises when the documents associated with DOIs are under embargoes. In this case, the publisher is able to use a start_date attribute on the <license_ref> element in order to convey simple embargo scenarios. For example, the following record that the respective DOI is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:
<license_ref start_date="2013-02-03">http://www.crossref.org/license <license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US
Text and data mining tools can easily use a combination of the <license_ref> element(s) and the start_date attribute to determine of the document pointed to by the DOI is currently under embargo.
Note that if you are NOT interested in receiving the metadata for the DOI, you can simply issue an HTTP HEAD request and you will get the Link header without the rest of the DOI record.
Or you can use the CrossRef REST APIs
The CrossRef REST APIs can also be used to provide cross-publisher support for text and data mining applications. This demonstration is a bit of a paradox as it is targeted at a non-technical audience who wants to understand a little bit about the technical infrastructure that researchers can leverage for text and data mining applications. A more complete explanation is available here.
Finding out what is in the CrossRef system
How many members does CrossRef have?
Who are they? Let’s look at first 100 members
And the second 100 members
How many DOI records does CrossRef have?
What content types does CrossRef have?
How many journal article DOIs does CrossRef have?
How many proceedings articles DOIs does CrossRef have?
But eventually you will probably want to start looking at metadata records. Lets search for records that have the word “blood” in the metadata and see how many there are.
Lets look at some of the results.
Now lets look at one of the records
Interesting. The record has ORCIDs, fulltext links, and license links. You need license and fulltext links to text and data mine the content.
How many works have license information?
How many license types are there?
How many works have a CC-BY license?
Ok, lets see how many records with the word “blood” in the metadata have license information and full text links
Let’s download the results and download the content locally to TDM
You can watch a presentation of CrossRef’s Geoffrey Bilder demonstrating this process at the Crossref Workshops 2014.
3 – Fetching the full text
You can now perform a standard GET request on the url to download the full text from the Publisher’s site.
Rate limiting headers
Because the bulk-downloading of large numbers of publications may put a strain on the publisher’s servers, we have defined the following HTTP headers:
|HEADER NAME||EXAMPLE VALUE||EXPLANATION|
|CR-TDM-Rate-Limit||1500||Maximum number of full text downloads that are allowed to be performed in the defined rate limit window|
|CR-TDM-Rate-Limit-Remaining||76||Number of downloads left for the current rate limit window|
Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started
You are not obliged to test for and act on these headers, and not all publishers will use these headers. However, doing so will avoid surprises.
An Example session using Rate Limiting
curl -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O HTTP/1.1 200 OK Date: Fri, 02 Aug 2013 07:10:53 GMT Server: Apache/2.2.22 (Ubuntu) X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13 CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBA CR-TDM-Rate-Limit: 5 CR-TDM-Rate-Limit-Remaining: 4 CR-TDM-Rate-Limit-Reset: 1375427514 X-Content-Type-Options: nosniff Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT Status: 200 Content-Length: 9426 Content-Type: application/pdf
Problems accessing full text URIs using the CrossRef Text and Data Mining API
If you are having trouble accessing the full text text URIs returned by you in the link header, this may be because either:
- You have hit a rate limit (see above)
- You are trying to access content that requires you to accept an additional text and data mining license.
- If you have encountered the second issue, then you may want to consider modifying your tools to work with the click-through service.