The CrossRef Common API is designed to allow researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription). It makes use of CrossRef DOI content negotiation to provide researchers with links to the the full text of content located on the publisher’s site. The publisher remains responsible for actually delivering the full text of the content requested. Thus, open access publishers can simply deliver the requested content while subscription based publishers continue to support subscriptions using their existing access control systems.
Blah, Blah, Blah. How do I get started?
In the simplest case, a researcher can simply issue a HTTP GET request using a CrossRef DOI and use DOI Content negotiation. So, for example, the following curl command will retrieve the metadata for the doi 10.5555/515151:
curl -L -iH "Accept: text/turtle" http://dx.doi.org/10.5555/515151
This will return the metadata for the specified DOI as well as a link header which points to several representations of the full text on the publisher’s site:
HTTP/1.1 200 OK Date: Wed, 31 Jul 2013 11:24:14 GMT Server: Apache/2.2.3 (CentOS) Link: <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf>; rel="http://id.crossref.org/schema/fulltext"; type="application/pdf", <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml>; rel="http://id.crossref.org/schema/fulltext"; type="application/xml" Vary: Accept Content-Length: 2189 Status: 200 OK Connection: close Content-Type: text/turtle;charset=utf-8
The following code shows how to access this full text link information using Ruby:
require 'open-uri' r = open("http://dx.doi.org/10.5555/515151", "Accept" => "text/turtle") puts r.meta['link']
The same in Python:
import urllib2 opener = urllib2.build_opener() opener.addheaders = [('Accept', 'text/turtle')] r = opener.open('http://dx.doi.org/10.5555/515151') print r.info()['Link']
The same in R:
library(httr) r = content(GET('http://dx.doi.org/10.5555/515151', add_headers(Accept = 'text/turtle'))) r
Note that, if present, the full text URI will also be returned in the metadata for the DOI. So, for instance, in the native CrossRef unixref schema, you would also see this in the returned metadata:
<collection property="text-mining" setbyID="creftest"> <item> <resource mime_type="application/pdf">http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf</resource> </item> <item> <resource mime_type="application/xml">http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml</resource> </item> </collection>
How do I know what I am allowed to do with the full text that I retrieve?
It isn’t much good having a common API to allow you to download the full text of a document if you do not have any easy way of telling what you are permitted to do with said document. To that end, publishers who participate in TDM will also be required to register a stable URI using the new
<license_ref> element which points to the license applying to that CrossRef DOI. So, for example, the following unixref example extract would show that the DOI in question was licensed under the well-recognized Creative Commons CC-BY license:
<program name="AccessIndicators"> <license_ref>http://creativecommons.org/licenses/by/3.0/deed.en_US</license_ref> </program>
Where as the following would indicate that the DOI in question was licensed under a publisher’s proprietary license:
<program name="AccessIndicators"> <license_ref>http://www.annalsofpschoceramics.org/art_license.html</license_ref> </program>
The license that the URI points to does not have to be machine readable. It is expected that TDM researchers will check the recorded license against a whitelist that either they or their institution compiles and maintains. Simply knowing that a given URI is not included in your whitelist means that the license will need to be to be reviewed manually and:
- If approved added to the existing white list.
- If rejected, added to a blacklist.
This is essentially the same mechanism that is widely used when auditing open source software projects for license compliance. The URIs of well-known licenses are recorded in the headers of source files making source trees easily audit-able against organisational or third-party curated open source license white lists.
A slight complication arises when the documents associated with DOIs are under embargoes. In this case, the publisher is able to use a
start_date attribute on the
<license_ref> element in order to convey simple embargo scenarios. For example, the following record that the respective DOI is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:
<program name="AccessIndicators"> <license_ref start_date="2013-02-03">http://www.crossref.org/license</license_ref> </program> <program name="AccessIndicators"> <license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US</license_ref> </program>
The researcher’s TDM tools can easily use a combination of the
<license_ref> element(s) and the
start_date attribute to determine of the document pointed to by the DOI is currently under embargo.
Note that if you are NOT interested in receiving the metadata for the DOI, you can simply issue an HTTP HEAD request and you will get the Link header without the rest of the DOI record.
Being Polite: Implement Rate Limiting Headers (optional)
Many publisher platforms are designed and scaled to handle typical interactive browsing and downloading behaviour. The process of bulk-downloading full text for TDM purposes could potentially put a major strain on servers that are not architected to handle automated processes. So some publishers will need a way to rate-limit TDM tools so as to ensure their site performance doesn’t degrade.
It would be tedious if each publisher implemented rate limiting differently, so CrossRef has defined a set of standard HTTP headers that can be used by servers to convey rate-limiting information to automated TDM tools. Well-behaved TDM tools can simply look for these headers when they query publisher sites in order to understand how best to adjust their behaviour so as not to effect the performance of the site. The headers allow a publisher to define a “rate limit window”- which is basically a time span (e.g. a minute, and hour, a day). The publisher can then specify:
|Header Name||Example Value||Explanation|
|CR-Prospect-Rate-Limit||1500||Maximum number of full text downloads that are allowed to be performed in the defined rate limit window|
|CR-Prospect-Rate-Limit-Remaining||76||Number of downloads left for the current rate limit window|
|CR-Prospect-Rate-Limit-Reset||1378072800||Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started|
It will be entirely up to the publisher to implement rate limiting should they require it. It will also be up to the publisher to define a rate limit that is appropriate for their servers. The CrossRef service itself will play no role in enforcing or providing this rate limiting, it simply defines the set of standard headers that should be used by servers implementing rate limiting so that TDM tools can use a common mechanism for adjusting behaviour for sites that may otherwise struggle serving bulk requests for full text downloads.
Similarly, it is entirely optional for a TDM tool to use the rate limiting headers, but be aware that if you do try to harvest the full text of DOIs and you are getting errors or time outs, it may be because you are attempting to access DOIs on a publisher site that has implemented rate-limiting.
An Example session using Rate Limiting
curl -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O
TP/1.1 200 OK Date: Fri, 02 Aug 2013 07:10:53 GMT Server: Apache/2.2.22 (Ubuntu) X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13 CR-Propspect-Client-Token: hZqJDbcbKSSRgRG_PJxSBA CR-Prospect-Rate-Limit: 5 CR-Prospect-Rate-Limit-Remaining: 4 CR-Prospect-Rate-Limit-Reset: 1375427514 X-Content-Type-Options: nosniff Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT Status: 200 Content-Length: 9426 Content-Type: application/pdf
Problems accessing full text URIs using the CrossRef Metadata API
If you are having trouble accessing the full text text URIs returned by you in the link header, this may be because either:
- You have hit a rate limit (see above)
- You are trying to access content that requires you to accept an additional text and data mining license.
If you have encountered the second issue, then you may want to consider modifying your tools to work with the click-through service.