Researchers: The CrossRef Metadata API

The CrossRef Common API is designed to allow researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription). It makes use of CrossRef DOI content negotiation to provide researchers with links to the the full text of content located on the publisher’s site. The publisher remains responsible for actually delivering the full text of the content requested. Thus, open access publishers can simply deliver the requested content while subscription based publishers continue to support subscriptions using their existing access control systems.

Blah, Blah, Blah. How do I get started?

In the simplest case, a researcher can simply issue a HTTP GET request using a CrossRef DOI and use DOI Content negotiation. So, for example, the following curl command will retrieve the metadata for the doi 10.5555/515151:

curl -L  -iH "Accept: text/turtle" http://dx.doi.org/10.5555/515151

This will return  the metadata for the specified DOI as well as a link header which points to several representations of the full text on the publisher’s site:

HTTP/1.1 200 OK
Date: Wed, 31 Jul 2013 11:24:14 GMT
Server: Apache/2.2.3 (CentOS)
Link: <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf>; rel="http://id.crossref.org/schema/fulltext"; type="application/pdf", <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml>; rel="http://id.crossref.org/schema/fulltext"; type="application/xml"
Vary: Accept
Content-Length: 2189
Status: 200 OK
Connection: close
Content-Type: text/turtle;charset=utf-8

The following code shows how to access this full text link information using Ruby:

require 'open-uri'
r = open("http://dx.doi.org/10.5555/515151", "Accept" => "text/turtle")
puts r.meta['link']

The same in Python:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('Accept', 'text/turtle')]
r = opener.open('http://dx.doi.org/10.5555/515151')
print r.info()['Link']

The same in R:

library(httr)
r = content(GET('http://dx.doi.org/10.5555/515151', add_headers(Accept = 'text/turtle')))
r

Note that, if present, the full text URI will also be returned in the metadata for the DOI. So, for instance, in the native CrossRef unixref schema, you would also see this in the returned metadata:

<collection property="text-mining" setbyID="creftest">
  <item>
    <resource mime_type="application/pdf">http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf</resource>
  </item>
  <item>
    <resource mime_type="application/xml">http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml</resource>
  </item>
</collection>

How do I know what I am allowed to do with the full text that I retrieve?

It isn’t much good having a common API to allow you to download the full text of a document if you do not have any easy way of telling what you are permitted to do with said document. To that end, publishers who participate in TDM will also be required to register a stable URI using the new <license_ref> element which points to the license applying to that CrossRef DOI. So, for example, the following unixref example extract would show that the DOI in question was licensed under the well-recognized Creative Commons CC-BY license:

<program name="AccessIndicators">
  <license_ref>http://creativecommons.org/licenses/by/3.0/deed.en_US</license_ref>
</program>

Where as the following would indicate that the DOI in question was licensed under a publisher’s proprietary license:

<program name="AccessIndicators">
  <license_ref>http://www.annalsofpschoceramics.org/art_license.html</license_ref>
</program>

The license that the URI points to does not have to be machine readable. It is expected that TDM researchers will check the recorded license against a whitelist that either they or their institution compiles and maintains. Simply knowing that a given URI is not included in your whitelist means that the license will need to be to be reviewed manually and:

  1. If approved added to the existing white list.
  2. If rejected, added to a blacklist.

This is essentially the same mechanism that is widely used when auditing open source software projects for license compliance. The URIs of well-known licenses are recorded in the headers of source files making source trees easily audit-able against organisational or third-party curated open source license white lists.

A slight complication arises when the documents associated with DOIs are under embargoes. In this case, the publisher is able to use a start_date attribute on the <license_ref> element in order to convey simple embargo scenarios. For example, the following record that the respective DOI is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:

<program name="AccessIndicators">
  <license_ref start_date="2013-02-03">http://www.crossref.org/license</license_ref>
</program>
<program name="AccessIndicators">
  <license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US</license_ref>
</program>

The researcher’s TDM tools can easily use a combination of the <license_ref> element(s) and the start_date attribute to determine of the document pointed to by the DOI is currently under embargo.

Note that if you are NOT interested in receiving the metadata for the DOI, you can simply issue an HTTP HEAD request and you will get the Link header without the rest of the DOI record.

Being Polite: Implement Rate Limiting Headers (optional)

Many publisher platforms are designed and scaled to handle typical interactive browsing and downloading behaviour. The process of bulk-downloading full text for TDM purposes could potentially put a major strain on servers that are not architected to handle automated processes. So some publishers will need a way to rate-limit TDM tools so as to ensure their site performance doesn’t degrade.

It would be tedious if each publisher implemented rate limiting differently, so CrossRef has defined a set of standard HTTP headers that can be used by servers to convey rate-limiting information to automated TDM tools. Well-behaved TDM tools can simply look for these headers when they query publisher sites in order to understand how best to adjust their behaviour so as not to effect the performance of the site. The headers allow a publisher to define a “rate limit window”- which is basically a time span (e.g. a minute, and hour, a day). The publisher can then specify:

Header Name Example Value Explanation
CR-Prospect-Rate-Limit 1500 Maximum number of full text downloads that are allowed to be performed in the defined rate limit window
CR-Prospect-Rate-Limit-Remaining 76 Number of downloads left for the current rate limit window
CR-Prospect-Rate-Limit-Reset 1378072800 Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started

It will be entirely up to the publisher to implement rate limiting should they require it. It will also be up to the publisher to define a rate limit that is appropriate for their servers. The CrossRef service itself will play no role in enforcing or providing this rate limiting, it simply defines the set of standard headers that should be used by servers implementing rate limiting so that TDM tools can use a common mechanism for adjusting behaviour for sites that may otherwise struggle serving bulk requests for full text downloads.

Similarly, it is entirely optional for a TDM tool to use the rate limiting headers, but be aware that if you do try to harvest the full text of DOIs and you are getting errors or time outs, it may be because you are attempting to access DOIs on a publisher site that has implemented rate-limiting.

An Example session using Rate Limiting

curl -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O
TP/1.1 200 OK
Date: Fri, 02 Aug 2013 07:10:53 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13
CR-Propspect-Client-Token: hZqJDbcbKSSRgRG_PJxSBA
CR-Prospect-Rate-Limit: 5
CR-Prospect-Rate-Limit-Remaining: 4
CR-Prospect-Rate-Limit-Reset: 1375427514
X-Content-Type-Options: nosniff
Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT
Status: 200
Content-Length: 9426
Content-Type: application/pdf

Problems accessing full text URIs using the CrossRef Metadata API

If you are having trouble accessing the full text text URIs returned by you in the link header, this may be because either:

  1. You have hit a rate limit (see above)
  2. You are trying to access content that requires you to accept an additional text and data mining license.

If you have encountered the second issue, then you may want to consider modifying your tools to work with the click-through service.

Error messages

(Coming soon)