Text and Data Mining for Researchers

The CrossRef Text and Data Mining API is designed to allow researchers to easily harvest full text documents from all participating publishers regardless of their business model (e.g. open access, subscription). The publisher remains responsible for actually delivering the full text of the content requested. Thus, open access publishers can simply deliver the requested content while subscription based publishers continue to support subscriptions using their existing access control systems.

Here is a worked example of how to use CrossRef’s metadata services to perform text and data mining. First you should have:

  1. A list of DOIs that you want to download
  2. A white-list of licenses that you accept

The way you create these lists is up to you. You might want to start with the list of DOIs, get the licenses and decide which ones you want to agree to. You may get the list from your institution. It is you, the Researcher, who will decide what to do with each license. This is essentially the same mechanism that is widely used when auditing open source software projects for license compliance.

You can get a list of DOIs from citations, our Metadata search, our Metadata API or any other source.

For each DOI you should:

  1. Use content negotiation to get the metadata for the DOI.
  2. Check to see if there’s license and full text metadata.
  3. Check the license against your whitelist.
  4. If you agree to the license, follow the link and download the full text of the article.

The absence of a license does not mean that the full text can be used without one. Publishers should deposit both the license and the full-text link at the same time.

Step by step

We will show the below examples with the Curl utility. You should be able to integrate with the API very easily with your text and data mining software.

1 – Fetch the Metadata

In the simplest case, a researcher can simply issue a HTTP GET request using a CrossRef DOI and use DOI content negotiation. So, for example, the following curl command will retrieve the metadata for the DOI 10.5555/515151:

curl -L  -iH "Accept: application/vnd.crossref.unixsd+xml" http://dx.doi.org/10.5555/515151

This will return the metadata for the specified DOI as well as a link header which points to several representations of the full text on the publisher’s site:

HTTP/1.1 200 OK
Date: Wed, 31 Jul 2013 11:24:14 GMT
Server: Apache/2.2.3 (CentOS)
Link: <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf>; rel="http://id.crossref.org/schema/fulltext"; type="application/pdf", <http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml>; rel="http://id.crossref.org/schema/fulltext"; type="application/xml"
Vary: Accept
Content-Length: 2189
Status: 200 OK
Connection: close
Content-Type: application/vnd.crossref.unixsd+xml;charset=utf-8

The following code shows how to access this full text link information using Ruby:

require 'open-uri'
r = open("http://dx.doi.org/10.5555/515151", "Accept" => "application/vnd.crossref.unixsd+xml")
puts r.meta['link']

The same in Python:

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('Accept', 'application/vnd.crossref.unixsd+xml')]
r = opener.open('http://dx.doi.org/10.5555/515151')
print r.info()['Link']

The same in R:

library(httr)
r = content(GET('http://dx.doi.org/10.5555/515151', add_headers(Accept = 'application/vnd.crossref.unixsd+xml')))
r

Note that, if present, the full text URI will also be returned in the metadata for the DOI. So, for instance, in the native CrossRef unixref schema, you would also see this in the returned metadata:

http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.pdf
http://annalsofpsychoceramics.labs.crossref.org/fulltext/10.5555/515151.xml

2 – Deciding what to do

Publishers who participate in CrossRef Text and Data Mining Services will also be required to register a stable license URI using the new <license_ref> element which points to the license applying to that CrossRef DOI. So, for example, the following unixref example extract would show that the DOI in question was licensed under the well-recognized Creative Commons CC-BY license:

<license_ref>http://creativecommons.org/licenses/by/3.0/deed.en_US

Whereas the following would indicate that the DOI in question was licensed under a publisher’s proprietary license:

<license_ref>http://www.annalsofpschoceramics.org/art_license.html

The license that the URI points to does not have to be machine readable. We expect that you will match the license URI to your whitelist. If you agree to it, you can proceed. If you don’t, you can put it in a list of licenses to review later and add to your whitelist (or blacklist).

A slight complication arises when the documents associated with DOIs are under embargoes. In this case, the publisher is able to use a start_date attribute on the <license_ref> element in order to convey simple embargo scenarios. For example, the following record that the respective DOI is under a proprietary license for a year after its publication date, after which it is licensed under a CC-BY license:

<license_ref start_date="2013-02-03">http://www.crossref.org/license
<license_ref start_date="2014-02-03">http://creativecommons.org/licenses/by/3.0/deed.en_US

Text and data mining tools can easily use a combination of the <license_ref> element(s) and the start_date attribute to determine of the document pointed to by the DOI is currently under embargo.

Note that if you are NOT interested in receiving the metadata for the DOI, you can simply issue an HTTP HEAD request and you will get the Link header without the rest of the DOI record.

Or you can use the CrossRef REST APIs

The CrossRef REST APIs can also be used to provide cross-publisher support for text and data mining applications. This demonstration is a bit of a paradox as it is targeted at a non-technical audience who wants to understand a little bit about the technical infrastructure that researchers can leverage for text and data mining applications. A more complete explanation is available here

Finding out what is in the CrossRef system

How many members does CrossRef have?

http://api.crossref.org/members?rows=0

Who are they? Let’s look at first 100 members

http://api.crossref.org/members?rows=100

And the second 100 members

http://api.crossref.org/members?rows=100&offset=100

How many DOI records does CrossRef have?

http://api.crossref.org/works?rows=0

What content types does CrossRef have?

http://api.crossref.org/types

How many journal article DOIs does CrossRef have?

http://api.crossref.org/types/journal-article/works?rows=0

How many proceedings articles DOIs does CrossRef have?

http://api.crossref.org/types/proceedings-article/works?rows=0

But eventually you will probably want to start looking at metadata records. Lets search for records that have the word “blood” in the metadata and see how many there are.

http://api.crossref.org/works?query=%22blood%22&rows=0

Lets look at some of the results.

http://api.crossref.org/works?query=%22blood%22&

Now lets look at one of the records

http://api.crossref.org/works/10.1155/2014/413629

Interesting. The record has ORCIDs, fulltext links, and license links. You need license and fulltext links to text and data mine the content.

How many works have license information?

http://api.crossref.org/works?filter=has-license:true&rows=0

How many license types are there?

http://api.crossref.org/licenses?rows=0

How many works have a CC-BY license?

http://api.crossref.org/works?rows=0&filter=license.url:http://creativecommons.org/licenses/by/3.0/

Ok, lets see how many records with the word “blood” in the metadata have license information and full text links

http://api.crossref.org/works?filter=has-license:true,has-full-text:true&query=blood&rows=0

Let’s download the results and download the content locally to TDM

http://api.crossref.org/works?filter=has-license:true,has-full-text:true&query=blood&rows=884

You can watch a presentation of CrossRef’s Geoffrey Bilder demonstrating this process at the CrossRef Workshops 2014.

3 – Fetching the full text

You can now perform a standard GET request on the url to download the full text from the Publisher’s site.

Rate limiting headers

Because the bulk-downloading of large numbers of publications may put a strain on the publisher’s servers, we have defined the following HTTP headers:

Header name Example value Explanation
CR-TDM-Rate-Limit 1500 Maximum number of full text downloads that are allowed to be performed in the defined rate limit window
CR-TDM-Rate-Limit-Remaining 76 Number of downloads left for the current rate limit window
CR-TDM-Rate-Limit-Reset 1378072800 Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started

You are not obliged to test for and act on these headers, and not all publishers will use these headers. However, doing so will avoid surprises.

An Example session using Rate Limiting

curl -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O
	
HTTP/1.1 200 OK
Date: Fri, 02 Aug 2013 07:10:53 GMT
Server: Apache/2.2.22 (Ubuntu)
X-Powered-By: Phusion Passenger (mod_rails/mod_rack) 3.0.13
CR-TDM-Client-Token: hZqJDbcbKSSRgRG_PJxSBA
CR-TDM-Rate-Limit: 5
CR-TDM-Rate-Limit-Remaining: 4
CR-TDM-Rate-Limit-Reset: 1375427514
X-Content-Type-Options: nosniff
Last-Modified: Tue, 23 Apr 2013 15:52:01 GMT
Status: 200
Content-Length: 9426
Content-Type: application/pdf

Using the Click-Through Service

Some publishers will require you to use the CrossRef click-through service. This allows you to agree to supplementary licenses. For more information see the Click-Through Service documentation. When you use the click-through service you will be given a token. You should supply this as a header when you make the query to full-text. Here is an example request using a click-through service token:

curl -H "CR-Clickthrough-Client-Token: hZqJDbcbKSSRgRG_PJxSBAx" -k "https://annalsofpsychoceramics.labs.crossref.org/fulltext/515151" -D - -L -O

Problems accessing full text URIs using the CrossRef Text and Data Mining API

If you are having trouble accessing the full text text URIs returned by you in the link header, this may be because either:

  • You have hit a rate limit (see above)
  • You are trying to access content that requires you to accept an additional text and data mining license.
  • If you have encountered the second issue, then you may want to consider modifying your tools to work with the click-through service.