Text and Data Mining for Publishers

For an introduction read Text and Data Mining with Crossref. Here is technical information about how to participate as a publisher.

As a publisher you need to do two things to participate in the Crossref text and data mining service:

  1. Deposit a full-text link in the metadata for each DOI so researchers can follow it to get the full-text at the URI stated
  2. Deposit a license URI in the metadata for each DOI so researchers can use this to find out if they have permission to text and data mine the piece of content

If you’re interested in a basic run-through of how the service works from the researcher-side, you can view a short walk-through here (or download the .mp4 file here).

For a more extensive demonstration, Geoffrey Bilder’s presentation from the 2014 CrossRef Workshops can be viewed here: http://river-valley.zeeba.tv/text-data-mining-api-researcher-use/.

Metadata

The Crossref REST API works across all publishers regardless of their business model (open access, subscription, combination). It makes use of Crossref DOI content negotiation to provide researchers with links to the the full text of content located on the publisher’s site. The publisher remains responsible for actually delivering the full text of the content requested. Thus, open access publishers can simply deliver the requested content while subscription based publishers continue to control access using their existing access control systems. In both cases publishers will be able to use their existing site statistics packages (e.g. COUNTER) to measure use of content accessed by text and data mining tools using the API.

There is only one step that is required of all publishers wishing to participate in Crossref text and data mining: the registration of text and data mining-specific metadata. Publishers who are concerned about the impact of automated text and data mining harvesters on their site performance may optionally want to implement Standard Rate Limiting Headers.

If you are an open access publisher or if your existing subscription licenses already allow text and data mining of subscribed full text, then the registration of the above metadata deposit is the ONLY thing you need to do in order to enable text and data mining of your content via the Crossref REST API.

Nb. You can upload this information to Crossref via a Resource Only Deposit or by uploading a .csv file containing the URI links and the related DOIs. Note that the .csv upload criteria have been updated in Jan 2015 to allow publishers to deposit different full text mime types.

Providing the full text

You should provide the full text of the article at the URL you provided. If you have access control systems in place you won’t need to change them.

Rate limiting

Text and data mining may change the volume of traffic that your servers have to handle as researchers download large numbers of files in bulk. You can mitigate performance issues with rate limiting.

We have defined a set of standard HTTP headers that can be used by servers to convey rate-limiting information to automated text and data mining tools. Well-behaved text and data mining tools can simply look for these headers when they query publisher sites in order to understand how best to adjust their behaviour so as not to effect the performance of the site. The headers allow a publisher to define a “rate limit window”- which is basically a time span (e.g. a minute, and hour, a day). The publisher can then specify:

HEADER NAME EXAMPLE VALUE EXPLANATION
CR-TDM-Rate-Limit 1500 Maximum number of full text downloads that are allowed to be performed in the defined rate limit window
CR-TDM-Rate-Limit-Remaining 76 Number of downloads left for the current rate limit window
CR-TDM-Rate-Limit-Reset 1378072800 Remaining time (in UTC epoch seconds) before the rate limit resets and a new rate limit window is started

It will be entirely up to the publisher to implement rate limiting should they require it. It will also be up to the publisher to define a rate limit that is appropriate for their servers. Crossref will play no role in enforcing or providing this rate limiting, the guidelines above simply define the set of standard headers that should be used by servers implementing rate limiting so that text and data mining tools can use a common mechanism for adjusting behaviour for sites that may otherwise struggle serving bulk requests for full text downloads.

Example Code

Crossref has created an example publisher, Tinypub site that implements the Crossref REST API, including rate limiting and IP-based subscription access. The code for the example site can be downloaded from Github for reference. Please note that this code is only meant to illustrate the workings of the system. It is not in any way intended for production.

Supplementary licenses

Your licensing arrangements may require researchers to agree to extra licenses. The Crossref click-through service allows you to provide these licenses and allows researchers to read and agree to them. More information can be found on the Click-Through Service help.