Text and data mining (TDM) is the automatic analysis and extraction of information from large numbers of documents. Researchers are increasingly interested performing text and data mining on scholarly content. This requires automated access to the full-text content of large numbers of articles. Crossref metadata helps researchers get access to this content and enables publishers to provide it.
Crossref maintains the database of DOIs for its 4000+ publisher members. Every DOI has bibliographic metadata associated with it, describing various pieces of information about a piece of content, be that a journal article, book chapter or conference proceeding. The metadata deposited can be expanded to identify where the full text of a piece of content can be found, and this information can then be used by researchers interested in text and data mining.
A Common System for Publishers and Researchers
Researchers are increasingly interested in text and data mining published scholarly content. This poses technical and logistical problems for scholarly researchers and publishers alike. It is impractical for researchers to negotiate multiple bilateral agreements with subscription-based publishers in order to get authorisation to text and data mine subscribed content, and negotiating those same agreements with large numbers of researchers takes time and effort on the part of the publisher. In short, all parties would benefit from support of standard APIs and data representations in order to enable text and data mining across both open access and subscription-based publishers.
What does Crossref Provide?
The main component of Crossref’s text and data mining services is a Crossref Metadata API that can be used by researchers to access the full text of content identified by Crossref DOIs across publisher sites and regardless of their business model. This will be free to use by researchers and the public.
The Crossref Metadata API has three basic sub-components:
- A common mechanism for providing automated text and data mining tools with direct links to full text on the publisher’s site
- A common mechanism for recording license information in Crossref metadata
- An optional common mechanism for rate-limiting automated text and data mining tools using HTTP headers
Why is this necessary?
In the past, researchers who wish to text and data mine published literature have no common or simple way of accessing the full text for the content they wish to mine. This is true both of subscription-based content as well as of open access content. Consequently, text and data mining users access the content in one or two ways:
- Negotiating with publishers to have the content delivered to them, either via physical media or bulk data transfer (e.g. FTP)
- “Screen-scraping” the publisher’s website.
The first option doesn’t scale well across multiple Publishers and Researchers. It also presents synchronisation problems if the researchers want an ongoing feed of refreshed content.
The issue with the second option is that “screen scraping” is an inefficient, fragile and error prone mechanism for identifying and downloading full text. Screen scrapers put a large performance burden on web sites and, at the same time, any slight changes to the web site can break the tool that is doing the screen scraping.
Crossref text and data mining provides a common solution which works across open access and subscription-based publishers and is free for anyone to use.
With the launch of Crossref Text and Data Mining Services on 28th May 2014, the working group that has overseen the pilot is becoming a committee that advises on the development of the production service. Publishers and vendors who are represented on the committee and are working with Crossref to move this service forward are:
- American Institute of Physics (AIP)
- American Physical Society (APS)
- American Psychological Association (APA)
- HighWire Press
- Institute of Electrical & Electronics Engineers (IEEE)
- Institute of Physics (IoPP)
- Korean Association of Medical Journal Editors (KAMJE)
- Taylor & Francis
- Walter de Gruyter
The additional metadata required to make content available via the service has been deposited by Elsevier, Hindawi, all journals published by the Korean Association of Medical Journal Editors (KAMJE), the Korean Society of Emergency Medicine, the Korean Society of Medical Education, the Korean Society for Preventative Medicine, the Korean Society of Environmental Health and Toxicology, the Korean Society of Epidemiology, the Korean Council of Science Editors, the Korean Movement Disorders Society, the Korean Cancer Association, the Korean Society of Exercise Rehabilitation, the Korean Society of Ultrasound in Medicine, the Asian-Australasian Association of Animal Production Societies, the National Health Personnel Licensing Examination Board of the Republic of Korea, Korean Acupuncture & Moxibustion Medicine Society, the Korean Society for Stem Cell Research and Pensoft Publishers.
How does it work?