How much overlap do we have with the HathiTrust?

This article is more than 5 years old.

Since ZSR library started looking at the HathiTrust as a potential source of out-of-copyright electronic texts people have been asking ‘how does our collection compare to HT?’ The short answer is (based on comparison of valid OCLC numbers) 366,800 out of 8.5 Million titles, or about 29% of our collection, slightly less than was recently indicated as an average coverage for research libraries in general. Granted, this is a simple first pass using OCLC number and is most likely leaving out a number of titles. It is interesting to note that out of 1.5M entries of 035A records in our database, nearly a third contain numbers that are not OCLC numbers.

A second question that has been raised is “What is the copyright status of these matched titles?” Unfortunately, over 87% of our matched titles are in copyright. This means that while they are digitized we can’t use them. Our matching process found only a 11% match of public domain titles (out of a database of 2.2M public domain records in the HathiTrust). This indicates that there are some good opportunities to expand our representation of public domain digitized resources in our catalog. As you might expect however, our MARC records for public domain resources also happen to be more likely to not have OCLC numbers. For instance, our Eighteenth Century Collections Online MARC records do not have OCLC numbers. For the rest of the codes listed in the table below you can visit the HathiTrust site.

This comparison is just our first guess at matching with the HathiTrust. HT includes some sophisticated data APIs and a wide range of identifiers that we can work with to see how our holdings compare. What we do about this is an open topic but I thought that it might be interesting to see initially how our collection compared.

Want to see the data? I can export the matches for you so you can run your own reports. Curious about the process? Visit my office and I can show you the database that was used to run the comparisons.

Table of copyright policy for matched titles

rights	Number of records	% of matched records
Cc-by	2	0.0005%
Cc-by-nc	16	0.0044%
Cc-by-nc-n	1	0.0003%
cc-by-nc-s	24	0.0065%
ic	320274	87.1446%
nobody	31	0.0084%
opb	137	0.0373%
pd	32100	8.7342%
pdus	9489	2.5819%
und	4809	1.3085%
world	637	0.1733%
Total matches	367520	100.0000%

2 Comments on ‘How much overlap do we have with the HathiTrust?’

Katherine

I wager that if Special Collections materials were completely included in your calculations, which uneven cataloguing makes difficult, the percentage would rise by at least a few points. When I find a book in the Babcock or Smith collections that looks interesting and useful, I often find that it has been included in the Hathitrust or Google Books collections.

I find your research very interesting and helpful.

Katherine

6:02PM, 5/6/2011

Lynn

These are data I’ve been wanting to see. I will want to continue the conversation when I get back!

6:02PM, 5/6/2011

Katherine
I wager that if Special Collections materials were completely included in your calculations, which uneven cataloguing makes difficult, the percentage would rise by at least a few points. When I find a book in the Babcock or Smith collections that looks interesting and useful, I often find that it has been included in the Hathitrust or Google Books collections.
I find your research very interesting and helpful.
Katherine
6:02PM, 5/6/2011
Lynn
These are data I’ve been wanting to see. I will want to continue the conversation when I get back!
6:02PM, 5/6/2011