Last month, I was happy to announce the availability of Open Access Theses and Dissertations (OATD), otherwise known as the 1.6-million record bibliographic database I’ve been building on my laptop, and that we’re hosting at http://oatd.org/.
It won’t be news to anyone that libraries and grad schools have worked hard over the last decade or more to begin publishing electronic theses and dissertations (ETDs). At many schools – including Wake – this includes putting an open access copy of the ETD in an institutional repository. The problem that has bothered me for a long time is that we’ve had to focus so much on the policy, advocacy, and workflow issues that no one has time to really work on the problem of discovery for this unique and valuable content. The available search services either focus on commercially available, closed-access content (Proquest Dissertations and Theses); toss ETDs in with an overwhelming number of other records (Google Scholar); have oddly incomplete records and/or idiosyncratic search interfaces that many users find difficult to use well (Elsevier’s Scirus and a VTLS Visualizer, two “semi-official” ETD services); or mix together open- and closed-access ETDs with no way to tell them apart (Elsevier and VTLS again).
OATD’s aim is to be the best possible resource for finding open access graduate theses and dissertations published around the world. It currently includes metadata for 1.6 million records from over 800 universities around the world.
OATD’s main components are:
- A harvester, which pulls metadata from about 350 repositories around the world using a standard from the Open Archives Initiative called the Protocol for Metadata Harvesting (OAI-PMH). This is a way for repositories to share their metadata openly.
- A Solr index. This is the same open source indexing program that VuFind uses.
- A web search interface.
- A feature I think of as “full text turbo boost”: I run a lightweight web crawler to grab an ETD’s full text, pull screen shots of the first few pages, pull sample images, and index the first thirty pages in order to show search hits in context. This isn’t really an attempt to make a fulltext index, but a way to add features to the search hits you get through a good citation + abstract record
Here’s what I’ve learned:
A lot of repositories have really lousy metadata. Not WakeSpace, of course, but-you know-other schools. Name something you’d expect in a record for a thesis, and I’ve seen examples where it’s missing or stuffed into the wrong field. The name of the school (chronicallly missing: you get a lot of “publishers” named eScholar-tastic DigiSpace @ Tech!); any URL for the record; the degree and department; advisor names; even the author’s name have all come up missing. At least the no-authors school responded with a rueful “Oops” when I let them know. OATD has a lot of ad hoc code to make specific schools’ records look as good as they can, but there are still a lot of sites that could easily communicate more aboutcontent.
A lot of schools honestly never thought about their metadata actually being used by anyone or anything outside their own system.
When Googlebot and Bingbot aren’t hitting OATD hard enough to crash it (we keep tweaking the settings to let them index the site without killing it), people are actually using it. Google Analytics has an addictive real-time display: at the moment, 17 people are using the site, not just in the U.S. but also in Spain, Canada, Lithuania, Madagascar (really!), and a handful of other countries.
When I ask people what they’d like to see in OATD, librarians almost always say browsing by author, but no one else does. I don’t (yet?) have a good way to implement browsing, but I’m not sure it’s a real-world need.