On Friday, July 26, I attended the ML4ARC: Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives workshop at the University of North Carolina, Chapel Hill, an event that focused on “applications of machine learning, deep learning, and natural language processing to support use, access, and analysis of digital primary source materials.” The daylong series of talks took place in the Pleasants Family Room at Wilson Library and featured speakers from institutions around the country, including Yale, UNC, Duke, Library of Congress, Open Preservation Foundation, and the National Archives. It was hosted by the RATOM: Review, Appraisal, and Triage of (online) Mail project, which launched in 2019 and is led by PI Dr. Cal Lee.
The first set of talks focused on Data Visualization and Access. UNC Professor Ray Wang’s work in applying machine learning to large archival document collections was a highlight of the session. As archives acquire more born-digital content and item-level classification becomes unwieldy, machine learning can support collections staff in managing large collections and in understanding what’s in their databases.
In the New Workflows session, Emily Higgs, who has just wrapped up a term as NC State Library Fellow, described her work developing name entity recognition (NER) functionality in the library’s born-digital processing workflows. Evolving NER tools (such as Stanford’s online tool or the commercial Semantria) can help collections managers identify the people and places in large scale digitized collections with greater speed and accuracy relative to manual processing.
In her presentation, “I don’t understand, Wikileaks was able to get everyone else’s records up very quickly,” Kathleen Jordan (Library of Virginia) described her team’s challenges providing public access to the email records from recent VA governors’ administrations. One of the highlights (lowlights?) of her talk was a video recording she shared of VA legislators discussing government email records and the apparent bottleneck created by the immense challenge of processing hundreds of thousands of emails from Kane and prior governors’ offices, many of which have PII and other privacy issues. The legislators suggested giving the State Library of Virginia a hard deadline for processing them: “Nothing motivates like a deadline,” they said. (Um…true, but…) The team is working through the technical and logistical challenges of providing ongoing access to these critical public records without a guarantee that infrastructure and funding will continue to be present to support the work.
I feel certain that the name of the next session, Interoperability and System Dependencies, was not designed to win at the polls. It was, however, full of Grape Nuts-quality digital nutrition. Matthew Farrell (Duke) talked about his work with the OSSArcFlow project, which recently released a comprehensive survey of institutional digital processing workflows. Justin Simpson (Artefactual Systems) examined possible strategies for making preservation action repositories replicable across systems (think Preservica ←> Archivematica). And Euan Cochrane (Yale) talked about his work with the Eaasi project, which aims to develop preservation infrastructure for software.
Given the topics of the day, it seems email might be here to stay after 40+ years as our information overlord. It’s also fairly evident from the talks that email’s creators are sadistic monsters who have no idea how difficult it is to record-manage the stuff. If the Library of Congress and National Archives are throwing up their hands and saying, sure, you can just send it to us as PDF files, clearly, we’re on the wrong side of history?
Hand-wringing aside, one of the talks I appreciated was Glynn Edwards (Stanford); she delved into some of the issues involved in performing natural language processing (NLP) with a software tool called ePADD. Tools in this domain are still very much in development, which is one of the reasons the RATOM project sought to bring together experts working in this area. The TOMES project out of the NC State Archives is another recent email archiving “big project” that may result in some best (or maybe just better?) practices for government agencies and state libraries. The “capstone” approach, in which the most “critical” email accounts in an organization (think top of the org chart) are the prioritized targets for email accessioning, has become a respected approach that enables archives to dispose of unneeded email records more decisively.
Kam Woods (UNC) finished up the day with an overview of RATOM and its goals, and participants had a chance to select a breakout group for extended discussions about the day’s topics. Overall, ML4ARC touched on quite a few emerging and important topics, and I look forward to following the RATOM project as it establishes strategies and tools for some of records management’s most stubborn challenges.