TPD @ CNI: I Am Big Data and So Can You!

Back on the train this year. I tried to fly to DC last year and the weather gods weren’t having it (yes, I caused the 2018 Snowtastrophe – sorry, WTL committee).

The CNI fall meeting continues to be the best three-day library tech conference that you can pack into 26 hours. They have changed the format a little so that a lot of the hour-long slots are divided among two or more shorter, related presentations. This goes a log way to reduce the times when similar presentations are happening simultaneously (and also the times when the second half hour of a presentation drags just a bit…). The result is a de facto set of subject tracks.

First, though, was Cliff Lynch’s regular 30,000-foot overview of the landscape, which by itself is basically worth the price of admission. Things he’s following this year: What are the roles for libraries/archives in an age of increasingly ephemeral records (Twitter, Instagram, et al.)? What happens when our ambitions to digitize deteriorating print materials runs up against newfangled facial recognition, handwriting recognition, etc? What are our opportunities/responsibilities to record 3D places (this in the year that Notre Dame burned in Paris, and weeks after Venice suffered historically damaging floods)? As quantum computing becomes more of a real thing, a likely outcome is that current encryption methods will break; how do we keep data secure? Content is increasingly marketed to consumers in formats that aren’t sold to or cannot work in libraries and archives; how do we address that? As more content becomes online-only, how do we address the fact that rural/urban (and rich neighborhood/poor neighborhood) bandwidth discrepancies haven’t gotten better? And, to wrap up, a handful of conflicts between geopolitics (with borders) and the Internet (generally without): different privacy laws; nation-level firewalls; content licenses with geographic restrictions.

On to the sessions. For me, a lot of AI and machine learning, mostly centered around the realization that Our Stuff can be seen to constitute Big Data. This has upsides. This has downsides.
First, a couple of warning notes from NISO’s Jason Griffey. As the current hot items for software development, artificial intelligence and machine learning are advancing quickly; they also take advantage of the constant advance in hardware capabilities. Or in other words, today’s AI is the least capable it will ever be and going forward it will continually get faster/smarter/cheaper/easier to use…and better? But on the other hand, we are on the verge of putting together AI platforms that can: do a literature search; compose summaries or whole papers from those results; and edit and proofread that paper. Voila, instant scholarship.

A quick side trip through AI platforms more of us know and love: another presentation on shoehorning library-related services into our voice operated digital assistants. Or in other words, “Alexa, ask [CorporateThirdPartyIntegrator] what hours the Flurgelheim Library is open today.” Still pretty clunky, IMO, still Alexa-only, and still requires getting into bed with third party integrators about whom we might not know much… But if the implementation I saw last summer at ALA was half baked, this is at least five-eighths baked. In the long run, though, my money is on well structured web pages to feed this information directly through Alexa, Siri, and the Google Assistant.

The good people at Virginia Tech and Old Dominion are grappling with the realization that 30,000+ open access ETDs makes for a serious research corpus. The tools of large-scale text mining can reveal fascinating connections, but first you have to pick apart thousands of book-length PDFs and programmatically identify structures like chapter divisions; tables, charts, and figures; and cited references. Some AI-driven subject classification would be a big help also.

Which brings us to images. We have lots of them, both scanned and born digital. We work at full capacity to add thousands of them to a digital collection, and too often they end up with descriptive text like “tree in front of building” or “telephone operators.” This is why at least three schools were presenting at CNI about automatic annotation, or “having software describe images for you.”

There is an obvious appeal to dumping a set of 10,000 images into an app, going to lunch, and coming back to a set of good image descriptions. There are, however, some potential pitfalls. A couple of them involve the issue of training data: AI apps learn how to describe images by looking at lots and lots of other images. But the major apps in this field have been designed to compare Instagram-y pictures against other Instagram-y pictures and may find themselves lost trying to describe your daguerreotypes of 1850s civic leaders, or your black and white WPA photos of Dust Bowl-era farmers.

There’s also the problem of inherent bias in the selection of training data. Teach your AI to identify airplanes by only showing it modern jets, and it will fall down trying to identify a 1920s biplane. Teach it to identify human faces by only showing it white people, and the results can be more than embarrassing. Human oversight won’t be going away for a while yet.

The closing plenary was a talk by Kate Eichorn, author of . Her book, and her talk, ask people who went through their awkward adolescence before 2004 (the start of Facebook), or 2006 (Twitter), or 2010 (Instagram), to think about being a youth today. Remember that night you and your buddies went cow tipping, or the year you tried really hard to look like Johnny Rotten? How about that big Confederate flag you used to have in your room, which no one at the time thought was a big deal? Or that party where you got really drunk and starting doing _____ with _____? Aren’t you glad those things are pretty much forgotten? Because adolescents these days are still as susceptible to dumb ideas and peer pressure as we were, only their misadventures are all recorded and will live online forever, to be unearthed by future admissions officers, employers, dates, journalists, congressional committees, etc. We’re only starting to see what happens when a generation leaves a permanent photographic records of their lives, in painfully personal detail.