As I work with librarians at BYU's Harold B. Lee Library in revising the Mormon Literature Database, and as I work with those on the media board of BYU Studies as we contemplate how best to do our online academic publishing, it is becoming more clear to me that those developing digital resources for the study of Mormonism would be wise to develop common metadata so that the data developed independently can be aggregatable. I haven't been quite sure how to go about that, but I feel much closer today following our weekly MLDB development meeting (attended by Janet Bradford, Ruth Eldredge, Jesse ____, and myself).
Bill Lund attended (head of IT for the library), and he educated us on how we could ready our database for a more meaningful interface with the web. After his explanations and some research I did after the meeting, I believe I now can better articulate to others the need for standard metadata within Mormon Studies. Something of a position paper now follows--perhaps the draft of a potential paper for a conference).
Preparing for the Harvest:
Why Metadata Matters to Mormon Studies
Gideon Burton
Early in the days of the Internet there was an intellectual goldrush as libraries, scholars, and individuals rushed to digitize anything that would hold still under a scanner. Just "getting something up" on the web seemed self-evidently worthwhile as we marveled at what could be placed or discovered online. Websites and databases were popping up everywhere. Hooray! And if the amount of data seemed bewildering, we placed our faith in Google and other search engines to lead researchers to our sites.
Online resources for the study of Mormonism have proliferated, set up and maintained by organizations, scholars, librarians, fans, critics, and volunteers. As with other fields of study, subject portals have emerged to keep up with the variety and number of resources. Email lists and now blogs help spread the word about what's available.
But this democratic approach to developing web resources has its dark side. The authority and quality of sites is an ongoing concern, a problem directly related to search engines. General search engines give general results, casting the net wide enough to include frivolous and irrelevant websites along with substantive studies.
Specific fields of study are best researched through various discipline-specific databases whose contents are not always visible to Google. In addition, there are also important archival materials not indexed in databases of published scholarship. Of course there are subject gateways and finding aids prepared by librarians, and these remain important, but too many costly repositories or databases languish in isolation as "information silos" because their records are not item-searchable on Google and so people don't know about them to use them.
Semantic Web
Enter the "Semantic Web." That's a fancy term referring to the way that machines are becoming able to "read" the content of various data sources as long as those sources employ data about their data--"metadata." Metadata is not a new idea. Keywords have been buried in HTML code for a long time as an aid to webcrawlers. But metadata has gotten smarter. This is because of XML
XML
Extensible Markup Language (XML) is a protocol by which one labels one's data. Data properly marked up with XML is "machine readable" because the information is not format-dependent. This makes it possible to reformat, repurpose, and apply data in ways not anticipated by the original data creators. Data without XML markup might work great within its native environment, but has no passport beyond itself. The proper markup of data with XML increases the usability and value of the data exponentially. Not only can the data be better found, but it can be applied in creative ways, combined with other data streams, etc. It is readied to be joined with other data and ported over to other environments where it will have additional significance. XML is a bit like HTML--you can mark up your documents however you want to. However, there are best practices for XML to be observed--practices which invite multiple parties to agree upon categories of metadata, as I will explain.
Bringing Data Out of Exile
As people create web resources and databases they create fields appropriate to their subject matter. However, not everyone who would benefit from accessing that data will think to look for it by way of the resource in question. They might be unaware of its existence. How do people find the data if they don't know the database and item-level records don't come up through Google? Through metadata that is compliant with the Dublin Core metadat standard (and possibly other XML schema) as that metadata is made available to an Aggregator/Harvester through an OAI interface or Responder. I will explain:
Metadata
This is data about data. It is used not just to mark up digital objects (such as images), but to make databases machine-readable, enabling an Aggregator/Harvester to understand what collections are available, and what categories of data are available to harvest. This requires reference to a metadata standard, Dublin Core, and optionally, to one or more XML Schema. Dublin Core consists of 15 simple metadata elements such as "Title," "Date," and "Language."
The Aggregator/Harvester
An aggregator harvests data from across multiple data sources and serves as something like a database of databases. Like a web crawler, it searches for data, but only seeks OAI Interfaces (or "responders") that are associated with data sources (explained below). The aggregator queries these interfaces by way of metadata to find just those subjects or types of content it is interested in. There are three types of aggregators, so far as I have discovered:
- General Aggregators. Example general aggregators include OAIster or Infomine, or Arc -- all tools for searching across diverse digital archives.
- Subject-Specific Aggregators. A good example of an aggregator devoted to a specific subject is Avano, a marine and aquatic sciences OAI harvester. It harvests 64 different open archives, making it possible to do a federated search across all those archives at once. For example, I searched using the terms "coral" "reef" "global" and "warming" and received results from an E-Repository of Ocean Publications called OceanDocs; from an institutional repository at Stockholm University; and from an Oxford academic journal, Integrative & Comparative Biology.
- Media-specific Aggregators. A good example of an aggregator devoted to a specific medium is the Sheet Music Consortium of UCLA, which indexes sheet music from seven different collections such as the Library of Congress or Special Collections at Duke University. Searching "Scott Joplin" retrieved records from Indiana University and from The Maine Music Box, an archive of digitzed musical scores.
In each case, the aggregators return search results that reference items found in the contributing data sources.
OAI Interface / Responder
An Open Archives Initiative interface is sometimes called a responder because it answers queries made by an aggregator/harvester. The two machines conduct a kind of conversation. This conversation between machines is made possible through a communications protocol known as Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH). The aggregator/harvester says, "Hey, what kind of data do you guys have?" Through the OAI interface the response is given, "Well, we've got collections about X, Y, and Z." "Oh," says the aggregator/harvester, "tell me just about Z." So the OAI interface (through an XML document with proper metadata) indicates what sorts of data are there for the taking: digital objects, types of information, etc. Based on this, the aggregator/harvester then calls or references the data when users employ the aggregator for federated searching.
XML Schema
XML Schema are metadata standards more specific than Dublin Core. All aggregators will employ Dublin Core to find basic data types ("Creator," "Title," "Description"), but some aggregators will also seek metadata that is more specific to a given subject or purpose. For example, librarians have long used MARC records to catalogue their holdings. These provide a host of very specific database fields, and unsurprisingly a metadata standard based on MARC records has been developed (MARCXML). There are subsets and refinements to this, such as MODS (Metadata Object Description Schema), a bibliographic-oriented metadata schema that carries selected data from MARC 21 records, and MADS (Metadata Authority Description Schema), meant for use along side MODS. Unlike Dublin Core, MADS can be used to provide metadata not just about a "Creator" for example, but various agents (people, organizations), events, etc.
In short, XML schema like these are essential to coordinate the maximum interoperability of open access data sources because they make possible more granular searching and finding through federated means. Roy Tennant of the California Digital Library argues to "Provide metadata in a variety of formats; at minimum, the required Dublin Core and the richest, most granular form of metadata available." This means using multiple XML schema appropriate to the data, the institution, and the purpose of the archive or database in question. Because there are variations in metadata standards, it is now the custom to create a crosswalk to navigate among them.
Crosswalk
A crosswalk is a way of merging or relating differing metadata schemes. Usually this begins by taking the very general Dublin Core standard, then listing the fields from one's more specific shema and relating those fields to the general ones within the Dublin Core. Additional schema can be similarly mapped.
Application to Mormon Studies (forthcoming)
Comments