Jan 21 2012
I’m obsessing about a talk I’m giving at one of my favorite conferences, code4lib. My talk proposal is about how we deal with whatever metadata comes our way. For those of you not inside my head at this moment, “we” is where I work, part of which is in developing software for and maintaining a Digital Asset Management System (DAMS). A digit asset is just a computer file, or set of files – often a picture, sound file, PDF, or video – that you have some desire to promote beyond just sitting on one person’s computer, unmanaged. We all have computer files coming out of our ears, but we know there are some that are more “valuable” than others that we’d like to give special treatment. So we call them digital assets and get then moved into some sort of management system beyond the random file systems on our desktops. This system is a DAMS.
I see a DAMS as a secure, reliable file system set up with good organization rules, and a goal of making the assets easy to find. Here are some of the rules I like to see followed:
- Describe the assets as thoroughly as possible and/or practicable. This is your metadata.
- Save this metadata and keep it associated with the assets.
- Keep the metadata in a system that is flexible enough to be able to accomodate attributes that you didn’t know about when you started – people change their minds about what they want all the time. We chose RDF for this and this is what my talk will be about.
- Create unique identifiers for the assets. We chose the ARK spec for this.
- Store the assets as simply as possible – don’t create a new file system because no one is going to understand it 30 years from now. We chose the Pairtree spec for this.
- Back your assets up – think like an IT person and have a data lifecycle plan.
- Preserve your assets and metadata – think like a Librarian/Archivist and store the assets in at least three, geographically separated places. We use Chronopolis for this.
- Export your metadata to the file system on a regular basis – this way the metadata becomes a digital asset/computer file too.
- Put all that lovely metadata into a full text search index. We use solr for this.
- Tie all of the metadata to the unique identifier for the assets.
As I noted above, I’m only talking about RDF and metadata at code4lib. What I’m obsessing about is that the talk is only 20 minutes. I usually talk about our DAMS in about an hour, and I’m only getting warmed up in the first 20 minutes. So I’ve got to empty my head of all this other DAMS stuff and laser down on just the RDF and metadata part.
We didn’t choose RDF because of the newish Linked Open Data (LOD) movement. Our (now retired) architect, Chris Frymann, was aware of the possibility but this was nearly ten year ago and LOD was barely a twinkle on the horizon. Previous to this job, I had been working in industry for years, so this approach looked silly and academic. Once Chris had me drink the RDF Kool-Aid, we envisioned a system that embraced flexibility from the start.
RDF is so simple and so terrifyingly different from the fixed database world that I was used to. Instead of a well defined table, or tables, we had millions of triples. We didn’t even have a triple store, just three columns in an SQL database.
What is wonderful about this approach was that each triple is somewhat self documenting. A triple is made up of a Subject, a Predicate, and an Object:
We use our asset’s unique identifier, the ARK, as our Subject. Now we needed to describe the assets with their metadata – so we started creating Predicates that could hold types of metadata. Three years later…. no really. This was probably one of the hardest things to do, and I’m not sure we’ll ever stop doing it. Some of our original assets had MARC records, and there were ways to convert MARC to RDF. Lots of deep discussions among metadata librarians, asset owner librarians, and the tech folks came to the conclusion that we wanted to cast our metadata into specific namespaces, namely MODS, PREMIS, and MIX. This was way beyond the Dublin Core defaults that other products were using, but we knew RDF was flexible enough to accomodate just about anything, so we just did it.
Guided by the head of our Metadata Analysis and Specification Unit (MASU - lots of great detail at that link), Brad Westbrook, we started specing out what the metadata needs were for each asset. Ok, that’s a lie… We did it per “collection” which was how we actually received assets from the librarians. Our DAMS works at the asset level, but our librarians normally think at the collection level. This was just another layer of translation that the MASU group stepped up to play a liaison role in getting the assets ingested. Over time, this became a workflow where:
- Collections are identified/approved for ingestion into the DAMS. (The project management and institutional buy in on this process is another talk!)
- MASU creates an Assembly Plan that maps the assets into collections and specifies what namespaces the metadata pieces are placed in and hand it off to IT Development
- IT Development creates mapping scripts from what the Assembly Plan calls for into RDF. This is done in XSL.
- IT Development ingests the assets into the storage system and parses the metadata into RDF in the triple store.
Ok, someone tell me how to get all that into 20 minutes… The Assemble Plan alone is an intense spreadsheet and text document that explains what is needed. Then the translation scripts are another challenge to present without everyone going cross eyed. Not to mention this thing of beauty!
That’s all of the metadata and relationships of one asset. Maybe I’ll just put that on the screen and take questions for 20 minutes…