DataRescue Philly: Environmental Data Archiving, Workflows, and Description

Today’s post is from Rachel Appel, Digital Projects & Services Librarian at Temple University.

From January 13-14th, I participated in DataRescue Philly at the University of Pennsylvania which was one event in a long series of DataRefuge grassroots events. These events are taking place in order to capture and archive federal environmental data for long-term access and preservation to combat the incoming administration’s efforts to deny climate change as well as the necessity to have ongoing management of digital data. The event was organized by the Penn Program in Environmental Humanities and the University of Pennsylvania Libraries.¹ I was interested to learn more about data archiving because of my commitment to climate change awareness and action and to learn more about data archiving for a project I am working on to preserve civic data accessed through OpenDataPhilly.org.²

The first day acted as an orientation to data management and archiving and included a Teach-In on Data Refuge and Environmental Justice, DataRescue Guide Training, and Roundtable on DataRefuge Value and Vulnerability.

The second day was the DataRescue: A Creative Coding and Archive-a-thon. DataRescue Philly focused on archiving NOAA (National Oceanic and Atmospheric Administration) data.³ There were six DataRefugePaths⁴ for participants to join:

Seeders: Enter seeds, or individual site URLs, into the Internet Archive’s End of Term Archive.⁵
Baggers: Bag breakdowns of web pages that are unable to be archived by the Internet Archive using the tool BagIt.⁶
Metadata: Work on descriptive metadata standard creation and data entering for bags.
Tool Builders: Create tools to assist the Baggers.
Storytelling: Capture the event on social media and developing documentation.
Long Trail: Strategize DateRefuge into the future.

I participated in the Metadata Path. I was one of the Guides for the group and my main role was to facilitate the group and develop a qualified Dublin Core metadata standard for descriptive metadata for bags that were then uploaded into an S3 Bucket and linked to from the DataRefuge CKAN Page (datarefuge.org). The hardest part was constructing a workflow with the Baggers and the S3 Bucket uploaders. Fortunately, the University of Michigan had developed a way to automate some preservation metadata into a JSON file.⁷ We then had to check against those fields, CKAN’s fields, and the fields we thought were pertinent to description and discovery. We developed a schema for CKAN and were able to work around the software’s limitations through adding custom fields. As soon as data had been bagged, we uploaded it to S3 and then created a record in CKAN, entering the metadata and linking to the file. This is still a work in progress and we hope to have a more streamlined workflow for future events to use and build upon. This is a model that can be applied to a number of fields, not just climate change.

At the end of the Archive-a-thon, we archived nearly 4,000 seeds and over 21GB of bagged data.

To learn more about the project please visit the Data Rescue Philly site at ppehlab.org/datarefuge or the GitHub repo at github.com/datarefugephilly. We are continuously working on updating the documentation.

List of upcoming DataRefuge events:

January 27-28, 2017 Ann Arbor: #DataRescueAnnArbor
February 4, 2017 New York: #DataRescueNYC
February 12, 2017 Boston: #DataRescueBoston

I would encourage everyone to try and attend these events, especially if one is hosted near you. You can bring a multitude of skills, technical and non-technical, and preserve climate data so we can still access it in the years to come.