You are on page 1of 66

Data Management

for Research

Aaron Collie, MSU Libraries


Lisa Schmidt, University Archives
Introductions
 Please tell us your name and
department
 A brief description of your primary
research area
 What do you consider to be your
research data?

 Optional: cc http://www.flickr.com/photos/quinnanya/

 Experience managing research data?


 Experience writing a data
management plan?
Agenda

• Introductions
• Background
• Definitions
• Upfront Decisions
• Data Sharing Impacts
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Lifecycle Strategy
Why are we here?
But why are we really here?
 An Impetus: NSF recently released a mandate that all grant
applications submitted after January 18th, 2011 must include a
supplemental “Data Management Plan”
 An Effect: This mandate from NSF has had a domino effect,
and many funders that now require or state guidelines for
data management of grant funded research
 A Challenge: Data management (and oftentimes research
methods in general) is an area that has not traditionally
received a full treatment in most graduate and doctoral
curricula
What is meant by “data management”?

Fundamental Practices Data lifecycle


 File Organization  Digital Sustainability
 Data Documentation  Scholarly
 Reliable Backups Communication
 Data Publishing
 Research Impact
 Effective January 18, 2011
 NSF will not evaluate any proposal missing a DMP
 May be up to two pages long
 PI may state that project will not generate data or
samples
 DMP is reviewed as part of intellectual merit or
broader impacts of application, or both
 Costs to implement DMP may be included in
proposal’s budget
NSF’s Data Management Guidelines
 Policies for re-use, re-distribution, and creation of
derivatives
 Plans for archiving data, samples, and other research
outcomes, maintaining access
 Types of data, samples, physical collections, software
generated
 Standards for data and metadata format and content
 Access and sharing policies, with stipulations for
privacy, confidentiality, security, intellectual property,
or other rights or requirements
Other Federal Policies
“expects the timely release and sharing of final research
data"
“…should describe how the project team will manage
and disseminate data generated by the project”
“requires that data…be submitted to and archived by
designated national data centers.”

NASA “promotes the full and open sharing of all data”

"IMLS encourages sharing of research data."


Upfront Decisions for Researchers
 What is the expected lifespan of the data?
 Besides the researcher(s) on the project, who else
should be given access to the data?
 Does the dataset include any sensitive information?
 Who owns or controls the research data?
 Should any restrictions be placed on the dataset?
 How are the data stored and preserved?
Upfront Decisions for Researchers
 How might the data be used, reused, and
repurposed?
 How is the data described and organized?
 Who are the expected and potential audiences for
the datasets?
 What publications or discoveries have resulted from
the datasets?
 How should the data be made accessible?
Data Sharing Impacts
Cc http://www.flickr.com/photos/pinchof_10/

 Reinforces open scientific


inquiry
 Encourages diversity of
analysis and opinion
 Promotes new research,
testing of new or alternative
hypotheses and methods of
analysis
 Supports studies on data
collection methods and
measurement
Data Sharing Impacts (cont.)
 Facilitates education of
new researchers
 Enables exploration of
topics not envisioned
by initial investigators
 Permits creation of
new datasets by
combining data from
multiple sources
Agenda

• Introductions
• Background
• Definitions
• Upfront Decisions
• Data Sharing Impacts
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Lifecycle Strategy
File Organization Practices: Overview
1. Create a file plan for your “When I was a
research project
2. Design a file naming
freshmen I named
convention that works for my assignments
your project
3. Agree on a version control
Paper Paperr
method to assist with file Paperrr Paperrrr”
synchronization
4. Carefully choose file
-Undergrad
formats to maximize
usefulness
1. Create a file plan for your research project

 File plan as a classification system


 Indexed – makes it easier to locate folders/files
 Primary subjects – main functions of research project
 Secondary subjects – more specific activities of project,
including research data
• Tertiary subjects – limit by date or equivalent
– File Name (naming conventions)
1. Create a file plan for your research project (cont.)

Example documentation of Directory Hierarchy:


 /[Project]/[Grant Number]/[Event]/[Date]
Example documentation of File Naming Convention:
 [investigator]_[method]_[descriptor]_[YYYYMMDD]_[version].[ext]
2. Design a file naming convention that works
for your project

 Why file naming conventions?


 Enable better access/retrieval of files
 Create logical sequences for file sorting
 More easily identify what you’re searching for
2. Design a file naming convention that works
for your project (cont.)

 Meaningful but short (255 character limit)


 Descriptive while still making sense
 Capital letters or underscores differentiate
between words
 Surname first followed by initials of first name
 More on handout
2. Design a file naming convention that works
for your project (cont.)
This Not This
sharpeW_krillMicrograph_backscatter3_20110117.tif KrillData2011.tif

This Not This


borgesJ_collocation_20080414.xml Borges_Textbase.xml
3. Agree on a version control method to assist
with file synchronization
 Version number of record indicated file name
with “v” followed by version number
 Letter “d” indicates draft
Examples of simple version control:
waltM_lakeLansing_fieldNotes_20091012_v002.doc
petersK_OrgChart2009_d001.svg
4. Carefully choose file formats to maximize
usefulness
• Non-proprietary
• Open, documented standard
• Common usage by research community
• Standard representation (ASCII, Unicode)
• Unencrypted
• Uncompressed
Documentation Practices: Overview

1. At minimum create a
README file that you can
use to document your
project
2. Utilize standards for
describing data including
Metadata Standards
3. If applicable, use in-line
code commentary to
explain code (cc) Will Scullin
1. At minimum create a README file that you
can use to document your project
 At minimum, store documentation in readme.txt file or
equivalent, with data
 Resource: http://
libraries.mit.edu/guides/subjects/data-management/metadat
a.html
2. Utilize standards for describing data including
Metadata Standards

 “Data about data”


 Standardized way of describing data
 Explains who, what, where, when of data creation
and methods of use
 Provides the essential tools for discovery, such as
a bibliographic citation
2. Utilize standards for describing data including
Metadata Standards

Basic project metadata:

• Title • Language • File Formats


• Creator • Dates • File Structure
• Identifier • Location • Variable List
• Subject • Methodology • Code Lists
• Funders • Data Processing • Versions
• Rights • Sources • Checksums
• Access • List of File Names
Information
Documentation Practices: Example Metadata Standards

 Dublin Core
Easy-to-create-and-maintain descriptive format to
facilitate cross-domain resource discovery on the Web
 Darwin Core
Facilitates reference and sharing of biological diversity
datasets
 Data Documentation Initiative (DDI)
Methodology for content, presentation, transport, and
preservation of metadata about datasets in the social
and behavioral sciences
Documentation Practices: Example Metadata Standards

 Directory Interchange Format


Descriptive format for exchanging information about
earth science data
 ISO 19115:2003
Describes geographic data such as maps and charts
 PBCore
Supports description and exchange of media assets,
including both individual clips and full, edited, aired
productions
Documentation Practices: Example Metadata Standards

 Science Data Literacy Project


Metadata for astronomy, biology, ecology and
oceanography
 VRACore
Data standard for description of works of visual culture
as well as images that document them
3. If applicable, use in-line code commentary to explain code

Example of R code commentary


# Cumulative normal density
pnorm(c(-1.96,0,1.96))
Backup Practices: Overview

1. Avoid single points of failure


2. Understand the different types of storage
3. Ensure data redundancy
4. Aim for geographic distribution of data
1. Avoid single points of failure

A single point of failure occurs when it would only take one


event to destroy all data on a device (e.g. dropped hard drive)

Good practices for avoiding single points of error:


 Use managed networked storage whenever possible
 Move data off of portable media
 Never rely on one copy of data
 Do not rely on CD or DVD copies to be readable
 Be wary of software lifespans (e.g. Angel)
2. Understand the different types of storage

• Flash Drives
• Internal Hard Drives
• External Hard Drives
• Server and Web Storage
• Managed Networked Storage
• Cloud Storage
3. Ensure data redundancy

Backup Do’s:
 Make 3 copies
 E.g. original + external/local + external/remote
 E.g. original + 2 formats on 2 drives in 2 locations
 Geographically distribute and secure
 Local vs. remote, depending on needed recovery time
 Personal computer, external hard drives,
departmental, or university servers may be used
3. Ensure data redundancy (cont.)

Backup Don’ts:
 Do not rely on one copy
 Do not use CDs and DVDs
 Do not rely on ANGEL

(cc) George Ornbo


3. Ensure data redundancy (cont.)

Backup Maybe:
 Cloud storage
 Amazon s3 Note that many
 Google enterprise cloud
 MS Azure storage services
 DuraCloud include a charge for
 Rackspace in/out of data
transfers

$$$
Agenda

• Introductions
• Background
• Definitions
• Upfront Decisions
• Data Sharing Impacts
• Fundamentals Practices
• File Organization
• Data Documentation
• Reliable Backup
• Data Lifecycle Strategy
Defi
que ne a
stio
n

Gat
info her
Research is…

rma
tion

For
hyp m a
o th
esis

Tes
hyp t the
ot h
esis

Ana
dat lyze t
a he

Inte
the rpret
dat
a

Pub
res lish
ults

Ret
est
For
Gat hyp m a
info her oth
rma esis
tion

Defi
que ne a
stio
n
Ana
dat lyze t
a he

Tes
hyp t the
oth
esis

Pub
res lish
ults

Inte
the rpret
?

dat
a Ret
est
que ne a
n
stio

res lish
Defi

ults
Pub
The scientific method “is
hyp m a
oth
esis

often misrepresented as a
For

esis

est
hyp t the

Ret
fixed sequence of steps,”

oth
Tes
rather than being seen for he

what it truly is, “a highly


dat lyze t
tion
info her
rma

Ana

the rpret
a

a
Gat

dat
variable and creative

Inte
process” (AAAS 2000:18).
Gauch, Hugh G. Scientific Method in Practice. New York: Cambridge University Press, 2010. Print. (Emphasis added)
Defi
que ne a
stio
n

Gat
info her
rma
tion

For
hyp m a
o th
esis

Tes
hyp t the
ot h
esis

Ana
dat lyze t
a he

Inte
the rpret
dat
a

Pub
res lish
ults

Ret
est
The Research Depth Chart
Scientific Method

More
Research Method

Generic
Research Design

More Specific
Research Tasks
Defi
que ne a
stio
n

Gat
info her
rma
tion

For
hyp m a
o th
esis

Tes
hyp t the
ot h
esis

Ana
dat lyze t
a he

Inte
the rpret
dat
a

Pub
res lish
ults

Ret
est
The Data Management Depth Chart
Research Data Lifecycle Model

Source: DDI Structural Reform Group. “Overview of the DDI Version 3.0 Conceptual Model.“ DDI Alliance.
2004.
http://opendatafoundation.org/ddi/srg/Papers/DDIModel_v_4.pdf
The Data Management Depth Chart
Research Data Lifecycle Model

???

???

Research Data Management Tasks


The Data Management Depth Chart
Research Data Lifecycle Model

Data Management Plan

???

Research Data Management Tasks


 http://www.lib.msu.edu/about/diginfo/ldmp.jsp
Data are brainstormed

Study
Concept
Data are brainstormed

DMP • Data type, purpose & value

• University Research Council guidelines


• Research Facilitation and
MSU Dissemination
• Lifecycle Data Management Planning
• Research Data Management Guidance

YOU • Start your Data Management Plan!


Data are collected and secured

Study Data
Concept Collection
Data are collected

DMP • Data format, size & short term storage

• ATS Andrew File System (AFS)


• Institute for Cyber Enabled Research
MSU • MSU Libraries Data Services
• MSU Libraries Campus Data Resources

YOU • File Plan, File Naming, Backup Plan


Data are normalized and processed

Study Data Data


Concept Collection Processing
Data are processed

DMP • Data transformations & structures

• LCT Computing Courses


MSU • High Performance Computing Center
• Consortium of Research Consulting
Services

YOU • Documentation, Methodology


Data are distributed

Study Data Data Data


Concept Collection Processing Distribution
Data are distributed

DMP • Data sharing, security & rights

• Human Research Protection Program


• University Research Council guidelines
MSU • MSU Libraries Copyright Permissions
Center
• MSU Google Apps

YOU • Roles, Responsibilities, Resources


Data are discoverable

Study Data Data Data Data


Concept Collection Processing Distribution Discovery
Data are discoverable

DMP • Data publishing & metadata

• Development of Copyrighted Materials


MSU • MSU Libraries Data Citation Guide

YOU • README, Metadata Standard


Data are analyzed

Study Data Data Data Data Data


Concept Collection Processing Distributio Discovery Analysis
n
Data are analyzed

DMP • Standards & workflow documentation

• Center for Statistical Training and


MSU Consulting
• Statistical Consulting Services

YOU • Code Commentary, Documentation


Data are stored and preserved

Data
Archiving

Study Data Data Data Data Data


Concept Collection Processing Distribution Discovery Analysis
Data are preserved

DMP • Long term storage & management

• VPRGS Repositories and Archives


MSU • Lifecycle Data Management Planning
• Databib.org!

YOU • Embrace stewardship


Data can be used and reused

Data
Archiving

Study Data Data Data Data Data


Concept Collection Processing Distribution Discovery Analysis

Repurposing
Data can be used and reused

DMP • Broader impact

• Research Data Management CAFE


MSU • MSU Research Centers and Institutes
• MSU Libraries Data Citation Guide

YOU • Publish your data!


Research Data Management Guidance
 Face-to-face Advising
 Writing Data Management Plans
 Planning for Digital Projects
 Managing Digital Information
 Group Training
 New Faculty Orientation
 Faculty Seminars
 Classroom Instruction lib.msu.edu/about/rdmg
In Conclusion…
 Upfront Decisions Researchers Need to Make
 General Good Practices for Managing Research Data
 NSF, NIH, IMLS and Other Funders’ Requirements
 Lifecycle of Research Data
Contact
Lisa M. Schmidt
Electronic Records Archivist
University Archives & Historical Collections
lschmidt@ais.msu.edu

Aaron Collie
Digital Curation Librarian
MSU Libraries
collie@msu.edu

You might also like