Research data management
"Research data management" bundle created by Jez
- ADMIRe
- RoaDMap blog
- REDm-MED
- Research360@Bath
- Modus Operandi for Repository Deposits
- Open Exeter: Human Factors in Research Data Management
- DaMaRO
- Data Management Planning for Secure Services (DMP-SS)
- Managing Research Data: a pilot study in Health and Life Sciences
- BRISSkit
- CERIF for Datasets
- data.bris
- Creating data management plans online
- JISC DataPool Project
- DATUM
- Data Management Planning & Storage for Psychology
- Reseach Information Network
- History DMP
- Iridium
- JISC MRD: Evidence Gathering
- Orbital
- PIMMS - Blog
- Rapid Organization of Health Research Data
- Sustainable Management of Digital Music Research Data blogs
- University of Hertfordshire - Research Data Toolkit
- JISC Managing Research Data Programme
- Research Data @Essex
- Scholarly Output Notification and Exchange (SONEX)
- SWORD
- Digital Curation News
- EDINA and Data Library: News
- MiSS
- KAPTUR
- REWARD - Events
- REWARD - News
OpenAIREplus is a large-scale EU project bringing together 41 pan-European partners, including three cross-disciplinary research communities. OpenAIREplus aims to:
“…create a robust, participatory service for the cross-linking of peer-reviewed scientific publications and associated datasets.”
The 30 month project launched in December 2011 (see Bill’s post on this launch) and on the 11th June they will be presenting an OpenAIREplus workshop in conjunction with the Nordbib Conference 2012 Copenhagen, June 11-13, 2012 . The OpenAIREplus workshop “Linking Open Access publications to data – policy development and implementation” looks really interesting with a very exciting programme and I am hoping they will make the workshop presentations and outputs available after the event.
The workshop is aimed at anyone with an interest in this topic, and will be of interest to library managers, researchers, research funders, repository managers, journal editors and publishers, and research administrators. Topics covered include:
- Preparing and writing institutional data management policies
- An overview of funder’s responsibilities and requirements towards data availability and management
- An overview of linking research publications and data
- The research data landscape
Follow developments and news items on the OA EU infrastructure on Twitter @OpenAIRE_eu
Links of interest
I was invited to speak on our Essex project last week) at the DCC Roadshow at Imperial College ( (21 May 2012). The workshop was well attended and the discussion useful. You can download a copy of my presentation, but here are some of the more interesting points on which I want to (briefly don’t worry) reflect.
The developments in MRD projects’ emerging Research Data polices are welcomed and particularly noticeable is the differences in approaches being undertaken; top down, bottom up, or dwindling in between and waiting for the right pathway to emerge.
At Essex we are drafting our policy, as a matter of priority, for the purposes of the EPSRC Framework requirements; a great stepping stone to a grander University Policy. We have based ours on the Edinburgh 10 point model, and it is in the queue to be agreed by two high level Committees.
In my DCC talk I mentioned that our project had four key players in the University which were critical in designing a path of least resistance to MRD policy: the UK Data Archive; our Research and Enterprise Office (REO); our Records Management department and our Information Systems Services (ISS). Our Library is not part of this work as it does not have any digital data expertise nor any particular interest in the research data management space. They look after books and journals and also do not run our University Eprints Publications’ Repository (that has been set up and managed by the REO).
However, the crux of getting our RDM Policy to work ultimately lies with gaining agreement from two main areas of University Policy – Research Strategy and ICT Strategy. We are very fortunate to have players from these worlds in this project and drafting our policies. However, keeping longer-term data preservation and storage space out of the equation for now, for me, one of the most critical roles is that of the University records manager who is, ultimately, the person at the end of the FOI line; and who has the greatest knowledge about the safe keeping of ‘records’.
What about the data?
We have completed our work with pilot departments using our data inventory (that Tom has already blogged on 2 May 2012) and have gathered representative samples of data. We have classified our data into four broad types:
Qualitative: collected through interview, recorded as audio (and sometimes video) and transcribed as text. Often have an associated annotation file associated with a transcript, which is highly valued by the researcher
•Numerical, tabular: typically handled in MS Excel (sometimes SPSS and other statistical packages), this kind of data forms the basis of a great deal of empirical research
•Machine output: logs or raw instrument-generated data. Often saved in proprietary formats, poorly documented, and hard to interpret without specialist knowledge
•Cross-discipline collections: emerging theme in a variety of disciplines. Typified by interdisciplinary research such as climate science, researchers producing these kinds of collection often collect the datasets together at a specific location
Here is summary of our test data so far:
|
Department |
Area |
Data |
|
Biological Sciences
|
Proteomics |
mass spectrometry data from tumour tissue samples
|
|
|
Bio-imaging |
high resolution image data collected to examine cellular structure |
|
Business School
|
Management |
Football manager performance exploring managerial succession in an environment of instant public access to performance metrics |
|
Linguistics |
Second language acquisition |
audio and transcripts of classroom second language learners |
|
|
Sociolinguistics |
audio and transcripts of interviews with multiple generations of Indian English speakers |
|
Computing and Electronic Engineering |
Artificial intelligence |
crowd sourced AI scripts, and results from competition between these AIs in a game environment |
RDM Support
In December we established an informal Data Manager’s Forum; the first meeting which invited all Research Directors. They were certainly keen to hear more, attended where they could, and launched many questions at us. A formalalised Forum was considered not to be useful as a regular standing meeting, but Research Directors were quite keen to take on local information role and meet when critical matters arose.
We also met with all the staff from the REO who support research grant submission and award management. While they had not encountered any problems with Essex researchers submitting DMPs for RC applications, they were interested in our DM Costing Tool, and were keen to incorporate this on their grants’ information web pages. We are currently drafting high level content for the REO pages on RDM, but, for now, will point to the UK Data Archive Creating & Managing web site which contains a lot of introductory detail on MRDl. We further agreed that some centralised training could take place, such as ‘how to cost research proposals to include RDM’ and ‘writing a DMP’ which would slot into the REO's existing support and outreach programme.
The Archive already runs MRD courses at Essex and we have one run one generic course on campus this year. Successful training in the future is most likely to be through these generic events, aimed at post graduates and staff, and some bespoke training within faculties on demand.
Our test Eprints system
Our Eprints work is going very well. As you may have seen from Tom’s great blogs (the two before this) we have spent time investigating metadata for our Eprints system, as we want to get it right (or thereabouts) first time.
From our experience of running two data catalogues - the ESDS catalogue and our first generation Fedora self-archive, UKDA store - we have a good idea which fields are essential for describing research data (we curate over 5000 collections between these two systems). Both of our catalogues use Dublin Core and DDI as the metadata base, and include all of the Datacite core fields (Identifier; Creator; Title; Publisher; Publication Year) plus more, and specifically, the two optional properties we would recommend for all IRs, Resource type and Version.
We believe that the metadata we have chosen for our own trial Eprints system will cover all collections, or at least make them sufficiently described. Do take a look at our metadata mapping (Eprints default, DDI 2.1 &b 3.1, Inspire, Datacite, JISC Datashare) and Eprints mock up (metadata fields and anticipated rendering) that we have set out for our own Repository (in Tom’s blog below!). Of course, we are keenly watching the RepositoryNet+ space to see which metadata elements they will chose for harvesting from IRs.
We hope to discuss more about metdata at the BL Datacite meeting planned for July.
But, are there enough ‘archivists’?
What I mean is, are there enough ‘archivists involved in the MRD projects? I don’t know the answer but I had a discussion with an archivist after my talk at Imperial, where they expressed concern that the MRD community were not, on the whole, calling upon ‘archivists’ to provide expert guidance on many of the issues all MRD projects are grappling with and which they live and breathe; appraisal, records management, accession and disposal, collection description, access control and so on. Most Universities will have an archivist or at least a records manager who could be consulted on data policies and the policies and procedures for IRs being set up.
Many archivists are already look after ‘data’; typically in the form of papers of well–known scientists or anthropologists, or other academics who have left their paper windfalls. These collections often contain ‘data’, if not digital. My call to all MRD projects is to Involve the Archivist, if they are not already.
Building on the previous post which described our IDMB-inspired basic metadata model, this post will present some more tangible results of our work so far.
Toward a standard metadata schema for research data
We thought the best place to start in developing our metadata schema would be to assess the strengths and weaknesses of a bunch of different metadata specifications. This exercise would help inform the development of our generic schema for describing research data.
niallco over at the iridium project blog posted back in December on metadata for a research data repository. He suggested a three level taxonomy of sorts, which compliments the IDMB-based model described in the previous post:
- The top level relates to the Discoverability of the Resource and could be based around the 15 Dublin Core elements
- The second level is contextual and essentially covers the elements in cerif (notably; outcomes – publications, patents etc, funding; research council grant, person; project team members, organisation- Uni, collaborators.
- Finally a specific level giving the minutiae.
Levels 1 and 2 are fairly well covered by default EPrints metadata. The rather more vague third point demands more consideration - which 'minutiae' should we provide? A mapping (or crosswalk) of candidate schema is attached below, and includes an early (draft) version of the proposed RDE metadata schema (mapped to EPrints default). This draws on elements from DataCite, INSPIRE, DDI and DataShare schema.
Designing an interface for research data eprints
The image below shows a mockup of a generic data eprint, and the way we hope to present the metadata above. This design illustrates how data, documentation and the various levels of metadata will appear to a user. Significant changes are required to the way in which the items (i.e. files) are displayed, in order to make the meaning of the various elements of a complex collection intelligible.
A trial implementation is in production and will likely be refined and improved as we progress.
The Institutional Data Management Blueprint (IDMB) project ran during the JISC Managing Research Data programme 2009-2011. The team, based at the University of Southampton, developed a suite of re-usable tools and templates for institutional data management. These were designed to provide a framework for the seamless integration of research data management into the workflows of all those involved in the data lifecycle. It is important that the community build on work such as this rather than unnecessary re-invention of the wheel (a consistent standard across institutions would be no bad thing either), and with this in mind we will be looking at how the IDMB toolkit can be used in our work on the Research Data at Essex project. This also provides a more general comment on its potential for use in similar scenarios at other institutions. There are several parts to the IDMB ‘blueprint’, which I will discuss over a series of blog posts as the project progresses.
In this post we will be looking at metadata. Based on close work with a pilot department at the University of Southampton, the blueprint proposes a three part metadata model for documenting research data. We like this model, and have developed our own (working) version of this model (see diagram below). Functionally the two models are similar, the most significant difference being that ours detaches all discipline-specific metadata from EPrints formal schemas. Based on another recommendation of the IDMB project, this allows the user to upload their own discipline specific metadata as a separate file - a flexible and pragmatic approach we feel. We will further extend/refine this model as the project progresses and we implement the first version of our metadata schema (more to come on this). We will also be examining the process of adding metadata at pre-deposit, deposit and ingest stages.It will important to clarify the relationships between data objects, particularly regarding how we present and link datasets, documentation and metadata. This is something which IDMB should be able to provide a rough framework for, with their model of data, metadata, and readme. Not yet explored though, is how metadata could be attached to high level ‘containers’ that group related studies.
‘Coin press at New Orleans Mint Museum’ ![]()
Some rights reserved by Ted Drake
Last week’s DataCite workshop was a really good opportunity to ask questions about DataCite at The British Library, how to mint a DOI (Digital Object Identifier), and to discuss challenges with citing research data.
Data Citation
The day started with a challenge to the presenters – what is data? This discussion had echoes of KAPTUR’s own research question – what is visual arts research data? (Environmental Assessment report). It seems almost impossible to define research data due to its diversity, but a working definition is obviously necessary, a good example is from University of Bristol’s Glossary.
The British Library’s Head of Scientific, Technical & Medical Information, Lee-Ann Coleman, spoke about the importance of making research data available, mentioning examples including the virologist Ilaria Capua who opened up worldwide access to Avian flu virus data sequences; and the open-data journal GigaScience research into E.Coli. A recent addition, ISO 26324:2012 for DOIs was mentioned. Garfield’s 15 reasons ‘when/why to cite?’ was a useful point of reference too:
- Paying homage to pioneers.
- Giving credit for related work (homage to peers).
- Identifying methodology, equipment etc.
- Providing background reading.
- Correcting one’s own work.
- Correcting the work of others.
- Criticizing previous work.
- Substantiating claims.
- Alerting researchers to forthcoming work.
- Providing leads to poorly disseminated, poorly indexed, or uncited work.
- Authenticating data and classes of fact – physical constants, etc.
- Identifying original publications in which an idea or concept was discussed.
- Identifying the original publication describing an eponymic [sic] concept or term as, e.g., Hodgkin’s disease, Pareto’s Law, Friedel-Crafts Reaction, etc.
- Disclaiming work or ideas of others (negative claims).
- Disputing priority claims of others (negative homage).
Garfield, E., 1996. When to Cite. In: Library Quarterly 66 (4), 449-458. Available from: http://www.garfield.library.upenn.edu/papers/libquart66(4)p449y1996.pdf [Accessed 25 May 2012].
What is DataCite?
Elizabeth Newbold provided an introduction to DataCite. It is a not-for-profit international registration agency for DOIs to facilitate the citing of research data. Founded in December 2009; it consists of a Managing Agent (currently the German National Library of Science and Technology (TIB)) and regional Members. In the UK The British Library is the regional Member, which then works with ‘Data Clients’ such as the UK Data Archive amongst other data centres and repositories. DOIs are assigned between the Data Member (e.g. The British Library) and their Data Clients (e.g. UK Data Archive) i.e. on an institution to institution basis – if an individual researcher wants a DOI then they need to contact the appropriate Data Client for their subject discipline, a list of some existing and potential future Data Clients is maintained on the DataCite website. Data Clients must fulfil a number of requirements and pay an annual fee to The British Library.
Some of the requirements for Data Clients:
- DOIs must resolve to a publically accessible landing page even if the data itself is not open; the landing page can be an existing set of Web pages with the Data Client’s style so long as it is updated to include the DataCite information.
- Mandatory metadata fields: 4 fields (5 if you include the DOI itself) – these should be subject discipline agnostic: http://schema.datacite.org/
- The mandatory metadata must be freely available for discovery purposes, specifically under a Creative Commons CC0 licence; there was some interesting discussion around this and some issues to be resolved.
- Data Clients should have a formal data preservation plan (this may include disposal policies and so on); an operational service level agreement (SLA); and a clear intention in a mission statement to preserve and maintain the DOIs, this could include reference to an EPSRC Roadmap. Action: DataCite will share a draft SLA with the attendees.
How to mint a DOI – case study
Louise Corti of the UK Data Archive provided a very useful mini-case study and I’ll link to her presentation here when it is available. As data providers the UK Data Archive want to use citations to improve resource access and discovery. It was really interesting to hear how DOIs are effected by changes to the research data – at the UK Data Archive minor changes (e.g. a spelling mistake or typo) are documented in their Change Log but the DOI version number stays the same; major changes (such as an updated dataset) are documented in the Change Log field and the DOI is also given a new version number at the end. Challenges for the future include citing parts or fragments of research data; and also issues around describing relationships between data. Look out for a forthcoming UK Data Archive and ESRC brochure on citing data, aimed at the Social Science community.
How to mint a DOI – the technical bit
An illuminating presentation from Ed Zukowski described the following components of the DataCite systems:
The Data Client will be provided with information from the regional Member in order to make use of the Metadata Store and facility to mint DOIs, technical knowledge is required to use the API for bulk registration. For minting one DOI an XML file is required with at least the four mandatory fields of metadata using the DataCite Schema.
The user will resolve a DOI (e.g. using a system such as http://dx.doi.org/) through the Global Handle Registry this includes information from the Handle Server hosted by the DataCite Managing Agent. Resolving a DOI takes the user to a landing page and collects statistics about how many times a DOI has been resolved.
There is a free search of existing DataCite DOIs. From the top right of the Search page select ‘Options’ and ‘enable’ the Filter Preview, then when you do a search it is possible to filter by individual regional Member (‘allocator’) and Data Client (‘datacentre’).
The OAI-PMH Data Provider is available here: http://oai.datacite.org/
http://data.datacite.org/ - provides two ways of exposing metadata held in the Metadata Store:
- HTML links i.e. hyperlinks in a standard Web browser.
- HTTP Content Negotiation – ‘I say what I want and in what priority’ e.g. ‘I want a PDF version of the research data but if there is a HTML version I’ll take it’ – if there is a PDF version available content negotiation will take you straight to the PDF rather than to the landing page for example.
Contact datasets@bl.uk to ask for access to the test site which enables you to mint ‘temporary’ DOIs. See also: https://github.com/datacite
A really useful tool to format DOIs into Harvard system citations (and other citation systems) in multiple languages: http://crosscite.org/citeproc/
Breakout groups on challenges with citing research data (some questions):
- Selection process – what about raw data? when does data become citable?
- Why not use DOIs for Ph.D. theses?
- Do you need to mint DOIs before you publish the journal article so you can link to them? – could start minting DOIs at collection level then move into additional specific parts nearer to publication of the journal article?
- A need to define roles and responsibilities.
- What about changes to Data Clients or funding bodies?
- How does versioning work with DOIs? (note UK Data Archive case study above)
- What is a citable unit of research data?
- What about cross-institutional, international, or cross-disciplinary research? Who mints the DOIs?
- A need for DataCite to provide case studies, perhaps with future workshops.
- It is only possible to describe one resource type per DOI (and this is a fixed controlled list e.g. Image, Film, etc) – this may be problematic with visual arts e.g. an exhibition; how do you describe complex relationships?
For cost/charge plans – discuss with the UK regional Member via datasets@bl.uk
The next DataCite workshop will be on metadata on Friday 6th July, details will be published online in due course.
Some other links:
- http://www.dcc.ac.uk/resources/how-guides/cite-datasets
- http://www.rin.ac.uk/our-work/using-and-accessing-information-resources/patterns-information-use-and-exchange-case-studie
Filed under: events Tagged: citation, Creative Commons, DataCite, jiscmrd, metrics, The British Library, UKDataArchive

Goldsmiths College, University of London.
Photo: MTG
All the members of the project team met at Goldsmiths College, University of London last week in order to collaborate on two aspects of the project: the development of the RDM policies, and the promotion of KAPTUR.
Key points:
- The Project Manager is working with Angus Whyte and Andrew McHugh from the Digital Curation Centre on two small events: Selecting and Appraising Research Data; and Using the OAIS Functional Model.
- The Project Officers are each raising awareness of the importance of effective Research Data Management at their institutions; developments seem to be quite diverse between the four institutions although the KAPTUR project represents an opportunity to learn from each other and the nuances of each different institution.
- The Technical Manager will be installing DataFlow’s DataStage onto his laptop so we can have a demonstration at the next meeting.
- The Project Officers and Project Manager are working on publicity to further promote the Environmental Assessment report.
- The Project Director reported that KAPTUR’s abstract for the Digital Humanities Congress 2012, University of Sheffield, had been successful.
Filed under: meetings Tagged: jiscmrd, RDM policies, ukdcc
The following blog post is adapted from the Conclusion and Recommendations section of the Technical Analysis report (PDF):
The KAPTUR Technical Manager investigated 17 different types of software which were compared to the requirements of the four partner institutions (details and appendices in the report). The next stage of the research reduced the choice of software to five options: DataFlow, DSpace, EPrints, Fedora, Figshare. These were all found to be suitable for managing research data in the visual arts; through a further selection process EPrints, Figshare, and DataFlow were identified as the strongest contenders.
[...] it is recommended that two pilots occur side by side: an integration of EPrints with Figshare and a separate piece of work linking DataFlow’s DataStage with EPrints. By integrating EPrints with Figshare, the project can take advantage of a system which has been built with, and for, researchers to handle research data specifically, and has a user-friendly visual interface (which is constantly evolving and enhanced by Figshare directly). [...]By integrating DataStage with EPrints the research data storage and software will be hosted within each institution, providing them with better control over the type of data that can be stored, published and managed. The integration will also enable content uploaded in DataStage to be securely backed up by the institution and accessible from anywhere in the world. A ‘Dropbox’-like tool is featured in the latest beta version, providing a user-friendly interface which will benefit visual arts researchers. EPrints will effectively provide the role of DataFlow’s DataBank.
Filed under: resources Tagged: DataFlow, DataStage, DSpace, EPrints, Fedora, Figshare, jiscmrd, technical analysis
Other JISC MRD projects or those working with ‘big data’ may be interested in a case study that has been written for Open Exeter by Dr Jacq Christmas (http://hdl.handle.net/10036/3556).
The case study documents the process of reviewing, preparing, uploading and describing multiple large video files. The project that generated the files is investigating the behaviour of crickets through analysis of thousands of hours of motion-triggered video.
The project is interesting to us for a number of reasons:
• It is a cross-disciplinary/cross-departmental project – these sort of projects are becoming increasingly common at Exeter and do throw up interesting questions around the area of ‘ownership’
• Huge amounts of data have been and continue to be produced
• Storage is a problem due to the number and size of files – most files are stored on external hard drives held in various places
• As there is no central storage system, secure backup can be a problem
• Ditto secure sharing
• The first batch of video is in a proprietary format that requires specific software in order to be viewable
The case study sets out quite clearly the thought that should be given to selecting and preparing files for upload to a repository. We are looking at how the procedures described can be adapted as templates to guide researchers from other disciplines through the deposit process, some aspects of which will always be generic, for example:
• Listing and explaining the various file formats and how they are related
• Selecting a set of metadata fields to describe the files
• Thinking about the structure of the data in the repository and how it links to related resources, projects and collections
One issue that has arisen from this case study, that we were already well aware of, is the preference to deposit research in a project or research group collection rather than a generic departmental or College collection. In many cases the sense of belonging to or affinity with a group is stronger than departmental ties. This is a tricky one for us: DSpace structure centres on a hierarchy of communities, sub-communities and collections; once these have been set up and start to be populated, it is difficult to make significant changes. Add to that the fact that our CRIS, Symplectic, has been painstakingly mapped across to all our existing communities and collections and any structural changes become even more problematic. For the moment we are looking at a possible metadata solution (dc****.research group ??). I’d be interested to hear how others deal with the research project/group requirement.
We’re about to start a similar test case study with Astrophysics and later in the year with an AHRC-funded project based in Classics and Ancient History. It will be interesting to see if the approach taken in these areas are significantly different, or given different emphasis.
I won’t say that our first case study has allowed us to resolve the many issues raised yet but we are at least more aware of what is important to researchers and can start to take steps to find solutions.
Open access is finally attracting high-level attention from national governments, but full open access has been a long time arriving despite extensive funding, development and the commitment of many people. As much of that effort switches towards the implementation of repositories to store, share and publish the research data that informs publications, we are considering what lessons might be learned from open access repositories, so that the path to effective data repositories might be shorter and less fraught. In part 1 the factors considered included policy, infrastructure, workflow and curation. Here in part 2 we look at rights and user interfaces.
Rights
Since open access is indelibly associated with publication, one of the primary impediments to providing open access is transfer of rights to publishers, a practice that has failed to adapt to the digital switch. Research data is not so encumbered now, and with care data creators can deploy rights more effectively because they begin in the digital era.
It has often been argued that open access repositories failed to adopt or compete with Web 2.0 services. Quite what this means is not clear, but one aspect might be social and user engagement for the purpose of growing content. Well-known services that became associated with Web 2.0 are YouTube and Flickr, so the case might be that OA repositories were not as successful in attracting content as these services. There is one key point that differentiates these services from open access: prior to Web photo and video services, there were no simple publication outlets for this type of content for non-professional or non-broadcast works. In the case of open access there are pre-exisiting publications, such as journals and conference proceedings. Open access repositories do not seek to eliminate the journals, but to supplement the access they provide. There is thus another party with a vested interest in ownership of this content.
This is why open access can get mired in discussions about rights. Creative Commons (CC) licences were designed for content to be shared on the Web and communicate how creators are prepared to share their rights with users to open and extend use of their content. For research papers, however, since long before the Web, publishers have required a transfer of rights from the author in return for publication – hence the ownership issue. Unlike CC, these rights can be used to lock down access and reuse, for commercial purposes.
There is a form of open access, ‘gold’ OA journals (in BOAI this is complementary to ‘green’ OA repositories), which may be accompanied by release of commercial rights using CC but often at a cost for publication. In other words, publication is paid for financially rather than in a transfer of rights. Such journals present this as an advantage over non-OA journals and OA repositories, and this can be beneficial for text mining and other applications. While this form of OA publishing has been growing in recent years, it remains to be seen how quickly it can replace or adapt key high-impact journals, and at what cost.
Broadly, research data are not yet subject to publication rights. Publications are a highly processed form of research data, in the form of tables and graphs, for example. Typically the data targeted by data repositories precedes the refined and summarised publication versions, and is therefore not covered by the same rights transfer. That could change if expanded publications requiring data deposit or third-party service providers seek to obtain rights in return for these services.
Strictly, while institutions where research has been performed inherently own the rights to that work, they have been reluctant to exercise those rights in ways that would restrict a researcher’s choice of publication, or to require or even advise authors on some retention of rights or amendments to rights agreements. Unlike with peer reviewed papers, where precedent is more strongly established, it is possible that institutions will seek to impose more control of rights where research data is concerned. Recently reported cases show how a university’s allocation of control of rights within research teams, the special case at Purdue University notwithstanding, can have consequences for publication. What data creators and authors will be concerned about is whether the exercise of those rights by institutions is commensurate with the services that are provided in return. There may be resistance if established academic freedoms are constrained and research impact is reduced as a result, but with the right services and effective exercise of rights impact can be increased by sharing research data openly.
The lesson of open access is that rights matter, that the traditional all-rights transfer for academic publication is no longer appropriate for or conducive to fully exploiting new forms of digital dissemination, but also that established practices can be slow to change. Institutions and authors should be careful not to let rights to research data slip away as they did for publications in another era, but equally they must be careful to work together to use those rights in ways that maximise the benefits and impact for them and for research.
User interfaces
Users of repository services are both those who provide the data and those who consume it. The features that define and characterise repositories are the interfaces through which users can perform these actions, but are these interfaces flexible or adaptable enough to serve all those who might want to use repositories for publications or data?
Within this analysis (including part 1) it has been suggested that OA repositories may have overlooked workflow, and Web 2.0 developments with regard to content growth, services and engagement with users. In fact, some helpful developments can be found buried deep within repository software, but to see where these might impact users more directly we have to look away from the familiar repository interfaces. This critical development is called SWORD (Simple Web service Offering Repository Deposit), and it will impact on data repositories as well, in ways that we have not yet seen implemented on a large scale, even for OA repositories.
As the name indicates, SWORD is focussed on one of the actions that a repository supports, deposit, that is, getting new content into a repository or updating content, this updating feature recently becoming available with SWORD version 2.
SWORD frees the user deposit interface from the repository software and the specific instance of a repository. As the number and types of repositories have grown, some authors may wish to deposit in more than one place. SWORD can help with that. If the repository deposit interface demands too many keystrokes (metadata), or does not allow all the metadata you want to record – too few keystrokes, SWORD can help there as well.
The deposit still needs to reach a repository (‘endpoint’) so SWORD and repository softwares are working together on this, not competing. All major repository softwares support SWORD, and the most recent releases support SWORD v2. What’s needed are more SWORD client interfaces, as there have been relatively few examples to date.
It is easy to see that data repositories can benefit from SWORD in the same way as open access – deposit in many places from a single interface. When it comes to scoping metadata within a deposit interface, given the wide disparity in describing different data types in different disciplines with metadata, SWORD begins to appear essential for data deposit. These are just the services we can anticipate now.
With SWORDv2 we can envisage taking deposit out of the forms-based deposit approach and into different applications. One that may work for data deposit is a DropBox-like application for file-based deposit. With this application ‘dropping’ a file to a specified directory in a file manager on a laptop, say, synchronises and copies subsequent versions of that file to a repository (Figure 3), or potentially to a remote storage service, which can be accessed by the user logging on to the storage site using any Web-connected device. Data can thus be accessed and shared, or published in open access repositories. Using SWORDv2, file manager-based services could be used for simple deposit of research data files in conjunction with storage services; with SWORD v2 these could also fulfil automated deposit cases.

Figure 3. Dragging an image copies it to the selected repository
The DataFlow workflow illustrated in part 1 uses SWORD as the transfer mechanism between the user’s local storage and the curated institutional storage, in essence using it to capture additional metadata.
Another demonstrated application of a SWORDv2-based interface works within desktop authoring tools, such as a word processor or other office applications.
What these applications portend is that data repositories can fill the workflow gap, which we recognised was missing from open access repositories, and which looks to be potentially more complex for data repositories. We can begin to support deposit of data to a schedule that need not be based on the same frequency and mode as publication but is more flexible. As well as needing more SWORD client interfaces, however, another open question is how repository softwares designed for publication can adapt to support two different paradigms: managed storage as well as publication.
There are only two reasons for data creators to deposit in data repositories: they want to (share, publish, good academic practice, etc.), or they have to (policy). By focussing on services that are adaptable enough to serve users, building on SWORD to support flexible workflow and bringing deposit into automated or even more creative applications, research data repositories have the chance to support both motivations, instead of being left to emphasise policy as the primary motivator, as has happened for open access repositories.
Summary
Establishing and growing open access content is taking longer and proving harder than ever originally anticipated back in 2000. As we consider how to extend open access repositories to manage research data, are we learning the right lessons from open access? Have we covered all the important issues, or are we missing key factors? Research data repositories bring challenges that are distinct from open access. What are the new challenges, and which of these will have most impact on the success of research data repositories?
In this analysis the factors we have considered include policy, infrastructure, workflow, curation, rights and user interfaces. We haven’t covered preservation, but digital preservation is served by a comprehensive selection of tools that can be applied to repositories, and one lesson seems to be that repositories will move to be preservation-ready when content volumes and risk-analysis demand.
Open access began with the principle that it is good for researchers to share findings, and that digital networks enable that to happen more widely and at lower cost, ultimately free to users. It was anticipated that users would want to take advantage of this, as physicists already did with arXiv, and when this model failed to take off to the same degree in other disciplines, eventually institutional repositories emerged to encourage further growth of open access. As that growth appeared to hit a ceiling, research funders and institutions began to step in with open access policy. In other words, principle – whichever principle you prefer, returns to taxpayers, for example, or productivity of research, or escalating journal costs – was used to justify and frame policy for users. Users themselves, so it seems based on unmandated rates of open access deposit, have been less keen to put principle into practice.
In hindsight there are lessons that could have been learned to speed up the process. Progress with data repositories need not suffer the same mistakes or the same delays. Data repositories might occupy a more pragmatic, less emotional space than open access. Unlike for open access there is no single or easily defined target for research data repositories – what is data? continues to be a perennial question – so policy and requirements might be broader. Perhaps this time content deposited in data repositories can be driven by services that attract users, as well as by policy. In this case, the aim of data repositories must be find those users who want these services, and then to make those services work better for them.
We’ve chosen DSpace as the repository to hold our research data. Much of the work to date has been involved around the issue of submitting large datasets to the respository.
We’re looking at using SWORD, and possibly SWORD 2.0. We’ve also taken the opportunity to update our current DSpace installations to the latest version of DSpace (1.8.2) and switched to an Oracle database, from Postgres. This gives us 24/7 support and allows us to use the latest version of SWORD, which only works on DSpace 1.8. We would also have had to upgrade our version of postrges to allow us to use the latest version of DSpace, which helped toward our decision to move to oracle.
For you techies out there, this process of updating has not been straightforward and is full of pitfalls. There is currently no process of easily cloning a postgres database and creating an oracle version so it all has to be done by hand to ensure the database integrity remains high. However, once the database is switched over, upgrading from 1.6.2 to DSpace 1.7.2 is quite straightforward.
BUT, DSpace 1.8.2 has some important differences to 1.7, most notably the devolvement of the main dspace.cfg file into several smaller configs.
So, a long winded process but we are near the end now. The test DSpace install is fully functional and so we now have the capability to upgrade the live version.
Ian – Technical Developer
One would think that there would be a simple solution to the problem of linking data and metadata but it turns out that Ben Goldacre's maxim "I think you'll find its a bit more complicated than that" holds true here too.
the Problem…
PIMMS needs to build a tool to link the metadata records we collect to the data that they are describing. Sounds simple enough, and it would be if everyone used the same archive structure and naming conventions, but we don't have that kind of control over the organisation of the data in A N Other institution that has decided to install PIMMS. So we need a generic method of linking metadata and data.
the Solution…
We reckon the solution will be some method of tagging the metadata and data records but to do that we need:
- tagging - some method or methods of tagging the data and metadata records
- tag store - some place to store our list of related tags
- tag search - some way of finding the tags
We also need to decide
- what kind of tag - how do we tag the data?
- where to tag - to what data granularity should we add tags?
- when to tag - when in the workflow do we add tags and link tags together?
the Big Question…
Has this problem already been solved? The JISC MRD community has lots of librarians, surely librarians have addressed the problem of how to match up books (data) with their library records (metadata). And there are such things as inter-library loans so there must also be a solution out there for linking together different data archives.
what PIMMS is doing…
We're scoping out the requirements for a tool that we will build to link metadata and data on our wiki. We've come up with ideas for tagging, ideas for storing tags, and requirements for interacting with the tool. We're soliciting advice from everyone we know and it would be great to find out what the JISC MRD have to say!
Thank you so much to all of you who contributed to making UWE's MRD event yesterday so interesting and enjoyable. The speakers and attendees all helped to make the day inspiring and enormously useful. Thank you also to Liz here at UWE for her excellent organisation and execution of the day.
We are pleased that the event successfully highlighted key RDM approaches and issues and gave you opportunities to discuss RDM ideas, problems and achievements.
I drew a number of summary points from the day:
- The problem is getting bigger
- Data curation means different things to different institutions (and services therein)
- There is no one-size fits all approach
- There are many useful reports, tools and products out there for efficient and effective re-use
- Engagement within your institution is key
- It's easy to get sidetracked on problems that are not yours
- Sustainability is everyone's problem
- Working with big / research intensive institutions can be intimidating.
So don't be intimidated, be inspired!
If you attended, please do complete the evaluation you have been sent. In addition, we would be interested in any comments you have on practical actions you have taken away as a result of your attendance. Be nice to have some comments to a blog post...
This month has been a month of thesis writing. As a disciple of engineering, I have not really considered the output of my hours of writing, producing an accessible front end to my research, as data. However, I am told that it is. Personally, I see data as the pages and pages of numbers produced either through physical experimentation or through a computer program; in a format that generally means nothing to anyone apart from its creator. You could extend this definition to describe the figures, graphs etc. that you derive from these pages of nonsense. But what about the findings, the conclusions, the evaluation…is that also data? Well, I suppose they are. Without these how does anyone know what your data has been used for, or its relevance to current research?
But then you have to also consider the framing: The metadata. How did you get your results? What assumptions did you make? What formulas did you use? All of these give validity to your data and provide valuable information and a platform for its future use. Without this platform, the data is arguably worthless to anyone else.
In the past these thoughts about data have been insignificant to me. But now that I am a PG researcher, I have a responsibility to store and share my data on a professional level. The Open Exeter project has helped me to think about data; how I define it, store it, save it, protect it and, in the end, share it. This level of consideration is being demanded by funding bodies and institutions nationally, who expect that the resultant output of taxpayers money is available to all. And as professionals, it is our duty to meet these expectations. And don’t forget, it benefits us in the long run by providing a whole new platform for developing new research using the data provided by others as a platform…without all the red tape.
We would like to thank everyone who participated in our online research data management survey and announce that Elif Gozler, a PhD student in the Institute of Arab and Islamic Studies, was the lucky winner of the Kindle in our prize draw.
Elif was randomly picked from the nearly 300 participants in our research data management survey and was presented the prize today by Afzal Hasan, Subject Librarian for the IAIS and Politics, and Open Exeter’s Holistic Librarian. Elif also enjoyed a cream team with the Open Exeter team.
Congratulations Elif!

Institutional research data repositories follow in the wake of the widespread adoption of open access repositories across UK institutions during the last decade. What can these new repositories learn from the experiences of open access, and what pointers can we find for the development of data repositories? In the first part of this post we will consider factors such as policy, infrastructure, workflow and curation. In part 2 we will extend the analysis to rights and user interfaces.
It may be a timely moment to reflect. A recent speech by the UK government’s science minister David Willetts prompted renewed excitement over open access, with a forthcoming report to advise on specific actions to be taken to realise more open access. Less remarked on, apart from comment about the undefined but potentially high-profile role of Wikipedia founder Jimmy Wales, was the bigger picture view that anticipates stronger integration and linking between research publications, research information for reporting and assessment, and research data for data mining but also for research testing and validation.
Open access (OA) repositories, which principally provide free access to an author’s version of published research papers, effectively began with the physics arXiv in 1991. Institutional repositories, which switch the focus of coverage from the subject to the place of authorship, emerged in 2001 following the Open Archives Initiative (OAI). To complete the record, the term ‘open access’ was defined by the Budapest Open Access Initiative (BOAI) in 2002.
So institutional OA repositories have up to a decade head start on proposed institutional research data repositories. The University of Southampton, home of the DataPool project, has hosted a leading OA repository since 2005, so the project team has long experience of running a repository.
As with OA repositories, there are plenty of examples of subject-focussed research data repositories, but here we focus on factors affecting institutional repositories (IRs).
Policy
For OA IRs, technology and infrastructure preceded policy. First impressions are that for data IRs this will be the other way round. As with OA, data policies in the UK are being driven both by research funders and institutions.
OA policies focus on the need to expand full-text content collections held in repositories and typically require (mandate) or encourage authors to deposit versions of their published papers. The first university-wide mandatory OA policy was implemented at Queensland University of Technology in Australia, in 2004, according to the site EnablingOpenScholarship. This site also shows graphically how the number of institutional policies began to accelerate from the first quarter of 2009, some 5 years or so since the growth of IRs saw similar acceleration, although it remains a minority of institutions that have such polices. It has been calculated that OA mandate policies can increase deposit rates to above 60% of eligible papers from the average of 20%. In this respect, the lack of a suitable policy could be seen to hinder an institutional OA repository.
Emerging UK institutional data polices by comparison have focussed on requiring researchers to create data management plans and data records, and emphasise sustainable practices in managing and storing data for the purpose of access, stopping short of requiring open access or of institutional deposit of actual data that would then need to be supported by the institution. This might be because institutions have still to calculate and cost the the storage infrastructure needed, whether managed locally or in the ‘cloud’, because institutions are unclear what value they can bring to data management – or even where the value is in the data they seek to help support, or because there is not yet any consensus on whether data repositories should be subject-based, or institutional, an issue which OA repositories have still not fully resolved. Institutional data policies have in turn been driven and directed by research funders’ data policies, principally RCUK and EPSRC (Jones 2012) setting principles and expectations of institutional compliance within a specified timescale (for EPSRC, by 2015).
Data policies may benefit from being instituted ahead of developing infrastructure for collecting, managing and presenting data. However, the few early policies available suggest little common purpose – we are clearly some way from having a best-practice data policy template for others to follow, as has evolved for OA repositories. To serve even the limited requirements of these early policies, institutions will need to connect decisions on infrastructure and understand patterns of workflow that produce research data, as we shall see below.
Infrastructure
By infrastructure we mean the technical capability to support distinctive requirements. While OA repository infrastructure is well established, it has not had to tackle the challenge of large-scale storage that is likely characterise data repositories.
The essential infrastructure that led to OA repositories was put in place by OAI: this was a protocol for metadata harvesting OAI-PMH. This allowed individual repositories to be viewed collectively through services – search being the most prominent service, at a time when Google was new and relatively little known – based on OAI-PMH. Immediately, software emerged for setting up institutional repositories, first EPrints and later DSpace and others. These repository softwares now also bring a range of integral services established over a decade that can be utilised to manage a range of data types, including research data.
Hence this same infrastructure, with modification, is being used to serve data repositories. There is, however, one new infrastructure component that data repositories will need to introduce – large-scale data storage. While content volumes for OA repositories do not test conventional storage systems, data repositories will inevitably provide much bigger challenges to storage and curation. To get a sense of the scale of the problem, Figure 1 compares data volumes at different stages, and is taken from a presentation about scoping curation for digital repositories. It is notable that data generation volumes cannot be visualised on the same scale as the other stages, since these are orders of magnitude larger. We might call this the data curation gap. Rosenthal has recently questioned assumptions that all data generated might be kept ‘forever’, indicating the need to fill the curation gap: “Assuming (data) growth continues, endowing 2012′s data will consume 19% of Gross World Product (GWP). On these trends, endowing 2018′s data will consume more than the entire GWP for the year.”
Institutions appear to have two choices to serve this level of storage: locally managed, or remote storage in the cloud. It is likely there will be a preference or a requirement to exert institutional control over storage (for example, at the University of Brighton: “we currently have a policy of not hosting staff data outside of the institution”), even in the case of cloud storage, hence developments such as the JISC UMF Cloud Pilot managed by Eduserv.
They could instead opt to advise researchers and data producers on selecting their own storage, from data archiving services such as UK Data Archive and the Archaeology Data Service, or data publication repositories such as Figshare, Dryad and other data repositories listed by DataCite, or even commercial cloud storage services (although a colleague noted that risk-averse advice might wish to start with where not to store data). Apart from the data archiving services, it remains to be seen whether these repositories can provide resilient, cost-effective, sustainable storage over an extended period, where content can be shared collaboratively during development and later made open access.
Workflow
OA repositories were designed from the outset for a publication mode of delivery that does not attempt to capture and support earlier phases in the workflow of writing a research paper. Given the more complex workflow (or life cycle) of research data, and the need to capture data at different stages of production and processing, the single publication mode may be inadequate for data repositories.
As the Web gained popularity in the mid-1990s all sorts of content began to appear, including digital versions of research papers published in what were then still largely paper journals. Authors were simply loading digital versions on to Web servers wherever these happened to be available, usually within their institutions, whether these servers were provided for this purpose or not.
OA institutional repositories served a simple purpose – to provide these authors with a more reliable, managed, services-based Web server to provide access to this digital content over a long timeframe. In this respect the designers behind these repositories over-estimated the number of authors that would use such services and the number of papers that would appear in repositories. Further, because the target content was papers due for peer-reviewed publication, the concept of workflow was barely considered beyond the expectation that the process of repository deposit would happen at the completion of writing the paper and in parallel with submission for formal publication. Thus OA repositories were designed for a one-stage deposit workflow, and no prior contact with authors while a paper was in preparation.
It has been suggested that by failing to engage authors at a sufficiently early stage and not providing support services for writing papers, that OA repositories have lost out to the more established process at the completion of a paper – publication. Further, by the time IRs were widespread, most journals were producing digital versions, so that was no longer a factor for authors posting Web copies of their papers, even if those journal digital versions still mostly stood behind subscription barriers.
While it is in principle simple to upload a completed paper from a local file store to a repository, it has been argued that a restraint to this happening is the requirement by many repositories for extensive accompanying ‘keystrokes’ or metadata. Competition with publishers for keystrokes at the point of completion and submission, lack of clarity in the benefits of OA repositories, and the failure to integrate with workflow may have been factors in preventing OA repositories from growing content to the levels anticipated, and led directly to the mandate policies described above.
The lesson for data repositories is clear: to capture content from data creators you must provide useful services that will become an integral part of the workflow of creating the data. It will not work to isolate particular requirements, such as records creation, from other needs such as storage services. Data does not appear with the same mode and frequency as published papers, so workflow must accommodate many different patterns. Research data is often produced by machines, so deposit workflow must allow scope for non-manual intervention.
While workflow involved in the production of research data is more complex and less easy to classify than for OA publications, one helpful representation of this workflow has been illustrated by the University of Oxford (Figure 2). This shows how a project begins with a bid for funding and in future will invariably be accompanied by a data management plan (DMP), a data roadmap for the project to follow. If a workflow begins with a successful proposal and a DMP, it will lead to data and, increasingly, from policy or from users, a requirement for managed data storage with the ability to support controlled access for collaboration, and discovery for wider access. Figure 2 is taken from this presentation by the DataFlow Project.
Effective institutional data services will need to span this whole workflow and engage data creators at all stages. Lessons on workflow from open access suggest that for research data providing separate services for creating data records and storage, for example, will be insufficient. Data creators and authors will not engage in processes that do not enhance their work.
Curation
Digital curation is defined by Wikipedia as the “selection, preservation, maintenance, collection and archiving of digital assets”. For open access, selection is pre-determined – the target content is peer reviewed and published research papers. Further selection of such content for curation purposes is not merited by the data volumes involved. As we have seen with the ‘curation gap’ above, this does not hold for research data. As a result, more attention will need to be paid to curation for research data, and the line between simple user-managed storage and assisted curation will need to be more flexible.
Where that line might be drawn is thus open to question. It is drawn in principle by the strategy exemplified by DataFlow in Figure 2, which has two stages representing user-managed workspace and storage (the stage to Local Storage & Retrieval in Figure 2), with a transition to an institutionally curated space (Institutional Storage, or DataBank in Oxford’s system). The question remains as to what drives that transition. Such spaces are likely to have different curation criteria in different institutions, and will need to take account of researcher, policy and publication requirements, as well as costs.
An example of research data management that has optimised workflow, metadata collection and records creation, data curation, aggregation, discovery and access is eCrystals at Southampton, now extended to a federation led by the UK National Crystallography Service.
Interim summary
There are more lessons to learn from experience with open access that we can apply to research data repositories. In part 2 we will extend the analysis to rights and user interfaces.
As part of Orbital, we’ve started to seriously consider how RDM (Research Data Management) fits into the whole range of University systems to support research information. In some contexts (usually research support/administration), we might use the term RIM (“Research Information Management”); in other, academic, contexts, we might talk about a VRE (“Virtual Research Environment”) – but I prefer to think of both as functions of the same set of systems.
Below is a first attempt to model various University research systems and what information might be shared/pass between them. This is a first draft only and will be discussed/developed over time – it’s here for comments!
It’s available to view on Lucidchart.
The Orbital project team met today (24 May 2012) and agreed the following:
- Documentation
- User documentation will focus on the “why”s of Research Data Management, rather than being a point-and-click guide to the Orbital UI (which should not require detailed explanations).
- JW will create a changelog (human readable text file) for each major release of Orbital, so that documentation for each feature is review if that feature is updated.
- PS will lead on writing documentation (as HTML pages, stored in the GitHub repository), with documentation for release v0.N completed and available by the launch of v0.N+1
- PS will email colleagues from the Library and Research/Enterprise for assistance on writing documentation.
- Training
- JW will invite Melanie Bullock and David Sheppard on to the Orbital working group. He is meeting Annalisa Jones to discuss RDM training for staff.
- Releases/development
- Orbital v0.1.1 (including bug fixes) met all of the initial ‘minimum viable product‘ requirements specified by Dr Tom Duckett, and also includes the basics of project administration.
- v0.2 will include improvements to the file upload/management, project management, and license management interfaces, as well as clearer distinction between language files and operating code.
- NJ demoed the current version of Orbital to Siemens staff. He now has access to Siemens machine data for testing within Orbital.
- The group discussed the LNCD plans for internal servers/private cloud, and about the disk space requirements and costs.
- Integration
- The current version of the DMPOnline tool has been installed on a test server. The group discussed our approach to integration between external tools/software (such as DMPOnline, R, Gephi) and Orbital.
- NJ is going to email Adrian Richardson at the DCC to ask when the DMPOnline APIs will become available.
- RDM policy
- JW presented the draft policy to the University RIEC committee. The committee have been asked to send comments to Joss. (One comment at the committee meeting was that our having a policy too geared around the requirements of the Research Councils may not be appropriate for Lincoln, which generates a lot of non-RC income. However it was noted that the good practice specified by the RCs is good practice for management of all research data, whatever the funding source.)
- Conferences and meetings
- JW, NJ and HN reported on the recent MRD hack days.
- JW recently attended an event organised by the UWE in Bristol (Raising your ReDMan:Approaches to Research Data Management), at which he spoke about Orbital’s minimum viable product approach, and about how Lincoln’s user-led development means we have to take a close interest in RDM issues.
- NJ and JW are presenting a paper at OR12 (Open Repositories 2012). PS will attend for ~1 day of the conference.
- JW is chairing a planning event with Ken Evans, looking at data/schema modelling for all the University’s research information systems (Orbital, EPrints, the Awards Management System, people systems) using an Object Role Modelling (ORM) approach.
- Data Asset Framework survey
- The group discussed the recent DAF survey which we conducted at the University of Lincoln.
- JW will convene a sub-group to consider the responses in detail, and plan follow-up interviews.
- Business case
- JW is currently gathering costs for long-term data storage. This will form the first strand of the Orbital business case, which will be presented to University SMT (along with the agreed RDM policy) in September 2012.
The development team of the MiSS project recently attended the excellent DevSCI Managing Data Hack Day in Manchester. The aim of the even was to bring together technical and non-technical people with an interest in managing research data.
We were all encouraged to enter into a collegiate and informal atmosphere to discuss, implement and research the areas of importance around the managing research data sphere.
The main thrust of the event as I saw it was to share what your ideas and problems were and see if they chimed with other peoples. This is at first a rather daunting task because one can’t help thinking ‘my institution has its own distinct problems – why would they marry with other institutions?’ As ever this line of thought is only partially right – of course other institutions have the same problems. Ish…
We all raised the issues we wanted to discuss and it quickly became apparent that the most pertinent issues to our work at MiSS were those to do with data and not metadata – the issue of transferring data, transferring large files, drop-box analogies and so on. (That’s not to say we’re not interested in metadata – it’s just that there is only so much time a developer can have!)
Drop Box – everyone likes it, we don’t care.
The rise of cloud based personal storage systems like Drop box has been huge recently. Researchers are seeing it as an entirely viable way of storing personal research data safely on a day to day basis. It is perfect for storing data during the processing stage of an experiment for example when the data owner is the only person who needs to access it.
And although discussing this was a useful activity, there are some fundamental problems with either using drop box or creating an analogue of it. The first of which is that the ease of use of drop box is its downfall in managing research data. The point of Drop Box is that one drags some files in to it, and pulls them out when one needs them. There are no metadata. There is no way of associating stored files with a project. A researcher could store files in a named folder in Drop Box pertaining to their current project, but it would then be the researcher managing their data, not a service like MiSS.
If Drop Box is what the researchers want, give them Drop Box. Unfortunately what they want and what they need is not always the same thing…
Moving files – A Land of Limited Opportunities.
There was a lot of discussion on moving large files, and moving files in general. There are inherent problems with even the most basic file movement methods, such as moving a file from a researchers PC to a remote server using HTTP. How do we version the file? Where are the metadata stored and how are they associated with the file?
And when we move onto transferring larger files then more problems arise. If we send one birthday card to our Granny using the Royal Mail we can expect it to get there fairly safely with one postman and be expecting our ‘thank you’ letter any day. If we send 2,147,483 birthday cards to our Granny (rough estimate at a Gb of data!) we can be waiting a fair while for our thank you card – especially if any went missing halfway.
There were many solutions suggested that may help our plight. These included using SparkleShare, GIT repositories, leveraging Bit Torrent protocols etc. This led onto another discussion about how some institutions didn’t like having two separate solutions for moving data – one for large files and one for smaller files. To my mind however after the hack day I had formed an opinion that this is the best way. How exactly we’ve yet to decide…
Jon Besson – jonathan.besson@manchester.ac.uk
As is common for researchers dealing with human research subjects, security of sensitive data is one of my main concerns. While developing my data management plan, it became clear that I would need to encrypt my data storage devices, including my computer and backup external hard drive.
Encryption, put simply, is the conversion of data into a format that requires a key in order to be decyphered. If you lose your laptop, external hard drive or USB key, it would at least make it very hard for the person that finds it to access your data in an intelligible format.
I have been making enquiries about encryption solutions, and found out that the university’s IT department uses Truecrypt, which is a free open-source software compatible with various operating systems, including Windows 7, Vista, XP and Mac OS X. More information and guidelines on Truecrypt, offered by the University of Exeter’s IT department, can be found at:
http://as.exeter.ac.uk/it/regulations/infosec/encryptionforlaptops/usingtruecrypt/
It is to be noted that, if you use a university loaned laptop, you may have to request approval prior to encrypting the device. Laptops on loan are not necessarily encrypted, but I was informed that this may change in the future. Very importantly, make sure to remember and keep your key safely!
Another concern of mine is to ensure the data temporarily stored on my iPad – which I use as a data collection tool (notes, audio, fieldwork photos) – would be secured. General opinion seems to be that an iPad is password protected and therefore relatively secured; I would be interested to hear what others think about this. I have also noticed an option to wipe out the data on an iPad/iPhone should the authentication attempts fail 10 times in a row.
I am also currently wondering about the safety of cloud storages such as Dropbox and iCloud. I have been assured by an IT technician who read a lot about Dropbox that it can be considered as safe. I do not have a problem with putting my own documents in there… But what about sensitive research data? I would be interested to know what others make of cloud storage.
Synching
I am a bit of a research data hoarder. I am so scared of losing my preciously gathered data that I tend to save it everywhere, with different backups. It is great if your devices are properly protected, unless, like me, you perpetually have to update manually your documents stored in multiple locations. Aside from the time-consuming nature of this process, I have experienced a few instances in the past of missing out on updating a backup file, and thus losing track of the most up to date version of my document. With my current project, I have decided to investigate ways of automating the synching process, to make sure my files are saved and backed-up, saving me time and avoiding missing updates. There is a simple solution that has been discussed by another member of the Open Exeter project in his private blog, which is Synchtoy. This same free software solution was also recommended by a technician from the University’s IT department. All I need to do now is go on a data/folder cleaning spree, plan out properly my storage destinations and process and get the synching going. It will take a bit of time, but as with many things, investing the energy now might save me much more in the future.
Atira PURE is a current research information system (CRIS) that has been adopted by around 20 UK HEIs. The UK PURE user group works closely with Atira to defined requirements and maintain a unified data model across all UK implementations. The user group met last week at The University of Aberdeen and was represented by several institutions who have JISCMRD projects. The pre-occupation of the meeting was in the present, with mock Research Excellence Framework assessments, but there was also discussion of the product roadmap and some interest for those with foot in the research data management camp.
CERIF2: PURE’s data model will be continually adjusted to match CERIF developments.
OpenAIRE Compliance: support for the OpenAIRE format will be added to PURE’s OAI-PMH harvesting interface.
PURE as a repository: a new player in the market? PURE currently supports ‘connectors’ to DSpace, ePrints, and Equella, so that research outputs originating in the CRIS feed through to an existing repository system. Whilst making a clear commitment to maintaining these interfaces, Atira restated their belief that PUREPortal offers an alternative that could replace a traditional repository system in full. The best example of this is at Aalborg University. At University of Hertfordshire we maintain a DSpace repository, but our PURE CRIS is now the primary source for almost all our repository content. This a similar position to University of Edinburgh and several others. We have reasons for keeping DSpace at the moment, not the least because it is opensource and offers the opportunity to be hacked to try out new initiatives, such as publishing data. There are several new PURE repositories about to go live, mainly among Universities who do not have an existing public presence. It will be interesting to see if it continues to gain traction among those of us who already have systems online. I think they may struggle to penetrate further until the REF is concluded and everyone has time to breathe, reflect and address new projects. (RDTK is already experiencing inertia due to the REF, which is an overriding priority for researchers and administrators alike.)
PURE and Datasets: there was quite a lot of discussion about data, with two tangled – but with hindsight – distinct threads; the first about data as primary research output in the REF; and the other about the new imperative to publish data in support of traditional publications. The first thread came up when the meeting was considering how PURE currently expresses non-textual outputs, including physical art outputs, events, source code, and data. This naturally drifted to a discussion about metadata, wherein I began to fear we would be mired for the rest of the meeting; but CERIF gurus rescued the day with timely intervention about the likely outcomes of Cerif for Datasets (C4D). By this route we arrived neatly back at the first point above. Atira have previously told me that they are waiting on the inclusion of a metadata model for data in CERIF, and will implement this when it arrives. I pointed out that in order to fulfil their aspiration as a repository vendor they will also have to address more than just the metadata issues, for example, in the way that @mire have done with their media streaming plugin for DSpace. (As aside - @mire tell me the DSpace developer community is also taking a keen interest in C4D).
Data working group: the conclusion of these discussions was that a working group should be convened to report on data publishing issues at future UK PURE user group meetings. If anyone in JISCMRD who is not in the PURE user group would like to feed into this, then #rdtk_herts can facilitate.





