Research data management
"Research data management" bundle created by Jez
- ADMIRe
- RoaDMap blog
- REDm-MED
- Research360@Bath
- Modus Operandi for Repository Deposits
- Open Exeter: Human Factors in Research Data Management
- DaMaRO
- Data Management Planning for Secure Services (DMP-SS)
- Managing Research Data: a pilot study in Health and Life Sciences
- BRISSkit
- CERIF for Datasets
- data.bris
- Creating data management plans online
- JISC DataPool Project
- DATUM
- Data Management Planning & Storage for Psychology
- Reseach Information Network
- History DMP
- Iridium
- JISC MRD: Evidence Gathering
- Orbital
- PIMMS - Blog
- Rapid Organization of Health Research Data
- Sustainable Management of Digital Music Research Data blogs
- University of Hertfordshire - Research Data Toolkit
- JISC Managing Research Data Programme
- Research Data @Essex
- Scholarly Output Notification and Exchange (SONEX)
- SWORD
- Digital Curation News
- EDINA and Data Library: News
- MiSS
- KAPTUR
- REWARD - Events
- REWARD - News
ADMIRe induction and #gettingtogrips
I will be starting full-time with ADMIRe next week, so this month has mainly been taken up with ADMIRe induction meetings, project planning with Tom and Bill, allocating tasks from the work packages, preparing presentations, and desk-based research on research data management (RDM) and data information literacy (DIL).
Research Data Management Training Materials
I have been looking at a very wide range of topics/issues related to research data management and have found the outputs from the JISC Research data management training materials (RDMTrain) projects particularly useful. Tomorrow, Tom and I will be joining Wendy (Faculty Team Leader, Medicine and Health Sciences) and colleagues from the Graduate School to explore the potential of the Research Data MANTRA course . This course is designed for PhD students and others who are planning a research project using digital data.
MANTRA is an Open Educational Resource (OER) that may be freely used by anyone. It is available through an open license for re-using, rebranding, and re-purposing. MANTRA is one of the key outputs from the first phase of the JISCMRD programme and has been produced by EDINA and Data Library, a division of Information Services, University of Edinburgh. Further information on the project is available from here.
Today I did a presentation for the Information Literacy Development Group (ILDG) on the issue of RDM and data information literacy skills (DIL). For some useful information on the role of data information literacy and libraries, the presentations from the recent Research Libraries UK (RLUK) event are all available here. This event aimed to clarify the research library agenda with regard to RDM.
DOIs for Research Data
I came across this interesting article today (full-text freely available), published in the May/June 2012 issue of the D-Lib Magazine:
‘Implementing DOIs for Research Data‘, Natasha Simons, Griffith University, Australia. Natasha concludes that implementing DOIs has “raised governance questions common to other institutions that encouraged discussion and collaboration.”
Orbital v0.1 was released on 16 May 2012. Every two weeks, staff working on Orbital meet with Dr Bingo Wing-Kuen Ling and Dr Chunmei Qing to discuss their research and RDM practice. Until now these meetings have been all about requirements-gathering – today was the first opportunity for some real, hands-on user testing with the alpha release of Orbital.
The notes below have been turned into tasks on the Orbital project Pivotal Tracker site.
BL = Bingo Wing-Kuen Ling.
- BL successfully viewed Orbital v0.1 in Internet Explorer 7 on the UoL corporate desktop and was able to sign in and grant access to the application using his UoL credentials. BL was able to create and describe a new project.
- BL tried to upload a file from his desktop to Orbital using IE7 and received an error (this is a known bug with Orbital in Internet Explorer). He was then unable to delete this file.
- Switching to Firefox, BL uploaded multiple files from his desktop to his project in Orbital (it wasn’t clear from the page that this was possible). This completed successfully: but because the files sizes were small, he did not receive any feedback on his upload.
- Returning to the original file upload screen, BL had to manually refresh the page to view the changes made (files uploaded). Files scheduled for processing are marked as ‘queued’ however this status does not update automatically without refreshing.
- Joss Winn demonstrated the file and project metadata pages, citable URLs for files, and Google Analytics on projects. The display of file metadata needs to be more complete, and G.A. needs a better explanation and links to sources of help.
- The group discussed BL’s requirements around project calendars/timelines. BL wants to be able to view project events (meetings, deadlines, etc.) for each project (but not aggregated) and is not particularly concerned about notifications on activity/changes to files. The group discussed this and will explore ways of presenting timelines made up of three sorts of events (project events, activity stream, and comments) with each type of event suppressible in the timeline. A timeline overview will be displayed on the Orbital ‘front page’ once a user has logged in.
- BL also would like to be able to organise project and data files in all Orbital workspaces using folders/tags, and to allow bundled file download by organising files into collections.
You can read about Orbital v0.1 in this blog post, and about the roadmap for development and release of future versions, here.
Raising your ReDMan: Approaches to Research Data Management
UWE Bristol on Wednesday 23rd May.
The first release of Orbital (v0.1) was on May 16th 2012. Orbital is a research and development project that has initial funding until March 2013. Given that timeframe, what follows is a high-level roadmap of where we expect development to go over the next few months.
We plan to release a new, working version of Orbital around the beginning of each month, up to December 2012, when we hope to hit v1.0. We plan to have a further point release in February, based on user feedback and then give ourselves a month to tidy things up and wind down this phase of the project.
You can read our implementation plan, check out our responses to user feedback and watch our project tracker for more details. The code can be downloaded here and here. Be aware that the roadmap may change at any time, based on user requirements, illness and natural disaster.
v0.1 (May)
See announcement. Authentication, user profiles, projects, upload of flat file data, a license picker, private and public archives, publish to the web, stable identifiers, download of flat files, analytics, and a way to give feedback.
v0.2 (June)
Documentation (for both developers and users), activity data, working/dynamic datasets + APIs, app access (OAuth), bug fixes, responses to feedback.
v0.3 (July)
User workspaces, version control, data staging (workflow), data snapshots, tools for basic analysis of datasets.
v0.4 (August)
EPrints integration (SWORD2), integration with Lincoln’s Award Management System (Worktribe), APIs for discovery, full role/permissions management, documentation.
v0.5 (September)
Plugins for third-party tools (e.g. Matlab)
v0.6/0.7/1.0 (Oct/Nov/Dec)
Network drive for desktop workspaces, full documentation for RDM, DataCite, ORCID, DMPOnline integration, improve design/branding, federated access.
Happy Christmas!
This is a post about our first release of Orbital.
About a month ago, Dr. Tom Duckett, Reader in The Department of Computing and Informatics approached the Orbital project because he urgently wanted to publish around 20GB of data for Long-term mobile robot operations. That afternoon, we gave Tom and Feras Dayoub, his Research Assistant, space on one of our servers and they uploaded a bunch of HTML pages and the zipped up data. We minted a proxy URL for them and advised them on an appropriate data license to choose. We also set up Google Analytics, so they could see what interest in his data there was.
Job done. For the time being.
What Tom really wanted was to be able to email a link to his data to a robotics mailing list and tell an international community of likeminded researchers and manufacturers that the data was available to use. He says that long-term datasets for mobile robots are quite rare in his community, so there was a good chance people would be interested in them. He also wanted to be able to demonstrate his work when writing an EU bid. There will be a follow up blog post about what impact this has had on Tom’s research.
That afternoon got us thinking: What is the minimal set of functions that a researcher like Tom requires of a Research Data Management tool?
Tom wanted access (sign in) to a server (hosting) where he could upload his data (storage) and describe it so that other people could understand and download it (publish) under an appropriate license. The URL pointing to the data should be persistent, even if the data itself is migrated from one system to another. The impact (analytics) of the data should also be measurable.
Tom’s chance intervention in our project made us focus on Orbital v0.1 as the ‘minimum viable product‘ for researchers who need to publish open data. We thought his requirements were a great opportunity to release something early and start getting direct user feedback on our product. We decided to set a release date for Orbital v0.1 a month ahead and aim to deliver everything that Tom asked of us in this first release.
A Minimum Viable Product has just those features that allow the product to be deployed, and no more.
Today, we released Orbital v0.1 and it does everything described above. It’s an alpha release, but we’ve been testing it like crazy, we also had Feras test it and we’ve been pushing code through Jenkins since the beginning of the project so we know it passes our QA checks and we think it’s stable enough for use. From this point forward, Orbital and the URIs it mints will persist, too.
From today, a researcher at the University of Lincoln can sign in to Orbital, create and describe a project, upload their data to the project, choose a license for the data and add a Google Analytics code to measure project analytics (we’re also tracking each button click to better understand how people use Orbital). The data is published at a id.lincoln.ac.uk URI, which will persist indefinitely. At this stage, until we’ve got an approved business case for scaling it up and out to all academics, we’ll be limiting uploads on a case-by-case basis. You can view and request what other features we develop for Orbital on UserVoice, or in more detail on our project tracker. We’ve also written a basic development roadmap.
For developers, here are the basic technical details. You might also want to trawl through our implementation plan and the collected blog posts at the bottom of the plan.
Orbital is written in PHP using the CodeIgniter development framework. It’s split into two main pieces of functionality. Orbital Core (database and APIs) is currently hosted on a Linux box on Rackspace’s cloud. Orbital Manager (the User Interface) is likewise hosted on Rackspace. A user signs in to Orbital Manager via OAuth 2.0 using their university credentials. Orbital Manager is using Twitter’s Bootstrap framework. The project metadata is stored in a MySQL database. Files are uploaded to Rackspace’s cloud files storage using Andrew Valums’s AJAX Uploader. APIs are exposed using Phil Sturgeon’s CodeIgniter REST server.
Orbital is licensed under the GNU Affero GPL 3 license and you can download, fork it and create pull requests on Github:
New contributors to Orbital will be ritually applauded each weekday morning
Thanks.
The Digital Preservation Coalition (DPC) and Digital Curation Centre (DCC) are delighted to announce a new issue of our joint newsletter ‘What’s New’.
Since the last progress report DaMaRO has undergone a change of management (which I’m using as an excuse for this special 2-in-1 update). Many thanks are due to Pip Willcox for guiding the project through its early stages and really laying the foundations for the technical development work that is now commencing. Thanks to her hard work and that of her colleagues at the Bodleian Libraries, we have a good idea of the metadata standards we need to employ, a clear vision of where we are going, and a plan for getting there.
With the completion of the VIDaaS Project at the end of March, various members of Oxford University Computing Services have now had their time freed up to dedicate themselves to DaMaRO, so I’d like to welcome our analyst, Dr. Meriel Patrick, and our software developer, Asif Akram, to the team, along with myself as the new project manager. Our final team member, Bhavana Ananda, will be joining us tomorrow.
Over the last couple of months we’ve refined the project plan into a more detailed delivery plan, defined our processes for software development (adopting an Agile approach with 3-week timeboxing), and compiled a prioritized list of requirements. Meriel has produced the project’s training strategy and has begun compiling all of the training materials produced by the various JISC MRD-sponsored projects that have taken place since Sudamih, so that we can assess which bits we want to repurpose for use within Oxford. We will, of course, also be developing original materials during DaMaRO and ensuring that what we produce can be fed back to the community.
I have been doing a bit of work on confidentiality and open access of research outcomes recently including considering how these factors apply to datasets.
I have been involved in FoI at the University of Glasgow since being part of the original implementation team.
Today I attended an FoI update by our University Legal representatives which reminded me to ensure that we are happy with how our policies explicity state that one of the main planned outcomes for research is to publish results.
This may add support to any case considering exemption under section 27 (Freedom of Information Act Scotland) or section 22 (Freedom of Information Act – covering England and Wales)
We spoke about the case by case nature of enquiries and that the consideration is whether there is a true claim to withholding the information.
We touched on the Protection of Freedoms Bill which is being prepared for publication and it will be interesting to see how this develops.
I also have actions on our Data Management Roadmap to:
- Involve our Intellectual Property Team in discussions regarding metadata recording for datasets as well as for other entities that may potentially have confidential aspects.
- Ensure our IP and Business Development representatives are involved in discussions regarding policy and procedure development in relation to any data we may hold on behalf of third parties and that the guidance around such datasets is clear.
The Open Exeter project is pleased to invite all UoE researchers to Discuss Debate Disseminate: A discussion of the issues around the management of your research materials and data and an opportunity to network with other researchers. PhD students and early career researchers from all disciplines welcome.
The event will take place on 22nd June 09:00 – 12:30 in the Upper Lounge of Reed Hall on the Streatham campus.
Programme
09:00 – 09:15: Arrival coffee/tea
09:15 – 09:30: Welcome
09:30 – 10:30: Session 1: Delete, Keep or Share?: Each researcher brings one example of research material or data (this could be, for example, in electronic or paper format). In groups you will describe your research material or data briefly before discussing whether you would delete it, keep it, or share it, and why.
10:30-10:45: Coffee/tea break
10:45-11:30: Session 2: “Speed data dating”: Meet and get to know other researchers and the issues that they face with their research materials. Are there any common problems or solutions?
11:30-12:15: Session 3: PhD student panel session: Open Exeter PhD student answer your research materials management questions.
12:15-12:30: Feedback and Close
Please register for the event via email to h.lloyd-jones@exeter.ac.uk
For event details see: https://www.facebook.com/events/407590612590904/
Once datasets are fully supported in CERIF, and CERIF datastores they will be a formidable information resource that will serve a variety of users and uses.
C4D will develop an ontology and ontology driven interface to CERIF data stores which will provide key features of import, discovery, exploration and linking.
D3.1 addresses issues faced in C4D and proposes an architecture to solve these issues.
D3.1_Ontology_Upload_Requirements_v2.2a
The Digital Curation Centre is pleased to announce the launch of DMP Online v3.0
This new release marks a major progression in the software’s functionality. For the first time users can create data management plans incorporating multiple templates, so if your institution, your funder and your publisher all require data management plans, you can now create a single plan to satisfy them all.
The Relu Data Support Service (Relu-DSS) helped researchers of the Rural Economy and Land Use (Relu) Programme manage their data throughout the research lifecycle. For the duration of the Relu Programme, the Relu-DSS provided proactive data management guidance and support to the Programme's researchers and co-ordinated the archiving of interdisciplinary data collections.
Relu was the first cross-council multi-million pound research programme to fund a dedicated support service, co-ordinated between the UK Data Archive and the Centre for Ecology and Hydrology (CEH) at Lancaster, to realise its data sharing policy. It also promoted the first programme-level research data policy which expected all its researchers - with support from the Relu-DSS - to plan a data management strategy for their research and manage their data well throughout their research, with data archived at data centres funded by the Research Councils (ESRC and NERC) when the research finished.
All research projects funded by the Relu Programme developed a data management plan. They were also given bespoke advice on preparing and implementing data management plans.
In their Relu data management plan researchers had to describe:
- the need for access to existing data sources
- data planned to be produced by the research project
- planned quality assurance and back-up procedures for data
- plans for managing and archiving research data
- expected difficulties in making data available for secondary research (through data archiving) and measures to overcome such difficulties
- who holds copyright and Intellectual Property Rights of the data
- data management responsibility roles within the research team
View data management plan template (see Section 3 as DMP combined with Stakeholder and Communications Plan)
Relu-DSS reviewed data management plans, made an assessment and gave feedback and support to the reseach teams when needed. During the review we considered whether:
- information provided on the data planned to be produced is adequate and realistic according to the proposed research and methodology
- all relevant data management aspects have been considered, with meaningful information provided in the plan
- where difficulties are anticipated to make data available for archiving, possible solutions have been suggested
- all possible obstacles to sharing data have been considered, such as ethical limitations and copyright ownership
- a team member with data management responsibility is in place at each participating institution
For of the six DMP questions (3.2 to 3.7) we scored 1-3, with
1= Insufficient: severely lacking clarity or detail
2= Adequate, but more information needed
3= Excellent, no more information required
Each initial plan submitted could score a maximum of 18 and a minimum of 6. We did not sign off any plans until we were happy that all had eventually scored 18! In some cases this required iteration and, rarely, some coercion from the Programme Director for them to complete the information.
We also assessed which existing data sources they stated were to be used (Q 3.1) to check whether researchers were seeking to purchase expensive third party data sources, and assess whether there may be a case for purchase of a programme-wide license.
Our view in administering the whole process is that this system works very well, and that the assessment should be done by someone who is familiar with the kinds of data being produced to see whether any short comings in the plan are apparent. For this particular Programme mid-term checks on progress with DMPs were not done, which would be advisable to prevent any problems with data sharing at the end of the award periods.
The data.bris team have compiled a glossary defining some of the terms which we will be using on the data.bris website and institutional policy documents. It is published under a Creative Commons licence and may be found at http://vocab.bris.ac.uk/data/glossary/.
One of the goals we set for our project's two-months extension was to explore the use of DataFlow's DataStage in combination with DSpace (see this post). Now that a beta version of DataStage has been released, and the SWORD2 support has been implemented, we gave it a try and, with some tweaks, we managed to publish a dataset from DataStage on DSpace. This is a short account of the steps and tweaks involved.
I’ve been asked to highlight the benefits of running the Orbital project so far. We’re seven months or so into Orbital, which is an 18 month project. In our initial Project Plan, we identified the anticipated impact of the project and so I’ll use that list in this blog post to reflect on the impact and benefits so far. The headings and text in italics are copied from the Project Plan.
Research practices
Researcher’s data management practices will change, supported by technologies that encourage new processes in the administration and dissemination of data.
We’ve had very little impact in this area so far. It’s too early to impact on researchers’ practices when we’re still developing our own knowledge and the infrastructure to support RDM. Changing researchers’ practices takes time. However, there are indications that it will happen. Engagement with our user groups and ad hoc requests for help from researchers who know we’re working on the project has shown us that researchers do want to change their practices. Our recent DAF survey also told us that researchers know that their practices could be improved and where they need support to do so.
Internal auditing
Greater oversight and analysis of research data created by researchers will be possible.
We’ve had no impact here and won’t until the Orbital software is built. We will be working on this in the next version of Orbital and our recent effort at the MRD Hack Day around activity data was a precursor to this work. We are increasingly seeing activity data as a key component of Orbital and this was underlined early on in the project when Mansur Darlington from the ERIM and REDm-MED projects stressed the importance of capturing contextual metadata.
Research governance
Improved methods of auditing research undertaken by the university will be possible, enabling greater cross-disciplinary work.
This relates to the benefit above around capturing activity data and improving ‘business intelligence’. As yet no real impact, but we still anticipate Orbital being a useful tool for reporting and enabling greater cross-disciplinary collaboration though greater transparency. Related to this is the creation of RDM Policy, which we have begun and has resulted in a statement being made for the EPSRC RDM ‘Road Map’.
Integrated services
Research data management will be integrated into existing systems, such as staff profiles, the institutional repository, blogs and calendars. Towards a Virtual Research Environment.
We have had a direct impact on the creation of new staff profiles at the university. Nick Jackson, Lead Developer on Orbital has been working with Alex Bilbie in the ICT Online Services Team to create an aggregated profile for staff. We blogged about it earlier and you can see my example. Profiles of staff are now aggregated from different systems and stored as RDF Linked Data and we intend to pull in activity data from Orbital to further enrich staff profiles. In this way, our work is being recognised and valued by other teams in the university. Furthermore, the university has recently procured a new ‘Awards Management System’, which will provide data to Orbital about funded research projects and we intend to couple Orbital with EPrints using SWORD2.
What is clear from our discussions with users is that they expect Orbital to do more than simply store and publish research data. Without using the term, they are effectively asking for a Virtual Research Environment (VRE). This is something which we did anticipate and have always planned for Orbital to be a tool for both analysing and publish research data. When discussing ‘Research Data Management’, there is a fine line between DMP planning, research project management, team workspaces and public web publishing and while we need to be careful that the scope of Orbital does not creep, we are sensitive to on-going user requirements.
FOI compliance
Will make FOI requests easier to respond to or unnecessary.
We’ve had no impact here so far, nor are we yet in a position to.
Open Data
Will promote and enable the production of public data sets.
As I will discuss in a forthcoming blog post about our first release of Orbital, we have had some impact here and have witnessed the benefits. In summary, a researcher contacted us for help with publishing some data, the result of which was an invitation to write a journal article about the data, offers of collaboration and the strengthening of an EU grant application.
Our workshop on open licensing has also led to a further meeting between myself (Joss Winn, Orbital PM), and the university’s IP Manager. A further follow up meeting is planned to draft guidance for staff on the use of open licenses for source code and data. Furthermore, research staff are being directed to the Orbital team for informal advice on open licensing. In this sense, we are beginning to improve the awareness and understanding of open licenses among researchers.
The innovation cycle
Will embed new technologies and culture change among professional staff at the university and lead to further innovation in our services.
We are having some impact here and have had meetings with central ICT staff about integrating our server farm with cloud services. We are currently developing the Orbital software using Rackspace, but have recently ordered hardware, partly paid for by the project, to establish a private cloud, running on OpenStack, for research and development at the university. In addition to this, our development toolchain has changed and we now have tools and processes in place that we did not have six months ago. These are being adopted outside of the Orbital project by staff within the ICT Online Services Team and other projects we are running. In addition to this, we intend that the changes we make to our own R&D tools and processes, are made available to other researchers and students. Over the summer, we will set up and maintain a university-wide Gitorious source code repository service (similar to Github), where staff and students who write code can form teams, manage their source code, and publish it if they wish. We also intend to run a Jenkins server for similar purposes so that all staff and students can benefit from source code control and the quality assurance processes that we have implemented through Orbital. Orbital is now a driver for a general R&D infrastructure for Academic Computing that project members and wider members of LNCD are building.
I will write more about this at a later date because, for someone who manages R&D infrastructure projects at the university and wishes to engage staff and students in our work, I am excited to be able to integrate this into academic programmes and the work of other researchers.
I want to also stress that like all of our projects, the benefits, however slight, spill into other aspects of our work. Being a large project, Orbital has allowed us to concentrate on developing our toolchain and development environment across other projects, it’s given us time to learn new skills and share our learning with colleagues. In this way, it has been pivotal in the way we work and the future direction of our work.
Recruitment
Will build capacity for local development of innovative services
Orbital has allowed us to recruit two full-time Developers (Nick Jackson and Harry Newton). We are therefore two staff up and it is my intention to try and keep it that way.
Staff skills
Will improve staff skills and experience
Yes, we are benefitting in this way. The Orbital project team are now the RDM ‘experts’ in the university and despite being novices in this regard, over the course of the project staff working in the Library, ICT, Research and Enterprise Office and Centre for Educational Research and Development, are each developing their understanding of the processes and implications of RDM.
Clearly an 18 month project (at least the way I run them!), allows for staff to learn new tools and skills, experiment with new methods of working and disseminate this learning to other staff. This is one constant that I value highly about our project work. Despite the stop and start nature of project work and that not all of our work eventually makes it into a fully fledged university services, the tools and learning, especially as we engage more with academic programmes, goes beyond the confines of the project and is most satisfying.
I have written more about my interest in how hackers learn and the university as a hackerspace.
Culture change
Will change the research culture of the university by improving the tools available for managing and sharing data.
From the point-of-view of RDM, this is closely related to the first anticipated impact/benefit. We cannot claim to have any real evidence of benefits or impact at this stage on how we manage research data. However, as I’ve noted several times above, our use of the cloud, our advocacy of open licensing, our implementation of new R&D tools and processes, are also part of ‘culture change’ at the university. Furthermore, due to the DAF survey, the Orbital project is now widely known by researchers beyond our initial user group in the School of Engineering, and through our reporting to the university Research, Innovation and Enterprise Committee, staff at all levels are made aware of our work. Gradually, the idea of ‘research data management’ is being understood.
Technology choices
Will influence future choices in technologies (both locally developed and outsourced).
Yes! See above.
HE sector R&D
Contributes to innovative R&D in the HE sector
Yes, I think we are beginning to do this and the benefits so far are around shared learning among developers across different projects. We were instrumental in early discussions about the DevCSI MRD Hack Day and three of us contributed to the two day event. We blog regularly on this site (around 50 posts so far) and share our work with anyone who is interested (see the links in the sidebar).
Public Sector data management
As yet, the Orbital project can only claim to have resulted in one research dataset being published (again, more on this soon as I want to explain it in more detail). However, Orbital has grown out of our work over the last couple of years around managing and re-using institutional data, resulting in data.lincoln.ac.uk. We are also active members of the data.ac.uk initiative and I chaired the data.ac.uk panel at Dev8D this year.
Efficient re/use of resources
Demonstrably re-uses and builds on previous work, both funded and non-funded projects.
Yes, this was an early benefit of the project. We are building on our previous work and what we have learned from it in past projects. Our use of MongoDB, our work on staff profiles, our use of OAuth, and our API-driven approach to development, all build on past projects, funded and un-funded.
If you are interested in the use of SWORD for the deposit of research data, take a look at the DevCSI event report from their recent ‘Managing Research Data Hack Day‘ event report:
There is an interesting discussion and associated video about using SWORD v2 to deposit files using BitTorrent.
Nick, Harry and I attended the MRD Hack Day, organised by the DevCSI project. You can read about the event and our contributions over on the event report. Lots of videos, presentations and interviews around the technical challenges of managing research data.
This is the second half of my post on the MRD Hackday. You can find the first part here.
In my first post I mentioned that the group I was part of was concerned with the issue of large uploads. Here I’ll explain the specific issue we identified, and our proposed solution. You can find our slides from the final presentation here.
SWORD is a protocol for making and updating deposits via HTTP to a repository. This provides, for example, front end software a non-propriatory means to work with repositories. SWORD is also used, as I understand it, to move data within Oxford’s Data Flow system: taking research files from a local managed system (Datastage) to a deposit system (Databank).
In data.bris SWORD support hasn’t been a high priority since live research data and the repository live on the same hardware: deposit is just a snapshot or copy. However there other potential uses for SWORD so I’ve investigated implementing it. What struck me about SWORD was that at heart it’s an HTTP upload to a server, and they’ve never seemed very reliable from my browser.
At the hack day Steve Welburn was quizzing Tim Brody from EPrints about his issues with large deposits, SWORD and DSpace. From this discussion it became clear there that there was an infelicity in the protocol, and we worked on an solution with input from Sander van der Waal (of Dataflow) who completed our group.
A very brief introduction to SWORD
SWORD is a profile of the Atom publishing protocol. Atom’s roots are in blog publishing: how can publish a blog post via HTTP? Atom’s answer is as follows (HTTP verbs in capitals):
- GET a service document from the server, which describes the publishing service.
- Locate the publishing url in this document.
- POST the blog content, plus metadata (title, author etc) to that url.
- Server will respond (201 created) with a url for the blog post.
- Updates to the post can be PUT to that url, or the post can be DELETEd.
SWORD expands the metadata significantly, and the content won’t be a blog post of course, but otherwise everything remains the same.
The problem
Any network communication is at risk of interruption, from major hardware issues like a cable being severed, to simple forgetfulness when the uploader closes their laptop lid. Large uploads are more prone to interruption simply because they take longer.
HTTP does have a way to continue sending or receiving content using the range header. You may have noticed browser downloads may be continued if incomplete, which works by requesting (via range) the rest of the download. That same header can be used to continue uploads, although that feature is rarely implemented (Tim has worked on this for EPrints).
However as normally used it’s impossible to recover a SWORD upload. Look again at steps 3 and 4 above: the server only returns the url for the deposit once the upload has finished. Without that url we can’t tell the server which content we are attempting to complete.
We can make use of SWORD in a more robust – that is, resumable way – by uploading virtually nothing, then uploading the full content later. This works because the deposit is created with a url, ready to be modified. However it’s not very elegant, and we had more radical ideas.
Don’t call me, I’ll call you
Suppose that, rather than sending content we send a pointer to the content. In that case the repository would be able to pull to bytes itself, and ensure all the bytes arrived safely. It could use HTTP, but it could use the ftp or even rsync protocols.
However we decided to use BitTorrent:
- Unlike rsync, ftp, or http, there are many peer implementations, with nice GUIs, for a variety of platforms, in a number of languages.
- Handles partial downloading with ease.
- No packaging required: moving directories is as easy as individual files.
To make a transfer each party needs a bittorrent client, and we need a tracker to coordinate the download swarm. BitTorrent works by moving chunks between clients. Some clients have all the content (seeders), some don’t (leechers). Clients talk to the tracker to locate other clients interested in the same download (trackers provide a very simple HTTP api for this purpose).
SWORD deposit with BitTorrent
So what does deposit look like?
- Uploader opens BitTorrent client, and creates a torrent file for a file or directory.
- The tracker used may be the repository itself, or an institutional tracker.
- SWORD deposit is made as usual, but the content is a torrent file.
- Content will be deposited.
During the hackday we – and by that I mean Tim and Steve – got some of the way along the implementation. Tim added the facility to spot torrent file uploads and passed them off to transmission-cli and vuze, while Steve began an integrated SWORD torrent uploader. We also learned rather more about BitTorrent that we were planning to.
Conclusions and future work
Firstly this also helps with another issue of large datasets: downloading. Torrent clients work in a swarm, which relieves pressure on the repository if a particular dataset becomes popular.
Secondly (and closely related to the first point) a picture emerges where deposits are held throughout the network. One can imagine deposits being located in institutional and subject repositories, and perhaps others. Replication, redundancy and distributed downloads fall naturally out of torrents.
We have some plans to finish this work, and happily (thought I missed this) won some sort of vote of the most promising / finish-able idea which may provide some means to continue it. I’d like to try adding a ‘make sword deposit’ button to an existing client, and I know Tim would like to finish his work.
Useful meeting today with William Nixon, our ePrints expert, about a new ePrints test server for looking at a datasets plugin (among other things).
All went a bit mission impossible with workers abseiling down the side of the library (which if you know the library here is a brave thing to do on suck a windy day!)
David