How to Manage Your Research Data?

Empirical research proceeds often through numerous iterations, involving lots of source documents, datasets and other files. Unless you are prepared to do quite a bit of digital housekeeping, burgeoning materials can easily spin out of control, turning research from an intellectual adventure into an administrative nightmare. In this post, I describe a simple system for taking care of raw data, datasets and, finally, for archiving analyses.

My motivation to consolidate a set of personal practices into a system emerged from frustration with increasing digital housekeeping and an aspiration to fully exploit my datasets wherever I work in the future. Yet, the word ‘system’ sounds somewhat bureaucratic and tends to put some people off — I acknowledge that under certain circumstances the system may not be suitable for you. If you know for sure that you are only ever going to do a single research project, ad hoc is probably a better approach for you than any rule-based way of organizing materials.

I wanted a system that is both robust and shamelessly practical. Most importantly, it should be possible to take care of any foreseeable project within the same framework. The system should also be relatively agnostic to philosophical, methodological and ethical questions related to research data. This is not to say those matters would not be important, just that it would be nice to have a system that can digest all kinds of empirical investigations. These are admittedly ambitious aims that could easily lead to complex rules and lots of meta-work, that, in turn, would defeat the purpose of my system.

The solution is to approach data management as file management. Whether you work within positivist, realist or interpretivist methodology, your empirical research is going to involve lots of files. The filing system should make it easy to work with multiple collaborators and large datasets consisting of many different kinds of files and observations; materials should be stored so that they make analyses replicable, and, finally, the system should be independent of organizational support and free of technological lock-ins. Overall, digital housekeeping should take as little time as possible away from intellectual work, maximize data reusability and help construct a chain of evidence from the results back to the data.

But isn’t there an app for that…

Complex technologies such as version control systems can be indispensable for a research project, but they are not a substitute for mastering the research workflow. This is not to say that data management would not depend on some technological foundations. Indeed, those foundations must be carefully chosen to be as future proof as possible.

The workflow of empirical research

Figure 1. Schematic representation of research workflow

A linear, waterfall-like description of empirical research is, of course, a gross simplification and cannot capture the multiplicity of practical research experience. In reality, research may stumble forward through dead-ends, endless iterations and sidetracks until its results suddenly start to crystallize. It is important to understand that Figure 1 is not an attempt to summarize how empirical investigations (should) proceed in practice. Instead, it depicts schematic steps that can help manage materials throughout the investigation by linking them always to preceding operations.

The point about thinking research as if it were a linear process is that it helps to link together those specific operations that, in the end, produced the results. A sceptic colleague who wants to assess or replicate the process does not want to repeat all the iterations and dead ends that you as the original investigator had to take. Rather, he or she wants to see clear steps connecting the results to the data. The chain of effective research operations can be constructed by linking each research operation backward to its predecessor(s) and finally backtracking the chain from the results to the data. This is best done as you stumble forward in your research. Trying to reconstruct the chain once you have arrived at the results will be more laborious and easily misses important details.

MIDAS system

Rule 1: Organize all materials into packages that are labelled with stable unique identifiers.

MIDAS depends on basic features available in all common filesystems, which makes it free from technological lock-ins. There are merely three more rules that specify how to use packages and identifiers in more detail. The rest can be adapted to your own circumstances.

Package

The contents of individual packages can be organized to suit the type of study — it would be very difficult to design generally applicable rules for the myriad of materials and types of studies different researchers engage with during their careers. However, you must strive to make packages as self-explanatory as possible. This means that each package must provide enough information on how its contents were created and connect to previous research operations using package identifiers. The metadata should understandable to the sceptic colleague or, at minimum, to yourself years after you completed the research. Once a package has been created, any change to its contents may invalidate an inbound reference from a later package.

Rule 2: Do not change package contents (unless you know that it does not invalidate any incoming references).

Text files are the best way to store material. Note that ‘text file’ is an umbrella term that covers a broad variety of file formats such as plain text, eXtensible Markup Language (XML), Comma-Separated Values (CSV), HyperText Markup Language (HTML), Rich Text Format (RTF), Structured Query Language (SQL), JavaScript Object Notation (JSON), TeX, and many others. What is common to all these is that if you open them in a text editor, they show up more or less human readable. Some types of data such as images and audio cannot be stored as text files. For such files, it is recommended to use open, standard file formats that are widely in use by many different applications. For instance, using PDF (ISO 32000–1:2008) and JPEG (ISO/IEC 10918) for images should be fairly safe options.

Identifier

Rule 3: Do not give the same name for different packages.
Rule 4: Do not change an identifier once it has been created.

I have a habit of naming my packages yyyymmdd-name such as ‘20150707-enwiki-dump’. The date in front (note those leading zeros) makes it easy to ensure that the identifier is unique. Also, it provides useful information about the package and enables sorting packages chronologically in the case filesystem timestamps get accidentally updated. Using dashes instead of spaces makes the identifier more compatible with URLs, and they also help the package stand out in prose. Compare “we used enwiki dump to…” vs. “we used 20150707-enwiki-dump to…”. In the former case, it is not clear if enwiki dump is a specific set of files or refers to such dumps in general, whereas 20150707-enwiki-dump points more clearly to a specific set of files.

Filesystem structure

Figure 2. Filesystem structure — note that I have not changed old package names (identifiers) with spaces and underscores despite recently opting for dashes. The rawdata directory contains three Wikipedia database dumps that already had the date so I decided not to append it in front of the package again.

rawdata contains raw data and datasets from external sources. The rationale is to put here material that I cannot reproduce from its source. For example, in my Wikipedia research I store database dumps that I have downloaded from the Wikimedia Foundation servers in this directory. If I were to do some interviews, I would store the audio recordings here.

datasets contains datasets that have been processed from external sources, packages in the rawdata directory, and from other packages in the datasets directory. For example, interview transcripts processed from audio recordings in rawdata would be stored here. If I then further code the interviews transcripts, the resulting dataset would also be stored here. Package identifiers make it easy to trace back data processing steps to earlier packages, raw data and external observations (note that the step “Processing data into research datasets” in Figure 1 is iterative).

anarchive is abbreviated from ‘analysis archive’. My practice is to archive analyses that underpin published research and identify the archival package with the corresponding publication. This way the archive maps directly to claims that I have made in public and may have to defend in the future. The archive package should allow replicating the analysis by storing all relevant materials and by adding necessary metadata. You can copy and link to other packages, write notes about the research process, describe the environment in which research operations took place — whatever helps understand and trace back the steps that were taken to achieve the results.

Limitations

Also, the system is not completely free from philosophical ideals. Most obviously, I assume that replicability is worth pursuing in research. This is probably not too controversial and, I argue, relatively harmless assumption. Perfect replicability is anyway merely an ideal, since it is impossible to store the entire research environment inside a package and thus enable replicability in practice under all circumstances. Nevertheless, an archival package should allow replicability in principle by describing how the analysis could be replicated if the environment was available.

Finally, the system does not say anything about information security and privacy. You need to protect your data processing and storage environment from unauthorized access and malfunctions up to an appropriate standard. However, the appropriate standard depends on so many things that I felt unable to give any guidance on the topic. There are lots of generic information security materials that provide a good starting point for your individual assessment.

There are probably many other limitations such as the fact that the system is strictly limited to the part of the research process that involves actual empirical data. It says nothing about the design and planning of empirical research operations. Let me stress that the objective of MIDAS is to provide only core principles that hopefully allow an individual researcher to develop effective data management practice.

Final thoughts

I would like to thank Attila Marton from Copenhagen Business School and Niccoló Tempini from University of Exeter for their helpful comments on a draft version of this post.

If you have a better system, please let me know!

Entrepreneur, MIS scholar and thinker with thirty years of experience in digital technology — www.aleksi.info

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store