The way you manage your data during analysis depends entirely on the type of data you’re using and what you’re doing with it. There are, however, several strategies you can adopt to avoid disaster, save time, and improve your ability to make sense of your work later on.
Keep your data secure.
Save your raw data. It is vitally important to maintain a copy of your data in its rawest, least processed form. This allows you to start over if something goes wrong, or to re-analyze the same dataset testing different variables or protocols.
- Consider saving snapshots of your data at a number of different stages (e.g., raw, cleaned up, subsetted).
- Distinguish between these datasets in the file names and/or documentation.
Control your versions. Keeping track of file versions can be done via consistently applied naming conventions. In projects that involve code or software development where there are frequent edits or multiple contributors, consider using a more elaborate version control system. Git is a popular choice, but your research community or lab may have a preferred environment.
Back things up. Proper storage and backup strategies are key to preventing catastrophic data loss due to things like hardware failure, natural disaster, computer viruses, or theft. Maintaining working copies of your data requires thoughtful consideration of hardware, redundant storage locations, and a disaster plan.
- LOCKSS (“lots of copies keep stuff safe”) is a helpful motto to remember. The more copies of your data, the better...as long as they’re not all in the same place.
- Use the 3-2-1 backup rule as rule of thumb: 3 copies, on 2 different types of storage media, 1 off-site.
- Test your system frequently to make sure it’s working.
Document your steps.
Whether for your future self or other researchers, it is crucial that you describe the process of your analysis. This can mean taking good notes, saving log files, or capturing your every step in an electronic lab book. Be sure to keep a copy together with any data or code you produce so that you can follow your trail later on.
- Scan paper notebooks, especially if they contain sketches or annotations that may not be captured by transcription.
- Include any pre-processing or data-cleaning steps to ensure reproducibility.
- Electronic Lab Notebooks (ELNs) can help you automate the process.
Tools and Resources
- The Texas Advanced Computing Center (TACC) offers resources and expertise in high performance computing (HPC), visualization, data analysis, and cloud computing.
- Information Technology Services (ITS) offers robust, mature, and secure data management resources and services including common good services, data storage, data security, network access, virtual machine hosting, and information security.
- The Data Lab at UT Libraries hosts 15 iMacs with a wide array of data analysis, visualization, and processing software. It is open to all UT staff, faculty, and students on a first-come, first served basis.
- The Department of Statistics and Data Sciences offers free statistical consulting services, organizes Software Short Courses, and provides software packages for use through terminal server technology.
- Qualtrics, the preferred tool for campus surveys, is available for use by faculty, staff, and students and is approved for most Confidential data (including HIPAA, FERPA, and IRB).
- This database of software tools from DataOne provides useful descriptions and recommendations (with links to resources) for a wide variety of tools.