Best Practices for Managing a Statistical Analysis Project
I work at a university doing public health research. I'm a hired gun; when the lead researcher needs to have some analyses run for her paper, she gets in touch with me. She lays out the analyses/tables she wants me to run and I run them.That's when the problems start.
If you try to read up on managing files when doing a data analysis project, the results are thin. You tend to wind up at something Hadley Wickham has written. Here is a short list of other information I've found that may be useful:
- https://stackoverflow.com/questions/1429907/workflow-for-statistical-analysis-and-report-writing
- https://www.r-statistics.com/2010/09/managing-a-statistical-analysis-project-guidelines-and-best-practices/
- https://swcarpentry.github.io/r-novice-inflammation/06-best-practices-R/
They all seem to focus on the plight of the single data analyst -- the person doing a self-contained analysis on their own machine for a client who will receive a finalized report. They look something like this:
- Main Project Folder
- data (the folder where your analysis data lives)
- src (the location of all your R code)
- load.R
- clean.R
- functions.R
- work.R
- graphs (where you cram all your beautiful graphs)
- tables (where you cram all your beautiful tables)
You read this when you are looking for answers and it all seems so clean and straightforward. You think "Why did I not think of this? My life is going to be so much better from here on out!" Then reality hits and you're mired back in file managing hell. See, what the above structure fails to account for is some supervisor/boss/PI asking you to tweak the results in multiple different ways. It assumes you've completed the work to your satisfaction and that your satisfaction is good enough. Assuming your job was to run a logistic regression, you probably fit multiple models, looked at all of them, and selected the model you think fits the data best. Easy peasy. The final model you select makes it into the work.R file and you can recreate the analysis in the future.
When you work for someone else, the reality is a little different. And, when you work in a large organization, things are even more different. Here is a short list of the myriad ways in which the above structure can be insufficient:
- Your boss wants to you to run a logistic regression multiple ways and you have no way of knowing which model they will ultimately want to use. In fact, they may pick a certain model and then change their mind in the future for any number of reasons (reviewer wants a covariate added to the model, new research suggests moving in a different direction, etc.)
- Your boss wants to change the bar chart you created...but then on second thought, they preferred the original....but then on third thought, they actually preferred the edited version with just a few tweaks.
- And the worst horror: Someone in charge of the files you link to for all of your analyses has just decided to reorganize their folder structure without telling you, breaking all of your scripts.
Here's the thing. I think it's super important to avoid naming files things like "final_barchart_v2_cj_new.pdf". This helps nobody and is always super-confusing. You even hate yourself as you are working on the file. But given the reality of an indecisive supervisor what do you do? You clearly have to hold on to multiple versions of the same work...(believe me, going back through hundreds of commented-out lines of code trying to find the ones that generate the specific numbers in a random-ass version of a spreadsheet your boss just sent you is no fun.)...but you also can't just start naming files things like "barchart_new". Here's a solution I propose:
- ~
- src (files to be sourced/used by all projects)
- paths.R (Paths to the datasets you will use in analysis.)
- projects (all your projects)
- project1
- src
- load.R
- clean.R
- functions.R
- data (any data specific to project1)
- tasks
- regression model (git repo)
- reg_model.R
- reg_model.csv
- barchart (git repo)
- bargraph.R
- bargraph.pdf
Hopefully that's all pretty clear, but I'll talk about it as well. First, you need a place to store universal R code. Specifically for me, it was an R file that provided a place to make a single edit to the location of a dataset and have that dataset load correctly. So an example of the contents of ~/src/paths.R would be:
population.data <- "/Users/Bob/work/data/population.csv"
This way, when Bob moves his files one night when he's drunk and bored, it doesn't affect you much. You would just edit your ~/src/paths.R file, source it whenever you needed to load that data file, and all the code that uses that file would continue to work. Within each project, I think it's reasonable to have a single load.R, clean.R, and function.R file without needing multiple versions of them. There is a data folder for storing data files that will only be used for that specific project. The next part is a deviation from the standard advice, though.
I've never been a fan of having code live in one folder and output live in another folder. It just makes it that much harder to figure out what code produces what output. I like the code and the output to live in the same folder and, ideally, to have the same name. So I came up with the idea of having a tasks folder with each discrete task having it's own subfolder. These subfolders are under version control, so I can have multiple versions of the same basic idea. It's easy in git to label exactly what is different about each version and include things like dates and times of emails a particular version was created in response to. However, you don't necessarily need version control here. If you don't know git, you could just have a well-curated file naming system in each folder backed up with readme.docx files. (IE: model_1.r, model_2.r, model_3.R along with a readme.docx file detailing exactly why and when each version was created.)
Comments
Post a Comment