Data Request Tips
By: Vincent Brandon, Data Coordinator
March 4, 2020
Data collection is one of the most challenging components in research. While literature and reports are readily available online, the raw data to investigate specific hypothesis is often sensitive and difficult to procure. The UDRC makes collection and matching across distal sensitive datasets much more accessible to academic and citizen researchers. To get the most out of our services follow these tips:
- Start your literature review
- Strip down your hypothesis to the maximum viable population
- Know what ‘Long’ vs ‘Wide’ data means
- There is no One Report to Rule Them All
The UDRC does not require a comprehensive literature review. You should start to pin down your methods and visuals. Requesting a broad dataset is a sure way to run up against obfuscation requirements and increase turnaround times. Browse a few papers or blog posts with graphs and consider which data points you need to begin answering your question.
This sounds counterintuitive, but is important. Unless you have reason to believe that an effect is present in a subset of the population, but not the rest, we recommend avoiding small subsets. Obfuscation applies to any identifying fields, including but not limited to demographics. It can also be helpful to bin similar groups together based on regional context. This can greatly improve the quality of the data as well as the speed you receive it. If, when you get the data, your findings warrant further analysis, put in another request. You will have a better idea of what precisely you need to enrich your research. We will be able to work with your prior search to add the granularity you need for a deep dive.
Do not create extra work for yourself by gluing columns together in a wide, shallow, report. We can create synthetic data tables for you (csv files) in real time. Take advantage of them to nail down the structure you need the data in for analysis. You can load them into your favorite statistics package, with your language of choice, or a spreadsheet program and build mockups. This is also a good time to note if some data needs to be given a different format.
Different questions, different techniques, and different programs, use data differently. With that, it should be obvious you want just the right data in the just the right format for the analysis you are going to run. On top of loading constraints, having extraneous data loaded during analysis is an easy way to find out how sensitive and buggy statistical packages and programming languages are. Mockups with synthetic data can go a long way to making sure you aren’t waiting for data that needs significant cleaning and filtering prior to use. If you need more than one table, please ask.
We look forward to helping you with your data request!