Seeking

Once you’ve gotten your tools set up and have set the stage for your project, the next core task will be to acquire the data you will be analyzing. These data can come from many places – from measuring botanical diversity across a tallgrass prairie with a field campaign, to measuring the health and density of vegetation across a mountain range using the Normalized Difference Vegetation Index (NDVI) derived from LANDSAT satellite data, to quantifying the makeup of microbial communities in the soil of a agricultural field with a DNA sequencing.

Some of the data sets you will work with will be rather small, in relative terms. They will easily fit on the hard drive of your personal computer and likely also in its memory, which means that working with it will likely be fast and less technically challenging. We’ll start by addressing best practices for working with this kind of data in 6  Importing Data I: Local Computing.

Other datasets that you may want to analyze are much larger. These data sets may come from instruments you have deployed in the field or data you have received from a lab experiment, but often in environmental data science, the data sets that you work with are pulled together from online sources and repositories that host the information. Many of these online sources make available their data through something called an application programming interface (API). Basically, this is a way for one computer to talk to another in a structured way. The idea is that these sites have been set up in such a way that they can be queried using a specific syntax and then will provide data that matches the request. Knowing how to interact with these APIs to assemble the various pieces of your data analysis project is a key focus in 7  Gathering structured data from the across the internet.

We’ve all heard “big data” used as a buzz word, but what does it really mean in the case of your scientific data analysis? It turns out that what constitutes “big” is relative to the computational resources you have available to you for the project. The amount of data that is available both from modern instruments as well as from internet sources means that for many projects, the amount of data that you would like to use for your analysis exceeds the capacity of the computer that you are using. This is particularly the case if you are doing most of your analysis on your personal laptop. 8  Importing Data III: When Your Data Appetite is Bigger than Your Laptop focuses on these scenarios. It starts with discussing the different types of characteristics of a computer that make it more or less amenable to different types of data analysis projects. It then goes through the different options for using remote cloud computing resources to process data sets that exceed the capacity of your local machine.

However, before you download a fresh dataset onto the computer you’re going to use for analysis, you’re going to need to make some decisions about how that data is analyzed processed and potentially shared. It’s rather important to make these kind of decisions early on in your project, because some data sets need to be analyzed in a way that respects their protected characteristics. Think, for example, about health data or data from tribal governments, both of which need to be handled with appropriate care. In the realm of environmental data, sometimes things like the GPS coordinates of a remnant population of a highly endangered species need to be thoughtfully managed as well. If your data do not have these types of restrictions, however, the recent movement toward open data in scientific research and publishing mean that the expectation is that the datasets you work with will be shared upon publication of the related research, if not earlier. In order to make the shared data useable to others, providing appropriate documentation and metadata are key. In 9  Tearing down data walls, we go over the walls that can keep data locked away, and how to embrace open science principles to maximize the ability of your work to be built on by others.

Finally, we end this section in Chapters 10  Data Science Workflows II: Managing your computational environment and 11  Environments for Data Science II: Supplemental languages for Environmental Data Science by talking about how to track and manage your computational environment as an important component of the inputs to your analysis. Being able to re-run the code you write and get the same output on a different computer is not a given. Different operating system versions, different programming languages, and different software package versions can all make it difficult to re-run something that initially worked smoothly. This is certainly true if you are sharing your work upon publication and want to make it easier for reviewers or collaborators to understand how and what you did, but it is also perhaps most true for your own most important collaborator – yourself. We can’t tell you how many times we’ve set aside a project, only to come back to it six months or several years later and spend days getting the code functional again because the computational environment wasn’t recorded.

By the end of this section, you should have a grasp of the major ways to work with data sets of any size, either locally or in the cloud. You’ll also understand new ways in which you can ensure your data analysis efforts are reusable and reproducible, and also some best practices for handling your code and data while your working with it so as to facilitate sharing and archiving throughout your project.