8 Importing Data III: When Your Data Appetite is Bigger than Your Laptop
We will discuss ways to work around data use issues when the dataset is too big to store in memory or to analyze with local computing. We introduce working with remote public data from governmental agencies (e.g. NASA and NOAA) on remote infrastructure like AWS or Google Earth Engine. We introduce tools like ssh, JupyterLab, and RStudio server.
Sometimes the data sets that you want to work with are too big for your laptop or local computer. Your computational requirements for a particular analysis could exceed the:
- the capacity and/or the read/write speed of the hard disk
- amount of RAM for in-memory analyses
- speed of the CPU (and/or the number of cores for parallelizable analyses)
- presence and capabilities of a GPU (particularly relevant in the age of AI)
- network bandwidth to download and or upload data
We’ll talk briefly about each of these constraints in turn and then introduce some alternatives to using local computing for your environmental data science projects.
What qualifies as “big data” depends on a lot of factors. An analysis of every landsat image ever taken would be considered big by anyone’s estimation, but big data could also just simply be any analysis or data set that is too large to effectively be worked with on your own personal computer. If you are a graduate student who has just received an email from your local genomic sequencing core with a link to download the results of chapter 1 of your dissertation project, or a postdoc who has just developed some code to model eddy flux measurements across dozens or hundreds of flux tower sites, this chapter is for you.
The first constraint to think carefully about is the amount of disk space you have available on your machine. While you’re at it, it also makes sense to consider the speed of your hard disk. Both of these factors will influence what size of data set you can comfortably work with on your computer. At the most basic level, you need to have the ability to download and store the set of data you want to work with. It is also important to consider that many data sets will be compressed when you download them, and that they may need to be uncompressed for parsing and analysis. This is not always the case, because there are some tools that can decompress data on the fly during analysis, which will save disc space, but this is not a universal capability of all software tools you might need to use. If the amount of data that you need to work with at once is too big to fit into memory (see the paragraph on memory below), then you will have to work with chunks of it at a time or stream through the data in order to process and analyze it. In this case the read and write speeds of your disk may be the primary bottleneck in your analysis. There is a core trade-off between cheaper, larger, mechanical spinning disk hard drives, and newer, faster, more expensive solid-state drives.
In some instances, it may be possible to work around the constraint of the drive inside of your computer by buying an external hard drive to store data while you analyze it. In other cases, however, the data will be too big even for a very large external drive. In this case, you may want to consider moving your analysis to either a local on-premise workstation or some sort of cloud computing. While they are uncommon in personal computers (laptops and consumer desktops), RAIDs (Redundant Arrays of Inexpensive Disks) are an approach that combines multiple smaller discs into one larger virtual disc in order to remove some of the constraints of smaller drive sizes. Depending on how the RAID is configured, they can also offer redundancy, which helps with protecting against data loss in the event of disk failure by writing the same data to two different physical drives simultaneously, and an approach called striping, which can substantially increase the speed of reading and writing data to the drive by reading and writing different parts of your data to multiple drives at the same time. This latter approach doubles the risk of data loss since all data is lost if either drive fails, so there are also RAID configurations that combine redundancy and striping.
A second constraint that you will need to consider is the amount of memory or RAM in your computer. Some types of analysis require that the complete set of data is in memory at once. For cases like this, you may need to use a server that is especially configured to have very high amounts of memory, since generally the motherboards inside of consumer computers are limited in the amount of RAM they can accept. Unlike with external hard disks, many newer laptops in particular have the RAM hardwired and it cannot be added to or replaced after purchase. If you are working with tabular data (like very large csv files) and analyzing your data in a scripting language like R or python, you may be required to load the complete data set into memory from your disk before being allowed to work with it.
The third constraint to consider is the CPU speed as well as the number of cores available. While the clock speed of CPUs has not increased drastically over the last decade, there has been an increasing shift to a larger number of cores in both consumer machines and remote servers. This means that for certain types of analysis that can be parallelized, analysis time can be substantially reduced relative to machines with fewer cores at a given CPU or ‘clock’ speed. However, not all analysis are amenable to parallelization and need to be performed sequentially or all at once. In this case, single core speed may be more important. As data sets get larger and larger, it becomes more important to consider parallelized approaches, and potentially switching to either an on-premises workstation or cloud compute with a very high number of cores. For data sets that may require processing that exceeds the capacities of even a very large, single machine or node, high performance computing approaches (computer clusters) may be an option. These are cases where a large number of nodes (individual servers) are linked together and processing jobs can be split up and distributed among these nodes to be run in parallel before being recombined upon completion.
For analyses that include machine learning in particular, the graphical processing unit or GPU capability of your machine is going to be very important. Both the speed and number of cores in the GPU as well as the GPU memory will constrain the types of analysis that you can do. In an era of explosive growth of large language models, it can be quite difficult and expensive to get access to top-tier high capability GPU cards. For some types of analysis, including training of machine learning models, high GPU capability is essential.
The last major constraint to consider is the network bandwidth that you have to your machine. While your home Wi-Fi network may be sufficient for browsing or streaming movies, downloading a multi terabyte data set is likely to be challenging. In cases like this, having a high speed, multi gigabyte wired ethernet connection will make the transfer of very large data sets faster and more reliable .
When you may want to consider finding a bigger computer to use for your analyses
1. Possible constraints
1. Disk space
2. RAM
3. CPU speed/number of cores
4. GPU
5. Network bandwidth
6. Disk speed (read/write speed)
7. Time needed to run analyses
8. Think about: is it going to be easier/faster to bring the data to the compute (i.e. download the data onto your computer to to a specific server) or to bring the compute to the data (i.e. do analyses on the servers where the data already are archived so the download/transfer step can be skipped).
Options for big compute
1. A large machine/server used by your research group/lab
2. A campus or company cluster (HPC)
3. The cloud (infrastructure)
a. Commercial options
i. Amazon – AWS
ii. Google Cloud
iii. Microsoft Azure
b. NSF-funded infrastructure for research
i. ACCESS
4. The tools (software interfaces to this infrastructure)
a. Google Earth Engine
b. Cyverse
c. Etc
How to connect?
1. Command line
a. ssh
b. sftp
2. Web browser interfaces
a. RStudio Server
b. Jupyter Hub
c. Google Colab
d. GitHub codespaces
Case studies of different dataset examples and where analysis might be most appropriate
I am going to need some help here - I’ve used Appears (p = rho), but not as much AWS or Google Earth Engine.
Once you have determined that the dataset(s) that you want or need to work with are beyond the capacity of your local computer, you’ll need to start thinking about other compute options to enable your analyses. The size of the data and the complexity of the analyses you want to do will determine which you choose to go with. We’ll give some specific case studies about types of data and what types of compute infrastructure may be needed to work with them effectively
The simplest option might be to use or request access to a larger local computer. This could be a large desktop machine or a server used by your lab group or research team. This larger computer might have enough disk space, RAM, and CPU/GPU compute to make your analyses possible.
One step up from that would be an on-premises computer HPC (High-Performance Computing) cluster. These are generally dozens to hundreds of ‘nodes’ a.k.a. Individual computers connected together to enable distributing analyses across a huge number of machines in parallel. This approach enables analyses that would take weeks, months, or even years(!) on a single machine. Common examples are computations that need to be performed on hundreds or thousands of map grid cells, or analyses where the same dataset is processed repeatedly with different parameters.
At some point, though, you may exhaust on-premise resources and will need to look for options elsewhere. These remote computing resources may be either commercial or government-funded (e.g. through agencies like the National science Foundation in the United States). Examples of commercial could computer options include Amazon’s AWS (Amazon Web Services), Google Cloud, or Microsoft Azure. Each of these are basically services where the company has purchased and set up a huge number of machines in giant warehouses and you can rent any number of machines at the size you need by the hour. These services are the majority of the infrastructure behind the modern web, and they host the infrastructure for most commercial websites and applications.
If your research is non-commercial (e.g. if you are at a university, research institute, or non-profit organization) you may be eligible to apply for access to remote computing resources funded through national research agencies. In the US, one example is ACCESS (Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support), which is funded by the US National Science Foundation.
Case studies: (maybe remove this section)
- A within-memory dataset to analyze on your laptop
- A dataset that fits on disk but not in memory (streaming analyses, a local database approach).
- Some massive NASA dataset on Google Earth Engine (where the compute can happen on the data where it is without downloading or transferring elsewhere).
- A large but not super large (~200 GB) dataset that can’t fit into laptop memory but could fit into memory on a server with lots of RAM.
- A dataset that is not huge but where analysis requires a huge amount of parallel computer (thus necessitating a cluster/HPC).
Scoping your own project’s needs
When you are starting a new data analysis project or ramping up the scope of what you’ve been doing, it can be worth assessing
Notes:
Scoping your own project needs:
How to test your bandwidth? How to assess your disk speed? How to check your CPU speed?
How to see how hard your computer is working?
Exercise with htop
Follow on question: Practice scoping a case study
Example with google earth engine raster data – do you have what’s needed to process locally? How would you know?