11 Environments for Data Science II: Supplemental languages for Environmental Data Science
This chapter returns to picks up where Chapter 3 left off. That chapter covered common programming languages for environmental data science (with a particular focus on R and Python). In this chapter we will be covering additional programming languages and tools that you may want to add to your environmental data science tool belt to supplement those core scripting languages. These are a bit more context-specific and make great add-ons for projects that need them, whereas the core scripting languages covered earlier are used in a more general sense. In addition, this chapter discusses in more detail the advantages and disadvantages of different appraoaches when applying the skills covered in the previous chapters.
Sometimes all you need is a hammer (or a Swiss army knife), but other times it’s useful to have a more complete set of tools. Having a solid working knowledge of how to use one of the core data science scripting languages (R, python, Julia) is an important first step and can often be all you need to solve a given EDS challenge. One of those languages by itself is often enough. Finding a package for one of those languages that does what you want to do can save an absolutely enormous amount of time –- especially if the alternative is to code up something from scratch on your own. However, there are also scenarios for which a scripting language is not the best tool for the job.
Additional languages/tools for data science projects can include
- Compiled languages (C, C++, Java, etc)
- Web languages (Javascript)
- Databases and their associated languages (MySQL, PostgreSQL, duckDB, mongoDB, etc)
- Some flavor of a *nix (Unix, Linux) command line shell (bash, zsh, etc)
When to consider using a compiled language
- Speed is the big one, or the need to build an application for end users (Swift, Java, etc)
When to consider using a web-first language like Javascript.
- D3.js
- other examples: https://julien-blanchard.github.io
- Pyodide – python in web browser
- Node.js
When to consider using a relational database approach – SQL
- Data is already in a database, perhaps on a remote server
- By learning the syntax for SQL queries, you can get the data you want directly
- By learning the syntax for SQL queries, you can get the data you want directly
- Data is well structured and very large, so wrangling it and subsetting it can’t be done in memory, but could be done efficiently if the data were loaded into a database
- You are designing a database for a new project to relate the different data sets in a framework that allows them to be linked to each other programmatically
When to consider using a command line shell to script analyses
- Across a compelex analytical workflow or project, the best (or only) software tools may come from different types of programming languages or environments. A bash script can help bring together parts of an analysis that are written in R, those that are written in python, and those that are compiled C code in a structured way. Some of these peices of software may be scripts that you wrote, while others may be command line programs made available my others.
- When working with a remote server and using software tools that primarily have a textual and not a graphical user interface – often for computationally intensive tasks, the go-to tools are compiled programs that are much faster than scripted approaches, but these programs must be run from a command line with a proper set of options or parameters to produce the desired output
- The next step up from a bash script would be using a specialized workflow tool, like make or snakemake – we’ll discuss these tools in more depth in Chapter 16 (data science workflows 3).
old notes
- This chapter could take several possible themes/approaches:
- Repeat the same workflow (grab data from NEON, clean, and plot?) a la 3 different ways?
- Finding an example that also has SQL might be tricky (?)
- To me, R and Python seem to be interchangeable - perhaps easier to plot with R.
- Finding an example that also has SQL might be tricky (?)
- Or we unify this around a workflow (Acquire → Read in → Wrangle → Plot)?
- Go back to the ChatGPT example from Environments for Data Science I? ##########################################