Predict Water Quality with Random Forests Coding Tutorial

Science Buddies

30 Jul 202422:12

Summary

TLDRTracy from Science Buddies introduces a coding tutorial on predicting water quality, specifically dissolved oxygen levels, using machine learning. The tutorial covers collecting data from the USGS National Water dashboard, pre-processing it in Google Sheets, and employing a random forest model in Google Colab for analysis. The project aims to manage algae blooms and protect aquatic life and human health by forecasting low dissolved oxygen levels.

Takeaways

🌿 The goal of the random forest model is to predict future dissolved oxygen levels in water bodies.
🐟 Dissolved oxygen (DO) is crucial for the survival of aquatic life and low levels can lead to hypoxia and harmful algal blooms.
🌳 Predicting DO levels can help manage algal blooms and prevent harmful water conditions, protecting aquatic ecosystems and human health.
🌳 A decision tree makes decisions by splitting data into branches based on questions about the data at decision points or nodes.
🌲 A random forest is an ensemble of decision trees, and its output is the average of the values predicted by the individual trees.
💻 Google Colab is a platform for writing, running, and sharing code, which is used in this tutorial for the hands-on portion.
📊 Data for the model is gathered from the USGS National Water Dashboard, focusing on water quality parameters like DO, temperature, turbidity, pH, and specific conductance.
📈 The data is organized and formatted in Google Sheets, calculating averages and preparing it for machine learning modeling.
📊 The tutorial includes creating sheets for current and future DO levels to train the model to predict one day, one week, and four weeks ahead.
📈 The final dataset is compiled into a CSV file, which is then uploaded to Google Drive and used to train the random forest model in a Python notebook.
📊 The model's accuracy is evaluated using mean absolute error and mean absolute percentage error, with visualizations showing the relationship between predicted and actual DO levels.

Q & A

What is the main goal of the random forest model in the water quality prediction project?
-The main goal of the random forest model in the water quality prediction project is to predict future dissolved oxygen levels.
Why is dissolved oxygen important to predict in water bodies?
-Dissolved oxygen is important to predict because it is essential for the survival of aquatic organisms like fish and invertebrates. Low levels can lead to hypoxia, causing algae blooms and potentially harmful conditions for both aquatic life and human health.
How does a decision tree model make decisions?
-A decision tree model makes decisions by splitting the data into branches at different decision points or nodes. Each node represents a question about the data, and each branch represents the possible answers to those questions.
What is a random forest in the context of machine learning?
-A random forest is an ensemble of multiple decision trees, and the average of those decision trees' predictions is the value that the random forest outputs.
How can one collect water quality data for machine learning analysis?
-One can collect water quality data for machine learning analysis from sources like the USGS National Water dashboard, where you can select specific stations and parameters such as dissolved oxygen, temperature, turbidity, pH, and specific conductance.
What are the steps to format the collected water quality data for machine learning?
-The steps to format the collected water quality data include selecting a location with the required variables, setting the time span, downloading the data, organizing it in a spreadsheet, separating the data into separate columns, calculating the average values for each day, and deleting unnecessary rows and columns.
How does one create future dissolved oxygen levels data for the model's predictions?
-To create future dissolved oxygen levels data, one shifts the time span by one day, one week, and four weeks into the future and repeats the data collection and formatting process for these new time spans.
What is the purpose of calculating the mean absolute error for the random forest model?
-The mean absolute error is calculated to measure the accuracy of the model in predicting numerical outcomes. It represents the average magnitude of errors in a set of predictions.
How can one visualize the accuracy of the random forest model's predictions?
-One can visualize the accuracy of the random forest model's predictions by creating scatter plots that compare the actual values of dissolved oxygen with the predicted values, as well as by examining the R-squared value, which measures how well the model's predictions match the actual data.
What does the feature importance graph in the random forest model represent?
-The feature importance graph ranks each water quality property based on its ability to predict future dissolved oxygen levels, indicating which properties are most influential in the model's predictions.