## Welcome to the WhoseEgg App

WhoseEgg is an R Shiny app for predicting the identification of fish eggs with an objective of detecting invasive carp (Bighead, Grass, and Silver Carp) in the Upper Mississippi River basin. Users are able to provide the required fish egg characteristics to the app, and the predicted family, genus, and species taxonomy levels will be returned. The predictions are made using random forests that are based on the models developed in Camacho et al. (2019) and validated in Goode et al. (2021).

See the first tab below for information on how to use the app. The other tabs below describe the locations where eggs were collected for training the random forests and the species present in the training data. We caution the use of WhoseEgg with eggs collected in different locations or if other species are believed to be present.

See the for information about the random forest models used by WhoseEgg, the definitions of the egg characteristics, and recommendations for how to handle data from different locations or containing different species.

Follow the steps below to obtain predictions. Additional instructions are inlcuded on the page corresponding to a step.

Egg Collection Locations

The data used to train the random forests in WhoseEgg are from fish eggs collected in the Upper Mississippi River basin at the locations shown in the image below. Validation of models from Camacho et al. (2019) showed great performance at locations included in the training data, but performance decreased on locations outside of the training data. For more details on the validation process, see Goode et al. (2021).

As a result, we caution the use of WhoseEgg for the prediction of eggs collected outside of the training data egg collection region without prior model validation.

For more information on using WhoseEgg with data collected in different regions, see the FAQ on the .

Species in Training Data

Random forests are only able to make predictions for fish species included in the data used to train the models. The family, genus, and species of fish included in the data used to train the models are listed below in alphabetical order. Note: Invasive carp include Bighead, Grass, and Silver Carp.

We caution the use of WhoseEgg for the prediction of eggs from data that is believed to contain species not included in the training data without prior model validation.

For more information on using WhoseEgg with data collected in which additional species may be present, see the FAQ on the .

• WhoseEgg may be used on any device with a browser, but we recommend using a computer for the best experience.

• Zoom in/out using control (Windows) or command (Mac) and the +/- key (or compatible technique on user's computer).

• The figures in the app can be saved by right clicking on the image and selecting an option to save the the image.

• For tables of data printed in the app, use the search box to help find observations of interest. This is especially useful for finding problematic observations that result in errors when the data is uploaded.

WhoseEgg was created by Katherine Goode, Dr. Michael Weber, and Dr. Philip Dixon at Iowa State University.

The creators would like to thank Carlos Camacho for his assistance, especially with the content describing the morphological variables on the help page.

For questions and feedback regarding WhoseEgg, please email whoseegg@iastate.edu.

Funding for WhoseEgg was provided by the U.S. Fish and Wildlife Service through Grant #F20AP11535-00.

Data privacy statement: Data uploaded to WhoseEgg will not be saved by WhoseEgg or distributed.

## Input of Egg Characteristics

### Overview

This page contains the tools for providing the fish egg characteristics that will be used by the random forests to predict the fish taxonomies. To provide the egg characteristics, follow the instructions in the sidebar panel to the left.

The egg characteristic data must be formatted appropriately to work with WhoseEgg and correctly obtain predictions. Follow the guidelines in the Spreadsheet Specifications tabs below. Once the egg characteristic spreadsheet is uploaded, several additional variables will be computed based on the input values to be used by the random forests: Julian_Day, Membrane_SD, Membrane_CV, Embryo_SD, Embryo_CV, and Embryo_to_Membrane_Ratio. The uploaded variables of Year and Day are only used by WhoseEgg to compute Julian_Day and are excluded from the processed data.

Under Egg Characteristics, see the 'Input Data' tab to view data in the uploaded spreadsheet and the 'Processed Data' tab for the set of predictor variables to be used by the random forest plus the Egg ID.

See the 'Random Forest Details' tab on the for a full list of the predictor variables used by the random forests in WhoseEgg.

• Fill in all variables (egg_ID and the 13 egg characteristics)

• Use the helpers in the template to correctly enter the variable values (see the 'Template Helpers' tab for more info)

• See the help page for detailed definitions of the egg characteristics (includes example photos)

• Variable names must be exactly as they appear in the template

• All variables (egg_ID and the 13 egg characteristics) must be filled in for WhoseEgg to return a prediction

• Observations with missing variable values will be excluded from the processed data for prediction but will be included in the final dataset for download without random forest predictions

• At least one egg observation is required

• There is no maximum number of observations that may be included, but the time it takes to compute predictions will increase as the number of observations increases

• Select a cell in the Excel template with a variable name to see required format, units, and/or accepted levels

• Cells of categorical variables contain drop down options

• Errors or warnings will appear if a value is not formatted correctly

• Helpers may be turned off by using the data validation option in Excel

## Results from Random Forests

### Overview

This page provides the ability to compute and display the random forest predictions for the egg data provided via the 'Data Input' tab. To obtain the predictions, follow the instructions in the sidebar panel to the left. The sections below provide tools for viewing and exploring the predictions.

See the Table of Predictions below for the random forest predictions and corresponding probabilities for each fish egg. The columns of Family Pred, Genus Pred, and Species Pred contain taxonomic level for the corresponding egg with the highest random forest probability. The columns of Family Prob, Genus Prob, and Species Prob contain the corresponding random forest probabilities. A random forest probability is the proportion of trees in the random forest that predict a certain level. See the 'Random Forest Details' tab on the help page for information on how random forest predictions and probabilities are determined.

See the Visualizations of Predictions below for various visualizations of the random forest predictions.

### Visualizations of Predictions

Frequency of Predictions per Taxonomic Level

Each plot shows the levels of family, genus, and species included in the predictions. The length of the bars represent the total number of eggs classified within a level by the random forest.

Random Forest Probabilities for a Specified Egg

The random forests return probabilities for all taxonomic levels in the training data for each egg observation. These graphics show the probabilities for each taxonomic level for an egg. The taxonomic levels are ordered from top to bottom by highest to lowest random forest probability.

### Overview

• all initial variables uploaded to WhoseEgg,
• variables computed to generate random forest predictions,
• the random forest predictions, and
• the random forest probabilities for all taxonomic levels.

## Help Page

This page contains additional information to assist with the use of WhoseEgg. The tabs of Environmental Variables and Morphological Variables contain information about the egg characteristics used in WhoseEgg including their definitions and required spreadsheet formats. The Random Forest Details tab contains information on random forests in general and the random forests used for prediction in WhoseEgg. The FAQ tab contains answers to common questions users may have such as how to handle data collected at locations outside the region where the training data were collected. For questions that are not answered by the content provided here, please email whoseegg@iastate.edu.

05-help-vars-env.utf8

### Day

Definition: Day of the month when the fish egg is collected

Spreadsheet Variable Name: Day

Format: Integer between 1 and 31, respective to the month

Random Forest Predictor Variable: No

### Conductivity

Definition: Conductivity ($$\mu$$/cm) of the water where the egg is collected

Spreadsheet Variable Name: Conductivity

Format: Continuous variable greater than 0

Random Forest Predictor Variable: Yes

Additional Information: Training data conductivity values range between 274 $$\mu$$S/cm and 781 $$\mu$$S/cm

### Julian Day

Definition: Julian day when the fish egg is collected

Spreadsheet Variable Name: Julian_Day

Format: Integer between 1 and 365

Random Forest Predictor Variable: Yes

Additional Information: Julian days in training data range between 113 and 243

### Month

Definition: Month when the fish egg is collected

Spreadsheet Variable Name: Month

Format: Integer between 1 and 12

Random Forest Predictor Variable: Yes

Additional Information: Months in training data are 4, 5, 6, 7, 8

### Temperature

Definition: Temperature (degrees Celsius) of the water where the egg is collected

Spreadsheet Variable Name: Temperature

Format: Continuous variable

Random Forest Predictor Variable: Yes

Additional Information: Training data temperature values range between 11 C and 30.7 C

### Year

Definition: Year when the fish egg is collected

Spreadsheet Variable Name: Year

Format: YYYY

Random Forest Predictor Variable: No

05-help-vars-morph.utf8

### Compact or Diffuse

Definition: Specifies whether the egg collected is compact or diffuse

Spreadsheet Variable Name: Compact_Diffuse

Format: C or D

Random Forest Predictor Variable: Yes

Examples of compact eggs: The embryo entity is clearly identifiable. Note that it is difficult to see the identifiable embryo of the egg in the top right corner. When the egg was viewed under a microscope, the embryo would roll around within the membrane. If you look closely at the right side of the embryo, you can see the space between the embryo and membrane. The embryo is compact but nearly as big as the membrane.

Examples of diffuse eggs: The embryo is not in a tightly compact entity within the membrane. The membrane appears to be filled with smoke or the embryo material is scrambled within the membrane.

### Deflated

Definition: Specifies whether the egg is deflated or not

Spreadsheet Variable Name: Deflated

Format: N or Y

Random Forest Predictor Variable: Yes

Examples of deflated eggs. All these membranes do not have a smooth surface or are spherical. Notice the dents and folds indicating the membrane is smaller now than when sampled. Some of this may be due to broken membranes from egg handling or desiccation from the ethanol. Most eggs in the training data were not ripped or broken. The membranes appeared intact but shriveled due to water being sucked out by the ethanol during preservation.

Examples of eggs that are not deflated. Notice all the membranes are spherical and smooth. You do not see folds or dents.

### Egg Pigment

Definition: Specifies whether the egg has pigment or not

Spreadsheet Variable Name: Pigment

Format: N or Y

Random Forest Predictor Variable: Yes

Examples of eggs with pigment. Arrows point at chromatophores (cell containing pigment). Notice in the bottom middle image that the larger dark circles are eyes.

Examples of eggs without pigment. There are no chromatophores present.

### Egg Stage

Definition: Stage of the egg when collected - either 1 through 8, broken, or diffuse

Spreadsheet Variable Name: Egg_Stage

Format: One of 1, 2, 3, 4, 5, 6, 7, 8, BROKEN, D (where BROKEN indicates that the egg is broken and D indicates the egg is diffuse)

Random Forest Predictor Variable: Yes

Additional Information: Below are examples of the stages. Most of these pictures of the stages are not great representations. Assessing the stage is accomplished best with the egg under a microscope, so that it can be moved around to inspect all sides and angles.

All diagrams of egg stages included below are from Kelso and Rutherford (1996). Permission to use the images was granted by the American Fisheries Society.

Egg Stage 1 (early cleavage): The blastomeres will look like prongs (typically 4) pointing in one direction. The arrows in the images point to the prongs. The top left is the best image showing the 4 prongs. They are not as pronounced, but the cleavage separating each blastomere is still apparent. The bottom left corner picture shows a side view of 2 prongs. It is difficult to get a picture looking directly into the prongs.

Egg Stage 2 (morula): The arrows point to blastomeres.

Egg Stage 3 (blastula):

Egg Stage 4 (gastrula): The arrows point to germ rings.

Egg Stage 5 (early embryo): The arrows point to embryonic axis (spine and back forming). This is best described as a ridge line that sticks out of the embryo.

Egg Stage 6 (tail-bud stage): Out of the embryonic axis, the head and tail will form a rounded end. The tail will start to protrude away from the embryo.

Egg Stage 7 (tail-free stage): The tail continues to protrude away from the embryo and is no longer touching the embryo. The arrows point to the tail not touching the embryo.

Egg Stage 8 (late embryo): The embryo is almost fully developed. Myomeres, eyes, auditory vesicle are all present. The egg is close to hatching.

### Embryo Diameter Average

Definition: Average of four diameter measurements (mm) taken from the embryo with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Embryo_Ave

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data averages of embryo diameters range between 0.434 mm and 4.371 mm

The images below show examples of four equally spaced embryo diameter measurements.

### Embryo Diameter Coefficient of Variation

Definition: Coefficient of variation (standard deviation / average) of four diameter measurements (mm) taken from the embryo with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Embryo_CV

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data coefficients of variation of membrane diameters range between 0.003 and 0.724. See the figures included underneath Average of Embryo Diameters for examples of how the membrane diameters are measured.

### Embryo Diameter Standard Deviation

Definition: Standard deviation of four diameter measurements (mm) taken from the embryo with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Embryo_SD

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data standard deviations of embryo diameters range between 0.005 and 1.377. See the figures included underneath Average of Embryo Diameters for examples of how the membrane diameters are measured.

### Embryo to Membrane Ratio

Definition: Ratio of average embryo diameter to average membrane diameter

Spreadsheet Variable Name: Embryo_to_Membrane_Ratio

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data embryo to membrane ratios range between 0.257 and 1.137. See the figures included underneath Average of Membrane Diameters and Average of Embryo Diameters for examples of how the membrane and embryo diameters are measured.

### Larval Length

Definition: Total length measurement (mm) along the midline from all of the late stage embryos (eggs in stages 6-8)

Spreadsheet Variable Name: Larval_Length

Format: Positive continuous variable (set to 0 if egg is in stage 5 or less or egg is diffuse)

Random Forest Predictor Variable: Yes

Additional Information: The training data larval lengths range between 0 mm and 5.089 mm.

Below is an example showing how larval length is measured.

### Membrane Diameter Average

Definition: Average of four diameter measurements (mm) taken from the membrane with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Membrane_Ave

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data averages of membrane diameters range between 0.728 mm and 5.492 mm

The images below show examples of four equally spaced membrane diameter measurements.

### Membrane Diameter Coefficient of Variation

Definition: Coefficient of variation (standard deviation / average) of four diameter measurements (mm) taken from the membrane with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Membrane_CV

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data coefficients of variation of membrane diameters range between 0.001 and 0.52. See the figures included underneath Average of Membrane Diameters for examples of how the membrane diameters are measured.

### Membrane Diameter Standard Deviation

Definition: Standard deviation of four diameter measurements (mm) taken from the membrane with starting points that are equally spaced around the circumference as defined by Kelso and Rutherford (1996)

Spreadsheet Variable Name: Membrane_SD

Format: Positive continuous variable

Random Forest Predictor Variable: Yes

Additional Information: The training data standard deviations of membrane diameters range between 0.001 mm and 1.472 mm. See the figures included underneath Average of Membrane Diameters for examples of how the membrane diameters are measured.

### Sticky Debris

Definition: Specifies whether there is debris on the egg

Spreadsheet Variable Name: Sticky_Debris

Format: N or Y

Random Forest Predictor Variable: Yes

Examples of eggs with sticky debris. In all pictures, the debris is adhered to the eggs. Some fish have sticky eggs to keep them from drifting. Most of the debris is wood, but the top left shows sand can also stick to eggs.

Examples of eggs without sticky debris. There is no debris or sand on the membranes.

05-help-rf.utf8

### General Information

Random forests are machine learning models that use an ensemble of classification trees (with categorical response variables) or regression trees (with continuous response variables) to provide predictions. The term random is used because two forms of randomness are introduced when a tree is fit:

1. Each tree in the ensemble is trained using an independent random bootstrap sample from the training data.
2. When a variable is being chosen for a split in a tree, only a randomly selected subset of predictor variables are considered. For example, when the WhoseEgg models were trained, the number of predictor variables considered at a split was equal to the square root of the total number of predictor variables.

Typically, many trees (such as 500) are trained and make up the forest. To get predictions, the random forest obtains a prediction from each tree and either

• computes an average of the tree predictions (for regression problems), or
• computes the proportion of trees that predict each response variable level and determines the level with the highest proportion of “votes” (for classification problems).

The diagram below shows a very simple example of a random forest for classification. The model has four predictor variables and a categorical response variable with three levels (species). The random forest is made up of three trees. The circles in the trees represent the features chosen by the tree, and the rectangles represent the classification at the end of a path. The bold lines represent the paths corresponding to an observation of interest. In a classification example such as this, the random forest returns two quantities:

1. A probability for each response variable level.
• In the example below, the probability for species 1 is 2/3 since two of the three trees returned a prediction of species 1.
2. A prediction.
• In the example below, the prediction is species 1 since it is the species with the highest random forest probability.

For more information on random forests, see the following resource: Cutler et al. (2007)

### Random Forests in WhoseEgg

WhoseEgg uses three random forest models (one for each taxonomic level). The models are similar to the augmented models described in Goode et al. (2021) and based on the models developed in Camacho et al. (2019). The models, code for training the models, and the training data are available on the GitHub repository for WhoseEgg:

Model structures:

• Trained using the randomForest package in R (Liaw 2002)
• All use 1000 trees
• All other tuning parameters are set to randomForest defaults

Response variables of random forest models (all three group Bighead, Grass, and Silver Carp as one category called invasive carp):

• Family
• Genus
• Species

Predictor variables:

• Compact_Diffuse
• Conductivity
• Deflated
• Egg_Stage
• Embryo_Ave
• Embryo_CV
• Embryo_SD
• Embryo_to_Membrane_Ratio
• Julian_Day
• Larval_Length
• Membrane_Ave
• Membrane_CV
• Membrane_SD
• Month
• Pigment
• Sticky_Debris
• Temperature

05-help-faq.utf8

### What can I do if I am having difficulty uploading data to WhoseEgg?

If you are having difficultly uploading a spreadsheet to WhoseEgg, first go through the following check list to ensure that all steps have been taken correctly:

• Make sure you are upload a file that ends in .csv, .xlsx, or .xls.
• Make sure there are no empty rows in the spreadsheet.

If all of the above are met, try one of these suggestions below:

• Try a different file format to see if it fixes the problem (e.g. switch from .csv to .xlsx).
• Try a different browser.

### How can I proceed if I have data from a different region than the training data?

The random forests used by WhoseEgg to make predictions have been validated for eggs collected in the figure below. In particular, the data collected in 2014 and 2015 were used to train the random forests, and the data collected in 2016 were used to validate the models. The models trained using data from 2014 and 2015 showed great performance on the data from 2016 in locations that were sampled in 2014 and/or 2015 but were less successful in locations not previously sampled. However, the sample size in 2016 was smaller than the original dataset. See Goode et al. (2021) for more details. These results suggest that the models in WhoseEgg may not perform well on data collected in different geographic regions and additional validations are needed. Note that the final models used in WhoseEgg were trained using all three years of data (2014-2016) to improve the performance for future predictions.

If there is interest in using WhoseEgg to make predictions on data collected in different geographic regions, we recommend the following as possible options:

• Be cautious interpreting the predictions from WhoseEgg if the models are applied to data collected in different geographic regions, especially if the regions have a different fish species composition.

• Compare the predictor variable values to those used to train the random forests in WhoseEgg. If your predictor variable values differ from the WhoseEgg data (especially values outside the range of the variable values), the models will have to extrapolate to make predictions. This often leads to untrustworthy predictions.

• Perform your own validation of the random forests by applying WhoseEgg to eggs that have been genetically identified. Compare the predictions from WhoseEgg to the genetic identifications to determine if the WhoseEgg predictions are reasonably trustworthy for the new region. See Goode et al. (2021) for an example model validation.

• If familiar with R, try updating the WhoseEgg models by training your own random forests based on the code and data available at the WhoseEgg GitHub repository. Add your data to the WhoseEgg training data and train new random forest models.

• If you try validating or updating the WhoseEgg models, the creators of WhoseEgg would be interested to hear about your results. Let us know by emailing .

### What can I do if I believe I collected fish eggs containing species not included in the training data?

Random forests are only able to make predictions for response variable levels included in the training data. See the table below for a list of the family, genus, and species levels included the WhoseEgg training data. If you believe that your data contains a level not present in the training data, we caution the use of WhoseEgg. If you would still like to apply WhoseEgg to your data, we recommend the following as possible options:

• The random forests will classify observations based on predictor variable similarity to those in the training data. Think about whether the different species that may be present in your data are similar to any species in the training data. Check to see if these species appear in the predictions made by WhoseEgg.

• Determine if the species have similar egg characteristics to invasive carp. If they are different from invasive carp, then it may be okay to proceed using WhoseEgg if your main objective is to identify invasive carp.

Family Genus Common Name Number of Eggs in Training Data
Catostomidae Carpiodes Carpsuckers sp. 1
Catostomidae Carpiodes Quillback 1
Catostomidae Carpiodes River Carpsucker 8
Catostomidae Ictiobus Bigmouth Buffalo 7
Catostomidae Ictiobus Black Buffalo 1
Catostomidae Ictiobus Buffalo sp. 10
Catostomidae Ictiobus Smallmouth Buffalo 2
Cyprinidae Cyprinella Spotfin Shiner 6
Cyprinidae Luxilus Common Shiner 1
Cyprinidae Macrhybopsis Silver Chub 36
Cyprinidae Macrhybopsis Speckled Chub 28
Cyprinidae Notropis Channel Shiner 32
Cyprinidae Notropis Emerald Shiner 201
Cyprinidae Notropis River Shiner 16
Cyprinidae Notropis Sand Shiner 1
Cyprinidae Notropis Shiner sp. 69
Hiodontidae Hiodon Goldeye 7
Invasive Carp Invasive Carp Invasive Carp 782
Moronidae Morone Striped Bass 17
Moronidae Morone White Bass 1
Percidae Etheostoma Banded Darter 1
Percidae Percina Common Logperch 1
Percidae Sander Walleye 2
Sciaenidae Aplodinotus Freshwater Drum 733

### Will WhoseEgg be updated to contain data from different geographic regions and with more species?

The creators of WhoseEgg are interested in updating the models to contain data from different geographic regions and with more species, but there are not plans to do so at this time.

### What can I do if I am interested in using WhoseEgg to predict fish species other than invasive carp?

The validation of the random forests used by WhoseEgg focused on the classification of invasive carp. If you would like to use WhoseEgg to identify other fish species, please take into account the following considerations:

### Why don’t my extra variables show up in the processed data tab?

While it is okay to upload extra variables to WhoseEgg, these variables will not be used by the random forests to make predictions. As a result, they are excluded from the processed data tab, which only contains the variables that will be used to make predictions. However, these variables will be included in the spreadsheet with predictions available for download. See the preview of the table with data for download on the ‘Downloads’ page.