1 - Data Preparation

Marina Papadopoulou

1.1 Input data - trackdf

The swaRmverse package uses the trackdf package to standardize the input dataset. Data are expected to be trajectories (id, x, y, t) generated by GPS or video tracking. First, lets load some data from trackdf:

library(swaRmverse)

raw <- read.csv(system.file("extdata/video/01.csv", package = "trackdf"))
raw <- raw[!raw$ignore, ]
head(raw)
##   id         x        y size frame track ignore track_fixed
## 1  1  629.3839 882.4783 1154     1     1  FALSE           1
## 2  2 1056.1692 656.5207 1064     1     2  FALSE           2
## 3  3  508.0092 375.2451 1624     1     3  FALSE           3
## 4  4 1277.6466 373.7491 1443     1     4  FALSE           4
## 5  5 1379.2844 343.0853 1431     1     5  FALSE           5
## 6  6 1137.1378 174.5110 1321     1     6  FALSE           6

1.2 Transform data

trackdf takes as input a vector for each positional time series (x,y) along with an vector of ids and time. Time will be transformed to date-time POSIXct format. Without additional information, the package uses UTC as timezone, current time as the origin of the experiment, and 1 second as the sampling step (time between observations). If your t column corresponds to real time (and not frames or sampling steps, e.g., c(1, 2, 3, 4)), then the period doesn’t have to be specified. For more details, see https://swarm-lab.github.io/trackdf/index.html. For now, let’s specify these attributes and create our main dataset (as a dataframe):

data_df <- set_data_format(raw_x = raw$x,
                          raw_y = raw$y,
                          raw_t = raw$frame,
                          raw_id = raw$track_fixed,
                          origin = "2020-02-1 12:00:21",
                          period = "0.04S",
                          tz = "America/New_York"
                          )

head(data_df)
## Track table [6 observations]
## Number of tracks:  6 
## Dimensions:  2D 
## Geographic:  FALSE 
## Table class:  data frame
##   id                   t         x        y        set
## 1  1 2020-02-01 12:00:21  629.3839 882.4783 2020-02-01
## 2  2 2020-02-01 12:00:21 1056.1692 656.5207 2020-02-01
## 3  3 2020-02-01 12:00:21  508.0092 375.2451 2020-02-01
## 4  4 2020-02-01 12:00:21 1277.6466 373.7491 2020-02-01
## 5  5 2020-02-01 12:00:21 1379.2844 343.0853 2020-02-01
## 6  6 2020-02-01 12:00:21 1137.1378 174.5110 2020-02-01

You can now notice that a ‘set’ column is added to the dataset. swaRmverse is using this column as the main unit for grouping the tracks into separate events. By default, the day of data collection is used.

1.3 Multi-species or multi-context data

As mentioned above, swaRmverse uses the date as a default data organization unit. However, if several separate observations are conducted in the same day, or an additional label on the data is needed, such as context or species, additional information can be given to the function. For instance, let’s assume that data from 2 different contexts exist in the data set:

# dummy column
raw$context <- c(rep("ctx1", nrow(raw) / 2), rep("ctx2", nrow(raw) / 2))

We can give any additional vector to the function and it will be combined with the date column as a set:

data_df <- set_data_format(raw_x = raw$x,
                          raw_y = raw$y,
                          raw_t = raw$frame,
                          raw_id = raw$track_fixed,
                          origin = "2020-02-1 12:00:21",
                          period = "0.04 seconds",
                          tz = "America/New_York",
                          raw_context = raw$context
                          )

head(data_df)
## Track table [6 observations]
## Number of tracks:  6 
## Dimensions:  2D 
## Geographic:  FALSE 
## Table class:  data frame
##   id                   t         x        y             set
## 1  1 2020-02-01 12:00:21  629.3839 882.4783 2020-02-01_ctx1
## 2  2 2020-02-01 12:00:21 1056.1692 656.5207 2020-02-01_ctx1
## 3  3 2020-02-01 12:00:21  508.0092 375.2451 2020-02-01_ctx1
## 4  4 2020-02-01 12:00:21 1277.6466 373.7491 2020-02-01_ctx1
## 5  5 2020-02-01 12:00:21 1379.2844 343.0853 2020-02-01_ctx1
## 6  6 2020-02-01 12:00:21 1137.1378 174.5110 2020-02-01_ctx1

With this dataset, we can move on into analyzing the collective motion in the data.