It can be useful to have a data set with a known distribution for testing modeling approaches. It’s also useful to be able to clearly conceptualize that data set. spect can generate synthetic time-to-event data for this purpose without relying on a potentially unknown external data set.

Creating synthetic data

The create_synthetic_data() function will produce a single, relational data set where each row represents a fictional subscriber to a theoretical streaming service. spect can be used to model the time to the cancellation of the service. If no parameters are passed, then all defaults are invoked. The resulting data set contains two modeling variables:

incomes - the average household income of the subscriber
watchtimes - the average number of weekly hours the subscriber used the service in the prior month

It also contains the following columns:

total_months - the first of the time to cancellation or the end of the study (i.e. - censored data - the event did not occur)
cancel_event_detected - an indicator variable. 0 means that the event did not occur (i.e. - censored data). 1 means that the event (cancellation) was observed.
baseline_time_to_cancel - This is given by a simple, but non-linear formula: B = 26 + W^2 - (I / 10000) where W is the watchtimes and I is the incomes. This can be thought of as the “ground truth” for the cancellation event time.
perturbed_baseline - This differs from the baseline_time_to_cancel by the pertubartion_shift, if passed.


set.seed(42)

data <- create_synthetic_data()
#> INFO [2025-04-06 20:25:29] Creating 250 income samples from normal distribution of median 50000, variance 10000 
#>             and watchtimes samples from uniform distribution with min: 0 and max: 6
head(data)
#>    incomes watchtimes total_months cancel_event_detected
#> 1 63709.58  0.8190312     20.29985                     1
#> 2 44353.02  1.0628185     22.69428                     1
#> 3 53631.28  3.1173627     30.35482                     1
#> 4 56328.63  4.8667247     44.05215                     1
#> 5 54042.68  0.6921721     21.07483                     1
#> 6 48938.75  5.3605307     48.00000                     0
#>   baseline_time_to_cancel perturbed_baseline
#> 1                20.29985           20.29985
#> 2                22.69428           22.69428
#> 3                30.35482           30.35482
#> 4                44.05215           44.05215
#> 5                21.07483           21.07483
#> 6                49.84141           49.84141

Modifying the distribution

Since a distribution that matches exactly to a formula may not be adequate for testing a model, some optional parameters are provided to perturb the cancellation event distribution in a structured way. In particular, the user can specify the minimum, median, and variance of the income distribution and the minimum and maximum watchtimes.

Additionally, it’s possible to set a censorship percentage within a given minimum and maximum amount, adjust the length of the study (i.e. - the maximum total months). Finally, the perturbation_shift argument adds some random noise to the total_months column of the data, which helps to prevent instant overfitting.


data <- create_synthetic_data(sample_size = 500
                      , minimum_income = 10000
                      , median_income = 40000
                      , income_variance = 10000
                      , min_watchhours = 2
                      , max_watchhours = 10
                      , censor_percentage = .2
                      , min_censor_amount = 3
                      , max_censor_amount = 3
                      , study_time_in_months = 60
                      , perturbation_shift = 5
                      )
#> INFO [2025-04-06 20:25:29] Creating 500 income samples from normal distribution of median 40000, variance 10000 
#>             and watchtimes samples from uniform distribution with min: 2 and max: 10

head(data)
#>    incomes watchtimes total_months cancel_event_detected
#> 1 50291.41   9.919725     60.00000                     0
#> 2 49147.75   5.507949     55.75277                     1
#> 3 39975.44   7.599226     60.00000                     0
#> 4 41360.10   9.112616     60.00000                     0
#> 5 32798.46   8.673276     60.00000                     0
#> 6 38018.76   7.875372     60.00000                     0
#>   baseline_time_to_cancel perturbed_baseline
#> 1               119.37180          114.97561
#> 2                51.42273           55.75277
#> 3                79.75069           78.24010
#> 4               104.90375          104.02173
#> 5                97.94587          102.55733
#> 6                84.21960           84.53033

It may also be useful to visualize the data distributions. plotSynData() handles this straightforwardly. Here, it’s easy to see the impact of the perturbation and censorship by comparing the “cancel_months” graph to the “final_cancel_months” graph. Also, note that incomes are roughly normally distribution, while watchtimes are roughly uniformly distributed.

plot_synthetic_data(data)