% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/modelMissLinSub.R
\name{modelMissLinSub}
\alias{modelMissLinSub}
\title{Subsampling under linear regression for a potentially misspecified model}
\usage{
modelMissLinSub(r0,rf,Y,X,N,Alpha,proportion,model="Auto")
}
\arguments{
\item{r0}{sample size for initial random sample}

\item{rf}{final sample size including initial(r0) and optimal(r) samples}

\item{Y}{response data or Y}

\item{X}{covariate data or X matrix that has all the covariates (first column is for the intercept)}

\item{N}{size of the big data}

\item{Alpha}{scaling factor when using Log Odds or Power functions to magnify the probabilities}

\item{proportion}{a proportion of the big data is used to help estimate AMSE values from the subsamples}

\item{model}{formula for the model used in the GAM or the default choice}
}
\value{
The output of \code{modelMissLinSub} gives a list of

\code{Beta_Estimates} estimated model parameters after subsampling

\code{Variance_Epsilon_Estimates} matrix of estimated variance for epsilon after subsampling

\code{Utility_Estimates} estimated A-, L- and L1- optimality values for the obtained subsamples

\code{AMSE_Estimates} matrix of estimated AMSE values after subsampling

\code{Sample_A-Optimality} list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

\code{Sample_L-Optimality} list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

\code{Sample_L1-Optimality} list of indexes for the initial and optimal samples obtained based on L1-Optimality criteria

\code{Sample_RLmAMSE} list of indexes for the optimal samples obtained based obtained based on RLmAMSE

\code{Sample_RLmAMSE_Log_Odds} list of indexes for the optimal samples obtained based on RLmAMSE with Log Odds function

\code{Sample_RLmAMSE_Power} list of indexes for the optimal samples obtained based on RLmAMSE with Power function

\code{Subsampling_Probability} matrix of calculated subsampling probabilities
}
\description{
Using this function sample from big data under linear regression for a potentially misspecified model.
Subsampling probabilities are obtained based on the A-, L- and L1- optimality criteria
with the RLmAMSE (Reduction of Loss by minimizing the Average Mean Squared Error).
}
\details{
\strong{The article for this function is in preparation for publication. Please be patient.}

Two stage subsampling algorithm for big data under linear regression for potential model misspecification.

First stage is to obtain a random sample of size \eqn{r_0} and estimate the model parameters.
Using the estimated parameters subsampling probabilities are evaluated for A-, L-, L1-optimality criteria,
RLmAMSE and enhanced RLmAMSE (log-odds and power) subsampling methods.

Through the estimated subsampling probabilities a sample of size \eqn{r \ge r_0} is obtained.
Finally, the two samples are combined and the model parameters are estimated for A-, L-, L1-optimality,
RLmAMSE and enhanced RLmAMSE (log-odds and power).

\strong{NOTE} :  If input parameters are not in given domain conditions
necessary error messages will be provided to go further.

If \eqn{r \ge r_0} is not satisfied then an error message will be produced.

If the big data \eqn{X,Y} has any missing values then an error message will be produced.

The big data size \eqn{N} is compared with the sizes of \eqn{X,Y},F_estimate_Full and
if they are not aligned an error message will be produced.

If \eqn{\alpha > 1} for the scaling factor is not satisfied an error message will be produced.

If proportion is not in the region of \eqn{(0,1]} an error message will be produced.

\code{model} is a formula input formed based on the covariates through the spline terms (s()),
squared term (I()), interaction terms (lo()) or automatically. If \code{model} is empty or NA
or NAN or not one of the defined inputs an error message is printed. As a default we have set
\code{model="Auto"}, which is the main effects model wit the spline terms.
}
\examples{
Beta <- c(-1, 0.75, 0.75, 1); Var_Epsilon <- 0.5;
family <- "linear"; N <- 500
X_1 <- replicate(2, stats::runif(n = N, min = -1, max = 1))

Temp <- Rfast::rowprods(X_1)
Misspecification <- (Temp - mean(Temp)) / sqrt(mean(Temp^2) - mean(Temp)^2)
X_Data <- cbind(X0 = 1, X_1)

Full_Data <- GenModelMissGLMdata(N, X_Data, Misspecification, Beta, Var_Epsilon, family)
r0 <- 40; rf <- rep(10 * c(8, 12), 25)
Original_Data <- Full_Data$Complete_Data[, -ncol(Full_Data$Complete_Data)]

Results <- modelMissLinSub(r0 = r0, rf = rf,
                           Y = as.matrix(Original_Data[, 1]),
                           X = as.matrix(Original_Data[, -1]),
                           N = N, Alpha = 10, proportion = 0.5)

plot_Beta(Results)
plot_AMSE(Results)


}
\references{
\insertRef{adewale2009robust}{NeEDS4BigData}

\insertRef{adewale2010robust}{NeEDS4BigData}

\insertRef{Amalan2025Misspecification}{NeEDS4BigData}
}
