
<!-- README.md is generated from README.Rmd. Please edit that file -->

# ORscraper: An R Package for for extracting data from Oncomine Reporter’s clinical reports .

## Overview

ORscraper is an R package designed to extract relevant medical
information from clinical reports generated by the Oncomine Reporter
software. This package is intended for healthcare professionals and
researchers working with genetic data who need to automate the
extraction and processing of information from report files. ORscraper
provides tools to identify biopsies, extract genetic variants and
pathogenicity classifications, filter relevant data, and query databases
such as NCBI ClinVar.

## Installation

Install the released version of remotes from CRAN:

``` r
install.packages("ORscraper")
```

You can install ORscraper from GitHub using the following R code:

``` r
# Install devtools if not already installed
if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
}

# Install ORscraper from GitHub
devtools::install_github("SamuelGonzalez0204/ORscraper")
```

## Basic Usage

Below is a basic example of how to use ORscraper to extract information
from PDF files:

``` r
library(ORscraper)

# Read content from a PDF file
example_pdf <- system.file("extdata", "100.1-example.pdf", package = "ORscraper")
lines <- read_pdf_content(example_pdf)

# Read content from mutation tables
genesFile <- system.file("extdata", "Genes.xlsx", package = "ORscraper")
genes <- read_excel(genesFile)
mutations <- unique(genes$GEN)

# Extract mutations values from the extracted text
genes_mut <- c()
pathogenicities <- c()
tableValues <- extract_values_from_tables(lines, mutations)
genes_mut <- c(genes_mut, tableValues[1])
pathogenicities <- c(pathogenicities, tableValues[2])

# Filter only pathogenic mutations
pathogenic_mutations <- filter_pathogenic_only(pathogenicities, genes_mut)

print(pathogenic_mutations)
```

## Main Functions

The ORscraper package includes several key functions:

- `classify_biopsy()`: Analyzes biopsy identifiers and categorizes them
  based on predefined rules.

- `extract_chip_id()`: Extracts chip values from filenames matching
  specific patterns.

- `extract_fusions()`: Identifies and extracts fusion variants from text
  lines.

- `extract_intermediate_values()`: Searches for a specific text pattern
  and extracts consecutive values.

- `extract_values_from_tables()`: Extracts information such as
  mutations, pathogenicity, and frequencies from tables in reports.

- `extract_values_start_end()`: Extracts values based on start and end
  markers.

- `filter_pathogenic_only()`: Filters mutations, retaining only those
  marked as “Pathogenic.”

- `read_pdf_content()`: Extracts the content of a PDF and splits it into
  individual lines.

- `read_pdf_files()`: Scans a directory and retrieves all PDF files.

- `search_ncbi_clinvar()`: Queries the NCBI ClinVar database for
  germline classifications.
