NEWS ==== Versioning ---------- Releases will be numbered with the following semantic versioning format: .. And constructed with the following guidelines: * Breaking backward compatibility bumps the major (and resets the minor and patch) * New additions without breaking backward compatibility bumps the minor (and resets the patch) * Bug fixes and misc changes bumps the patch textshape 1.7.2 - 1.7.3 ---------------------------------------------------------------- BUG FIXES * `split_token` would keep quote apostrophes with the leading/ending words of the quote. These are now stripped off as individual tokens. Apostrophes are still preserved inside of words (e.g., possessives and contractions) textshape 1.6.1 - 1.7.1 ---------------------------------------------------------------- BUG FIXES `as_list` contained a bug because it relies on `apply` which can simplfy lists to a data.frame when the lists contain equal numbers of each eleement. `split_run` relies on **stringi** which had a change in API that caused the strings to no longer split correctly (a space was added at the end). This has been fixed. * `split_token` would keep quote apostrophes with the leading/ending words of the quote. These are now stripped off as individual tokens. Apostrophes are still preserved inside of words (e.g., possessives and contractions) NEW FEATURES * `grab_index` added to grab from a start up to an ending index. * `grab_match` added to grab from a start up to an ending regex match. IMPROVEMENTS * `split_transcript` now has more accurate regex group capturing. textshape 1.5.1 - 1.6.0 ---------------------------------------------------------------- BUG FIXES * `split_match_regex_to_transcript` gave warnings about `perl = TRUE` that were due to `gsub` with a `fixed = TRUE` clashing with the `perl = TRUE`. NEW FEATURES * `flatten` added for flattening nested, named lists into single tiered lists using the concatenated list/atomic vector names as the names of the single tiered list. * `unnest_text` added to located and un-nest nested text columns in a data.frame. IMPROVEMENTS * `tidy_dtm`/`tidy_tdm` did not order unnamed matrices as expected (e.g., `{1, 2, ..., 1}` was ordered as `{1, 10, 2, ...}`). This has been corrected. textshape 1.4.0 - 1.5.0 ---------------------------------------------------------------- BUG FIXES * split_sentence`, `split_word`, `split_token`, & `split_speaker` did not handle single row data.frames properly resulting in loss of data. This has been fixed. NEW FEATURES * `split_sentence_token` added as a shortcut to split into sentences, add a sentence index, and then split into tokens and add a token index. * `tidy_matrix` and `tidy_adjacency_matrix` added to provide easy tidy representations of these data types. * `cluster_matrix` added for reordering the columns/rows of matrices via hierarchical clustering. IMPROVEMENTS * `split_sentence` now handles digit(s) + inch (in.) abbreviation if not followed by a capital letter. Previously, this was split on. Additionally, post script (p.s.) is no longer split on. textshape 1.1.0 - 1.3.0 ---------------------------------------------------------------- BUG FIXES * `tidy_list` did not add `content.attribute.name` for lists of named vectors. MINOR FEATURES * `split_match_regex` added as a version of `split_match` with `regex = TRUE` by default. This makes it easier to reason about what the function call is doing. * `split_match_regex_to_transcript` added to directly split a text by a person regex and convert to a two column transcript of person and dialogue. IMPROVEMENTS * `tidy_list` now uses **data.table**'s `rbind` for lists of `data.frame`s. This means column ordering does not need to match and missing columns are automatically filled with `NA`s. * `split_sentence` has better handling for the 'No.' abbreviation that distinguishes between 'No.' followed by digits (assumed to be and abbreviation) and when no digits follow (assumed to be a complete sentence). * `split_sentence` has better handling for quoted material (i.e., a punctuation mark followed by single or double quotes that is not followed by a comma). * `split_sentence` has better handling for single and double middle names presented as initials. * `split_sentence` has better handling for abbreviated English units of measure. CHANGES * `combine.default` included element names by default. This has been removed to include only the elements. textshape 1.0.2 ---------------------------------------------------------------- BUG FIXES * `tidy_list` with a list of unnamed `data.frame`s resulted in an error (see issue #7). This issue has been fixed. * `split_word.data.frame` and `split_token.data.frame` both used an incorrect column naming of `sentence_id` for word and token index respectively. These columns are now renamed to `word_id` and `token_id` respectively. * `split_token` gets a more robust splitting algorithm. NEW FEATURES * `column_to_rownames` added to enable one to quickly add a column as rownames easily within a pipeline. This is useful when turning a `data.frame` into a `matrix`. * `tidy_list` picks up the ability to tidy a list of named vectors into three columns. CHANGES * `as.tibble` removed from all function arguments. This was a nice interactive feature that made programming very difficult to reason about. Having an environment dependent output would result in no adoption of the **textshape** package as a dependency. Additionally, `set_output` and `tibble_output`, two complementary function have been removed without being deprecated. The problem was so egregious and the package infant enough, that removal without deprecation was warranted. textshape 1.0.1 ---------------------------------------------------------------- NEW FEATURES * Users can now globally select a **tibble** output rather than a **data.table** output for all functions that outputted a **data.table**. This can be set globally via `set_output`. If the user does not set the output type **textshape** tries to infer based on whether or not the user has **dplyr** loaded. If **dplyr** is loaded then *tibble* is the default output. * `set_output` and `tibble_output` added to globally set the output type (**tibble** or **data.table**) and to check/infer the desired output type. textshape 1.0.0 ---------------------------------------------------------------- CHANGES * `bind_list`, `bind_table`, & `bind_vector` have been renamed to the more meaningful forms of `tidy_list`, `tidy_table`, & `tidy_vector`. The former version are now deprecated. This bumps the version to 1.0.0 as this is a major change that breaks backward compatibility. textshape 0.1.0 - 0.2.0 ---------------------------------------------------------------- NEW FEATURES * `bind_list` added to `rbind` a `list` of named `data.frame`s or `vector`s. * `split_transcript` added to split a transcript style vector (e.g., `c("greg: Who me", "sarah: yes you!")` into a name and dialogue vector that is coerced to a `data.table`. * `change_index` added for extracting the indices of changes in runs within an atomic vector. Pairs well with `split_index`. * `bind_vector` added to `cbind` a named atomic `vector`'s names and values. * `bind_table` added to `cbind` a `table`'s names and values. * `duration` method for numeric vectors added as well as a `starts` and `ends` function for calculating start and end times from a numeric vector. * `from_to` added to prepare speaker data for a network lot given the flowing nature of discourse. * `tidy_dtm` & `tidy_tdm` added to convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a tidied `data.frame`. * `tidy_colo_dtm` & `tidy_colo_tdm` added to convert a `DocumentTermMatrix` or `TermDocumentMatrix` into a collocation matrix and then a tidied `data.frame`. * `unique_pairs` added to compliment the output of `tidy_colo_dtm` & `tidy_colo_tdm`. Enables the removal of duplicated collocating pairs caused by symmetrical mirroring of the upper and lower triangle of the collocation matrix. CHANGES * `split_index` now uses `change_index(x)` as the default when `x` is an atomic vector. textshape 0.0.1 ---------------------------------------------------------------- Tools that can be used to reshape text data.