text2vec 0.6.5 (2023-10-16)
- fix test discovered with 
Matrix==1.6-2 release 
text2vec 0.6.4 (2023-02-15)
- update dependency 
Matrix>=1.5-2, fixes #338 
text2vec 0.6.2 (2022-09-11)
- removed test which is not needed with Matrix package v 1.5
 
text2vec 0.6
- 2019-12-17
- breaking change - removed construction of a
vocabulary in parallel on windows
 
- use 
rsparse package for SVD and GloVe
factorizations 
- updated RWMD implementation (hopefully bug free)
 
 
- 2018-09-10
- breaking change - changed IDF formula - see #280
for details.
 
 
- 2018-05-28
- Added 
postag_lemma_tokenizer() (wrapper around
udpipe::udpipe_annotate). Can be used as a drop-in
replacement for more simple tokenizers in text2vec. 
 
- 2018-05-25
- Made 
combine_vocabularies() part of public API - see
#260 for details. 
 
- 2018-05-10
- Added 
coherence() function for comprehensive coherence
metrics. Thanks to Manuel Bickel ( @manuelbickel ) for conrtibution. 
 
- 2018-05-02
- Fixed bug LSA model - document embeddings calculated as left
singular vectors multiplied by singular values (not square root of
values as before). Thanks to Sloane Simmons ( @singularperturbation )
 
- Now 
fit_transform and transform methods in
LDA model produce same results. Thanks to @jiunsiew for reporting. Also now LDA has
n_iter_inference parameter. It controls number of the
samples from converged distribution for document-topic inference. This
leads to more robust document-topic probabilities (reduced variance).
Default value is 10. 
 
- 2018-01-17
- more numerically robust PMI, LFMD - thanks to @andland. Also adds iteration number
iter to collocation_stat. iter
shows iteration number when collocation stats (and counters) were
calculated. 
 
text2vec 0.5.1 [2018-01-10]
- 2018-01-10
- removed rank* columns from 
collocation_stat - were
never used internally. Users can easily calculate ranks themselves 
 
- 2018-01-09
- Added Bi-Normal Separation transformation, thanks to Pavel Shashkin
( @pshashk )
 
- Added Dunning’s log-likelihood ratio for collocations, thanks to
Chris Lee ( @Chrisss93 )
 
- Early stopping for collocations learning
 
 
- 2017-12-18
- fixed several bugs #219 #217 #205
 
- decreased number of dependencies - no more 
magrittr,
uuid, tokenizers 
- removed distributed LDA which didn’t work correctly
 
 
- 2017-10-18
- Now tokenization is based on tokenizers and
THE stringi packages.
 
- models API follow mlapi package. No API
changes on 
text2vec side - we just put abstract
scikit-learn-like classes to a separate package in order to
make them more reusable. 
 
text2vec 0.5.0
- 2017-06-12
- Add additional filters to 
prune_vocabulary - filter by
document counts 
- Clean up LSA, fixed transform method. Added option to use randomized
SVD algorithm from 
irlba. 
 
- 2017-05-17
 
- 2017-05-17
- API breaking change - vocabulary format change -
now plain 
data.frame with meta-information in attributes
(stopwords, ngram, number of docs, etc). 
 
- 2017-03-25
- No more rely on RcppModules
 
- API breaking change - removed 
lda_c
from formats in DTM construction 
- added 
ifiles_parallel, itoken_parallel
high-level functions for parallel computing 
- API breaking change 
chunks_numer
parameter renamed to n_chunks 
 
- 2017-01-02
- API breaking change - removed
create_corpus from public API, moved co-occurence related
optons to create_tcm from vecorizers 
- add ability to add custom weights for co-occurence statistics
calculations
 
 
- 2016-12-30
- Noticeable speedup (1.5x) and even
more noticeable improvement on memory usage (2x less!)
for 
create_dtm, create_tcm . Now package
relies on sparsepp
library for underlying hash maps. 
 
- 2016-10-30
- Collocations - detection of multi-word phrases using differend
heuristics - PMI, gensim, LFMD.
 
 
- 2016-10-20
- Fixed bug in 
as.lda_c() function 
 
text2vec 0.4.0
2016-10-03. See 0.4 milestone
tags.
- Now under GPL (>= 2) Licence
 
- “immutable” iterators - no need to reinitialize them
 
- unified models interface
 
- New models: LSA, LDA, GloVe with L1 regularization
 
- Fast similarity and distances calculation: Cosine, Jaccard, Relaxed
Word Mover’s Distance, Euclidean
 
- Better hadnling UTF-8 strings, thanks to @qinwf
 
- iterators and models rely on 
R6 package 
text2vec 0.3.0
- 2016-01-13 fix for #46, thanks to @buhrmann for reporting
 
- 2016-01-16 format of vocabulary changed.
- do not keep 
doc_proportions. see #52. 
- add 
stop_words argument to
prune_vocabulary. signature also was changed. 
 
- 2016-01-17 fix for #51. if iterator over tokens returns list with
names, these names will be:
- stored as 
attr(corpus, 'ids') 
- rownames in dtm
 
- names for dtm list in 
lda_c format 
 
- 2016-02-02 high level function for corpus and vocabulary
construction.
- construction of vocabulary from list of 
itoken. 
- construction of dtm from list of 
itoken. 
 
- 2016-02-10 rename transformers
- now all transformers starts with 
transform_* - more
intuitive + simpler usage with autocompletion 
 
- 2016-03-29 (accumulated since 2016-02-10)
- rename 
vocabulary to
create_vocabulary. 
- new functions 
create_dtm, create_tcm. 
- All core functions are able to benefit from multicore machines (user
have to register parallel backend themselves)
 
- Fix for progress bars. Now they are able to reach 100% and ticks
increased after computation.
 
ids argument to itoken. Simplifies
assignement of ids to rows of DTM 
create_vocabulary now can handle
stopwords 
- see all updates here
 
 
- 2016-03-30 more robust 
split_into() util. 
text2vec 0.2.0 (2016-01-10)
First CRAN release of text2vec.
- Fast text vectorization with stable streaming API on arbitrary
n-grams.
- Functions for vocabulary extraction and management
 
- Hash vectorizer (based on digest murmurhash3)
 
- Vocabulary vectorizer
 
 
- GloVe algorithm word embeddings.
- Fast term-co-occurence matrix factorization via parallel async
AdaGrad.
 
 
- All core functions written in C++.