pysummarization.vectorizabletoken package

Submodules

pysummarization.vectorizabletoken.dbm_like_skip_gram_vectorizer module

class pysummarization.vectorizabletoken.dbm_like_skip_gram_vectorizer.DBMLikeSkipGramVectorizer(token_list, document_list=[], traning_count=100, batch_size=20, learning_rate=1e-05, feature_dim=100)[source]

Bases: pysummarization.vectorizable_token.VectorizableToken

Vectorize token by Deep Bolzmann Machine(DBM).

Note that this class employs an original method based on this library-specific intuition and analogy about skip-gram, where by n-grams are still stored to model language, but they allow for tokens to be skipped.

convert_tokens_into_matrix(token_list)[source]

Create matrix of sentences.

Parameters:token_list – The list of tokens.
Returns:2-D np.ndarray of sentences. Each row means one hot vectors of one sentence.
get_token_arr()[source]

getter

get_token_list()[source]

getter

set_readonly(value)[source]

setter

set_token_arr(value)[source]

setter

token_arr

getter

token_list

getter

tokenize(vector_list)[source]

Tokenize vector.

Parameters:vector_list – The list of vector of one token.
Returns:token
vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens.
Returns:[vector of token, vector of token, vector of token, …]

pysummarization.vectorizabletoken.encoder_decoder module

class pysummarization.vectorizabletoken.encoder_decoder.EncoderDecoder[source]

Bases: pysummarization.vectorizable_token.VectorizableToken

Vectorize tokens by Encoder/Decoder based on LSTM.

This library provides Encoder/Decoder based on LSTM, which is a reconstruction model and makes it possible to extract series features embedded in deeper layers. The LSTM encoder learns a fixed length vector of time-series observed data points and the LSTM decoder uses this representation to reconstruct the time-series using the current hidden state and the value inferenced at the previous time-step.

References

controller

getter

get_controller()[source]

getter

learn(sentence_list, token_master_list, hidden_neuron_count=200, epochs=100, batch_size=100, learning_rate=1e-05, learning_attenuate_rate=0.1, attenuate_epoch=50, bptt_tau=8, weight_limit=0.5, dropout_rate=0.5, test_size_rate=0.3)[source]

Init.

Parameters:
  • sentence_list – The list of tokenized sentences. [[token, token, token, …], [token, token, token, …], [token, token, token, …]]
  • token_master_list – Unique list of tokens.
  • hidden_neuron_count – The number of units in hidden layer.
  • epochs – Epochs of Mini-batch.
  • batch_size – Batch size of Mini-batch.
  • learning_rate – Learning rate.
  • learning_attenuate_rate – Attenuate the learning_rate by a factor of this value every attenuate_epoch.
  • attenuate_epoch – Attenuate the learning_rate by a factor of learning_attenuate_rate every attenuate_epoch. Additionally, in relation to regularization, this class constrains weight matrixes every attenuate_epoch.
  • bptt_tau – Refereed maxinum step t in Backpropagation Through Time(BPTT).
  • weight_limit – Regularization for weights matrix to repeat multiplying the weights matrix and 0.9 until $sum_{j=0}^{n}w_{ji}^2 < weight_limit$.
  • dropout_rate – The probability of dropout.
  • test_size_rate – Size of Test data set. If this value is 0, the
set_readonly(value)[source]

setter

vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens..
Returns:[vector of token, vector of token, vector of token, …]

pysummarization.vectorizabletoken.skip_gram_vectorizer module

class pysummarization.vectorizabletoken.skip_gram_vectorizer.SkipGramVectorizer(token_list, epochs=300, skip_n=1, batch_size=50, feature_dim=20, scale=1e-05, learning_rate=1e-05, auto_encoder=None)[source]

Bases: pysummarization.vectorizable_token.VectorizableToken

Vectorize token by skip-gram.

auto_encoder

getter

convert_tokens_into_matrix(token_list)[source]

Create matrix of sentences.

Parameters:token_list – The list of tokens.
Returns:2-D np.ndarray of sentences. Each row means one hot vectors of one sentence.
get_auto_encoder()[source]

getter

get_token_arr()[source]

getter

get_token_list()[source]

getter

learn()[source]

Learn.

set_auto_encoder(value)[source]

setter

set_readonly(value)[source]

setter

set_token_arr(value)[source]

setter

token_arr

getter

token_list

getter

tokenize(vector_list)[source]

Tokenize vector.

Parameters:vector_list – The list of vector of one token.
Returns:token
vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens.
Returns:[vector of token, vector of token, vector of token, …]

pysummarization.vectorizabletoken.t_hot_vectorizer module

class pysummarization.vectorizabletoken.t_hot_vectorizer.THotVectorizer(token_list)[source]

Bases: pysummarization.vectorizable_token.VectorizableToken

Vectorize token by t-hot Vectorizer.

convert_tokens_into_matrix(token_list)[source]

Create matrix of sentences.

Parameters:token_list – The list of tokens.
Returns:2-D np.ndarray of sentences. Each row means one hot vectors of one sentence.
get_token_arr()[source]

getter

set_token_arr(value)[source]

setter

token_arr

getter

tokenize(vector_list)[source]

Tokenize vector.

Parameters:vector_list – The list of vector of one token.
Returns:token
vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens.
Returns:[vector of token, vector of token, vector of token, …]

pysummarization.vectorizabletoken.tfidf_vectorizer module

class pysummarization.vectorizabletoken.tfidf_vectorizer.TfidfVectorizer(token_list_list)[source]

Bases: pysummarization.vectorizable_token.VectorizableToken

Vectorize token.

vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens..
Returns:[vector of token, vector of token, vector of token, …]

Module contents