pysummarization package

Submodules

pysummarization.abstractable_doc module

class pysummarization.abstractable_doc.AbstractableDoc[source]

Bases: object

Automatic abstraction and summarization. This is the filtering approach.

Reference.

This interface is designed the Strategy Pattern.

filter(scored_list)[source]

Execute filtering sentences.

Parameters:scored_list – The list of statistical information derived from word frequency and distribution.
Retruns:
the list of filtered sentence.

pysummarization.n_gram module

class pysummarization.n_gram.Ngram[source]

Bases: object

N-gram

generate_ngram_data_set(token_list, n=2)[source]

Generate the N-gram’s pair.

Parameters:
  • token_list – The list of tokens.
  • N (n) –
Returns:

zip of Tuple(Training N-gram data, Target N-gram data)

generate_skip_gram_data_set(token_list)[source]

Generate the Skip-gram’s pair.

Parameters:token_list – The list of tokens.
Returns:zip of Tuple(Training N-gram data, Target N-gram data)
generate_tuple_zip(token_list, n=2)[source]

Generate the N-gram.

Parameters:
  • token_list – The list of tokens.
  • N (n) –
Returns:

zip of Tuple(N-gram)

pysummarization.nlp_base module

class pysummarization.nlp_base.NlpBase[source]

Bases: object

The base class for NLP.

delimiter_list

getter

get_delimiter_list()[source]

getter

get_token()[source]

getter

get_tokenizable_doc()[source]

getter

listup_sentence(data, counter=0)[source]

Divide string into sentence list.

Parameters:
  • data – string.
  • counter – recursive counter.
Returns:

List of sentences.

set_delimiter_list(value)[source]

setter

set_token(value)[source]

setter

set_tokenizable_doc(value)[source]

setter

token

getter

tokenizable_doc

getter

tokenize(data)[source]

Tokenize sentence and set the list of tokens to self.token.

Parameters:data – string.

pysummarization.readable_web_pdf module

class pysummarization.readable_web_pdf.ReadableWebPDF[source]

Bases: object

Read strings in PDF documents

is_pdf_url(url)[source]

Check PDF format.

Parameters:url – URL
Returns:PDF, False: not PDF
Return type:True
url_to_text(url)[source]

Transform PDF documents to strings.

Parameters:url – URL
Returns:string.

pysummarization.similarity_filter module

class pysummarization.similarity_filter.SimilarityFilter[source]

Bases: object

Abstract class for filtering mutually similar sentences.

calculate(token_list_x, token_list_y)[source]

Calculate similarity.

Abstract method.

Parameters:
  • token_list_x – [token, token, token, …]
  • token_list_y – [token, token, token, …]
Returns:

Similarity.

count(token_list)[source]

Count the number of tokens in token_list.

Parameters:token_list – The list of tokens.
Returns:the numbers}
Return type:{token
get_nlp_base()[source]

getter

get_similarity_limit()[source]

getter

nlp_base

getter

set_nlp_base(value)[source]

setter

set_similarity_limit(value)[source]

setter

similar_filter_r(sentence_list)[source]

Filter mutually similar sentences.

Parameters:sentence_list – The list of sentences.
Returns:The list of filtered sentences.
similarity_limit

getter

unique(token_list_x, token_list_y)[source]

Remove duplicated elements.

Parameters:
  • token_list_x – [token, token, token, …]
  • token_list_y – [token, token, token, …]
Returns:

Tuple(token_list_x, token_list_y)

pysummarization.tokenizable_doc module

class pysummarization.tokenizable_doc.TokenizableDoc[source]

Bases: object

Tokenize string.

tokenize(sentence_str)[source]

Tokenize str.

Parameters:sentence_str – tokenized string.
Returns:[token, token, token, …]

pysummarization.vectorizable_sentence module

class pysummarization.vectorizable_sentence.VectorizableSentence[source]

Bases: object

Vectorize sentence.

vectorize(sentence_list)[source]

Tokenize token list.

Parameters:sentence_list

The list of tokenized sentences: [

[token, token, token, …], [token, token, token, …], [token, token, token, …],

]

Returns:
[
vector of token, vector of token, vector of token

]

Return type:np.ndarray

pysummarization.vectorizable_token module

class pysummarization.vectorizable_token.VectorizableToken[source]

Bases: object

Vectorize token.

vectorize(token_list)[source]

Tokenize token list.

Parameters:token_list – The list of tokens.
Returns:[vector of token, vector of token, vector of token, …]

pysummarization.web_scraping module

class pysummarization.web_scraping.WebScraping[source]

Bases: object

Object of Web-scraping.

This is only a demo.

get_readable_web_pdf()[source]

getter

readable_web_pdf

getter

scrape(url)[source]

Execute Web-Scraping. The target dom objects are in self.__dom_object_list.

Parameters:url – Web site url.
Returns:The result. this is a string.

@TODO(chimera0): check URLs format.

set_readable_web_pdf(value)[source]

setter

Module contents