capreolus.extractor.deeptileextractor
¶
Module Contents¶
Classes¶
DeepTileExtractor () |
Creates a text tiling matrix. Used by the DeepTileBars reranker. |
-
class
capreolus.extractor.deeptileextractor.
DeepTileExtractor
[source]¶ Bases:
capreolus.extractor.Extractor
Creates a text tiling matrix. Used by the DeepTileBars reranker.
-
extract_segment
(self, doc_toks, ttt, slicelen=20)[source]¶ - Tries to extract segments using nlt.TextTilingTokenizer (instance passed as an arg)
- If that fails, simply splits into segments of 20 tokens each
-
clean_segments
(self, segments, p_len=30)[source]¶ - Pad segments if it’s too short
- If it’s too long, collapse the extra text into the last element
-
color_grid
(self, q_tok, topic_segment, embeddings_matrix)[source]¶ See the section titles “Coloring” in the original paper: https://arxiv.org/pdf/1811.00606.pdf Calculates TF, IDF and max gaussian for the given q_tok <> topic_segment pair :param q_tok: List of tokens in a query :param topic_segment: A single segment. String. (A document can have multiple segments)
-
create_visualization_matrix
(self, query_toks, document_segments, embeddings_matrix)[source]¶ Returns a tensor of shape (1, maxqlen, passagelen, channels) The first dimension (i.e 1) is dummy. Ignore that The 2nd and 3rd dimensions (i.e maxqlen and passagelen) together represents a “tile” between a query token and a passage (i.e doc segment). The “tile” is up to dimension 3 - it contains TF of the query term in that passage, idf of the query term, and the max word2vec similarity between query term and any term in the passage :param query_toks: A list of tokens in the query. Eg: [‘hello’, ‘world’] :param document_segments: List of segments in a document. Each segment is a string :param embeddings_matrix: Used to look up word2vec embeddings
-