:mod:`capreolus.extractor.deeptileextractor` ============================================ .. py:module:: capreolus.extractor.deeptileextractor Module Contents --------------- Classes ~~~~~~~ .. autoapisummary:: capreolus.extractor.deeptileextractor.DeepTileExtractor .. data:: logger .. data:: CACHE_BASE_PATH .. py:class:: DeepTileExtractor Bases: :class:`capreolus.extractor.Extractor` Creates a text tiling matrix. Used by the DeepTileBars reranker. .. attribute:: module_name :annotation: = deeptiles .. attribute:: pad :annotation: = 0 .. attribute:: pad_tok :annotation: = .. attribute:: embed_paths .. attribute:: requires_random_seed :annotation: = True .. attribute:: dependencies .. attribute:: config_spec .. method:: load_state(self, qids, docids) .. method:: cache_state(self, qids, docids) .. method:: get_tf_feature_description(self) .. method:: create_tf_feature(self) .. method:: parse_tf_example(self, example_proto) .. method:: extract_segment(self, doc_toks, ttt, slicelen=20) 1. Tries to extract segments using nlt.TextTilingTokenizer (instance passed as an arg) 2. If that fails, simply splits into segments of 20 tokens each .. method:: clean_segments(self, segments, p_len=30) 1. Pad segments if it's too short 2. If it's too long, collapse the extra text into the last element .. method:: gaussian(self, x1, z1) .. method:: color_grid(self, q_tok, topic_segment, embeddings_matrix) See the section titles "Coloring" in the original paper: https://arxiv.org/pdf/1811.00606.pdf Calculates TF, IDF and max gaussian for the given q_tok <> topic_segment pair :param q_tok: List of tokens in a query :param topic_segment: A single segment. String. (A document can have multiple segments) .. method:: create_visualization_matrix(self, query_toks, document_segments, embeddings_matrix) Returns a tensor of shape (1, maxqlen, passagelen, channels) The first dimension (i.e 1) is dummy. Ignore that The 2nd and 3rd dimensions (i.e maxqlen and passagelen) together represents a "tile" between a query token and a passage (i.e doc segment). The "tile" is up to dimension 3 - it contains TF of the query term in that passage, idf of the query term, and the max word2vec similarity between query term and any term in the passage :param query_toks: A list of tokens in the query. Eg: ['hello', 'world'] :param document_segments: List of segments in a document. Each segment is a string :param embeddings_matrix: Used to look up word2vec embeddings .. method:: exist(self) .. method:: preprocess(self, qids, docids, topics) .. method:: id2vec(self, qid, posdocid, negdocid=None)