||Creates a text tiling matrix. Used by the DeepTileBars reranker.|
DeepTileExtractor(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶
Creates a text tiling matrix. Used by the DeepTileBars reranker.
extract_segment(self, doc_toks, ttt, slicelen=20)[source]¶
- Tries to extract segments using nlt.TextTilingTokenizer (instance passed as an arg)
- If that fails, simply splits into segments of 20 tokens each
clean_segments(self, segments, p_len=30)[source]¶
- Pad segments if it’s too short
- If it’s too long, collapse the extra text into the last element
color_grid(self, q_tok, topic_segment, embeddings_matrix)[source]¶
See the section titles “Coloring” in the original paper: https://arxiv.org/pdf/1811.00606.pdf Calculates TF, IDF and max gaussian for the given q_tok <> topic_segment pair :param q_tok: List of tokens in a query :param topic_segment: A single segment. String. (A document can have multiple segments)
create_visualization_matrix(self, query_toks, document_segments, embeddings_matrix)[source]¶
Returns a tensor of shape (1, maxqlen, passagelen, channels) The first dimension (i.e 1) is dummy. Ignore that The 2nd and 3rd dimensions (i.e maxqlen and passagelen) together represents a “tile” between a query token and a passage (i.e doc segment). The “tile” is up to dimension 3 - it contains TF of the query term in that passage, idf of the query term, and the max word2vec similarity between query term and any term in the passage :param query_toks: A list of tokens in the query. Eg: [‘hello’, ‘world’] :param document_segments: List of segments in a document. Each segment is a string :param embeddings_matrix: Used to look up word2vec embeddings
id2vec(self, qid, posdocid, negdocid=None, **kwargs)[source]¶
Creates a feature from the (qid, docid) pair. If negdocid is supplied, that’s also included in the feature (needed for training with pairwise hinge loss) Label is a vector of shape [num_classes], and is supplied only when using pointwise training (i.e cross entropy) When using pointwise samples, negdocid is None, and label is either [0, 1] or [1, 0] depending on whether the document represented by posdocid is relevant or irrelevant respectively.