capreolus.extractor.deeptileextractor

Module Contents

Classes

DeepTileExtractor Creates a text tiling matrix. Used by the DeepTileBars reranker.
capreolus.extractor.deeptileextractor.logger[source]
capreolus.extractor.deeptileextractor.CACHE_BASE_PATH[source]
class capreolus.extractor.deeptileextractor.DeepTileExtractor(config=None, provide=None, share_dependency_objects=False, build=True)[source]

Bases: capreolus.extractor.Extractor

Creates a text tiling matrix. Used by the DeepTileBars reranker.

module_name = deeptiles[source]
pad = 0[source]
pad_tok = <pad>[source]
embed_paths[source]
requires_random_seed = True[source]
dependencies[source]
config_spec[source]
load_state(self, qids, docids)[source]
cache_state(self, qids, docids)[source]
get_tf_feature_description(self)[source]
create_tf_feature(self)[source]
parse_tf_example(self, example_proto)[source]
extract_segment(self, doc_toks, ttt, slicelen=20)[source]
  1. Tries to extract segments using nlt.TextTilingTokenizer (instance passed as an arg)
  2. If that fails, simply splits into segments of 20 tokens each
clean_segments(self, segments, p_len=30)[source]
  1. Pad segments if it’s too short
  2. If it’s too long, collapse the extra text into the last element
gaussian(self, x1, z1)[source]
color_grid(self, q_tok, topic_segment, embeddings_matrix)[source]

See the section titles “Coloring” in the original paper: https://arxiv.org/pdf/1811.00606.pdf Calculates TF, IDF and max gaussian for the given q_tok <> topic_segment pair :param q_tok: List of tokens in a query :param topic_segment: A single segment. String. (A document can have multiple segments)

create_visualization_matrix(self, query_toks, document_segments, embeddings_matrix)[source]

Returns a tensor of shape (1, maxqlen, passagelen, channels) The first dimension (i.e 1) is dummy. Ignore that The 2nd and 3rd dimensions (i.e maxqlen and passagelen) together represents a “tile” between a query token and a passage (i.e doc segment). The “tile” is up to dimension 3 - it contains TF of the query term in that passage, idf of the query term, and the max word2vec similarity between query term and any term in the passage :param query_toks: A list of tokens in the query. Eg: [‘hello’, ‘world’] :param document_segments: List of segments in a document. Each segment is a string :param embeddings_matrix: Used to look up word2vec embeddings

exist(self)[source]
preprocess(self, qids, docids, topics)[source]
id2vec(self, qid, posdocid, negdocid=None, **kwargs)[source]

Creates a feature from the (qid, docid) pair. If negdocid is supplied, that’s also included in the feature (needed for training with pairwise hinge loss) Label is a vector of shape [num_classes], and is supplied only when using pointwise training (i.e cross entropy) When using pointwise samples, negdocid is None, and label is either [0, 1] or [1, 0] depending on whether the document represented by posdocid is relevant or irrelevant respectively.