`capreolus.extractor.deeptileextractor`¶

Module Contents¶

Classes¶

DeepTileExtractor

Creates a text tiling matrix. Used by the DeepTileBars reranker.

Attributes¶

`logger`
`CACHE_BASE_PATH`

capreolus.extractor.deeptileextractor.logger[source]¶

capreolus.extractor.deeptileextractor.CACHE_BASE_PATH[source]¶

class capreolus.extractor.deeptileextractor.DeepTileExtractor(config=None, provide=None, share_dependency_objects=False, build=True)[source]¶

Bases: capreolus.extractor.Extractor

Creates a text tiling matrix. Used by the DeepTileBars reranker.

module_name = 'deeptiles'[source]¶

pad = 0[source]¶

pad_tok = '<pad>'[source]¶

embed_paths[source]¶

requires_random_seed = True[source]¶

dependencies[source]¶

config_spec[source]¶

load_state(qids, docids)[source]¶

cache_state(qids, docids)[source]¶

abstract get_tf_feature_description()[source]¶

abstract create_tf_feature()[source]¶

abstract parse_tf_example(example_proto)[source]¶

extract_segment(doc_toks, ttt, slicelen=20)[source]¶

Tries to extract segments using nlt.TextTilingTokenizer (instance passed as an arg)
If that fails, simply splits into segments of 20 tokens each

clean_segments(segments, p_len=30)[source]¶

Pad segments if it’s too short
If it’s too long, collapse the extra text into the last element

gaussian(x1, z1)[source]¶

color_grid(q_tok, topic_segment, embeddings_matrix)[source]¶: See the section titles “Coloring” in the original paper: https://arxiv.org/pdf/1811.00606.pdf Calculates TF, IDF and max gaussian for the given q_tok <> topic_segment pair :param q_tok: List of tokens in a query :param topic_segment: A single segment. String. (A document can have multiple segments)

create_visualization_matrix(query_toks, document_segments, embeddings_matrix)[source]¶: Returns a tensor of shape (1, maxqlen, passagelen, channels) The first dimension (i.e 1) is dummy. Ignore that The 2nd and 3rd dimensions (i.e maxqlen and passagelen) together represents a “tile” between a query token and a passage (i.e doc segment). The “tile” is up to dimension 3 - it contains TF of the query term in that passage, idf of the query term, and the max word2vec similarity between query term and any term in the passage :param query_toks: A list of tokens in the query. Eg: [‘hello’, ‘world’] :param document_segments: List of segments in a document. Each segment is a string :param embeddings_matrix: Used to look up word2vec embeddings

exist()[source]¶

preprocess(qids, docids, topics)[source]¶

id2vec(qid, posdocid, negdocid=None, *args, **kwargs)[source]¶: Creates a feature from the (qid, docid) pair. If negdocid is supplied, that’s also included in the feature (needed for training with pairwise hinge loss) Label is a vector of shape [num_classes], and is supplied only when using pointwise training (i.e cross entropy) When using pointwise samples, negdocid is None, and label is either [0, 1] or [1, 0] depending on whether the document represented by posdocid is relevant or irrelevant respectively.

capreolus.extractor.deeptileextractor¶

Module Contents¶

Classes¶

Attributes¶

`capreolus.extractor.deeptileextractor`¶