Extracts passages from the document to be later consumed by a BERT based model.
- class capreolus.extractor.bertpassage.BertPassage(config=None, provide=None, share_dependency_objects=False, build=True)¶
Extracts passages from the document to be later consumed by a BERT based model. Does NOT use all the passages. The first passages is always used. Use the prob config to control the probability of a passage being selected Gotcha: In Tensorflow the train tfrecords have shape (batch_size, maxseqlen) while dev tf records have the shape (batch_size, num_passages, maxseqlen). This is because during inference, we want to pool over the scores of the passages belonging to a doc
- module_name = bertpassage¶
- load_state(self, qids, docids)¶
- cache_state(self, qids, docids)¶
- create_tf_train_feature(self, sample)¶
Returns a set of features from a doc. Of the num_passages passages that are present in a document, we use only a subset of it. params: sample - A dict where each entry has the shape [batch_size, num_passages, maxseqlen]
Returns a list of features. Each feature is a dict, and each value in the dict has the shape [batch_size, maxseqlen]. Yes, the output shape is different to the input shape because we sample from the passages.
- create_tf_dev_feature(self, sample)¶
Unlike the train feature, the dev set uses all passages. Both the input and the output are dicts with the shape [batch_size, num_passages, maxseqlen]
- parse_tf_train_example(self, example_proto)¶
- parse_tf_dev_example(self, example_proto)¶
- preprocess(self, qids, docids, topics)¶
- id2vec(self, qid, posid, negid=None, label=None)¶
See parent class for docstring