trec-car-tools¶
This is the documentation for trec-car-tools, a Python 3 library for reading
and manipulating the TREC Complex Answer Retrieval (CAR) dataset.
Getting started¶
This library requires Python 3.3 or greater. It can can be installed with
setup.py
python3 ./setup.py install
If you are using Anaconda, install the cbor
library for Python 3.6:
conda install -c laura-dietz cbor=1.0.0
Once you have installed the library, you can download a dataset and start playing.
Reading the dataset¶
The TREC CAR dataset consists of a number of different exports. These include,
- Annotations files (also called “pages files”) contain full Wikipedia pages and their contents
- Paragraphs files contain only paragraphs disembodied from their pages
- Outlines files contain only the section structure of pages and no textual content
To read an annotations file use the iter_annotations() function:
-
trec_car.read_data.iter_annotations(file)[source]¶ Iterate over the
Pages of an annotations file.Return type: typing.Iterator[Page]
For instance, to list the page IDs of pages in a pages file one might write
for page in read_data.iter_annotations(open('train.test200.cbor', 'rb')):
print(page.pageId)
Likewise, to read a paragraphs file the iter_paragraphs() function is
provided
-
trec_car.read_data.iter_paragraphs(file)[source]¶ Iterate over the
Paragraphs of an paragraphs file.Return type: typing.Iterator[Paragraph]
To list the text of all paragraphs in a paragarphs file one might write,
for para in read_data.iter_paragraphs(open('train.test200.cbor', 'rb')):
print(para.getText())
Basic types¶
-
class
trec_car.read_data.PageName¶ PageNamerepresents the natural language “name” of a page. Note that this means that it is not necessarily unique. If you need a unique handle for a page usePageId.
-
class
trec_car.read_data.PageId¶ A
PageIdis the unique identifier for aPage.
The Page type¶
-
class
trec_car.read_data.Page(page_name, page_id, skeleton, page_type, page_meta)[source]¶ The name and skeleton of a Wikipedia page.
-
skeleton¶ Return type: typing.List[PageSkeleton] The contents of the page
-
page_meta¶ Return type: PageMetadata Metadata about the page
-
flat_headings_list()[source]¶ return Returns a flat list of headings contained by the
Page.Return type: typing.List[Section]
-
get_text()[source]¶ Include all visible text below this elements. Includes Captions of images, but no headings and no infoboxes. See get_text_with_headings for a version that includes headings.
-
get_text_with_headings(include_heading=False)[source]¶ Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.
-
-
class
trec_car.read_data.PageMetadata(redirectNames, disambiguationNames, disambiguationIds, categoryNames, categoryIds, inlinkIds, inlinkAnchors, wikiDataQid, siteId, pageTags)[source]¶ Meta data for a page
-
categoryNames¶ Return type: str Page names of categories to which this page belongs
-
categoryIds¶ Return type: str Page IDs of categories to which this page belongs
-
inlinkIds¶ Return type: str Page IDs of pages containing inlinks
-
inlinkAnchors¶ -
inlinkAnchor frequencies rtype: str (Anchor text, frequency) of pages containing inlinks
-
wikidataQid¶ Return type: str Language and time independent Wikidata IDs (e.g. Q12345)
-
siteId¶ Return type: str SiteId (e.g. enwiki). The combination of WikidataQid and SiteId identifies a page in a wikipedia across time stamps. Note that PageName and PageId can change over time.
-
pageTags¶ Return type: str Template tags of pages, e.g. “Good article” or “Vital article”
-
Page structure¶
The high-level structure of a Page is captured by the subclasses of
PageSkeleton.
-
class
trec_car.read_data.PageSkeleton[source]¶ An abstract superclass for the various types of page elements. Subclasses include:
-
class
trec_car.read_data.Para(paragraph)[source]¶ Bases:
trec_car.read_data.PageSkeletonA paragraph within a Wikipedia page.
-
paragraph¶ Return type: Paragraph The content of the Paragraph (which in turn contain a list of
ParaBodys)
-
-
class
trec_car.read_data.Section(heading, headingId, children)[source]¶ Bases:
trec_car.read_data.PageSkeletonA section of a Wikipedia page.
-
heading¶ Return type: str The section heading.
-
headingId¶ Return type: str The unique identifier of a section heading.
-
children¶ Return type: typing.List[PageSkeleton] The
PageSkeletonelements contained by the section.
-
-
class
trec_car.read_data.List(level, body)[source]¶ Bases:
trec_car.read_data.PageSkeletonAn list element within a Wikipedia page.
-
level¶ Return type: int The list nesting level
-
-
class
trec_car.read_data.Image(imageurl, caption)[source]¶ Bases:
trec_car.read_data.PageSkeletonAn image within a Wikipedia page.
-
caption¶ Return type: str PageSkeleton representing the caption of the image
-
imageurl¶ Return type: str URL to the image; spaces need to be replaced with underscores, Wikimedia Commons namespace needs to be prefixed
-
Paragraph contents¶
-
class
trec_car.read_data.ParaBody[source]¶ An abstract superclass representing a bit of
Paragraphcontent.
-
class
trec_car.read_data.ParaText(text)[source]¶ Bases:
trec_car.read_data.ParaBodyA bit of plain text from a paragraph.
-
text¶ Return type: str The text
-
-
class
trec_car.read_data.ParaLink(page, link_section, pageid, anchor_text)[source]¶ Bases:
trec_car.read_data.ParaBodyA link within a paragraph.
-
link_section¶ Return type: str Section anchor of link target (i.e. the part after the
#in the URL), orNone.
-
anchor_text¶ Return type: str The anchor text of the link
-