trec-car-tools¶

This is the documentation for trec-car-tools, a Python 3 library for reading and manipulating the TREC Complex Answer Retrieval (CAR) dataset.

Getting started¶

This library requires Python 3.3 or greater. It can can be installed with setup.py

python3 ./setup.py install

If you are using Anaconda, install the cbor library for Python 3.6:

conda install -c laura-dietz cbor=1.0.0

Once you have installed the library, you can download a dataset and start playing.

Reading the dataset¶

The TREC CAR dataset consists of a number of different exports. These include,

Annotations files (also called “pages files”) contain full Wikipedia pages and their contents

Paragraphs files contain only paragraphs disembodied from their pages

Outlines files contain only the section structure of pages and no textual content

To read an annotations file use the iter_annotations() function:

trec_car.read_data.iter_annotations(file)[source]¶

Iterate over the Pages of an annotations file.

Return type:	typing.Iterator[Page]

For instance, to list the page IDs of pages in a pages file one might write

for page in read_data.iter_annotations(open('train.test200.cbor', 'rb')):
    print(page.pageId)

Likewise, to read a paragraphs file the iter_paragraphs() function is provided

trec_car.read_data.iter_paragraphs(file)[source]¶

Iterate over the Paragraphs of an paragraphs file.

Return type:	typing.Iterator[Paragraph]

To list the text of all paragraphs in a paragarphs file one might write,

for para in read_data.iter_paragraphs(open('train.test200.cbor', 'rb')):
    print(para.getText())

Basic types¶

class trec_car.read_data.PageName¶: PageName represents the natural language “name” of a page. Note that this means that it is not necessarily unique. If you need a unique handle for a page use PageId.

class trec_car.read_data.PageId¶: A PageId is the unique identifier for a Page.

The `Page` type¶

class trec_car.read_data.Page(page_name, page_id, skeleton, page_type, page_meta)[source]¶

The name and skeleton of a Wikipedia page.

page_name¶

Return type:	PageName

The name of the page.

skeleton¶

Return type:	typing.List[PageSkeleton]

The contents of the page

page_type¶

Return type:	PageType

Type about the page

page_meta¶

Return type:	PageMetadata

Metadata about the page

flat_headings_list()[source]¶

return Returns a flat list of headings contained by the Page.

Return type:	typing.List[Section]

get_text()[source]¶: Include all visible text below this elements. Includes Captions of images, but no headings and no infoboxes. See get_text_with_headings for a version that includes headings.

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

nested_headings()[source]¶

Each heading recursively represented by a pair of (heading, list_of_child_sections).

Return type:	typing.List[typing.Tuple[Section, typing.List[Section]]]

to_string()[source]¶

Render a string representation of the page.

Return type:	str

class trec_car.read_data.PageMetadata(redirectNames, disambiguationNames, disambiguationIds, categoryNames, categoryIds, inlinkIds, inlinkAnchors, wikiDataQid, siteId, pageTags)[source]¶

Meta data for a page

redirectNames¶

Return type:	PageName

Names of pages which redirect to this page

disambiguationNames¶

Return type:	PageName

Names of disambiguation pages which link to this page

disambiguationId¶

Return type:	PageId

Page IDs of disambiguation pages which link to this page

categoryNames¶

Return type:	str

Page names of categories to which this page belongs

categoryIds¶

Return type:	str

Page IDs of categories to which this page belongs

inlinkIds¶

Return type:	str

Page IDs of pages containing inlinks

inlinkAnchors¶

inlinkAnchor frequencies

rtype: str

(Anchor text, frequency) of pages containing inlinks

wikidataQid¶

Return type:	str

Language and time independent Wikidata IDs (e.g. Q12345)

siteId¶

Return type:	str

SiteId (e.g. enwiki). The combination of WikidataQid and SiteId identifies a page in a wikipedia across time stamps. Note that PageName and PageId can change over time.

pageTags¶

Return type:	str

Template tags of pages, e.g. “Good article” or “Vital article”

Types of pages¶

class trec_car.read_data.PageType[source]¶

An abstract base class representing the various types of pages.

Subclasses include

ArticlePage
CategoryPage
DisambiguationPage
RedirectPage

The abstact base class.

class trec_car.read_data.ArticlePage[source]¶

class trec_car.read_data.CategoryPage[source]¶

class trec_car.read_data.DisambiguationPage[source]¶

class trec_car.read_data.RedirectPage(targetPage)[source]¶

targetPage¶

Return type:	PageId

The target of the redirect.

Page structure¶

The high-level structure of a Page is captured by the subclasses of PageSkeleton.

class trec_car.read_data.PageSkeleton[source]¶

An abstract superclass for the various types of page elements. Subclasses include:

Section
Para
Image

get_text()[source]¶: Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Para(paragraph)[source]¶

Bases: trec_car.read_data.PageSkeleton

A paragraph within a Wikipedia page.

paragraph¶

Return type:	Paragraph

The content of the Paragraph (which in turn contain a list of ParaBodys)

get_text()[source]¶: Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Section(heading, headingId, children)[source]¶

Bases: trec_car.read_data.PageSkeleton

A section of a Wikipedia page.

heading¶

Return type:	str

The section heading.

headingId¶

Return type:	str

The unique identifier of a section heading.

children¶

Return type:	typing.List[PageSkeleton]

The PageSkeleton elements contained by the section.

get_text()[source]¶: Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.List(level, body)[source]¶

Bases: trec_car.read_data.PageSkeleton

An list element within a Wikipedia page.

level¶

Return type:	int

The list nesting level

body¶: A Paragraph containing the list element contents.

get_text()[source]¶: Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Image(imageurl, caption)[source]¶

Bases: trec_car.read_data.PageSkeleton

An image within a Wikipedia page.

caption¶

Return type:	str

PageSkeleton representing the caption of the image

imageurl¶

Return type:	str

URL to the image; spaces need to be replaced with underscores, Wikimedia Commons namespace needs to be prefixed

get_text()[source]¶: Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]¶: Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

Paragraph contents¶

class trec_car.read_data.Paragraph(para_id, bodies)[source]¶

A paragraph.

get_text()[source]¶

Get all of the contained text.

Return type:	str

class trec_car.read_data.ParaBody[source]¶

An abstract superclass representing a bit of Paragraph content.

get_text()[source]¶

Get all of the text within a ParaBody.

Return type:	str

class trec_car.read_data.ParaText(text)[source]¶

Bases: trec_car.read_data.ParaBody

A bit of plain text from a paragraph.

text¶

Return type:	str

The text

get_text()[source]¶

Get all of the text within a ParaBody.

Return type:	str

class trec_car.read_data.ParaLink(page, link_section, pageid, anchor_text)[source]¶

Bases: trec_car.read_data.ParaBody

A link within a paragraph.

page¶

Return type:	PageName

The page name of the link target

pageid¶

Return type:	PageId

The link target as trec-car identifer

link_section¶

Return type:	str

Section anchor of link target (i.e. the part after the # in the URL), or None.

anchor_text¶

Return type:	str

The anchor text of the link

get_text()[source]¶

Get all of the text within a ParaBody.

Return type:	str

trec-car-tools¶

Getting started¶

Reading the dataset¶

Basic types¶

The `Page` type¶

Types of pages¶

Page structure¶

Paragraph contents¶

Indices and tables¶

trec-car-tools

Navigation

Related Topics

trec-car-tools¶

Getting started¶

Reading the dataset¶

Basic types¶

The Page type¶

Types of pages¶

Page structure¶

Paragraph contents¶

Indices and tables¶

The `Page` type¶