trec-car-tools

This is the documentation for trec-car-tools, a Python 3 library for reading and manipulating the TREC Complex Answer Retrieval (CAR) dataset.

Getting started

This library requires Python 3.3 or greater. It can can be installed with setup.py

python3 ./setup.py install

If you are using Anaconda, install the cbor library for Python 3.5 / 3.6:

conda install -c laura-dietz cbor=1.0.0

Once you have installed the library, you can download a dataset and start playing.

Reading the dataset

The TREC CAR dataset consists of a number of different exports. These include,

  • Annotations files (also called “pages files”) contain full Wikipedia pages and their contents
  • Paragraphs files contain only paragraphs disembodied from their pages
  • Outlines files contain only the section structure of pages and no textual content

To read an annotations file use the iter_annotations() function:

trec_car.read_data.iter_annotations(file)[source]

Iterate over the Pages of an annotations file.

Return type:typing.Iterator[Page]

For instance, to list the page IDs of pages in a pages file one might write

for page in read_data.iter_annotations(open('train.test200.cbor', 'rb')):
    print(page.pageId)

Likewise, to read a paragraphs file the iter_paragraphs() function is provided

trec_car.read_data.iter_paragraphs(file)[source]

Iterate over the Paragraphs of an paragraphs file.

Return type:typing.Iterator[Paragraph]

To list the text of all paragraphs in a paragarphs file one might write,

for para in read_data.iter_paragraphs(open('train.test200.cbor', 'rb')):
    print(para.getText())

Basic types

class trec_car.read_data.PageName

PageName represents the natural language “name” of a page. Note that this means that it is not necessarily unique. If you need a unique handle for a page use PageId.

class trec_car.read_data.PageId

A PageId is the unique identifier for a Page.

The Page type

class trec_car.read_data.Page(page_name, page_id, skeleton, page_type, page_meta)[source]

The name and skeleton of a Wikipedia page.

page_name
Return type:PageName

The name of the page.

skeleton
Return type:typing.List[PageSkeleton]

The contents of the page

page_type
Return type:PageType

Type about the page

page_meta
Return type:PageMetadata

Metadata about the page

flat_headings_list()[source]

return Returns a flat list of headings contained by the Page.

Return type:typing.List[Section]
get_text()[source]

Include all visible text below this elements. Includes Captions of images, but no headings and no infoboxes. See get_text_with_headings for a version that includes headings.

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

nested_headings()[source]

Each heading recursively represented by a pair of (heading, list_of_child_sections).

Return type:typing.List[typing.Tuple[Section, typing.List[Section]]]
to_string()[source]

Render a string representation of the page.

Return type:str
class trec_car.read_data.PageMetadata(redirectNames, disambiguationNames, disambiguationIds, categoryNames, categoryIds, inlinkIds, inlinkAnchors)[source]

Meta data for a page

redirectNames
Return type:PageName

Names of pages which redirect to this page

disambiguationNames
Return type:PageName

Names of disambiguation pages which link to this page

disambiguationId
Return type:PageId

Page IDs of disambiguation pages which link to this page

categoryNames
Return type:str

Page names of categories to which this page belongs

categoryIds
Return type:str

Page IDs of categories to which this page belongs

inlinkIds
Return type:str

Page IDs of pages containing inlinks

inlinkAnchors
inlinkAnchor frequencies
rtype:str

(Anchor text, frequency) of pages containing inlinks

Types of pages

class trec_car.read_data.PageType[source]

An abstract base class representing the various types of pages.

Subclasses include

The abstact base class.

class trec_car.read_data.ArticlePage[source]
class trec_car.read_data.CategoryPage[source]
class trec_car.read_data.DisambiguationPage[source]
class trec_car.read_data.RedirectPage(targetPage)[source]
targetPage
Return type:PageId

The target of the redirect.

Page structure

The high-level structure of a Page is captured by the subclasses of PageSkeleton.

class trec_car.read_data.PageSkeleton[source]

An abstract superclass for the various types of page elements. Subclasses include:

get_text()[source]

Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Para(paragraph)[source]

Bases: trec_car.read_data.PageSkeleton

A paragraph within a Wikipedia page.

paragraph
Return type:Paragraph

The content of the Paragraph (which in turn contain a list of ParaBodys)

get_text()[source]

Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Section(heading, headingId, children)[source]

Bases: trec_car.read_data.PageSkeleton

A section of a Wikipedia page.

heading
Return type:str

The section heading.

headingId
Return type:str

The unique identifier of a section heading.

children
Return type:typing.List[PageSkeleton]

The PageSkeleton elements contained by the section.

get_text()[source]

Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.List(level, body)[source]

Bases: trec_car.read_data.PageSkeleton

An list element within a Wikipedia page.

level
Return type:int

The list nesting level

body

A Paragraph containing the list element contents.

get_text()[source]

Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

class trec_car.read_data.Image(imageurl, caption)[source]

Bases: trec_car.read_data.PageSkeleton

An image within a Wikipedia page.

caption
Return type:str

PageSkeleton representing the caption of the image

imageurl
Return type:str

URL to the image; spaces need to be replaced with underscores, Wikimedia Commons namespace needs to be prefixed

get_text()[source]

Includes visible text of this element and below. Headings are excluded. Image Captions are included. Infoboxes are ignored. (For a version with headers and no captions see get_text_with_headings

get_text_with_headings(include_heading=False)[source]

Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.

Paragraph contents

class trec_car.read_data.Paragraph(para_id, bodies)[source]

A paragraph.

get_text()[source]

Get all of the contained text.

Return type:str
class trec_car.read_data.ParaBody[source]

An abstract superclass representing a bit of Paragraph content.

get_text()[source]

Get all of the text within a ParaBody.

Return type:str
class trec_car.read_data.ParaText(text)[source]

Bases: trec_car.read_data.ParaBody

A bit of plain text from a paragraph.

text
Return type:str

The text

get_text()[source]

Get all of the text within a ParaBody.

Return type:str

Bases: trec_car.read_data.ParaBody

A link within a paragraph.

page
Return type:PageName

The page name of the link target

pageid
Return type:PageId

The link target as trec-car identifer

Return type:str

Section anchor of link target (i.e. the part after the # in the URL), or None.

anchor_text
Return type:str

The anchor text of the link

get_text()[source]

Get all of the text within a ParaBody.

Return type:str

Indices and tables