trec-car-tools¶
This is the documentation for trec-car-tools
, a Python 3 library for reading
and manipulating the TREC Complex Answer Retrieval (CAR) dataset.
Getting started¶
This library requires Python 3.3 or greater. It can can be installed with
setup.py
python3 ./setup.py install
If you are using Anaconda, install the cbor
library for Python 3.6:
conda install -c laura-dietz cbor=1.0.0
Once you have installed the library, you can download a dataset and start playing.
Reading the dataset¶
The TREC CAR dataset consists of a number of different exports. These include,
- Annotations files (also called “pages files”) contain full Wikipedia pages and their contents
- Paragraphs files contain only paragraphs disembodied from their pages
- Outlines files contain only the section structure of pages and no textual content
To read an annotations file use the iter_annotations()
function:
-
trec_car.read_data.
iter_annotations
(file)[source]¶ Iterate over the
Page
s of an annotations file.Return type: typing.Iterator[Page]
For instance, to list the page IDs of pages in a pages file one might write
for page in read_data.iter_annotations(open('train.test200.cbor', 'rb')):
print(page.pageId)
Likewise, to read a paragraphs file the iter_paragraphs()
function is
provided
-
trec_car.read_data.
iter_paragraphs
(file)[source]¶ Iterate over the
Paragraph
s of an paragraphs file.Return type: typing.Iterator[Paragraph]
To list the text of all paragraphs in a paragarphs file one might write,
for para in read_data.iter_paragraphs(open('train.test200.cbor', 'rb')):
print(para.getText())
Basic types¶
-
class
trec_car.read_data.
PageName
¶ PageName
represents the natural language “name” of a page. Note that this means that it is not necessarily unique. If you need a unique handle for a page usePageId
.
-
class
trec_car.read_data.
PageId
¶ A
PageId
is the unique identifier for aPage
.
The Page
type¶
-
class
trec_car.read_data.
Page
(page_name, page_id, skeleton, page_type, page_meta)[source]¶ The name and skeleton of a Wikipedia page.
-
skeleton
¶ Return type: typing.List[PageSkeleton] The contents of the page
-
page_meta
¶ Return type: PageMetadata Metadata about the page
-
flat_headings_list
()[source]¶ return Returns a flat list of headings contained by the
Page
.Return type: typing.List[Section]
-
get_text
()[source]¶ Include all visible text below this elements. Includes Captions of images, but no headings and no infoboxes. See get_text_with_headings for a version that includes headings.
-
get_text_with_headings
(include_heading=False)[source]¶ Include all visible text below this elements. While the heading of this element is excluded, headings of subsections will be included. Captions of images are excluded.
-
-
class
trec_car.read_data.
PageMetadata
(redirectNames, disambiguationNames, disambiguationIds, categoryNames, categoryIds, inlinkIds, inlinkAnchors, wikiDataQid, siteId, pageTags)[source]¶ Meta data for a page
-
categoryNames
¶ Return type: str Page names of categories to which this page belongs
-
categoryIds
¶ Return type: str Page IDs of categories to which this page belongs
-
inlinkIds
¶ Return type: str Page IDs of pages containing inlinks
-
inlinkAnchors
¶ -
inlinkAnchor frequencies
rtype: str (Anchor text, frequency) of pages containing inlinks
-
wikidataQid
¶ Return type: str Language and time independent Wikidata IDs (e.g. Q12345)
-
siteId
¶ Return type: str SiteId (e.g. enwiki). The combination of WikidataQid and SiteId identifies a page in a wikipedia across time stamps. Note that PageName and PageId can change over time.
-
pageTags
¶ Return type: str Template tags of pages, e.g. “Good article” or “Vital article”
-
Page structure¶
The high-level structure of a Page
is captured by the subclasses of
PageSkeleton
.
-
class
trec_car.read_data.
PageSkeleton
[source]¶ An abstract superclass for the various types of page elements. Subclasses include:
-
class
trec_car.read_data.
Para
(paragraph)[source]¶ Bases:
trec_car.read_data.PageSkeleton
A paragraph within a Wikipedia page.
-
paragraph
¶ Return type: Paragraph The content of the Paragraph (which in turn contain a list of
ParaBody
s)
-
-
class
trec_car.read_data.
Section
(heading, headingId, children)[source]¶ Bases:
trec_car.read_data.PageSkeleton
A section of a Wikipedia page.
-
heading
¶ Return type: str The section heading.
-
headingId
¶ Return type: str The unique identifier of a section heading.
-
children
¶ Return type: typing.List[PageSkeleton] The
PageSkeleton
elements contained by the section.
-
-
class
trec_car.read_data.
List
(level, body)[source]¶ Bases:
trec_car.read_data.PageSkeleton
An list element within a Wikipedia page.
-
level
¶ Return type: int The list nesting level
-
-
class
trec_car.read_data.
Image
(imageurl, caption)[source]¶ Bases:
trec_car.read_data.PageSkeleton
An image within a Wikipedia page.
Return type: str PageSkeleton representing the caption of the image
-
imageurl
¶ Return type: str URL to the image; spaces need to be replaced with underscores, Wikimedia Commons namespace needs to be prefixed
Paragraph contents¶
-
class
trec_car.read_data.
ParaBody
[source]¶ An abstract superclass representing a bit of
Paragraph
content.
-
class
trec_car.read_data.
ParaText
(text)[source]¶ Bases:
trec_car.read_data.ParaBody
A bit of plain text from a paragraph.
-
text
¶ Return type: str The text
-
-
class
trec_car.read_data.
ParaLink
(page, link_section, pageid, anchor_text)[source]¶ Bases:
trec_car.read_data.ParaBody
A link within a paragraph.
-
link_section
¶ Return type: str Section anchor of link target (i.e. the part after the
#
in the URL), orNone
.
-
anchor_text
¶ Return type: str The anchor text of the link
-