trec-car-tools ============== This is the documentation for ``trec-car-tools``, a Python 3 library for reading and manipulating the `TREC Complex Answer Retrieval `_ (CAR) dataset. Getting started --------------- This library requires Python 3.3 or greater. It can can be installed with ``setup.py`` :: python3 ./setup.py install If you are using `Anaconda `_, install the ``cbor`` library for Python 3.6: :: conda install -c laura-dietz cbor=1.0.0 Once you have installed the library, you can download a `dataset `_ and start playing. Reading the dataset ------------------- The TREC CAR dataset consists of a number of different exports. These include, * Annotations files (also called "pages files") contain full Wikipedia pages and their contents * Paragraphs files contain only paragraphs disembodied from their pages * Outlines files contain only the section structure of pages and no textual content To read an annotations file use the :func:`iter_annotations` function: .. autofunction:: trec_car.read_data.iter_annotations For instance, to list the page IDs of pages in a pages file one might write .. code-block:: python for page in read_data.iter_annotations(open('train.test200.cbor', 'rb')): print(page.pageId) Likewise, to read a paragraphs file the :func:`iter_paragraphs` function is provided .. autofunction:: trec_car.read_data.iter_paragraphs To list the text of all paragraphs in a paragarphs file one might write, .. code-block:: python for para in read_data.iter_paragraphs(open('train.test200.cbor', 'rb')): print(para.getText()) Basic types ----------- .. class:: trec_car.read_data.PageName :class:`PageName` represents the natural language "name" of a page. Note that this means that it is not necessarily unique. If you need a unique handle for a page use :class:`PageId`. .. class:: trec_car.read_data.PageId A :class:`PageId` is the unique identifier for a :class:`Page`. The :class:`Page` type ---------------------- .. autoclass:: trec_car.read_data.Page :members: .. autoclass:: trec_car.read_data.PageMetadata :members: Types of pages ~~~~~~~~~~~~~~ .. autoclass:: trec_car.read_data.PageType The abstact base class. .. autoclass:: trec_car.read_data.ArticlePage .. autoclass:: trec_car.read_data.CategoryPage .. autoclass:: trec_car.read_data.DisambiguationPage .. autoclass:: trec_car.read_data.RedirectPage :members: Page structure -------------- The high-level structure of a :class:`Page` is captured by the subclasses of :class:`PageSkeleton`. .. autoclass:: trec_car.read_data.PageSkeleton :members: .. autoclass:: trec_car.read_data.Para :members: :show-inheritance: .. autoclass:: trec_car.read_data.Section :members: :show-inheritance: .. autoclass:: trec_car.read_data.List :members: :show-inheritance: .. autoclass:: trec_car.read_data.Image :members: :show-inheritance: Paragraph contents ------------------ .. autoclass:: trec_car.read_data.Paragraph :members: .. autoclass:: trec_car.read_data.ParaBody :members: .. autoclass:: trec_car.read_data.ParaText :members: :show-inheritance: .. autoclass:: trec_car.read_data.ParaLink :members: :show-inheritance: Indices and tables ================== * :ref:`genindex` * :ref:`modindex` * :ref:`search`