.. _index-file:

The Package Index file
----------------------

Column values that are unique across a dataset can be used to identify
a specific row.  These columns are referred to as index columns [#]_,
alternatively they are also referred to as the *primary key*.  Using a
*package index file* we can combine column metadata and specify these
index columns, column aliases, etc.  The set of keys that can be used
in a package index file is documented below.

**path** (string)

    Relative path to a dataset

**idxcols** (list of strings)

    Column names that should be considered part of the index of a
    dataset (or primary key)

**skip** (positive integer)

    Number of lines to skip when reading the dataset

**name** (string)

    Typically the name of a dataset is derived from its file name, but
    when working with the Python API, this key is used to map a
    dataframe to an entry in the index file.  This can also be used to
    map a table in a database to an entry (where the ``path`` key
    points to the database, e.g. path to an sqlite file).

**alias** (mapping or dictionary)

    A mapping of column names in the dataset that should be mapped to
    another column in the registry; say you use ``node`` for
    locations, and you want the corresponding column to be mapped to
    ``region`` in the registry.  This can be specified with an index
    entry like this:

    .. code-block:: yaml

      - path: demand.csv
        idxcols: [node, timestep]
        alias: {node: region}

**iamc** (string)

    A format string to construct the IAMC variable for a file entry.
    It can reference index columns by enclosing them in braces (like a
    Python format string)::

      Installed Capacity|{carrier}|{technology}

**agg** (mapping or dictionary)

    A mapping of index column name to a list of aggregation rules (for
    IAMC conversion) which is another mapping of the form:

    .. code-block:: yaml

      values:
      - open_field_pv
      - roof_mounted_pv
      variable: Primary Energy|Solar

    As there can be multiple rules for a column, they are included as
    a list.  A complete index entry with aggregation rules looks like:

    .. code-block:: yaml

      - agg:
        technology:
        - values:
          - dac
          variable: Carbon Sequestration|Direct Air Capture
        - values:
          - hydro_reservoir
          - hydro_run_of_river
          variable: Primary Energy|Hydro
        - values:
          - open_field_pv
          - roof_mounted_pv
          variable: Primary Energy|Solar
      iamc: Primary Energy|{technology}
      idxcols:
      - carrier
      - technology
      - year
      path: flow_out_sum.csv

    With the above entry, when converting to IAMC format, all data
    points with technology ``open_field_pv`` and ``roof_mounted_pv``
    will be added together under the IAMC variable name ``Primary
    Energy|Solar``.  Note that multiple index columns cannot be
    combined in this manner; only one is possible.

.. [#] It is similar to index of a book, which allows you to jump to a
       specific page in the book by looking up a keyword.