Technical documentation of functions and classes provided by the different friendly_data modules.

Datapackage tools

Functions useful to interact with a data package.

friendly_data.dpkg.create_pkg(meta: Dict, fpaths: Iterable[friendly_data.dpkg._res_t], basepath: Union[str, pathlib.Path] = '', infer=True)[source]

Create a datapackage from metadata and resources.

If resources point to files that exist, their schema are inferred and added to the package. If basepath is a non empty string, it is treated as the parent directory, and all resource file paths are checked relative to it.

Parameters
metaDict

A dictionary with package metadata.

fpathsIterable[Union[str, Path, Dict]]

An iterator over different resources. Resources are paths to files, relative to basepath.

basepathstr (default: empty string)

Directory where the package files are located

inferbool (default: True)

Whether to infer resource schema

Returns
Package

A datapackage with inferred schema for all the package resources

friendly_data.dpkg.entry_from_res(res: frictionless.resource.Resource) Dict[source]

Create an index entry from a Resource object

Parameters
resResource

A resource object

Returns
Dict

A dictionary that is an index entry

friendly_data.dpkg.fullpath(resource: frictionless.resource.Resource) pathlib.Path[source]

Get full path of a resource

Parameters
resourceResource

Resource object/dictionary

Returns
Path

Full path to the resource

friendly_data.dpkg.get_aliased_cols(cols: Iterable[str], col_t: str, alias: Dict[str, str]) Dict[source]

Get aliased columns from the registry

Parameters
colsIterable[str]

List of columns to retrieve

col_tLiteral[“cols”, “idxcols”]

A literal string specifying the kind of column; one of: “cols”, or “idxcols”

aliasDict[str, str]

Dictionary of aliases; key is the column name in the dataset, and the value is the column name in the registry that it is equivalent to.

Returns
Dict

Schema for each column, the column name is the key, and the schema is the value; see the doctring of index_levels() for more.

friendly_data.dpkg.idxpath_from_pkgpath(pkgpath: Union[str, pathlib.Path]) Union[str, pathlib.Path][source]

Return a valid index path given a package path

Parameters
pkgpathUnion[str, Path]

Path to package directory

Returns
Union[str, Path]
  • Returns a valid index path; if there are multiple matches, returns the lexicographically first match

  • If an index file is not found, returns an empty string

Warns
  • Warns if no index file is found
  • Warns if multiple index files are found
friendly_data.dpkg.index_levels(file_or_df: Union[str, pathlib.Path, friendly_data._types._dfseries_t], idxcols: Iterable[str], alias: Dict[str, str] = {}) Tuple[friendly_data._types._dfseries_t, Dict][source]

Read a dataset and determine the index levels

Parameters
file_or_dfUnion[str, Path, pd.DataFrame, pd.Series]

A dataframe, or the path to a CSV file

idxcolsIterable[str]

List of columns in the dataset that constitute the index

aliasDict[str, str]

Column aliases: {my_alias: col_in_registry}

Returns
Tuple[Union[pd.DataFrame, pd.Series], Dict]

Tuple of the dataset, and the schema of each index column as a dictionary. If idxcols was [“foo”, “bar”], the dictionary might look like:

{
    "foo": {
        "name": "foo",
        "type": "datetime",
        "format": "default"
    },
    "bar": {
        "name": "bar",
        "type": "string",
        "constraints": {
            "enum": ["a", "b"]
        }
    }
}

Note that the index columns that have categorical values, are filled in by reading the dataset and determining the full set of values.

friendly_data.dpkg.pkg_from_files(meta: Dict, fpath: Union[str, pathlib.Path], fpaths: Iterable[Union[str, pathlib.Path]]) Tuple[pathlib.Path, frictionless.package.Package, Optional[friendly_data.dpkg.pkgindex]][source]

Create a package from an index file and other files

Parameters
metaDict

A dictionary with package metadata.

fpathUnion[str, Path]

Path to the package directory or index file. Note the index file has to be at the top level directory of the datapackage. See pkgindex.from_file()

fpathsList[Union[str, Path]]

A list of paths to datasets/resources not in the index. If any of the paths point to a dataset already present in the index, the index entry is respected.

Returns
Tuple[Path, Package, Union[pkgindex, None]]

A datapackage with inferred schema for the resources/datasets present in the index; all other resources are added with a basic inferred schema.

friendly_data.dpkg.pkg_from_index(meta: Dict, fpath: Union[str, pathlib.Path]) Tuple[pathlib.Path, frictionless.package.Package, friendly_data.dpkg.pkgindex][source]

Read an index file, and create a datapackage with the provided metadata.

The index can be in either YAML, or JSON format. It is a list of dataset files, names, and a list of columns in the dataset that are to be treated as index columns (see example below)

Parameters
metaDict

Package metadata dictionary

fpathUnion[str, Path]

Path to the index file. Note the index file has to be at the top level directory of the datapackage.

Returns
Tuple[Path, Package, pkgindex]

The package directory, the Package object, and the index.

Examples

YAML (JSON is also supported):

- path: file1
  name: dst1
  idxcols: [cola, colb]
- path: file2
  name: dst2
  idxcols: [colx, coly, colz]
- path: file3
  name: dst3
  idxcols: [col]
class friendly_data.dpkg.pkgindex(iterable=(), /)[source]

Data package index (a subclass of list)

It is a list of dictionaries, where each dictionary is the respective record for a file. A record may have the following keys:

  • “path”: path to the file,

  • “idxcols”: list of column names that are to be included in the dataset index (optional),

  • “name”: dataset name (optional),

  • “skip”: lines to skip when reading the dataset (optional, CSV only),

  • “alias”: a mapping of column name aliases (optional),

  • “sheet”: sheet name or position (0-indexed) to use as dataset (optional, Excel only)

While iterating over an index, always use records() to ensure all necessary keys are present.

Methods

append(object, /)

Append object to the end of the list.

clear(/)

Remove all items from list.

copy(/)

Return a shallow copy of the list.

count(value, /)

Return number of occurrences of value.

extend(iterable, /)

Extend list by appending elements from the iterable.

from_file(fpath)

Read the index of files included in the data package

get(key)

Get the value of key from all records as a list.

index(value[, start, stop])

Return first index of value.

insert(index, object, /)

Insert object before index.

pop([index])

Remove and return item at index (default last).

records(keys)

Return an iterable of index records.

remove(value, /)

Remove first occurrence of value.

reverse(/)

Reverse IN PLACE.

sort(*[, key, reverse])

Sort the list in ascending order and return None.

classmethod from_file(fpath: Union[str, pathlib.Path]) friendly_data.dpkg.pkgindex[source]

Read the index of files included in the data package

Parameters
fpathUnion[str, Path]

Index file path or a stream object

Returns
List[Dict]
Raises
ValueError

If the file type is correct (YAML/JSON), but does not return a list

RuntimeError

If the file has an unknown extension (raised by friendly_data.io.dwim_file())

MatchError

If the file contains any unknown keys

get(key: str) List[source]

Get the value of key from all records as a list.

If key is absent, the corresponding value is set to None.

Parameters
keystr

Key to retrieve

Returns
List

List of records with values corresponding to key.

records(keys: List[str]) Iterable[Dict][source]

Return an iterable of index records.

Each record is guaranteed to have all the requested keys. If a value wasn’t specified in the index file, it is set to None.

Parameters
keysList[str]

List of keys that are requested in each record.

Returns
Iterable[Dict]
Raises
glom.MatchError

If keys has an unsupported value

friendly_data.dpkg.read_pkg(pkg_path: Union[str, pathlib.Path], extract_dir: Optional[Union[str, pathlib.Path]] = None)[source]

Read a datapackage

If pkg_path points to a datapackage.json file, read it as is. If it points to a zip archive. The archive is first extracted before opening it. If extract_dir is not provided, the current directory of the zip archive is used. If it is a directory, look for a datapackage.json inside.

Parameters
pkg_pathUnion[str, Path]

Path to the datapackage.json file, or a zip archive

extract_dirUnion[str, Path]

Path to which the zip archive is extracted

Returns
Package
Raises
ValueError

When an unsupported format (not a directory, JSON, or ZIP) is provided

FileNotFoundError

When a datapackage.json file cannot be found

friendly_data.dpkg.res_from_entry(entry: Dict, pkg_dir: Union[str, pathlib.Path]) frictionless.resource.Resource[source]

Create a resource from an index entry.

Entry must have the keys: path, idxcols, alias; so use pkgindex.records() to iterate over the index.

Parameters
entryDict

Entry from an index file:

{
  "path": "data.csv"
  "idxcols": ["col1", "col2"]
  "alias": {
    "col1": "col0"
  }
}
pkg_dirUnion[str, Path]

Root directory of the package

Returns
Resource

The resource object (subclass of dict)

friendly_data.dpkg.resource_(spec: Dict, basepath: Union[str, pathlib.Path] = '', infer=True) frictionless.resource.Resource[source]

Create a Resource object based on the dictionary

Parameters
specDict

Dictionary with the structure:

{"path": "relpath/resource.csv", "skip": <nrows>, "sheet": <num>}

both “skip” & “sheet” are optional; “sheet” can be used to select a specific sheet as the dataset; sheet numbering starts at 1.

basepathUnion[str, Path]

Base path for resource object

inferbool (default: True)

Whether to infer resource schema

Returns
Resource
friendly_data.dpkg.set_idxcols(fpath: Union[str, pathlib.Path], basepath: Union[str, pathlib.Path] = '') frictionless.resource.Resource[source]

Create a resource object for a file, with index columns set

Parameters
fpathUnion[str, Path]

Path to a dataset (resource), e.g. a CSV file, relative to basepath

basepathUnion[str, Path], default: empty string (current directory)

Path to directory to consider as the data package basepath

Returns
Resource

Resource object with the index columns set according to the registry

friendly_data.dpkg.write_pkg(pkg: Union[Dict, frictionless.package.Package], pkgdir: Union[str, pathlib.Path], *, idx: Optional[Union[friendly_data.dpkg.pkgindex, List]] = None) List[pathlib.Path][source]

Write a data package to path

Parameters
pkg: Package

Package object

pkgdir: Union[str, Path]

Path to write to

idxUnion[pkgindex, List] (optional)

Package index written to pkgdir/index.json

Returns
List[Path]

List of files written to disk

Metadata tools

Functions useful to access and manipulate package metadata.

friendly_data.metatools.check_license(lic: Dict[str, str]) Dict[str, str][source]

Return the license spec from the metadata

Issue a warning if the license is old. TODO: add other recommendations

Parameters
licDict[str, str], alias _license_t

License metadata dictionary (as returned by the Open Definition License Service) Example: CC-BY-SA:

{
  "domain_content": true,
  "domain_data": true,
  "domain_software": false,
  "family": "",
  "id": "CC-BY-SA-4.0",
  "maintainer": "Creative Commons",
  "od_conformance": "approved",
  "osd_conformance": "not reviewed",
  "status": "active",
  "title": "Creative Commons Attribution Share-Alike 4.0",
  "url": "https://creativecommons.org/licenses/by-sa/4.0/"
}
friendly_data.metatools.get_license(lic: str, group: str = 'all') Dict[str, str][source]

Return the license metadata

Retrieve the license metadata of the requested group from the Open Definition License Service and cache it in a temporary file. From the retrieved list, find the requested license and return it.

Parameters
licstr or None

Requested license; if None, interactively ask for the license name

group{‘all’, ‘osi’, ‘od’, ‘ckan’}

License group where to find the license

Returns
Dict[str, str], alias _license_t

A dictionary with the license metadata

Raises
ValueError

If the license group is incorrect

KeyError

If the license cannot be found in the provided group

friendly_data.metatools.lic_domain(lic: Dict[str, str]) str[source]

Find the domain of a license

friendly_data.metatools.lic_metadata(keys: typing.Iterable[str], pred: typing.Callable[[typing.Dict], bool] = <function <lambda>>) List[Dict[str, str]][source]

Return a list of license metadata with the requested set of keys

Parameters
keysIterable[str]

List of keys to include in the metadata

predCallable[[Dict], bool]

A predicate to select a subset of licenses. It should accept a dictionary with license metadata, and return a boolean indicating whether to accept or not.

Returns
List[Dict]

List of license metadata

friendly_data.metatools.list_licenses(group: str = 'all') List[str][source]

Return list of valid licenses

friendly_data.metatools.resolve_licenses(meta: Dict) Dict[source]

Check and fix if licenses are specified correctly in the metadata

Registry API

Configurable Friendly data schema registry

Module to wrap around the default friendly_data_registry to add configurability. A custom registry configuration can be specified by using the config_ctx() context manager. The RegistrySchema validates the registry config before customising the default registry.

class friendly_data.registry.RegistrySchema(registry_config: Dict[str, List[Dict]])[source]

Instantiate with the “registry” section of the config file to validate

The registry section looks like this:

registry:
  idxcols:
    - name: enduse
      type: string
      constraints:
        enum:
          - ...
  cols:
    - name: cost
      type: number
      constraints:
        minimum: 0

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop(key[, default])

If key is not found, default is returned if given, otherwise KeyError is raised

popitem(/)

Remove and return a (key, value) pair as a 2-tuple.

setdefault(key[, default])

Insert key with a value of default if key is not in the dictionary.

update([E, ]**F)

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

friendly_data.registry.config_ctx(*, confdict: Dict[str, List[Dict]] = {}, conffile: Union[str, pathlib.Path] = '', idxcols: List[Dict] = [], cols: List[Dict] = [])[source]

Context manager to temporarily override the default registry

Note that the parameters are allowed only as a keyword argument, and multiple parameters are not allowed at the same time. They are checked in the same order as shown here, and on finding one, following parameters are ignored.

The registry config is also validated. If validation fails, an error message is logged, and the default registry remains unaltered.

Parameters
confdictDict[str, List[Dict]]

Registry config in dictionary form

conffileUnion[str, Path]

Path to a config file with a custom registry section

idxcolsList[Dict]

List of custom index columns

colsList[Dict]

List of custom value columns

Returns
Generator[Dict[str, List[Dict]]]

The custom registry config

Examples

from friendly_data.registry import config_ctx, get, getall

with config_ctx(conffile="config.yaml") as _:
    print(get("mycol", "cols"))
    print(getall())
friendly_data.registry.get(col: str, col_t: str) Dict[source]

Wraps around the getters in friendly_data_registry.get().

If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the config_ctx() context manager.

friendly_data.registry.getall(with_file=False) Dict[str, List[Dict]][source]

Wraps around the getters in friendly_data_registry.getall().

If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the config_ctx() context manager.


The Friendly data schema registry

This module provides getter methods to retrieve individual columns, get(), or the whole registry, getall(). The function utilities and classes are used by the module internally.

friendly_data_registry.get(col: str, col_t: str) Dict[source]

Retrieve the column schema from column schema registry: friendly_data_registry

Parameters
colstr

Column name to look for

col_tLiteral[“cols”, “idxcols”]

A literal string specifying the kind of column; one of: “cols”, or “idxcols”

Returns
Dict

Column schema; an empty dictionary is returned in case there are no matches

Raises
RuntimeError

When more than one matches are found

ValueError

When the schema file in the registry is unsupported; not one of: JSON, or YAML

friendly_data_registry.getall(with_file: bool = False) Dict[str, List[Dict]][source]

Get all columns from registry, primarily to generate documentation

Returns
Dict[str, Dict]

The returned value is separated by column type:

{
  "idxcols": [
    {..}  # column schemas
  ],
  "cols": [
    {..}  # column schemas
  ],
}
Raises
RuntimeError

When more than one matches are found

friendly_data_registry.read_file(fpath: Union[str, pathlib.Path]) Union[Dict, List][source]

Read JSON or yaml file; file type is guessed from extension

class friendly_data_registry.schschemaema(schema: dict)[source]

Registry column schema. Instantiate to validate.

Raises
TypeMatchError

When the column schema has a type mismatch

MatchError

Other mismatches like, an incorrectly named key

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop(key[, default])

If key is not found, default is returned if given, otherwise KeyError is raised

popitem(/)

Remove and return a (key, value) pair as a 2-tuple.

setdefault(key[, default])

Insert key with a value of default if key is not in the dictionary.

update([E, ]**F)

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

Command Line Interface

Functions that are run from the CLI to create, or edit a data package.

friendly_data.cli.create(idxpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', inplace: bool = False, export: str = '', config: str = '')[source]

Create a package from an index file and other files

Package metadata provided with command line flags override metadata from the config file.

Parameters
idxpathstr

Path to the index file or package directory with the index file. Note the index file has to be at the top level directory of the datapackage.

fpathsTuple[str]

List of datasets/resources not in the index. If any of them point to a dataset already present in the index, it is ignored.

namestr

Package name (no spaces or special characters)

titlestr

Package title

licensesstr

License

descriptionstr

Package description

keywordsstr

A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]

inplacebool

Whether to create the data package by only adding metadata to the current directory. NOTE: one of inplace/export must be chosen

exportstr

Create the data package in the provided directory instead of the current directory

configstr

Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.

friendly_data.cli.describe(pkgpath: str, config: str = '', report_dir: str = '')[source]

Give a summary of the data package

Parameters
pkgpathstr

Path to the data package

report_dirstr (default: empty string)

If not empty, generate an HTML report and write to the directory report. The directory will have an index.html file, and one HTML file for each dataset.

configstr

Config file in YAML format with custom registry. It should be defined under a “registry” section.

friendly_data.cli.describe_registry(column_type: str = '')[source]

Describe columns defined in the registry

Parameters
column_typestr (default: empty string → all)

Column type to list; one of: “cols”, or “idxcols”. If nothing is provided (default), columns of both types are listed.

friendly_data.cli.generate_index_file(idxpath: str, *fpaths: str, config: str = '')[source]

Generate an index file from a set of dataset files

Parameters
idxpathstr

Path where the index file (YAML format) should be written

fpathsTuple[str]

List of datasets/resources to include in the index

configstr

Config file in YAML format with custom registry. It should be defined under a “registry” section.

friendly_data.cli.license_info(lic: str) Dict[source]

Give detailed metadata about a license

Parameters
licstr

License ID as listed in the output of friendly_data list-licenses

Returns
Dict

License metadata

friendly_data.cli.license_prompt() Dict[str, str][source]

Prompt for a license on the terminal (with completion).

friendly_data.cli.list_licenses() str[source]

List commonly used licenses

NOTE: for Python API users, not to be confused with metatools.list_licenses().

Returns
str

ASCII table with commonly used licenses

friendly_data.cli.main()[source]

Entry point for console scripts

friendly_data.cli.remove(pkgpath: str, *fpaths: str, rm_from_disk: bool = False) str[source]

Remove datasets from the package

Parameters
pkgpathstr

Path to the package directory

fpathsTuple[str]

List of datasets/resources to be removed from the package. The index is updated accordingly.

rm_from_diskbool (default: False)

Permanently delete the files from disk

friendly_data.cli.reports(pkg: frictionless.package.Package, report_dir: str)[source]

Write HTML reports summarising all resources in the package

Parameters
pkgPackage
report_dirstr

Directory where reports are written

Returns
int

Bytes written (index.html)

friendly_data.cli.to_iamc(config: str, idxpath: str, iamcpath: str, *, wide: bool = False)[source]

Aggregate datasets into an IAMC dataset

Parameters
configstr

Config file

idxpathstr

Index file

iamcpathstr

IAMC dataset

widebool (default: False)

Enable wide IAMC format

friendly_data.cli.update(pkgpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', config: str = '')[source]

Update metadata and datasets in a package.

Parameters
pkgpathstr

Path to the package.

fpathsTuple[str]

List of datasets/resources; they could be new datasets or datasets with updated index entries.

namestr

Package name (no spaces or special characters)

titlestr

Package title

descriptionstr

Package description

keywordsstr

A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]

licensesstr

License

configstr

Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.

Validation functions

Functions useful to validate a data package or parts of its schema.

friendly_data.validate.check_pkg(pkg) List[Dict][source]

Validate all resources in a datapackage for common errors.

Typical errors that are checked:
  • blank-header,

  • extra-label,

  • missing-label,

  • blank-label,

  • duplicate-label,

  • incorrect-label,

  • blank-row,

  • primary-key-error,

  • foreign-key-error,

  • extra-cell,

  • missing-cell,

  • type-error,

  • constraint-error,

  • unique-error

Parameters
pkgfrictionless.Package

The datapackage descriptor dictionary

Returns
Dict

A dictionary with a summary of the validation checks.

friendly_data.validate.check_schema(ref: Dict[str, str], dst: Dict[str, str], *, remap: Optional[Dict[str, str]] = None) Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]][source]

Compare a schema with a reference.

The reference schema is a minimal set, meaning, any additional fields in the compared schema are accepted, but omissions are not.

Name comparisons are case-sensitive.

TODO: maybe also compare constraints?

Parameters
refDict[str, str]

Reference schema dictionary

dstDict[str, str]

Schema dictionary from the dataset being validated

remapDict[str, str] (optional)

Column/field names that are to be remapped before checking.

Returns
resultTuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]

Result tuple:

  • Boolean flag indicating if it passed the checks or not

  • If checks failed, set of missing columns from minimal set

  • If checks failed, set of columns with mismatching types. It is a dictionary with the column name as key, and the reference type and the actual type in a tuple as value.

    {
        'col_x': ('integer', 'number'),
        'col_y': ('datetime', 'string'),
    }
    
  • If primary keys are different, tuple with the diff. The first element is the index where the two differ, and the two subsequent elements are the corresponding elements from the reference and dataset primary key list: (index, ref_col, dst_col)

friendly_data.validate.summarise_diff(diff: Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]) str[source]

Summarise the schema diff from check_schema() results as a pandas.DataFrame.

friendly_data.validate.summarise_errors(report: List[Dict]) pandas.DataFrame[source]

Summarise the dict/json error report as a pandas.DataFrame

Parameters
reportList[Dict]

List of errors as returned by check_pkg()

Returns
pandas.DataFrame

Summary dataframe; example:

   filename  row  col       error  remark
0   bad.csv   12       extra-cell     ...
1   bad.csv   22  SRB  type-error     ...

Data analysis interface

The following modules provide interfaces that are useful when you are working with common data analysis frameworks like pandas and xarray.

Converters

Functions useful to read a data package resource into common analysis frameworks like pandas, xarray, etc. Currently supported:

Library

Data Structure

pandas

:class:pandas.DataFrame

xarray (via pandas)

:class:xarray.DataArray, :class:xarray.Dataset, multi-file :class:xarray.Dataset

Type mapping between the frictionless specification and pandas types:

schema type

pandas type

boolean

bool

datetime

datetime64

integer

Int64

number

float

string

string

friendly_data.converters.from_df(df: friendly_data._types._dfseries_t, basepath: Union[str, pathlib.Path], datapath: Union[str, pathlib.Path] = '', alias: Dict[str, str] = {}, rename: bool = True) frictionless.resource.Resource[source]

Write dataframe to a CSV file, and return a data package resource.

NOTE: Do not call frictionless.Resource.infer() on the resource instance returned by this function, as that might overwrite our metadata/schema customisations with default heuristics in the frictionless implementation.

Parameters
dfpd.DataFrame | pd.Series

Dataframe to write

basepathUnion[str, Path]

Path to the package directory

datapathUnion[str, Path] (default: empty string)

Path to the CSV file where the dataframe is written. If datapath is empty, a file name is generated by concatinating all the columns in the dataframe.

aliasDict[str, str] (default: {})

A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.

renamebool (default: True)

Rename aliased columns to match the registry when writing to the CSV.

Returns
frictionless.Resource

Data package resource that points to the CSV file.

friendly_data.converters.from_dst(dst: xarray.Dataset, basepath: Union[str, pathlib.Path], alias: Dict[str, str] = {}) List[frictionless.resource.Resource][source]

Write an xarray.Dataset into CSV files, and return the list resources

Each data variable is written to a separate CSV file in the directory specified by basepath. The file name is derived from the data variable name by sanitising it and appending the CSV extension.

Parameters
dstxr.Dataset

Dataset to write

basepathUnion[str, Path]

Path to the package directory

aliasDict[str, str]

A dictionary of column aliases if the dataset has custom data variable/coordinate names that need to be mapped to columns in the registry.

Returns
List[Resource]

List of data package resources that point to the CSV files.

friendly_data.converters.resolve_aliases(df: friendly_data._types._dfseries_t, alias: Dict[str, str]) friendly_data._types._dfseries_t[source]

Return a copy of the dataframe with aliases resolved

Parameters
dfpd.DataFrame | pd.Series
aliasDict[str, str]

A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.

Returns
pd.DataFrame | pd.Series

Since the column and index levels are renamed, a copy is returned so that the original dataframe/series remains unaltered.

friendly_data.converters.to_da(resource: frictionless.resource.Resource, noexcept: bool = False, **kwargs) xarray.DataArray[source]

Reads a data package resource as an xarray.DataArray

This function is restricted to tables with only one value column (equivalent to a pandas.Series). All indices are treated as xarray.core.coordinates.DataArrayCoordinates and dimensions. The array is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to the data array. It is assumed that the whole table uses the same unit.

Additional keyword arguments are passed on to xarray.DataArray.

Parameters
resourcefrictionless.Resource

List of data package resource objects

noexceptbool (default: False)

Whether to suppress an exception

**kwargs

Additional keyword arguments that are passed on to xarray.DataArray

See also

to_df()

see for details on noexcept

friendly_data.converters.to_df(resource: frictionless.resource.Resource, noexcept: bool = False, **kwargs) pandas.DataFrame[source]

Reads a data package resource as a pandas.DataFrame

FIXME: ‘format’ in the schema is ignored.

Parameters
resourcefrictionless.Resource

A data package resource object

noexceptbool (default: False)

Whether to suppress an exception

**kwargs

Additional keyword arguments that are passed on to the reader: pandas.read_csv(), pandas.read_excel(), etc

Returns
pandas.DataFrame

NOTE: when noexcept is True, and there’s an exception, an empty dataframe is returned

Raises
ValueError

If the resource is not local If the source type the resource is pointing to isn’t supported

friendly_data.converters.to_dst(resource: frictionless.resource.Resource, noexcept: bool = False, **kwargs) xarray.Dataset[source]

Reads a data package resource as an xarray.Dataset

Unlike to_da(), this function works for all tables. All indices are treated as xarray.core.coordinates.DataArrayCoordinates and dimensions. The arrays is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to each data arrays. It is assumed that all columns in the whole table uses the same unit.

Additional keyword arguments are passed on to xarray.Dataset.

Parameters
resourcefrictionless.Resource

List of data package resource objects

noexceptbool (default: False)

Whether to suppress an exception

**kwargs

Additional keyword arguments that are passed on to xarray.Dataset

See also

to_df()

see for details on noexcept

friendly_data.converters.to_mfdst(resources: Iterable[frictionless.resource.Resource], noexcept: bool = False, **kwargs) xarray.Dataset[source]

Reads a list of data package resources as an xarray.Dataset

This function reads multiple resources/files and converts each column into a data array (identical to to_dst()), which are then combined into one xarray.Dataset. Note that any value column that is present more than once in the data package is overwritten by the last one. If you want support for duplicates, you should use to_dst() and handle the duplicates yourself.

Parameters
resourcesList[frictionless.Resource]

List of data package resource objects

noexceptbool (default: False)

Whether to suppress an exception

**kwargs

Additional keyword arguments that are passed on to xarray.Dataset

See also

to_df()

see for details on noexcept

friendly_data.converters.xr_da(df: pandas.DataFrame, col: Union[int, Hashable], *, coords: Dict, attrs: Dict = {}, **kwargs) xarray.DataArray[source]

Create an xarray data array from a data frame

Parameters
dfpandas.DataFrame
colUnion[int, Hashable]

Column to use to create the data array, either use the column number, or column name

coordsDict

Dictionary of coordinate arrays

attrsDict

Dictionary of metadata attributes like unit

Returns
xarray.DataArray
friendly_data.converters.xr_metadata(df: pandas.DataFrame) Tuple[pandas.DataFrame, Dict, Dict][source]

Extract metadata to create xarray data array/datasets

All indices except unit is extracted as coordinates, and “unit” is extracted as metadata attribute.

Parameters
dfpandas.DataFrame
Returns
Tuple[pandas.DataFrame, Dict, Dict]

The dataframe with units removed, dictionary of coordinates, dictionary with constant attributes like units

Time series API

Convenience functions useful to ingest different kinds of differently shaped time series data into the standard 1-D shape supported by the data package specification.

friendly_data.tseries.from_multicol(fpath: friendly_data.tseries._file_t, *, date_cols: List[friendly_data.tseries._col_t], **kwargs)[source]

Read a time series where datetime values are in multiple columns.

See also

read_timeseries

see for full documentation, main entrypoint for users

friendly_data.tseries.from_table(fpath: friendly_data.tseries._file_t, *, col_units: str, zero_idx: bool, row_fmt: str = '', **kwargs)[source]

Read a time series from a tabular file.

See also

read_timeseries

see for full documentation, main entrypoint for users

friendly_data.tseries.read_timeseries(fpath: friendly_data.tseries._file_t, *, date_cols: Optional[List[friendly_data.tseries._col_t]] = None, col_units: Optional[str] = None, zero_idx: bool = False, row_fmt: str = '', source_t: str = '', **kwargs)[source]

Read a time series from a file.

While the natural way to structure a time series dataset is with the index column as datetime values, with subsequent columns holding other values, there are a few other frequently used structures.

The first is to structure it as a table:

date

1

2

23

24

1/1/2016

0

10

2.3

5.1

4/1/2016

3

11

4.3

9.1

When source_t is set to “table”, this function reads a tabular dataset like the one above, and flattens it into a series, and sets the appropriate datetime values as their index.

The other common structure is to split the datetime values into multiple columns in the table:

date

time

col1

col2

1/1/2016

10:00

42.0

foo

4/1/2016

11:00

3.14

bar

When source_t is set to “multicol”, as the table is read, the indicated columns are combined to construct the datetime values, which are then set as the index.

If source_t is not specified (or set to an empty string), options specific to this function are ignored, and all other keyword options are passed on to the backend transparently; in case of reading a CSV with Pandas, that means all valid keywords for pandas.read_csv are accepted.

Parameters
fpathUnion[str, Path, TextIO]

Path to the dataset file

date_colsList[int, str] (for “multicol” mode)

List of columns to be combined to construct the datetime values

col_unitsstr (for “table” mode)

Time units for the columns. Accepted values: “month”, “hour”.

zero_idxbool (for “table” mode, default: False)

Whether the columns are zero indexed. When the columns represent hours, or minutes, it is common to number them as nth hour. Which means they are counted starting at 1 instead of 0. Set this to False if that is the case.

row_fmtstr (for “table” mode, default: empty string)

What is the format of the datetime column (use strftime format strings, see: man 3 strftime). If this is left empty, the reader tries to guess a format using the dateutil module (Pandas default)

source_tstr (default: empty string)

Mode of reading the data. Accepted values: “table”, “multicol”, or empty string

**kwardsDict

Other keyword arguments passed on to the reader backend. Any options passed here takes precedence, and overwrites other values inferred from the earlier keyword arguments.

Returns
tsSeries/DataFrame

The time series is returned as a series or a dataframe depending on the number of other columns that are present.

Examples

To skip specific rows, maybe because they have bad data, or are empty, you may use the skiprows option. It can be set to a list-like where the entries are row indices (numbers).

>>> read_timeseries("mydata.csv", source_t="table", col_units="hour",
...     skiprows=range(1522, 5480))  

The above example skips rows 1522-5480.

Similarly, data type of the column values can be controlled by using the dtype option. When set to a numpy.dtype, all values will be read as that type, which is probably relevant for the “table” mode. In the “multicol” mode, the types of the values can be controlled at the column level by setting it to a dictionary, where the key matches a column name, and the value is a valid numpy.dtype.

Conversion to IAMC

Interface to convert a Friendly dataset to IAMC format

Configuration can be done using two separate files, A global config file (in YAML format) can set options like mapping an index column to the corresponding IAMC names, and setting default values for mandatory columns. Whereas per dataset configuration like, identifying index columns, mapping a dataset to its IAMC variable name, defining column aliases, and aggregations can be done in an index file (in YAML format).

class friendly_data.iamc.IAMconv(idx: friendly_data.dpkg.pkgindex, indices: Dict, basepath: Union[str, pathlib.Path])[source]

Converter class for IAMC data

This class resolves index columns against the “semi-hierarchical” variables used in IAMC data, and separates them into individual datasets that are part of the datapackage. It relies on the index file and index column definitions to do the disaggregation. It also supports the reverse operation of aggregating multiple datasets into an IAMC dataset.

TODO:

  • describe assumptions (e.g. case insensitive match) and fallbacks (e.g. missing title)

  • limitations (e.g. when no index column exists)

Attributes
basepath

Data package basepath, directory the index file is located

indices

Index definitions

res_idx

Package index

Methods

agg_idxcol(df, col, entry)

Aggregate values and generate IAMC dataframes

agg_vals_all(entry)

Find all values in index column that are present in an aggregate rule

frames(entry, df)

Convert the dataframe to IAMC format according to configuration in the entry

from_file(confpath, idxpath)

Create a mapping of IAMC indicator variables with index columns

iamcify(df)

Transform dataframe to match the IAMC (long) format

index_levels(idxcols)

Index levels for user defined index columns

read_indices(path, basepath, **kwargs)

Read index column definitions provided in config

resolve_idxcol_defaults(df)

Find missing IAMC indices and set them to the default value from config

to_csv(files, output[, wide])

Write converted IAMC data frame to a CSV file

to_df(files_or_dfs)

Convert CSV files/dataframes to IAMC format according to the index

agg_idxcol(df: pandas.DataFrame, col: str, entry: Dict) List[pandas.DataFrame][source]

Aggregate values and generate IAMC dataframes

Parameters
dfpd.DataFrame

Dataframe to aggregate from

colstr

Column to perform aggregation on

entryDict

Index entry with aggregation rules

Returns
List[pd.DataFrame]

List of IAMC dataframes

agg_vals_all(entry: Dict) Tuple[str, List[str]][source]

Find all values in index column that are present in an aggregate rule

property basepath

Data package basepath, directory the index file is located

frames(entry: Dict, df: pandas.DataFrame) List[pandas.DataFrame][source]

Convert the dataframe to IAMC format according to configuration in the entry

Parameters
entryDict

Index entry

dfpandas.DataFrame

The dataframe that is to be converted to IAMC format

Returns
List[pandas.DataFrame]

List of ``pandas.DataFrame``s in IAMC format

classmethod from_file(confpath: Union[str, pathlib.Path], idxpath: Union[str, pathlib.Path]) friendly_data.iamc.IAMconv[source]

Create a mapping of IAMC indicator variables with index columns

Parameters
confpathUnion[str, Path]

Path to config file for IAMC <-> data package config file

idxpathUnion[str, Path]

Path to index file

**kwargs

Keyword arguments passed on to the pandas reader backend.

Returns
IAMconv
iamcify(df: pandas.DataFrame) pandas.DataFrame[source]

Transform dataframe to match the IAMC (long) format

index_levels(idxcols: Iterable) Dict[str, pandas.Series][source]

Index levels for user defined index columns

Parameters
idxcolsIterable[str]

Iterable of index column names

Returns
Dict[str, pd.Series]

Different values for a given set of index columns

property indices: Dict

Index definitions

  • Default value of mandatory index columns in case they are missing

  • Different levels of user defined index columns; points to a 2-column CSV file, with the “name” and “iamc” columns

classmethod read_indices(path: Union[str, pathlib.Path], basepath: Union[str, pathlib.Path], **kwargs) pandas.Series[source]

Read index column definitions provided in config

property res_idx: friendly_data.dpkg.pkgindex

Package index

Each entry corresponds to a resource that maybe included in IAMC output.

resolve_idxcol_defaults(df: pandas.DataFrame) pandas.DataFrame[source]

Find missing IAMC indices and set them to the default value from config

The IAMC format requires the following indices: self._IAMC_IDX; if any of them are missing, the corresponding index level is created, and the level values are set to a constant specified in the config.

Parameters
dfpandas.DataFrame
Returns
pandas.DataFrame

Dataframe with default index columns resolved

to_csv(files: Iterable[Union[str, pathlib.Path]], output: Union[str, pathlib.Path], wide: bool = False)[source]

Write converted IAMC data frame to a CSV file

Parameters
filesIterable[Union[str, Path]]

List of files to collate and convert to IAMC

outputUnion[str, Path] (default: empty string)

Path of the output CSV file; if empty, nothing is written to file.

basepathUnion[str, Path]

Data package base path

widebool (default: False)

Write the CSN in wide format (with years as columns)

to_df(files_or_dfs: Union[Iterable[Union[str, pathlib.Path]], Dict[str, pandas.DataFrame]]) pandas.DataFrame[source]

Convert CSV files/dataframes to IAMC format according to the index

Parameters
files_or_dfsUnion[Iterable[Union[str, Path]], Dict[str, pandas.DataFrame]]

List of files or a dictionary of dataframes, to be collated and converted to IAMC format. Each item must have an entry in the package index the converter was initialised with, it is skipped otherwise. Files are matched by file path, whereas dataframes match when the dictionary key matches the index entry name.

Note when the files are read, the basepath is set to whatever the converter was initialised with. If IAMconv.from_file() was used, it is the parent directory of the index file.

Returns
DataFrame

A pandas.DataFrame in IAMC format

Internal interfaces

Internal functions and classes; useful if you are developing new features for friendly_data.

File I/O

Functions useful for I/O and file manipulation

class friendly_data.io.HttpCache(url_t: str)[source]

An HTTP cache

It accepts a URL template which accepts parameters: https://www.example.com/path/{}.json, the parameters can be provided later at fetch time. No checks are made if the number of parameters passed are compatible with the URL template.

After fetching a resource, it is cached in a file under $TMPDIR/friendly_data_cache/. The file name is of the form http-<checksum-of-url-template>-<checksum-of-url>. The cache is updated every 24 hours. A user may also force a cache cleanup by calling remove().

Parameters
url_tstr

URL template, e.g. https://www.example.com/path/{}.json

Attributes
cachedirpathlib.Path

Path object pointing to the cache directory

Methods

cachefile(arg, *args)

Return the cache file, and the corresponding URL

fetch(url)

Fetch the URL

get(arg, *args)

Get the URL contents

remove(*args)

Remove cache files

cachefile(arg: str, *args: str) Tuple[pathlib.Path, str][source]

Return the cache file, and the corresponding URL

Parameters
argstr

parameters for the URL template (one mandatory)

*argsstr, optional

more parameters (optional)

Returns
Tuple[pathlib.Path, str]

Tuple of Path object pointing to the cache file and the URL string

fetch(url: str) bytes[source]

Fetch the URL

Parameters
urlstr

URL to fetch

Returns
bytes

bytes array of the contents that was fetched

Raises
ValueError

If the URL is incorrect

get(arg: str, *args: str) bytes[source]

Get the URL contents

If a valid cache exists, return the contents from there, otherwise fetch again.

Parameters
argstr

parameters for the URL template (one mandatory)

*argsstr, optional

more parameters (optional)

Returns
bytes

bytes array of the contents

Raises
ValueError

If the URL is incorrect

requests.ConnectionError

If there is no network connection

remove(*args: str)[source]

Remove cache files

  • Remove all files associated with this cache (w/o arguments).

  • Remove only the files associated with the URL formed from the args.

Parameters
*argsstr, optional

parameters for the URL template

Raises
FileNotFoundError

If an argument is provided to remove a specific cache file, but the cache file does not exist.

friendly_data.io.copy_files(src: Iterable[Union[str, pathlib.Path]], dest: Union[str, pathlib.Path], anchor: Union[str, pathlib.Path] = '') List[pathlib.Path][source]

Copy files to a directory

Without an anchor, the source files are copied to the root of the destination directory; with an anchor, the relative paths between the source files are maintained; any required subdirectories are created.

Parameters
srcIterable[Union[str, Path]]

List of files to be copied

destUnion[str, Path]

Destination directory

anchorUnion[str, Path] (default: empty string)

Top-level directory for anchoring, provide if you want the relative paths between the source files to be maintained with respect to this directory.

Returns
List[Path]

List of files that were copied

friendly_data.io.dwim_file(fpath: Union[str, pathlib.Path]) Union[Dict, List][source]
friendly_data.io.dwim_file(fpath: Union[str, pathlib.Path], data: Any) None

Do What I Mean with file

Depending on the function arguments, either read the contents of a file, or write data to the file. The file type is guessed from the extension; supported formats: JSON and YAML.

Parameters
fpathUnion[str, Path]

File path to read or write to

dataUnion[None, Any]

Data, when writing to a file.

Returns
Union[None, Union[Dict, List]]
  • If writing to a file, nothing (None) is returned

  • If reading from a file, depending on the contents, either a list or dictionary are returned

friendly_data.io.get_cachedir() pathlib.Path[source]

Create the directory $TMPDIR/friendly_data_cache and return the Path

friendly_data.io.outoftree_paths(basepath: Union[str, pathlib.Path], fpaths: Iterable[Union[str, pathlib.Path]]) Tuple[List[pathlib.Path], List[pathlib.Path]][source]

Separate a list of paths into in tree and out of tree.

Parameters
basepathUnion[str, Path]

Path to use as the reference when identifying in/out of tree paths.

fpathsIterable[Union[str, Path]]

List of paths.

Returns
Tuple[List[str], List[Path]]

A pair of list of in tree and out of tree paths

friendly_data.io.path_in(fpaths: Iterable[Union[str, pathlib.Path]], testfile: Union[str, pathlib.Path]) bool[source]

Function to test if a path is in a list of paths.

The test checks if they are the same physical files or not, so the testfile needs to exist on disk.

Parameters
fpathsIterable[Union[str, Path]]

List of paths to check

testfileUnion[str, Path]

Test file (must exist on disk)

Returns
bool
friendly_data.io.path_not_in(fpaths: Iterable[Union[str, pathlib.Path]], testfile: Union[str, pathlib.Path]) bool[source]

Function to test if a path is absent from a list of paths.

Opposite of path_in().

Parameters
fpathsIterable[Union[str, Path]]

List of paths to check

testfileUnion[str, Path]

Test file (must exist on disk)

Returns
bool
friendly_data.io.posixpathstr(fpath: Union[str, pathlib.Path]) str[source]

Given a path object, return a POSIX compatible path string

Parameters
fpathUnioin[str, Path]

Path object

Returns
str
friendly_data.io.relpaths(basepath: Union[str, pathlib.Path], pattern: Union[str, Iterable[Union[str, pathlib.Path]]]) List[str][source]

Convert a list of paths to relative paths

Parameters
basepathUnion[str, Path]

Path to use as the reference when calculating relative paths

patternUnion[str, Iterable[Union[str, Path]]]

Either a pattern relative to basepath to generate a list of paths, or a list of paths to convert.

Returns
List[str]

List of relative paths (as str-s)

Helper utilities

Collection of helper functions

friendly_data.helpers.filter_dict(data: Dict, allowed: Iterable) Dict[source]

Filter a dictionary based on a set of allowed keys

friendly_data.helpers.flatten_list(lst: Iterable) Iterable[source]

Flatten an arbitrarily nested list (returns a generator)

friendly_data.helpers.idx_lvl_values(idx: pandas.MultiIndex, name: str) pandas.Index[source]

Given a pandas.MultiIndex and a level name, find the level values

Parameters
idxpandas.MultiIndex

A multi index

namestr

Level name

Returns
pandas.Index

Index with the level values

friendly_data.helpers.idxslice(lvls: Iterable[str], selection: Dict[str, List]) Tuple[source]

Create an index slice tuple from a set of level names, and selection mapping

NOTE: The order of lvls should match the order of the levels in the index exactly; typically, mydf.index.names.

Parameters
lvlsIterable[str]

Complete set of levels in the index

selectionDict[str, List]

Selection set; the key is a level name, and the value is a list of values to select

Returns
Tuple

Tuple of values, with slice(None) for skipped levels (matches anything)

friendly_data.helpers.import_from(module: str, name: str)[source]

Import name from module, if name is empty, return module

friendly_data.helpers.is_windows() bool[source]

Check if we are on Windows

friendly_data.helpers.match(pattern, **kwargs)[source]

Wrap glom.Match with the default action set to glom.SKIP.

This is very useful to match items inside nested data structures. A few example uses:

>>> from glom import glom
>>> cols = [
...     {
...         "name": "abc",
...         "type": "integer",
...         "constraints": {"enum": []}
...     },
...     {
...         "name": "def",
...         "type": "string"
...     },
... ]
>>> glom(cols, [match({"constraints": {"enum": list}, str: str})])
[{"name": "abc", "type": "integer", "constraints": {"enum": []}}]

For details see: glom.Match

class friendly_data.helpers.noop_map[source]

A noop mapping class

A dictionary subclass that falls back to noop on KeyError and returns the key being looked up.

Methods

clear()

copy()

fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

get(key[, default])

Return the value for key if key is in the dictionary, else default.

items()

keys()

pop(key[, default])

If key is not found, default is returned if given, otherwise KeyError is raised

popitem(/)

Remove and return a (key, value) pair as a 2-tuple.

setdefault(key[, default])

Insert key with a value of default if key is not in the dictionary.

update([E, ]**F)

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values()

friendly_data.helpers.sanitise(string: str) str[source]

Sanitise string for use as group/directory name

friendly_data.helpers.select(spec, **kwargs)[source]

Wrap glom.Check with the default action set to glom.SKIP.

This is very useful to select items inside nested data structures. A few example uses:

>>> from glom import glom
>>> cols = [
...     {
...         "name": "abc",
...         "type": "integer"
...     },
...     {
...         "name": "def",
...         "type": "string"
...     },
... ]
>>> glom(cols, [select("name", equal_to="abc")])
[{"name": "abc", "type": "integer"}]

For details see: glom.Check