Technical documentation of functions and classes provided by the different friendly_data modules.

Datapackage tools

Functions useful to interact with a data package.

friendly_data.dpkg.create_pkg(meta: Dict, fpaths: Iterable[_res_t], basepath: Union[str, Path] = '', infer=True)[source]

Create a datapackage from metadata and resources.

If resources point to files that exist, their schema are inferred and added to the package. If basepath is a non empty string, it is treated as the parent directory, and all resource file paths are checked relative to it.

Parameters

metaDict: A dictionary with package metadata.
fpathsIterable[Union[str, Path, Dict]]: An iterator over different resources. Resources are paths to files, relative to basepath.
basepathstr (default: empty string): Directory where the package files are located
inferbool (default: True): Whether to infer resource schema

Returns

Package: A datapackage with inferred schema for all the package resources

friendly_data.dpkg.entry_from_res(res: Resource) → Dict[source]

Create an index entry from a Resource object

Parameters

resResource: A resource object

Returns

Dict: A dictionary that is an index entry

friendly_data.dpkg.fullpath(resource: Resource) → Path[source]

Get full path of a resource

Parameters

resourceResource: Resource object/dictionary

Returns

Path: Full path to the resource

friendly_data.dpkg.get_aliased_cols(cols: Iterable[str], col_t: str, alias: Dict[str, str]) → Dict[source]

Get aliased columns from the registry

Parameters

colsIterable[str]: List of columns to retrieve
col_tLiteral[“cols”, “idxcols”]: A literal string specifying the kind of column; one of: “cols”, or “idxcols”
aliasDict[str, str]: Dictionary of aliases; key is the column name in the dataset, and the value is the column name in the registry that it is equivalent to.

Returns

Dict: Schema for each column, the column name is the key, and the schema is the value; see the doctring of index_levels() for more.

friendly_data.dpkg.idxpath_from_pkgpath(pkgpath: Union[str, Path]) → Union[str, Path][source]

Return a valid index path given a package path

Parameters

pkgpathUnion[str, Path]: Path to package directory

Returns

Union[str, Path]

Returns a valid index path; if there are multiple matches, returns the lexicographically first match
If an index file is not found, returns an empty string

Warns

Warns if no index file is found
Warns if multiple index files are found

friendly_data.dpkg.index_levels(file_or_df: Union[str, Path, _dfseries_t], idxcols: Iterable[str], alias: Dict[str, str] = {}) → Tuple[_dfseries_t, Dict][source]

Read a dataset and determine the index levels

Parameters

file_or_dfUnion[str, Path, pd.DataFrame, pd.Series]: A dataframe, or the path to a CSV file
idxcolsIterable[str]: List of columns in the dataset that constitute the index
aliasDict[str, str]: Column aliases: {my_alias: col_in_registry}

Returns

Tuple[Union[pd.DataFrame, pd.Series], Dict]

Tuple of the dataset, and the schema of each index column as a dictionary. If idxcols was [“foo”, “bar”], the dictionary might look like:

{
    "foo": {
        "name": "foo",
        "type": "datetime",
        "format": "default"
    },
    "bar": {
        "name": "bar",
        "type": "string",
        "constraints": {
            "enum": ["a", "b"]
        }
    }
}

Note that the index columns that have categorical values, are filled in by reading the dataset and determining the full set of values.

friendly_data.dpkg.pkg_from_files(meta: Dict, fpath: Union[str, Path], fpaths: Iterable[Union[str, Path]]) → Tuple[Path, Package, Optional[pkgindex]][source]

Create a package from an index file and other files

Parameters

metaDict: A dictionary with package metadata.
fpathUnion[str, Path]: Path to the package directory or index file. Note the index file has to be at the top level directory of the datapackage. See pkgindex.from_file()
fpathsList[Union[str, Path]]: A list of paths to datasets/resources not in the index. If any of the paths point to a dataset already present in the index, the index entry is respected.

Returns

Tuple[Path, Package, Union[pkgindex, None]]: A datapackage with inferred schema for the resources/datasets present in the index; all other resources are added with a basic inferred schema.

friendly_data.dpkg.pkg_from_index(meta: Dict, fpath: Union[str, Path]) → Tuple[Path, Package, pkgindex][source]

Read an index file, and create a datapackage with the provided metadata.

The index can be in either YAML, or JSON format. It is a list of dataset files, names, and a list of columns in the dataset that are to be treated as index columns (see example below)

Parameters

metaDict: Package metadata dictionary
fpathUnion[str, Path]: Path to the index file. Note the index file has to be at the top level directory of the datapackage.

Returns

Tuple[Path, Package, pkgindex]: The package directory, the Package object, and the index.

Examples

YAML (JSON is also supported):

- path: file1
  name: dst1
  idxcols: [cola, colb]
- path: file2
  name: dst2
  idxcols: [colx, coly, colz]
- path: file3
  name: dst3
  idxcols: [col]

class friendly_data.dpkg.pkgindex(iterable=(), /)[source]

Data package index (a subclass of list)

It is a list of dictionaries, where each dictionary is the respective record for a file. A record may have the following keys:

“path”: path to the file,
“idxcols”: list of column names that are to be included in the dataset index (optional),
“name”: dataset name (optional),
“skip”: lines to skip when reading the dataset (optional, CSV only),
“alias”: a mapping of column name aliases (optional),
“sheet”: sheet name or position (0-indexed) to use as dataset (optional, Excel only)

While iterating over an index, always use records() to ensure all necessary keys are present.

Methods

`append`(object, /)	Append object to the end of the list.
`clear`(/)	Remove all items from list.
`copy`(/)	Return a shallow copy of the list.
`count`(value, /)	Return number of occurrences of value.
`extend`(iterable, /)	Extend list by appending elements from the iterable.
`from_file`(fpath)	Read the index of files included in the data package
`get`(key)	Get the value of `key` from all records as a list.
`index`(value[, start, stop])	Return first index of value.
`insert`(index, object, /)	Insert object before index.
`pop`([index])	Remove and return item at index (default last).
`records`(keys)	Return an iterable of index records.
`remove`(value, /)	Remove first occurrence of value.
`reverse`(/)	Reverse IN PLACE.
`sort`(*[, key, reverse])	Sort the list in ascending order and return None.

classmethod from_file(fpath: Union[str, Path]) → pkgindex[source]

Read the index of files included in the data package

Parameters

fpathUnion[str, Path]: Index file path or a stream object

Returns

List[Dict]

Raises

ValueError: If the file type is correct (YAML/JSON), but does not return a list
RuntimeError: If the file has an unknown extension (raised by friendly_data.io.dwim_file())
MatchError: If the file contains any unknown keys

get(key: str) → List[source]

Get the value of key from all records as a list.

If key is absent, the corresponding value is set to None.

Parameters

keystr: Key to retrieve

Returns

List: List of records with values corresponding to key.

records(keys: List[str]) → Iterable[Dict][source]

Return an iterable of index records.

Each record is guaranteed to have all the requested keys. If a value wasn’t specified in the index file, it is set to None.

Parameters

keysList[str]: List of keys that are requested in each record.

Returns

Iterable[Dict]

Raises

glom.MatchError: If keys has an unsupported value

friendly_data.dpkg.read_pkg(pkg_path: Union[str, Path], extract_dir: Optional[Union[str, Path]] = None)[source]

Read a datapackage

If pkg_path points to a datapackage.json file, read it as is. If it points to a zip archive. The archive is first extracted before opening it. If extract_dir is not provided, the current directory of the zip archive is used. If it is a directory, look for a datapackage.json inside.

Parameters

pkg_pathUnion[str, Path]: Path to the datapackage.json file, or a zip archive
extract_dirUnion[str, Path]: Path to which the zip archive is extracted

Returns

Package

Raises

ValueError: When an unsupported format (not a directory, JSON, or ZIP) is provided
FileNotFoundError: When a datapackage.json file cannot be found

friendly_data.dpkg.res_from_entry(entry: Dict, pkg_dir: Union[str, Path]) → Resource[source]

Create a resource from an index entry.

Entry must have the keys: path, idxcols, alias; so use pkgindex.records() to iterate over the index.

Parameters

entryDict

Entry from an index file:

{
  "path": "data.csv"
  "idxcols": ["col1", "col2"]
  "alias": {
    "col1": "col0"
  }
}

pkg_dirUnion[str, Path]

Root directory of the package

Returns

Resource: The resource object (subclass of dict)

friendly_data.dpkg.resource_(spec: Dict, basepath: Union[str, Path] = '', infer=True) → Resource[source]

Create a Resource object based on the dictionary

Parameters

specDict

Dictionary with the structure:

{"path": "relpath/resource.csv", "skip": <nrows>, "sheet": <num>}

both “skip” & “sheet” are optional; “sheet” can be used to select a specific sheet as the dataset; sheet numbering starts at 1.

basepathUnion[str, Path]

Base path for resource object

inferbool (default: True)

Whether to infer resource schema

Returns

Resource

friendly_data.dpkg.set_idxcols(fpath: Union[str, Path], basepath: Union[str, Path] = '') → Resource[source]

Create a resource object for a file, with index columns set

Parameters

fpathUnion[str, Path]: Path to a dataset (resource), e.g. a CSV file, relative to basepath
basepathUnion[str, Path], default: empty string (current directory): Path to directory to consider as the data package basepath

Returns

Resource: Resource object with the index columns set according to the registry

friendly_data.dpkg.write_pkg(pkg: Union[Dict, Package], pkgdir: Union[str, Path], *, idx: Optional[Union[pkgindex, List]] = None) → List[Path][source]

Write a data package to path

Parameters

pkg: Package: Package object
pkgdir: Union[str, Path]: Path to write to
idxUnion[pkgindex, List] (optional): Package index written to pkgdir/index.json

Returns

List[Path]: List of files written to disk

Metadata tools

Functions useful to access and manipulate package metadata.

friendly_data.metatools.check_license(lic: Dict[str, str]) → Dict[str, str][source]

Return the license spec from the metadata

Issue a warning if the license is old. TODO: add other recommendations

Parameters

licDict[str, str], alias _license_t

License metadata dictionary (as returned by the Open Definition License Service) Example: CC-BY-SA:

{
  "domain_content": true,
  "domain_data": true,
  "domain_software": false,
  "family": "",
  "id": "CC-BY-SA-4.0",
  "maintainer": "Creative Commons",
  "od_conformance": "approved",
  "osd_conformance": "not reviewed",
  "status": "active",
  "title": "Creative Commons Attribution Share-Alike 4.0",
  "url": "https://creativecommons.org/licenses/by-sa/4.0/"
}

friendly_data.metatools.get_license(lic: str, group: str = 'all') → Dict[str, str][source]

Return the license metadata

Retrieve the license metadata of the requested group from the Open Definition License Service and cache it in a temporary file. From the retrieved list, find the requested license and return it.

Parameters

licstr or None: Requested license; if None, interactively ask for the license name
group{‘all’, ‘osi’, ‘od’, ‘ckan’}: License group where to find the license

Returns

Dict[str, str], alias _license_t: A dictionary with the license metadata

Raises

ValueError: If the license group is incorrect
KeyError: If the license cannot be found in the provided group

friendly_data.metatools.lic_domain(lic: Dict[str, str]) → str[source]: Find the domain of a license

friendly_data.metatools.lic_metadata(keys: ~typing.Iterable[str], pred: ~typing.Callable[[~typing.Dict], bool] = <function <lambda>>) → List[Dict[str, str]][source]

Return a list of license metadata with the requested set of keys

Parameters

keysIterable[str]: List of keys to include in the metadata
predCallable[[Dict], bool]: A predicate to select a subset of licenses. It should accept a dictionary with license metadata, and return a boolean indicating whether to accept or not.

Returns

List[Dict]: List of license metadata

friendly_data.metatools.list_licenses(group: str = 'all') → List[str][source]: Return list of valid licenses

friendly_data.metatools.resolve_licenses(meta: Dict) → Dict[source]: Check and fix if licenses are specified correctly in the metadata

Registry API

Configurable Friendly data schema registry

Module to wrap around the default friendly_data_registry to add configurability. A custom registry configuration can be specified by using the config_ctx() context manager. The RegistrySchema validates the registry config before customising the default registry.

class friendly_data.registry.RegistrySchema(registry_config: Dict[str, List[Dict]])[source]

Instantiate with the “registry” section of the config file to validate

The registry section looks like this:

registry:
  idxcols:
    - name: enduse
      type: string
      constraints:
        enum:
          - ...
  cols:
    - name: cost
      type: number
      constraints:
        minimum: 0

Methods

`clear`()
`copy`()
`fromkeys`(iterable[, value])	Create a new dictionary with keys from iterable and values set to value.
`get`(key[, default])	Return the value for key if key is in the dictionary, else default.
`items`()
`keys`()
`pop`(key[, default])	If key is not found, default is returned if given, otherwise KeyError is raised
`popitem`(/)	Remove and return a (key, value) pair as a 2-tuple.
`setdefault`(key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`([E, ]**F)	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

friendly_data.registry.config_ctx(*, confdict: Dict[str, List[Dict]] = {}, conffile: Union[str, Path] = '', idxcols: List[Dict] = [], cols: List[Dict] = [])[source]

Context manager to temporarily override the default registry

Note that the parameters are allowed only as a keyword argument, and multiple parameters are not allowed at the same time. They are checked in the same order as shown here, and on finding one, following parameters are ignored.

The registry config is also validated. If validation fails, an error message is logged, and the default registry remains unaltered.

Parameters

confdictDict[str, List[Dict]]: Registry config in dictionary form
conffileUnion[str, Path]: Path to a config file with a custom registry section
idxcolsList[Dict]: List of custom index columns
colsList[Dict]: List of custom value columns

Returns

Generator[Dict[str, List[Dict]]]: The custom registry config

Examples

from friendly_data.registry import config_ctx, get, getall

with config_ctx(conffile="config.yaml") as _:
    print(get("mycol", "cols"))
    print(getall())

friendly_data.registry.get(col: str, col_t: str) → Dict[source]

Wraps around the getters in friendly_data_registry.get().

If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the config_ctx() context manager.

friendly_data.registry.getall(with_file=False) → Dict[str, List[Dict]][source]

Wraps around the getters in friendly_data_registry.getall().

If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the config_ctx() context manager.

The Friendly data schema registry

This module provides getter methods to retrieve individual columns, get(), or the whole registry, getall(). The function utilities and classes are used by the module internally.

friendly_data_registry.get(col: str, col_t: str) → Dict[source]

Retrieve the column schema from column schema registry: friendly_data_registry

Parameters

colstr: Column name to look for
col_tLiteral[“cols”, “idxcols”]: A literal string specifying the kind of column; one of: “cols”, or “idxcols”

Returns

Dict: Column schema; an empty dictionary is returned in case there are no matches

Raises

RuntimeError: When more than one matches are found
ValueError: When the schema file in the registry is unsupported; not one of: JSON, or YAML

friendly_data_registry.getall(with_file: bool = False) → Dict[str, List[Dict]][source]

Get all columns from registry, primarily to generate documentation

Returns

Dict[str, Dict]

The returned value is separated by column type:

{
  "idxcols": [
    {..}  # column schemas
  ],
  "cols": [
    {..}  # column schemas
  ],
}

Raises

RuntimeError: When more than one matches are found

friendly_data_registry.read_file(fpath: Union[str, Path]) → Union[Dict, List][source]: Read JSON or yaml file; file type is guessed from extension

class friendly_data_registry.schschemaema(schema: dict)[source]

Registry column schema. Instantiate to validate.

Raises

TypeMatchError: When the column schema has a type mismatch
MatchError: Other mismatches like, an incorrectly named key

Methods

`clear`()
`copy`()
`fromkeys`(iterable[, value])	Create a new dictionary with keys from iterable and values set to value.
`get`(key[, default])	Return the value for key if key is in the dictionary, else default.
`items`()
`keys`()
`pop`(key[, default])	If key is not found, default is returned if given, otherwise KeyError is raised
`popitem`(/)	Remove and return a (key, value) pair as a 2-tuple.
`setdefault`(key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`([E, ]**F)	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

Command Line Interface

Functions that are run from the CLI to create, or edit a data package.

friendly_data.cli.create(idxpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', inplace: bool = False, export: str = '', config: str = '')[source]

Create a package from an index file and other files

Package metadata provided with command line flags override metadata from the config file.

Parameters

idxpathstr: Path to the index file or package directory with the index file. Note the index file has to be at the top level directory of the datapackage.
fpathsTuple[str]: List of datasets/resources not in the index. If any of them point to a dataset already present in the index, it is ignored.
namestr: Package name (no spaces or special characters)
titlestr: Package title
licensesstr: License
descriptionstr: Package description
keywordsstr: A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]
inplacebool: Whether to create the data package by only adding metadata to the current directory. NOTE: one of inplace/export must be chosen
exportstr: Create the data package in the provided directory instead of the current directory
configstr: Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.

friendly_data.cli.describe(pkgpath: str, config: str = '', report_dir: str = '')[source]

Give a summary of the data package

Parameters

pkgpathstr: Path to the data package
report_dirstr (default: empty string): If not empty, generate an HTML report and write to the directory report. The directory will have an index.html file, and one HTML file for each dataset.
configstr: Config file in YAML format with custom registry. It should be defined under a “registry” section.

friendly_data.cli.describe_registry(column_type: str = '')[source]

Describe columns defined in the registry

Parameters

column_typestr (default: empty string → all): Column type to list; one of: “cols”, or “idxcols”. If nothing is provided (default), columns of both types are listed.

friendly_data.cli.generate_index_file(idxpath: str, *fpaths: str, config: str = '')[source]

Generate an index file from a set of dataset files

Parameters

idxpathstr: Path where the index file (YAML format) should be written
fpathsTuple[str]: List of datasets/resources to include in the index
configstr: Config file in YAML format with custom registry. It should be defined under a “registry” section.

friendly_data.cli.license_info(lic: str) → Dict[source]

Give detailed metadata about a license

Parameters

licstr: License ID as listed in the output of friendly_data list-licenses

Returns

Dict: License metadata

friendly_data.cli.license_prompt() → Dict[str, str][source]: Prompt for a license on the terminal (with completion).

friendly_data.cli.list_licenses() → str[source]

List commonly used licenses

NOTE: for Python API users, not to be confused with metatools.list_licenses().

Returns

str: ASCII table with commonly used licenses

friendly_data.cli.main()[source]: Entry point for console scripts

friendly_data.cli.remove(pkgpath: str, *fpaths: str, rm_from_disk: bool = False) → str[source]

Remove datasets from the package

Parameters

pkgpathstr: Path to the package directory
fpathsTuple[str]: List of datasets/resources to be removed from the package. The index is updated accordingly.
rm_from_diskbool (default: False): Permanently delete the files from disk

friendly_data.cli.reports(pkg: Package, report_dir: str)[source]

Write HTML reports summarising all resources in the package

Parameters

pkgPackage
report_dirstr: Directory where reports are written

Returns

int: Bytes written (index.html)

friendly_data.cli.to_iamc(config: str, idxpath: str, iamcpath: str, *, wide: bool = False)[source]

Aggregate datasets into an IAMC dataset

Parameters

configstr: Config file
idxpathstr: Index file
iamcpathstr: IAMC dataset
widebool (default: False): Enable wide IAMC format

friendly_data.cli.update(pkgpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', config: str = '')[source]

Update metadata and datasets in a package.

Parameters

pkgpathstr: Path to the package.
fpathsTuple[str]: List of datasets/resources; they could be new datasets or datasets with updated index entries.
namestr: Package name (no spaces or special characters)
titlestr: Package title
descriptionstr: Package description
keywordsstr: A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]
licensesstr: License
configstr: Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.

Validation functions

Functions useful to validate a data package or parts of its schema.

friendly_data.validate.check_pkg(pkg) → List[Dict][source]

Validate all resources in a datapackage for common errors.

Typical errors that are checked:

blank-header,
extra-label,
missing-label,
blank-label,
duplicate-label,
incorrect-label,
blank-row,
primary-key-error,
foreign-key-error,
extra-cell,
missing-cell,
type-error,
constraint-error,
unique-error

Parameters

pkgfrictionless.Package: The datapackage descriptor dictionary

Returns

Dict: A dictionary with a summary of the validation checks.

friendly_data.validate.check_schema(ref: Dict[str, str], dst: Dict[str, str], *, remap: Optional[Dict[str, str]] = None) → Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]][source]

Compare a schema with a reference.

The reference schema is a minimal set, meaning, any additional fields in the compared schema are accepted, but omissions are not.

Name comparisons are case-sensitive.

TODO: maybe also compare constraints?

Parameters

refDict[str, str]: Reference schema dictionary
dstDict[str, str]: Schema dictionary from the dataset being validated
remapDict[str, str] (optional): Column/field names that are to be remapped before checking.

Returns

resultTuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]

Result tuple:

Boolean flag indicating if it passed the checks or not
If checks failed, set of missing columns from minimal set
If checks failed, set of columns with mismatching types. It is a dictionary with the column name as key, and the reference type and the actual type in a tuple as value.
```
{
    'col_x': ('integer', 'number'),
    'col_y': ('datetime', 'string'),
}
```
If primary keys are different, tuple with the diff. The first element is the index where the two differ, and the two subsequent elements are the corresponding elements from the reference and dataset primary key list: (index, ref_col, dst_col)

friendly_data.validate.summarise_diff(diff: Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]) → str[source]: Summarise the schema diff from check_schema() results as a pandas.DataFrame.

friendly_data.validate.summarise_errors(report: List[Dict]) → pandas.DataFrame[source]

Summarise the dict/json error report as a pandas.DataFrame

Parameters

reportList[Dict]: List of errors as returned by check_pkg()

Returns

pandas.DataFrame

Summary dataframe; example:

   filename  row  col       error  remark
0   bad.csv   12       extra-cell     ...
1   bad.csv   22  SRB  type-error     ...

Data analysis interface

The following modules provide interfaces that are useful when you are working with common data analysis frameworks like pandas and xarray.

Converters

Functions useful to read a data package resource into common analysis frameworks like pandas, xarray, etc. Currently supported:

Library	Data Structure
`pandas`	:class:`pandas.DataFrame`
`xarray` (via `pandas`)	:class:`xarray.DataArray`, :class:`xarray.Dataset`, multi-file :class:`xarray.Dataset`

Type mapping between the frictionless specification and pandas types:

schema type	`pandas` type
`boolean`	`bool`
`datetime`	`datetime64`
`integer`	`Int64`
`number`	`float`
`string`	`string`

friendly_data.converters.from_df(df: _dfseries_t, basepath: Union[str, Path], datapath: Union[str, Path] = '', alias: Dict[str, str] = {}, rename: bool = True) → Resource[source]

Write dataframe to a CSV file, and return a data package resource.

NOTE: Do not call frictionless.Resource.infer() on the resource instance returned by this function, as that might overwrite our metadata/schema customisations with default heuristics in the frictionless implementation.

Parameters

dfpd.DataFrame | pd.Series: Dataframe to write
basepathUnion[str, Path]: Path to the package directory
datapathUnion[str, Path] (default: empty string): Path to the CSV file where the dataframe is written. If datapath is empty, a file name is generated by concatinating all the columns in the dataframe.
aliasDict[str, str] (default: {}): A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.
renamebool (default: True): Rename aliased columns to match the registry when writing to the CSV.

Returns

frictionless.Resource: Data package resource that points to the CSV file.

friendly_data.converters.from_dst(dst: xarray.Dataset, basepath: Union[str, Path], alias: Dict[str, str] = {}) → List[Resource][source]

Write an xarray.Dataset into CSV files, and return the list resources

Each data variable is written to a separate CSV file in the directory specified by basepath. The file name is derived from the data variable name by sanitising it and appending the CSV extension.

Parameters

dstxr.Dataset: Dataset to write
basepathUnion[str, Path]: Path to the package directory
aliasDict[str, str]: A dictionary of column aliases if the dataset has custom data variable/coordinate names that need to be mapped to columns in the registry.

Returns

List[Resource]: List of data package resources that point to the CSV files.

friendly_data.converters.resolve_aliases(df: _dfseries_t, alias: Dict[str, str]) → _dfseries_t[source]

Return a copy of the dataframe with aliases resolved

Parameters

dfpd.DataFrame | pd.Series
aliasDict[str, str]: A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.

Returns

pd.DataFrame | pd.Series: Since the column and index levels are renamed, a copy is returned so that the original dataframe/series remains unaltered.

friendly_data.converters.to_da(resource: Resource, noexcept: bool = False, **kwargs) → xarray.DataArray[source]

Reads a data package resource as an xarray.DataArray

This function is restricted to tables with only one value column (equivalent to a pandas.Series). All indices are treated as xarray.core.coordinates.DataArrayCoordinates and dimensions. The array is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to the data array. It is assumed that the whole table uses the same unit.

Additional keyword arguments are passed on to xarray.DataArray.

Parameters

resourcefrictionless.Resource: List of data package resource objects
noexceptbool (default: False): Whether to suppress an exception
**kwargs: Additional keyword arguments that are passed on to xarray.DataArray

See also

to_df(): see for details on noexcept

friendly_data.converters.to_df(resource: Resource, noexcept: bool = False, **kwargs) → pandas.DataFrame[source]

Reads a data package resource as a pandas.DataFrame

FIXME: ‘format’ in the schema is ignored.

Parameters

resourcefrictionless.Resource: A data package resource object
noexceptbool (default: False): Whether to suppress an exception
**kwargs: Additional keyword arguments that are passed on to the reader: pandas.read_csv(), pandas.read_excel(), etc

Returns

pandas.DataFrame: NOTE: when noexcept is True, and there’s an exception, an empty dataframe is returned

Raises

ValueError: If the resource is not local If the source type the resource is pointing to isn’t supported

friendly_data.converters.to_dst(resource: Resource, noexcept: bool = False, **kwargs) → xarray.Dataset[source]

Reads a data package resource as an xarray.Dataset

Unlike to_da(), this function works for all tables. All indices are treated as xarray.core.coordinates.DataArrayCoordinates and dimensions. The arrays is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to each data arrays. It is assumed that all columns in the whole table uses the same unit.

Additional keyword arguments are passed on to xarray.Dataset.

Parameters

resourcefrictionless.Resource: List of data package resource objects
noexceptbool (default: False): Whether to suppress an exception
**kwargs: Additional keyword arguments that are passed on to xarray.Dataset

See also

to_df(): see for details on noexcept

friendly_data.converters.to_mfdst(resources: Iterable[Resource], noexcept: bool = False, **kwargs) → xarray.Dataset[source]

Reads a list of data package resources as an xarray.Dataset

This function reads multiple resources/files and converts each column into a data array (identical to to_dst()), which are then combined into one xarray.Dataset. Note that any value column that is present more than once in the data package is overwritten by the last one. If you want support for duplicates, you should use to_dst() and handle the duplicates yourself.

Parameters

resourcesList[frictionless.Resource]: List of data package resource objects
noexceptbool (default: False): Whether to suppress an exception
**kwargs: Additional keyword arguments that are passed on to xarray.Dataset

See also

to_df(): see for details on noexcept

friendly_data.converters.xr_da(df: pandas.DataFrame, col: Union[int, Hashable], *, coords: Dict, attrs: Dict = {}, **kwargs) → xarray.DataArray[source]

Create an xarray data array from a data frame

Parameters

dfpandas.DataFrame
colUnion[int, Hashable]: Column to use to create the data array, either use the column number, or column name
coordsDict: Dictionary of coordinate arrays
attrsDict: Dictionary of metadata attributes like unit

Returns

xarray.DataArray

friendly_data.converters.xr_metadata(df: pandas.DataFrame) → Tuple[pandas.DataFrame, Dict, Dict][source]

Extract metadata to create xarray data array/datasets

All indices except unit is extracted as coordinates, and “unit” is extracted as metadata attribute.

Parameters

dfpandas.DataFrame

Returns

Tuple[pandas.DataFrame, Dict, Dict]: The dataframe with units removed, dictionary of coordinates, dictionary with constant attributes like units

Time series API

Convenience functions useful to ingest different kinds of differently shaped time series data into the standard 1-D shape supported by the data package specification.

friendly_data.tseries.from_multicol(fpath: _file_t, *, date_cols: List[_col_t], **kwargs)[source]

Read a time series where datetime values are in multiple columns.

See also

read_timeseries: see for full documentation, main entrypoint for users

friendly_data.tseries.from_table(fpath: _file_t, *, col_units: str, zero_idx: bool, row_fmt: str = '', **kwargs)[source]

Read a time series from a tabular file.

See also

read_timeseries: see for full documentation, main entrypoint for users

friendly_data.tseries.read_timeseries(fpath: _file_t, *, date_cols: Optional[List[_col_t]] = None, col_units: Optional[str] = None, zero_idx: bool = False, row_fmt: str = '', source_t: str = '', **kwargs)[source]

Read a time series from a file.

While the natural way to structure a time series dataset is with the index column as datetime values, with subsequent columns holding other values, there are a few other frequently used structures.

The first is to structure it as a table:

date	1	2	…	23	24
1/1/2016	0	10	…	2.3	5.1
4/1/2016	3	11	…	4.3	9.1

When source_t is set to “table”, this function reads a tabular dataset like the one above, and flattens it into a series, and sets the appropriate datetime values as their index.

The other common structure is to split the datetime values into multiple columns in the table:

date	time	col1	col2
1/1/2016	10:00	42.0	foo
4/1/2016	11:00	3.14	bar

When source_t is set to “multicol”, as the table is read, the indicated columns are combined to construct the datetime values, which are then set as the index.

If source_t is not specified (or set to an empty string), options specific to this function are ignored, and all other keyword options are passed on to the backend transparently; in case of reading a CSV with Pandas, that means all valid keywords for pandas.read_csv are accepted.

Parameters

fpathUnion[str, Path, TextIO]: Path to the dataset file
date_colsList[int, str] (for “multicol” mode): List of columns to be combined to construct the datetime values
col_unitsstr (for “table” mode): Time units for the columns. Accepted values: “month”, “hour”.
zero_idxbool (for “table” mode, default: False): Whether the columns are zero indexed. When the columns represent hours, or minutes, it is common to number them as nth hour. Which means they are counted starting at 1 instead of 0. Set this to False if that is the case.
row_fmtstr (for “table” mode, default: empty string): What is the format of the datetime column (use strftime format strings, see: man 3 strftime). If this is left empty, the reader tries to guess a format using the dateutil module (Pandas default)
source_tstr (default: empty string): Mode of reading the data. Accepted values: “table”, “multicol”, or empty string
**kwardsDict: Other keyword arguments passed on to the reader backend. Any options passed here takes precedence, and overwrites other values inferred from the earlier keyword arguments.

Returns

tsSeries/DataFrame: The time series is returned as a series or a dataframe depending on the number of other columns that are present.

Examples

To skip specific rows, maybe because they have bad data, or are empty, you may use the skiprows option. It can be set to a list-like where the entries are row indices (numbers).

>>> read_timeseries("mydata.csv", source_t="table", col_units="hour",
...     skiprows=range(1522, 5480))  

The above example skips rows 1522-5480.

Similarly, data type of the column values can be controlled by using the dtype option. When set to a numpy.dtype, all values will be read as that type, which is probably relevant for the “table” mode. In the “multicol” mode, the types of the values can be controlled at the column level by setting it to a dictionary, where the key matches a column name, and the value is a valid numpy.dtype.

Conversion to IAMC

Interface to convert a Friendly dataset to IAMC format

Configuration can be done using two separate files, A global config file (in YAML format) can set options like mapping an index column to the corresponding IAMC names, and setting default values for mandatory columns. Whereas per dataset configuration like, identifying index columns, mapping a dataset to its IAMC variable name, defining column aliases, and aggregations can be done in an index file (in YAML format).

class friendly_data.iamc.IAMconv(idx: pkgindex, indices: Dict, basepath: Union[str, Path])[source]

Converter class for IAMC data

This class resolves index columns against the “semi-hierarchical” variables used in IAMC data, and separates them into individual datasets that are part of the datapackage. It relies on the index file and index column definitions to do the disaggregation. It also supports the reverse operation of aggregating multiple datasets into an IAMC dataset.

TODO:

describe assumptions (e.g. case insensitive match) and fallbacks (e.g. missing title)
limitations (e.g. when no index column exists)

Attributes

basepath: Data package basepath, directory the index file is located
indices: Index definitions
res_idx: Package index

Methods

`agg_idxcol`(df, col, entry)	Aggregate values and generate IAMC dataframes
`agg_vals_all`(entry)	Find all values in index column that are present in an aggregate rule
`frames`(entry, df)	Convert the dataframe to IAMC format according to configuration in the entry
`from_file`(confpath, idxpath)	Create a mapping of IAMC indicator variables with index columns
`iamcify`(df)	Transform dataframe to match the IAMC (long) format
`index_levels`(idxcols)	Index levels for user defined index columns
`read_indices`(path, basepath, **kwargs)	Read index column definitions provided in config
`resolve_idxcol_defaults`(df)	Find missing IAMC indices and set them to the default value from config
`to_csv`(files, output[, wide])	Write converted IAMC data frame to a CSV file
`to_df`(files_or_dfs)	Convert CSV files/dataframes to IAMC format according to the index

agg_idxcol(df: pandas.DataFrame, col: str, entry: Dict) → List[pandas.DataFrame][source]

Aggregate values and generate IAMC dataframes

Parameters

dfpd.DataFrame: Dataframe to aggregate from
colstr: Column to perform aggregation on
entryDict: Index entry with aggregation rules

Returns

List[pd.DataFrame]: List of IAMC dataframes

agg_vals_all(entry: Dict) → Tuple[str, List[str]][source]: Find all values in index column that are present in an aggregate rule

property basepath: Data package basepath, directory the index file is located

frames(entry: Dict, df: pandas.DataFrame) → List[pandas.DataFrame][source]

Convert the dataframe to IAMC format according to configuration in the entry

Parameters

entryDict: Index entry
dfpandas.DataFrame: The dataframe that is to be converted to IAMC format

Returns

List[pandas.DataFrame]: List of ``pandas.DataFrame``s in IAMC format

classmethod from_file(confpath: Union[str, Path], idxpath: Union[str, Path]) → IAMconv[source]

Create a mapping of IAMC indicator variables with index columns

Parameters

confpathUnion[str, Path]: Path to config file for IAMC <-> data package config file
idxpathUnion[str, Path]: Path to index file
**kwargs: Keyword arguments passed on to the pandas reader backend.

Returns

IAMconv

iamcify(df: pandas.DataFrame) → pandas.DataFrame[source]: Transform dataframe to match the IAMC (long) format

index_levels(idxcols: Iterable) → Dict[str, pandas.Series][source]

Index levels for user defined index columns

Parameters

idxcolsIterable[str]: Iterable of index column names

Returns

Dict[str, pd.Series]: Different values for a given set of index columns

property indices: Dict

Index definitions

Default value of mandatory index columns in case they are missing
Different levels of user defined index columns; points to a 2-column CSV file, with the “name” and “iamc” columns

classmethod read_indices(path: Union[str, Path], basepath: Union[str, Path], **kwargs) → pandas.Series[source]: Read index column definitions provided in config

property res_idx: pkgindex

Package index

Each entry corresponds to a resource that maybe included in IAMC output.

resolve_idxcol_defaults(df: pandas.DataFrame) → pandas.DataFrame[source]

Find missing IAMC indices and set them to the default value from config

The IAMC format requires the following indices: self._IAMC_IDX; if any of them are missing, the corresponding index level is created, and the level values are set to a constant specified in the config.

Parameters

dfpandas.DataFrame

Returns

pandas.DataFrame: Dataframe with default index columns resolved

to_csv(files: Iterable[Union[str, Path]], output: Union[str, Path], wide: bool = False)[source]

Write converted IAMC data frame to a CSV file

Parameters

filesIterable[Union[str, Path]]: List of files to collate and convert to IAMC
outputUnion[str, Path] (default: empty string): Path of the output CSV file; if empty, nothing is written to file.
basepathUnion[str, Path]: Data package base path
widebool (default: False): Write the CSN in wide format (with years as columns)

to_df(files_or_dfs: Union[Iterable[Union[str, Path]], Dict[str, pandas.DataFrame]]) → pandas.DataFrame[source]

Convert CSV files/dataframes to IAMC format according to the index

Parameters

files_or_dfsUnion[Iterable[Union[str, Path]], Dict[str, pandas.DataFrame]]

List of files or a dictionary of dataframes, to be collated and converted to IAMC format. Each item must have an entry in the package index the converter was initialised with, it is skipped otherwise. Files are matched by file path, whereas dataframes match when the dictionary key matches the index entry name.

Note when the files are read, the basepath is set to whatever the converter was initialised with. If IAMconv.from_file() was used, it is the parent directory of the index file.

Returns

DataFrame: A pandas.DataFrame in IAMC format

Internal interfaces

Internal functions and classes; useful if you are developing new features for friendly_data.

File I/O

Functions useful for I/O and file manipulation

class friendly_data.io.HttpCache(url_t: str)[source]

An HTTP cache

It accepts a URL template which accepts parameters: https://www.example.com/path/{}.json, the parameters can be provided later at fetch time. No checks are made if the number of parameters passed are compatible with the URL template.

After fetching a resource, it is cached in a file under $TMPDIR/friendly_data_cache/. The file name is of the form http-<checksum-of-url-template>-<checksum-of-url>. The cache is updated every 24 hours. A user may also force a cache cleanup by calling remove().

Parameters

url_tstr: URL template, e.g. https://www.example.com/path/{}.json

Attributes

cachedirpathlib.Path: Path object pointing to the cache directory

Methods

`cachefile`(arg, *args)	Return the cache file, and the corresponding URL
`fetch`(url)	Fetch the URL
`get`(arg, *args)	Get the URL contents
`remove`(*args)	Remove cache files

cachefile(arg: str, *args: str) → Tuple[Path, str][source]

Return the cache file, and the corresponding URL

Parameters

argstr: parameters for the URL template (one mandatory)
*argsstr, optional: more parameters (optional)

Returns

Tuple[pathlib.Path, str]: Tuple of Path object pointing to the cache file and the URL string

fetch(url: str) → bytes[source]

Fetch the URL

Parameters

urlstr: URL to fetch

Returns

bytes: bytes array of the contents that was fetched

Raises

ValueError: If the URL is incorrect

get(arg: str, *args: str) → bytes[source]

Get the URL contents

If a valid cache exists, return the contents from there, otherwise fetch again.

Parameters

argstr: parameters for the URL template (one mandatory)
*argsstr, optional: more parameters (optional)

Returns

bytes: bytes array of the contents

Raises

ValueError: If the URL is incorrect
requests.ConnectionError: If there is no network connection

remove(*args: str)[source]

Remove cache files

Remove all files associated with this cache (w/o arguments).
Remove only the files associated with the URL formed from the args.

Parameters

*argsstr, optional: parameters for the URL template

Raises

FileNotFoundError: If an argument is provided to remove a specific cache file, but the cache file does not exist.

friendly_data.io.copy_files(src: Iterable[Union[str, Path]], dest: Union[str, Path], anchor: Union[str, Path] = '') → List[Path][source]

Copy files to a directory

Without an anchor, the source files are copied to the root of the destination directory; with an anchor, the relative paths between the source files are maintained; any required subdirectories are created.

Parameters

srcIterable[Union[str, Path]]: List of files to be copied
destUnion[str, Path]: Destination directory
anchorUnion[str, Path] (default: empty string): Top-level directory for anchoring, provide if you want the relative paths between the source files to be maintained with respect to this directory.

Returns

List[Path]: List of files that were copied

friendly_data.io.dwim_file(fpath: Union[str, Path]) → Union[Dict, List][source]

friendly_data.io.dwim_file(fpath: Union[str, Path], data: Any) → None

Do What I Mean with file

Depending on the function arguments, either read the contents of a file, or write data to the file. The file type is guessed from the extension; supported formats: JSON and YAML.

Parameters

fpathUnion[str, Path]: File path to read or write to
dataUnion[None, Any]: Data, when writing to a file.

Returns

Union[None, Union[Dict, List]]

If writing to a file, nothing (None) is returned
If reading from a file, depending on the contents, either a list or dictionary are returned

friendly_data.io.get_cachedir() → Path[source]: Create the directory $TMPDIR/friendly_data_cache and return the Path

friendly_data.io.outoftree_paths(basepath: Union[str, Path], fpaths: Iterable[Union[str, Path]]) → Tuple[List[Path], List[Path]][source]

Separate a list of paths into in tree and out of tree.

Parameters

basepathUnion[str, Path]: Path to use as the reference when identifying in/out of tree paths.
fpathsIterable[Union[str, Path]]: List of paths.

Returns

Tuple[List[str], List[Path]]: A pair of list of in tree and out of tree paths

friendly_data.io.path_in(fpaths: Iterable[Union[str, Path]], testfile: Union[str, Path]) → bool[source]

Function to test if a path is in a list of paths.

The test checks if they are the same physical files or not, so the testfile needs to exist on disk.

Parameters

fpathsIterable[Union[str, Path]]: List of paths to check
testfileUnion[str, Path]: Test file (must exist on disk)

Returns

bool

friendly_data.io.path_not_in(fpaths: Iterable[Union[str, Path]], testfile: Union[str, Path]) → bool[source]

Function to test if a path is absent from a list of paths.

Opposite of path_in().

Parameters

fpathsIterable[Union[str, Path]]: List of paths to check
testfileUnion[str, Path]: Test file (must exist on disk)

Returns

bool

friendly_data.io.posixpathstr(fpath: Union[str, Path]) → str[source]

Given a path object, return a POSIX compatible path string

Parameters

fpathUnioin[str, Path]: Path object

Returns

str

friendly_data.io.relpaths(basepath: Union[str, Path], pattern: Union[str, Iterable[Union[str, Path]]]) → List[str][source]

Convert a list of paths to relative paths

Parameters

basepathUnion[str, Path]: Path to use as the reference when calculating relative paths
patternUnion[str, Iterable[Union[str, Path]]]: Either a pattern relative to basepath to generate a list of paths, or a list of paths to convert.

Returns

List[str]: List of relative paths (as str-s)

Helper utilities

Collection of helper functions

friendly_data.helpers.filter_dict(data: Dict, allowed: Iterable) → Dict[source]: Filter a dictionary based on a set of allowed keys

friendly_data.helpers.flatten_list(lst: Iterable) → Iterable[source]: Flatten an arbitrarily nested list (returns a generator)

friendly_data.helpers.idx_lvl_values(idx: pandas.MultiIndex, name: str) → pandas.Index[source]

Given a pandas.MultiIndex and a level name, find the level values

Parameters

idxpandas.MultiIndex: A multi index
namestr: Level name

Returns

pandas.Index: Index with the level values

friendly_data.helpers.idxslice(lvls: Iterable[str], selection: Dict[str, List]) → Tuple[source]

Create an index slice tuple from a set of level names, and selection mapping

NOTE: The order of lvls should match the order of the levels in the index exactly; typically, mydf.index.names.

Parameters

lvlsIterable[str]: Complete set of levels in the index
selectionDict[str, List]: Selection set; the key is a level name, and the value is a list of values to select

Returns

Tuple: Tuple of values, with slice(None) for skipped levels (matches anything)

friendly_data.helpers.import_from(module: str, name: str)[source]: Import name from module, if name is empty, return module

friendly_data.helpers.is_windows() → bool[source]: Check if we are on Windows

friendly_data.helpers.match(pattern, **kwargs)[source]

Wrap glom.Match with the default action set to glom.SKIP.

This is very useful to match items inside nested data structures. A few example uses:

>>> from glom import glom
>>> cols = [
...     {
...         "name": "abc",
...         "type": "integer",
...         "constraints": {"enum": []}
...     },
...     {
...         "name": "def",
...         "type": "string"
...     },
... ]
>>> glom(cols, [match({"constraints": {"enum": list}, str: str})])
[{"name": "abc", "type": "integer", "constraints": {"enum": []}}]

For details see: glom.Match

class friendly_data.helpers.noop_map[source]

A noop mapping class

A dictionary subclass that falls back to noop on KeyError and returns the key being looked up.

Methods

`clear`()
`copy`()
`fromkeys`(iterable[, value])	Create a new dictionary with keys from iterable and values set to value.
`get`(key[, default])	Return the value for key if key is in the dictionary, else default.
`items`()
`keys`()
`pop`(key[, default])	If key is not found, default is returned if given, otherwise KeyError is raised
`popitem`(/)	Remove and return a (key, value) pair as a 2-tuple.
`setdefault`(key[, default])	Insert key with a value of default if key is not in the dictionary.
`update`([E, ]**F)	If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
`values`()

friendly_data.helpers.sanitise(string: str) → str[source]: Sanitise string for use as group/directory name

friendly_data.helpers.select(spec, **kwargs)[source]

Wrap glom.Check with the default action set to glom.SKIP.

This is very useful to select items inside nested data structures. A few example uses:

>>> from glom import glom
>>> cols = [
...     {
...         "name": "abc",
...         "type": "integer"
...     },
...     {
...         "name": "def",
...         "type": "string"
...     },
... ]
>>> glom(cols, [select("name", equal_to="abc")])
[{"name": "abc", "type": "integer"}]

For details see: glom.Check