Technical documentation of functions and classes provided by the
different friendly_data
modules.
Datapackage tools
Functions useful to interact with a data package.
- friendly_data.dpkg.create_pkg(meta: Dict, fpaths: Iterable[_res_t], basepath: Union[str, Path] = '', infer=True)[source]
Create a datapackage from metadata and resources.
If
resources
point to files that exist, their schema are inferred and added to the package. Ifbasepath
is a non empty string, it is treated as the parent directory, and all resource file paths are checked relative to it.- Parameters
- metaDict
A dictionary with package metadata.
- fpathsIterable[Union[str, Path, Dict]]
An iterator over different resources. Resources are paths to files, relative to
basepath
.- basepathstr (default: empty string)
Directory where the package files are located
- inferbool (default: True)
Whether to infer resource schema
- Returns
- Package
A datapackage with inferred schema for all the package resources
- friendly_data.dpkg.entry_from_res(res: Resource) Dict [source]
Create an index entry from a
Resource
object- Parameters
- resResource
A resource object
- Returns
- Dict
A dictionary that is an index entry
- friendly_data.dpkg.fullpath(resource: Resource) Path [source]
Get full path of a resource
- Parameters
- resourceResource
Resource object/dictionary
- Returns
- Path
Full path to the resource
- friendly_data.dpkg.get_aliased_cols(cols: Iterable[str], col_t: str, alias: Dict[str, str]) Dict [source]
Get aliased columns from the registry
- Parameters
- colsIterable[str]
List of columns to retrieve
- col_tLiteral[“cols”, “idxcols”]
A literal string specifying the kind of column; one of: “cols”, or “idxcols”
- aliasDict[str, str]
Dictionary of aliases; key is the column name in the dataset, and the value is the column name in the registry that it is equivalent to.
- Returns
- Dict
Schema for each column, the column name is the key, and the schema is the value; see the doctring of
index_levels()
for more.
- friendly_data.dpkg.idxpath_from_pkgpath(pkgpath: Union[str, Path]) Union[str, Path] [source]
Return a valid index path given a package path
- Parameters
- pkgpathUnion[str, Path]
Path to package directory
- Returns
- Union[str, Path]
Returns a valid index path; if there are multiple matches, returns the lexicographically first match
If an index file is not found, returns an empty string
- Warns
- Warns if no index file is found
- Warns if multiple index files are found
- friendly_data.dpkg.index_levels(file_or_df: Union[str, Path, _dfseries_t], idxcols: Iterable[str], alias: Dict[str, str] = {}) Tuple[_dfseries_t, Dict] [source]
Read a dataset and determine the index levels
- Parameters
- file_or_dfUnion[str, Path, pd.DataFrame, pd.Series]
A dataframe, or the path to a CSV file
- idxcolsIterable[str]
List of columns in the dataset that constitute the index
- aliasDict[str, str]
Column aliases: {my_alias: col_in_registry}
- Returns
- Tuple[Union[pd.DataFrame, pd.Series], Dict]
Tuple of the dataset, and the schema of each index column as a dictionary. If
idxcols
was [“foo”, “bar”], the dictionary might look like:{ "foo": { "name": "foo", "type": "datetime", "format": "default" }, "bar": { "name": "bar", "type": "string", "constraints": { "enum": ["a", "b"] } } }
Note that the index columns that have categorical values, are filled in by reading the dataset and determining the full set of values.
- friendly_data.dpkg.pkg_from_files(meta: Dict, fpath: Union[str, Path], fpaths: Iterable[Union[str, Path]]) Tuple[Path, Package, Optional[pkgindex]] [source]
Create a package from an index file and other files
- Parameters
- metaDict
A dictionary with package metadata.
- fpathUnion[str, Path]
Path to the package directory or index file. Note the index file has to be at the top level directory of the datapackage. See
pkgindex.from_file()
- fpathsList[Union[str, Path]]
A list of paths to datasets/resources not in the index. If any of the paths point to a dataset already present in the index, the index entry is respected.
- Returns
- Tuple[Path, Package, Union[pkgindex, None]]
A datapackage with inferred schema for the resources/datasets present in the index; all other resources are added with a basic inferred schema.
- friendly_data.dpkg.pkg_from_index(meta: Dict, fpath: Union[str, Path]) Tuple[Path, Package, pkgindex] [source]
Read an index file, and create a datapackage with the provided metadata.
The index can be in either YAML, or JSON format. It is a list of dataset files, names, and a list of columns in the dataset that are to be treated as index columns (see example below)
- Parameters
- metaDict
Package metadata dictionary
- fpathUnion[str, Path]
Path to the index file. Note the index file has to be at the top level directory of the datapackage.
- Returns
- Tuple[Path, Package, pkgindex]
The package directory, the
Package
object, and the index.
Examples
YAML (JSON is also supported):
- path: file1 name: dst1 idxcols: [cola, colb] - path: file2 name: dst2 idxcols: [colx, coly, colz] - path: file3 name: dst3 idxcols: [col]
- class friendly_data.dpkg.pkgindex(iterable=(), /)[source]
Data package index (a subclass of
list
)It is a list of dictionaries, where each dictionary is the respective record for a file. A record may have the following keys:
“path”: path to the file,
“idxcols”: list of column names that are to be included in the dataset index (optional),
“name”: dataset name (optional),
“skip”: lines to skip when reading the dataset (optional, CSV only),
“alias”: a mapping of column name aliases (optional),
“sheet”: sheet name or position (0-indexed) to use as dataset (optional, Excel only)
While iterating over an index, always use
records()
to ensure all necessary keys are present.Methods
append
(object, /)Append object to the end of the list.
clear
(/)Remove all items from list.
copy
(/)Return a shallow copy of the list.
count
(value, /)Return number of occurrences of value.
extend
(iterable, /)Extend list by appending elements from the iterable.
from_file
(fpath)Read the index of files included in the data package
get
(key)Get the value of
key
from all records as a list.index
(value[, start, stop])Return first index of value.
insert
(index, object, /)Insert object before index.
pop
([index])Remove and return item at index (default last).
records
(keys)Return an iterable of index records.
remove
(value, /)Remove first occurrence of value.
reverse
(/)Reverse IN PLACE.
sort
(*[, key, reverse])Sort the list in ascending order and return None.
- classmethod from_file(fpath: Union[str, Path]) pkgindex [source]
Read the index of files included in the data package
- Parameters
- fpathUnion[str, Path]
Index file path or a stream object
- Returns
- List[Dict]
- Raises
- ValueError
If the file type is correct (YAML/JSON), but does not return a list
- RuntimeError
If the file has an unknown extension (raised by
friendly_data.io.dwim_file()
)- MatchError
If the file contains any unknown keys
- get(key: str) List [source]
Get the value of
key
from all records as a list.If
key
is absent, the corresponding value is set toNone
.- Parameters
- keystr
Key to retrieve
- Returns
- List
List of records with values corresponding to
key
.
- records(keys: List[str]) Iterable[Dict] [source]
Return an iterable of index records.
Each record is guaranteed to have all the requested keys. If a value wasn’t specified in the index file, it is set to
None
.- Parameters
- keysList[str]
List of keys that are requested in each record.
- Returns
- Iterable[Dict]
- Raises
- glom.MatchError
If
keys
has an unsupported value
- friendly_data.dpkg.read_pkg(pkg_path: Union[str, Path], extract_dir: Optional[Union[str, Path]] = None)[source]
Read a datapackage
If
pkg_path
points to adatapackage.json
file, read it as is. If it points to a zip archive. The archive is first extracted before opening it. Ifextract_dir
is not provided, the current directory of the zip archive is used. If it is a directory, look for adatapackage.json
inside.- Parameters
- pkg_pathUnion[str, Path]
Path to the
datapackage.json
file, or a zip archive- extract_dirUnion[str, Path]
Path to which the zip archive is extracted
- Returns
- Package
- Raises
- ValueError
When an unsupported format (not a directory, JSON, or ZIP) is provided
- FileNotFoundError
When a datapackage.json file cannot be found
- friendly_data.dpkg.res_from_entry(entry: Dict, pkg_dir: Union[str, Path]) Resource [source]
Create a resource from an index entry.
Entry must have the keys:
path
,idxcols
,alias
; so usepkgindex.records()
to iterate over the index.- Parameters
- entryDict
Entry from an index file:
{ "path": "data.csv" "idxcols": ["col1", "col2"] "alias": { "col1": "col0" } }
- pkg_dirUnion[str, Path]
Root directory of the package
- Returns
- Resource
The resource object (subclass of
dict
)
- friendly_data.dpkg.resource_(spec: Dict, basepath: Union[str, Path] = '', infer=True) Resource [source]
Create a Resource object based on the dictionary
- Parameters
- specDict
Dictionary with the structure:
{"path": "relpath/resource.csv", "skip": <nrows>, "sheet": <num>}
both “skip” & “sheet” are optional; “sheet” can be used to select a specific sheet as the dataset; sheet numbering starts at 1.
- basepathUnion[str, Path]
Base path for resource object
- inferbool (default: True)
Whether to infer resource schema
- Returns
- Resource
- friendly_data.dpkg.set_idxcols(fpath: Union[str, Path], basepath: Union[str, Path] = '') Resource [source]
Create a resource object for a file, with index columns set
- Parameters
- fpathUnion[str, Path]
Path to a dataset (resource), e.g. a CSV file, relative to basepath
- basepathUnion[str, Path], default: empty string (current directory)
Path to directory to consider as the data package basepath
- Returns
- Resource
Resource object with the index columns set according to the registry
- friendly_data.dpkg.write_pkg(pkg: Union[Dict, Package], pkgdir: Union[str, Path], *, idx: Optional[Union[pkgindex, List]] = None) List[Path] [source]
Write a data package to path
- Parameters
- pkg: Package
Package object
- pkgdir: Union[str, Path]
Path to write to
- idxUnion[pkgindex, List] (optional)
Package index written to
pkgdir/index.json
- Returns
- List[Path]
List of files written to disk
Metadata tools
Functions useful to access and manipulate package metadata.
- friendly_data.metatools.check_license(lic: Dict[str, str]) Dict[str, str] [source]
Return the license spec from the metadata
Issue a warning if the license is old. TODO: add other recommendations
- Parameters
- licDict[str, str], alias _license_t
License metadata dictionary (as returned by the Open Definition License Service) Example: CC-BY-SA:
{ "domain_content": true, "domain_data": true, "domain_software": false, "family": "", "id": "CC-BY-SA-4.0", "maintainer": "Creative Commons", "od_conformance": "approved", "osd_conformance": "not reviewed", "status": "active", "title": "Creative Commons Attribution Share-Alike 4.0", "url": "https://creativecommons.org/licenses/by-sa/4.0/" }
- friendly_data.metatools.get_license(lic: str, group: str = 'all') Dict[str, str] [source]
Return the license metadata
Retrieve the license metadata of the requested group from the Open Definition License Service and cache it in a temporary file. From the retrieved list, find the requested license and return it.
- Parameters
- licstr or None
Requested license; if None, interactively ask for the license name
- group{‘all’, ‘osi’, ‘od’, ‘ckan’}
License group where to find the license
- Returns
- Dict[str, str], alias _license_t
A dictionary with the license metadata
- Raises
- ValueError
If the license group is incorrect
- KeyError
If the license cannot be found in the provided group
- friendly_data.metatools.lic_metadata(keys: ~typing.Iterable[str], pred: ~typing.Callable[[~typing.Dict], bool] = <function <lambda>>) List[Dict[str, str]] [source]
Return a list of license metadata with the requested set of keys
- Parameters
- keysIterable[str]
List of keys to include in the metadata
- predCallable[[Dict], bool]
A predicate to select a subset of licenses. It should accept a dictionary with license metadata, and return a boolean indicating whether to accept or not.
- Returns
- List[Dict]
List of license metadata
Registry API
Configurable Friendly data schema registry
Module to wrap around the default friendly_data_registry
to add
configurability. A custom registry configuration can be specified by using the
config_ctx()
context manager. The RegistrySchema
validates the
registry config before customising the default registry.
- class friendly_data.registry.RegistrySchema(registry_config: Dict[str, List[Dict]])[source]
Instantiate with the “registry” section of the config file to validate
The registry section looks like this:
registry: idxcols: - name: enduse type: string constraints: enum: - ... cols: - name: cost type: number constraints: minimum: 0
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
(key[, default])If key is not found, default is returned if given, otherwise KeyError is raised
popitem
(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault
(key[, default])Insert key with a value of default if key is not in the dictionary.
update
([E, ]**F)If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
- friendly_data.registry.config_ctx(*, confdict: Dict[str, List[Dict]] = {}, conffile: Union[str, Path] = '', idxcols: List[Dict] = [], cols: List[Dict] = [])[source]
Context manager to temporarily override the default registry
Note that the parameters are allowed only as a keyword argument, and multiple parameters are not allowed at the same time. They are checked in the same order as shown here, and on finding one, following parameters are ignored.
The registry config is also validated. If validation fails, an error message is logged, and the default registry remains unaltered.
- Parameters
- confdictDict[str, List[Dict]]
Registry config in dictionary form
- conffileUnion[str, Path]
Path to a config file with a custom registry section
- idxcolsList[Dict]
List of custom index columns
- colsList[Dict]
List of custom value columns
- Returns
- Generator[Dict[str, List[Dict]]]
The custom registry config
Examples
from friendly_data.registry import config_ctx, get, getall with config_ctx(conffile="config.yaml") as _: print(get("mycol", "cols")) print(getall())
- friendly_data.registry.get(col: str, col_t: str) Dict [source]
Wraps around the getters in
friendly_data_registry.get()
.If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the
config_ctx()
context manager.
- friendly_data.registry.getall(with_file=False) Dict[str, List[Dict]] [source]
Wraps around the getters in
friendly_data_registry.getall()
.If a custom registry config has been specified, columns from the config are also considered. A custom registry config can be set using the
config_ctx()
context manager.
The Friendly data schema registry
This module provides getter methods to retrieve individual columns,
get()
, or the whole registry, getall()
. The function utilities and
classes are used by the module internally.
- friendly_data_registry.get(col: str, col_t: str) Dict [source]
Retrieve the column schema from column schema registry: friendly_data_registry
- Parameters
- colstr
Column name to look for
- col_tLiteral[“cols”, “idxcols”]
A literal string specifying the kind of column; one of: “cols”, or “idxcols”
- Returns
- Dict
Column schema; an empty dictionary is returned in case there are no matches
- Raises
- RuntimeError
When more than one matches are found
- ValueError
When the schema file in the registry is unsupported; not one of: JSON, or YAML
- friendly_data_registry.getall(with_file: bool = False) Dict[str, List[Dict]] [source]
Get all columns from registry, primarily to generate documentation
- Returns
- Dict[str, Dict]
The returned value is separated by column type:
{ "idxcols": [ {..} # column schemas ], "cols": [ {..} # column schemas ], }
- Raises
- RuntimeError
When more than one matches are found
- friendly_data_registry.read_file(fpath: Union[str, Path]) Union[Dict, List] [source]
Read JSON or yaml file; file type is guessed from extension
- class friendly_data_registry.schschemaema(schema: dict)[source]
Registry column schema. Instantiate to validate.
- Raises
- TypeMatchError
When the column schema has a type mismatch
- MatchError
Other mismatches like, an incorrectly named key
Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
(key[, default])If key is not found, default is returned if given, otherwise KeyError is raised
popitem
(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault
(key[, default])Insert key with a value of default if key is not in the dictionary.
update
([E, ]**F)If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
Command Line Interface
Functions that are run from the CLI to create, or edit a data package.
- friendly_data.cli.create(idxpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', inplace: bool = False, export: str = '', config: str = '')[source]
Create a package from an index file and other files
Package metadata provided with command line flags override metadata from the config file.
- Parameters
- idxpathstr
Path to the index file or package directory with the index file. Note the index file has to be at the top level directory of the datapackage.
- fpathsTuple[str]
List of datasets/resources not in the index. If any of them point to a dataset already present in the index, it is ignored.
- namestr
Package name (no spaces or special characters)
- titlestr
Package title
- licensesstr
License
- descriptionstr
Package description
- keywordsstr
A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]
- inplacebool
Whether to create the data package by only adding metadata to the current directory. NOTE: one of inplace/export must be chosen
- exportstr
Create the data package in the provided directory instead of the current directory
- configstr
Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.
- friendly_data.cli.describe(pkgpath: str, config: str = '', report_dir: str = '')[source]
Give a summary of the data package
- Parameters
- pkgpathstr
Path to the data package
- report_dirstr (default: empty string)
If not empty, generate an HTML report and write to the directory
report
. The directory will have anindex.html
file, and one HTML file for each dataset.- configstr
Config file in YAML format with custom registry. It should be defined under a “registry” section.
- friendly_data.cli.describe_registry(column_type: str = '')[source]
Describe columns defined in the registry
- Parameters
- column_typestr (default: empty string → all)
Column type to list; one of: “cols”, or “idxcols”. If nothing is provided (default), columns of both types are listed.
- friendly_data.cli.generate_index_file(idxpath: str, *fpaths: str, config: str = '')[source]
Generate an index file from a set of dataset files
- Parameters
- idxpathstr
Path where the index file (YAML format) should be written
- fpathsTuple[str]
List of datasets/resources to include in the index
- configstr
Config file in YAML format with custom registry. It should be defined under a “registry” section.
- friendly_data.cli.license_info(lic: str) Dict [source]
Give detailed metadata about a license
- Parameters
- licstr
License ID as listed in the output of
friendly_data list-licenses
- Returns
- Dict
License metadata
- friendly_data.cli.license_prompt() Dict[str, str] [source]
Prompt for a license on the terminal (with completion).
- friendly_data.cli.list_licenses() str [source]
List commonly used licenses
NOTE: for Python API users, not to be confused with
metatools.list_licenses()
.- Returns
- str
ASCII table with commonly used licenses
- friendly_data.cli.remove(pkgpath: str, *fpaths: str, rm_from_disk: bool = False) str [source]
Remove datasets from the package
- Parameters
- pkgpathstr
Path to the package directory
- fpathsTuple[str]
List of datasets/resources to be removed from the package. The index is updated accordingly.
- rm_from_diskbool (default: False)
Permanently delete the files from disk
- friendly_data.cli.reports(pkg: Package, report_dir: str)[source]
Write HTML reports summarising all resources in the package
- Parameters
- pkgPackage
- report_dirstr
Directory where reports are written
- Returns
- int
Bytes written (index.html)
- friendly_data.cli.to_iamc(config: str, idxpath: str, iamcpath: str, *, wide: bool = False)[source]
Aggregate datasets into an IAMC dataset
- Parameters
- configstr
Config file
- idxpathstr
Index file
- iamcpathstr
IAMC dataset
- widebool (default: False)
Enable wide IAMC format
- friendly_data.cli.update(pkgpath: str, *fpaths: str, name: str = '', title: str = '', licenses: str = '', description: str = '', keywords: str = '', config: str = '')[source]
Update metadata and datasets in a package.
- Parameters
- pkgpathstr
Path to the package.
- fpathsTuple[str]
List of datasets/resources; they could be new datasets or datasets with updated index entries.
- namestr
Package name (no spaces or special characters)
- titlestr
Package title
- descriptionstr
Package description
- keywordsstr
A space separated list of keywords: ‘renewable energy model’ -> [‘renewable’, ‘energy’, ‘model’]
- licensesstr
License
- configstr
Config file in YAML format with metadata and custom registry. The metadata should be under a “metadata” section, and the custom registry under a “registry” section.
Validation functions
Functions useful to validate a data package or parts of its schema.
- friendly_data.validate.check_pkg(pkg) List[Dict] [source]
Validate all resources in a datapackage for common errors.
- Typical errors that are checked:
blank-header
,extra-label
,missing-label
,blank-label
,duplicate-label
,incorrect-label
,blank-row
,primary-key-error
,foreign-key-error
,extra-cell
,missing-cell
,type-error
,constraint-error
,unique-error
- Parameters
- pkgfrictionless.Package
The datapackage descriptor dictionary
- Returns
- Dict
A dictionary with a summary of the validation checks.
- friendly_data.validate.check_schema(ref: Dict[str, str], dst: Dict[str, str], *, remap: Optional[Dict[str, str]] = None) Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]] [source]
Compare a schema with a reference.
The reference schema is a minimal set, meaning, any additional fields in the compared schema are accepted, but omissions are not.
Name comparisons are case-sensitive.
TODO: maybe also compare constraints?
- Parameters
- refDict[str, str]
Reference schema dictionary
- dstDict[str, str]
Schema dictionary from the dataset being validated
- remapDict[str, str] (optional)
Column/field names that are to be remapped before checking.
- Returns
- resultTuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]
Result tuple:
Boolean flag indicating if it passed the checks or not
If checks failed, set of missing columns from minimal set
If checks failed, set of columns with mismatching types. It is a dictionary with the column name as key, and the reference type and the actual type in a tuple as value.
{ 'col_x': ('integer', 'number'), 'col_y': ('datetime', 'string'), }
If primary keys are different, tuple with the diff. The first element is the index where the two differ, and the two subsequent elements are the corresponding elements from the reference and dataset primary key list:
(index, ref_col, dst_col)
- friendly_data.validate.summarise_diff(diff: Tuple[bool, Set[str], Dict[str, Tuple[str, str]], List[Tuple]]) str [source]
Summarise the schema diff from
check_schema()
results as apandas.DataFrame
.
- friendly_data.validate.summarise_errors(report: List[Dict]) pandas.DataFrame [source]
Summarise the dict/json error report as a pandas.DataFrame
- Parameters
- reportList[Dict]
List of errors as returned by
check_pkg()
- Returns
- pandas.DataFrame
Summary dataframe; example:
filename row col error remark 0 bad.csv 12 extra-cell ... 1 bad.csv 22 SRB type-error ...
Data analysis interface
The following modules provide interfaces that are useful when you are
working with common data analysis frameworks like pandas
and
xarray
.
Converters
Functions useful to read a data package resource into common analysis
frameworks like pandas
, xarray
, etc. Currently supported:
Library |
Data Structure |
---|---|
|
:class: |
|
:class: |
Type mapping between the frictionless specification and pandas types:
schema type |
|
---|---|
|
|
|
|
|
|
|
|
|
|
- friendly_data.converters.from_df(df: _dfseries_t, basepath: Union[str, Path], datapath: Union[str, Path] = '', alias: Dict[str, str] = {}, rename: bool = True) Resource [source]
Write dataframe to a CSV file, and return a data package resource.
NOTE: Do not call
frictionless.Resource.infer()
on the resource instance returned by this function, as that might overwrite our metadata/schema customisations with default heuristics in thefrictionless
implementation.- Parameters
- dfpd.DataFrame | pd.Series
Dataframe to write
- basepathUnion[str, Path]
Path to the package directory
- datapathUnion[str, Path] (default: empty string)
Path to the CSV file where the dataframe is written. If datapath is empty, a file name is generated by concatinating all the columns in the dataframe.
- aliasDict[str, str] (default: {})
A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.
- renamebool (default: True)
Rename aliased columns to match the registry when writing to the CSV.
- Returns
- frictionless.Resource
Data package resource that points to the CSV file.
- friendly_data.converters.from_dst(dst: xarray.Dataset, basepath: Union[str, Path], alias: Dict[str, str] = {}) List[Resource] [source]
Write an
xarray.Dataset
into CSV files, and return the list resourcesEach data variable is written to a separate CSV file in the directory specified by basepath. The file name is derived from the data variable name by sanitising it and appending the CSV extension.
- Parameters
- dstxr.Dataset
Dataset to write
- basepathUnion[str, Path]
Path to the package directory
- aliasDict[str, str]
A dictionary of column aliases if the dataset has custom data variable/coordinate names that need to be mapped to columns in the registry.
- Returns
- List[Resource]
List of data package resources that point to the CSV files.
- friendly_data.converters.resolve_aliases(df: _dfseries_t, alias: Dict[str, str]) _dfseries_t [source]
Return a copy of the dataframe with aliases resolved
- Parameters
- dfpd.DataFrame | pd.Series
- aliasDict[str, str]
A dictionary of column aliases if the dataframe has custom column names that need to be mapped to columns in the registry. The key is the column name in the dataframe, and the value is a column in the registry.
- Returns
- pd.DataFrame | pd.Series
Since the column and index levels are renamed, a copy is returned so that the original dataframe/series remains unaltered.
- friendly_data.converters.to_da(resource: Resource, noexcept: bool = False, **kwargs) xarray.DataArray [source]
Reads a data package resource as an
xarray.DataArray
This function is restricted to tables with only one value column (equivalent to a pandas.Series). All indices are treated as
xarray.core.coordinates.DataArrayCoordinates
and dimensions. The array is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to the data array. It is assumed that the whole table uses the same unit.Additional keyword arguments are passed on to
xarray.DataArray
.- Parameters
- resourcefrictionless.Resource
List of data package resource objects
- noexceptbool (default: False)
Whether to suppress an exception
- **kwargs
Additional keyword arguments that are passed on to
xarray.DataArray
See also
to_df()
see for details on
noexcept
- friendly_data.converters.to_df(resource: Resource, noexcept: bool = False, **kwargs) pandas.DataFrame [source]
Reads a data package resource as a pandas.DataFrame
FIXME: ‘format’ in the schema is ignored.
- Parameters
- resourcefrictionless.Resource
A data package resource object
- noexceptbool (default: False)
Whether to suppress an exception
- **kwargs
Additional keyword arguments that are passed on to the reader:
pandas.read_csv()
,pandas.read_excel()
, etc
- Returns
- pandas.DataFrame
NOTE: when
noexcept
isTrue
, and there’s an exception, an empty dataframe is returned
- Raises
- ValueError
If the resource is not local If the source type the resource is pointing to isn’t supported
- friendly_data.converters.to_dst(resource: Resource, noexcept: bool = False, **kwargs) xarray.Dataset [source]
Reads a data package resource as an
xarray.Dataset
Unlike
to_da()
, this function works for all tables. All indices are treated asxarray.core.coordinates.DataArrayCoordinates
and dimensions. The arrays is reshaped to match the dimensions. Any unit index is extracted and attached as an attribute to each data arrays. It is assumed that all columns in the whole table uses the same unit.Additional keyword arguments are passed on to
xarray.Dataset
.- Parameters
- resourcefrictionless.Resource
List of data package resource objects
- noexceptbool (default: False)
Whether to suppress an exception
- **kwargs
Additional keyword arguments that are passed on to
xarray.Dataset
See also
to_df()
see for details on
noexcept
- friendly_data.converters.to_mfdst(resources: Iterable[Resource], noexcept: bool = False, **kwargs) xarray.Dataset [source]
Reads a list of data package resources as an
xarray.Dataset
This function reads multiple resources/files and converts each column into a data array (identical to
to_dst()
), which are then combined into onexarray.Dataset
. Note that any value column that is present more than once in the data package is overwritten by the last one. If you want support for duplicates, you should useto_dst()
and handle the duplicates yourself.- Parameters
- resourcesList[frictionless.Resource]
List of data package resource objects
- noexceptbool (default: False)
Whether to suppress an exception
- **kwargs
Additional keyword arguments that are passed on to
xarray.Dataset
See also
to_df()
see for details on
noexcept
- friendly_data.converters.xr_da(df: pandas.DataFrame, col: Union[int, Hashable], *, coords: Dict, attrs: Dict = {}, **kwargs) xarray.DataArray [source]
Create an xarray data array from a data frame
- Parameters
- dfpandas.DataFrame
- colUnion[int, Hashable]
Column to use to create the data array, either use the column number, or column name
- coordsDict
Dictionary of coordinate arrays
- attrsDict
Dictionary of metadata attributes like unit
- Returns
- xarray.DataArray
- friendly_data.converters.xr_metadata(df: pandas.DataFrame) Tuple[pandas.DataFrame, Dict, Dict] [source]
Extract metadata to create xarray data array/datasets
All indices except unit is extracted as coordinates, and “unit” is extracted as metadata attribute.
- Parameters
- dfpandas.DataFrame
- Returns
- Tuple[pandas.DataFrame, Dict, Dict]
The dataframe with units removed, dictionary of coordinates, dictionary with constant attributes like units
Time series API
Convenience functions useful to ingest different kinds of differently shaped time series data into the standard 1-D shape supported by the data package specification.
- friendly_data.tseries.from_multicol(fpath: _file_t, *, date_cols: List[_col_t], **kwargs)[source]
Read a time series where datetime values are in multiple columns.
See also
read_timeseries
see for full documentation, main entrypoint for users
- friendly_data.tseries.from_table(fpath: _file_t, *, col_units: str, zero_idx: bool, row_fmt: str = '', **kwargs)[source]
Read a time series from a tabular file.
See also
read_timeseries
see for full documentation, main entrypoint for users
- friendly_data.tseries.read_timeseries(fpath: _file_t, *, date_cols: Optional[List[_col_t]] = None, col_units: Optional[str] = None, zero_idx: bool = False, row_fmt: str = '', source_t: str = '', **kwargs)[source]
Read a time series from a file.
While the natural way to structure a time series dataset is with the index column as datetime values, with subsequent columns holding other values, there are a few other frequently used structures.
The first is to structure it as a table:
date
1
2
…
23
24
1/1/2016
0
10
…
2.3
5.1
4/1/2016
3
11
…
4.3
9.1
When source_t is set to “table”, this function reads a tabular dataset like the one above, and flattens it into a series, and sets the appropriate datetime values as their index.
The other common structure is to split the datetime values into multiple columns in the table:
date
time
col1
col2
1/1/2016
10:00
42.0
foo
4/1/2016
11:00
3.14
bar
When source_t is set to “multicol”, as the table is read, the indicated columns are combined to construct the datetime values, which are then set as the index.
If source_t is not specified (or set to an empty string), options specific to this function are ignored, and all other keyword options are passed on to the backend transparently; in case of reading a CSV with Pandas, that means all valid keywords for pandas.read_csv are accepted.
- Parameters
- fpathUnion[str, Path, TextIO]
Path to the dataset file
- date_colsList[int, str] (for “multicol” mode)
List of columns to be combined to construct the datetime values
- col_unitsstr (for “table” mode)
Time units for the columns. Accepted values: “month”, “hour”.
- zero_idxbool (for “table” mode, default: False)
Whether the columns are zero indexed. When the columns represent hours, or minutes, it is common to number them as nth hour. Which means they are counted starting at 1 instead of 0. Set this to False if that is the case.
- row_fmtstr (for “table” mode, default: empty string)
What is the format of the datetime column (use strftime format strings, see: man 3 strftime). If this is left empty, the reader tries to guess a format using the dateutil module (Pandas default)
- source_tstr (default: empty string)
Mode of reading the data. Accepted values: “table”, “multicol”, or empty string
- **kwardsDict
Other keyword arguments passed on to the reader backend. Any options passed here takes precedence, and overwrites other values inferred from the earlier keyword arguments.
- Returns
- tsSeries/DataFrame
The time series is returned as a series or a dataframe depending on the number of other columns that are present.
Examples
To skip specific rows, maybe because they have bad data, or are empty, you may use the skiprows option. It can be set to a list-like where the entries are row indices (numbers).
>>> read_timeseries("mydata.csv", source_t="table", col_units="hour", ... skiprows=range(1522, 5480))
The above example skips rows 1522-5480.
Similarly, data type of the column values can be controlled by using the dtype option. When set to a numpy.dtype, all values will be read as that type, which is probably relevant for the “table” mode. In the “multicol” mode, the types of the values can be controlled at the column level by setting it to a dictionary, where the key matches a column name, and the value is a valid numpy.dtype.
Conversion to IAMC
Interface to convert a Friendly dataset to IAMC format
Configuration can be done using two separate files, A global config file (in YAML format) can set options like mapping an index column to the corresponding IAMC names, and setting default values for mandatory columns. Whereas per dataset configuration like, identifying index columns, mapping a dataset to its IAMC variable name, defining column aliases, and aggregations can be done in an index file (in YAML format).
- class friendly_data.iamc.IAMconv(idx: pkgindex, indices: Dict, basepath: Union[str, Path])[source]
Converter class for IAMC data
This class resolves index columns against the “semi-hierarchical” variables used in IAMC data, and separates them into individual datasets that are part of the datapackage. It relies on the index file and index column definitions to do the disaggregation. It also supports the reverse operation of aggregating multiple datasets into an IAMC dataset.
TODO:
describe assumptions (e.g. case insensitive match) and fallbacks (e.g. missing title)
limitations (e.g. when no index column exists)
- Attributes
Methods
agg_idxcol
(df, col, entry)Aggregate values and generate IAMC dataframes
agg_vals_all
(entry)Find all values in index column that are present in an aggregate rule
frames
(entry, df)Convert the dataframe to IAMC format according to configuration in the entry
from_file
(confpath, idxpath)Create a mapping of IAMC indicator variables with index columns
iamcify
(df)Transform dataframe to match the IAMC (long) format
index_levels
(idxcols)Index levels for user defined index columns
read_indices
(path, basepath, **kwargs)Read index column definitions provided in config
Find missing IAMC indices and set them to the default value from config
to_csv
(files, output[, wide])Write converted IAMC data frame to a CSV file
to_df
(files_or_dfs)Convert CSV files/dataframes to IAMC format according to the index
- agg_idxcol(df: pandas.DataFrame, col: str, entry: Dict) List[pandas.DataFrame] [source]
Aggregate values and generate IAMC dataframes
- Parameters
- dfpd.DataFrame
Dataframe to aggregate from
- colstr
Column to perform aggregation on
- entryDict
Index entry with aggregation rules
- Returns
- List[pd.DataFrame]
List of IAMC dataframes
- agg_vals_all(entry: Dict) Tuple[str, List[str]] [source]
Find all values in index column that are present in an aggregate rule
- property basepath
Data package basepath, directory the index file is located
- frames(entry: Dict, df: pandas.DataFrame) List[pandas.DataFrame] [source]
Convert the dataframe to IAMC format according to configuration in the entry
- Parameters
- entryDict
Index entry
- dfpandas.DataFrame
The dataframe that is to be converted to IAMC format
- Returns
- List[pandas.DataFrame]
List of ``pandas.DataFrame``s in IAMC format
- classmethod from_file(confpath: Union[str, Path], idxpath: Union[str, Path]) IAMconv [source]
Create a mapping of IAMC indicator variables with index columns
- Parameters
- confpathUnion[str, Path]
Path to config file for IAMC <-> data package config file
- idxpathUnion[str, Path]
Path to index file
- **kwargs
Keyword arguments passed on to the pandas reader backend.
- Returns
- IAMconv
- iamcify(df: pandas.DataFrame) pandas.DataFrame [source]
Transform dataframe to match the IAMC (long) format
- index_levels(idxcols: Iterable) Dict[str, pandas.Series] [source]
Index levels for user defined index columns
- Parameters
- idxcolsIterable[str]
Iterable of index column names
- Returns
- Dict[str, pd.Series]
Different values for a given set of index columns
- property indices: Dict
Index definitions
Default value of mandatory index columns in case they are missing
Different levels of user defined index columns; points to a 2-column CSV file, with the “name” and “iamc” columns
- classmethod read_indices(path: Union[str, Path], basepath: Union[str, Path], **kwargs) pandas.Series [source]
Read index column definitions provided in config
- property res_idx: pkgindex
Package index
Each entry corresponds to a resource that maybe included in IAMC output.
- resolve_idxcol_defaults(df: pandas.DataFrame) pandas.DataFrame [source]
Find missing IAMC indices and set them to the default value from config
The IAMC format requires the following indices: self._IAMC_IDX; if any of them are missing, the corresponding index level is created, and the level values are set to a constant specified in the config.
- Parameters
- dfpandas.DataFrame
- Returns
- pandas.DataFrame
Dataframe with default index columns resolved
- to_csv(files: Iterable[Union[str, Path]], output: Union[str, Path], wide: bool = False)[source]
Write converted IAMC data frame to a CSV file
- Parameters
- filesIterable[Union[str, Path]]
List of files to collate and convert to IAMC
- outputUnion[str, Path] (default: empty string)
Path of the output CSV file; if empty, nothing is written to file.
- basepathUnion[str, Path]
Data package base path
- widebool (default: False)
Write the CSN in wide format (with years as columns)
- to_df(files_or_dfs: Union[Iterable[Union[str, Path]], Dict[str, pandas.DataFrame]]) pandas.DataFrame [source]
Convert CSV files/dataframes to IAMC format according to the index
- Parameters
- files_or_dfsUnion[Iterable[Union[str, Path]], Dict[str, pandas.DataFrame]]
List of files or a dictionary of dataframes, to be collated and converted to IAMC format. Each item must have an entry in the package index the converter was initialised with, it is skipped otherwise. Files are matched by file
path
, whereas dataframes match when the dictionary key matches the index entryname
.Note when the files are read, the basepath is set to whatever the converter was initialised with. If
IAMconv.from_file()
was used, it is the parent directory of the index file.
- Returns
- DataFrame
A
pandas.DataFrame
in IAMC format
Internal interfaces
Internal functions and classes; useful if you are developing new
features for friendly_data
.
File I/O
Functions useful for I/O and file manipulation
- class friendly_data.io.HttpCache(url_t: str)[source]
An HTTP cache
It accepts a URL template which accepts parameters:
https://www.example.com/path/{}.json
, the parameters can be provided later at fetch time. No checks are made if the number of parameters passed are compatible with the URL template.After fetching a resource, it is cached in a file under
$TMPDIR/friendly_data_cache/
. The file name is of the formhttp-<checksum-of-url-template>-<checksum-of-url>
. The cache is updated every 24 hours. A user may also force a cache cleanup by callingremove()
.- Parameters
- url_tstr
URL template, e.g.
https://www.example.com/path/{}.json
- Attributes
- cachedirpathlib.Path
Path object pointing to the cache directory
Methods
cachefile
(arg, *args)Return the cache file, and the corresponding URL
fetch
(url)Fetch the URL
get
(arg, *args)Get the URL contents
remove
(*args)Remove cache files
- cachefile(arg: str, *args: str) Tuple[Path, str] [source]
Return the cache file, and the corresponding URL
- Parameters
- argstr
parameters for the URL template (one mandatory)
- *argsstr, optional
more parameters (optional)
- Returns
- Tuple[pathlib.Path, str]
Tuple of Path object pointing to the cache file and the URL string
- fetch(url: str) bytes [source]
Fetch the URL
- Parameters
- urlstr
URL to fetch
- Returns
- bytes
bytes array of the contents that was fetched
- Raises
- ValueError
If the URL is incorrect
- get(arg: str, *args: str) bytes [source]
Get the URL contents
If a valid cache exists, return the contents from there, otherwise fetch again.
- Parameters
- argstr
parameters for the URL template (one mandatory)
- *argsstr, optional
more parameters (optional)
- Returns
- bytes
bytes array of the contents
- Raises
- ValueError
If the URL is incorrect
- requests.ConnectionError
If there is no network connection
- remove(*args: str)[source]
Remove cache files
Remove all files associated with this cache (w/o arguments).
Remove only the files associated with the URL formed from the args.
- Parameters
- *argsstr, optional
parameters for the URL template
- Raises
- FileNotFoundError
If an argument is provided to remove a specific cache file, but the cache file does not exist.
- friendly_data.io.copy_files(src: Iterable[Union[str, Path]], dest: Union[str, Path], anchor: Union[str, Path] = '') List[Path] [source]
Copy files to a directory
Without an anchor, the source files are copied to the root of the destination directory; with an anchor, the relative paths between the source files are maintained; any required subdirectories are created.
- Parameters
- srcIterable[Union[str, Path]]
List of files to be copied
- destUnion[str, Path]
Destination directory
- anchorUnion[str, Path] (default: empty string)
Top-level directory for anchoring, provide if you want the relative paths between the source files to be maintained with respect to this directory.
- Returns
- List[Path]
List of files that were copied
- friendly_data.io.dwim_file(fpath: Union[str, Path]) Union[Dict, List] [source]
- friendly_data.io.dwim_file(fpath: Union[str, Path], data: Any) None
Do What I Mean with file
Depending on the function arguments, either read the contents of a file, or write data to the file. The file type is guessed from the extension; supported formats: JSON and YAML.
- Parameters
- fpathUnion[str, Path]
File path to read or write to
- dataUnion[None, Any]
Data, when writing to a file.
- Returns
- Union[None, Union[Dict, List]]
If writing to a file, nothing (
None
) is returnedIf reading from a file, depending on the contents, either a list or dictionary are returned
- friendly_data.io.get_cachedir() Path [source]
Create the directory
$TMPDIR/friendly_data_cache
and return the Path
- friendly_data.io.outoftree_paths(basepath: Union[str, Path], fpaths: Iterable[Union[str, Path]]) Tuple[List[Path], List[Path]] [source]
Separate a list of paths into in tree and out of tree.
- Parameters
- basepathUnion[str, Path]
Path to use as the reference when identifying in/out of tree paths.
- fpathsIterable[Union[str, Path]]
List of paths.
- Returns
- Tuple[List[str], List[Path]]
A pair of list of in tree and out of tree paths
- friendly_data.io.path_in(fpaths: Iterable[Union[str, Path]], testfile: Union[str, Path]) bool [source]
Function to test if a path is in a list of paths.
The test checks if they are the same physical files or not, so the testfile needs to exist on disk.
- Parameters
- fpathsIterable[Union[str, Path]]
List of paths to check
- testfileUnion[str, Path]
Test file (must exist on disk)
- Returns
- bool
- friendly_data.io.path_not_in(fpaths: Iterable[Union[str, Path]], testfile: Union[str, Path]) bool [source]
Function to test if a path is absent from a list of paths.
Opposite of
path_in()
.- Parameters
- fpathsIterable[Union[str, Path]]
List of paths to check
- testfileUnion[str, Path]
Test file (must exist on disk)
- Returns
- bool
- friendly_data.io.posixpathstr(fpath: Union[str, Path]) str [source]
Given a path object, return a POSIX compatible path string
- Parameters
- fpathUnioin[str, Path]
Path object
- Returns
- str
- friendly_data.io.relpaths(basepath: Union[str, Path], pattern: Union[str, Iterable[Union[str, Path]]]) List[str] [source]
Convert a list of paths to relative paths
- Parameters
- basepathUnion[str, Path]
Path to use as the reference when calculating relative paths
- patternUnion[str, Iterable[Union[str, Path]]]
Either a pattern relative to
basepath
to generate a list of paths, or a list of paths to convert.
- Returns
- List[str]
List of relative paths (as
str
-s)
Helper utilities
Collection of helper functions
- friendly_data.helpers.filter_dict(data: Dict, allowed: Iterable) Dict [source]
Filter a dictionary based on a set of allowed keys
- friendly_data.helpers.flatten_list(lst: Iterable) Iterable [source]
Flatten an arbitrarily nested list (returns a generator)
- friendly_data.helpers.idx_lvl_values(idx: pandas.MultiIndex, name: str) pandas.Index [source]
Given a
pandas.MultiIndex
and a level name, find the level values- Parameters
- idxpandas.MultiIndex
A multi index
- namestr
Level name
- Returns
- pandas.Index
Index with the level values
- friendly_data.helpers.idxslice(lvls: Iterable[str], selection: Dict[str, List]) Tuple [source]
Create an index slice tuple from a set of level names, and selection mapping
NOTE: The order of
lvls
should match the order of the levels in the index exactly; typically,mydf.index.names
.- Parameters
- lvlsIterable[str]
Complete set of levels in the index
- selectionDict[str, List]
Selection set; the key is a level name, and the value is a list of values to select
- Returns
- Tuple
Tuple of values, with
slice(None)
for skipped levels (matches anything)
- friendly_data.helpers.import_from(module: str, name: str)[source]
Import
name
frommodule
, ifname
is empty, return module
- friendly_data.helpers.match(pattern, **kwargs)[source]
Wrap
glom.Match
with the default action set toglom.SKIP
.This is very useful to match items inside nested data structures. A few example uses:
>>> from glom import glom >>> cols = [ ... { ... "name": "abc", ... "type": "integer", ... "constraints": {"enum": []} ... }, ... { ... "name": "def", ... "type": "string" ... }, ... ] >>> glom(cols, [match({"constraints": {"enum": list}, str: str})]) [{"name": "abc", "type": "integer", "constraints": {"enum": []}}]
For details see: glom.Match
- class friendly_data.helpers.noop_map[source]
A noop mapping class
A dictionary subclass that falls back to noop on
KeyError
and returns the key being looked up.Methods
clear
()copy
()fromkeys
(iterable[, value])Create a new dictionary with keys from iterable and values set to value.
get
(key[, default])Return the value for key if key is in the dictionary, else default.
items
()keys
()pop
(key[, default])If key is not found, default is returned if given, otherwise KeyError is raised
popitem
(/)Remove and return a (key, value) pair as a 2-tuple.
setdefault
(key[, default])Insert key with a value of default if key is not in the dictionary.
update
([E, ]**F)If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
values
()
- friendly_data.helpers.sanitise(string: str) str [source]
Sanitise string for use as group/directory name
- friendly_data.helpers.select(spec, **kwargs)[source]
Wrap
glom.Check
with the default action set toglom.SKIP
.This is very useful to select items inside nested data structures. A few example uses:
>>> from glom import glom >>> cols = [ ... { ... "name": "abc", ... "type": "integer" ... }, ... { ... "name": "def", ... "type": "string" ... }, ... ] >>> glom(cols, [select("name", equal_to="abc")]) [{"name": "abc", "type": "integer"}]
For details see: glom.Check