amd.compare module

Functions for comparing AMDs and PDDs of crystals.

amd.compare.compare(crystals: Union[amd.periodicset.PeriodicSet, str, List[Union[amd.periodicset.PeriodicSet, str]]], crystals_: Optional[Union[amd.periodicset.PeriodicSet, str, List[Union[amd.periodicset.PeriodicSet, str]]]] = None, by: str = 'AMD', k: int = 100, n_neighbors: Optional[int] = None, csd_refcodes: bool = False, verbose: bool = True, **kwargs) pandas.core.frame.DataFrame

Given one or two sets of crystals, compare by AMD or PDD and return a pandas DataFrame of the distance matrix.

Given one or two paths to CIFs, periodic sets, CSD refcodes or lists thereof, compare by AMD or PDD and return a pandas DataFrame of the distance matrix. Default is to comapre by AMD with k = 100. Accepts any keyword arguments accepted by CifReader, CSDReader and functions from compare.

Parameters
  • crystals (list of str or PeriodicSet) – A path, PeriodicSet, tuple or a list of those.

  • crystals_ (list of str or PeriodicSet, optional) – A path, PeriodicSet, tuple or a list of those.

  • by (str, default 'AMD') – Use AMD or PDD to compare crystals.

  • k (int, default 100) – Parameter for AMD/PDD, the number of neighbor atoms to consider for each atom in a unit cell.

  • n_neighbors (int, deafult None) – Find a number of nearest neighbors instead of a full distance matrix between crystals.

  • csd_refcodes (bool, optional, csd-python-api only) – Interpret crystals and crystals_ as CSD refcodes or lists thereof, rather than paths.

  • verbose (bool, optional) – If True, prints a progress bar during reading, calculating and comparing items.

  • **kwargs – Any keyword arguments accepted by the amd.CifReader, amd.CSDReader, amd.PDD and functions used to compare: reader, remove_hydrogens, disorder, heaviest_component, molecular_centres, show_warnings, (from class:CifReader <.io.CifReader>), refcode_families (from CSDReader), collapse_tol (from PDD), metric, low_memory (from AMD_pdist), metric, backend, n_jobs, verbose, (from PDD_pdist), algorithm, leaf_size, metric, p, metric_params, n_jobs (from _nearest_items).

Returns

df – DataFrame of the distance matrix for the given crystals compared by the chosen invariant.

Return type

pandas.DataFrame

Raises

ValueError – If by is not ‘AMD’ or ‘PDD’, if either set given have no valid crystals to compare, or if crystals or crystals_ are an invalid type.

Examples

Compare everything in a .cif (deafult, AMD with k=100):

df = amd.compare('data.cif')

Compare everything in one cif with all crystals in all cifs in a directory (PDD, k=50):

df = amd.compare('data.cif', 'dir/to/cifs', by='PDD', k=50)

Examples (csd-python-api only)

Compare two crystals by CSD refcode (PDD, k=50):

df = amd.compare('DEBXIT01', 'DEBXIT02', csd_refcodes=True, by='PDD', k=50)

Compare everything in a refcode family (AMD, k=100):

df = amd.compare('DEBXIT', csd_refcodes=True, families=True)
amd.compare.EMD(pdd: numpy.ndarray[Any, numpy.dtype[numpy.floating]], pdd_: numpy.ndarray[Any, numpy.dtype[numpy.floating]], metric: Optional[str] = 'chebyshev', return_transport: Optional[bool] = False, **kwargs) Union[float, Tuple[float, numpy.ndarray[Any, numpy.dtype[numpy.floating]]]]

Calculate the Earth mover’s distance (EMD) between two PDDs, aka the Wasserstein metric.

Parameters
  • pdd (numpy.ndarray) – PDD of a crystal.

  • pdd_ (numpy.ndarray) – PDD of a crystal.

  • metric (str or callable, default 'chebyshev') – EMD between PDDs requires defining a distance between PDD rows. By default, Chebyshev (L-infinity) distance is chosen like with AMDs. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • return_transport (bool, default False) – Instead return a tuple (emd, transport_plan) where transport_plan describes the optimal flow.

Returns

emd – Earth mover’s distance between two PDDs. If return_transport is True, return a tuple (emd, transport_plan).

Return type

float

Raises

ValueError – Thrown if pdd and pdd_ do not have the same number of columns.

amd.compare.AMD_cdist(amds, amds_, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]]

Compare two sets of AMDs with each other, returning a distance matrix. This function is essentially scipy.spatial.distance.cdist() with the default metric chebyshev and a low memory option.

Parameters
  • amds (ArrayLike) – A list/array of AMDs.

  • amds_ (ArrayLike) – A list/array of AMDs.

  • metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinitys) distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).

  • **kwargs – Extra arguments for metric, passed to scipy.spatial.distance.cdist().

Returns

dm – A distance matrix shape (len(amds), len(amds_)). dm[ij] is the distance (given by metric) between amds[i] and amds[j].

Return type

numpy.ndarray

amd.compare.AMD_pdist(amds, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]]

Compare a set of AMDs pairwise, returning a condensed distance matrix. This function is essentially scipy.spatial.distance.pdist() with the default metric chebyshev and a low memory parameter.

Parameters
  • amds (ArrayLike) – An list/array of AMDs.

  • metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinity) distance. Accepts any metric accepted by scipy.spatial.distance.pdist().

  • low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).

  • **kwargs – Extra arguments for metric, passed to scipy.spatial.distance.pdist().

Returns

cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function squareform from SciPy to convert to a symmetric square distance matrix.

Return type

numpy.ndarray

amd.compare.PDD_cdist(pdds: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], pdds_: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]]

Compare two sets of PDDs with each other, returning a distance matrix. Supports parallel processing via joblib. If using parallelisation, make sure to include an if __name__ == ‘__main__’ guard around this function.

Parameters
  • pdds (List[numpy.ndarray]) – A list of PDDs.

  • pdds_ (List[numpy.ndarray]) – A list of PDDs.

  • metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of joblib.Parallel.

  • n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with joblib. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.

  • verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of joblib.Parallel is used, otherwise uses tqdm.

  • **kwargs – Extra arguments for metric, passed to scipy.spatial.distance.cdist().

Returns

dm – Returns a distance matrix shape (len(pdds), len(pdds_)). The \(ij\) th entry is the distance between pdds[i] and pdds_[j] given by Earth mover’s distance.

Return type

numpy.ndarray

amd.compare.PDD_pdist(pdds: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]]

Compare a set of PDDs pairwise, returning a condensed distance matrix. Supports parallelisation via joblib. If using parallelisation, make sure to include a if __name__ == ‘__main__’ guard around this function.

Parameters
  • pdds (List[numpy.ndarray]) – A list of PDDs.

  • metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of joblib.Parallel.

  • n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with joblib. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.

  • verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of joblib.Parallel is used, otherwise uses tqdm.

  • **kwargs – Extra arguments for metric, passed to scipy.spatial.distance.cdist().

Returns

cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function squareform from SciPy to convert to a symmetric square distance matrix.

Return type

numpy.ndarray

amd.compare.emd(pdd: numpy.ndarray[Any, numpy.dtype[numpy.floating]], pdd_: numpy.ndarray[Any, numpy.dtype[numpy.floating]], **kwargs) Union[float, Tuple[float, numpy.ndarray[Any, numpy.dtype[numpy.floating]]]]

Alias for EMD().