amd.compare module

Functions for comparing AMDs and PDDs of crystals.

amd.compare.compare(crystals, crystals_=None, by: str = 'AMD', k: int = 100, nearest: Optional[int] = None, reader: str = 'gemmi', remove_hydrogens: bool = False, disorder: str = 'skip', heaviest_component: bool = False, molecular_centres: bool = False, csd_refcodes: bool = False, refcode_families: bool = False, show_warnings: bool = True, collapse_tol: float = 0.0001, metric: str = 'chebyshev', n_jobs: Optional[int] = None, backend: str = 'multiprocessing', verbose: bool = False, low_memory: bool = False, **kwargs) pandas.core.frame.DataFrame

Given one or two sets of crystals, compare by AMD or PDD and return a pandas DataFrame of the distance matrix.

Given one or two paths to CIFs, periodic sets, CSD refcodes or lists thereof, compare by AMD or PDD and return a pandas DataFrame of the distance matrix. Default is to comapre by AMD with k = 100. Accepts most keyword arguments accepted by CifReader, CSDReader and functions from compare.

Parameters
  • crystals (list of str or PeriodicSet) – A path, PeriodicSet, tuple or a list of those.

  • crystals_ (list of str or PeriodicSet, optional) – A path, PeriodicSet, tuple or a list of those.

  • by (str, default 'AMD') – Use AMD or PDD to compare crystals.

  • k (int, default 100) – Parameter for AMD/PDD, the number of neighbour atoms to consider for each atom in a unit cell.

  • nearest (int, deafult None) – Find a number of nearest neighbours instead of a full distance matrix between crystals.

  • reader (str, optional) – The backend package used to parse the CIF. The default is gemmi, pymatgen and ase are also accepted, as well as ccdc if csd-python-api is installed. The ccdc reader should be able to read any format accepted by ccdc.io.EntryReader, though only CIFs have been tested.

  • remove_hydrogens (bool, optional) – Remove hydrogens from the crystals.

  • disorder (str, optional) – Controls how disordered structures are handled. Default is skip which skips any crystal with disorder, since disorder conflicts with the periodic set model. To read disordered structures anyway, choose either ordered_sites to remove atoms with disorder or all_sites include all atoms regardless of disorder.

  • heaviest_component (bool, optional, csd-python-api only) – Removes all but the heaviest molecule in the asymmeric unit, intended for removing solvents.

  • molecular_centres (bool, default False, csd-python-api only) – Use the centres of molecules for comparison instead of centres of atoms.

  • csd_refcodes (bool, optional, csd-python-api only) – Interpret crystals and crystals_ as CSD refcodes or lists thereof, rather than paths.

  • refcode_families (bool, optional, csd-python-api only) – Read all entries whose refcode starts with the given strings, or ‘families’ (e.g. giving ‘DEBXIT’ reads all entries with refcodes starting with DEBXIT).

  • show_warnings (bool, optional) – Controls whether warnings that arise during reading are printed.

  • collapse_tol (float, default 1e-4, by='PDD' only) – If two PDD rows have all elements closer than collapse_tol, they are merged and weights are given to rows in proportion to the number of times they appeared.

  • metric (str or callable, default 'chebyshev') – The metric to compare AMDs/PDDs with. AMDs are compared directly with this metric. EMD is the metric used between PDDs, which requires giving a metric to use between PDD rows. Chebyshev (L-infinity) distance is the default. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • n_jobs (int, default None, by='PDD' only) – Maximum number of concurrent jobs for parallel processing with joblib. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.

  • backend (str, default ‘multiprocessing’, by='PDD' only) – The parallelization backend implementation for PDD comparisons. For a list of supported backends, see the backend argument of joblib.Parallel.

  • verbose (bool, default False) – Prints a progress bar when reading crystals, calculating AMDs/PDDs and comparing PDDs. If using parallel processing (n_jobs > 1), the verbose argument of joblib.Parallel is used, otherwise uses tqdm.

  • low_memory (bool, default False, by='AMD' only) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).

Returns

df – DataFrame of the distance matrix for the given crystals compared by the chosen invariant.

Return type

pandas.DataFrame

Raises

ValueError – If by is not ‘AMD’ or ‘PDD’, if either set given have no valid crystals to compare, or if crystals or crystals_ are an invalid type.

Examples

Compare everything in a .cif (deafult, AMD with k=100):

df = amd.compare('data.cif')

Compare everything in one cif with all crystals in all cifs in a directory (PDD, k=50):

df = amd.compare('data.cif', 'dir/to/cifs', by='PDD', k=50)

Examples (csd-python-api only)

Compare two crystals by CSD refcode (PDD, k=50):

df = amd.compare('DEBXIT01', 'DEBXIT02', csd_refcodes=True, by='PDD', k=50)

Compare everything in a refcode family (AMD, k=100):

df = amd.compare('DEBXIT', csd_refcodes=True, families=True)
amd.compare.EMD(pdd: numpy.ndarray, pdd_: numpy.ndarray, metric: Optional[str] = 'chebyshev', return_transport: Optional[bool] = False, **kwargs) Union[float, Tuple[float, numpy.ndarray]]

Calculate the Earth mover’s distance (EMD) between two PDDs, aka the Wasserstein metric.

Parameters
  • pdd (numpy.ndarray) – PDD of a crystal.

  • pdd_ (numpy.ndarray) – PDD of a crystal.

  • metric (str or callable, default 'chebyshev') – EMD between PDDs requires defining a distance between PDD rows. By default, Chebyshev (L-infinity) distance is chosen like with AMDs. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • return_transport (bool, default False) – Instead return a tuple (emd, transport_plan) where transport_plan describes the optimal flow.

Returns

emd – Earth mover’s distance between two PDDs. If return_transport is True, return a tuple (emd, transport_plan).

Return type

float

Raises

ValueError – Thrown if pdd and pdd_ do not have the same number of columns.

amd.compare.AMD_cdist(amds, amds_, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray

Compare two sets of AMDs with each other, returning a distance matrix. This function is essentially scipy.spatial.distance.cdist() with the default metric chebyshev and a low memory option.

Parameters
  • amds (ArrayLike) – A list/array of AMDs.

  • amds_ (ArrayLike) – A list/array of AMDs.

  • metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinitys) distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).

Returns

dm – A distance matrix shape (len(amds), len(amds_)). dm[ij] is the distance (given by metric) between amds[i] and amds[j].

Return type

numpy.ndarray

amd.compare.AMD_pdist(amds, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray

Compare a set of AMDs pairwise, returning a condensed distance matrix. This function is essentially scipy.spatial.distance.pdist() with the default metric chebyshev and a low memory parameter.

Parameters
  • amds (ArrayLike) – An list/array of AMDs.

  • metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinity) distance. Accepts any metric accepted by scipy.spatial.distance.pdist().

  • low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).

Returns

cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function squareform from SciPy to convert to a symmetric square distance matrix.

Return type

numpy.ndarray

amd.compare.PDD_cdist(pdds: List[numpy.ndarray], pdds_: List[numpy.ndarray], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray

Compare two sets of PDDs with each other, returning a distance matrix. Supports parallel processing via joblib. If using parallelisation, make sure to include an if __name__ == ‘__main__’ guard around this function.

Parameters
  • pdds (List[numpy.ndarray]) – A list of PDDs.

  • pdds_ (List[numpy.ndarray]) – A list of PDDs.

  • metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of joblib.Parallel.

  • n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with joblib. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.

  • verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of joblib.Parallel is used, otherwise uses tqdm.

Returns

dm – Returns a distance matrix shape (len(pdds), len(pdds_)). The \(ij\) th entry is the distance between pdds[i] and pdds_[j] given by Earth mover’s distance.

Return type

numpy.ndarray

amd.compare.PDD_pdist(pdds: List[numpy.ndarray], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray

Compare a set of PDDs pairwise, returning a condensed distance matrix. Supports parallelisation via joblib. If using parallelisation, make sure to include a if __name__ == ‘__main__’ guard around this function.

Parameters
  • pdds (List[numpy.ndarray]) – A list of PDDs.

  • metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by scipy.spatial.distance.cdist().

  • backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of joblib.Parallel.

  • n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with joblib. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.

  • verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of joblib.Parallel is used, otherwise uses tqdm.

Returns

cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function squareform from SciPy to convert to a symmetric square distance matrix.

Return type

numpy.ndarray

amd.compare.emd(pdd: numpy.ndarray, pdd_: numpy.ndarray, **kwargs) Union[float, Tuple[float, numpy.ndarray]]

Alias for EMD().