amd.compare module¶
Functions for comparing AMDs and PDDs of crystals.
- amd.compare.compare(crystals, crystals_=None, by: str = 'AMD', k: int = 100, nearest: Optional[int] = None, reader: str = 'gemmi', remove_hydrogens: bool = False, disorder: str = 'skip', heaviest_component: bool = False, molecular_centres: bool = False, csd_refcodes: bool = False, refcode_families: bool = False, show_warnings: bool = True, collapse_tol: float = 0.0001, metric: str = 'chebyshev', n_jobs: Optional[int] = None, backend: str = 'multiprocessing', verbose: bool = False, low_memory: bool = False, **kwargs) pandas.core.frame.DataFrame ¶
Given one or two sets of crystals, compare by AMD or PDD and return a pandas DataFrame of the distance matrix.
Given one or two paths to CIFs, periodic sets, CSD refcodes or lists thereof, compare by AMD or PDD and return a pandas DataFrame of the distance matrix. Default is to comapre by AMD with k = 100. Accepts most keyword arguments accepted by
CifReader
,CSDReader
and functions fromcompare
.- Parameters
crystals (list of str or
PeriodicSet
) – A path,PeriodicSet
, tuple or a list of those.crystals_ (list of str or
PeriodicSet
, optional) – A path,PeriodicSet
, tuple or a list of those.by (str, default 'AMD') – Use AMD or PDD to compare crystals.
k (int, default 100) – Parameter for AMD/PDD, the number of neighbour atoms to consider for each atom in a unit cell.
nearest (int, deafult None) – Find a number of nearest neighbours instead of a full distance matrix between crystals.
reader (str, optional) – The backend package used to parse the CIF. The default is
gemmi
,pymatgen
andase
are also accepted, as well asccdc
if csd-python-api is installed. The ccdc reader should be able to read any format accepted byccdc.io.EntryReader
, though only CIFs have been tested.remove_hydrogens (bool, optional) – Remove hydrogens from the crystals.
disorder (str, optional) – Controls how disordered structures are handled. Default is
skip
which skips any crystal with disorder, since disorder conflicts with the periodic set model. To read disordered structures anyway, choose eitherordered_sites
to remove atoms with disorder orall_sites
include all atoms regardless of disorder.heaviest_component (bool, optional, csd-python-api only) – Removes all but the heaviest molecule in the asymmeric unit, intended for removing solvents.
molecular_centres (bool, default False, csd-python-api only) – Use the centres of molecules for comparison instead of centres of atoms.
csd_refcodes (bool, optional, csd-python-api only) – Interpret
crystals
andcrystals_
as CSD refcodes or lists thereof, rather than paths.refcode_families (bool, optional, csd-python-api only) – Read all entries whose refcode starts with the given strings, or ‘families’ (e.g. giving ‘DEBXIT’ reads all entries with refcodes starting with DEBXIT).
show_warnings (bool, optional) – Controls whether warnings that arise during reading are printed.
collapse_tol (float, default 1e-4,
by='PDD'
only) – If two PDD rows have all elements closer thancollapse_tol
, they are merged and weights are given to rows in proportion to the number of times they appeared.metric (str or callable, default 'chebyshev') – The metric to compare AMDs/PDDs with. AMDs are compared directly with this metric. EMD is the metric used between PDDs, which requires giving a metric to use between PDD rows. Chebyshev (L-infinity) distance is the default. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.n_jobs (int, default None,
by='PDD'
only) – Maximum number of concurrent jobs for parallel processing withjoblib
. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.backend (str, default ‘multiprocessing’,
by='PDD'
only) – The parallelization backend implementation for PDD comparisons. For a list of supported backends, see the backend argument ofjoblib.Parallel
.verbose (bool, default False) – Prints a progress bar when reading crystals, calculating AMDs/PDDs and comparing PDDs. If using parallel processing (n_jobs > 1), the verbose argument of
joblib.Parallel
is used, otherwise usestqdm
.low_memory (bool, default False,
by='AMD'
only) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).
- Returns
df – DataFrame of the distance matrix for the given crystals compared by the chosen invariant.
- Return type
- Raises
ValueError – If by is not ‘AMD’ or ‘PDD’, if either set given have no valid crystals to compare, or if crystals or crystals_ are an invalid type.
Examples
Compare everything in a .cif (deafult, AMD with k=100):
df = amd.compare('data.cif')
Compare everything in one cif with all crystals in all cifs in a directory (PDD, k=50):
df = amd.compare('data.cif', 'dir/to/cifs', by='PDD', k=50)
Examples (csd-python-api only)
Compare two crystals by CSD refcode (PDD, k=50):
df = amd.compare('DEBXIT01', 'DEBXIT02', csd_refcodes=True, by='PDD', k=50)
Compare everything in a refcode family (AMD, k=100):
df = amd.compare('DEBXIT', csd_refcodes=True, families=True)
- amd.compare.EMD(pdd: numpy.ndarray, pdd_: numpy.ndarray, metric: Optional[str] = 'chebyshev', return_transport: Optional[bool] = False, **kwargs) Union[float, Tuple[float, numpy.ndarray]] ¶
Calculate the Earth mover’s distance (EMD) between two PDDs, aka the Wasserstein metric.
- Parameters
pdd (
numpy.ndarray
) – PDD of a crystal.pdd_ (
numpy.ndarray
) – PDD of a crystal.metric (str or callable, default 'chebyshev') – EMD between PDDs requires defining a distance between PDD rows. By default, Chebyshev (L-infinity) distance is chosen like with AMDs. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.return_transport (bool, default False) – Instead return a tuple
(emd, transport_plan)
where transport_plan describes the optimal flow.
- Returns
emd – Earth mover’s distance between two PDDs. If
return_transport
is True, return a tuple (emd, transport_plan).- Return type
float
- Raises
ValueError – Thrown if
pdd
andpdd_
do not have the same number of columns.
- amd.compare.AMD_cdist(amds, amds_, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray ¶
Compare two sets of AMDs with each other, returning a distance matrix. This function is essentially
scipy.spatial.distance.cdist()
with the default metricchebyshev
and a low memory option.- Parameters
amds (ArrayLike) – A list/array of AMDs.
amds_ (ArrayLike) – A list/array of AMDs.
metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinitys) distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).
- Returns
dm – A distance matrix shape
(len(amds), len(amds_))
.dm[ij]
is the distance (given bymetric
) betweenamds[i]
andamds[j]
.- Return type
- amd.compare.AMD_pdist(amds, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray ¶
Compare a set of AMDs pairwise, returning a condensed distance matrix. This function is essentially
scipy.spatial.distance.pdist()
with the default metricchebyshev
and a low memory parameter.- Parameters
amds (ArrayLike) – An list/array of AMDs.
metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinity) distance. Accepts any metric accepted by
scipy.spatial.distance.pdist()
.low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).
- Returns
cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function
squareform
from SciPy to convert to a symmetric square distance matrix.- Return type
- amd.compare.PDD_cdist(pdds: List[numpy.ndarray], pdds_: List[numpy.ndarray], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray ¶
Compare two sets of PDDs with each other, returning a distance matrix. Supports parallel processing via joblib. If using parallelisation, make sure to include an if __name__ == ‘__main__’ guard around this function.
- Parameters
pdds (List[
numpy.ndarray
]) – A list of PDDs.pdds_ (List[
numpy.ndarray
]) – A list of PDDs.metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of
joblib.Parallel
.n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with
joblib
. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of
joblib.Parallel
is used, otherwise uses tqdm.
- Returns
dm – Returns a distance matrix shape
(len(pdds), len(pdds_))
. The \(ij\) th entry is the distance betweenpdds[i]
andpdds_[j]
given by Earth mover’s distance.- Return type
- amd.compare.PDD_pdist(pdds: List[numpy.ndarray], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray ¶
Compare a set of PDDs pairwise, returning a condensed distance matrix. Supports parallelisation via joblib. If using parallelisation, make sure to include a if __name__ == ‘__main__’ guard around this function.
- Parameters
pdds (List[
numpy.ndarray
]) – A list of PDDs.metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of
joblib.Parallel
.n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with
joblib
. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of
joblib.Parallel
is used, otherwise uses tqdm.
- Returns
cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function
squareform
from SciPy to convert to a symmetric square distance matrix.- Return type
- amd.compare.emd(pdd: numpy.ndarray, pdd_: numpy.ndarray, **kwargs) Union[float, Tuple[float, numpy.ndarray]] ¶
Alias for
EMD()
.