amd.compare module¶
Functions for comparing AMDs and PDDs of crystals.
- amd.compare.compare(crystals: Union[amd.periodicset.PeriodicSet, str, List[Union[amd.periodicset.PeriodicSet, str]]], crystals_: Optional[Union[amd.periodicset.PeriodicSet, str, List[Union[amd.periodicset.PeriodicSet, str]]]] = None, by: str = 'AMD', k: int = 100, n_neighbors: Optional[int] = None, csd_refcodes: bool = False, verbose: bool = True, **kwargs) pandas.core.frame.DataFrame ¶
Given one or two sets of crystals, compare by AMD or PDD and return a pandas DataFrame of the distance matrix.
Given one or two paths to CIFs, periodic sets, CSD refcodes or lists thereof, compare by AMD or PDD and return a pandas DataFrame of the distance matrix. Default is to comapre by AMD with k = 100. Accepts any keyword arguments accepted by
CifReader
,CSDReader
and functions fromcompare
.- Parameters
crystals (list of str or
PeriodicSet
) – A path,PeriodicSet
, tuple or a list of those.crystals_ (list of str or
PeriodicSet
, optional) – A path,PeriodicSet
, tuple or a list of those.by (str, default 'AMD') – Use AMD or PDD to compare crystals.
k (int, default 100) – Parameter for AMD/PDD, the number of neighbor atoms to consider for each atom in a unit cell.
n_neighbors (int, deafult None) – Find a number of nearest neighbors instead of a full distance matrix between crystals.
csd_refcodes (bool, optional, csd-python-api only) – Interpret
crystals
andcrystals_
as CSD refcodes or lists thereof, rather than paths.verbose (bool, optional) – If True, prints a progress bar during reading, calculating and comparing items.
**kwargs – Any keyword arguments accepted by the
amd.CifReader
,amd.CSDReader
,amd.PDD
and functions used to compare:reader
,remove_hydrogens
,disorder
,heaviest_component
,molecular_centres
,show_warnings
, (from class:CifReader <.io.CifReader>),refcode_families
(fromCSDReader
),collapse_tol
(fromPDD
),metric
,low_memory
(fromAMD_pdist
),metric
,backend
,n_jobs
,verbose
, (fromPDD_pdist
),algorithm
,leaf_size
,metric
,p
,metric_params
,n_jobs
(from_nearest_items
).
- Returns
df – DataFrame of the distance matrix for the given crystals compared by the chosen invariant.
- Return type
- Raises
ValueError – If by is not ‘AMD’ or ‘PDD’, if either set given have no valid crystals to compare, or if crystals or crystals_ are an invalid type.
Examples
Compare everything in a .cif (deafult, AMD with k=100):
df = amd.compare('data.cif')
Compare everything in one cif with all crystals in all cifs in a directory (PDD, k=50):
df = amd.compare('data.cif', 'dir/to/cifs', by='PDD', k=50)
Examples (csd-python-api only)
Compare two crystals by CSD refcode (PDD, k=50):
df = amd.compare('DEBXIT01', 'DEBXIT02', csd_refcodes=True, by='PDD', k=50)
Compare everything in a refcode family (AMD, k=100):
df = amd.compare('DEBXIT', csd_refcodes=True, families=True)
- amd.compare.EMD(pdd: numpy.ndarray[Any, numpy.dtype[numpy.floating]], pdd_: numpy.ndarray[Any, numpy.dtype[numpy.floating]], metric: Optional[str] = 'chebyshev', return_transport: Optional[bool] = False, **kwargs) Union[float, Tuple[float, numpy.ndarray[Any, numpy.dtype[numpy.floating]]]] ¶
Calculate the Earth mover’s distance (EMD) between two PDDs, aka the Wasserstein metric.
- Parameters
pdd (
numpy.ndarray
) – PDD of a crystal.pdd_ (
numpy.ndarray
) – PDD of a crystal.metric (str or callable, default 'chebyshev') – EMD between PDDs requires defining a distance between PDD rows. By default, Chebyshev (L-infinity) distance is chosen like with AMDs. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.return_transport (bool, default False) – Instead return a tuple
(emd, transport_plan)
where transport_plan describes the optimal flow.
- Returns
emd – Earth mover’s distance between two PDDs. If
return_transport
is True, return a tuple (emd, transport_plan).- Return type
float
- Raises
ValueError – Thrown if
pdd
andpdd_
do not have the same number of columns.
- amd.compare.AMD_cdist(amds, amds_, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]] ¶
Compare two sets of AMDs with each other, returning a distance matrix. This function is essentially
scipy.spatial.distance.cdist()
with the default metricchebyshev
and a low memory option.- Parameters
amds (ArrayLike) – A list/array of AMDs.
amds_ (ArrayLike) – A list/array of AMDs.
metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinitys) distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).
**kwargs – Extra arguments for
metric
, passed toscipy.spatial.distance.cdist()
.
- Returns
dm – A distance matrix shape
(len(amds), len(amds_))
.dm[ij]
is the distance (given bymetric
) betweenamds[i]
andamds[j]
.- Return type
- amd.compare.AMD_pdist(amds, metric: str = 'chebyshev', low_memory: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]] ¶
Compare a set of AMDs pairwise, returning a condensed distance matrix. This function is essentially
scipy.spatial.distance.pdist()
with the default metricchebyshev
and a low memory parameter.- Parameters
amds (ArrayLike) – An list/array of AMDs.
metric (str or callable, default 'chebyshev') – Usually AMDs are compared with the Chebyshev (L-infinity) distance. Accepts any metric accepted by
scipy.spatial.distance.pdist()
.low_memory (bool, default False) – Use a slower but more memory efficient method for large collections of AMDs (metric ‘chebyshev’ only).
**kwargs – Extra arguments for
metric
, passed toscipy.spatial.distance.pdist()
.
- Returns
cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function
squareform
from SciPy to convert to a symmetric square distance matrix.- Return type
- amd.compare.PDD_cdist(pdds: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], pdds_: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]] ¶
Compare two sets of PDDs with each other, returning a distance matrix. Supports parallel processing via joblib. If using parallelisation, make sure to include an if __name__ == ‘__main__’ guard around this function.
- Parameters
pdds (List[
numpy.ndarray
]) – A list of PDDs.pdds_ (List[
numpy.ndarray
]) – A list of PDDs.metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of
joblib.Parallel
.n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with
joblib
. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of
joblib.Parallel
is used, otherwise uses tqdm.**kwargs – Extra arguments for
metric
, passed toscipy.spatial.distance.cdist()
.
- Returns
dm – Returns a distance matrix shape
(len(pdds), len(pdds_))
. The \(ij\) th entry is the distance betweenpdds[i]
andpdds_[j]
given by Earth mover’s distance.- Return type
- amd.compare.PDD_pdist(pdds: List[numpy.ndarray[Any, numpy.dtype[numpy.floating]]], metric: str = 'chebyshev', backend: str = 'multiprocessing', n_jobs: Optional[int] = None, verbose: bool = False, **kwargs) numpy.ndarray[Any, numpy.dtype[numpy.floating]] ¶
Compare a set of PDDs pairwise, returning a condensed distance matrix. Supports parallelisation via joblib. If using parallelisation, make sure to include a if __name__ == ‘__main__’ guard around this function.
- Parameters
pdds (List[
numpy.ndarray
]) – A list of PDDs.metric (str or callable, default 'chebyshev') – Usually PDD rows are compared with the Chebyshev/l-infinity distance. Accepts any metric accepted by
scipy.spatial.distance.cdist()
.backend (str, default 'multiprocessing') – The parallelization backend implementation. For a list of supported backends, see the backend argument of
joblib.Parallel
.n_jobs (int, default None) – Maximum number of concurrent jobs for parallel processing with
joblib
. Set to -1 to use the maximum. Using parallel processing may be slower for small inputs.verbose (bool, default False) – Prints a progress bar. If using parallel processing (n_jobs > 1), the verbose argument of
joblib.Parallel
is used, otherwise uses tqdm.**kwargs – Extra arguments for
metric
, passed toscipy.spatial.distance.cdist()
.
- Returns
cdm – Returns a condensed distance matrix. Collapses a square distance matrix into a vector, just keeping the upper half. See the function
squareform
from SciPy to convert to a symmetric square distance matrix.- Return type
- amd.compare.emd(pdd: numpy.ndarray[Any, numpy.dtype[numpy.floating]], pdd_: numpy.ndarray[Any, numpy.dtype[numpy.floating]], **kwargs) Union[float, Tuple[float, numpy.ndarray[Any, numpy.dtype[numpy.floating]]]] ¶
Alias for
EMD()
.