Getting Started¶
Use pip to install average-minimum-distance:
pip install average-minimum-distance
Then import average-minimum-distance with import amd
.
Compare crystals in .CIF files with amd.compare()
¶
amd.compare()
extracts crystals from CIF files
and compares them by AMD or PDD. Passing one path to a CIF will compare all
crystals in the CIF with each other, passing two will compare all in one
against all in the other. For example, to compare crystals by AMD with k = 100:
>>> import amd
>>> distance_matrix = amd.compare('crystals.cif', by='AMD', k=100)
>>> print(distance_matrix)
crystal_1 crystal_2 crystal_3 crystal_4
crystal_1 0.000000 1.319907 0.975221 2.023880
crystal_2 1.319907 0.000000 0.520115 0.703973
crystal_3 0.975221 0.520115 0.000000 1.072211
crystal_4 2.023880 0.703973 1.072211 0.000000
The distance matrix returned is a pandas.DataFrame
.
If csd-python-api is installed, amd.compare()
can also accept lists of CSD refcodes (pass refcode_families=True
) or other
formats (pass reader='ccdc'
).
Read CIFs, calculate descriptors and compare separately¶
amd.compare()
reads from CIFs,
calculates the descriptors and compares them, but these steps can be done
separately if needed. The code above using
amd.compare()
is equivalent to the following:
import pandas as pd
from scipy.spatial.distance import squareform
crystals = list(amd.CifReader('crystals.cif')) # read crystals
amds = [amd.AMD(crystal, 100) for crystal in crystals] # calculate AMDs, k=100
cdm = amd.AMD_pdist(amds) # compare AMDs pairwise
dm = squareform(cdm) # condensed -> square distance matrix
names = [crystal.name for crystal in crystals]
distance_matrix = pd.DataFrame(dm, index=names, columns=names)
amd.CifReader
reads a CIF and yields the crystals
represented as amd.PeriodicSet
objects.
A amd.PeriodicSet
can be passed to
amd.AMD()
to calculate its AMD (similar to
amd.PDD()
). Then
amd.AMD_pdist()
is used to compare the
AMDs pairwise, returning a condensed distance matrix which is converted to a
symmetric square distance matrix with scipy.spatial.distance.squareform()
.
The equivalent function for comparing PDDs is amd.PDD_pdist()
.
There is also amd.AMD_cdist()
and
amd.PDD_cdist()
, which take
two collections of AMDs/PDDs and compares everything in one set with the other
returning a 2D distance matrix.
Optional parameters of amd.compare()
¶
amd.compare()
accepts nearly all optional parameters
included in reading CIFs, computing descriptors or comparing, listed below.
Comparison options¶
The parameters n_jobs
and verbose
below only apply to PDD comparisons, and low_memory
only applies to AMD comparisons.
by
(defaultAMD
) chooses whether to compare by AMD or PDD invariants.k
(default100
) is the parameter passed to AMD or PDD, the number of nearest neighbor atoms to consider.n_neighbors
(defaultNone
) if given, only finds a number of nearest neighbors for each item rather than a full distance matrix.metric
(defaultchebyshev
) chooses the metric used to compare AMDs or PDD rows (the metric used for PDDs is always Earth Mover’s distance, which requires a chosen metric to compare PDD rows). See SciPy’s cdist/pdist for a list of accepted metrics.n_jobs
(requiresby='PDD'
, defaultNone
) is the number of cores to use for multiprocessing comparisons (passed tojoblib.Parallel
). Pass -1 to use the maximum.backend
(requiresby='PDD'
, defaultmultiprocessing
) is the parallelization backend implementation for PDD comparisons.verbose
(requiresby='PDD'
, defaultFalse
) controls the verbosity level. With parallel processing the verbose argument ofjoblib.Parallel
is used, otherwisetqdm
is used.low_memory
(requiresby='AMD'
andmetric='chebyshev'
, defaultFalse
) uses a slower algorithm with a smaller memory footprint, better for large input sizes.**kwargs
are additional keyword arguments passed to SciPy’s cdist/pdist or scikit-learn’s NearestNeighbors.
Reading options¶
Parameters of amd.CifReader
or amd.CSDReader
.
reader
(defaultgemmi
) controls the backend package used to parse the file. Acceptsgemmi
,pycodcif
,pymatgen
,ase
andccdc
(if installed). The ccdc reader can read any format accepted byccdc.io.EntryReader
.remove_hydrogens
(defaultFalse
) removes Hydrogen atoms from the structure.disorder
(defaultskip
) controls how disordered structures are handled. The default skips (ignores) any crystal with disorder, since disorder conflicts with the model of a periodic set. Alternatively,ordered_sites
removes sites with disorder andall_sites
includes all sites regardless of disorder.show_warnings
(defaultTrue
) chooses whether to print warnings during reading, e.g. from disordered structures or crystals with missing data.verbose
(defaultFalse
) displays progress bar if True.heaviest_component
(csd-python-api
only, defaultFalse
) removes all but the heaviest connected molecule in the asymmetric unit, intended for removing solvents.molecular_centres
(csd-python-api
only, defaultFalse
) makes periodic sets using molecular centers of mass rather than atomic centers.csd_refcodes
(csd-python-api
only, defaultFalse
) interprets all inputs as CSD refcodes.families
(csd-python-api
only, defaultFalse
) interprets the list of strings given as CSD refcode families and reads all crystals in those families.
PDD options¶
Parameters of amd.PDD()
. amd.AMD()
does not accept any optional parameters.
collapse
(defaultTrue
) chooses whether to collpase rows of PDDs which are similar enough (elementwise).collapse_tol
(default 0.0001) is the tolerance for considering PDD rows as the same and collapsing them into one. The merged row is the average of those collapsed.