Getting Started¶
Use pip to install average-minimum-distance:
pip install average-minimum-distance
Then import average-minimum-distance with import amd
.
Compare crystals in .CIF files with amd.compare()
¶
amd.compare()
extracts crystals from CIF files
and compares them by AMD or PDD. Passing one path to a CIF will compare all
crystals in the CIF with each other, passing two will compare all in one
against all in the other. For example, to compare crystals in ‘crystals.cif’ by
AMD with k = 100:
>>> import amd
>>> distance_matrix = amd.compare('crystals.cif', by='AMD', k=100)
>>> print(distance_matrix)
crystal_1 crystal_2 crystal_3 crystal_4
crystal_1 0.000000 1.319907 0.975221 2.023880
crystal_2 1.319907 0.000000 0.520115 0.703973
crystal_3 0.975221 0.520115 0.000000 1.072211
crystal_4 2.023880 0.703973 1.072211 0.000000
The distance matrix returned is a pandas.DataFrame
.
If csd-python-api is installed, amd.compare()
can also accept lists of CSD refcodes (pass refcode_families=True
) or other
formats (pass reader='ccdc'
).
Read CIFs, calculate descriptors and compare separately¶
amd.compare()
reads crystals from CIFs,
calculates descriptors and compares them. These steps can be done separately,
e.g. to analyse the descriptors themselves. The line of code above using
amd.compare()
is equivalent to the following:
import pandas as pd
from scipy.spatial.distance import squareform
crystals = list(amd.CifReader('crystals.cif')) # read crystals
amds = [amd.AMD(crystal, 100) for crystal in crystals] # calculate AMDs
dm = squareform(amd.AMD_pdist(amds)) # compare AMDs pairwise
names = [crystal.name for crystal in crystals]
distance_matrix = pd.DataFrame(dm, index=names, columns=names)
amd.CifReader
reads a CIF and yields the crystals
represented as amd.PeriodicSet
objects.
A amd.PeriodicSet
can be passed to
amd.AMD()
to calculate its AMD (similarly pass to
amd.PDD()
for the PDD).
Here, amd.AMD_pdist()
is used to compare the AMDs pairwise, returning a condensed distance matrix (see
scipy.spatial.distance.squareform()
, which converts it to a symmetric 2D distance matrix). There is
an equivalent function for comparing PDDs, amd.PDD_pdist()
. There are also two cdist functions, which take
two collections of descriptors and compares everything in one set with the other returning a 2D distance matrix.
Optional parameters of amd.compare()
¶
amd.compare()
reads crystals, computes their
invariants and compares them in one function for convinience. It accepts
most of the optional parameters from these steps, all are listed below.
Reading options¶
Parameters of amd.CifReader
or amd.CSDReader
.
reader
(defaultgemmi
) controls the backend package used to parse the file. Acceptsgemmi
,pycodcif
,pymatgen
,ase
andccdc
(if installed). The ccdc reader can read formats accepted byccdc.io.EntryReader
.remove_hydrogens
(defaultFalse
) removes Hydrogen atoms from the structure.disorder
(defaultskip
) controls how disordered structures are handled. The default skips any crystal with disorder, since disorder conflicts with the model of a periodic set. Alternatively,ordered_sites
removes atoms with disorder andall_sites
includes all atoms regardless of disorder.show_warnings
(defaultTrue
) chooses whether to print warnings during reading, e.g. from disordered structures or crystals with missing data.verbose
(defaultFalse
) prints a progress bar showing the number of items read so far.heaviest_component
(csd-python-api
only, defaultFalse
) removes all but the heaviest connected molecule in the asymmetric unit, intended for removing solvents.molecular_centres
(csd-python-api
only, defaultFalse
) uses molecular centres of mass instead of atoms as the motif of the periodic set.csd_refcodes
(csd-python-api
only, defaultFalse
) interprets the string(s) given as CSD refcodes.families
(csd-python-api
only, defaultFalse
) interprets the list of strings given as CSD refcode families and reads all crystals in those families.
PDD options¶
Parameters of amd.PDD()
. amd.AMD()
does not accept any optional parameters.
collapse
(defaultTrue
) chooses whether to collpase rows of PDDs which are similar enough (elementwise).collapse_tol
(default 0.0001) is the tolerance for collapsing PDD rows into one. The merged row is the average of those collapsed.
Comparison options¶
The parameters n_jobs
and verbose
below only apply to PDD comparisons, and low_memory
only applies to AMD comparisons.
metric
(defaultchebyshev
) chooses the metric used to compare AMDs or PDD rows (the metric used for PDDs is always Earth Mover’s distance, which requires a chosen ‘base’ metric to compare rows). See SciPy’s cdist/pdist for a list of accepted metrics.n_jobs
(requiresby='PDD'
, defaultNone
) is the number of cores to use for multiprocessing (passed tojoblib.Parallel
). Pass -1 to use the maximum.backend
(requiresby='PDD'
, defaultmultiprocessing
) is the parallelization backend implementation for PDD comparisons.verbose
(requiresby='PDD'
, defaultFalse
) controls the verbosity level. With parallel processing the verbose argument ofjoblib.Parallel
is used, otherwisetqdm
is used.low_memory
(requiresby='AMD'
andmetric='chebyshev'
, defaultFalse
) uses a slower algorithm with a smaller memory footprint, better for large input sizes.