Getting Started

Use pip to install average-minimum-distance:

pip install average-minimum-distance

Then import average-minimum-distance with import amd.

Compare crystals in .CIF files with amd.compare()

amd.compare() extracts crystals from CIF files and compares them by AMD or PDD. Passing one path to a CIF will compare all crystals in the CIF with each other, passing two will compare all in one against all in the other. For example, to compare crystals in ‘crystals.cif’ by AMD with k = 100:

>>> import amd
>>> distance_matrix = amd.compare('crystals.cif', by='AMD', k=100)
>>> print(distance_matrix)

           crystal_1  crystal_2  crystal_3  crystal_4
crystal_1   0.000000   1.319907   0.975221   2.023880
crystal_2   1.319907   0.000000   0.520115   0.703973
crystal_3   0.975221   0.520115   0.000000   1.072211
crystal_4   2.023880   0.703973   1.072211   0.000000

The distance matrix returned is a pandas.DataFrame.

If csd-python-api is installed, amd.compare() can also accept lists of CSD refcodes (pass refcode_families=True) or other formats (pass reader='ccdc').

Read CIFs, calculate descriptors and compare separately

amd.compare() reads crystals from CIFs, calculates descriptors and compares them. These steps can be done separately, e.g. to analyse the descriptors themselves. The line of code above using amd.compare() is equivalent to the following:

import pandas as pd
from scipy.spatial.distance import squareform

crystals = list(amd.CifReader('crystals.cif'))          # read crystals
amds = [amd.AMD(crystal, 100) for crystal in crystals]  # calculate AMDs
dm = squareform(amd.AMD_pdist(amds))                    # compare AMDs pairwise
names = [crystal.name for crystal in crystals]
distance_matrix = pd.DataFrame(dm, index=names, columns=names)

amd.CifReader reads a CIF and yields the crystals represented as amd.PeriodicSet objects. A amd.PeriodicSet can be passed to amd.AMD() to calculate its AMD (similarly pass to amd.PDD() for the PDD). Here, amd.AMD_pdist() is used to compare the AMDs pairwise, returning a condensed distance matrix (see scipy.spatial.distance.squareform(), which converts it to a symmetric 2D distance matrix). There is an equivalent function for comparing PDDs, amd.PDD_pdist(). There are also two cdist functions, which take two collections of descriptors and compares everything in one set with the other returning a 2D distance matrix.

Optional parameters of amd.compare()

amd.compare() reads crystals, computes their invariants and compares them in one function for convinience. It accepts most of the optional parameters from these steps, all are listed below.

Reading options

Parameters of amd.CifReader or amd.CSDReader.

  • reader (default gemmi) controls the backend package used to parse the file. Accepts gemmi, pycodcif, pymatgen, ase and ccdc (if installed). The ccdc reader can read formats accepted by ccdc.io.EntryReader.

  • remove_hydrogens (default False) removes Hydrogen atoms from the structure.

  • disorder (default skip) controls how disordered structures are handled. The default skips any crystal with disorder, since disorder conflicts with the model of a periodic set. Alternatively, ordered_sites removes atoms with disorder and all_sites includes all atoms regardless of disorder.

  • show_warnings (default True) chooses whether to print warnings during reading, e.g. from disordered structures or crystals with missing data.

  • verbose (default False) prints a progress bar showing the number of items read so far.

  • heaviest_component (csd-python-api only, default False) removes all but the heaviest connected molecule in the asymmetric unit, intended for removing solvents.

  • molecular_centres (csd-python-api only, default False) uses molecular centres of mass instead of atoms as the motif of the periodic set.

  • csd_refcodes (csd-python-api only, default False) interprets the string(s) given as CSD refcodes.

  • families (csd-python-api only, default False) interprets the list of strings given as CSD refcode families and reads all crystals in those families.

PDD options

Parameters of amd.PDD(). amd.AMD() does not accept any optional parameters.

  • collapse (default True) chooses whether to collpase rows of PDDs which are similar enough (elementwise).

  • collapse_tol (default 0.0001) is the tolerance for collapsing PDD rows into one. The merged row is the average of those collapsed.

Comparison options

The parameters n_jobs and verbose below only apply to PDD comparisons, and low_memory only applies to AMD comparisons.

  • metric (default chebyshev) chooses the metric used to compare AMDs or PDD rows (the metric used for PDDs is always Earth Mover’s distance, which requires a chosen ‘base’ metric to compare rows). See SciPy’s cdist/pdist for a list of accepted metrics.

  • n_jobs (requires by='PDD', default None) is the number of cores to use for multiprocessing (passed to joblib.Parallel). Pass -1 to use the maximum.

  • backend (requires by='PDD', default multiprocessing) is the parallelization backend implementation for PDD comparisons.

  • verbose (requires by='PDD', default False) controls the verbosity level. With parallel processing the verbose argument of joblib.Parallel is used, otherwise tqdm is used.

  • low_memory (requires by='AMD' and metric='chebyshev', default False) uses a slower algorithm with a smaller memory footprint, better for large input sizes.