Getting Started

Comparing crystals

amd.compare() extracts crystals from one or more CIFs and compares them by AMD or PDD. For example, to compare all crystals in a cif by AMD with k = 100:

import amd
df = amd.compare('crystals.cif', by='AMD', k=100)

To compare by PDD, use by='PDD'. A distance matrix is returned as a pandas DataFrame. amd.compare() can also take two paths to compare all crystals in one file with those in the other.

If csd-python-api is installed, amd.compare() can also accept lists of CSD refcodes, or other formats.

Read, calculate descriptors and compare separately

amd.compare() reads crystals, calculates AMD or PDD, and compares them. It is sometimes useful to do these steps separately, e.g. to save the descriptors to a file. The code above using amd.compare() is equivalent to the following:

import amd
import pandas as pd
from scipy.spatial.distance import squareform

crystals = list(amd.CifReader('crystals.cif'))          # read crystals
amds = [amd.AMD(crystal, 100) for crystal in crystals]  # calculate AMDs
dm = squareform(amd.AMD_pdist(amds))                    # compare AMDs pairwise
names = [crystal.name for crystal in crystals]
df = pd.DataFrame(dm, index=names, columns=names)

Here, amd.AMD_pdist() is used to compare the AMDs pairwise, returning a condensed distance matrix (see scipy.spatial.distance.squareform(), which converts it to a symmetric 2D distance matrix). There is an equivalent function for comparing PDDs, amd.PDD_pdist(). There are also two cdist functions, which take two collections of descriptors and compares everything in one set with the other returning a 2D distance matrix.

Write crystals or their descriptors to a file

pickle is an easy way to store crystals or their descriptors.

import amd
import pickle

crystals = list(amd.CifReader('crystals.cif'))

with open('crystals.pkl', 'wb') as f: # write
    pickle.dump(crystals, f)

with open('crystals.pkl', 'rb') as f: # read
    crystals = pickle.load(f)

List of optional parameters

amd.compare() reads crystals, computes their invariants and compares them in one function for convinience. It accepts most of optional parameters from any of these steps, all are listed below.

Reader options

Parameters of amd.CifReader or amd.CSDReader.

  • reader (default ase) controls the backend package used to parse the file. Accepts ase, pycodcif, pymatgen, gemmi and ccdc (if these packages are installed). The ccdc reader can read formats accepted by ccdc.io.EntryReader.

  • remove_hydrogens (default False) removes Hydrogen atoms from the structure.

  • disorder (default skip) controls how disordered structures are handled. The default skips any crystal with disorder, since disorder conflicts somewhat with the periodic set model. Alternatively, ordered_sites removes atoms with disorder and all_sites includes all atoms regardless.

  • show_warnings (default True) chooses whether to print warnings during reading, e.g. from disordered structures or crystals with missing data.

  • heaviest_component (default False, CSD Python API only) removes all but the heaviest molecule in the asymmetric unit, intended for removing solvents.

  • molecular_centres (default False, CSD Python API only) uses centres of molecules instead of atoms as the motif of the periodic set.

  • families (default False, CSD Python API only) interprets the list of strings given as CSD refcode families and reads all crystals in those families.

PDD options

Parameters of amd.PDD(). amd.AMD() does not accept any optional parameters.

  • collapse (default True) chooses whether to collpase rows of PDDs which are similar enough (elementwise).

  • collapse_tol (default 0.0001) is the tolerance for collapsing PDD rows into one. The merged row is the average of those collapsed.

Comparison options

The first parameter metric below is available to amd.PDD_pdist(), amd.PDD_cdist(), amd.AMD_pdist() and amd.AMD_cdist(). n_jobs and verbose only apply to PDD comparisons and low_memory only applies to AMD comparisons.

  • metric (default chebyshev) chooses the metric used to compare AMDs or PDD rows. See SciPy’s cdist/pdist for a list of accepted metrics.

  • n_jobs (requires by='PDD', default None) is the number of cores to use for multiprocessing (passed to joblib.Parallel). Pass -1 to use the maximum.

  • verbose (requires by='PDD', default 0) controls the verbosity level, increasing with larger numbers. This is passed to joblib.Parallel, see its documentation for details.

  • low_memory (requires by='AMD' and metric='chebyshev', default False) uses a slower algorithm with a smaller memory footprint, better for large input sizes.