Using PDDs

Calculation

The pointwise distance distribution (PDD) of a crystal is given by amd.PDD(). It accepts a crystal and an integer k, returning the \(\text{PDD}_k\) as a 2D NumPy array with k + 1 columns, the weights of each row being in the first column.

If you have a .cif file, use amd.CifReader to read the crystals (see Reading cifs). If csd-python-api is installed, amd.CSDReader accepts CSD refcodes (see Reading from the CSD).

# get PDDs of crystals in a .cif
reader = amd.CifReader('file.cif')
pdds = [amd.PDD(crystal, 100) for crystal in reader]

# get PDDs of DEBXIT01 and DEBXIT02 from the CSD
crystals = list(amd.CSDReader(['DEBXIT01', 'DEBXIT02']))
pdds = [amd.PDD(crystal, 50) for crystal in crystals]

You can also give the coordinates of motif points and unit cell as a tuple of numpy arrays, in Cartesian form:

# PDD (k=10) of 3D cubic lattice
motif = np.array([[0,0,0]])
cell = np.identity(3)
cubic_lattice = (motif, cell)
cubic_pdd = amd.PDD(cubic_lattice, 10)

The object returned by amd.PDD() is a NumPy array with k + 1 columns.

Calculation options

amd.PDD() accepts a few optional arguments (not relevant to amd.AMD()):

amd.PDD(
    periodic_set,              # input crystal (a PeriodicSet or tuple)
    k,                         # k value (number of neighbours)
    lexsort=True,              # lexicograpgically sort rows
    collapse=True,             # collpase identical rows (within tolerance)
    collapse_tol=1e-4,         # tolerance for collapsing rows
    return_row_groups=False    # return info about which rows collapsed
)

lexsort lexicograpgically orders the rows of the PDD, and collapse merges rows if all elements of rows are within collapse_tol. The definition of PDD requires both in order to technically satisfy invariance (that two isometric sets have equal PDDs). But, Earth mover’s distance does not depend on row order, so lexsort=False makes no difference to comparisons. Setting collapse=False will also not change Earth mover’s distances, but comparisons may take longer.

Sometimes it’s useful to know which rows of the PDD came from which motif points in the input. Setting return_row_groups=True makes the function return a tuple (pdd, groups), where groups[i] contains the indices of the point(s) corresponding to pdd[i]. Note that these indices are for the asymmetric unit, whose indices in periodic_set.motif are accessible through periodic_set.asym_unit if it exists.

Comparison

The Earth mover’s distance is the appropriate metric to compare PDDs. The amd.compare module contains functions for these comparisons.

amd.PDD_pdist() and amd.PDD_cdist() are like SciPy’s functions pdist and cdist. pdist takes one set and compares all elements pairwise, whereas cdist takes two sets and compares elements in one with the other. cdist returns a 2D distance matrix, but pdist returns a condensed distance matrix (see SciPy’s pdist function).

# compare crystals in file1.cif with those in file2.cif by PDD, k=100
pdds1 = [amd.PDD(crystal, 100) for crystal in amd.CifReader('file1.cif')]
pdds2 = [amd.PDD(crystal, 100) for crystal in amd.CifReader('file2.cif')]
distance_matrix = amd.PDD_cdist(pdds1, pdds2)

# compare everything in file1.cif with each other
condensed_dm = amd.PDD_pdist(pdds1)

You can compare one PDD with another with amd.EMD():

# compare DEBXIT01 and DEBXIT02 by PDD, k=100
pdds = [amd.PDD(crystal, 100) for crystal in amd.CSDReader(['DEBXIT01', 'DEBXIT02'])]
distance = amd.EMD(pdds[0], pdds[1])

amd.EMD(), amd.PDD_pdist() and amd.PDD_cdist() all accept an optional argument metric, which can be anything accepted by SciPy’s pdist/cdist functions. The metric used to compare PDDs is always Earth mover’s distance, but this still requires another metric between the rows of PDDs (so technically there’s a different Earth mover’s distance for each choice of metric).

Comparison options and multiprocessing

amd.PDD_pdist() and amd.PDD_cdist() share the following optional arguments:

  • metric (default chebyshev) chooses the metric used to compare PDD rows. See SciPy’s cdist/pdist for a list of accepted metrics.

  • n_jobs (new in 1.2.3, default None) is the number of cores to use for multiprocessing (passed to joblib.Parallel). Pass -1 to use the maximum.

  • backend (default multiprocessing) is the parallelization backend implementation for PDD comparisons.

  • verbose (requires by='PDD', default False) controls the verbosity level. With parallel processing the verbose argument of joblib.Parallel is used, otherwise tqdm is used.