Below is the readme for amd, explaining what it is and how to use it. More detailed documentation can be found in the sidebar.

average-minimum-distance: isometrically invariant crystal fingerprints

PyPI Status Read the Docs Build Status MATCH Paper CC-0 license

Implements fingerprints (isometry invariants) of crystal structures based on geometry: average minimum distances (AMD) and pointwise distance distributions (PDD).

What’s amd?

The typical representation of a crystal as a motif and cell is ambiguous, as there are many ways to define the same crystal. This package implements new isometric invariants: average minimum distances (AMD) and pointwise distance distributions (PDD), which always take the same value for any two (isometrically) identical input crystals. They do this in a continuous way, so similar crystals have a small distance between their invariants.

Brief description of AMD and PDD

The pointwise distance distribution (PDD) records the environment of each atom in a unit cell by listing the distances from each atom to neighbouring atoms in order, with some extra steps to ensure independence of cell and motif. A PDD is a collection of lists with attached weights (a matrix). Two PDDs are compared by finding an optimal matching between the two sets of lists while respecting the weights (Earth Mover’s distance), and when the crystals are geometrically identical (regardless of choice of motif and cell) there is always a perfect matching resulting in a distance of zero.

The average minimum distance (AMD) averages the PDD over atoms in a unit cell to make a vector, which is also the same for any choice of cell and motif. Since AMDs are just vectors, comparing by AMD is much faster than PDD, though AMD contains less information in theory.

Both AMD and PDD have a parameter k, the number of nearest neighbours to consider for each atom, which is the length of the AMD vector or the number of columns in the PDD (plus an extra column for weights of rows).

Getting started

Use pip to install average-minimum-distance:

pip install average-minimum-distance

Then import average-minimum-distance with import amd.

amd.compare() compares sets of crystals by AMD or PDD in one line, e.g. by PDD with k = 100:

import amd
df = amd.compare('crystals.cif', by='PDD', k=100)

A pandas DataFrame is returned of the distance matrix with names of crystals in rows and columns. It can also take two paths and compare crystals in one file with the other, for example

df = amd.compare('crystals_1.cif', 'crystals_2.cif' by='AMD', k=100)

Either first or second argument can be lists of cif paths (or file objects) which are combined in the final distance matrix.

amd.compare() reads crystals and calculates their AMD or PDD, but throws them away. It may be faster to save these to a file (e.g. pickle), see sections below on how to separately read, calculate and compare.

If csd-python-api is installed, the compare function can also accept one or more CSD refcodes or other file formats instead of cifs (pass reader='ccdc').

Choosing a value of k

The parameter k of the invariants is the number of nearest neighbour atoms considered for each atom in the unit cell, e.g. k = 5 looks at the 5 nearest neighbours of each atom. Two crystals with the same unit molecule will have a small AMD/PDD distance for small enough k. A larger k will mean the environments of atoms in one crystal must line up with those in the other up to a larger radius to have a small AMD/PDD distance. Very large k does not mean better comparisons, as the invariants start to converge to depend only on density.

Reading crystals from a file, calculating the AMDs and PDDs

This code reads a .cif with amd.CifReader and computes the AMDs (k = 100):

import amd
reader = amd.CifReader('path/to/file.cif')
amds = [amd.AMD(crystal, 100) for crystal in reader]  # calc AMDs

Note: CifReader accepts optional arguments, e.g. for removing hydrogen and handling disorder. See the documentation for details.

To calculate PDDs, just replace amd.AMD with amd.PDD.

If csd-python-api is installed, crystals can be read directly from your local copy of the CSD with amd.CSDReader, which accepts a list of refcodes. CifReader can accept file formats other than cif by passing reader='ccdc'.

Comparing by AMD or PDD

amd.AMD_pdist and amd.PDD_pdist take a list of invariants and compares them pairwise, returning a condensed distance matrix like SciPy’s pdist function.

# read and calculate AMDs and PDDs (k=100)
crystals = list(amd.CifReader('path/to/file.cif'))
amds = [amd.AMD(crystal, 100) for crystal in reader]
pdds = [amd.PDD(crystal, 100) for crystal in reader]

amd_cdm = amd.AMD_pdist(amds) # compare a list of AMDs pairwise
pdd_cdm = amd.PDD_pdist(pdds) # compare a list of PDDs pairwise

# Use SciPy's squareform for a symmetric 2D distance matrix
from scipy.distance.spatial import squareform
amd_dm = squareform(amd_cdm)

Note: if you want both AMDs and PDDs like above, it’s faster to compute the PDDs first and use ``amd.PDD_to_AMD()`` rather than computing both from scratch.

The default metric for comparison is chebyshev (L-infinity), though it can be changed to anything accepted by SciPy’s pdist, e.g. euclidean.

If you have two sets of crystals and want to compare all crystals in one to the other, use amd.AMD_cdist or amd.PDD_cdist.

set1 = amd.CifReader('set1.cif')
set2 = amd.CifReader('set2.cif')
amds1 = [amd.AMD(crystal, 100) for crystal in set1]
amds2 = [amd.AMD(crystal, 100) for crystal in set2]

# dm[i][j] = distance(amds1[i], amds2[j])
dm = amd.AMD_cdist(amds)

Example: PDD-based dendrogram

This example compares some crystals in a cif by PDD (k = 100) and plots a single linkage dendrogram:

import amd
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy

crystals = list(amd.CifReader('crystals.cif'))
names = [crystal.name for crystal in crystals]
pdds = [amd.PDD(crystal, 100) for crystal in crystals]
cdm = amd.PDD_pdist(pdds)
Z = hierarchy.linkage(cdm, 'single')
dn = hierarchy.dendrogram(Z, labels=names)
plt.show()

Example: Finding n nearest neighbours in one set from another

This example finds the 10 nearest PDD-neighbours in set 2 for every crystal in set 1.

import numpy as np
import amd

n = 10
df = amd.compare('set1.cif', 'set2.cif', k=100)
dm = df.values

# Uses np.argpartiton (partial argsort) and np.take_along_axis to find
# nearest neighbours of each item in set1. Works for any distance matrix.
nn_inds = np.array([np.argpartition(row, n)[:n] for row in dm])
nn_dists = np.take_along_axis(dm, nn_inds, axis=-1)
sorted_inds = np.argsort(nn_dists, axis=-1)
nn_inds = np.take_along_axis(nn_inds, sorted_inds, axis=-1)
nn_dists = np.take_along_axis(nn_dists, sorted_inds, axis=-1)

for i in range(len(set1)):
    print('neighbours of', df.index[i])
    for j in range(n):
        print('neighbour', j+1, df.columns[nn_inds[i][j]], 'dist:', nn_dists[i][j])

Cite us

Use the following bib references to cite AMD or PDD.

Average minimum distances of periodic point sets - foundational invariants for mapping periodic crystals. MATCH Communications in Mathematical and in Computer Chemistry, 87(3), 529-559 (2022). https://doi.org/10.46793/match.87-3.529W.

@article{10.46793/match.87-3.529W,
  title = {Average Minimum Distances of periodic point sets - foundational invariants for mapping periodic crystals},
  author = {Widdowson, Daniel and Mosca, Marco M and Pulido, Angeles and Kurlin, Vitaliy and Cooper, Andrew I},
  journal = {MATCH Communications in Mathematical and in Computer Chemistry},
  doi = {10.46793/match.87-3.529W},
  volume = {87},
  number = {3},
  pages = {529-559},
  year = {2022}
}

Pointwise distance distributions of periodic point sets. arXiv preprint arXiv:2108.04798 (2021). https://arxiv.org/abs/2108.04798.

@misc{arXiv:2108.04798,
  author = {Widdowson, Daniel and Kurlin, Vitaliy},
  title = {Pointwise distance distributions of periodic point sets},
  year = {2021},
  eprint = {arXiv:2108.04798},
}

Indices and tables