alchemtest: the simple alchemistry test set

Zenodo DOI

alchemtest is a collection of test datasets for alchemical free energy calculations. The datasets come from a variety of software packages, primarily molecular dynamics engines, and are used as the test set for alchemlyb. The package is standalone, however, and can be used for any purpose.

Datasets are released under an open license that conforms to the Open Definition 2.1 that allows free use, re-use, redistribution, modification, separation, for any purpose and without a charge. All data and code can be found in the public GitHub repository alchemistry/alchemtest.

This library is under active development. We use semantic versioning to indicate clearly what kind of changes you may expect between releases. Although it is heavily used for the alchemlyb test suite it may contain bugs. Please raise any issues or questions in the Issue Tracker. Contributions of data sets and code in the form of pull requests are very welcome.

Installing alchemtest

alchemtest is pure-Python, so it can be installed easily via

pip:

pip install alchemtest

If you wish to install this in your user site-packages, use the --user flag:

pip install --user alchemtest

Installing from source

from source. Clone the source from GitHub with:

git clone https://github.com/alchemistry/alchemtest.git

then do:

cd alchemtest
pip install .

If you wish to install this in your user site-packages, use the --user flag:

pip install --user .

Basic usage

All datasets in alchemtest are accessible via load_* functions, organized in submodules by the software package that generated them. The current set of submodules are:

gmx

Gromacs molecular dynamics simulation datasets.

amber

Amber molecular dynamics simulation datasets.

namd

NAMD molecular dynamics simulation datasets.

As an example, we can access the Gromacs: Benzene in water dataset with:

>>> from alchemtest.gmx import load_benzene
>>> bz = load_benzene()

and use the resulting Bunch object to introspect what this dataset includes. In particular, it features a DESCR attribute with a human-readable description of the dataset:

>>> print(bz.DESCR)
Gromacs: Benzene in water
=========================

Benzene in water, alchemically turned into benzene in vacuum separated from water

Notes
-----
Data Set Characteristics:
    :Number of Legs: 2 (Coulomb, VDW)
    :Number of Windows: 5 for Coulomb, 16 for VDW
    :Length of Windows: 40ns

    :Missing Values: None
    :Creator: \I. Kenney
    :Donor: Ian Kenney (ian.kenney@asu.edu)
    :Date: March 2017
    :License: `CC0
              <https://creativecommons.org/publicdomain/zero/1.0/>`_
              Public Domain Dedication

This dataset was generated using `MDPOW <https://github.com/Becksteinlab/MDPOW>`_, with
the `Gromacs <http://www.gromacs.org/>`_ molecular dynamics engine.

as well as the dataset itself:

>>>  bz.data.keys()
['VDW', 'Coulomb']

which consists in this case of two alchemical legs, each having several files. For this dataset each file happens to correspond to a simulation sampling a particular \(\lambda\):

>>> bz.data['Coulomb']
['/usr/local/python3.6/site-packages/alchemtest/gmx/benzene/Coulomb/0000/dhdl.xvg.bz2',
 '/usr/local/python3.6/site-packages/alchemtest/gmx/benzene/Coulomb/0250/dhdl.xvg.bz2',
 '/usr/local/python3.6/site-packages/alchemtest/gmx/benzene/Coulomb/0500/dhdl.xvg.bz2',
 '/usr/local/python3.6/site-packages/alchemtest/gmx/benzene/Coulomb/0750/dhdl.xvg.bz2',
 '/usr/local/python3.6/site-packages/alchemtest/gmx/benzene/Coulomb/1000/dhdl.xvg.bz2']

These paths can be read by any appropriate parser for further analysis. For this particular dataset, see alchemlyb.parsing.gmx for a good set of parsers.

Helper functions and classes

A small number of functions and classes are included to help organize the data.

class alchemtest.Bunch(**kwargs)

Container object for datasets

Dictionary-like object that exposes its keys as attributes.

>>> b = Bunch(a=1, b=2)
>>> b['b']
2
>>> b.b
2
>>> b.a = 3
>>> b['a']
3
>>> b.c = 6
>>> b['c']
6

Code taken from sklearn/utils/__init__.py version 0.19.1 under the ‘New BSD license’ https://github.com/scikit-learn/scikit-learn/blob/master/COPYING

Gromacs datasets

Gromacs molecular dynamics simulation datasets.

The alchemlyb.gmx module features datasets generated using the Gromacs molecular dynamics engine. They can be accessed using the following accessor functions:

load_benzene()

Load the Gromacs benzene dataset.

load_expanded_ensemble_case_1()

Load the Gromacs Host CB7 Guest C3 expanded ensemble dataset, case 1 (single simulation visits all states).

load_expanded_ensemble_case_2()

Load the Gromacs Host CB7 Guest C3 expanded ensemble dataset, case 2 (two simulations visit all states independently).

load_expanded_ensemble_case_3()

Load the Gromacs Host CB7 Guest C3 REX dataset, case 3.

load_water_particle_with_total_energy()

Load the Gromacs water particle with total energy dataset.

load_water_particle_with_potential_energy()

Load the Gromacs water particle with potential energy dataset.

load_water_particle_without_energy()

Load the Gromacs water particle without energy dataset.

Simple TI and FEP

The data sets contain derivatives of the Hamiltonian (TI) and free energy perturbation (FEP) data suitable for processing with FEP estimators as well as BAR/MBAR. Individual \(\lambda\) windows were run independently.

Gromacs: Benzene in water

Benzene in water, alchemically turned into benzene in vacuum separated from water

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

5 for Coulomb, 16 for VDW

Length of Windows

40ns

System Size

1668 atoms

Temperature

300 K

Pressure

1 bar

Alchemical Pathway

vdw + coul –> vdw –> vacuum

Experimental Hydration Free Energy

-0.90 +- 0.2 kcal/mol

Missing Values

None

Energy unit

kJ/mol

Time unit

ps

Creator

I. Kenney

Donor

Ian Kenney (ian.kenney@asu.edu)

Date

March 2017

License

CC0 Public Domain Dedication

This dataset was generated using MDPOW, with the Gromacs molecular dynamics engine.

Experimental value sourced from [Mobley2013].

Mobley2013

Mobley, David L. (2013). Experimental and Calculated Small Molecule Hydration Free Energies. UC Irvine: Department of Pharmaceutical Sciences, UCI. Retrieved from: http://escholarship.org/uc/item/6sd403pz

alchemtest.gmx.load_benzene()

Load the Gromacs benzene dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Extended ensemble

Data for extended ensemble simulations; case 1 and case 2 are extended ensembles in the alchemical parameters, case 3 includes replica exchange (REX).

Gromacs: Host CB7 and Guest C3 in water

Host CB7 and Guest C3 in water, Guest C3 alchemically turned into Guest C3 in vacuum separated from water and Host CB7. This unpublished data uses Host CB7 and Guest C3 from [Muddana2014a]. Similar published data can be found in [Monroe2014a].

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

32 total, 20 for Coulomb, 12 for VDW

Number of Simulations

1

Length of Simulation

100ns

System Size

8286 atoms

Temperature

300 K

Alchemical Pathway

vdw + coul –> vdw –> vacuum

Missing Values

None

Energy unit

kJ/mol

Time unit

ps

Creator

T. Jensen

Donor

Travis Jensen (travis.jensen@colorado.edu)

Date

November 2017

License

CC0 Public Domain Dedication

This dataset was generated using the expanded ensemble algorithm in the Gromacs molecular dynamics engine.

Muddana2014a
  1. Muddana, A. Fenley, D. Mobley, and M. Gilson. The SAMPL4 host–guest blind prediction challenge: an overview. Journal of Computer-Aided Molecular Design, 28(4):305–317, 2014. PMID: 24599514. DOI: 10.1007/s10822-014-9735-1.

Monroe2014a
  1. Monroe and M. Shirts. Converging free energies of binding in cucurbit[7]uril and octa-acid host-guest systems from SAMPL4 using expanded ensemble simulations. Journal of Computer-Aided Molecular Design, 28(4):401–415, 2014. PMID: 24610238 DOI: 10.1007/s10822-014-9716-4.

alchemtest.gmx.load_expanded_ensemble_case_1()

Load the Gromacs Host CB7 Guest C3 expanded ensemble dataset, case 1 (single simulation visits all states).

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Gromacs: Host CB7 and Guest C3 in water

Host CB7 and Guest C3 in water, Guest C3 alchemically turned into Guest C3 in vacuum separated from water and Host CB7. This unpublished data uses Host CB7 and Guest C3 from [Muddana2014b]. Similar published data can be found in [Monroe2014b].

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

32 total, 20 for Coulomb, 12 for VDW

Number of Simulations

2

Length of Simulation

50ns

System Size

8286 atoms

Temperature

300 K

Alchemical Pathway

vdw + coul –> vdw –> vacuum

Missing Values

None

Energy unit

kJ/mol

Time unit

ps

Creator

T. Jensen

Donor

Travis Jensen (travis.jensen@colorado.edu)

Date

November 2017

License

CC0 Public Domain Dedication

This dataset was generated using the expanded ensemble algorithm in the Gromacs molecular dynamics engine.

Muddana2014b
  1. Muddana, A. Fenley, D. Mobley, and M. Gilson. The SAMPL4 host–guest blind prediction challenge: an overview. Journal of Computer-Aided Molecular Design, 28(4):305–317, 2014. PMID: 24599514. DOI: 10.1007/s10822-014-9735-1.

Monroe2014b
  1. Monroe and M. Shirts. Converging free energies of binding in cucurbit[7]uril and octa-acid host-guest systems from SAMPL4 using expanded ensemble simulations. Journal of Computer-Aided Molecular Design, 28(4):401–415, 2014. PMID: 24610238 DOI: 10.1007/s10822-014-9716-4.

alchemtest.gmx.load_expanded_ensemble_case_2()

Load the Gromacs Host CB7 Guest C3 expanded ensemble dataset, case 2 (two simulations visit all states independently).

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Gromacs: Host CB7 and Guest C3 in water

Host CB7 and Guest C3 in water, Guest C3 alchemically turned into Guest C3 in vacuum separated from water and Host CB7. This unpublished data uses Host CB7 and Guest C3 from [Muddana2014c].

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

32 total, 20 for Coulomb, 12 for VDW

Number of Simulations

32

Length of Simulation

5ns

System Size

8286 atoms

Temperature

300 K

Alchemical Pathway

vdw + coul –> vdw –> vacuum

Missing Values

None

Energy unit

kJ/mol

Time unit

ps

Creator

T. Jensen

Donor

Travis Jensen (travis.jensen@colorado.edu)

Date

November 2017

License

CC0 Public Domain Dedication

This dataset was generated using the REX algorithm in the Gromacs molecular dynamics engine.

Muddana2014c
  1. Muddana, A. Fenley, D. Mobley, and M. Gilson. The SAMPL4 host–guest blind prediction challenge: an overview. Journal of Computer-Aided Molecular Design, 28(4):305–317, 2014. PMID: 24599514. DOI: 10.1007/s10822-014-9735-1.

alchemtest.gmx.load_expanded_ensemble_case_3()

Load the Gromacs Host CB7 Guest C3 REX dataset, case 3.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Water particle TI and FEP

3 simple dH/dl and U_nk datasets of a single water particle from a simulations of water between to hydrophilic surfaces. One dataset contains a total energy column, one contains a potential energy column and one does not contain a energy column.

Gromacs: water particle

Free energy estimation of a water particle between to hydrophilic surfaces

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

17 for Coulomb, 20 for VDW

Length of Windows

10ns

System Size

3312 atoms

Temperature

300 K

Ensemble

NVT

Volume

70.204 nm^3

Alchemical Pathway

vacuum –> vdw –> vdw + coul

Missing Values

None

Creator

D. Wille

Donor

Dominik Wille (harlor@web.de)

Date

November 2018

License

CC0 Public Domain Dedication

Similar free energy estimations can be found in:

Schlaich2017

Alexander Schlaich, Julian Kappler, and Roland R. Netz. Hydration Friction in Nanoconfinement: From Bulk via Interfacial to Dry Friction. Nano Lett., 2017, 17 (10), pp 5969–5976. DOI: 10.1021/acs.nanolett.7b02000.

alchemtest.gmx.load_water_particle_with_total_energy()

Load the Gromacs water particle with total energy dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

alchemtest.gmx.load_water_particle_with_potential_energy()

Load the Gromacs water particle with potential energy dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

alchemtest.gmx.load_water_particle_without_energy()

Load the Gromacs water particle without energy dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Amber datasets

Amber molecular dynamics simulation datasets.

The alchemlyb.amber module features datasets generated using the Amber molecular dynamics engine. They can be accessed using the following accessor functions:

load_bace_improper()

Load Amber Bace improper solvated vdw example :returns: data – Dictionary-like object, the interesting attributes are:

load_bace_example()

Load Amber Bace example perturbation.

load_simplesolvated()

Load the Amber solvated dataset.

load_invalidfiles()

Load the invalid files.

Amber: Small molecule thermodynamic integration free energy difference in water

Improper Bace solvated small molecule perturbation, alchemical vdw perturbation of ligand 1 into ligand 2. This example uses ligands CAT-13a to CAT-13m from [Wang2015].

Notes

Data Set Characteristics:
Number of Legs

1 (vdw)

Number of Windows

12

Length of Windows

1ns

System Size

3920 atoms

Temperature

300 K

Pressure

1 bar

Alchemical Pathway

vdw in ligand 1 –> vdw in ligand 2, softcore is used in vdw

Experimental Free Energy difference

N/A

Missing Values

None

Energy unit

kcal/mol

Time unit

ps

Date

Jan 2018

Donor

Silicon Therapeutics

License

CC0 Public Domain Dedication

This dataset was generated using the Amber molecular dynamics engine.

Wang2015

L. Wang, Y. Wu, Y. Deng, B. Kim, L. Pierce, G. Krilov, D. Lupyan, S. Robinson, M. K. Dahlgren, J. Greenwood, D. L. Romero, C. Masse, J. L. Knight, T. Steinbrecher, T. Beuming, W. Damm, E. Harder, W. Sherman, M. Brewer, R. Wester, M. Murcko, L. Frye, R. Farid, T. Lin, D. L. Mobley, W. L. Jorgensen, B. J. Berne, R. A. Friesner, and R. Abel. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. Journal of the American Chemical Society, 137(7):2695–2703, 2015. PMID: 25625324. DOI: 10.1021/ja512751q.

alchemtest.amber.load_bace_improper()

Load Amber Bace improper solvated vdw example :returns: data – Dictionary-like object, the interesting attributes are:

  • ‘data’ : the data files for improper solvated vdw alchemical leg

Return type

Bunch

Amber: Small molecule thermodynamic integration free energy difference in water

Bace complex and solvated small molecule perturbation, alchemical perturbation of ligand 1 into ligand 2. This example uses ligands CAT-13d to CAT-17a from [Wang2015].

Notes

Data Set Characteristics:
Number of Legs

3 (decharge, vdw, recharge)

Number of Windows

5 for decharge, 12 for vdw, 5 for recharge

Length of Windows

1ns

System Size

46594 atoms (complex), 4115 atoms (solvated)

Temperature

300 K

Pressure

1 bar

Alchemical Pathway

(decharge + vdw + recharge) in ligand 1 –> (decharge + vdw + recharge) in ligand 2, decharge, vdw, and recharge are running in parellel, soft core is used in vdw

Experimental Free Energy difference

-0.26 kcal/mol

Missing Values

None

Energy unit

kcal/mol

Time unit

ps

Date

Jan 2018

Donor

Silicon Therapeutics

License

CC0 Public Domain Dedication

This dataset was generated using the Amber molecular dynamics engine.

Wang2015

L. Wang, Y. Wu, Y. Deng, B. Kim, L. Pierce, G. Krilov, D. Lupyan, S. Robinson, M. K. Dahlgren, J. Greenwood, D. L. Romero, C. Masse, J. L. Knight, T. Steinbrecher, T. Beuming, W. Damm, E. Harder, W. Sherman, M. Brewer, R. Wester, M. Murcko, L. Frye, R. Farid, T. Lin, D. L. Mobley, W. L. Jorgensen, B. J. Berne, R. A. Friesner, and R. Abel. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. Journal of the American Chemical Society, 137(7):2695–2703, 2015. PMID: 25625324. DOI: 10.1021/ja512751q.

alchemtest.amber.load_bace_example()

Load Amber Bace example perturbation. :returns: data – Dictionary-like object, the interesting attributes are:

  • ‘data’ : the data files by system and alchemical leg

Return type

Bunch

Amber: Small molecule thermodynamic integration free energy difference in water

Small molecule perturbation in water, alchemically turned ligand 1 into ligand 2 in water. This example uses ligands 17124-1 to 18637-1 from [Wang2015].

Notes

Data Set Characteristics:
Number of Legs

2 (charge, vdw)

Number of Windows

5 for charge, 12 for vdw

Length of Windows

1ns

System Size

5979 atoms

Temperature

300 K

Pressure

1 bar

Alchemical Pathway

(charge + vdw) in ligand 1 –> (charge + vdw) in ligand 2, charge and vdw are running in parellel, soft core is used in vdw

Experimental Free Energy difference

N/A

Missing Values

None

Energy unit

kcal/mol

Time unit

ps

Date

Oct 2017

Donor

Silicon Therapeutics

License

CC0 Public Domain Dedication

This dataset was generated using the Amber molecular dynamics engine.

Wang2015

L. Wang, Y. Wu, Y. Deng, B. Kim, L. Pierce, G. Krilov, D. Lupyan, S. Robinson, M. K. Dahlgren, J. Greenwood, D. L. Romero, C. Masse, J. L. Knight, T. Steinbrecher, T. Beuming, W. Damm, E. Harder, W. Sherman, M. Brewer, R. Wester, M. Murcko, L. Frye, R. Farid, T. Lin, D. L. Mobley, W. L. Jorgensen, B. J. Berne, R. A. Friesner, and R. Abel. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. Journal of the American Chemical Society, 137(7):2695–2703, 2015. PMID: 25625324. DOI: 10.1021/ja512751q.

alchemtest.amber.load_simplesolvated()

Load the Amber solvated dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

Amber TI invalid output files

Examples for file validation testing.

Notes

  • invalid-case-1.out.bz2: file contains no useful data

  • invalid-case-2.out.bz2: file contains no control data

  • invalid-case-3.out.bz2: file with Non-constant temperature

  • invalid-case-4.out.bz2: file with no free energy section

  • invalid-case-5.out.bz2: file with no ATOMIC section

  • invalid-case-6.out.bz2: file with no RESULTS section

alchemtest.amber.load_invalidfiles()

Load the invalid files.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the example of invalid data files

  • ’DESCR’: the full description of the dataset

Return type

Bunch

NAMD datasets

NAMD molecular dynamics simulation datasets.

The alchemlyb.namd module features datasets generated using the NAMD molecular dynamics engine. They can be accessed using the following accessor functions:

load_tyr2ala()

Load the NAMD tyrosine to alanine mutation dataset.

NAMD: free energy of tyrosine to alanine mutation in aqueous solution

Free energy change from mutating a tyrosine (Y) residue into alanine (A) in the Ala-Tyr-Ala tripeptide in aqueous environment.

Notes

Data Set Characteristics:
Number of Legs

2 (forward Y–>A, backward A–>Y)

Number of Windows

20 for each leg

Length of Windows

1000 ps (each window interspersed with 200 ps equilibration)

System Size

1521 atoms

Temperature

300 K

Pressure

1 bar

Alchemical Pathway

Point mutation of Tyr to Ala using dual topology hybrid molecule. Nonbonded interactions of perturbed atoms are scaled with their environment.

Experimental Free Energy difference

N/A

Missing Values

None

Energy unit

kcal/mol

Time unit

step

Date

Oct 2017

Donor

JC Gumbart

License

CC0 Public Domain Dedication

This dataset was generated using the NAMD molecular dynamics engine.

alchemtest.namd.load_tyr2ala()

Load the NAMD tyrosine to alanine mutation dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch

GOMC datasets

GOMC Monte Carlo simulation datasets.

The alchemlyb.gomc module features datasets generated using the GPU Optimized Monte Carlo (GOMC) simulation engine. They can be accessed using the following accessor functions:

load_benzene()

Load the GOMC benzene dataset.

Simple TI and FEP

The data sets contain derivatives of the Hamiltonian (TI) and free energy perturbation (FEP) data suitable for processing with FEP estimators as well as BAR/MBAR. Individual \(\lambda\) windows were run independently.

GOMC: Benzene in water

Hydration free energy of benzene using TraPPE-EH model and SPC water model.

Notes
Data Set Characteristics:
Number of Legs

2 (Coulomb, VDW)

Number of Windows

7 for Coulomb, 15 for VDW

Length of Windows

50 million Monte Carlo steps

System Size

1001 molecules

Temperature

298 K

Pressure

1 bar

Alchemical Pathway

vacuum –> vdw –> vdw + coul

Experimental Hydration Free Energy

-0.90 +- 0.2 kcal/mol

Missing Values

None

Energy unit

kJ/mol

Time unit

Monte Carlo steps

Creator

M. Soroush Barhaghi

Donor

Mohammad Soroush Barhaghi (m.soroush@wayne.edu)

Date

July 2019

License

CC0 Public Domain Dedication

This dataset was generated using GOMC Monte Carlo simulation engine.

Experimental value sourced from [Mobley2013].

Mobley2013

Mobley, David L. (2013). Experimental and Calculated Small Molecule Hydration Free Energies. UC Irvine: Department of Pharmaceutical Sciences, UCI. Retrieved from: http://escholarship.org/uc/item/6sd403pz

alchemtest.gomc.load_benzene()

Load the GOMC benzene dataset.

Returns

data – Dictionary-like object, the interesting attributes are:

  • ’data’ : the data files by alchemical leg

  • ’DESCR’: the full description of the dataset

Return type

Bunch