Contributing new data sets

We are looking for new data sets. Please read the following and consider contributing data; details are described under Process.

Types of systems

The ideal set of files would be something like the GROMACS dataset for alchemtest alchemtest.gmx: benzene in water for 1…10 ns per window, with \(\partial H/\partial\lambda\) saved every 10 ps. For GROMACS we tend to put each lambda in a separate directory (see the directory layout in alchemtest/gmx/benzene) but you should provide files that are typical of how the specific code is run.

Documentation

Add

  • a brief explanation of how you would analyze the data with alchemlyb or your own tool (show Python commands or the full command with options so that we can reproduce) and

  • the value(s) that you get so that we know the ground truth.

Comment on what to look out for in the output files (knowing what is what in the files helps). If you have links to where the format is defined, please let us know.

In general, follow the example of the existing data sets (especially similar data sets or for the same MD/MC code) and discuss the specfics on an initial Pull Request.

Licensing

Finally, because we want to make the data part of the actual tests that are run every time when new code is committed to the repository, we would need the data to be made available under an open license (preferrable CC0 (public domain) or CC-BY (attribution required)). The dataset will carry the license and your authorship.

At the moment, all included data sets are in the public domain via CC0.

Process

  1. Raise an issue in the alchemtest issue tracker proposing the new data set. In this issue we will do all discussions.

  2. Fork the alchemtest repo and create a branch for your dataset.

  3. Add your dataset to your branch. Follow the existing layout.

    • Choose a top level directory. If your data files are for GROMACS, add it to alchemtest/gmx or for NAMD to alchemtest/namd, etc. If you support a new code, create a new directory.

    • Create a subdirectory for your dataset, choose a good, short name for the dataset and the directory.

      • Create one or more additional directories inside your dataset directory for your actual data files; do whatever seems natural for your problem.

      • Copy your data files to the appropriate subdirectories. Consider compressing them with gzip or bzip2 (alchemlyb can read compressed files).

      • Check the MANIFEST.in: make sure that the line

        recursive-include src/alchemtest *.gz *.bz2 *.zip *.rst *.txt *.out *.xvg
        

        will include your files into the package: If your filename extension(s) are not matched, add them.

      • Create a restructured text (reST) file descr.rst that describes the dataset. Look at other description files as examples: copy one that is close in what you need and modify. The description will show up in the online documentation and will be part of the dataset Bunch.

    • Add an accessor function load_MYDATASET() to the access.py file at the top of the code directory. The accessor function makes the dataset available as a dict under the data key in the Bunch. The data are typically another dict with different parts of a calculation such as Coulomb and VDW parts being different keys in a dictionary. All files that are needed for a single free energy calculation are in a list under the appropriate key. The description text is added the DESCR key.

      Again, copy an existing function and modify.

    • Add an from .access import load load_MYDATASET to the top-level __init__.py to make your accessor function part of alchemtest.

  4. Locally test that you can load your dataset:

    from alchemtest.MYCODE.MYDATASET import load_MYDATASET
    d = load_MYDATASET()
    print(d.DESCR)
    print(d.data)
    

    You should see your description and the full path to your datafiles (possibly inside another dictionary). It should be possible to work with your dataset as shown under Basic usage.

    Try building the documentation with

    python setup.py build_sphinx
    

    and look at the docs in build/sphinx/html/index.html.

    Check that your documentation is visible. If not, it’s possible that another page needs to be added to the docs — just move ahead with the next step and ask in the comments on your Pull Request and we will help.

  5. Create a Pull Request with your new code and files.

  6. Engage in the code review — we might have questions, suggestions, and requests for revisions to ensure that your contribution fits into the library.

  7. Once your PR is accepted it will be merged by a developer and your dataset is part of alchemtest — Congratulations!