Contributing new data sets

We are looking for new data sets. Please read the following and consider contributing data; details are described under Process.

Types of systems

The ideal set of files would be something like the GROMACS dataset for alchemtest alchemtest.gmx: benzene in water for 1…10 ns per window, with \(\partial H/\partial\lambda\) saved every 10 ps. For GROMACS we tend to put each lambda in a separate directory (see the directory layout in alchemtest/gmx/benzene) but you should provide files that are typical of how the specific code is run.

Data set description

For your data set, you should be able to include the following in a brief description (which will become part of the data set and the documentation as described in more detail in Process):

  • Include the value(s) that you get when analyzing the data set yourself so that we know the ground truth.

    It is very helpful if you include a brief explanation of how you analyze the data with alchemlyb or your own tool (show Python commands or the full command with options so that one can reproduce the analysis if necessary).

  • State how the data set was generated. Include the temperature.

  • Comment on what to look out for in the output files, e.g., special sampling options.

  • For new file formats: Include information about the file format definition, such as links or paper citations.

In general, follow the example of the existing data sets (especially similar data sets or for the same MD/MC code) and discuss the specfics on an initial Pull Request.

Licensing

Finally, because we want to make the data part of the actual tests that are run every time when new code is committed to the repository, we would need the data to be made available under an open license (preferrable CC0 (public domain) or CC-BY (attribution required)). The dataset will carry the license and your authorship.

At the moment, all included data sets are in the public domain via CC0.

Process

  1. Raise an issue in the alchemtest issue tracker proposing the new data set. In this issue we will do all discussions.

  2. Fork the alchemtest repo and create a branch for your dataset.

  3. Add your dataset to your branch. Follow the existing layout.

    • Choose a top level directory. If your data files are for GROMACS, add it to alchemtest/gmx or for NAMD to alchemtest/namd, etc. If you support a new code, create a new directory.

    • Create a subdirectory for your dataset, choose a good, short name for the dataset and the directory.

      • Create one or more additional directories inside your dataset directory for your actual data files; do whatever seems natural for your problem.

      • Copy your data files to the appropriate subdirectories. Consider compressing them with gzip or bzip2 (alchemlyb can read compressed files).

      • Check the MANIFEST.in: make sure that the line

        recursive-include alchemtest *.gz *.bz2 *.zip *.rst *.txt *.out *.xvg
        

        will include your files into the package: If your filename extension(s) are not matched, add them.

      • Create a restructured text (reST) file descr.rst that describes the dataset. Look at other description files as examples: copy one that is close in what you need and modify. The description will show up in the online documentation and will be part of the dataset Bunch.

    • Add an accessor function load_MYDATASET() to the access.py file at the top of the code directory. The accessor function makes the dataset available as a dict under the data key in the Bunch. The data are typically another dict with different parts of a calculation such as Coulomb and VDW parts being different keys in a dictionary. All files that are needed for a single free energy calculation are in a list under the appropriate key. The description text is added the DESCR key.

      Again, copy an existing function and modify.

    • Add an from .access import load_MYDATASET to the top-level __init__.py to make your accessor function part of alchemtest.

  4. Locally test that you can load your dataset:

    from alchemtest.MYCODE.MYDATASET import load_MYDATASET
    d = load_MYDATASET()
    print(d.DESCR)
    print(d.data)
    

    You should see your description and the full path to your datafiles (possibly inside another dictionary). It should be possible to work with your dataset as shown under Basic usage.

    Try building the documentation with

    python setup.py build_sphinx
    

    and look at the docs in build/sphinx/html/index.html.

    Check that your documentation is visible. If not, it’s possible that another page needs to be added to the docs — just move ahead with the next step and ask in the comments on your Pull Request and we will help.

  5. Create a Pull Request with your new code and files.

  6. Add a test that checks that your files can be found. Look in the alchemtest/tests directory and follow the examples that are already there. We are also happy to help you with this step — just ask.

    You can run the tests locally with pytest and you will also see that the tests are run on your PR.

  7. Engage in the code review — we might have questions, suggestions, and requests for revisions to ensure that your contribution fits into the library.

  8. Once your PR is accepted it will be merged by a developer and your dataset is part of alchemtest — Congratulations!