Contributing new data sets¶
We are looking for new data sets. Please read the following and consider contributing data; details are described under Process.
Types of systems¶
The ideal set of files would be something like the GROMACS dataset for
alchemtest alchemtest.gmx
: benzene in water for 1…10 ns per
window, with \(\partial H/\partial\lambda\) saved every 10 ps. For
GROMACS we tend to put each lambda in a separate directory (see the
directory layout in alchemtest/gmx/benzene) but you should provide
files that are typical of how the specific code is run.
Documentation¶
Add
a brief explanation of how you would analyze the data with alchemlyb or your own tool (show Python commands or the full command with options so that we can reproduce) and
the value(s) that you get so that we know the ground truth.
Comment on what to look out for in the output files (knowing what is what in the files helps). If you have links to where the format is defined, please let us know.
In general, follow the example of the existing data sets (especially similar data sets or for the same MD/MC code) and discuss the specfics on an initial Pull Request.
Licensing¶
Finally, because we want to make the data part of the actual tests that are run every time when new code is committed to the repository, we would need the data to be made available under an open license (preferrable CC0 (public domain) or CC-BY (attribution required)). The dataset will carry the license and your authorship.
At the moment, all included data sets are in the public domain via CC0.
Process¶
Raise an issue in the alchemtest issue tracker proposing the new data set. In this issue we will do all discussions.
Fork the alchemtest repo and create a branch for your dataset.
Add your dataset to your branch. Follow the existing layout.
Choose a top level directory. If your data files are for GROMACS, add it to alchemtest/gmx or for NAMD to alchemtest/namd, etc. If you support a new code, create a new directory.
Create a subdirectory for your dataset, choose a good, short name for the dataset and the directory.
Create one or more additional directories inside your dataset directory for your actual data files; do whatever seems natural for your problem.
Copy your data files to the appropriate subdirectories. Consider compressing them with gzip or bzip2 (alchemlyb can read compressed files).
Check the
MANIFEST.in
: make sure that the linerecursive-include src/alchemtest *.gz *.bz2 *.zip *.rst *.txt *.out *.xvg
will include your files into the package: If your filename extension(s) are not matched, add them.
Create a restructured text (reST) file
descr.rst
that describes the dataset. Look at other description files as examples: copy one that is close in what you need and modify. The description will show up in the online documentation and will be part of the datasetBunch
.
Add an accessor function
load_MYDATASET()
to theaccess.py
file at the top of the code directory. The accessor function makes the dataset available as adict
under the data key in theBunch
. The data are typically anotherdict
with different parts of a calculation such as Coulomb and VDW parts being different keys in a dictionary. All files that are needed for a single free energy calculation are in alist
under the appropriate key. The description text is added the DESCR key.Again, copy an existing function and modify.
Add an
from .access import load load_MYDATASET
to the top-level__init__.py
to make your accessor function part of alchemtest.
Locally test that you can load your dataset:
from alchemtest.MYCODE.MYDATASET import load_MYDATASET d = load_MYDATASET() print(d.DESCR) print(d.data)
You should see your description and the full path to your datafiles (possibly inside another dictionary). It should be possible to work with your dataset as shown under Basic usage.
Try building the documentation with
python setup.py build_sphinx
and look at the docs in
build/sphinx/html/index.html
.Check that your documentation is visible. If not, it’s possible that another page needs to be added to the docs — just move ahead with the next step and ask in the comments on your Pull Request and we will help.
Create a Pull Request with your new code and files.
Engage in the code review — we might have questions, suggestions, and requests for revisions to ensure that your contribution fits into the library.
Once your PR is accepted it will be merged by a developer and your dataset is part of alchemtest — Congratulations!