Configure
-
Make a directory for your ingest, using the source of the data as the name:
For example:mkdir src/monarch_ingest/ingests/<source>
mkdir src/monarch_ingest/ingests/ncbi
-
Add data sources to
src/monarch_ingest/download.yaml
:For example:# <source> - url: https://<source>.com/downloads/somedata.txt local_name: data/<source>/somedata.txt tag: <source>_<ingest>
# mgi - url: http://www.informatics.jax.org/downloads/reports/MRK_Reference.rpt local_name: data/mgi/MRK_Reference.rpt tag: mgi_publication_to_gene
Note: You can now use
ingest download --tags <tag>
oringest download --all
, and your data will be downloaded to the appropriate subdir indata/
-
Add your ingest to
src/monarch_ingest/ingests.yaml
:For example:<ingest_name>: config: 'ingests/<source>/<ingest>.yaml
ncbi_gene: config: 'ingests/ncbi/gene.yaml'
-
Copy the template:
cp ingest_template/* src/monarch_ingest/ingests/<source>
-
Edit
metadata.yaml
:- Update the description, rights link, url, etc and then add your source_file
-
Edit the source file yaml
- Match the columns or required fields with what's available in the file to be ingested
- If it's an ingest that exists in Dipper, check out what Dipper does.
- Check the Biolink Model documentation to look at what you can capture
- If what we need from an ingest can't be captured in the model yet, make a new Biolink issue
- Set the header properties
- If there is no header at all, set
header: False
- If there are comment lines before the header, count them and set
skip_lines: {n}
- If there is no header at all, set
- Match the columns or required fields with what's available in the file to be ingested
--
Next step: Adding documentation