Discombobulator
Discombobulate a column of the original data, using text mining to find HPO terms and make one column for each identified HPO term in the output. In the following example, "Book2.xlsx" is an Excel file derived from an original publication. It has a column called "Cardiac defect", some of whose cells contain items such as Ventricular septal defect, Atrial septal defect, Patent foramen ovale. Some of the cells contain codes (here, "na", and "UN") that indicate that no information is available (so we want to output "na"). The assumeExcluded argument means that if an observation was made (e.g., echocardiography), then we assume all items are excluded except those that are named in the cell. The decode method returns a pandas DataFrame that has columns that can be inspected and then added to the pyphetools Excel template once any necessary revisions have been made. The DataFrame will have one column for the patient identifier and one column for each of the identified HPO terms. Finally, the last column will be the original column that we can use to vet results.
import pandas as pd
df = pd.read_excel("../../Book2.xlsx")
from pyphetools.creation import Discombobulator
dc = Discombobulator(df=df, individual_id="individual column name")
cardiac = dc.decode(column="Cardiac defect", trueNa={"na", "UN"}, assumeExcluded=True)
cardiac.to_excel("cardiac.xlsx")
Source code in pyphetools/creation/discombulator.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
|
decode(column, delim=',', assumeExcluded=False, trueNa='na')
Discombobulate a column of the original data, using text mining to find HPO terms and make one column for each identified HPO term in the output. :assumeExcluded: Assume that if an item is not mentioned in a cell, then it was excluded. This can be justified if the column is about Echocardiography findings, for instance. :trueNa:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to dsicombobulate |
required |
delim
|
str
|
delimiter between items |
','
|
Source code in pyphetools/creation/discombulator.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
|