Code for manipulating the structure of UMLS files and converting ontologies or sets of ontologies into a graph structure.
The Unified Medical Language System (UMLS) is a set of ontologies and associated resource files, lexicons, etc. that together provide a comprehensive set of structured concepts, codes, and text descriptors for the biomedical domain. UMLS is widely used in biomedical informatics research and in a variety of clinical software products, but the raw UMLS files are a bit unwieldy. If you want to do more than simply load UMLS into a database and search for a given concept, code, etc. its structure can be difficult to understand and manipulate.
My own work has focused on natural language processing (NLP) and I've often encountered situations where I want to find all of the synonyms for a given clinical concept, or I want to know hierarchical relationships for a given term (Ex. "Atherosclerosis is a type of cardiovascular disease."). In many of these cases, UMLS can help, but it often doesn't seem worth the effort to parse all of the raw UMLS files.
Even though it wasn't designed to be used this way, I've found that it often helps to think of UMLS as a graph. Different ontologies can be combined in a flexible way to produce graph structures that represent different subsets of UMLS, and the nodes of the graph (concept unique identifiers, or CUIs) can be decorated with synonyms, billing codes, or whatever you want. So I've written some basic code that converts the raw UMLS files into a graph structure, the details of which can be manipulated using a single configuration file.
The first thing you need to do is download UMLS. The full release is available here but you need to apply for a license before you can download it.
Once you've downloaded it, follow the instructions in the README to install it. You need to use the MetamorphoSys app (included in the download) to install the raw resource files.
A couple of notes:
- On the "Select Default Subset Configuration" screen, I always choose "Select all non-level 0 sources except SNOMED_CT US". SNOMED contains a ton of useful terms and hierarchies.
- By default, MedDRA and CPT are not selected. I would include those, as well as any other English-language terminologies you think will be useful.
- Once the resource files are installed, there will be many more than you need for the graph. You can delete pretty much everything except: MRCONSO.RRF (concept names and sources), MRHIER.RRF (hierarchies), MRREL.RRF (related concepts), MRSAT.RRF (simple concept and atom attributes) and MRSTY.RRF (semantic types). You can find complete descriptions of all of the UMLS files here.
The umls-to-graph
code uses Maven to handle dependencies (see the pom.xml
file for details). Run the code from Eclipse or IntelliJ, or compile a jar using Maven.
You construct the graph by creating a configuration file (mine is called graph-config.txt
and can be found in the resources
section of this repository) that looks something like this:
subgraph AIR /Users/beth/Desktop/graph-info/umls-subgraph-AIR.txt
subgraph AOD /Users/beth/Desktop/graph-info/umls-subgraph-AOD.txt
subgraph AOT /Users/beth/Desktop/graph-info/umls-subgraph-AOT.txt
subgraph ATC /Users/beth/Desktop/graph-info/umls-subgraph-ATC.txt
subgraph CCS /Users/beth/Desktop/graph-info/umls-subgraph-CCS.txt
subgraph CCS_10 /Users/beth/Desktop/graph-info/umls-subgraph-CCS_10.txt
subgraph CPT /Users/beth/Desktop/graph-info/umls-subgraph-CPT.txt
subgraph CSP /Users/beth/Desktop/graph-info/umls-subgraph-CSP.txt
subgraph CST /Users/beth/Desktop/graph-info/umls-subgraph-CST.txt
subgraph FMA /Users/beth/Desktop/graph-info/umls-subgraph-FMA.txt
subgraph GO /Users/beth/Desktop/graph-info/umls-subgraph-GO.txt
subgraph HPO /Users/beth/Desktop/graph-info/umls-subgraph-HPO.txt
subgraph ICD10 /Users/beth/Desktop/graph-info/umls-subgraph-ICD10.txt
subgraph ICD10CM /Users/beth/Desktop/graph-info/umls-subgraph-ICD10CM.txt
subgraph ICD10PCS /Users/beth/Desktop/graph-info/umls-subgraph-ICD10PCS.txt
subgraph ICD9CM /Users/beth/Desktop/graph-info/umls-subgraph-ICD9CM.txt
subgraph ICPC /Users/beth/Desktop/graph-info/umls-subgraph-ICPC.txt
subgraph LNC /Users/beth/Desktop/graph-info/umls-subgraph-LNC.txt
subgraph MEDLINEPLUS /Users/beth/Desktop/graph-info/umls-subgraph-MEDLINEPLUS.txt
subgraph MSH /Users/beth/Desktop/graph-info/umls-subgraph-MSH.txt
subgraph MTHHH /Users/beth/Desktop/graph-info/umls-subgraph-MTHHH.txt
subgraph NCBI /Users/beth/Desktop/graph-info/umls-subgraph-NCBI.txt
subgraph NCI /Users/beth/Desktop/graph-info/umls-subgraph-NCI.txt
subgraph NDFRT /Users/beth/Desktop/graph-info/umls-subgraph-NDFRT.txt
subgraph OMIM /Users/beth/Desktop/graph-info/umls-subgraph-OMIM.txt
subgraph PDQ /Users/beth/Desktop/graph-info/umls-subgraph-PDQ.txt
subgraph SNOMEDCT_US /Users/beth/Desktop/graph-info/umls-subgraph-SNOMEDCT_US.txt
subgraph SOP /Users/beth/Desktop/graph-info/umls-subgraph-SOP.txt
subgraph TKMT /Users/beth/Desktop/graph-info/umls-subgraph-TKMT.txt
subgraph USPMG /Users/beth/Desktop/graph-info/umls-subgraph-USPMG.txt
subgraph UWDA /Users/beth/Desktop/graph-info/umls-subgraph-UWDA.txt
subgraph RXNORM /Users/beth/Desktop/graph-info/umls-subgraph-drug-ingredient.txt
codedecorator MDR /Users/beth/Desktop/graph-info/umls-code-decorator-MDR.txt
codedecorator ICD9CM /Users/beth/Desktop/graph-info/umls-code-decorator-ICD9CM.txt
codedecorator MTHICD9 /Users/beth/Desktop/graph-info/umls-code-decorator-MTHICD9.txt
codedecorator ICD10 /Users/beth/Desktop/graph-info/umls-code-decorator-ICD10.txt
codedecorator ICD10CM /Users/beth/Desktop/graph-info/umls-code-decorator-ICD10CM.txt
codedecorator ICD10PCS /Users/beth/Desktop/graph-info/umls-code-decorator-ICD10PCS.txt
codedecorator LNC /Users/beth/Desktop/graph-info/umls-code-decorator-LNC.txt
codedecorator SNOMEDCT_US /Users/beth/Desktop/graph-info/umls-code-decorator-SNOMEDCT_US.txt
codedecorator NDC /Users/beth/Desktop/graph-info/umls-code-decorator-NDC.txt
translation ICD9CM ICD10PCS /Users/beth/Desktop/graph-info/umls-translation-decorator-icd9-icd10.txt
translation ICD9CM SNOMEDCT_US /Users/beth/Desktop/graph-info/umls-translation-decorator-icd9-icd10.txt
decorator ONTOLOGY /Users/beth/Desktop/graph-info/umls-decorator-ontology.txt
decorator SEMGROUP /Users/beth/Desktop/graph-info/umls-decorator-semantic-group.txt
decorator SEMTYPE /Users/beth/Desktop/graph-info/umls-decorator-semantic-type.txt
decorator STRINGS /Users/beth/Desktop/graph-info/umls-decorator-strings.txt.gz
edgefilter SEMGROUPMISMATCH
nodemodifier MODIFIEDSTRINGS
All of those resource files are generated by the create-resource-files.sh
script, also in the resources
directory, which you should feel free to modify and use.
You can leave out any subgraphs or decorators you want, or add additional subgraphs for other ontologies (details below).
You build the graph by running java build.CreateUMLSGraph <graph-config-file> <output-structure-file> <output-decorations-file>
.
The following ontologies are currently supported and provide hierarchical relationships from MRHIER.RRF:
- AIR (1512 hierarchical relationships in MRHIER.RRF)
- AOD (14284)
- AOT (350)
- ATC (6083)
- CCS (1099)
- CCS_10 (372)
- CPT
- CSP (14582)
- CST (3331)
- FMA (104084)
- GO (809207)
- HPO (81558)
- ICD10 (12319)
- ICD10CM (94516)
- ICD10PCS (190176)
- ICD9CM (22407)
- ICPC (1433)
- LNC (270837)
- MDR
- MEDLINEPLUS (1812)
- MSH (58859)
- MTHHH (6937)
- NCBI (1285985)
- NCI (312695)
- NDFRT (45364)
- OMIM (51769)
- PDQ (3816)
- SNOMEDCT_US (9561078)
- SOP (156)
- TKMT (372)
- USPMG (1910)
- UWDA (419453)
Two more ontologies from MRCONSO.RRF don't provide hierarchical relationships but do provide codes:
And finally, NDC is only in MRSAT.RRF.
Graphs have nodes and edges. The UMLS graph will have nodes that correspond to CUIs in UMLS and edges that correspond to hierarchical relationships from the various ontologies within UMLS.
The graph lives in two files: a "structure" file and a "decorations" file. The structure file contains the edges and the decorations file contains the nodes, along with a bunch of metadata about them.
Both of the graph output files are tab-delimited.
The structure file has two columns:
- parent CUI
- child CUI
The decorations file has the following columns:
- CUI
- sibling CUIs (see explanation below; comma-delimited)
- string descriptions (pipe-delimited)
- codes (pipe-delimited, with sources)
- semantic type(s) (pipe-delimited)
- semantic group(s) (pipe-delimited)
- ontologies (pipe-delimited)
The edges of the UMLS graph come from the hierarchical relationships within UMLS. The relationships from each ontology constitute a subgraph. Multiple subgraphs are combined to create the complete graph. You can choose to include relationships from as many ontologies as you want.
You need to build a resource file for each subgraph before the final graph can be created. This is really useful for debugging later, since you can see where all of the edges in the final graph come from.
For each ontology in MRHIER.RRF for which you want to create a subgraph, run the following:
java subgraphs.MrHierSubgraph <umls-location>/MRHIER.RRF <umls-location>/MRCONSO.RRF <output-subgraph-file> <ontology-name>
For example:
java subgraphs.MrHierSubgraph /Users/beth/Documents/data/2017AB-full/2017AB/META/MRHIER.RRF /Users/beth/Documents/data/2017AB-full/2017AB/META/MRCONSO.RRF /Users/beth/Desktop/subgraphs/umls-subgraph-omim.txt OMIM
Ensure that the string you use to reference the ontology (argument 4, above) matches one of the recognized ontology types listed above.
The output resource file format is a two-column tab-delimited file; the first column is a parent CUI and the second column is a comma-separated list of child CUIs. Note that instead of being stored as strings ("C4228946"
) the CUIs are stored as integers to save space. So the CUI listed as 543
in the subgraph files corresponds to the CUI listed as C0000543
in the UMLS files.
Many different branded drugs correspond to the same active ingredient. It is therefore useful to create a second type of subgraph that maps individual drug preparations ("children") to their active ingredients ("parents"). This is totally optional and distinct from the MRHIER.RRF subgraphs.
To get the drug-ingredient relationships, you need MRREL.RRF and MRCONSO.RRF. To create this subgraph, run:
java subgraphs.DrugIngredientSubgraph <umls-location>/MRREL.RRF <umls-location>/MRCONSO.RRF <output-subgraph-file>
For example:
/Users/beth/Documents/data/2017AB-full/2017AB/META/MRREL.RRF /Users/beth/Documents/data/2017AB-full/2017AB/META/MRCONSO.RRF /Users/beth/Desktop/subgraphs/umls-subgraph-drug-ingredient.txt
The nodes of the subgraphs are all CUIs. CUIs are not very useful on their own for most applications. Normally we'll want to start with a string descriptor of a concept ("diabetes") or a billing code (such as from NDC or ICD9), map it to a CUI, and use the graph to find parent or child CUIs (and their associated strings and codes). To decorate the nodes of the subgraphs with all of this other information, we use an object called a decorator.
The metadata that decorators use to decorate the nodes also lives in resource files. You'll need to create one resource file for each type of decoration you wish to include.
For most NLP applications, string annotations for UMLS concepts will be the most important thing that comes out of this graph. Each CUI is attached to a set of descriptors from various ontologies. The SPECIALIST lexicon (which comes with UMLS) also provides a list of alternate spellings for various terms in the LRSPL file.
The string annotation decorator finds all English-language, all-ASCII strings from MRCONSO.RRF and the alternate spellings from LRSPL, and attaches them to nodes in the graph. To generate a resource file for this code decorator, run the following:
java nodedecorators.StringsNodeDecorator <umls-location>/MRCONSO.RRF <umls-location>/LRSPL <output-resource-file>
These decorators map ontology-specific codes to CUIs. Some ontologies whose codes are frequently used in medical practice and billing include: ICD9CM, MTHICD9, ICD10PCS, SNOMEDCT_US, and LOINC (LNC). To generate a resource file for a code decorator, do the following:
java nodedecorators.GenericCodeNodeDecorator <umls-location>/MRCONSO.RRF <ontology-type> <output-resource-file>
For example:
java nodedecorators.GenericCodeNodeDecorator /Users/beth/Documents/data/2017AB-full/2017AB/META/MRCONSO.RRF ICD9CM /Users/beth/Desktop/subgraphs/umls-decorator-icd9cm.txt
will generate a resource file for ICD9 codes.
NDC codes are not accessible from MRCONSO.RRF; the mapping needs to be created separately using MRSAT.RRF. To generate a resource file for NDC codes, run:
java nodedecorators.NdcCodeNodeDecorator <umls-location>/MRSAT.RRF <output-decorator-map-file>
Code decorators are a work in progress and I'd encourage people to reach out if they need a way to include other code types in the graph.
There have been various attempts to map ICD9 codes to ICD10 codes and ICD9 codes to SNOMED codes, mostly to facilitate changes in medical billing and electronic medical record practices. NLM has provided a variety of mapping files that can translate between codes. I've created an object called a CodeTranslationNodeDecorator
to handle these mappings. The decorator looks at the current set of codes attached to a node and then adds codes from the other ontologies. So for example, if a node is decorated with ICD9 codes, the mappings will add codes for ICD10 or SNOMED. The reverse is also possible.
There are currently two types of code translation node decorators.
-
ICD9 <-> ICD10 translation Download the general equivalence mapping (GEM) files from here. When you unzip the archive, you'll see two files called
2017_I9gem.txt
and2017_I10gem.txt
. Create the resource file using:java nodedecorators.Icd9Icd10CodeTranslationNodeDecorator <path-to-gems>/2017_I9gem.txt <path-to-gems>/2017_I10gem.txt <output-resource-file>
. -
ICD9 <-> SNOMED translation You need to use your UMLS license to download the ICD9-to-SNOMED maps from NLM. You can get them here. When you unzip the archive, you'll see two files called
ICD9CM_SNOMED_MAP_1TO1_201612.txt
andICD9CM_SNOMED_MAP_1TOM_201612.txt
. Create the resource file using:java nodedecorators.Icd9SnomedCodeTranslationNodeDecorator <path-to-map-files>/ICD9CM_SNOMED_MAP_1TO1_201612.txt <path-to-map-files>/ICD9CM_SNOMED_MAP_1TOM_201612.txt <output-resource-file>
.
Not all CUIs are represented in all ontologies. If you want to know which ontologies each CUI belongs to, you can create a resource file that will add a list of ontologies to the CUIs in the graph decorations file. Just do:
java nodedecorators.OntologyNodeDecorator <umls-location>/MRCONSO.RRF <output-resource-file>
UMLS provides semantic group and semantic type annotations for all CUIs. The definitions of these types and groups can be found here. These are particularly useful in cases where you only want a graph that includes certain categories, like disorders, chemicals, etc.
Create the resource files for semantic types and groups by doing the following:
java nodedecorators.SemanticGroupNodeDecorator <umls-location>/MRSTY.RRF <output-resource-file>
java nodedecorators.SemanticTypeNodeDecorator <umls-location>/MRSTY.RRF <output-resource-file>
Sometimes we want to remove nodes from the graph based on certain properties. For example, maybe we want our graph to include only chemicals (semantic group: CHEM). The code accomplishes this using objects called NodeFilters and NodeModifiers. You don't need to generate resource files for these - just include the following line in your graph config file to restrict by semantic group:
nodefilter SEMGROUP <allowed types>
where <allowed-types>
is a pipe-delimited list of allowed types; for example, CHEM|DISO
. The code will automatically remove nodes that have only those types. For nodes with multiple types, only the allowed types will be retained in the final graph.
To restrict by semantic type, you do the opposite:
nodefilter SEMTYPE <disallowed-types>
You list the disallowed types, since there are so many semantic types and more often than not, you simply wish to remove one or two types from the graph. Disallowed types will also be removed from the nodes that have multiple types.
The final type of node modifier adds modified strings. This is a bit of a hack, but in practice I've noticed that ontology terms will frequently have the form diabetes, type II
when what appears most often in real text is type II diabetes
. So this modifier switches those around and also adds and subtracts apostrophes from terms like Alzheimer's Disease
. LRSPL takes care of some of these, but this node modifier basically acts as a check, making sure that these common variants are represented. You don't have to include it.
To include the modified strings, add the following line to your config file:
nodemodifier MODIFIEDSTRINGS
and the modified strings will automatically be added to the list of string descriptions for each node.
If you need/want another type of node filter or modifier, please reach out.
You can also filter out edges in the graph using an object called an EdgeFilter. The edge filter looks at the nodes on either side of an edge and removes the edge if the nodes don't fulfill certain properties. For example, occasionally you can end up with a situation where a CHEM
node is a parent of a DISO
or something like that. This can lead to weird properties in the final graph, where you have a drug being the parent of a disease, etc. just because of a quirk that occurred when two ontologies were merged.
Right now the only type of edge filter that exists removes edges where the connected nodes have different semantic types. To include the filter, add the following line to your config file:
edgefilter SEMGROUPMISMATCH
Again, if you can think of another type of edge filter that would be useful, please reach out.
Occasionally two ontologies will contradict each other (one will have CUI A as the parent of CUI B and the other will have B as the parent of A). To ensure the final graph is a DAG, the code will automatically collapse these cycles and create a new "meta-node" that includes both A and B. All of the CUIs in the cycle (or cycles) that get merged will end up in the same node. These are called "sibling CUIs". If you want to avoid this, you'll need to figure out which ontology or ontologies is causing the cycle and remove those edges.
This is just some utility code that I've found useful - it is not production-grade and there are bound to be errors. Please let me know if you find any. My plan is to refine it in the coming months. If you're part of the biomedical research community and have any advice or want to contribute, I'd love to hear from you. I can be reached at bethany.percha@mssm.edu.
Please note that this code is released under the GPL.