Medline Analysis

From NCBO Wiki
Jump to: navigation, search


The Unified Medical Language System (UMLS) Metathesaurus are the most widely used underlying sources for biomedical natural language processing (NLP) systems, even though they were not designed as terminologies for NLP tasks. However the performances of these systems are not satisfactory. In this study, we systematically analyzed UMLS terms by analyzing their occurrences in over 18 million MEDLINE abstracts written in human natural language. Our goals are three folds: 1. analyze UMLS term frequency and syntactic distribution on MEDLINE; 2. build an automatically filtered UMLS Metathesaurus based on MEDLINE analysis; 3. build an augmented UMLS Metathesaurus where each term is associated with its MEDLINE frequency and syntactic distribution statistics. The automatically filtered and augmented UMLS Metathesaurus can be used to improve efficiency and precision of UMLS-based information retrieval and NLP tasks. After automatic MEDLINE filtering, the augmented UMLS contains 518,835 terms, roughly 13% of original terms. Each term in the augmented UMLS is associated with a vector of syntactic distribution statistics and its MEDLINE frequency.

Code repository: <on our g-forge server> ..

Data and Methods

18,413,784 million abstracts published in MEDLINE from 1965 to 2009 were parsed into sentences (96,374,837). Each sentence was lexically parsed to generate a parse tree using the Stanford Parser. We used the publicly available information retrieval library, Lucene, to create an index on sentences and their corresponding parse trees. UMLS 2009AB version was used in our study, which includes 5,175,449 distinct English strings and 2,120,271 concepts. The term frequency (sentence level) and document frequency (abstract level) were calculated by counting occurrences of each UMLS terms in all the MEDLINE sentences and abstracts. The tf-idf (term frequency-inverse document frequency) of each UMLS terms was calculated as following: tf_idf = (1+log10(tf)) * (log(N/df), where tf is term frequency, df is document frequency, N is the total number of abstracts (18,413,784 in total). The syntactic types and frequencies for each term were collected from all the parse trees where the term appears. Each term was assigned a vector of syntactic types and probabilities.

Data and data format

1. data directory on biox2:

Medline sentences: /scratch/users/xurong/medline_all/sentences

Medline parsetrees:/scratch/users/xurong/medline_all/trees

There are 100 files in both sentences and trees directories. The file where each MEDLINE sentence or parse tree is stored is determined by the last two digits of the PMID ID. For example, all sentences from abstract with PMID 13062500 are stored in file: /scratch/users/xurong/medline_all/sentences/00.txt

2. sentence file format: (pmid_sentenceID|sentence|year)

13062100_0|Further studies on the formation of adrenaline and noradrenaline in the body|1953

13062100_0 means the sentence "Further studies on the formation of adrenaline and noradrenaline in the body" is from abstract with PMID 13062100 and it is the title. The sentence id assignment starts with title assigned '0' and goes on.

3. parse tree file format: pmid_sentenceID|parse tree

1306200_1|(ROOT [165.455] (S [165.352] (NP [4.654] (PRP [3.405] We)) (VP [150.901] (VBN [7.607] reviewed) (NP [45.822] (NP [26.386] (DT [0.650] the) (JJ [13.702] neuroimaging) (NNS [7.227] studies)) (PP [19.028] (IN [0.669] of) (NP [17.957] (CD [6.498] 150) (NNS [7.159] patients)))) (PP [91.214] (IN [2.595] with) (NP [87.790] (NP [40.279] (JJ [10.955] cavernous) (NNS [11.334] sinus) (NNS [9.944] tumors)) (VP [45.296] (VBN [7.882] operated) (PRT [3.235] (RP [3.173] on)) (PP [28.914] (IN [4.860] during) (NP [23.382] (DT [3.222] an) (JJ [11.114] 8-year) (NN [6.078] period)))))))))

Code, Commands, Sample Input and Output

1. code directory: /home/xurong/java/src/medline_analysis

2. how to complie:

[xurong@frontend1 medline_analysis]$ ant

3. how to run:

a. find sentence id where each input term appears in the associated sentence

input: the terms to be counted


sh scripts/ -Xmx2048m medline_analysis.MedlineCountFinder find_single /scratch/users/xurong/medline_all/sentences/00.txt input.txt output/00.output stopwords_2.txt not

NOTE: this command is only for find medline count in a subset of MEDLINE sentences, namely, the sentences from pmid ended with 00. You need to run 100 scripts at the same time or sequentially using a loop)

[xurong@frontend1 medline_analysis]$ more input.txt

breast cancer

[xurong@frontend1 medline_analysis]$ more output/00.output

breast cancer|4114100_0

breast cancer|14486400_0

breast cancer|13886800_0

breast cancer|168100_1

breast cancer|10791000_0

b. gather the medline count

[xurong@frontend1 medline_analysis]$ sh scripts/ -Xmx2048m medline_analysis.MedlineCountFinder count output/ output.counter

[xurong@frontend1 medline_analysis]$ more output.counter

breast cancer|2966.0

c. find syntactic types for each term

a. find the syntactic type for each term input:

breast cancer


breast cancer|ROOT|13203800_0

breast cancer|NP|13203800_0

sh scripts/ -Xmx2048m medline_analysis.SyntacticTypeFinder findTypes /scratch/users/xurong/medline_all/sentences/00.txt /scratch/users/xurong/medline_all/trees/00.txt input.txt output2/00.output 00.done stopwords_2.txt

(NOTE: this command is only for find systactic types in a subset of MEDLINE sentences, namely, the sentences from pmid ended with 00. You need to run 100 scripts at the same time or sequentially using a loop)

b. gather the count

sh scripts/ -Xmx2048m medline_analysis.MedlineCountFinder count output2/ output2.counter

input: (all the files in output2 directory)

breast cancer|ROOT|13203800_0

breast cancer|NP|13203800_0

output: (output2.counter) breast cancer|NP|7684.0

breast cancer|PP|4449.0

breast cancer|VP|4227.0

breast cancer|S|3400.0