Text mining, pathways and disease Andrey Rzhetsky

advertisement
Text mining,
pathways and disease
Andrey Rzhetsky
Scientists are like plants
biomass -> texts/studies
sunlight,water -> peaceful
time and resources for
work
Production of new texts (in
pages): new additions annually
QuickTime™ and a
decompressor
are needed to see this picture.
We estimated that the world
libraries host at least a
TRILLION
unique scholarly pages
(Evans and Rzhetsky, 2009; analysis of the Online Computer
Library Center’s WorldCat database of books and journals in
71,000 libraries across 121 countries)
A bit of context
PubMed search for cancer
Results: 2,358,785 papers (March 17,
2010)
GeneWays as an infogrinder
On-line Journals
GeneWays
Pathways
Graph: multi-type arcs
and nodes
QuickTime™ and a
decompressor
are needed to see this picture.
Context-free grammar:
Production rules
S  A activates B
B  binding of D by C
A  p53 C  bax
D  bad
Context-free grammar:
Non-terminals
S, A, B, C, D
Context-free grammar
Terminals:
p53
bax
bad
Context-free grammar
binding
A activates
B
p53
S
of bad
D by bax
C
Context-free grammar:
Parsing
binding
A activates
B
p53
S
of bad
D by bax
C
[ bind, [protein, bad], [protein,bax] ]
Context-free grammar:
Parsing
A activates
B
p53
S
[activate, [protein, p53], [action, B]]
Context-free grammar:
Parsing
“Our experiments show that p53 activates
binding of bax by bad.”
[activate, [protein, p53],
[bind, [protein, bad],
[protein, bax]
]
To be more concrete…
The phosphorylation of ATP-citrate lyase by NDPK
suggests that NDPK may have a role in the regulation
of membrane biosynthesis
The phosphorylation of ATP-citrate lyase by
NDPK suggests that NDPK may have a role in
the regulation of membrane biosynthesis
ndpk
phosphorylates
atp-citrate lyase
Typical arcs
1001,'bind'
1004,'suppress'
1011,'replace'
1018,'interact'
1020,'activate'
1022,'stimulate'
1023,'phosphorylate'
1027,'increase'
1028,'associate'
1034,'up-regulate'
1036,'inhibit'
1040,'promote'
1041,'down-regulate'
1043,'trigger'
1049,'block'
1054,'modify'
1057,'digest'
1058,'degrade'
1062,'link'
1071,'cleave'
1072,'release'
1074,'catalyze'
1083,'inactivate'
1106,'repress'
1110,'acetylate'
1117,'methylate'
Typical nodes
17767,'calcium channel antagonists'
20324,'hsp70 chaperone'
17467,'activator protein 1'
5104,'daunorubicin'
13194,'tyrosyl-phosphorylated' 9689,'paroxonase'
4190,'immunodeficiency'
4478,'iga2'
8552,'human fcgammarii'
4472,'iga1'
13151,'ikaros'
9820,'caveolin 1'
7277,'virus-triggered p-dcs'
4366,'complexes pr-3'
12290,'anti-alpha4 mabs'
2258,'gal4-mef2d'
14464,'polyneuropathy'
database ID
16044,'alk5'
10393,'mek-1 inhibitor'
13262,'pro-matrilysin'
Graph: multi-type arcs
and nodes
QuickTime™ and a
decompressor
are needed to see this picture.
Useful hairballs…
dandelions…
vermicelli…
dust bunnies…
rat’s nests…
Well, if reality is complicated,
so is the corresponding figure
Intel 8086: breadboard
Making case for utility
Looking at cerebellar malformations
through text-mined interactomes of
mice and humans
Ivan Iossifov, Raul Rodriguez-Esteban, Ilya
Mayzus, Kathleen J. Millen, and Andrey
Rzhetsky
PLoS Computational Biology, 2009
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Application to discovery of genes
related to heritable disorders
Molecular triangulation
We (again!) generated a
number of high-confidence
predictions about associations
of phenotypes with genes.
Testing experimentally: to be
continued…
Looking at cerebellar malformations through
text-mined interactomes of mice and humans
Ivan Iossifov, Raul Rodriguez-Esteban, Ilya Mayzus,
Kathleen J. Millen, and Andrey
Rzhetsky
PLoS Computational Biology, 2009, to appear
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
An alternative title:
about useful hairballs…
dandelions…
ridiculograms…
dust bunnies…
rat’s nests…
Intel 8086: breadboard
Using the position
in networks to
describe function
(from Mark Gerstein)
[NY Times, 2-Oct-05, 9-Dec-08]
Star witness: D-SPOP
QuickTime™ and a
decompressor
are needed to see this picture.
D-SPOP: phenotype
QuickTime™ and a
decompressor
are needed to see this picture.
Wild eye
Egr-triggered
Cell death
One copy of D-SPOP
deleted
D-SPOP as marker
In humans, SPOP was highly
expressed in 99% of clear
cell renal cell carcinomas,
the most prevalent form of
kidney cancer.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
A case for (marginal)
usefulness of our efforts
Next: A scary story…
The invisible plague
Steady rise
of prevalence
of neurodevelopmental
disorders during
the last 250
years
Autism
Bipolar disorder
Schizophrenia
Genetic-linkage Mapping
of Complex Hereditary
Disorders to a Wholegenome Molecularinteraction Network
QuickT ime™ and a
T IFF (Uncompressed) decompressor
are needed to see this picture.
AR
Tian Zheng
Miron Baron
T. Conrad Gilliam
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
One-chromosome example
☺
☺
☺
☹
☹
☹
Sick
Healthy
☺
☺
☺
☹
☹
☹
Sick
Healthy
Hypothetical pedigree
Model
Now, we have 23
chromosomes in two copies,
~25,000 genes
Combinations of genes:
108 -- 2 genes
12
10
-- 3 genes
1016 -- 4 genes
1037 -- 10 genes
It doesn’t make sense
(statistically) to test all
possible combinations of
genes for association with
disease
Possible way out: test only
functionally plausible
combinations of genes
Graph: multi-type arcs
and nodes
QuickTime™ and a
decompressor
are needed to see this picture.
Graph: only physical
interactions (such as bind,
phosphorylate, etc)
QuickTime™ and a
decompressor
are needed to see this picture.
Then we realized
that the clusters
should be
allowed to have
arbitrary
topology
Genetic heterogeneity model
Pedigrees, phenotypes,
Marker states
parameters
cluster probability for gene 1
cluster probability for gene c
disease
We did this exercise for all
three disorders (autism,
bipolar disorder,
schizophrenia).
We then combined the most
significant predictions.
“Our” predictions indeed form overlapping networks!
Download