WO2015138374A1

WO2015138374A1 - Methods to represent and interact with complex knowledge

Info

Publication number: WO2015138374A1
Application number: PCT/US2015/019573
Authority: WO
Inventors: Toni R. FARLEY; Spyro Mousses; Christopher YOO
Original assignee: Systems Imagination, Inc.
Priority date: 2014-03-10
Filing date: 2015-03-10
Publication date: 2015-09-17

Abstract

Methods for representing and interacting with semantic knowledge.

Description

METHODS TO REPRESENT AND INTERACT WITH COMPLEX

KNOWLEDGE

FIELD OF THE INVENTION

[0001] The present invention relates to databases and in particular to a database model to represent complex data relationships in knowledge, and a method to interact with the model.

BACKGROUND OF THE INVENTION

[0002] Data is not currently captured and processed by machines in a way that supports

semantics, abstract knowledge concepts, and the evolution of knowledge. Existing database models either do not capture knowledge semantics, or do not do so in a flexible and scalable way. Knowledge may be defined by the relationships among a plurality of data elements. Relational and graph data models are suited to capturing binary relationships, are not flexible to representing complex relationships, do not readily scale with increasing volumes of relationships. Complex relationships comprise relationships that are similar in construct, but indicative of different semantics in different contexts. The commonly used relational data model requires a schema defined a priori, which presents difficulties in capturing changing relationship structures as knowledge evolves.

[0003] The present disclosure provides a data model for capturing knowledge semantics, and a method to interact with the model. The data model is flexible to evolving knowledge defined by complex relationships, and scalable to large volumes of data. A concise method to interact with knowledge stored in the data model is provided. SUMMARY OF THE INVENTION

[0004] In order to overcome the challenges of representing semantics in a database in a flexible and scalable manner, we present a new data model, and method to interact with the model.

DETAILED DESCRIPTION OF THE INVENTION

[0005] The present invention discloses methods for representing semantic knowledge in a data model and interacting with knowledge in a persistent data store based on the model.

[0006] Semantic knowledge may be defined by a collection of related entities. In the present disclosure, an entity is defined by a unique identifier, and a pair of ordered lists comprising a number of sublists each. For example, the following four sublists represent four ways in which an entity relates to another entity, as described in Table 1.

Table 1 Four sublists of an entity (C⁴).

[0007] In Table 1 the pair of lists are denoted (X and β, and the four sublists are collectively referred to as C⁴. These lists, along with a unique identifier (UID) define an entity as:

UID, Ot, β (1)

[0008] The lists Ot and β in Table 1 have a reciprocal relationship. For instance, given an entity, x, the sublists in x(ot) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 1, as:

1. composed of (has-a): x is an entity that is made up of the entities in this sub list

2. includes: x is a general classification of the specific entities in this sublist

3. derived from: x is a concept or entity that is derived from the combination of the entities in this sublist

4. caused by: x is an effect that is caused by the entities in this sublist

[0009] The sublists in χ(β) contain the UIDs of other entities that relate to this entity by the semantic meaning given in Table 1, as:

1. part of: x is a part of the entities in this sublist

2. member of (is-a): x is a member of the classification entities in this sublist

3. contributes to: x is one of the pluralities of interacting entities that contribute to the derived entity or concept in this sublist

4. effects: x is a cause that results in the effects in this sublist

[0010] By these definitions, the first sublist allows for abstractions, where an entity x can be viewed as a singular entity, or expanded and viewed by the sum of its parts using ^(Composition).

[0011] As an example, if x is defined as: x, [[a], [b], [c, d], [e]], [[], [], [], []] (2)

then, a, b, c, d and e are defined as: a, [[], [], [], []], [[x], [], [], []] (3)

b, [[], [], [], []], [[], [x], [], []] (4)

c, [[], Π, [], []], [[], [], [x], []] (5)

d, [[], [], D, []], [[], [], [x], []] (6)

e, [[], Π, [], []], [[], [], [], M] (7)

[0012] and it follows that a is a part of x, b is an x, c and d, when combined, form x, and e, when present, results in JC. To further elaborate on this example, if x is a car, then a may be a tire; if x is the classification, fruit, then b may be an apple; if x is mud, then c and d may be dirt and water respectively; or if x is a sister concept, then b may be a sibling concept, and c may be a female concept; and if JC is smoke, then e may be fire.

[0013] Further, a sublist may begin with a binary digit specifying whether or not ordering is imposed on the items in the list, where 0 denotes unordered and 1 denotes ordered.

[0014] A method to traverse a network of related entities to recover data elements and semantic knowledge is presented. In this method, we refer to the sublists C⁴ of a and β as: a(Ci), a (C₂), a (Cs), a (C₄) (8)

KCi), β (C2), p (C3), p (C4) (9)

[0015] A query is defined by a string of tokens in a language using the following grammar rules:

query -> path {operator path} (10)

path -> idlist list [count] (sublist) [return] (11) idlist ^ UID{, UID} (12) return 0 1 (13)

list -» |β (14)

sublist ^ Ci|C₂|C3|C4 (15)

operator^■ Λ V ^ (16)

count -> digit {digit} (17)

digit -» 0| 1 |2|3|4|5|6|7|8|9 (18)

[0016] with the following syntax rules:

delineates a non-terminal on the left hand side (LHS) and a production rule on the right hand side (RHS)

{. . . } denotes repetition of zero or more

[. . . ] denotes optional (zero or one)

I denotes a choice (logical or)

[0017] and the terminal symbols:

[0018] where the operators in (16) are logical conjunction, disjunction, and negation; the binary value of (13) specifies whether to return the entities traversed at this step, where 1 denotes return, 0 otherwise, wherein this value is optional and defaults to zero; the numeric value of (17) specifies how many times to repeat the following step, wherein this value is optional and defaults to one; and UID is a unique identifier on an entity.

[0019] The equation in (25) is an example query to extract "all parts of A that are members of B".

Aa(Ci) Λ Ba (C2)l (25)

[0020] In some embodiments the present invention may be used to persist and interact with knowledge in a network of systems to achieve computational intelligence that is analogous to the human mind. A system, Sj may be used for generating, capturing and representing models of reality, hypotheses, beliefs, predictions, contingencies and any other imagined relationship structures which may or may not exist in reality. These beliefs may take any form, and may be contributed by human imagination through an interface that allows users to pose models, hypotheses, predictions, contingencies etc. Depending on the source and origin of imagined beliefs, there may also be different kinds of systems that leverage human interfaces for manual inputs of testable models, or learning engine applications to automate the generation of new testable models.

[0021] In some embodiments a network of systems based on the present invention may include another system ¾. This system may comprise an engine to automatically generate knowledge structures that represent novel relationships that a system imagines, and applies unsupervised combinatorial approaches to generate novel structures that represent unique models with measurable predictions. The system may also emulate imagination by comparison of prior knowledge analogies to infer new contingencies that have not yet been imagined. The present invention supports a platform for this automated generation of testable models that can greatly expand the limited ability of humans to generate testable models, and therefore eliminate a critical bottleneck in the generation of knowledge. By linking automated belief generation to learned knowledge, such a system may iteratively improve its ability to infer or imagine testable hypotheses based on learning and experience, accelerating the evolution of new knowledge. The present invention provides a flexible and scalable platform to generate, capture, and store testable models, which may then provide input to a network of systems-based model testing processes.

[0022] In some embodiments a network of systems based on the present invention may include another system & that represents relationships that are precisely captured from observations, measurements, perceptions, or any other means that represents data inputs from perceiving reality. Such a system may provide an input to another system ¾ and by doing so, S₃ provides the real world evidence repository to enable automated testing of models that are stored in and originated in Si or S₂.

[0023] In some embodiments a network of systems based on the present invention may include another system ¾, which may function as a testing and learning system in the network by accepting model inputs from Si or ¾, and inputs from S₃, and performing model fitting functions to test how well the predictions of the model fit data structures in S3. The output of model testing may be new structures used to annotate the models as contextually validated or not supported based on how well the predictions, contingencies, or models fit specific real world data in S₃. Validation may comprise qualitative or quantitative evaluation to determine how well a prediction or contingency in Si or S₂ matches the actual structures or relationships in S3. Si, S2, and S3 may store information primarily in primary and secondary network structures, while S4 may operate at a secondary and/or tertiary network level.

[0024] In some embodiments a network of networks based on the present invention may comprise federated networks of networks, wherein networks containing similar types of primary networks with different content may be linked in higher order secondary networks to share content. For example, hundreds of distributed observation networks containing primary networks with different content could be combined into a secondary federated (overlay) network with content awareness, so relationships could be hierarchically structured even further across the secondary network in a way that is analogous to how white matter networks tie together neocortical columns of neurons in the human brain.

[0025] In some embodiments a network of networks based on the present invention may comprise tertiary learning networks, wherein an overlay network may link different kinds of federated networks, which may be designed to achieve higher order functions, such as learning, to create and recover knowledge using the methods of the present invention. Knowledge may be defined as an imagined belief that has been tested and determined to fit well with real world observations. For instance, in the human brain, the two hemispheres of the brain divide the functions of storing and generating imagined beliefs, from capturing and storing observations that precisely represent the world as it is perceived.

[0026] Illustrative Example

[0027] The present invention may be used to structure knowledge in molecular biology, capturing the reciprocal hierarchy of nested relationships in biological structures, and mechanistic processes, such as: 1. A human comprises trillions of cells.

2. Zooming in on one cell, a bone cell normally comprises 46 chromosomes, divided into 23 pairs.

3. Zooming in on one chromosome, the entire sequence of DNA for chromosome 17 comprises about 12,000 different gene loci.

4. Zooming in on one gene on chromosome 17, the DNA region of the locus from base pair 7,668,401 to base pair 7,687,549 encompasses the TP53 gene.

5. The normal TP53 gene sequence is transcribed and translated resulting in codons that encode a sequence of 393 amino acids, which form the p53 protein.

6. Zooming in on the TP53 gene locus, a segment of DNA comprising the nucleotides, G, A, T, C defines exons and introns of the TP53 genomic DNA sequence.

7. The TP53 genomic DNA sequence results in the transcription of multiple RNA molecules which comprise a set of variant messenger RNA transcripts defined by alternative splicing of primary mRNA transcripts.

8. Exposure to radiation can cause mutations in a normal p53 gene sequence, and/or such mutations can be inherited.

9. A single base pair change in DNA in the first base of three-base codon 72 will encode for a different amino acid at protein position 72.

10. An altered amino acid causes the p53 protein to function differently.

11. p53 is also related to multiple causes/functions, including binding to dozens of other proteins, binding to DNA elements to transcriptionally control specific set of genes, and those genes in turn regulate higher order functions such as double stranded DNA repair, which in turn regulates genomic stability. Structurally altering the sequence of the TP53 gene therefore causes loss of the function of the p53 protein, which in turn deregulates genes controlled by p53, and changes to those genes in turn causes loss of DNA repair functions, and that in turn causes genomic instability, which in turn causes cancer. Representing the complex mechanism that starts with first order TP53 gene sequence changes at the DNA level, and maps/links to intermediary causes and effects in subsequent steps in a multistep mechanism that ends with higher order semantic concepts such as cancer predisposition or clinical drug response, therefore requires a system like the invention described in this disclosure, that can support capturing the multiscalar hierarchical knowledge relationships.

12. Thus, alterations in p53 protein function may cause changes in other pathways and cellular functions, such as DNA repair.

13. A bone cell might lose both copies of the TP53 gene. For example, the first hit may be an inherited deletion in the p53 gene, and the second, a codon 72 mutation caused by radiation.

14. Without any normal TP53, no normal p53 protein product is formed, which means DNA damage cannot be repaired by p53 like it is supposed to, and that lack of repair causes genomic instability to accumulate in that bone cell. When a critical mass of genetic instability and mutations have occurred, that bone cell's progeny inherit malignant traits, such as bypassing natural cell growth control, that allow selective growth advantages for cells that inherit those genetic mutations, thereby allowing for cellular evolution that can produce bone cancer (e.g. osteosarcoma).

15. Patients that have inherited mutations in the p53 gene, are classified as having Li-Fraumeni Syndrome, which is associated with sensitivity to DNA damaging agents, and predisposition to multiple types of cancer.

16. Therefore, families where a mutated p53 allele (an alternative form of the gene or locus) is passed down from generation to generation have higher incidence of certain types of cancer, like sarcoma, leukemia, breast cancer, etc. That increase in risk

(predisposition) is caused by inheriting the mutated p53 gene, in combination with exposure to environmental agents that cause additional DNA damage.

[0028] This predisposition example demonstrates that a system using the present invention may be applied in a real world example to represent the fundamental code of life as computable knowledge, allowing us to navigate up and down this hierarchy of relationships, compositions, combinations, classifications, causes, etc., and link new learned knowledge, or recover knowledge at any point of the hierarchy. For instance, a child with osteosarcoma may have p53 mutations in all of their normal and cancer cells, but the additional mutation in p 3 in his cancer cells may have caused additional somatic (meaning they are present only in the cancer cells) mutations in the EGFR gene. These EGFR mutations may have been linked to favorable response to a drug called erlotinib, but only in Squamous Cell Carcinoma. The context of having germ line (non-somatic) p53 mutations makes the patient a poor candidate for D A damaging chemotherapy because the normal cells will not be able to repair the DNA damage and the patient will likely die of toxicity or secondary cancer. EGFR inhibitors are targeted drugs that do not cause DNA damage, but erlotinib might not be indicated for Osteosarcoma (i.e. there is no clinical evidence). The knowledge above provides a causal mechanistic model, and observing that there is an EGFR mutation supports that hypothesis. Testing that n = 1 hypothesis, and observing a favorable response to erlotinib in this particular context may suggest that the hypothesis might be more generalizable, and that could be tested by suggesting a clinical trial of an adult lung cancer drug for a pediatric bone cancer. The new hypothesis is that other children with the same context, Li-Fraumeni Syndrome who have Osteosarcoma should be tested for EGFR mutations, and if they are present, they should enter into a trial to evaluate the effectiveness of erlotinib as an alternative to other treatments.

[0029] In this n = 1 analysis, imagine that we knew nothing about a child, other than their germline and cancer genomes. A system may use the present invention to:

1. comprehensively and fundamentally represent prior and patient-specific knowledge;

2. capture this knowledge in a way that allows all of the genetic and therapeutic mechanistic knowledge described in the example above to be effectively and efficiently applied to classify this osteosarcoma patient as having Li-Fraumeni Syndrome;

3. support systems level imagination of mechanistic hypotheses about plausible treatments given the context;

4. pose genetic analysis to test that imagined hypothesis; and

5. recover treatments beyond the standard of care (which would not be suitable for this context) that may be more relevant for that context.

[0030] To illustrate aspects of the present example, entity types may be represented by the entities in Table 2, wherein the unique identifiers (UIDs) comprise one or two alpha characters, which denote the UID for a "classification" entity, followed by x = [1..«], where n is the number of items in each classification respectively. For instance, CT is an entity and CTa(C_/) = {CTx|x[l ..n]}, and so on.

[0031] Relationships for the items in Table 2 may be defined using the semantics of Table 1 as shown in Table 3, wherein "comprises" denotes a composition (composed of) relationship, "arise from" and "posits" denote combination (derived from)

relationships, "encoded by" denotes a causality (caused by) relationship, and "codes for" denotes a reciprocal causality (effects) relationship. Data sets and sequences in Table 3 are captured using set notation, and x, y, m, and n are variables.

[0032] The relationships in Table 3 may be represented by sublists for each entity defined in Table 4, where a lists that begin with a 1 are sequences (ordered), and a lists that begin with a 0 are unordered sets.

[0033] Given the data structure of the present example, we can compose queries to recover knowledge. For instance a query to recover biologically relevant treatment options (drugs) for PI may search for molecular contexts that contribute to PI, which in this case recovers sets of amplified and expressed genes, and recover any treatment options these genes

Table 2 Items in an illustrative example.

UIDs Item Specific Item in Example

CTx Clinical Trials CT1 represents an erlotinib trial

Px Patients (Clinical Trial Subjects) PI represents a clinical trial subject

CLx Cells CL1 and CL2 represent normal (germ line) and cancer (somatic) bone cells respectively

CHx Chromosomes CHI 7 represents Chromosome 17

Lx Loci LI 713 represents the Locus 17p 13.1

Cx Codons C54 and C22 represent the codons CGC and CCC respectively

Nx Nucleotides l, N2 represent the nucleotides guanine (G) and cytosine (C) respectively

AAx Amino Acids AA2 and AA15 represent the amino acids

arginine and proline respectively

Gx Genes Gl and G2 represents the genes TP53 and EGFR respectively

PRx Proteins PR1 represents the protein p53

Mx Gene Mutations Ml represents a specific inherited mutation in

TP53, and M2 represents a mutation in EGFR

Ex Environmental factors El represents exposure to radiation

COx Concepts COl represents sensitivity to DNA damaging agents, and C02 represents a favorable treatment outcome

GSx Gene States GS1 and GS2 represent classes of amplified and mutated gene states respectively

Dx Diseases/Disorders Dl and D2 represent Squamous Cell Carcinoma, and Osteosarcoma respectively

DRx Drugs (Pharmaceuticals) DR1 represents the drug erlotinib (an EGFR inhibitor)

Ax Molecular Actions Al represents inhibits

Tx Targets Tl represents a specific molecular target

Hx Hypotheses HI represents a hypothesis to be tested, wherein the hypotheses may be automatically generated

Table 3 Relationships in the illustrative example.

contribute to. Using the grammar rules of the present disclosure, this query may translate to: Pla(C₃)a(C₂)a(C3)a(C₃)l (26)

[0034] The query of (26) returns a null set as there are no drug targets present in this knowledge that are associated with the patient's amplified and mutated genes.

[0035] A subsequent query (27) may expand the search to include additional molecular contexts that are related to the patient's amplified and mutated genes, as:

PI a (C₃)2a (C₂)ip(C₄)ip(C₂MC₃)a(C₃)l (27)

Table 4 Sublists generated in the illustrative example.

Dl [[ΠΠΠ]] [[],[],[C02],[]]

D2 [[MUM]] [[],[],[H1],[]]

DR1 [[],[],[],□ [[],[],[T1],[]]

Al [[],[],[],□ [[],[],[T1],[]]

Tl [[],[],[1,DR1,A1,G2],[]] [[],[],[C02,H1],[]]

HI [[],[],[0,CO2,D2, Tl],[]] [[],[],[],[]]

[0036] The query of (27) returns a subset of knowledge relating TP53 mutations (Ml) to EGFR mutations (M2) (the 1 's in the paths (11) specify to return these intermediate results), and the subset further comprises the target Tl, and its related drug erlotinib (DR1), mechanism of action, "inhibits" (Al), and the gene EGFR (G2).

[0037] A subsequent query on the recovered target may recover more knowledge of contexts related to this target:

Tip(C₃)la(C₃)l (28)

[0038] The query of (28) returns a subset of knowledge comprising the concept of favorable treatment outcome (C02) for Tl in the context of the disease Squamous Cell Carcinoma (Dl), and the hypothesis (HI) that the same treatment outcome (C02) may arise for Tl in the context of the disease Osteosarcoma (D2).

[0039] Illustrative Example 2

[0040] To further illustrate an exhaustive capture of prior knowledge, relations among nucleotides, codons, and amino acids may be represented in the present invention as shown in Table 5, where Cod refers to Codons in the first four sets.

Table 5 Capturing the Relationships Among

Nucleotides, Codons, and Amino Acids.

CTC [[C, T,C],[],[],[]] [[],D,[],[Leu]]

CTA [[C, T,A],[],[],[]] [[],D,[],[Leu]]

CTG [[C, T,G],[],[],[]] [[],[],[],[Leu]]

ATT [[A, T, T],[],[],[]] [[],[],[],[He]]

ATC [[A, T,C],[],[],[]] [[],[],[],[He]]

ATA [[A, T,A],[],[],[]] [[UnUIle]]

ATG [[A, T,G],[],[],[]] [[],[],[],[Met]]

GTT [[G, T, T],[],[],[]] [[],[],[],[Val]]

GTC [[G, T,C],[],[],[]] [[],[],[],[Val]]

GTA [[G, T,A],[],[],[]] [[],[],[],[Val]]

GTG [[G, T,G],[],[],[]] [[],[],[],[Val]]

TCT [[T,C, T],[],[],[]] [[],[],D,[Ser]]

TCC [[T,C,C],[],[],[]] [[],[],[],[Ser]]

TCA [[T,C,A],[],[],[]] [[],[],[],[Ser]]

TCG [[T,C,G],[],[],[]] [[],[],D,[Ser]]

CCT [[C,C, T],[],[],[]] [[],[],[],[Pro]]

CCC [[C,C,C],[],[],[]] [[],[],[],[Pro]]

CCA [[C,C,A],[],[],[]] [[],[],[],[Pro]]

CCG [[C,C,G],[],[],[]] [[],[],[],[Pro]]

ACT [[A,C, T],[],[],[]] [[],[],[],[T r]]

ACC [[A,C,C],[],[],[]] [[],[],[],[T r]]

ACA [[A,C,A],[],[],[]] [[],[],[],[¾]]

ACG [[A,C,G],[],[],[]] [[],[],[],[T r]]

GCT [[G,C, T],[],[],[]] [[],[],[],[Ala]]

GCC [[G,C,C],[],[],[]] [[],[],[],[Ala]]

GCA [[G,C,A],[],[],[]] [[MMMAla]]

GCG [[G,C,G],[],[],[]] [[],[],[],[Ala]]

TAT [[T, A, T],[],[],[]] [[UUUTyrll

TAC [[T, A,C],[],[],[]] [[],[],[],[Tyr]]

TAA [[T, A,A],[],[],[]] [[],[],[],[STOP]]

TAG [[T, A,G],[],[],[]] [[],[],[],[STOP]]

CAT [[C, A, T],[],[],[]] [[],[],[],[His]]

CAC [[C, A,C],[],[],[]] [[],[],[],[His]]

CAA [[C, A,A],[],[],[]] [[],[],[],[Gln]]

CAG [[C, A,G],[],[],[]] [[],[],[],[Gln]]

AAT [[A, A, T],[],[],[]] [[],[],[],[Asn]]

AAC [[A, A,C],[],[],[]] [[],[],[],[Asn]]

AAA [[A, A,A],[],[],[]] [[],[],[],[Lys]]

AAG [[A, A,G],[],[],[]] [[],D,[],[Lys]]

GAT [[G, A, T],[],[],[]] [n,il,n,[Aspll

GAC [[G, A,C],[],[],[]] [[],[],[],[Asp]]

GAA [[G, A,A],[],[],[]] [[],[],[],[Glu]]

GAG [[G, A,G],[],[],[]] [[],D,[],[Glu]]

TGT [[T, G, T],[],[],[]] [[lil UCysll

TGC [[T, G,C],[],[],[]] m,n,n,[cysii

[0041] Table 5 comprises:

1. 4 nucleotide bases T, C, A, and G denoted respectively as the UID;

2. 4 = 64 codons, each comprising a sequences of 3 bases denoted by sequence as the UID;

3. 20 amino acids coded by the codons, and denoted by the abbreviations in Table 6 as the UID;

4. START and STOP sequences coded by codons, and denoted respectively as the UID; and

5. the codon ATG represents both the START sequence, and the amino acid Met.

Table 6 Amino Acid Abbreviations.

[0042] Table 5 further captures the relationships among these entities, representing the knowledge that: 1. a single amino acid can be coded by anywhere from one to six codons,

2. a codon has an exclusive 1 to 1 causal relationship with an amino acid,

3. 1 codon represents the start of an amino acid sequence, and

4. 3 codons represent a stop in the amino acid sequences.

[0043] Polypeptides (or proteins) are derived from sequences of amino acids. A nucleotide sequence (codon) in a gene transcript can be replaced by its related amino acid, using the methods of the present invention, to derive a polypeptide from a gene transcript.

Claims

What is claimed:

1. A method to represent semantic knowledge as a collection of related entities, wherein the entities are represented by a data model comprising:

a) a unique identifier;

b) a pair of ordered lists, wherein items in the second list have a reciprocal relationship to items in the first list;

c) the lists further comprise a plurality of sublists; and

d) the sublists represent distinct semantics.

2. The method of claim 1 wherein the lists comprise 4 sublists, representing the semantics:

composition, classification, combination, and causality.

3. A method to interact with the data model comprising:

a) a method to traverse a network of related entities;

b) a method to extract data elements and semantic knowledge from the network; and

c) a method to define traversal and extraction procedures, comprising a query language, wherein the query language is defined by a grammar, and legal (allowed/valid) strings in the language represent queries on the network.