Given a list of terms and a short string that appears in many of them, we seek a way to produce useful phrases which contain the sought short string, and can help to pinpoint ones intended term without listing all the matching terms.
So, suppose we have a list of terms like this:
2Fe-2S ferredoxin subdomain 5\'3\'-Exonuclease N- and I-domain AAA-protein subdomain Alanine dehydrogenase/PNT, C-terminal subdomain Alanine dehydrogenase/PNT, N-terminal subdomain Alpha-2-macroglobulin receptor-associated protein, domain 1 Alpha amylase, catalytic subdomain Aspartate/ornithine carbamoyltransferase, carbamoyl-P binding domain Autotransporter beta-domain Bacterial extracellular solute-binding protein, family 1 domain Bacterial membrane-flanked domain Bacterial transcriptional activator domain Blue (type 1) copper domain Bromodomain Bromodomain transcription factor Carbohydrate binding domain, family 11 Carbohydrate binding domain, family 15 Carbohydrate binding domain, family 17/28 Carbohydrate-binding domain, family V/XII Catalytic domain of components of various dehydrogenase complexes Cell division protein 48, CDC48, domain 2 Chitin-binding, domain 3 Condensation domain CpcD phycobilisome linker-like subdomain Cysteine-rich small domain Dullard-like phosphatase domain Elongation factor G, domain IV Elongation factor Tu, domain 2 Epidermal growth-factor receptor (EGFR), L domain Epoxide hydrolase N-terminal domain-like phosphatase eRF1 domain 1 eRF1 domain 2 eRF1 domain 3 Fatty acid desaturase subdomain F-box protein interaction domain FCP1-like phosphatase, phosphatase domain Fibrobacter succinogenes major paralogous domain Fibronectin, type III subdomain Gal4-like dimerisation domain Glucan 1,4-alpha-glucosidase with starch-binding domain Glycoside hydrolase family 2, immunoglobulin-like beta-sandwich domain Glycoside hydrolase, family 2, TIM barrel domain GTP1/OBG domain GTP1/OBG subdomain GTP-binding signal recognition particle SRP54, G-domain Hedgehog/intein hint domain, C-terminal Homeobox domain, ZF-HD class Homeodomain-like Homeodomain protein CUT Homeodomain-related Hpt, subdomain Iroquois-class homeodomain protein Legume lectin, beta domain Metal-dependent phosphohydrolase, HD subdomain Mo-co oxidoreductase dimerisation domain MoeA, C-terminal, domain IV MoeA, N-terminal, domain I and II Molybdopterin binding domain Mycoplasmal MG032/MG096/MG288 1 domain Mycoplasmal MG032/MG096/MG288 2 domain N-acetylglutamate kinase with DUF619 domain Nitrite/sulfite reductase, flavoprotein alpha-component, domains 1 and 3 Paired-like homeodomain protein, OAR PEA3-type ETS-domain transcription factor, N-terminal Penicillin-binding protein, dimerisation domain Peptidase, alpha-lytic prodomain Peptidoglycan-binding domain 1 Phosphatidylinositol-specific phospholipase C, Y domain Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain I Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain II Phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain III Plant-specific domain of unknown function 3588 Predicted aldehyde dehydrogenase with duplicated domain Predicted bacteriophytochrome with receiver domain Predicted kinase with amino acid kinase domain Predicted prephenate dehydrogenase/arogenate dehydrogenase with a C-terminal regulatory domain Putative zinc finger domain, LRP1 Quinohemoprotein amine dehydrogenase, alpha chain, domain 3 Relaxase/mobilization nuclease domain Respiratory-chain NADH dehydrogenase domain, 51 kDa subunit Response regulator with LytTR DNA-binding domain, AlgR/VirR/ComE type Restriction modification system DNA specificity domain RfaE bifunctional protein, domain I RfaE bifunctional protein, domain II RNA polymerase Rpb1, domain 1 RNA polymerase Rpb1, domain 3 RNA polymerase Rpb1, domain 4 RNA polymerase Rpb1, domain 5 RNA polymerase Rpb1, domain 6 RNA polymerase Rpb1, domain 7 RNA polymerase Rpb2, domain 2 RNA polymerase Rpb2, domain 3 RNA polymerase Rpb2, domain 4 RNA polymerase Rpb2, domain 5 RNA polymerase Rpb2, domain 6 RNA polymerase Rpb2, domain 7 RuvA domain 2-like Saposin B subdomain Sec8 exocyst complex component specific domain Signal peptide binding (SRP54) M-domain SLA1 homology domain 1, SHD1 S-layer-related duplication domain Small GTP-binding protein domain Sugar-specific permease, EIIA 1 domain Thioredoxin domain 2 Toprim subdomain Uncharacterised conserved protein with HAD-like hydrolase domain Uncharacterized Cys-rich domain Uncharacterized domain 2 Uncharacterized hydrophobic domain Uncharacterized plant-specific domain Uncharacterized plant-specific domain 01589 Uncharacterized plant-specific domain 01627
(This is a list of Interpro terms that contain the substring "domain".)
We define a phrase to be a sequentially occuring list of words (like "binding domain"). For example, the phrases of a term "foo bar fnord" are "foo", "foo bar", "foo bar fnord", "bar", "bar fnord" and "fnord".
Given a string like "domain", which occurs often in this list (indeed in every item), it is desirable to produce a short list of frequently occuring phrases that contain "domain".
The process for doing this is two fold. First, we consider all phrases of each term and count their occurance, discarding those that occur only once.
In this process, we also clean up any phrase encountered. First, we can consider only the lower-case version of each phrase (assuming case-insensitivity for the entire process). Next, we can remove first and last character if it is not a letter or number or "(" or "[". Finally, we can remove leading and trailing "stopwords". A stopword, in this case, is simply a frequently occuring word in the english language like "the", "and" or "with" which does not contribute significantly if it appears at the beginning or end of a phrase. We have to be careful when doing such transformations so as to not produce phrases which do not occur in the original list of terms (assuming case-insensitivity).
If we look at the above list of terms, a phrase histogram of 95 phrases is produced:
Count: 85 Phrase: "domain" Count: 13 Phrase: "subdomain" Count: 12 Phrase: "rna" Count: 12 Phrase: "rna polymerase" Count: 12 Phrase: "protein" Count: 12 Phrase: "polymerase" Count: 6 Phrase: "uncharacterized" Count: 6 Phrase: "rpb2, domain" Count: 6 Phrase: "rpb2" Count: 6 Phrase: "rpb1, domain" Count: 6 Phrase: "rpb1" Count: 6 Phrase: "rna polymerase rpb2, domain" Count: 6 Phrase: "rna polymerase rpb2" Count: 6 Phrase: "rna polymerase rpb1, domain" Count: 6 Phrase: "rna polymerase rpb1" Count: 6 Phrase: "polymerase rpb2, domain" Count: 6 Phrase: "polymerase rpb2" Count: 6 Phrase: "polymerase rpb1, domain" Count: 6 Phrase: "polymerase rpb1" Count: 6 Phrase: "domain 2" Count: 6 Phrase: "binding" Count: 5 Phrase: "domain 3" Count: 5 Phrase: "domain 1" Count: 5 Phrase: "dehydrogenase" Count: 5 Phrase: "binding domain" Count: 4 Phrase: "predicted" Count: 4 Phrase: "plant-specific" Count: 4 Phrase: "plant-specific domain" Count: 4 Phrase: "n-terminal" Count: 4 Phrase: "hydrolase" Count: 4 Phrase: "factor" Count: 4 Phrase: "domain," Count: 4 Phrase: "c-terminal" Count: 3 Phrase: "uncharacterized plant-specific" Count: 3 Phrase: "uncharacterized plant-specific domain" Count: 3 Phrase: "protein, domain" Count: 3 Phrase: "phosphoglucomutase/phosphomannomutase" Count: 3 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha" Count: 3 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain" Count: 3 Phrase: "phosphatase" Count: 3 Phrase: "homeodomain" Count: 3 Phrase: "homeodomain protein" Count: 3 Phrase: "erf1" Count: 3 Phrase: "erf1 domain" Count: 3 Phrase: "domain i" Count: 3 Phrase: "dimerisation" Count: 3 Phrase: "dimerisation domain" Count: 3 Phrase: "carbohydrate" Count: 3 Phrase: "carbohydrate binding" Count: 3 Phrase: "carbohydrate binding domain," Count: 3 Phrase: "carbohydrate binding domain" Count: 3 Phrase: "binding domain," Count: 3 Phrase: "bacterial" Count: 3 Phrase: "alpha/beta/alpha" Count: 3 Phrase: "alpha/beta/alpha domain" Count: 3 Phrase: "1 domain" Count: 2 Phrase: "type" Count: 2 Phrase: "transcription" Count: 2 Phrase: "transcription factor" Count: 2 Phrase: "signal" Count: 2 Phrase: "rfae" Count: 2 Phrase: "rfae bifunctional" Count: 2 Phrase: "rfae bifunctional protein, domain" Count: 2 Phrase: "rfae bifunctional protein" Count: 2 Phrase: "phosphatase domain" Count: 2 Phrase: "mycoplasmal" Count: 2 Phrase: "mycoplasmal mg032/mg096/mg288" Count: 2 Phrase: "moea" Count: 2 Phrase: "mg032/mg096/mg288" Count: 2 Phrase: "kinase" Count: 2 Phrase: "iii" Count: 2 Phrase: "gtp1/obg" Count: 2 Phrase: "gtp-binding" Count: 2 Phrase: "glycoside" Count: 2 Phrase: "glycoside hydrolase" Count: 2 Phrase: "elongation" Count: 2 Phrase: "elongation factor" Count: 2 Phrase: "domain iv" Count: 2 Phrase: "domain ii" Count: 2 Phrase: "domain 7" Count: 2 Phrase: "domain 6" Count: 2 Phrase: "domain 5" Count: 2 Phrase: "domain 4" Count: 2 Phrase: "dehydrogenase/pnt" Count: 2 Phrase: "catalytic" Count: 2 Phrase: "bromodomain" Count: 2 Phrase: "bifunctional" Count: 2 Phrase: "bifunctional protein, domain" Count: 2 Phrase: "bifunctional protein" Count: 2 Phrase: "alpha" Count: 2 Phrase: "alanine" Count: 2 Phrase: "alanine dehydrogenase/pnt" Count: 2 Phrase: "acid" Count: 2 Phrase: "2" Count: 2 Phrase: "1"
This list is a nice representation of the original list of terms, but it is still not what we want. For example, the string "domain" appears in several of these phrases: "domain", "rpb1, domain", "rpb2, domain", "polymerase rpb1, domain", "polymerase rpb2, domain", "binding domain" and "carbohydrate binding domain". Clearly, if we wish to recommend a useful subset of phrases, this list is confusing because it contains multiple phrases that originate from the same term. To fix this, we go to the second step.
The second step is to eliminate from this list those phrases that contain frequently occuring sub-phrases. That is, consider the list of phrases as a list of terms and apply the phrase-counting algorithm to find frequently occuring phrases.
We can apply this phrase-finding algorithm to the above list of found phraess to produce this histogram of sub-phrases:
Count: 33 Phrase: "domain" Count: 10 Phrase: "polymerase" Count: 7 Phrase: "protein" Count: 6 Phrase: "rpb2" Count: 6 Phrase: "rpb1" Count: 6 Phrase: "rna" Count: 6 Phrase: "binding" Count: 6 Phrase: "bifunctional" Count: 5 Phrase: "rna polymerase" Count: 4 Phrase: "rfae" Count: 4 Phrase: "polymerase rpb2" Count: 4 Phrase: "polymerase rpb1" Count: 4 Phrase: "plant-specific" Count: 4 Phrase: "carbohydrate" Count: 4 Phrase: "binding domain" Count: 4 Phrase: "bifunctional protein" Count: 4 Phrase: "alpha/beta/alpha" Count: 3 Phrase: "uncharacterized" Count: 3 Phrase: "rpb2, domain" Count: 3 Phrase: "rpb1, domain" Count: 3 Phrase: "rfae bifunctional" Count: 3 Phrase: "protein, domain" Count: 3 Phrase: "phosphoglucomutase/phosphomannomutase" Count: 3 Phrase: "factor" Count: 3 Phrase: "carbohydrate binding" Count: 2 Phrase: "uncharacterized plant-specific" Count: 2 Phrase: "transcription" Count: 2 Phrase: "rna polymerase rpb2" Count: 2 Phrase: "rna polymerase rpb1" Count: 2 Phrase: "rfae bifunctional protein" Count: 2 Phrase: "polymerase rpb2, domain" Count: 2 Phrase: "polymerase rpb1, domain" Count: 2 Phrase: "plant-specific domain" Count: 2 Phrase: "phosphoglucomutase/phosphomannomutase alpha/beta/alpha" Count: 2 Phrase: "phosphatase" Count: 2 Phrase: "mycoplasmal" Count: 2 Phrase: "mg032/mg096/mg288" Count: 2 Phrase: "hydrolase" Count: 2 Phrase: "homeodomain" Count: 2 Phrase: "glycoside" Count: 2 Phrase: "erf1" Count: 2 Phrase: "elongation" Count: 2 Phrase: "dimerisation" Count: 2 Phrase: "dehydrogenase/pnt" Count: 2 Phrase: "carbohydrate binding domain" Count: 2 Phrase: "bifunctional protein, domain" Count: 2 Phrase: "alpha/beta/alpha domain" Count: 2 Phrase: "alanine"These 48 sub-phrases occur frequently in the list of phrases, and phrases which are (not which contain) these re-occuring subphrases ought to be removed. The list of filtered phrases is:
13 occurances of subdomain 6 occurances of rna polymerase rpb2, domain 6 occurances of rna polymerase rpb1, domain 6 occurances of domain 2 5 occurances of domain 3 5 occurances of domain 1 5 occurances of dehydrogenase 4 occurances of predicted 4 occurances of n-terminal 4 occurances of domain, 4 occurances of c-terminal 3 occurances of uncharacterized plant-specific domain 3 occurances of phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain 3 occurances of homeodomain protein 3 occurances of erf1 domain 3 occurances of domain i 3 occurances of dimerisation domain 3 occurances of carbohydrate binding domain, 3 occurances of binding domain, 3 occurances of bacterial 3 occurances of 1 domain 2 occurances of type 2 occurances of transcription factor 2 occurances of signal 2 occurances of rfae bifunctional protein, domain 2 occurances of phosphatase domain 2 occurances of mycoplasmal mg032/mg096/mg288 2 occurances of moea 2 occurances of kinase 2 occurances of iii 2 occurances of gtp1/obg 2 occurances of gtp-binding 2 occurances of glycoside hydrolase 2 occurances of elongation factor 2 occurances of domain iv 2 occurances of domain ii 2 occurances of domain 7 2 occurances of domain 6 2 occurances of domain 5 2 occurances of domain 4 2 occurances of catalytic 2 occurances of bromodomain 2 occurances of alpha 2 occurances of alanine dehydrogenase/pnt 2 occurances of acid 2 occurances of 2 2 occurances of 1
Given such a list of phrases, those that contain "domain" are:
13 occurances of subdomain 6 occurances of rna polymerase rpb2, domain 6 occurances of rna polymerase rpb1, domain 6 occurances of domain 2 5 occurances of domain 3 5 occurances of domain 1 4 occurances of domain, 3 occurances of uncharacterized plant-specific domain 3 occurances of phosphoglucomutase/phosphomannomutase alpha/beta/alpha domain 3 occurances of homeodomain protein 3 occurances of erf1 domain 3 occurances of domain i 3 occurances of dimerisation domain 3 occurances of carbohydrate binding domain, 3 occurances of binding domain, 3 occurances of 1 domain 2 occurances of rfae bifunctional protein, domain 2 occurances of phosphatase domain 2 occurances of domain iv 2 occurances of domain ii 2 occurances of domain 7 2 occurances of domain 6 2 occurances of domain 5 2 occurances of domain 4 2 occurances of bromodomain
So "subdomain" is possibly what a user might be searching for. Or maybe it's "rna polymerase rpb2, domain", which occurs in six terms.
I've found a really good use for this. Maybe you will too!
https://michal.guerquin.com/phrases.html
, updated 2006-02-20 03:11 EST