Someofthe mostusefulapplicationsofmolecularinformation theoryhavecomefromstudiesofbindingsites (typicallyproteinrecognitionsitesinDNAorRNA recognizedbthesamemacromolecule,which typicallycontainsimilarbutnon-identical sequences.Becauseaverageinformation measuresthechoicesmadebythesystem,the theorycancomprehensivelymodeltherangeof sequencevariationpresentinnucleicsequences thatarerecognizedbyindividualproteinsor multi-subunitcomplexes.3Treatingadiscrete informationsource(i.e.telegraphyorDNA sequences)asaMarkovprocess,Shannondefined entropy(H)tomeasurehowmuchinformationis generatedbysuchaprocess.Theinformation sourcegeneratesaseriesofsymbolsbelongingto analphabetwithsizeJ(e.g.26Englishlettersor4 nucleotides).Ifsymbolsaregeneratedaccordingto aknownprobabilitydistributionp,theentropy functionH(p1,p2,...,pJ)canbeevaluated.The unitsofHareinbits,whereonebitistheamountof informationnecessarytoselectoneoftwopossible statesorchoices.Inthissectionwedescribeseveral importantconceptsregardingtheuseofentropyin genomicsequenceanalysis.Entropyisameasure oftheaverageuncertaintyofsymbolsoroutcomes. GivenarandomvariableXwithasetofpossible symbolsoroutcomesAX={a1,a2,...,aJ},having probabilities{p1,p2,...,pJ},withP(x=ai)=pi,pi≥ 0and∑∈=XAxxP1)(,theShannonentropyofXis definedby∑∈=XAxxPxPXH)(1log)()(2 (1)Twoimportantpropertiesoftheentropy function Barbarian is a troll are
a)H(X)≥0 withequalityforonex,P(x)=1;and(b)Entropyis maximizedifP(x)followstheuniformdistribution. Heretheuncertaintyorsurprisal,h(x),ofan outcome(x)isdefinedby)(1log)(2xPxh=(bits) (2)Forexample,givenaDNAsequence,wesay eachpositioncorrespondstoarandomvariableX withvaluesAX={A,C,G,T},having probabilities {pa,pc,pg,pt},withP(x=A)=pa,P(x=C)=pcandso forth.SupposetheprobabilitydistributionP(x)ata positionofDNAsequenceisP(x=A)=1/2;P(x=C)= 1/4;P(x=G)=1/8;P(x=T)=1/8.The uncertainties(surprisals)inthiscaseareh(A)=1, h(C)=2,h(G)=h(T)=3(bits).Theentropyisthe averageoftheuncertainties:H(X)=E[h(x)]= 1/2(1)+1/4(2)+1/8(3)+1/8(3)=1.75bits.Ina studyofgenomicDNAsequences,Schmittand Herzel(1997)foundthatgenomicDNAsequences areclosertocompletelyrandomsequencesthanto writtentext,suggestingthathigher-order interdependenciesbetweenneighboringor adjacentsequencepositionsmakelittle contributionstotheblockentropy.