Citation
Computational Methods for the Design and Selection of Hybridization Probes

Material Information

Title:
Computational Methods for the Design and Selection of Hybridization Probes
Creator:
RAGLE, MICHELLE A. ( Author, Primary )
Copyright Date:
2008

Subjects

Subjects / Keywords:
Algorithms ( jstor )
Bioinformatics ( jstor )
Datasets ( jstor )
DNA ( jstor )
Hamming distances ( jstor )
Heuristics ( jstor )
Minimization of cost ( jstor )
Oligonucleotide probes ( jstor )
Oligonucleotides ( jstor )
Optimal solutions ( jstor )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Michelle A. Ragle. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
5/31/2008
Resource Identifier:
659874702 ( OCLC )

Downloads

This item is only available as the following downloads:


Full Text

PAGE 1

1

PAGE 2

2

PAGE 3

3

PAGE 4

Iwouldliketoexpressmysinceregratitudetomyadvisor,ProfessorPanosM.Pardalosforhisguidanceandsupport.Iamgratefultomycommitteemembers,professorsJ.ColeSmith,JosephGeunes,andWilliamHager,forprovidingmewithvaluablecounselandencouragement.Finally,IthankmyfamilyandfriendswhohavebeenextremelysupportiveandwithoutwhomIcouldnothavecompletedthiswork. 4

PAGE 5

page ACKNOWLEDGMENTS ................................. 4 LISTOFTABLES ..................................... 7 LISTOFFIGURES .................................... 8 ABSTRACT ........................................ 9 CHAPTER 1INTRODUCTION .................................. 10 2COMPUTATIONALMETHODSFORPROBEDESIGNANDSELECTION . 12 2.1Introduction ................................... 12 2.2ProbeDesignandSelection .......................... 16 2.2.1Specicity ................................ 16 2.2.2Sensitivity ................................ 18 2.2.3Homogeneity ............................... 20 2.2.4ProbeLength .............................. 21 2.2.5ProbeSelectionProblems ........................ 22 2.2.5.1Oligonucleotidengerprinting ................ 22 2.2.5.2Non-Uniqueprobeselectionproblem ............ 27 2.3ClosingRemarks ................................ 28 3ASYNCHRONOUSTEAMSFORPROBESELECTIONPROBLEMS ..... 29 3.1Introduction ................................... 29 3.2ProbeSelectionProblems ........................... 30 3.3AsynchronousTeam .............................. 31 3.4A-TeamfortheMaximumDistinguishingProbeSetProblem ........ 32 3.5A-TeamfortheMinimumCostProbeSetProblem ............. 37 3.6ImplementationIssues ............................. 40 3.7ComputationalExperiments .......................... 43 3.7.1InstancesandTestEnvironment .................... 43 3.7.2ExperimentalAnalysisfortheMaximumDistinguishingProbeSetProblem ................................. 45 3.7.3ExperimentalAnalysisfortheMinimumCostProbeSetProblem . 47 3.8ClosingRemarks ................................ 49 4THENON-UNIQUEPROBESELECTIONPROBLEM ............. 50 4.1Introduction ................................... 50 4.2ProblemDenition ............................... 51 4.3DataDescription ................................ 55 5

PAGE 6

..... 56 4.4.1HeuristicAlgorithms .......................... 56 4.4.2ComputationalExperiments ...................... 60 4.4.3Conclusions ............................... 62 4.5OptimalCutting-PlaneMethodforSolutionoftheNon-UniqueProbeSelectionProblem ..................................... 63 4.5.1Cutting-PlaneMethod ......................... 64 4.5.2ComputationalExperiments ...................... 67 4.5.3Conclusions ............................... 70 5CONCLUSIONSANDFUTURERESEARCH ................... 72 REFERENCES ....................................... 74 BIOGRAPHICALSKETCH ................................ 78 6

PAGE 7

Table page 3-1DatafromBornemanetal. .............................. 45 3-2Semi-randomdata. .................................. 45 3-3Maximumdistinguishingprobesetproblemresults(binarydistinguishability). . 47 3-4Maximumdistinguishingprobesetproblemresults(binarydistinguishability)onsemi-randomdataaftertwentytestruns. .................... 48 3-5Minimumcostprobesetproblemresults(binarydistinguishability)onBornemanetal.dataaftertwentytestruns. .......................... 48 3-6Minimumcostprobesetproblemresults(binarydistinguishability)onsemi-randomdataaftertwentytestruns. ............................. 49 4-1Target-probeincidencematrix. ........................... 51 4-2Numbersoftargetsandcandidateprobes. ..................... 56 4-3Comparisonofresults. ................................ 61 4-4Impactofparameteradjustmentsoncutting-planealgorithm. .......... 69 4-5ComparisonofresultsondatafromKlauetal. .................. 70 4-6ComparisonofresultsonHIVdata. ......................... 70 7

PAGE 8

Figure page 2-1Purinebasesareadenineandguanine ........................ 12 2-2ComplementaryDNAStrands ............................ 13 2-3Pyrimidinebasesarecytosine,thymineanduracil ................. 13 2-4ProbeTTTAGTCCATGTCCTAGGACTTTCcouldfoldbackonitself ..... 18 3-1ExampleofanA-Team. ............................... 32 3-2A-Teamforthemaximumdistinguishingprobesetproblem. ........... 33 3-3A-Teamfortheminimumcostprobesetproblem. ................. 37 8

PAGE 9

BiologistscanusehybridizationtodeterminewhetheraspecicDNAfragmentispresentinaDNAsolution.Wehavedevelopedseveralmethodsfortheselectionofprobesetstobeusedinhybridizationexperiments.Thespecicproblemsweconsideredarecalledtheminimumcostprobesetproblem,themaximumdistinguishingprobesetproblemandthenon-uniqueprobeselectionproblem.Applicationsoftheseproblemsincludeanalyzingmicrobialcommunities,detectingpathogenicbacteriainfoods,anddetectingmanyhumanvirusessuchasthoseimplicatedinAIDS. Wedesignedteamsofheuristicalgorithmsthatworktogethertogenerateandimprovesolutionsofthemaximumdistinguishingprobesetandminimumcostprobesetproblems.Ourmethodandresultsarepresented.Wedevelopedtwoseparatealgorithmsforthenon-uniqueprobeselectionproblem;oneheuristicalgorithm,andoneexactalgorithmwhichistherstexactalgorithmforsolutionofthenon-uniqueprobeselectionproblemwithoutapriorieliminationforprobes.Adetaileddiscussionofthealgorithmsandtheresultsofourcomputationalexperimentsarepresented. Anumberofalgorithmsandsoftwareprogramshavebeendevelopedforthedesignandselectionofhybridizationprobes.Inadditiontopresentingourprobeselectionalgorithms,wedescribedandcomparedthesemethods. 9

PAGE 10

In2003,the13-yearInternationalHumanGenomeProject(HGP)wascompleted.ThegoalsoftheprojectincludedtheidenticationofallofthegenesinhumanDNA(approximately20,000-25,000ofthem)andthedeterminationofthesequencesofthe3billionbasepairsthatmakeuphumanDNA.AccordingtotheDepartmentofEnergyOceofScience,oneofthekeyareasofresearchfortheHGPwasbioinformatics. Onefamilyofproblemsthatisaddressedinmolecularbiologyandbioinformaticsinvolvesthedesignandselectionofoligonucleotideprobesforuseinhybridizationexperiments.BiologistscanusehybridizationtodeterminewhetheraspecicDNAfragmentispresentinaDNAsolution.Theyoftenuseoligonucleotideprobes.Oligo-nucleotideprobesareshortpiecesofsingle-strandedDNAthathaveaknownsequencethatiscomplementarytotheDNAfragmentinquestion.Ingeneral,thebiologistwouldapplytheappropriateamountofheatand/orsolventtoseparatethestrandsoftheunknownDNA,thenputtheDNAincontactwiththeprobesandallowthemtoreconstitute.ByobservingwhetheraknownprobehybridizestotheunknownDNAfragment,abiologistcandeterminethepresenceorabsenceofthecomplementarysequence.RNAcanalsobeprobedtoseeifageneisonoro[ 17 ].In1979syntheticoligonucleotideprobeswereintroducedforuseashybridizationprobes[ 33 ].Computationalmethodscanthereforebeusedtodesignprobesinsilicoforsyntheticconstruction. Wepresentseveralmethodswedevelopedandresultsfortheselectionofprobesetstobeusedinhybridizationexperiments.Theproblemsweconsideredarecalledtheminimumcostprobesetproblem,themaximumdistinguishingprobesetproblemandthenon-uniqueprobeselectionproblem.Applicationsoftheseproblemsincludeanalyzingmicrobialcommunities,detectingpathogenicbacteriainfoods,anddetectingmanyhumanvirusessuchasthoseimplicatedinAIDS. Anumberofalgorithmsandsoftwareprogramshavebeendevelopedforthedesignandselectionofhybridizationprobes.InChapter 2 ,wepresentasurveyofthesemethods. 10

PAGE 11

26 ].Wethendetailmethodswedevelopedfortheselectionofoptimalornear-optimalprobesets.InChapter 3 wepresentatechniquecalledAsynchronousTeams,orA-Teams,tondnear-optimalsolutionstothemaximumdistinguishingprobesetandminimumcostprobesetproblems.AnA-Teamisanorganizationofalgorithmsthatndinitialfeasiblesolutionsandthenworktogethertoimprovethesesolutions.Theycommunicatewitheachotherbymeansofsharedmemories.Wedesignedanumberofheuristicalgorithmsforeachproblem.Wetestedourapproachonbothrealandsemi-randomdataandobtainedexceptionalresultsthatwerecompetitivewiththoseobtainedwithcurrentalgorithms. Non-uniqueprobesareprobesthatbindtomorethanonetargetsequence.Whentargetsequencesaretoocloselyrelated,itissometimesnotpossibletonduniqueprobesthatwillbindtoasingletarget.Non-uniqueprobesmustoftenbeusedwhenabiologicalsampleistobetestedforthepresenceorabsenceofavirusorbacteria.InChapter 4 wedenethenon-uniqueprobeselectionproblemanddescribetwomethodswehavedevelopedforsolutionoftheproblem.Therstmethoddiscussedisatwo-phaseheuristicalgorithmthatgeneratesnear-optimalsolutionstothenon-uniqueprobeselectionproblem[ 25 ].Welaterdevelopedtherstexactmethodforndingoptimalsolutionstothenon-uniqueprobeselectionproblemwithinpracticalcomputationallimits,withouttheapriorieliminationofcandidateprobes.Finally,inChapter 5 ,wepresentourconclusionsanddiscussfutureresearch. 11

PAGE 12

33 ].Thereconstitutiontodouble-strandedmoleculesisreferredtoasrenaturationorhybridizationandwasrstdescribedbyMarmurandDoty[ 24 ].Thesediscoveriesandthepropertyofsequencecomplementarity,whichwewilldiscussshortly,formthebasisofthemethodsusedtoanalyzeDNAandRNA[ 33 ]. DNAisalongpolymerofnucleotides.Eachnucleotideconsistsofa5-carbonsugar,onephosphategroupandanitrogencontainingbaseattachedtothesugar.Thenucleotidesaresymbolizedbyfourlettersidentifyingtheirbases,theyareadenine(A),cytosine(C),guanine(G),andthymine(T)[ 33 ]. Figure2-1. Purinebasesareadenineandguanine Whennucleotidesarechainedtogetherbyaphosphodiesterbond,theyformaDNAstrand.TheendsoftheDNAstrandarenumberedbythecarbonatompositionwherethenextnucleotidecanbeattached.The50endcontainsaphosphategroup.The30positionofthenucleotideatthe30endisfreeforappendingtothenextnucleotide.DNAsequencesarealwaysreadfromthe50endtothe30end[ 4 ]. 12

PAGE 13

4 ].ThispairingisreferredtoasWatson-Crickbasepairing. Figure2-2. ComplementaryDNAStrands RNAdiersfromDNAinthatitisalmostalwayssingle-stranded,andconsistsofthebasesadenine,cytosine,guanineanduracil.Thatis,thebasethymineisreplacedbyuracil(U).Uraciliscomplementarytoadenine[ 17 ]. Figure2-3. Pyrimidinebasesarecytosine,thymineanduracil BiologistscanusehybridizationtodeterminewhetheraspecicDNAfragmentispresentinaDNAsolution.Theyoftenuseoligonucleotideprobes.Oligonucleotideprobesareshortpiecesofsingle-strandedDNAthathaveaknownsequencethatiscomplementarytotheDNAfragmentinquestion.Theprobeshaveeitherauorescent 13

PAGE 14

17 ].Ingeneral,thebiologistwouldapplytheappropriateamountofheatand/orsolventtoseparatethestrandsoftheunknownDNA,thenputtheDNAincontactwiththeprobesandallowthemtoreconstitute.ByobservingwhetheraknownprobehybridizestotheunknownDNAfragment,abiologistcandeterminethepresenceorabsenceofthecomplementarysequence.RNAcanalsobeprobedtoseeifageneisonoro[ 17 ].In1979syntheticoligonucleotideprobeswereintroducedforuseashybridizationprobes[ 33 ].Throughoutthischapterthetermprobewillbeusedtorefertonucleicacidsofknownsequence,andthetermtargetwillrefertotheunknownsequenceorsetofsequences. Microarraysallowforlarge-scalegeneexpressionmeasurements.AmicroarrayconsistsofasurfacesuchasaglassslideormembranethatisspottedwithDNAfragmentsoroligonucleotidestoformanarray.ThevalueofmicroarraysliesinthefactthatsincetherecanbemanythousandDNAmoleculesperarray,theexpressionofmanythousandsofgenescanbemeasuredsimultaneously.Thereareanumberofmicroarrayplatformsthathavebeendeveloped.TwocommonlyusedplatformsareoligonucleotideandcDNA[ 40 ]. Onceamicroarrayiscreated,asolutioncontainingalabeledtargetnucleicacidsampleisbroughtincontactwiththesubstrate.ProbesonthearrayhybridizetothetargetifthereisasegmentofthetargetsamplethatisWatson-Crickcomplementarytotheprobe.Aftersometime,thesubstrateiswashedtoremoveunboundandweaklyboundtargetoligonucleotidesfromthesample.Themicroarrayisscannedaftertheremainingtargetmoleculeshavebeenstainedwithauorophore.Theexpressionlevelsarethenmeasuredbasedupondetectionoftheuorescentsignal[ 14 , 40 ]. Therearemanyapplicationsinbiologythatinvolvetheuseofprobes.Herewediscussafewofthoseapplications.Onesuchapplicationissequencingbyhybridization(SBH). In1987,DrmanacandCrkvenjakovappliedforapatentforamethodtheyhaddevelopedasanalternativeapproachforDNAsequencing.Themethodinvolvedtheuseof 14

PAGE 15

33 ].Inthisprocedure,microarraysarecreatedinwhicheachcellcontainsadistinctknownprobe.ThearrayisthenbroughtincontactwithasolutioncontainingmanycopiesofthetargetDNAthatistobesequenced.Eachofthecopieswillhavebeentaggedwithauorescentorradioactivemarker.AsdescribedinSection 2.1 ,probesonthearrayhybridizetothetargetifthereisasegmentofthetargetthatisWatson-Crickcomplementarytotheprobe.AfteralloftheDNAcopiesthathavenothybridizedarewashedothearray,thesubsetofprobesthathybridizetothetarget,calledthespectrumofthesequence,isidentiedbyobservingwhichcellsofthearrayaretagged.ThespectrumisthenusedtoreconstructtheDNAsequencebyacombinatorialsequencingalgorithm[ 7 , 12 , 13 ]. TherstSBHdesignrequiredchipsthatcontainedall4kstringsoflengthk.TheproblemwasreducedtondingaEulerianpathonadirectedgraph.Sincethen,severalnewapproacheshavebeenproposed. Inanotherapplicationdiagnosticprobescanbedesignedtodetectbacterialinfections.GivenDNAsequencesfromagroupofcloselyrelatedpathogenicbacteria,theideaistondastringthatisasubstringofeachofthebacterialsequenceswithoutbeingasubstringofthehost'sDNAsequence.Probesaredesignedtohybridizetothesesubstring(target)sequences.Theprobescanthenbeusedtodetectthepresenceofatleastoneofthebacterialspecies[ 22 ]. RashandGuseldaddressedtheStringBarcodingproblem[ 34 ].Theydevelopedamethodtondaprobesetofminimalcardinalityforuseinthedetectionofviral-sizepathogens.WediscusstheRashandGuseldpaperinSection 2.2.5.1 . AntisenseoligonucleotidesareshortsyntheticoligonucleotidesthataredesignedtohybridizetoRNA.OncetheybindtothetargetRNA,theseoligonucleotidespreventexpressionoftheencodedproteinproduct.Therearecomplicatedchallengesinvolvedin 15

PAGE 16

42 ]. SungandLeeciteanexamplein[ 41 ]inwhichDNAmicroarrayscanbeusedtohelpidentifythepresenceofalternateformsof,oranirregularexpressionin,agenethatresultsinresistancetochemotherapy. Thesearejustafewoftheapplicationsthatinvolvetheuseofprobes.Itisclearthattheoptimaldesignandselectionofprobesservesavaluablefunctioninmanydiverseareas.Thefocusofthischapterisoncomputationalapproachesforthedesignandselectionofprobes. 40 , 41 , 43 ]. 40 , 41 ]. 16

PAGE 17

41 ]. Asmentioned,specicityminimizescross-hybridization.Probescontaininglow-complexitysequences(i.e.repetitivesequences)shouldbeavoidedsincetheyarelikelytocross-hybridizetoothertargets.Severalsoftwareprogramshavebeendevelopedforthedesignofprobes.RepeatMaskerissoftwarethatdetectsrepeatsequencesofalltypes.ItacceptsasinputaDNAorRNAsequenceandreturnsthesequencewithrepeatsreplacedbyN's.Ifdesigningoligonucleotideprobesforuseonamicroarray,onewouldavoidselectingprobesinregionsmaskedwithN's[ 40 ]. Ahomologysearchalgorithmidentiessequencesinaselecteddatabasethatmatchpartorallofaspecicsequencereferredtoasthequerysequence.BLAST(BasicLocalAlignmentSearchTool)iscommonlyusedtosearchforsimilarities,butmanyotherprogramsarealsoavailable.TouseBLASTtocheckpotentialprobes,youwouldinputtheprobe'ssequenceasthequerysequenceandthenselectadatabasecontainingothertargetgenesthatyoudonotwanttobindtotheprobeforcomparison.TheoutputproducedbyBLASTincludesahitlistwhichtellstheuseriftheinputsequenceissimilartoasequencecontainedintheselecteddatabase.Thehitlisthasabitscorecolumnthatgivesthestatisticalsignicanceofthealignment.Thehigherthebitscoreis,themoresimilararethetwosequences.Inmostcases,thebitscoreistwicethelengthofthelongest 17

PAGE 18

40 ]. WangandSeed[ 45 ]developedaprogramcalledOligoPickerfortheselectionofoligo-nucleotideprobesforproteincodingsequences.Intheirselectionscheme,theymaketherejectionofcontiguoussequenceidentitytheprimarylter,thereasonbeingthattheyhypothesizethatcontiguousbasepairingisthesinglemostimportantdeterminantofcrosshybridization.Theyuseahashtabletoquicklyndrepetitivesequencestretchesof10to20mersinlength.TheyalsoscreenprobesbasedupontheirBLASTscorestoreducethelikelihoodofcross-hybridizationduetoglobalsimilarity.Incaseswheresequencescannotberepresentedbyasingleuniqueprobe,probesareselectedfromregionsthatcross-hybridizetothesmallestnumberofothersequencesinthesampleuniverse[ 45 ]. 40 , 41 ].Infact,secondarystructuralmotifsinprobesoflength20-mercanreducethehybridizationsignalupto50-fold[ 21 ]. Figure2-4. ProbeTTTAGTCCATGTCCTAGGACTTTCcouldfoldbackonitself ADNAsequencethatisidenticaltoitsreversecomplementsequence,suchasTGCA,isreferredtoasapalindrome.DanGuselddescribesalgorithmstoidentifypalindromesinsequencesin[ 12 ].Secondarystructureinboththeprobeandthetargetsequencecan 18

PAGE 19

21 ].Sincethetargetsequenceisoftenunknown,detectionoftargetsecondarystructurecanbeconsiderablymorechallenging.AmeasurethatiscommonlyusedtopredictthestabilityofsecondarystructureinprobesisGibbsfreeenergy[ 27 ]. Gibbsfreeenergydescribestheenergyavailabletodoworkwithinasystem.ItsatisestheequationG=HTS,whereHdenotesenthalpy,TdenotestemperatureindegreesKelvinandSdenotesentropy.Enthalpyisdenedtobethesumoftheinternalenergyplustheproductofthepressureandvolume,andentropyisameasureofdisorderorrandomness.Bothenthalpyandentropyarestatefunctions.Foranyprocessatconstantpressureandtemperature,thechangeinfreeenergyisgivenbyG=HTS[ 4 ].Thenearestneighbormodelisonemethodthatisusedtocalculatefreeenergy.Themethodisconsideredtobecomputationallyfeasiblewhilestillbeingsucientlyaccurate. SungandLeegivepreferencetoprobeswiththehighestfreeenergyintheirFind-Probealgorithmtoensurethattheprobeshaveminimalsecondarystructure.Theirsensitivityltereliminatesprobesdeterminedtohavesecondarystructure.Thisdeterminationismadeinasinglepassofeachprobewheretheycheckeachlength-x3'endoftheprobetoensurethattherearenomorethanyconsecutivecomplementarieswiththe5'endoftheprobe[ 41 ]. AprogramcalledmfoldisavailabletopredictsecondarystructureofDNAandRNAmoleculesfromtheirsequence.Itcalculatesthermodynamicpropertiesusingthebase-stackingmodel.MfoldacceptsasinputaDNAorRNAsequenceandreturns,amongotherinformation,thefreeenergycalculationsanddrawingsoffoldedinputmolecules[ 40 ]. LiandStormo[ 23 ]developedaprogramcalledProbeSelecttodesignandselectoptimalDNAprobesforgeneexpressionarrays.Theirapproachworksintwomajorsteps.Therststepconsistsofnding,foreachgene,asetofcandidateprobesthatwillmaximizetheminimumnumberofmismatchestoeveryothergeneinthegenome.Thesecondstepinvolvesselectionbaseduponfreeenergy.Optimalprobesareselectedsothat 19

PAGE 20

23 ]. In2006,Pozhitkovetal.publishedresultsofexperimentsoneukaryotictargetsequencesinwhichtheycomparedcurrentapproachestopredictinguorescentsignalintensitiestoactualintensities.TheirresultsdidnotsupporttheuseofthermodynamicpropertiestoaccuratelypredictsignalintensitiyvaluesofduplexeswithrRNAsonoligonucleotideDNAmicroarrays.Basedupontheirresults,theyrecommendedthatthermodynamiccriterianotbeusedforthedesignofoligonucleotideprobesforspeciesidentication.Theysuggestedinsteadthateachprobebeempiricallyveriedinordertoprovidethebestsignalintensities[ 32 ]. 41 ].Themeltingtemperature,denotedTm,isthetemperatureatwhichhalfthestrandsareinthedouble-helicalstateandhalfareintherandom-coilstate[ 37 ].ThesaltconcentrationofthesolutionandthebasecompositionoftheDNAbothaectthemeltingtemperature.DNAwithmanyG-Cpairs(highGCcontent)hasahighermeltingtemperaturethanDNAwithmoreA-Tpairs.G-CpairshavethreehydrogenbondswhileA-Tpairshavetwohydrogenbonds[ 27 , 28 ]. Forself-complementaryoligonucleotideduplexes,TmispredictedusingthenearestneighbormodelasreportedbySantaLuciain[ 37 ]and[ 38 ]bytheequation:Tm=Ho 37 , 38 ]. 20

PAGE 21

41 ]calculatethemeltingtemperatureofeachprobepusingaformulathatisbasedonnearest-neighborparametersbuthasaslightvariation.Tm(p)=H(p) S(p)+Rlog(CT)273:15+16:6log(Na+) whereNa+isthesaltconcentrationofthesolution.CTisreplacedbyCT whereCFistheformamideconcentrationofthesolution.Finally,SungandLeerequirethatthecontentofanysinglebasenotexceed50%[ 41 ]. WangandSeed[ 45 ]requirethatthemeltingtemperatureofallprobesfallwithinagivennarrowrange.TheycalculatethemedianmeltingtemperatureusingtheformulaMedian=64:0+41gcCount oligoLength600 45 ]. In[ 40 ]itisnotedthatfordeterminingthemeltingtemperatureandthestabilityofanucleicacidduplex,thebase-stackingmodel,whichbasescalculationsoneachbasepair,supersedestheuseofbasecomposition.Thebase-stackingmodeldoesimplicitlyincludebasecomposition(GCcontent),butitismorecomplexsinceitalsoconsiderstheorderofthebasesinthesequence. 21

PAGE 22

43 ]. Longprobestendtohavemorereliablehybridizationproperties,butthegreaterlengthincreasesthechanceofnon-speciccross-hybridization.Probesoflength50-60showanimprovedspecicityandsensitivitywhencomparedtoshorterprobes,howevertheymaynotreliablydistinguishsinglebasemismatches.Inshort,theintendedapplicationmustbeconsideredwhendeterminingthebestprobelength[ 43 ]. Chouetal.havestudiedtheeectofvaryingprobelengthandthenumberofprobespergeneonmicroarrayanalysis.Theirresultssuggestthatprobesoflengthapproximately150merareoptimalforaccuratemicroarraymeasurementofgeneexpression.Theyalsofoundthatbothshortorlongprobesworkedwellifeithermultipleprobeswereusedforeachgeneortheprobeswereselectedviavalidationbyexperimentalhybridization[ 3 ]. 2.2.5.1Oligonucleotidengerprinting 15 ].Inthissectionwediscussseveralpaperswhosefocusistheselectionofprobesetsforthisapplication. Bornemanetal.[ 2 ]discussedanapplicationofmicroarraysinwhichasingleprobeisappliedtoaDNAmicroarraycontainingalargesampleofrDNAsequencesfromthepopulationbeingstudied.ThisisquitedierentfromtheprocedurediscussedinSection 2.1 whereamicroarraycontainsalargenumberofprobes.Oneuseofthismethodisintheanalysisofmicrobialcommunitieswheremultipleexperimentsareperformedusingasingleprobeforeachexperiment.Clearlythenumberofexperimentsand,asaresult,thecost,isdirectlyrelatedtothenumberofprobesused[ 2 ]. 22

PAGE 23

2 ],theclonesareapproximatelyoflength1500andtheprobesareoflength6to10. Aprobepissaidtodistinguishapairofclonescanddifpisasubstringofexactlyoneofcord.ThegoalistoselectaminimalcardinalitysetofprobessuchthateachdistinctpairofclonesinCisdistinguishedbyatleastoneprobeinS[ 2 ]. ForagivensetSofmprobes,theS-ngerprintofaclonec,whichisdenotedfingerprintS(c),isavectoroflengthmwhichcontainsoneentryforeachprobep.EachentrydenotestheminimumofagivenintegralvalueRandthenumberofoccurrencesoftheprobepinc.NotethatifR=1,thengerprintwilljustbeabinaryvectorwithaoneinpositioniifprobepiisasubstringofclonec,andazerootherwise. AsetofprobesSissaidtodistinguishtwoclonescanddiffingerprintS(c)6=fingerprintS(d).WedenotebySC2thesetofpairsofclonesthataredistinguishedbythesetS,andsojSjdenotesthenumberofdistinctpairsofclonesdistinguishedbytheprobesetS[ 2 ]. Asnotedabove,therDNAsequencesofthepopulationbeingstudiedareunknown.Thereforechoosingprobestodistinguishpairsofclonescanbeaproblem.Toovercomethisproblem,asubsetC0ofrDNAclonesisrandomlyselectedfromthepopulation.AprobesetisthenfoundforthesubsetC0andusedtoanalyzethepopulation[ 2 ]. TwovariationsoftheprobeselectionproblemareconsideredbyBornemanetal.[ 2 ].TheyarereferredtoastheMaximumDistinguishingProbeSet(MDPS)andtheMinimumCostProbeSet(MCPS)problemsandaredenedasfollows.Inthemaximumdistinguishingprobesetproblem(MDPS),theprobleminstanceiscomprisedofasetC=fc1;c2;:::;cmgofclones,asetP=fp1;p2;:::;pngofprobes,andanintegerk.ThesolutionconsistsofasubsetSP,withjSj=kwherejSjismaximized[ 2 ].Theprobleminstancefortheminimumcostprobesetproblem(MCPS)iscomprisedofaset 23

PAGE 24

2 ]. BothproblemsMDPSandMCPSareNP-hardwhenthelengthofprobesisunbounded[ 2 ].NotethatMDPSandMCPSaredualsofoneanother.MCPSisaspecialcaseoftheSetCoverproblem. Bornemanetal.[ 2 ]appliedasimulatedannealingalgorithmtotheMDPSproblemandaLagrangianrelaxationalgorithmtotheMCPSproblem.Bothmethodsproducedsuccessfulresults.InChapter 3 ,wewilldiscussourapproachtotheMDPSandMCPSproblemswhereweappliedamethodcalledAsynchronousTeams.OurapproachproducedresultsthatwerecompetitivewiththoseobtainedbyBornemanetal. In2005,Fu,Borneman,YeandChrobakdevelopedanimprovedprobeselectionmethod[ 8 ].Theyfoundthatprobesetsselectedwiththecurrentlyusedmethodweretheoreticallyoptimal,howeverinactualbiologicalexperimentsprobesoftendidnothybridizeinaconsistentandpredictablemanner.Theyreferredtotheseprobesasunreliable. Twocommonerrorsinhybridizationexperimentsarefalsenegatives,whereprobesareexpectedtobindwithaclonebutthesignalintensityislow;andfalsepositives,whereprobesareexpectednottobindwithaclonebutthesignalintensityishigh.Theoccurrenceofbothtypesoferrorsmayberelatedtothelocationoftheprobetargetsites.ThebasicideaoftheapproachofFuetal.liesinthehypothesisthatthereliabilityoftheprobeisrelatedtoitslocationintheclone.Theirmethodusedaprobabilisticmodeltoidentifyunreliableprobesandeliminatethem.Theirresultsshowedthatapplicationofthismethodsignicantlydecreasedthenumberofunreliableprobes;infact90.9%ofunreliableprobescanbeeliminated[ 8 ]. Herwigetal.[ 15 ]proposedthedesignofprobesetsforsimultaneousidenticationofsequencesbyoligonucleotidengerprintingbaseduponmaximizationofShannonentropy.GivenasetofMsequences,s1;:::;sM,andasingleprobep,thesetofsequences 24

PAGE 25

15 ],theamountofinformationofaprobewithrespecttoasetofsequencesismeasuredbyI=NXi=1pilog2pi 15 ]. In[ 34 ],RashandGuseldconsideredtheStringBarcodingproblemanditsapplicationinselectingprobesforhybridizationexperimentsusedtoidentifyviral-sizepathogens.Thebarcodetheyrefertoisequivalenttoabinaryoligonucleotidengerprint.Theproblemtheyconsiderisasfollows.GivenasetSofmstrings,S=fs1;:::;smg,ndasetofsubstringsP=fp1;:::;pngofminimumcardinalitysuchthateverypairofstringsinShasatleastonesubstringinPthatdistinguishesthepair.OncePisfound,abinaryngerprintorbarcode,asdescribedatthebeginningofthissection,isassociatedwitheachstringinS.Foragivenstringsi2S,thesetofstringsfromPthataresubstringsofsiiscalledthesignatureofsi.Boththebarcodeandthesignatureshouldbeuniqueforeachstring[ 34 ]. IntheirapproachtotheStringBarcodingproblem,RashandGuseldconstructandsolveanintegerlinearprogram(ILP)thatcontainsonebinaryvariableforeachsubstringandoneinequalityforeachpairofstringsinS.Leti=1;:::;qdenotethepairsofthesetS.Thevariablevjisincludedineachequationwherepjdistinguishesagivenstringpair. 25

PAGE 26

minimizeNXj=1vj RashandGuseldusesuxtrees[ 12 ]toconstructtheinitialsetofcandidatesubstrings.Byusingsuxtreesinsteadofenumeratingallpossibledistinctsubstrings,theybeginwithamuchsmallersetofsubstrings,i.e.considerablyfewervariablesintheILP.Theyfurtherreducethenumberofvariablesbylteringoutsubstringsthatdonotsatisfythevalidlengthrequirements[ 34 ]. RashandGuseldalsoaddressedproblemsthatmayoccurifthevirusmutatesinnatureafterasignatureisdevelopedandthemutatedversionmuststillbeidentied.Todoso,theyaddedtwosetsofconstraints.Intherstsetofconstraints,insteadofrequiringthateachpairofstringsinSbedistinguishedbyatleastonesubstringinP,theyrequireaminimumofrsubstringsforeachpair.Thatis,Xj2Tivjrforalli: Thesecondsetofconstraintsenforcedaminimumeditdistancebetweenpairsofstrings.So,foreachpairofsubstringspiandpjthatarenottheminimumedit 26

PAGE 27

isaddedforeachsuchpair.Theydeterminedthateditdistancesassmallastwoorfourresultedinaddedcondencethatmutationsinnaturenotinvalidatethesignatures.TheysolvedtheresultingILPusingCPLEX[ 34 ]. In[ 19 ],Klauetal.considertheselectionofnon-uniqueprobesforuseinexperimentstodetectthepresenceorabsenceofvirusesorbacteriainabiologicalsample.Theybeginwithazero-onetarget-probeincidencematrixH=[hij]wherehij=1ifprobejhybridizestotargeti,andhij=0otherwise.Thegoalistondaprobesetofminimumsizesuchthateverytargetiscoveredbyatleastcminprobesandeverytargetpairisdistinguishedbyatleasthminprobes.Theyndaninitialsolutionusingaheuristicalgorithm.TheythenconstructanILPanduseCPLEXtosolvetheILP,thusfurtherreducingthesizeoftheprobeset.InthedescriptionofanILPwhichtheyrefertoastheslaveILP,theyproposeabranch-and-cutapproachinwhichconstraintsareaddedtoenforcegroupseparationinadditiontopairwiseseparation.Theresultspresentedinthepaperarehoweverstrictlyforpairwiseseparation.Finally,theyevaluatetheirresultsusingdecodingsoftware[ 19 ].InChapter 4 ,wewilldiscusstwoalgorithmswedevelopedtoselectnon-uniqueprobesforthedetectionofbacteriaorviruses[ 25 ]. Gasieniecetal.addressprobeselectioninmicroarraydesignin[ 10 ].Intheirapproachtheyrstsearchforuniqueprobesforaninputsetoftargetsequences.Ifitisnotpossibletondauniqueprobeforagivensequence,theysearchforasmallcollectionofprobesthattogetheruniquelyidentifythesequence.Theystartbyrstusingalteringprocess 27

PAGE 28

28

PAGE 29

2 , 44 ]. Thedesignandselectionofprobesetsplaysavitalroleinhybridizationexperiments.InthischapterweproposetheuseoftheAsynchronousTeam(A-Team)techniquetodeterminenear-optimalprobesetsforgivensetsofclones.WefocusourattentionontwoprobeselectionproblemscalledtheMaximumDistinguishingProbeSet(MDPS)andtheMinimumCostProbeSet(MCPS)[ 2 ]. TheA-TeammethodwasproposedbySouzaandTalukdar[ 6 ].AnA-Teamiscomprisedofseveraldierentheuristicalgorithms,calledagents,thatcommunicatewitheachotherbymeansofsharedmemories.Thesharedmemoriesstoresolutionsgeneratedbyagents.Eachagentcanmakeitsowndecisionsaboutinputs,schedulingandresourceallocation.TheA-Teammethodlendsitselfwelltothesolutionofcombinatorialproblems. Theremainderofthechapterisorganizedasfollows.Inthenextsectionwedescribetheproblemsindetailwithexamples.InSection 3.3 ,wegiveabriefdescriptionoftheA-Teamtechnique.Sections 3.4 and 3.5 outlinetheA-TeamforMDPSandMCPSrespectively,withdetailedalgorithms.InSection 3.6 ,wediscussissuesrelatedtotheimplementation.ComputationalresultsaregivenanddiscussedinSection 3.7 ,andclosingremarksarestatedinSection 3.8 . 29

PAGE 30

SomeofthefollowingdenitionswerediscussedinChapter 2 .Theyarerepeatedherefortheconvenienceofthereader.Aprobepissaidtodistinguishapairofclonescanddifpisasubstringofexactlyoneofcord.Insomeapplicationscloneshavelengthapproximately1500andprobeshavelengthbetween6and10.InahybridizationexperimenttheuorescenceresponseislinearwithrespecttothenumberofoccurrencesoftheprobeinacloneuptoacertainthresholdR.Becauseofthis,therearedierentversionsofthedistinguishabilitycriteria[ 2 ]. Twocases,R=1(binary)andR=4(non-binary),areconsideredinthefollowingexample.LetC=fGGATTCAA,GGATATGGA,AGTATAGgandP=fATT,GGA,ATA,GAT,AGTg.IfR=1,thenS=fATT,AGTgisasmallestsetofprobessuchthattwodistinctclonescanddfromCaredistinguishedbyatleastoneprobeinS.Thatis,probeATTdistinguishesclonesGGATTCAAandGGATATGGA,andclonesGGATTCAAandAGTATAG,butitdoesnotdistinguishclonesGGATATGGAandAGTATAG.TheprobeAGTdistinguishesclonesGGATATGGAandAGTATAG.IfR=4,thenS=fGGAgisasmallestsetofprobesthatdistinguishesanytwodistinctclonesinC. Letocc(p;c)denotethenumberofoccurrencesofprobepinclonec.GivenanitesetSofprobes,theS-ngerprintofclonec,denotedbyfingerprintS(c),isthevectorofvaluesminfR;occ(p;c)goverallp2S.LetCandPbedenedasabove,andR=1.IfS=fCCT,AAAg,thenocc(CCT,AAACCTGA)=1,occ(AAA,AAACCTGA)=1,fingerprintS(AAACCTGA)=[1;1].Similarly,fingerprintS(AAACATAAA)=[0;1],fingerprintS(ACTAACG)=[0;0]. 30

PAGE 31

Westudiedtwoprobeselectionproblemscalledthemaximumdistinguishingprobesetproblem(MDPS)andtheminimumcostprobesetproblem(MCPS).IntheMDPSproblem,asetC=fc1;c2;:::;cmgofclones,asetP=fp1;p2;:::;pngofprobes,andanintegerkaregiven.TheobjectiveistoselectasubsetSP,withjSj=ksothatjSjismaximized[ 2 ].NotethatjSjrepresentsthenumberofprobescontainedinthesolutionsetS,andjSjrepresentsthenumberofpairsofclonesdistinguishedbyS.Theminimumcostprobesetproblem(MCPS)isaspecialcaseofthesetcoveringproblemwherewearegivenasetC=fc1;c2;:::;cmgofclonesandsetP=fp1;p2;:::;pngofprobes.TheobjectiveistondasubsetSPsuchthatS=C2andjSjisminimized.[ 2 ]. BothproblemsMDPSandMCPSareNP-hardwhenthelengthofprobesisunbounded[ 2 ].Bornemanetal.[ 2 ]appliedaLagrangianrelaxationbasedheuristictotheMCPSproblem,andaSimulatedAnnealingalgorithmtotheMDPSproblem. 6 ]andhasbeenappliedsuccessfullytotheTravelingSalesmanProblem[ 6 ],FlowShopSchedulingProblem,JobShopSchedulingProblem,SetCoveringProblem,andPoint-to-PointConnectionProblem[ 11 ]. InFigure 3-1 anA-Teamisshown,wherearrowsandrectanglesrepresent,respectively,agentsandsharedmemories.Agentscanreadfromandwritetosharedmemories.Shared 31

PAGE 32

3-1 AgentAreadsfromMemory1andwritestoMemory1,AgentBreadsfromMemory1andwritestoMemory2.AgentFreadsfromMemory1andwritestobothMemories2and3.WhileAgentGisresponsibleforpopulatingMemories2and3withsolutions,AgentHisresponsiblefortheeliminationofsolutionsfromMemories2and3. Figure3-1. ExampleofanA-Team. AnA-Teamischaracterizedpredominantlybythreefeatures: instancebyselectingarandomk-subsetSfromP. 1. Createarandomk-subsetSfromP. 2. ComputeS-ngerprintofeachcloneinC. 3. ReturnS,ObjectiveFunctionValue(S). 32

PAGE 33

A-Teamforthemaximumdistinguishingprobesetproblem. Algorithms 1 4 detailthefunctionsrequiredforAgentC1.Algorithm 3 givesanaiveimplementationofthecalculationoftheobjectivefunctionvalue,jSj,wheretheS-ngerprintsforeachpairofdistinctclonesarecomparedinO(m2k)time.InSection 3.6 wegiveaO(mk)timealgorithmforcomputingjSj[ 2 ]. 1. Randomlyselectrsolutions,saySifori=1;:::;r,fromMemoryM1. 2. 3. ReturnS0. 1. Randomlyselecttwosolutions,sayS1andS2,fromMemoryM1. 2. IfobjectivefunctionvalueofS1isbetterthantheobjectivefunctionvalueofS2,thenS0S1nS2;elseS0S2nS1. 3. ReturnS0. 1. Selectasolution,sayS1,fromMemoryM2. 2. 3. RandomlyselectktprobesfromP,sayp01;p02;:::;p0(kt),suchthatp01;p02;:::;p0(kt)donotbelongtoS1. 4. 5. ReturnS,ObjectiveFunctionValue(S). 33

PAGE 34

If1;2;:::;ngS;

PAGE 35

probep,clonec numberofoccurrencesofprobepinclonect0

PAGE 36

1. Randomlyselectasolution,sayS1,fromMemoryM1. 2. 3. Foreachp2S1,computethefractionofpairsfromffingerprintfpg(c):c2Cgthataredistinct.Ifthisfractionisgreaterthantheparameterratio,theninsertprobepintoS. 4. IfjSj
PAGE 37

2 ],thisproblemisaspecialcaseofthewell-knownSetCoveringProblem,wheretheuniversetobecoveredisC2andthecoveringsetsarethevariousp,wherepisdenedasfollowsforagivenprobep. UsingthenotationintroducedinSection 3.2 forS,weletpjbethesetofpairsofclonesinC2thataredistinguishedbythesingleprobepj2P.Algorithm 7 describesindetailhowtocomputepj. TomakethingsclearwedeneformallytheSetCoveringProblem.GivenagroundsetE=fe1;:::;emg,subsetsS1;:::;SnE,andcostwjforeachsubsetSj.TheobjectiveistondasetIf1;:::;ngsuchthat[j2ISj=EandPj2Iwjisminimum.CastingtheMCPSastheSetCoveringProblemweneedtosetE=C2,Sj=pj,wherepjisaprobeinP,andwpj=1foreachprobepj2P.ThetaskofConstructionAgent Figure3-3. A-Teamfortheminimumcostprobesetproblem. C1istoscanthroughthesetspjandchoosetheonethatwilldistinguishthelargestnumberofpairsofclonesyettobedistinguished.ThealgorithmdetailsaregiveninAlgorithm 6 .Algorithm 7 describeshowpisfoundforagivenindividualprobe. 37

PAGE 38

Randomlyselectasolution,sayS1,fromMemoryM1 Q>ratio)thenSS[fpgii+1break;endendendif(i=k)thenStoptrueendend if(i
PAGE 39

ProbesetS whileS6=C2doLpj:~pj6=;j~pjjpirandom(L)SS[fpigforj=1tondo~pj~pjnpiendend 1. Randomlyselectrsolutions,saySifori=1;:::;r,fromMemoryM1. 2. 3. ReturnS0. 1. Randomlyselecttwosolutions,sayS1andS2,fromMemoryM1. 2. IfjS1j>jS2j,thenS0S1nS2;elseS0S2nS1. 3. ReturnS0. 1. Selectasolution,sayS1,fromMemoryM2. 2. CallagentC1withtheinputparameterS=S1. section. 1. Randomlyselectasolution,sayS1,fromMemoryM1. 2. 3. Foreachp2S1,computethefractionofpairsfromffingerprintfpg(c):c2Cgthataredistinct.Ifthisfractionisgreaterthantheparameterratio,theninsertprobepintoS0. 4. CallagentC1withtheinputparameterS=S0. 39

PAGE 40

p(i.e.,setofpairsofclonesinC2thataredistinguishedbyprobep)Computep-ngerprintofeachcloneinC 2;1-exchangeasfollows:forallpairsofprobesinS,ifpossible,exchangetwoprobes withoneprobenotinS. 1. Randomlyselectasolution,sayS1,fromMemoryM1. 2. Applythe2,1-exchangeinS1andgeneratethefeasiblesolutionS. 3. ReturnS,ObjectiveFunctionValue(S). DestructorAgentD1deletestheworstsolutionintheMemoryM1.Theworstsolutionisdenedasthesolutionthatcontainsthegreatestnumberofprobes. Bornemanetal.[ 2 ]proposedafastapproachtocomputejSj.Itisasfollows:foreachngerprintfcomputethenumberfofcloneswithngerprintf.ItcanbeshownthatjSj=1 2Pff(mf).Tondf,sorttheclonesusingfingerprintS(c)asthe 40

PAGE 41

InourimplementationofjSjweusethefunctionqsortthatimplementsthequick-sortalgorithm.Thisfunctionisappropriatetosortlargearrays.ThecomputationalexperimentsshowedthedrasticdierencebetweenusingthenaiveimplementationofjSjandtheoneexplainedabove. Wefoundthatwhenlargedatalescontainingseveralthousandclonesandprobeswereprocessed,theAgentC3intheMDPSA-TeamusedasignicantamountofCPUtime.ThebottleneckinAgentC3isintheloopthatcalculatesthenumberofclonepairsthataredistinguishedbyeachprobeinagivensolution.ThelooprunsapproximatelyjCj2jS0jiterationspercall,wherejS0jisequaltothenumberofprobesinthesolution. Tospeeduptheprocess,wedevelopedamultithreadedpreprocessingprograminC++tocalculatethengerprintsandnumberofclonepairsdistinguishedbyeachprobe.Athreadisapathofexecutionwithinaninstanceofaprogram.Generallyaprogramwillhaveonethread(theprimarythread)whichterminateswhentheprogramterminates.Morethreadscanbecreatedtorunmultiplepathsofexecutioninparallel.Thepreprocessingprogramusestenthreadstoperformthecalculationsforanygivenpairofclone/probedataleswithaselectedR-valueof1or4.EmployingmultiplethreadsinthepreprocessingprogramsignicantlyreducedtheCPUtimerequiredtoprocessles.Theoutputoftheprogramisalelistingtheprobesandthenumberofclonepairsdistinguishedbyeachprobe.TheuseofpreprocessedlessignicantlyincreasedthespeedofAgentC3byeliminatingtheneedtorunthroughthebottleneckloop. TheAgentC1intheMCPSA-TeamisresponsibleforthegenerationofinitialsolutionsinMemoryM1.Togenerateanewsolution,AgentC1choosestheprobesthatdistinguishthegreatestnumberofpairsofclones.Asitbuildsasolution,eachtimeanewprobeisaddeditdeterminesthesetofpairsofclonesnotyetdistinguishedbythe 41

PAGE 42

TheAgentC1intheMCPSA-Teamcombinesalloftheissuesthatwereaconcerninthersttwocasesdiscussed.ThevalueofjSjmustberecalculatedeverytimeaprobeisaddedtodetermineifthesolutionisacceptable,andthenumberofpairsofclonesdistinguishedbytheprobesmustberepeatedlycalculatedwhilethesetofclonesconsideredconstantlychangesaspairsthataredistinguishedareeliminated.ItwasnecessarytotakeseveralmeasurestospeedupAgentC1.ThemoreecientimplementationofthecalculationofjSjwasusedtondjSj.Thepreprocessedprobelesweresortedusingtheqsortalgorithmaccordingtothenumberofpairsofclonestheydistinguish.ThesortingoftheprobeseliminatedtheneedforAgentC1tosearchfortheinitialprobe.Toavoidrecalculatingthenumberofpairsofclonesdistinguishedfortheentiresetofprobes(whichcouldeasilynumbermorethan4000),onlysubsetsoftheprobeswereexaminedatonetime.Thesubsetwasselectedfromthetopofthesortedlistofprobestoensurethatthebestcandidateswereincluded.Thesizeofthesubset,whichisgenerallybetweentwentyandfty,isinputbytheuser.TheseimprovementsallworkedtogethertoallowAgentC1torunveryeciently.Asimpleimplementationoftheagentwouldhavebeeninfeasibletouse. Adialogbasedframeworkwasusedtoallowforrapiddevelopmentoftheprogram.ClassesfromtheMicrosoftFoundationClass(MFC)librarywereutilizedforstorageofprobes,clonesandsolutions.IndividualprobeswerestoredasCStrings.CStringArrayswereusedtostoreinputclones,probesandprobesolutions,andaCObArraywasusedforMemoryM1.Arandom-numbergeneratorwasusedtorandomlyselecttheagenttobecalledateachiteration. InSection 3.7.1 ,wediscussthefactthatnoneoftheprobeleswehadwerecapableofdistinguishingeverypairofclonesinaclonele.Sinceitwasnotpossibletodistinguish 42

PAGE 43

3.4 and 3.5 .Inthenextsubsectionwedescribetheinstancesusedinthetests.Insections 3.7.2 and 3.7.3 wepresenttheresults. 31 ],withparameters16807(multiplier)and2311(primenumber). AlltestswererunonaPentium4CPUwithspeedof3.0GHzand1GBofRAMunderMSWindowsXP.AllalgorithmswereimplementedinMSVisualC++6.0.CPUtimeswerecomputedusingthefunctionclock. WetestedtheA-Teamsapproachontwogroupsofdata.OnegroupofdatalesisdetailedinTable 3-1 andwasobtainedfromtheauthorsof[ 2 ].Itconsistsofthreedierentcloneles.Therst(dataset1),contains1158small-subunitribosomalgenesfromGenBank(NCBI).Thenucleotidesequenceofeachgeneinthelewaseditedsothatitcontainsthesequencebetweentwohighlyconservedprimers,butnottheprimersequencesthemselves[ 2 ].Theothertwocloneles(eubacteria2kandeubacteria5k)contain2000and5000eubacteriasamplesrespectively.InTable 3-1 ,thelastextensionoftheprobelenameindicatesthelengthoftheprobescontainedinthele. Thesecondgroupissemi-randomdatathatwegeneratedusingaC++programwedeveloped.Theprogramrandomlygeneratesprobesofanylengthinputbytheuser.Itthenrandomlygeneratescloneswithprobesfromtheprobeleembeddedassubstrings.Thesemi-randomdatasetsarelistedinTable 3-2 . 43

PAGE 44

3-1 and 3-2 . Todate,theonlypublishedpaperspecicallydiscussingtheMDPSandMCPSproblemsthatweareawareofisBornemanetal.[ 2 ].Whilewedidnotacquiretheresultsforallofthedatasetsandtestsdiscussedintheirpaper,wedidhaveaccesstoalimitednumberoftheresultsfortheMDPSproblemfordataset1andtheeubacteria5kdataset.WecompareourresultsforMDPSinSection 3.7.2 .WedidnothaveresultsforcomparisonforMCPS.InSection 3.7.3 ,wecompareourresultsfortheMCPSproblemwiththosegiveninatableinthepaperbyBornemanetal.[ 2 ]. TofurthermeasuretheperformanceoftheA-TeamsapproachfortheMDPSproblem,wecomparedthebestjSjobtainedfromtwentytestrunswithsolutionscontainingexactlytwentyprobestothevalueofjSjobtainedwhenallcandidateprobeswereincludedinthesolution.Thatis,wecomparedtheresultswithtwentyprobestothebestpossibleresultthatcouldbeobtainedforeachgivendataset.ToevaluateourapproachfortheMCPSproblem,wehadtheprogramdeterminetheminimumsolutionsetsizethatcouldachieveaminimumcoverof95%.Weusedthesameapproachforthedatafrom[ 2 ]andthesemi-randomdata. RashandGuseld[ 34 ]commentthataweaknessoftheworkbyBornemanetal.[ 2 ]isthattheyonlyconsiderprobesofaxedlength.Ourapproachisnotlimitedtoaxedprobelength.Toaddressthisquestion,wecreatedthreesemi-randomdatasetswithprobesofvaryinglengths.ThedatasetsarereferencedinTables 3-2 , 3-4 and 3-6 as5,6,7and6,7,8,9and5,6,7,8,9,10,indicatingthevaryinglengthsoftheprobes.Eachdatasetcontainsanequalnumberofprobesofeachlength. EachprobelewasrunthroughthePreprocessingprogramdiscussedinSection 3.6 withtheassociatedcloneleonetimebeforebeingusedintheA-Teamalgorithms.The 44

PAGE 45

DatafromBornemanetal. CloneFile(no.) MaxPossiblejSj dataset1.clones(1158) 669,309candprobes.a.7(6241) dataset1.clones(1158) 669,309candprobes.a.8(4209) dataset1.clones(1158) 669,309candprobes.a.9(4581) dataset1.clones(1158) 669,307candprobes.a.10(4209) dataset1.clones(1158) 669,305eubacteria5k.a.6(4064) eubacteria2k.clones(2000) 1,997,759eubacteria5k.a.6(4064) eubacteria5k.clones(5000) 12,494,429 Semi-randomdata. CloneLength(no.) MaxPossiblejSj 1200(1200) 718,8016(4000) 1500(1200) 718,8017(4000) 1500(1200) 718,8018(4000) 2000(1200) 718,8019(4000) 2000(1200) 718,80110(4000) 2000(1200) 718,8015,6,7(4200) 1500(1200) 718,8016,7,8,9(4400) 2000(2000) 1,998,0015,6,7,8,9,10(4200) 2000(1500) 1,123,501 TwentytestswererunforeachdatasetandtheresultsarereportedinTables 3-3 and 3-4 .TheMaxjSjandMinjSjcolumnsgivethemaximumandminimumjSjvalueobtainedoverthetwentytestruns,thesecorrespondtothebestandworstsolutionsrespectively.The%columngivesthepercentofthemaximumjSjtothebestpossible 45

PAGE 46

TheresultsfortheMDPSproblemperformedonthedatafromBornemanetal.[ 2 ]aregiveninTable 3-3 .Theresultsareverygood.Intheworstcase,afteronly300iterationsandlittlecputime,solutionscontainingtwentyprobeswereabletodistinguishatleast99.86%oftheclonepairsdistinguishedwhenusingallofthecandidateprobes.Itisclearfromthestandarddeviationthattheresultswereconsistentoverthetwentytestruns.Thecputimesaremostaectedbythenumberofclonesinthedataset.Thetimesareconsistentamongdatasetswiththesamenumberofclones. Bornemanetal.[ 2 ]notedthattheyhadsignicantlybetterresultswithnon-binarydistinguishability.WefoundthesametobetruewiththeA-TeamsforboththeMDPSandMCPSproblemsforallbutprobesoflength10.Asnoted,theresultsshowninTables 3-3 to 3-6 areallforbinarydistinguishability. AsdiscussedinSection 3.6 ,weusedthefastimplementationtocalculatejSj.Toensurethatwewerecomparingresultsaccurately,weransolutionsfrom[ 2 ]throughourcodetocalculatejSj.Wethencomparedourresultsusingthosevalues.Forsolutionscontainingtwentyprobesoflength6,weobtainedbetterresultsfornon-binarydistinguishabilityinjust350iterationsand152.6secondsofcputime.Ourbinaryresultswerewithin0.04%oftheirsafter500iterationsand218.9seconds.Wedidnothaveacputimefortheirnon-binaryresult,howeverthebinaryresulttookapproximately2hoursand26minutes.Wealsocomparedsolutionsoftwentyprobesoflengths8and10withbothbinaryandnon-binarydistinguishability.Wefoundthatafter350-500iterationswithcputimesof174to215seconds,weobtainedresultswithjSjwithin0.05%oftheirjSjvalues,withtheexceptionofthenon-binarycaseforprobesoflength10whichwaswithin 46

PAGE 47

Table3-3. Maximumdistinguishingprobesetproblemresults(binarydistinguishability). MaxjSj%MinjSjMeanStdDevAvgcpu candprobes.a.6 669,02399.96%668,882668,949.4241.76130.98scandprobes.a.7 668,77799.92%668,311668,507.20111.24132.15scandprobes.a.8 668,50399.88%667,964668,215.50153.42133.69scandprobes.a.9 668,39599.86%667,458667,984.00240.59132.87scandprobes.a.10 668,90399.94%667,135667,948.60548.6133.29sEubacteria(2k) 1,995,47699.89%1,994,8891,995,070.60268.74232.10sEubacteria(5k) 12,486,42599.94%12,484,72312,484,740.65938.84582.46s Wenotedthatforcandidateprobesetscontainingmixedlengthprobes,theA-TeamsforMCPStendedtoselectmoreprobesoflength6ineachcase.Infact,onlyprobesoflength6wereselectedforsolutionsforthelength5through10dataset.TheA-TeamsforMDPShoweverselectedprobesofmultiplelengths,andinthelength5through10datasetitfoundsolutionsthatcontainedatleastoneprobeofeachlength. 47

PAGE 48

Maximumdistinguishingprobesetproblemresults(binarydistinguishability)onsemi-randomdataaftertwentytestruns. MaxjSj%MinjSjMeanStdDevAvgcpu 5 718,801100.00%718,799718,800.500.60109.86s6 718,801100.00%718,797718,799.501.00129.73s7 718,75299.99%718,438718,639.2089.26136.74s8 718,47799.95%714,888717,330.00973.76168.20s9 718,16899.91%712,699715,685.601303.74171.29s10 717,71999.85%707,790714,095.002825.12171.22s5,6,7 718,79599.999%718,776718,789.754.60136.11s6,7,8,9 1,997,99799.999%1,997,9141,997,980.0018.81285.18s5,6,7,8,9,10 1,123,49499.999%1,123,4381,123,479.0015.04210.81s 3-5 and 3-6 . Table3-5. Minimumcostprobesetproblemresults(binarydistinguishability)onBornemanetal.dataaftertwentytestruns. MinjSjMaxjSjMeanStdDevAvgcpu candprobes.a.6 5760.3243.19scandprobes.a.7 565.950.2250.27scandprobes.a.8 676.650.4951.73scandprobes.a.9 6128.752.0021.83scandprobes.a.10 687.000.3356.89sEubacteria(2k) 676.600.50146.27sEubacteria(5k) 666.000.00987.50s 2 ]listresultsforoptimalandnear-optimalprobesetsfortwoofthedatasets(Table1).Wehadthedataonlyfortherstdatasetwithprobesoflength6and8(theyfoundanoptimalsolutionwithprobesoflength5.)Theauthorsdidnotspecifywhattheymeantby"near-optimal",soherewereportsolutionsizes(jSj)forsetsthatdistinguishatleast99.9%ofclonepairsthatcanbedistinguishedusingtheentireset.Forprobesoflength6,wefoundsolutionsofsize35forbinaryand22fornon-binarydistinguishability;forprobesoflength8wefoundsolutionsofsize48and20forbinaryandnon-binaryrespectively.ThesearesmallerthanthevaluesreportedinTable1of[ 2 ],howeverwemuststressthatwedonotknowthecriteriausedtodetermineifasolutionwasnear-optimalintheotherpaper. 48

PAGE 49

Minimumcostprobesetproblemresults(binarydistinguishability)onsemi-randomdataaftertwentytestruns. MinjSjMaxjSjMeanStdDevAvgcpu 5 777.000.0037.35s6 565.950.22104.22s7 666.000.0071.39s8 555.000.0091.18s9 555.000.0086.94s10 555.000.0097.88s5,6,7 666.000.00102.02s6,7,8,9 555.000.00250.27s5,6,7,8,9,10 666.000.0095.15s 2 ]inasmallfractionofthetime.Ourapproachhassomespecicadvantagesovertheirs.OneadvantageisthattheA-Teamsisabletondnear-optimalprobesetscontainingprobesofbothxedandvaryinglengths.AnotheristhattheA-Teamsapproachisdynamicinthesensethatnewheuristicagentscanbeaddedatanytimewithoutrequiringanysignicantchangestoexistingcode.Thisallowsforcontinuedimprovementofthemethodwiththeseamlessincorporationofnewheuristicalgorithms. 49

PAGE 50

35 ],detectionofmanyhumanvirusessuchasthoseimplicatedinAIDS,andthedetectionofviralRNAorDNAthatisrelevanttopathologiesofthecentralnervoussystem[ 5 ]. AsdiscussedinChapter 2 ,biologistscanusehybridizationtodeterminewhetheraspecicDNAfragmentispresentinaDNAsolution.Afrequentlyusedapproachforidentifyingtargetsequencesisbaseduponoligonucleotidemicroarrays[ 16 , 30 ].Microarraysallowforaverylargenumberofhybridizationexperimentstotakeplaceinparallel[ 20 ]. Supposethatwewouldliketoidentifycertainvirustypesinasample.Byobservingifanumberofoligonucleotideprobeshybridizestothegenomeofthevirus,onecandetermineifavirusiscontainedinasample.Inthecasewherethereismorethanonevirusinthesample,theapproachisreadilyextensible.However,therearedrawbacksassociatedwiththisapproachsincendinguniqueprobes(i.e.,probesthathybridizetoonlyonetarget)isdicultinthecaseofcloselyrelatedvirussubtypes.Analternativeapproachistousenon-uniqueprobesthathybridizetomorethanonetarget[ 19 ]. Anoptimalsolutiontothenon-uniqueprobeselectionproblemwillconsistofthesmallestpossiblesetofprobesthatsatisfytheminimumrequirementsoftheproblem.However,thisproblemisverydiculttosolve,evenwithimmenselypowerfulcomputers.Infact,beforethemethodpresentedinSection 4.5 ,nooptimalalgorithmhadbeen 50

PAGE 51

19 ]. Inthischapter,wepresenttwodierentmethodswehavedevelopedforthesolutionoftheNon-UniqueProbeSelectionproblem.Theremainderofthechapterisorganizedasfollows.WedenetheNon-uniqueProbeSelectionprobleminSection 4.2 .AdescriptionofthedatausedtotestbothmethodsispresentedinSection 4.3 .InSection 4.4 ,wedescribetheheuristicmethodwedevelopedforthenon-uniqueprobeselectionproblem,andinSection 4.5 wedetailouroptimalcutting-planemethodwhichistherstexactmethoddevelopedforthisproblem. 4-1 foranexampleofatarget-probeincidencematrixwith4targetsand6probes[ 25 ].Aprobepjissaidtodistinguishorseparatetwotargetsequencestaandtbifitisasubstringofexactlyoneoftaortb(i.e.ifjhajhbjj=1).Forexample,ifta=AGGCAATTandtb=CCATATTGG,thenforprobepj=GCAA,wehavehaj=1,hbj=0,andsopjdistinguishestaandtb.However,forprobepk=ATT,wehavehak=hbk=1,andsopkdoesnotdistinguishtaandtb. Table4-1. Target-probeincidencematrix. Givenatarget-probeincidencematrixH,thegoalofthenon-uniqueprobeselectionproblemistoselectaminimumsetofprobesthatconclusivelydeterminesthepresenceorabsenceofasingletarget.However,duetoexperimentalerrorsintheexperiment(i.e.,theexperimentreportsahybridizationthatshouldnotexist,orfailstoreportahybridization 51

PAGE 52

IntheexamplegiveninTable 4-1 ,supposethatcmin=dmin=1.Thesetofprobesp1,p2,andp3resolvestheexperimentusingaminimumnumberofprobes.However,ifwerequirecmin=dmin=2(eachtargetpairmustbedistinguishedbyatleasttwoprobes,andeachtargethaveatleasttwoprobesthathybridizetoit),thenthesetp1,p2,p3andp4becomesanoptimalsetofprobes.Ofcourse,anextremesolutiontoanyprobeselectionproblemcontainingafeasiblesolutionistosimplyselectallpossibleprobesfortheexperiment.However,theexperimentalcostisproportionaltothenumberofprobesused,andsosuchanall-encompassingexperimentisnottypicallycost-eective. Notethattheproblembecomesmoredicultwhenmultipletargetsarepresentinthesample,andthattheprioranalysisassumestheexistenceofasingletargetinthesample.Selectingthesameprobesetp1,p2,p3,andp4asbefore,notethatthethreeoutofthesixpossiblepairsoftargets(pairs(t1;t2),(t1;t3),and(t1;t4))causep1,p2,p3,andp4.Hence,ifourexperimentresultsinthehybridizationofallprobes,wecandeterminethepresenceoftargett1,butcannotdeterminewhichothertargetispresent. Onewaytoresolvethisproblemistocreateaggregatedtargetsthatrepresenttheexistenceofmultipleindividualtargets.Forinstance,ifwebelievethatitispossiblethatbotht1andt2existinasample,thenafthtarget,t5,canbeusedtorepresentthesituationinwhichbotht1andt2arepresent.Thistargetwouldhaveh5j=1forj=1,2,3,4,and6,andh55=0. 52

PAGE 53

19 ]. Itcaneasilybeshownthatthenon-uniqueprobeselectionproblemisNP-hardusingareductionfromtheSetCoveringproblem,evenifcmin=0orifdmin=0.ThereaderisreferredtoaseminalworkbyGareyandJohnson[ 9 ]onthetheoryofcomputationalcomplexityandNP-hardness.Forthepurposesofthischapter,computationalcomplexityreferstoaworst-casemeasureofthecomputationaltimeorspacerequiredtosolveaproblembyanalgorithm,asafunctionofthesizeoftheproblem.Optimizationanddecisionproblemsaredividedintoclasses,suchastheNP-hardclassofproblems.TheimportantpracticalimplicationregardingthefactthattheprobeselectionproblemisNP-hardisthatonecannotguaranteethatanoptimalsolutioncanbefoundinareasonableamountoftime(i.e.,inanumberofstepsboundedbyapolynomialfunctionofproblemsize).HeuristicalgorithmsareoftenemployedtoquicklysolveNP-hardproblems,usuallyintimeboundedbyapolynomialfunctionofproblemsize.However,suchheuristicscannotguaranteethatanoptimalsolution,oreventhatafeasiblesolution,willbefound.Eveninthebestcase,ifanoptimalsolutionisobtainedbytheheuristic,thefactthatitisoptimalcannotbeproven. Thenon-uniqueprobeselectionproblemcanbenaturallymodeledasanintegerlinearprogram(ILP).Thefollowingformulation,duetoKlauetal.[ 19 ],isbasedontheSetCoveringproblem[ 9 ].LetN=fp1;:::;pngdenotethesetofcandidateprobes,M=ft1;:::;tmgdenotethesetoftargets,andP=f(i;k)j1i
PAGE 54

Notethattheobjectivefunction( 4{1a )minimizesthenumberofprobesusedintheexperiment.Constraints( 4{1b )requirethatatleastcminselectedprobeshybridizetoeachpossibletarget.Constraints( 4{1c )statetheHammingdistanceconstraints,whichrequirethattargettiisdistinguishablefromtargettkbyatleastdmindierentproberesults,foreach(i;k)2P.Finally,restrictions( 4{1d )statethateachprobeselectionvariablemustbebinaryvalued.Solving( 4{1 )tooptimalitybyintegerprogrammingsolversmayindeedbepossibleforsmallprobleminstances,butthequadraticrateofgrowthinthesetofconstraints( 4{1c )rendersthisapproachintractableforlarge-scaleinstances. Klauetal.[ 19 ]prescribeaheuristicproceduretothenon-uniqueprobeselectionproblembasedonarestrictedversionof( 4{1 ).First,theygenerateafeasiblesubsetofprobesfromtheoverallsetofcandidateprobes.TheythenreducethenumberofprobesinthesolutionbysolvinganintegerlinearprogramusingCPLEXsoftware.ConcurrentlywiththedevelopmentofthepaperpresentedinSection 4.5 ,Klauetal.[ 18 ]extendedtheirapproachbyprescribingabranch-and-cutprocedureforthenon-uniqueprobeselectionprobleminwhichmultipletargetsmaybepresent.TheirresultingrestrictedILP(\restricted,"sinceseveralcandidateprobesareignoredintherststep)issolvablewithinreasonablecomputationallimits.However,becausecandidateprobesareomittedintheconstructionandsolutionoftherestrictedILP,theoptimalsolutiontotherestrictedILPisnotnecessarilyanoptimalsolutiontotheoverallproblem. 54

PAGE 55

19 ]andthesecondgroupconsistedofHIV1andHIV2data. Therstgroupofdataincludedbotharticialandreal(Meiobenthos)datasets.Benthosaretheorganismsthatresideontheseaoorandatthebottomsoflakesandrivers.TheyarecategorizedaccordingtosizewithMeiobenthosbetween100mand500minsize[ 1 ]. Atestsetof679targetMeiobenthossequenceswasconstructedbyclustering1,23028SrDNAsequencesfromdierentorganismspresentintheMeiobenthosandarbitrarilyselectingarepresentativefromeachcluster.The149clusters,whichcontainedtwoormoresequenceseach,wererepresentativeofapproximately56%ofallMeiobenthossequences[ 19 ].ThesamesetofsequenceshadbeenusedbySchliepetal.[ 39 ].Thecorrespondingprobesetcontained15,139probes.TheMeiobenthosdatasetisdesignatedas\M"inTables 4-2 through 4-5 . ThearticialdataweregeneratedusingRandomEvolutionaryFORestModel(REFORM)software.TenindependenttestsetswereconstructedusingtwodierentREFORMmodels.Fivetestsetswith256targetseachweregeneratedfromonemodel(a1a5),andvetestsetswith400targetseachweregeneratedfromthesecondmodel(b1b5).TheprobecandidatesforeachofthetenarticialtestsetsweregeneratedusingPromidesoftware.AdetaileddescriptionofthearticialdatasetconstructionisgivenbyKlauetal.[ 19 ]. Thesecondgroupofdataconsistedof200eachHIV1andHIV2sequencesdownloadedfromtheNationalCenterforBiotechnologyInformation.Thesimilarityofthesequencesmadethemgoodcandidatesforthenon-uniqueprobeselectionproblem.CandidateprobesforthesequencesweregeneratedusingPrimer3[ 36 ]withdefaultinputparameters.Thedefaultparametersincluded:lengthbetween18and27nucleotides,meltingtemperature 55

PAGE 56

Numbersoftargetsandcandidateprobes. SetTargetsProbes a12562786a22562821a32562871a42562954a52562968b14006292b24006283b34006311b44006223b54006285M67915139 between57oCand63oC,andGCcontentbetween20%and80%.Eightthousandprobes,40foreachHIVsequence,weregeneratedforeachdataset.Repeatprobeswerethendeletedbeforethetarget-probeincidencematrixwasconstructed[ 25 ].Alltestswereconductedwiththeparametersdminandcminequalto5and10,respectively. 39 ],andforlarge,realdatasetswhencomparedtothemethodofKlauetal.[ 19 ]. 56

PAGE 57

8 10 . 57

PAGE 59

11 )reducesthesolutionsetbyselectingtwoprobesatatimetobedeleted.Ifthesolutioncanremainfeasiblebyaddingzerooroneprobetoreplacethetwo,thesolutionisupdated;otherwiseitremainsthesame.Thesearchfunctionusedduringtheconstructionphaseisusedtondareplacementprobeifoneisneeded. Thereisnorandomselectionatanypointinthealgorithms.Theprobesaresortedandselectedfrominthesortedorder,exceptforthespecicsearchwhichendsupsearchingthroughtheremainingprobesinthesameorder.TheoneelementthatcanbevariedisthemethodinwhichtheprobesareselectedfordeletionintheTwo-for-Less 59

PAGE 60

WerstdiscussresultsofexperimentsusingthedataobtainedfromKlauetal.[ 19 ].WewillthendiscusstheresultsfortheHIVdata. Allofthearticialdatasets(a1-a5andb1-b5)weretestedunderthesameconditions.Intheconstructionphase,thespecicsearchforprobeswasstartedwhenthenumberoftargetpairsfailingtosatisfytheminimumHammingdistanceconstraintreducedto200.Inthereductionphase,eachprobewastestedtwicefordeletionfromthesolutionsetunlessitremaineddeletedaftertherstattempt. TheMeiobenthosdatasetwastestedunderslightlydierentconditions.Intheconstructionphase,thespecicprobesearchwasstartedat300targetpairsinsteadof200. WhencomparedtothegreedyheuristicofSchliepetal.[ 39 ],ouralgorithmproducedresultsthatwerebetter(i.e.containedfewerprobes)ineverycase.Table 4-3 comparestheresultsofthegreedyheuristicandthoseofKlauetal.toourresults. OurresultsfortheMeiobenthosdataset,whichwasthelargestandonlyrealdatasetintherstgroupofdata,weresignicantlybetterthanthoseobtainedbyKlauetal.[ 19 ].Thesolutioncontainedonly2,261probesor897fewerprobesthantheirsolution. CPLEX(http://www.ilog.com/products/cplex/)isoneoftheleadingmathematicalprogrammingsoftwarepackagesavailable.Veryfewheuristicalgorithms,ifany,areabletocompetewithresultsproducedbyCPLEX.IntheapproachofKlauetal.[ 19 ],the 60

PAGE 61

39 ].AnILPwasthenconstructedusingthesmallersetofcandidateprobesandsolvedusingCPLEXsoftwaretondthenalsolution.BecauseonlytheprobesfromthereducedsetwereusedtoconstructtheILP,CPLEXwasnotawareoftheadditionalcandidateprobesintheoriginalpool,andthereforecouldnotchooseanyofthoseprobesaspartofthenalsolution.FortheMeiobenthosdataset,thecandidateprobeswerereducedfrom15,139to3,851bythegreedyheuristic.CPLEXcouldonlychoosefromthe3,851probes,whereasouralgorithmwasallowedtochoosefromall15,139probesateverystepoftheprocess.Asaresult,ourmethodwasabletondamuchsmallersolutionset. ThereductionoftheoriginalsetofcandidateprobeswasnotnearlyasdrasticforthetenarticialdatasetsfromGroup1.Datasetsa1througha5containedbetween2,786and2,968originalcandidateprobeswhichwerereducedbythegreedyheuristicto1,137to1,175probes.Datasetsb1throughb5beganwith6,223to6,311candidateprobeswhichwerereducedto1,876to1,908probes.Inthesecases,CPLEXwasabletoconsideramuchgreaterportionoftheoriginalprobesandasadirectresultwasmuchmoreeective.Ourapproachproducedlargersolutionsforthesedatasets. Table4-3. Comparisonofresults. 39 ]KlauILPSoln[ 19 ]OurHeuristicSoln a11163503562a21137519558a31175516597a41169540595a51175504601b11908879961b21885938975b31895891946b418889151001b518769461019M385131582261 61

PAGE 62

FortheHIVdatasets,thespecicsearchduringtheconstructionphasewasstartedwhenthenumberoftargetpairsfailingtosatisfytheminimumHammingdistanceconstrainthadreducedto200.Eachprobewastestedtwicefordeletioninthereductionphase. Testsonbothdatasetsyieldedverygoodresults.OncetherepeatedprobesweredeletedfortheHIV-1set,wehadaninitialsetof4,806candidateprobes.Thealgorithmsproducedasolutionsetcontainingonly511,or10:6%ofthecandidateprobes. Beforerunningthetestwith40candidateprobesgeneratedforeachsequenceintheHIV-1dataset,weranthesametestwith20candidateprobesgeneratedforeachsequence.The20probesweregeneratedbyPrimer3withthesameinputparameters.Afterdeletingrepeatedprobes,therewere2,241distinctcandidateprobes.Asexpected,theresultswerenotasgood.Thenalsolutioncontained565probeswithalargernumberofsequencepairsfailingtosatisfytheminimumhammingdistanceconstraint. TheinitialcandidatesetfortheHIV-2datacontained4686probesafterthedeletionofrepeatedprobes.Thealgorithmsproducedasolutionsetcontainingonly543,or11:6%oftheprobes. 4.2 .AcomparisonofourresultswiththoseobtainedwiththegreedyheuristicofSchliepetal.[ 39 ]showedthatourmethodfoundabettersolution 62

PAGE 63

19 ],wefoundthatourmethodwasabletoproducesignicantlybetterresultsfortheverylarge,realdataset.Ourapproachproducedlargersolutionsforthesmall,articialdatasetswhereCPLEXplayedamoresignicantrollinndingthesolution.WealsotestedourapproachfromstarttonishbydownloadingHIV-1andHIV-2sequencesfromNCBI,generatingprobesusingthird-partysoftware,andcreatingtarget-probeincidencematrices.Thealgorithmyieldedverygoodresultsforbothdatasets. OurresultsshowthatthesolutionsizesreducerapidlyintherstiterationsoftheTwo-for-Lessalgorithm.Aftertheinitialdropinsolutionsize,therateofreductionslows.Ateachstepfeasibilityofthesolutionismaintainedsothatafeasiblesolutioncouldbeextractedatanypointaftertheconstructionphase. 4 .Anoptimalsolutionmaysignicantlyreducethecostofexperimentationusingprobes,andgivestheoreticalbenchmarksfromwhichtoassessthequalityofheuristicsolutionsrelativetooptimality.However,thedirectsolutionof( 4{1 )viaintegerprogrammingsoftwareisintractable,duetothelargenumberofvariablesandconstraintswithintheproblem. Themethodologicalcontributionofthepaperdiscussedinthissectionisthedevelopmentofacutting-planeapproachtondoptimalsolutionstothenon-uniqueprobeselectionproblem.RecallthattheapproachemployedbyKlauetal.[ 19 ]employsarestrictedversionof( 4{1 ),whichprovidesupperboundsontheoptimalsolution.Bycontrast,ifonerelaxes( 4{1 )byremovingtroublesomeconstraints,itispossibletoobtainlowerboundsontheoptimalsolution.Itisthuspossibletogeneratebothlowerbounds(fromtherelaxationof( 4{1 ))andupperbounds(givenbyanyfeasiblesolutiontothefull 63

PAGE 64

4{1 )),andterminatewhenthesetwoboundsmatch.WeelecttorelaxalargesubsetofHammingdistanceconstraints( 4{1c ),reinstatingthemonlyifdeterminedtobenecessary. 1. Initializebysetting 2. Computeanoptimalsolutionto( 4{1 )inwhich( 4{1c )isstatedonlyforpairsoftargetsin 3. CheckthefeasibilityofxtwithrespecttoallomittedHammingdistanceconstraints(i.e.,thoseconstraintsinthesetPn 4. LetPtbethesetofHammingdistanceconstraintsinPn 5. TerminatewithoptimalsolutionxtandoptimalobjectivefunctionvalueLB. Notethatsolvingtherelaxedversionof( 4{1 ),inwhichonlyasubsetofconstraintsfor( 4{1c )areincluded,yieldsalowerboundontheoptimalobjectivefunctionvalue,duetothefactthataddingomittedconstraintscannotimprove(decrease)theoptimalobjectivefunctionvalueoftherelaxedproblem.Hence,thevalueLBcomputedinStep2ofthealgorithmisavalidlowerboundtotheoptimalsolution.InStep3,ifnoomittedconstraintsarebeingviolated,thenthesolutionxtobtainedinthepriorstepisalsofeasible,whichmeansthatLBisalsoanupperboundontheproblem,andaccordingly,thealgorithmterminateswithxtbeingprovenasanoptimalsolution.Else,inordertoensurethattheserestrictionsarenotviolatedagain,allviolatedconstraintsarereinstatedintothesetofconstraints( 4{1c ),andtheproblemissolvedagain.Thisprocessmustterminateinanitenumberofiterations,sinceatleastoneconstraint( 4{1c )isaddedtothesetofincludedinequalities, 64

PAGE 65

4{1b )and( 4{1d ),andthatrelativelyfewHammingdistanceconstraintsareneededtoenforcefeasibilitytothedistinguishabilitycriterion.Ontheotherhand,thereexistsariskthatStep2willbeencounteredtoomanytimesforthealgorithmtobepracticallyuseful;sinceStep2involvesthesolutionofanintegerprogrammingproblem,itisbyfarthemosttime-consumingstepofthealgorithm.Ourcomputationalexperimentswilltestthetractabilityofthisalgorithmonrealtestinstances. Severalremarksaboutthealgorithmcanbemadeatthispoint. Notethataprematurelyterminatedproblemdoesnotyieldalowerboundtotheoverallproblem,becausethereisnoguaranteeofoptimalitytotherelaxedproblem.However,suchaguaranteemaynotbenecessary.IfthebestsolutionobtainedthusfarviolatesungeneratedHammingdistanceconstraints,itmightbemostecienttoproceedwiththealgorithm,predictingthatthoseviolatedconstraintswilleventuallyneedto 65

PAGE 66

4.5.2 . 29 ].Additionally,itmayalsobeappropriatetoguidethebranch-and-boundsearchbybranchinginamannerthatfocusesonprovingoptimalityoftheincumbentsolution(i.e.,thebestsolutionfoundsofar),ratherthanaggressivelyseekingtoimprovetheincumbent.Thissettingiscalleda\best-bound-rst"rule.Eachstrategywillyieldtheoptimalsolution,butitispossiblethatsomestrategiesmayresultinamuchfasterexecutionofthecutting-planealgorithm.OurspecicimplementationsoftheserulesideasaredescribedinSection 4.5.2 . 25 ]canbeusedgivenastartingsolutionofxtfromStep2inordertoobtainaquickupperbound.Whereastheimplementationandstudyoftheseproceduresareoutsidethescopeofthispaper,theycouldpossiblybeusefulinthecontextofvery-large-scaleinstancesofthenon-uniqueprobeselectionproblem. 66

PAGE 67

WebeginthissectionbystudyingthecomputationaleortrequiredtoobtainanoptimalsolutiontothetestinstancesdescribedinSection 4.3 .Inourimplementation,wenotedthatforalltestscasesexceptdatasetsb1andb4,thecandidatesetofprobeswasnotsucientlylargeordiverseenoughtosatisfyeveryHammingdistanceconstraint.Forinstance,theremightbeonly3probesintheentirecandidatesetthatdistinguishtargetstaandtb.Asaresult,itisnotpossibletosatisfytheHammingdistanceconstraintforthispairifdmin>3.Inthesecases,weincludedeveryprobethatwoulddistinguishthetargetpairinquestioninthesolution.Wethentemporarilyaddedarticialprobesinordertoobtainafeasibleoptimalsolution.Thoseconstraintsarethensatisedtothegreatestextentpossible,andcanbeomittedfromthesetPofpairsthatappearintheHammingdistanceconstraints.(Priorwork[ 18 , 19 , 25 ]adds\articial"or\virtual"probestosatisfytheseconstraints,butsincetheyappearonlyintheinfeasibletargetpairconstraints,andsincethoseprobesarenotincludedinthenalobjectivefunction,thereisactuallynoneedtoincludetheminthemodel.) WeexploredtheuseofseveraltechniquesthatwerediscussedinSection 4.5.1 forincreasingthecomputationaleciencyofthealgorithm,includingsettingbranchingpriorities(eitherasageneral\best-bound-rst"rule,byencouragingtheILPsolvertofocusonimprovinglowerbounds,orexplicitly,byaordinghigherbranchingprioritiestoprobesappearinginthemostHammingdistanceconstraints),andimposingaconditional 67

PAGE 68

Acomparisonofalgorithmicperformanceofthebasiccutting-planealgorithmandthecutting-planealgorithmwiththerevisedparametersettingisgiveninTable 4-4 .Thecolumn\Iter."givesthenumberofiterationsrequiredbythecutting-planealgorithm,\TotalCuts"givesthenumberofHammingdistancecutsgeneratedbythealgorithm,and\CPUTime"givesthenumberofsecondsrequiredtoexecutethealgorithm.Observethatwhilethereisamodestcomputationalimprovementonaveragefortheeasierinstances(a1a5,HIV1,HIV2),theimpactofourimprovedimplementationismostevidentonthemoredicultinstances(b1b5,M).Indeed,withouttheseimprovements,instanceb2couldnotbesolvedwithin12hoursofCPUtime.Theruntimedroppeddrasticallyinourbestimplementation,fromanexcessof12hourstoonly96.8seconds. Weexploredseveralcombinationsofbranchingpriorityrulesenforcedinconjunctionwiththebest-bound-rstapproach,incombinationwiththeconditionaltimelimit.Ourcutting-planestrategyisimprovedbyeitherenforcingthebest-bound-rstapproachortheHammingdistance-basedpriorityrule.Usingbothoftheserulesresultsinnodiscernabledierenceinalgorithmicperformance.Theconditionaltimelimitisalsoaneectivetool,butnotwhenusedinconjunctionwiththebest-bound-rstrule.ByaggressivelyseekingtoimprovethelowerboundintheILPphaseinthebest-bound-rstimplementation,theoptimalsolutiontotheILPisgenerallyfoundlaterinthesearchthanwhennobranchingprioritiesareset.Thus,prematureterminationofthealgorithmrunsagreaterriskofaddingunnecessaryconstraintsinStep4ofthealgorithm,andthuspossiblyexecutingadditionaliterationsofthealgorithm. Next,wecomparetheoptimalobjectivefunctionvalueidentiedbyouralgorithmtotheheuristicvaluesreportedbyMenesesetal.[ 25 ](abbreviatedasMPR-Heur) 68

PAGE 69

Impactofparameteradjustmentsoncutting-planealgorithm. BasicImplementation Best-Bound-FirstImplementationSet Iter.TotalCutsCPUTime Iter.TotalCutsCPUTime a1 552941.7s 452931.3sa2 358922.8s 358922.9sa3 262617.7s 262617.7sa4 368326.5s 268218.5sa5 449432.2s 449432.8sb1 3604111.6s 360797.7sb2 Notoptimizedin12hours 369396.8sb3 2737548.6s 3737101.1sb4 262766.4s 262668.5sb5 2860409.7s 4655137.4sM 8713851.0s 6711605.1sHIV1 425015.9s 425013.5sHIV2 510920.9s 510920.2s andKlauetal.(abbreviatedasKRSVR-Heur).Table 4-5 givesacomparisonoftheoptimalobjectivefunctionvaluefortherstgroupofdata,comparedtothetwoheuristicstrategies.RecallthatthegoalistondasmallestsetofprobesthatsatisestheminimumcoverandHammingdistanceconstraints.TheoptimalobjectivefunctionvalueisstrictlysmallerthantheMPR-Heursolutionforeveryinstance.AcomparisonwiththeKRSVR-Heurrevealsthattheoptimalobjectivefunctionvalueisstrictlysmallerineverytestinstance,exceptfordatasetsa1,a3,anda5.Thetwoobjectivesmatchforinstancea1,andhenceitislikelythattheKRSVR-Heuridentiestheoptimalsolutionforthisinstance.Oddly,KRSVR-Heurreportsaslightlysmaller-than-optimalsolutionfora3anda5.Thesedatasetsrequiredasignicantnumberofarticialprobes,indicatingthattheoriginalinstancewasinfeasible.Hence,itisevidentthatthisinfeasibilitywashandledintheKRSVR-Heurinadierentmannerfromwhatwasdoneviaouroptimalalgorithm. Table 4-6 givesacomparisonofourresultsontheHIVdatawiththeresultsfromtheMPR-Heurimplementation.ThecolumnlabeledCandidateProbesliststhetotalnumberofprobesfromwhichoptimalsolutionswereselected.Naturally,theheuristicsolutionsrepresentasignicantdecreaseinthenumberofprobesovertheoriginalsetofcandidateprobes.However,theoptimalsolutionscontainsubstantiallyfewerprobesthanwhatis 69

PAGE 70

ComparisonofresultsondatafromKlauetal. SetHeuristicSoln[ 25 ]KlauILPSoln[ 19 ]OptimalSoln a1562503503a2558519492a3597516527a4595540537a5601504525b1961879830b2975938841b3946891822b41001915873b51019946871M226131581887 prescribedbythealgorithm.Moreover,thecomputationaltimesofourcutting-planealgorithmusingtheHIVdatawereonly15.9secondsfortheHIV1instanceand20.0secondsfortheHIV2instance. Table4-6. ComparisonofresultsonHIVdata. 25 ]OptimalSoln HIV14806511431HIV24686543444 Incomparisonwithincumbentheuristicmethods,wefoundthatthereisasignicantneedforexactoptimizationfornon-uniqueprobeselectionproblems.Notonlyhavewecontributedexactbenchmarksforfurtherheuristicdevelopment,butwehaveshownthat 70

PAGE 71

Futureresearchonthisproblemmayinvolveadetailedalgorithmicstrategytofurtherreducecomputationaltimerequiredbyanexactalgorithm,perhapsrelyingonbranch-and-cutstrategiesandotheradvancedmathematicalprogrammingimplementations.Wealsobelievethatthecutting-planeapproachpresentedherecanlikelybeadaptedtoobtainoptimalsolutionsforascopeofscenariosbroaderthanthenon-uniqueprobeselectionproblemconsideredhere. 71

PAGE 72

Inthisstudy,weexaminedmethodsforthesolutionofprobedesignandselectionproblems.Weconductedasurveyofexistingcomputationalmethodsforthedesignandselectionofprobes.Anumberofalgorithmsandsoftwareprogramsarecommerciallyandfreelyavailableforthedesignofoligonucleotideprobes.Wethereforeconcentratedonndingnewmethodsfortheselectionofprobesetsforuseinhybridizationexperimentsinourresearch. Theasynchronousteamsalgorithmswedesignedforndingsolutionstothemaximumdistinguishingprobesetproblemandtheminimumcostprobesetproblemwereabletondcomparableand,insomecases,clearlybettersolutionsthanexistingmethodsinasmallfractionofthetime.Theapproachhadtheaddedbenetofatypeofplug-and-playabilityforexpansion.Newalgorithmscanbeseamlesslyaddedtotheasynchronousteamsalgorithmsatanytimeallowingforcontinuedimprovementofthemethod. Ourheuristicmethodforthenon-uniqueprobeselectionproblemwasanimprovementoverthegreedyheuristicofSchliepetal.[ 39 ]inalltestcases,andwasabletondsmallersolutionsetsforlargerealdatasetsthantheheuristicalgorithmofKlauetal.[ 19 ].However,becauseoftheheuristicnatureofthemethod,itwasnotabletondaprovablyoptimalsolution.Clearly,therewasasignicantneedforexactoptimizationfornon-uniqueprobeselectionproblems.Thisleadtoourdevelopmentoftherstexactmethodforsolvingthenon-uniqueprobeselectionproblemwithincomputationallimitswithoutapriorieliminationofcandidateprobes.Thealgorithmcontributedexactbenchmarksforfurtherheuristicdevelopment,andalsodemonstratedthatthemodestcomputationaltime(within10minutesforeachinstancetested)requiredforourapproachyieldsasubstantialimprovementovertheincumbentstrategyforsolvingrealinstancesofthisproblem.Specically,wecomputedsolutionsthatwere19.82%,18.56%,and22.30%betterthanthepreviousbestsolutionfoundforthethreerealinstancesexamined. 72

PAGE 73

2.2 ).Researchersmustweighthebenetofmoreaccuratemeasuresagainstthecostofincreasedcomplexityand,asaresult,computationaltimeofthecalculations.Asmoreecientmethodsaredevelopedforthesemeasures,theywilllikelyhaveadirectimpactonthemethodsappliedtothedesignofprobes. Regardingtheselectionofprobesets,futureresearchmayinvolvecontinuedimprovementstofurtherreducecomputationaltime.TheA-Teamsmethodcanreadilyincorporatenewalgorithmsorimprovementstoexistingalgorithms.Researchisalsocontinuingonthenon-uniqueprobeselectionproblemtonotonlyconsiderpairwiseseparationoftargets,butalsogroupseparationoftargets.Concurrentwithourdevelopmentoftheoptimalcutting-planealgorithmforsolutionofthenon-uniqueprobeselectionproblem,amethodwasdevelopedtoallowfortheseparationofsmallgroupsoftargets.Finally,webelievethatthecutting-planeapproachwepresentedcanlikelybeadaptedtoobtainoptimalsolutionsforascopeofscenariosbroaderthanthenon-uniqueprobeselectionproblem. 73

PAGE 74

[1] R.S.K.BarnesandR.N.Hughes.AnIntroductiontoMarineEcology.BlackwellScience,Malden,Mass,1999. [2] J.Borneman,M.Chroback,G.D.Vedova,A.Figueroa,andT.Jiang.Probeselectionalgorithmswithapplicationsintheanalysisofmicrobialcommunities.Bioinformatics,17:S39{S48,2001. [3] C.Chou,C.Chen,T.Lee,andK.Peck.Optimizationofprobelengthandthenumberofprobespergeneforoptimalmicroarrayanalysisofgeneexpression.NucleicAcidsResearch,32(12),2004. [4] P.CloteandR.Backofen.ComputationalMolecularBiology.JohnWileyandSonsLtd,WestSussex,England,2000. [5] C.Conejero-Goldberg,E.Wang,C.Yi,T.E.Goldberg,L.Jones-Brando,F.M.Marincola,M.J.Webster,andE.F.Torrey.Infectiouspathogendetectionarrays:viraldetectionincelllinesandpostmortembraintissue.BioTechniques,pages741{751,November2005. [6] P.S.deSouzaandS.N.Talukdar.AsynchronousorganizationsforMultiAlgorithmproblems.InProceedingsofthe1993ACM/SIGAPPSymposiumonAppliedComput-ing:StatesoftheArtandPractice,pages286{293,Indianapolis,IN,1993. [7] A.M.Frieze,F.P.Preparata,andE.Upfal.Optimalreconstructionofasequencefromitsprobes.JournalofComputationalBiology,6(3),1999. [8] Q.Fu,J.Borneman,J.Ye,andM.Chrobak.Improvedprobeselectionfordnaarraysusingnonparametrickerneldensityestimation.InProceedingsofthe2005IEEEEngineeringinMedicineandBiology27thAnnualConference,Shanghai,China,2005. [9] M.R.GareyandD.S.Johnson.ComputersandIntractability:AGuidetotheTheoryofNP-Completeness.W.H.FreemanandCo.,NewYork,NY,1979. [10] L.Gasieniec,C.Y.Li,P.Sant,andP.W.H.Wong.Ecientprobeselectioninmicroarraydesign.InProceedingsoftheIEEESymposiumonComputationalIntelligenceinBioinformaticsandComputationalBiology,pages247{254,Toronto,Canada,April2006. [11] F.C.Gomes,C.N.Meneses,A.G.Lima,andC.A.S.Oliveira.Asynchronousorganizationsforsolvingthepoint-to-pointconnectionproblem.InProceedingsofInternationalConferenceonMulti-AgentSystems(ICMAS-98),IEEEComputerSociety,Paris,France,1998. [12] D.Guseld.AlgorithmsonStrings,Trees,andSequences.PressSyndicateoftheUniversityofCambridge,NewYork,NY. 74

PAGE 75

E.Halperin,S.Halperin,T.Hartman,andR.Shamir.Handlinglongtargetsanderrorsinsequencingbyhybridization.JournalofComputationalBiology,10(3),2003. [14] G.Held,G.Grinstein,andY.Tu.Relationshipbetweengeneexpressionandobservedintensitiesindnamicroarrays-amodelingstudy.NucleicAcidsResearch,34(9),2006. [15] R.Herwig,A.Schmitt,M.Steinfath,J.O'Brien,H.Seidel,S.Meier-Ewert,H.Lehrach,andU.Radelof.Informationtheoreticalprobeselectionforhybridisationexperiments.Bioinformatics,16:890{898,2000. [16] Y.Huang,C.Chang,C.Chan,T.Yeh,Y.Chang,C.Chen,andC.Kao.Integratedminimum-setprimersanduniqueprobedesignalgorithmsfordierentialdetectiononsymptom-relatedpathogens.Bioinformatics,21:4330{4337,2005. [17] N.C.JonesandP.APevzner.AnIntroductiontoBioinformaticsAlgorithms.MITPress,Boston,MA,2004. [18] G.W.Klau,S.Rahmann,A.Schliep,M.Vingron,andK.Reinert.Integerlinearprogrammingapproachesfornon-uniqueprobeselection.ToappearinDiscreteAppliedMathematics,2006. [19] G.W.Klau,S.Rahmann,A.Schliep,M.Vingron,andK.Reinert.Optimalrobustnon-uniqueprobeselectionusingintegerlinearprogramming.Bioinformatics,20:i186{i193,2004. [20] S.Knudsen.GuidetoAnalysisofDNAMicroarrayData,2ndEd.JohnWiley&Sons,Ltd.,NewYork,NY,2004. [21] R.KoehlerandN.Peyret.EectsofDNAsecondarystructureonoligonucleotideprobebindingeciency.ComputationalBiologyandChemistry,29:393{397,2005. [22] J.K.Lanctot,M.Li,B.Ma,S.Wang,andL.Zhang.Distinguishingstringselectionproblems.InProceedingsofthetenthannualACM-SIAMsymposiumonDiscretealgorithms,pages633{642,Baltimore,Maryland,1999.SocietyforIndustrialandAppliedMathematics,Philadelphia,PA. [23] F.LiandG.Stormo.SelectionofoptimalDNAoligosforgeneexpressionarrays.Bioinformatics,17:1067{1076,2001. [24] M.MarmurandP.Doty.Thermalrenaturationofdeoxyribonucleicacids.JournalofMolecularBiology,3:585{594,1961. [25] C.N.Meneses,P.M.Pardalos,andM.A.Ragle.Anewapproachtothenon-uniqueprobeselectionproblem.ToappearinAnnalsofBiomedicalEngineering,2007. [26] C.N.Meneses,P.M.Pardalos,andM.A.Ragle.Asurveyofcomputationalmethodsforprobedesignandselection.InGinoLim,editor,toappearinOptimizationinMedicineandBiology.TaylorandFrancis,2007. 75

PAGE 76

O.MilenkovicandN.Kashyap.DNAcodesthatavoidsecondarystructure.InProceedingsofthe2005IEEEInternationalSymposiumonInformationTheory,Adelaide,Australia,2005. [28] R.Murray,D.Granner,P.Mayes,andV.Rodwell.Harper'sIllustratedBiochemistry,Twenty-SixthEdition.McGraw-HillCompanies,2003. [29] G.L.NemhauserandL.A.Wolsey.IntegerandCombinatorialOptimization.WileyInterscienceSeriesinDiscreteMathematicsandOptimization.WileyandSons,NewYork,NY,1988. [30] E.Nordberg.YODA:selectingsignatureoligonucleotides.Bioinformatics,21:1365{1370,2005. [31] S.ParkandK.Miller.Randomnumbergenerators:Goodonesarehardtond.CommunicationsoftheACM,31:1192{1201,1988. [32] A.Pozhitkov,P.Noble,T.Domazet-Loso,A.Nolte,R.Sonnenberg,P.Staehler,M.Beier,andD.Tautz.TestsofrRNAhybridizationtomicroarrayssuggestthathybridizationcharacteristicsofoligonucleotideprobesforspeciesdiscriminationcannotbepredicted.NucleicAcidsResearch,34(9),2006. [33] J.B.Rampal,editor.DNAArrays:MethodsandProtocols.HumanaPressInc.,Totowa,NJ,2001. [34] S.RashandD.Guseld.Stringbarcoding:Uncoveringoptimalvirussignatures.InG.Meyers,S.Hannenballi,S.Istrail,P.Perzner,andM.Waterman,editors,ProceedingsoftheSixthAnnualInternationalConverenceonComputationalBiology,pages254{261,WashingtonDC,UnitedStates,April2002. [35] S.B.Roth,J.Jalava,O.Ruuskanen,A.Ruohola,andS.Nikkari.Useofanoligonucleotidearrayforlaboratorydiagnosisofbacteriaresponsibleforacuteupperrespiratoryinfections.JournalofClinicalMicrobiology,pages4268{4274,2004. [36] S.RozenandH.Skaletsky.Primer3onthewwwforgeneralusersandforbiologistprogrammers.InS.KrawetzandS.Misener,editors,BioinformaticsMethodsandProtocols:MethodsinMolecularBiology,pages365{386.HumanaPress,Totowa,NJ,2000. [37] J.SantaLucia.Auniedviewofpolymer,dumbbell,andoligonucleotideDNAnearest-neighborthermodynamics.InProceedingsoftheNationalAcademyofScience,volume95,pages1460{1465,1998. [38] J.SantaLucia,H.Allawi,andP.Seneviratne.Improvednearest-neighborparametersforpredictingDNAduplexstability.Biochemistry,35:3555{3562,1996. 76

PAGE 77

A.Schliep,D.C.Torney,andS.Rahmann.GrouptestingwithDNAchips:generatingdesignsanddecodingexperiments.InG.Meyers,S.Hannenballi,S.Istrail,P.Perzner,andM.Waterman,editors,Proceedingsofthe2ndIEEEComputerSocietyBioinformaticsConference(CSB2003),pages84{93,Standford,CA,August2003. [40] D.Stekel.MicroarrayBioinformatics.CambridgeUniversityPress,Cambridge,UK,2003. [41] W.SungandW.Lee.Fastandaccurateprobeselectionalgorithmforlargegenomes.InProceedingsoftheComputationalSystemsBioinformatics,Stanford,CA,2003.IEEEComputerSociety. [42] N.Templeton,editor.GeneandCellTherapy:TherapeuticMechanismsandStrate-gies,2nded.MarcelDekker,Inc.,2004. [43] S.TomiukandK.Hofmann.Microarrayprobeselectionstrategies.BriengsinBioinformatics,2(4):329{340,2001. [44] L.Valinsky,G.D.Vedova,A.J.Scupham,S.Alvey,A.Figueroa,B.Yin,R.J.Hartin,M.Chrobak,D.E.Crowley,T.Jiang,andJ.Borneman.AnalysisofbacterialcommunitycompositionbyoligonucleotidengerprintingofrRNAgenes.AppliedandEnvironmentalMicrobiology,pages3243{3250,2002. [45] X.WangandB.Seed.Selectionofoligonucleotideprobesforproteincodingsequences.Bioinformatics,19(7),2003. 77

PAGE 78

MichelleA.RaglewasborninHolyoke,Massachussetts.ShereceivedaB.A.inmathematicsin1993andanM.A.inmathematicsin1995fromtheUniversityofWestFlorida.Sheworkedasasoftwareengineerfrom1995until1998whensheobtainedhercurrentfull-timefacultypositionintheDepartmentofMathematicsatOkaloosa-WaltonCollegeinNiceville,Florida.ShehasbeenaPh.D.studentattheUniversityofFloridaintheDepartmentofIndustrialandSystemsEngineeringsinceMayof2004.ShereceivedherM.S.inindustrialandsystemsengineeringattheUniversityofFloridain2005. 78