| OGT Home | UPF Home | View Cart |
CITATION
PDF VIEWER
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Full Citation | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
STANDARD VIEW
MARC VIEW
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Downloads | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Full Text | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
PAGE 1 FromAlgorithmstoZ-Scores: ProbabilisticandStatisticalModelingin ComputerScience NormMatloff UniversityofCalifornia,Davis f X t = ce )]TJ/F26 9.9626 Tf 8.911 0 Td [(0 : 5 t )]TJ/F28 9.9626 Tf 8.911 0 Td [( 0 )]TJ/F27 6.9738 Tf 7.046 0 Td [(1 t )]TJ/F28 9.9626 Tf 8.911 0 Td [( libraryMASS x<-mvrnormmu,sgm SeeCreativeCommonslicenseat http://heather.cs.ucdavis.edu/matloff/probstatbook.html PAGE 2 2 PAGE 3 Contents 1DiscreteProbabilityModels1 1.1ALOHANetworkExample...................................1 1.2BasicIdeasofProbability....................................2 1.2.1TheCrucialNotionofaRepeatableExperiment....................2 1.2.2OurDenitions.....................................4 1.2.3BasicProbabilityComputations:ALOHANetworkExample.............6 1.2.4Bayes'Theorem....................................8 1.2.5ALOHAintheNotebookContext...........................9 1.2.6Simulation.......................................10 1.2.6.1SimulationoftheALOHAExample....................10 1.2.6.2RollingDice.................................11 1.2.7Combinatorics-BasedProbabilityComputation....................12 1.2.7.1WhichIsMoreLikelyinFiveCards,OneKingorTwoHearts?......12 1.2.7.2AssociationRulesinDataMining....................13 1.3DiscreteRandomVariables...................................14 1.4Independence,ExpectedValueandVariance..........................14 1.4.1IndependentRandomVariables............................14 1.4.2ExpectedValue.....................................15 1.4.2.1IntuitiveDenition.............................15 1.4.2.2ComputationandPropertiesofExpectedValue...............15 1.4.2.3Casinos,InsuranceCompaniesandSumUsers,ComparedtoOthers..18 1.4.3Variance.........................................19 i PAGE 4 ii CONTENTS 1.4.4IsaVarianceofXLargeorSmall?...........................20 1.4.5Chebychev'sInequality.................................21 1.4.6TheCoefcientofVariation..............................21 1.4.7Covariance.......................................22 1.4.8ACombinatorialExample...............................22 1.4.9ExpectedValue,Etc.intheALOHAExample.....................23 1.4.10ReconciliationofMathandIntuitionoptionalsection................24 1.5Distributions..........................................24 1.5.1BasicNotions......................................24 1.5.2ParametericFamiliesofpmfs.............................25 1.5.2.1TheGeometricFamilyofDistributions...................25 1.5.2.2TheBinomialFamilyofDistributions....................26 1.5.2.3ThePoissonFamilyofDistributions....................27 1.5.2.4TheNegativeBinomialFamilyofDistributions..............27 1.5.2.5ThePowerLawFamilyofDistributions..................29 1.6RecognizingDistributionsWhenYouSeeThem........................29 1.6.1ACoinGame......................................29 1.6.2TossingaSetofFourCoins..............................30 1.6.3TheALOHAExampleAgain.............................31 1.7ACautionaryTale........................................31 1.7.1TrickCoins,TrickyExample..............................31 1.7.2IntuitioninRetrospect.................................32 1.7.3ImplicationsforModeling...............................33 1.8WhyNotJustDoAllAnalysisbySimulation?.........................33 1.9TipsonFindingProbabilities,ExpectedValuesandSoOn..................33 2ContinuousProbabilityModels37 2.1ARandomDart.........................................37 2.2DensityFunctions........................................40 2.2.1Motivation,DenitionandInterpretation.......................40 PAGE 5 CONTENTS iii 2.2.2UseofDensitiestoFindProbabilitiesandExpectedValues..............42 2.3FamousParametricFamiliesofContinuousDistributions...................43 2.3.1TheUniformDistributions...............................43 2.3.1.1DensityandProperties............................43 2.3.1.2Example:ModelingofDiskPerformance..................43 2.3.1.3Example:ModelingofDenial-of-ServiceAttack..............43 2.3.2TheNormalGaussianFamilyofContinuousDistributions.............44 2.3.2.1DensityandProperties............................44 2.3.2.2Example:NetworkIntrusion........................45 2.3.2.3TheCentralLimitTheorem.........................46 2.3.2.4Example:CoinTosses............................46 2.3.2.5MuseumDemonstration...........................47 2.3.2.6Optionaltopic:FormalStatementoftheCLT................47 2.3.2.7ImportanceinModeling...........................48 2.3.3TheChi-SquareFamilyofDistributions........................48 2.3.3.1DensityandProperties............................48 2.3.3.2ImportanceinModeling...........................48 2.3.4TheExponentialFamilyofDistributions........................48 2.3.4.1DensityandProperties............................48 2.3.4.2ConnectiontothePoissonDistributionFamily...............49 2.3.4.3ImportanceinModeling...........................49 2.3.5TheGammaFamilyofDistributions..........................49 2.3.5.1DensityandProperties............................49 2.3.5.2Example:NetworkBuffer..........................50 2.3.5.3ImportanceinModeling...........................51 2.4DescribingFailure......................................51 2.4.1MemorylessProperty..................................51 2.4.2HazardFunctions....................................54 2.4.2.1BasicConcepts...............................54 2.4.3Example:SoftwareReliabilityModels.........................55 PAGE 6 iv CONTENTS 2.5ACautionaryTale:theBusParadox..............................55 2.6ChoosingaModel........................................57 2.7AGeneralMethodforSimulatingaRandomVariable.....................57 3MultivariateProbabilityModels61 3.1MultivariateDistributions....................................61 3.1.1WhyAreTheyNeeded?................................61 3.1.2DiscreteCase......................................61 3.1.3MultivariateDensities.................................63 3.1.3.1MotivationandDenition..........................63 3.1.3.2UseofMultivariateDensitiesinFindingProbabilitiesandExpectedValues63 3.1.3.3Example:aTriangularDistribution.....................64 3.2MoreonCo-variationofRandomVariables..........................66 3.2.1Covariance.......................................66 3.2.2Correlation.......................................67 3.2.3Example:ContinuationofSection3.1.3.3.......................67 3.2.4Example:aCatchupGame...............................68 3.3SetsofIndependentRandomVariables.............................69 3.3.1Properties........................................69 3.3.1.1ProbabilityMassFunctionsandDensitiesFactor..............69 3.3.1.2ExpectedValuesFactor...........................70 3.3.1.3CovarianceIs0...............................70 3.3.1.4VariancesAdd................................71 3.3.1.5Convolution.................................71 3.3.2Examples........................................72 3.3.2.1Example:Dice................................72 3.3.2.2Example:Ethernet..............................72 3.3.2.3Example:AnalysisofSeekTime......................73 3.3.2.4Example:BackupBattery..........................73 3.4MatrixFormulations......................................74 PAGE 7 CONTENTS v 3.4.1PropertiesofMeanVectors...............................74 3.4.2PropertiesofCovarianceMatrices...........................74 3.5ConditionalDistributions....................................75 3.5.1ConditionalPmfsandDensities............................75 3.5.2ConditionalExpectation................................75 3.5.3TheLawofTotalExpectationadvancedtopic....................76 3.5.3.1ExpectedValueAsaRandomVariable...................76 3.5.3.2TheFamousFormulaTheoremofTotalExpectation...........76 3.5.4WhatAbouttheVariance?...............................77 3.5.5Example:TrappedMiner................................77 3.5.6Example:AnalysisofHashTables...........................78 3.6ParametricFamiliesofDistributions..............................80 3.6.1TheMultinomialFamilyofDistributions.......................80 3.6.1.1ProbabilityMassFunction..........................80 3.6.1.2MeansandCovariances...........................81 3.6.1.3Application:TextMining..........................82 3.6.2TheMultivariateNormalFamilyofDistributions...................83 3.6.2.1DensitiesandProperties...........................83 3.6.2.2TheMultivariateCentralLimitTheorem..................86 3.6.2.3Example:DiceGame............................86 3.6.2.4Application:DataMining..........................88 3.7SimulationofRandomVectors.................................88 3.8TransformMethodsadvancedtopic..............................89 3.8.0.5GeneratingFunctions............................89 3.8.0.6MomentGeneratingFunctions.......................90 3.8.1Example:NetworkPackets...............................91 3.8.1.1PoissonGeneratingFunction........................91 3.8.1.2SumsofIndependentPoissonRandomVariablesArePoissonDistributed.91 3.8.1.3RandomNumberofBitsinPacketsonOneLinkadvancedtopic.....92 3.8.2OtherUsesofTransforms...............................93 PAGE 8 vi CONTENTS 3.9VectorSpaceInterpretationsforthemathematicallyadventurousonly...........93 3.9.1PropertiesofCorrelation................................93 3.9.2ConditionalExpectationAsaProjection........................94 3.10ProofoftheLawofTotalExpectation.............................95 4IntroductiontoStatisticalInference99 4.1WhatStatisticsIsAllAbout..................................99 4.2IntroductiontoCondenceIntervals..............................99 4.2.1HowLongShouldWeRunaSimulation?.......................99 4.2.2CondenceIntervalsforMeans............................100 4.2.2.1SamplingDistributions...........................100 4.2.2.2OurFirstCondenceInterval........................102 4.2.3MeaningofCondenceIntervals............................104 4.2.3.1AWeightSurveyinDavis..........................104 4.2.3.2BacktoOurBusSimulation.........................105 4.2.3.3OneMorePointAboutInterpretation....................106 4.2.4SamplingWithandWithoutReplacement.......................107 4.2.5OtherCondenceLevels................................107 4.2.6TheStandardErroroftheEstimate.........................107 4.2.7WhyNotDividebyn-1?TheNotionofBias.....................108 4.2.8AndWhatAbouttheStudent-tDistribution?.....................109 4.2.9CondenceIntervalsforProportions..........................110 4.2.9.1Derivation..................................110 4.2.9.2Examples..................................111 4.2.9.3Interpretation................................112 4.2.9.4Non-EffectofthePopulationSize.....................112 4.2.9.5PlanningAhead...............................112 4.2.10One-SidedCondenceIntervals............................113 4.2.11CondenceIntervalsforDifferencesofMeansorProportions............113 4.2.11.1IndependentSamples............................113 PAGE 9 CONTENTS vii 4.2.11.2RandomSampleSize............................115 4.2.11.3DependentSamples.............................115 4.2.12Example:MachineClassicationofForestCovers..................116 4.2.13ExactCondenceIntervals...............................117 4.2.14Slutsky'sTheoremadvancedtopic..........................117 4.2.14.1TheTheorem................................118 4.2.14.2WhyIt'sValidtoSubstitute s for .....................118 4.2.14.3Example:CondenceIntervalforaRatioEstimator............119 4.2.15TheDeltaMethod:CondenceIntervalsforGeneralFunctionsofMeansorProportionsadvancedtopic.................................119 4.2.15.1TheTheorem................................119 4.2.15.2Example:SquareRootTransformation...................120 4.2.15.3Example:CondenceIntervalfor 2 ....................121 4.2.16SimultaneousCondenceIntervals...........................123 4.2.16.1TheBonferonniMethod...........................124 4.2.16.2Scheffe'sMethodadvancedtopic.....................125 4.2.16.3Example...................................126 4.2.16.4OtherMethodsforSimultaneousInference.................126 4.2.17TheBootstrapMethodforFormingCondenceIntervalsadvancedtopic......126 4.3HypothesisTesting.......................................126 4.3.1TheBasics.......................................126 4.3.2GeneralTestingBasedonNormallyDistributedEstimators..............127 4.3.3Example:NetworkSecurity..............................128 4.3.4TheNotionofp-Values...............................128 4.3.5What'sRandomandWhatIsNot...........................128 4.3.6One-Sided H A .....................................129 4.3.7ExactTests.......................................129 4.3.8What'sWrongwithHypothesisTesting........................131 4.3.9WhattoDoInstead...................................131 4.3.10DecideontheBasisofthePreponderanceofEvidence...............132 PAGE 10 viii CONTENTS 4.4GeneralMethodsofEstimation.................................132 4.4.1Example:GuessingtheNumberofRafeTicketsSold................133 4.4.2MethodofMoments..................................133 4.4.3MethodofMaximumLikelihood............................134 4.4.4Example:EstimationtheParametersofaGammaDistribution............135 4.4.4.1MethodofMoments.............................135 4.4.4.2MLEs....................................136 4.4.5MoreExamples.....................................136 4.4.6WhatAboutCondenceIntervals?...........................138 4.4.7BayesianMethodsadvancedtopic..........................138 4.4.8TheEmpiricalcdf...................................139 4.5RealPopulationsandConceptualPopulations.........................140 4.6NonparametricDensityEstimation...............................141 4.6.1BasicIdeas.......................................141 4.6.2Histograms.......................................142 4.6.3Kernel-BasedDensityEstimationadvancedtopic..................144 4.6.4ProperUseofDensityEstimates............................145 5IntroductiontoModelBuilding149 5.1BiasVs.Variance........................................149 5.2DesperateforData......................................150 5.2.1MathematicalFormulationoftheProblem.......................150 5.2.2BiasandVarianceoftheTwoPredictors........................151 5.2.3Implications......................................151 5.3AssessingGoodnessofFitofaModel............................153 5.3.1TheChi-SquareGoodnessofFitTest.........................153 5.3.2Kolmogorov-SmirnovCondenceBands.......................154 5.4BiasVs.VarianceAgain...................................155 5.5Robustness...........................................155 6StatisticalRelationsBetweenVariables157 PAGE 11 CONTENTS ix 6.1TheGoals:PredictionandUnderstanding...........................157 6.2ExampleApplications:SoftwareEngineering,Networks,TextMining............157 6.3RegressionAnalysis.......................................158 6.3.1WhatDoesRelationshipReallyMean?.......................158 6.3.2MultipleRegression:MoreThanOnePredictorVariable...............159 6.3.3InteractionTerms....................................160 6.3.4NonrandomPredictorVariables............................160 6.3.5Prediction........................................163 6.3.6OptimalityoftheRegressionFunction.........................164 6.3.7ParametricEstimationofLinearRegressionFunctions................165 6.3.7.1MeaningofLinear.............................165 6.3.7.2PointEstimatesandMatrixFormulation..................166 6.3.7.3BacktoOurALOHAExample.......................167 6.3.7.4ApproximateCondenceIntervals.....................169 6.3.7.5OnceAgain,OurALOHAExample....................171 6.3.7.6EstimationVs.Prediction..........................172 6.3.7.7ExactCondenceIntervals.........................172 6.3.8TheFamousErrorTermadvancedtopic......................172 6.3.9ModelSelection....................................173 6.3.9.1TheOverttingProbleminRegression...................173 6.3.9.2MethodsforPredictorVariableSelection..................174 6.3.10NonlinearParametricRegressionModels.......................175 6.3.11NonparametricEstimationofRegressionFunctions..................176 6.3.12RegressionDiagnostics.................................177 6.3.13NominalVariables...................................177 6.3.14TheCaseinWhichAllPredictorsAreNominalVariables:AnalysisofVariance..177 6.3.14.1It'saRegression!..............................178 6.3.14.2InteractionTerms..............................178 6.3.14.3NowConsiderParsimony..........................179 6.3.14.4Reparameterization.............................180 PAGE 12 x CONTENTS 6.4TheClassicationProblem...................................181 6.4.1MeaningoftheRegressionFunction..........................181 6.4.1.1TheMeanHereIsaProbability.......................181 6.4.1.2OptimalityoftheRegressionFunction...................181 6.4.2ParametricModelsfortheRegressionFunctioninClassicationProblems......182 6.4.2.1TheLogisticModel:Form.........................182 6.4.2.2TheLogisticModel:IntuitiveMotivation..................183 6.4.2.3TheLogisticModel:TheoreticalFoundation................183 6.4.3NonparametricEstimationofRegressionFunctionsforClassicationadvancedtopic184 6.4.3.1UsetheKernelMethod,CART,Etc.....................184 6.4.3.2SVMs....................................184 6.4.4VariableSelectioninClassicationProblems.....................185 6.4.4.1ProblemsInheritedfromtheRegressionContext..............185 6.4.4.2Example:ForestCoverData........................185 6.4.5YMustHaveaMarginalDistribution!.........................186 6.5PrincipalComponentsAnalysis.................................187 6.5.1DimensionReductionandthePrincipleofParsimony.................187 6.5.2HowtoCalculateThem................................188 6.5.3Example:ForestCoverData..............................189 6.6Log-LinearModels.......................................189 6.6.1TheSetting.......................................189 6.6.2TheData........................................190 6.6.3TheModels.......................................191 6.6.4ParameterEstimation..................................192 6.6.5TheGoal:ParsimonyAgain..............................192 6.7Simpson'sNon-Paradox....................................193 7MarkovChains 197 7.1Discrete-TimeMarkovChains.................................197 7.1.1Example:FiniteRandomWalk.............................197 PAGE 13 CONTENTS xi 7.1.2Long-RunDistribution.................................198 7.1.2.1PeriodicChains...............................200 7.1.2.2TheMeaningoftheTermStationaryDistribution............200 7.1.3Example:Stuck-At0Fault...............................200 7.1.3.1Description.................................200 7.1.3.2InitialAnalysis................................201 7.1.3.3GoingBeyondFinding ..........................202 7.1.4Example:Shared-MemoryMultiprocessor......................204 7.1.4.1TheModel..................................204 7.1.4.2GoingBeyondFinding ..........................205 7.1.5Example:SlottedALOHA...............................206 7.1.5.1GoingBeyondFinding ..........................207 7.2HiddenMarkovModels.....................................209 7.3Continuous-TimeMarkovChains................................210 7.3.1Holding-TimeDistribution...............................210 7.3.2TheNotionofRates.................................211 7.3.3StationaryDistribution.................................211 7.3.4MinimaofIndependentExponentiallyDistributedRandomVariables........213 7.3.5Example:MachineRepair...............................213 7.3.6Continuous-TimeBirth/DeathProcesses........................215 7.3.7Example:ComputerWorm...............................216 7.4HittingTimesEtc.........................................217 7.4.1SomeMathematicalConditions............................217 7.4.2Example:RandomWalks...............................217 7.4.3FindingHittingandRecurrenceTimes.........................218 7.4.4Example:FiniteRandomWalk.............................219 7.4.5Example:Tree-Searching...............................220 8IntroductiontoQueuingModels223 8.1Introduction...........................................223 PAGE 14 xii CONTENTS 8.2M/M/1..............................................223 8.2.1Steady-StateProbabilities...............................224 8.2.2MeanQueueLength..................................224 8.2.3DistributionofResidenceTime/Little'sRule.....................225 8.3Multi-ServerModels......................................227 8.4LossModels...........................................227 8.5NonexponentialServiceTimes.................................229 8.6ReversedMarkovChains....................................230 8.6.1MarkovProperty....................................231 8.6.2Long-RunStateProportions..............................231 8.6.3FormoftheTransitionRatesoftheReversedChain..................231 8.6.4ReversibleMarkovChains...............................232 8.6.4.1ConditionsforCheckingReversibility...................232 8.6.4.2MakingNewReversibleChainsfromOldOnes..............233 8.6.4.3Example:QueueswithaCommonWaitingArea..............233 8.6.4.4Closed-FormExpressionfor forAnyReversibleMarkovChain.....234 8.7NetworksofQueues......................................235 8.7.1TandemQueues.....................................235 8.7.2JacksonNetworks...................................236 8.7.2.1OpenNetworks...............................236 8.7.3ClosedNetworks....................................237 9RenewalTheoryandSomeApplications239 9.1Introduction...........................................239 9.1.1TheLightBulbExample,Generalized.........................239 9.1.2DualityBetweenLifetimeDomainandCountsDomain.............239 9.2WhereWeAreGoing......................................240 9.3PropertiesofPoissonProcesses.................................240 9.3.1Denition........................................240 9.3.2AlternateCharacterizationsofPoissonProcesses...................240 PAGE 15 CONTENTS xiii 9.3.2.1ExponentialInterrenewalTimes.......................240 9.3.2.2Stationary,IndependentIncrements.....................241 9.3.3ConditionalDistributionofRenewalTimes......................242 9.3.4DecompositionandSuperpositionofPoissonProcesses................243 9.3.5NonhomogeneousPoissonProcesses.........................243 9.3.5.1Example:SoftwareReliability.......................244 9.4PropertiesofGeneralRenewalProcesses............................244 9.4.1TheRegenerativeNatureofRenewalProcesses....................244 9.4.2SomeoftheMainTheorems..............................244 9.4.2.1TheFunctions F n Sumtom.........................244 9.4.2.2TheRenewalEquation............................246 9.4.2.3TheFunctionmtUniquelyDeterminesFt................246 9.4.2.4AsymptoticBehaviorofmt........................248 9.5AlternatingRenewalProcesses.................................248 9.5.1DenitionandMainResult...............................248 9.5.2Example:InventoryProblemdifcult........................249 9.6Residual-LifeDistribution...................................250 9.6.1Residual-LifeDistribution...............................250 9.6.2AgeDistribution....................................251 9.6.3MeanoftheResidualandAgeDistributions......................253 9.6.4Example:EstimatingWebPageModicationRates..................253 9.6.5Example:TheS,sInventoryModelAgain......................253 9.6.6Example:DiskFileModel...............................253 9.6.7Example:EventSetsinDiscreteEventSimulationdifcult.............254 9.6.8Example:MemoryPagingModel...........................256 PAGE 16 xiv CONTENTS PAGE 17 Preface Whyisthisbookdifferentfromallotherbooksonprobabilityandstatistics? First,thebookstressescomputerscienceapplications.Thoughotherbooksofthisnaturehavebeenpublished,notablytheoutstandingtextbyK.S.Trivedi,thisbookhasmuchmorecoverageofstatistics,includingafullchaptertitledStatisticalRelationsBetweenVariables.Thisshouldproveespeciallyhelpfulas machinglearninganddataminingplayagreaterroleincomputerscience. Second,thereisastrongemphasisonmodeling:Whatdoprobabilisticmodelsreallymean,inreal-life terms?Howdoesonechooseamodel?Howdoweassessthepracticalusefulnessofmodels?Thisaspectis soimportantthatthereisaseparatechapterforthisaswell,titledIntroductiontoModelBuilding.Throughoutthetext,thereisconsiderablediscussionoftheintuitioninvolvingprobabilisticconcepts.Forinstance, whenprobabilitydensityfunctionsareintroduced,thereisanextendeddiscussionregardingtheintuitive meaningofdensitiesandtheirrelationtotheinherently-discretenatureofrealdataduetotheniteprecisionofmeasurement.However,allmodelsandsoonaredescribedpreciselyintermsofrandomvariables anddistributions. Finally,theRstatistical/datamanipulationlanguageisusedthroughout.Again,severalexcellenttextson probabilityandstatisticshavebeenwrittenthatfeatureR,butthisbook,byvirtueofhavingacomputer scienceaudience,usesRinamoresophisticatedmanner.ItisrecommendedthatmyonlinetutorialonR programming, RforProgrammers http://heather.cs.ucdavis.edu/ matloff/R/RProg. pdf ,beusedasasupplement. Asprerequisites,thestudentmustknowcalculus,basicmatrixalgebra,andhaveskillinprogramming.As withanytextinprobabilityandstatistics,itisalsoextremelyhelpfulifthestudenthasagoodsenseofmath intuition,anddoesnottreatmathematicsassimplymemorizationofformulas. Anoteregardingthechaptersonstatistics:Itiscrucialthatstudentsapplytheconceptsinthought-provoking exercisesonrealdata.Nowadaystherearemanygoodsourcesforrealdatasetsavailable.Hereareafewto getyoustarted: UCIrvineMachineLearningRepository, http://archive.ics.uci.edu/ml/datasets. html UCLAStatisticsDept.datasets, http://www.stat.ucla.edu/data/ Dr.B'sWideWorldofWebData, http://research.ed.asu.edu/multimedia/DrB/ Default.htm StatSci.org,at http://www.statsci.org/datasets.html xv PAGE 18 xvi CONTENTS UniversityofEdinburghSchoolofInformatics, http://www.inf.ed.ac.uk/teaching/courses/ dme/html/datasets0405.html NotethatRhasthecapabilityofreadinglesontheWeb,e.g. >z<-read.table"http://heather.cs.ucdavis.edu/matloff/z" ThisworkislicensedunderaCreativeCommonsAttribution-NoDerivativeWorks3.0UnitedStatesLicense.Thedetailsmaybeviewedat http://creativecommons.org/licenses/by-nd/3.0/ us/ ,butinessenceitstatesthatyouarefreetouse,copyanddistributethework,butyoumustattributethe worktomeandnotalter,transform,orbuilduponit.Ifyouareusingthebook,eitherinteachingaclass orforyourownlearning,Iwouldappreciateyourinformingme.Iretaincopyrightinallnon-U.S.jurisdictions,butpermissiontousethesematerialsinteachingisstillgranted,providedthelicensinginformation hereisdisplayed. PAGE 19 Chapter1 DiscreteProbabilityModels 1.1ALOHANetworkExample Throughoutthisbook,wewillbediscussingbothclassicalprobabilityexamplesinvolvingcoins,cards anddice,andalsoexamplesinvolvingapplicationstocomputerscience.Thelatterwillinvolvediverseelds suchasdatamining,machinelearning,computernetworks,softwareengineeringandbioinformatics. Inthissection,anexamplefromcomputernetworksispresentedwhichwillbeusedatanumberofpoints inthischapter.Probabilityanalysisisusedextensivelyinthedevelopmentofnew,fastertypesofnetworks. Today'sEthernetevolvedfromanexperimentalnetworkdevelopedattheUniversityofHawaii,called ALOHA.Anumberofnetworknodeswouldoccasionallytrytousethesameradiochanneltocommunicatewithacentralcomputer.Thenodescouldn'theareachother,duetotheobstructionofmountains betweenthem.Ifonlyoneofthemmadeanattempttosend,itwouldbesuccessful,anditwouldreceivean acknowledgementmessageinresponsefromthecentralcomputer.Butifmorethanonenodeweretotransmit,a collision wouldoccur,garblingallthemessages.Thesendingnodeswouldtimeoutafterwaitingfor anacknowledgementwhichnevercame,andtrysendingagainlater.Toavoidhavingtoomanycollisions, nodeswouldengageinrandom backoff ,meaningthattheywouldrefrainfromsendingforawhileeven thoughtheyhadsomethingtosend. Onevariationis slotted ALOHA,whichdividestimeintointervalswhichIwillcallepochs.Eachepoch willhaveduration1.0,soepoch1extendsfromtime0.0to1.0,epoch2extendsfrom1.0to2.0andsoon. Intheversionwewillconsiderhere,ineachepoch,ifanodeisactive,i.e.hasamessagetosend,itwill eithersendorrefrainfromsending,withprobabilitypand1-p.Thevalueofpissetbythedesignerofthe network.RealEthernethardwaredoessomethinglikethis,usingarandomnumbergeneratorinsidethe chip. Theotherparameterqinourmodelistheprobabilitythatanodewhichhadbeeninactivegeneratesa messageduringanepoch,andthusbecomesactive.Thinkofwhathappenswhenyouareatacomputer. Youarenottypingconstantly,andwhenyouarenottyping,thetimeuntilyouhitakeyagainwillberandom. Ourparameterqmodelsthatrandomness. Letnbethenumberofnodes,whichwe'llassumeforsimplicityistwo.Assumealsoforsimplicitythatthe timingisasfollows.Arrivalofanewmessagehappensinthemiddleofanepoch,andthedecisionasto 1 PAGE 20 2 CHAPTER1.DISCRETEPROBABILITYMODELS whethertosendversusbackoffismadeneartheendofanepoch,say90%intotheepoch. Forexample,saythatatthebeginningoftheepochwhichextendsfromtime15.0to16.0,nodeAhas somethingtosendbutnodeBdoesnot.Attime15.5,nodeBwilleithergenerateamessagetosendornot, withprobabilityqand1-q,respectively.SupposeBdoesgenerateanewmessage.Attime15.9,nodeAwill eithertrytosendorrefrain,withprobabilitypand1-p,andnodeBwilldothesame.SupposeArefrains butBsends.ThenB'stransmissionwillbesuccessful,andatthestartofepoch16Bwillbeinactive,while nodeAwillstillbeactive.Ontheotherhand,supposebothAandBtrytosendattime15.9;bothwillfail, andthusbothwillbeactiveattime16.0,andsoon. Besuretokeepinmindthatinoursimplemodelhere,duringthetimeanodeisactive,itwon'tgenerate anyadditionalnewmessages. Let'sobservethenetworkfortwoepochs,epoch1andepoch2.Assumethatthenetworkconsistsofjust twonodes,callednode1andnode2,bothofwhichstartoutactive.Let X 1 and X 2 denotethenumbersof activenodesatthe veryend ofepochs1and2, afterpossibletransmissions .We'lltakeptobe0.4andqto be0.8inthisexample. Let'snd P X 1 =2 ,theprobabilitythat X 1 =2 ,andthengettothemainpoint,whichistoaskwhatwe reallymeanbythisprobability. Howcould X 1 =2 occur?Therearetwopossibilities: bothnodestrytosend;thishasprobability p 2 neithernodetriestosend;thishasprobability )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 Thus P X 1 =2= p 2 + )]TJ/F46 10.9091 Tf 10.91 0 Td [(p 2 =0 : 52 .1 1.2BasicIdeasofProbability 1.2.1TheCrucialNotionofaRepeatableExperiment It'scrucialtounderstandwhatthat0.52gurereallymeansinapracticalsense.Tothisend,let'sputthe ALOHAexampleasideforamoment,andconsidertheexperimentconsistingofrollingtwodice,saya blueoneandayellowone.LetXandYdenotethenumberofdotswegetontheblueandyellowdice, respectively,andconsiderthemeaningof P X + Y =6= 5 36 Inthemathematicaltheoryofprobability,wetalkofa samplespace ,whichconsistsofthepossibleoutcomes X;Y ,seeninTable1.1.Inatheoreticaltreatment,weplaceweightsof1/36oneachofthepoints inthespace,reectingthefactthateachofthe36pointsisequallylikely,andthensay,Whatwemeanby P X + Y =6= 5 36 isthattheoutcomes,5,,4,,3,,2,,1havetotalweight5/36. Thoughthenotionofsamplespaceispresentedineveryprobabilitytextbook,andiscentraltotheadvanced theoryofprobability,mostprobabilitycomputationsdonotrelyonexplicitlywritingdownasamplespace. PAGE 21 1.2.BASICIDEASOFPROBABILITY 3 1,1 1,2 1,3 1,4 1,5 1,6 2,1 2,2 2,3 2,4 2,5 2,6 3,1 3,2 3,3 3,4 3,5 3,6 4,1 4,2 4,3 4,4 4,5 4,6 5,1 5,2 5,3 5,4 5,5 5,6 6,1 6,2 6,3 6,4 6,5 6,6 Table1.1:SampleSpacefortheDiceExample notebookline outcome blue+yellow=6? 1 blue2,yellow6 No 2 blue3,yellow1 No 3 blue1,yellow1 No 4 blue4,yellow2 Yes 5 blue1,yellow1 No 6 blue3,yellow4 No 7 blue5,yellow1 Yes 8 blue3,yellow6 No 9 blue2,yellow5 No Table1.2:NotebookfortheDiceProblem Inthisparticularexampleitisusefulforusasavehicleforexplainingtheconcepts,butwewillNOTuseit much. ButtheintuitivenotionwhichisFARmoreimportantofwhat P X + Y =6= 5 36 meansisthe following.Imaginedoingtheexperimentmany,manytimes,recordingtheresultsinalargenotebook: Rollthedicethersttime,andwritetheoutcomeontherstlineofthenotebook. Rollthedicethesecondtime,andwritetheoutcomeonthesecondlineofthenotebook. Rollthedicethethirdtime,andwritetheoutcomeonthethirdlineofthenotebook. Rollthedicethefourthtime,andwritetheoutcomeonthefourthlineofthenotebook. Imagineyoukeepdoingthis,thousandsoftimes,llingthousandsoflinesinthenotebook. Therst9linesofthenotebookmightlooklikeTable1.2.Here2/9oftheselinessayYes.Butaftermany, manyrepetitions,approximately5/36ofthelineswillsayYes.Forexample,afterdoingtheexperiment720 times,approximately 5 36 720=100 lineswillsayYes. Thisiswhatprobabilityreallyis:Inwhatfractionofthelinesdoestheeventofinteresthappen? Itsounds simple,butifyoualwaysthinkaboutthislinesinthenotebookidea,probabilityproblemsarealot easiertosolve. Anditisthefundamentalbasisofcomputersimulation. PAGE 22 4 CHAPTER1.DISCRETEPROBABILITYMODELS 1.2.2OurDenitions Thesedenitionsareintuitive,ratherthanrigorousmath,butintuitioniswhatweneed.Keepinmindthat wearemakingdenitions below,notlistingproperties. Weassumeanexperimentwhichisatleastinconceptrepeatable.Theexperimentofrolling twodiceisrepeatable,andeventheALOHAexperimentisso.Wesimplywatchthenetworkfor alongtime,collectingdataonpairsofconsecutiveepochsinwhichtherearetwoactivestationsat thebeginning.Ontheotherhand,theeconometricians,inforecasting2009,cannotrepeat2008. Yetalloftheeconometricians'toolsassumethateventsin2008wereaffectedbyvarioussortsof randomness,andwethinkofrepeatingtheexperimentinaconceptualsense. Weimagineperformingtheexperimentalargenumberoftimes,recordingtheresultofeachrepetition onaseparatelineinanotebook. WesayAisan event forthisexperimentifitisapossiblebooleani.e.yes-or-nooutcomeofthe experiment.Intheaboveexample,herearesomeevents: *X+Y=6 *X=1 *Y=3 *X-Y=4 A randomvariable isanumericaloutcomeoftheexperiment,suchasXandYhere,aswellasX+Y, 2XYandevensinXY. ForanyeventofinterestA,imagineacolumnonAinthenotebook.The k th lineinthenotebook,k= 1,2,3,...,willsayYesorNo,dependingonwhetherAoccurredornotduringthe k th repetitionofthe experiment.Forinstance,wehavesuchacolumninourtableabove,fortheevent f A=blue+yellow =6 g ForanyeventofinterestA,wedenePAtobethelong-runproportionoflineswithYesentries. ForanyeventsA,B,imagineanewcolumninournotebook,labeledAandB.Ineachline,this columnwillsayYesifandonlyifthereareYesentriesforbothAandB.PAandBisthenthe long-runproportionoflineswithYesentriesinthenewcolumnlabeledAandB. 1 ForanyeventsA,B,imagineanewcolumninournotebook,labeledAorB.Ineachline,this columnwillsayYesifandonlyifatleastoneoftheentriesforAandBsaysYes. 2 ForanyeventsA,B,imagineanewcolumninournotebook,labeledA j BandpronouncedA givenB.Ineachline: *ThisnewcolumnwillsayNAnotapplicableiftheBentryisNo. 1 Inmosttextbooks,whatwecallAandBhereiswrittenA B,indicatingtheintersectionoftwosetsinthesamplespace. Butagain,wedonottakeasamplespacepointofviewhere. 2 Inthesamplespaceapproach,thisiswrittenA [ B. PAGE 23 1.2.BASICIDEASOFPROBABILITY 5 *IfitisalineinwhichtheBcolumnsaysYes,thenthisnewcolumnwillsayYesorNo,depending onwhethertheAcolumnsaysYesorNo. Thinkofprobabilitiesinthisnotebookcontext: PAmeansthelong-runproportionoflinesinthenotebookinwhichtheAcolumnsaysYes. PAorBmeansthelong-runproportionoflinesinthenotebookinwhichtheA-or-Bcolumnsays Yes. PAandBmeansthelong-runproportionoflinesinthenotebookinwhichtheA-and-Bcolumnsays Yes. PA j Bmeansthelong-runproportionoflinesinthenotebookinwhichtheA j Bcolumnsays Yes amongthelineswhichdoNOTsayNA. AhugelycommonmistakeistoconfusePAandBandPA j B. Thisiswherethenotebookview becomessoimportant.Comparethequantities P X =1and S =6= 1 36 and P X =1 j S =6= 1 5 whereS=X+Y: 3 Afteralargenumberofrepetitionsoftheexperiment,approximately1/36ofthelinesofthenotebook willhavethepropertythatbothX=1andS=6sinceX=1andS=6isequivalenttoX=1andY =5. Afteralargenumberofrepetitionsoftheexperiment,if welookonlyatthelinesinwhichS=6 then amongthoselines ,approximately1/5of thoselines willshowX=1. ThequantityPA j Biscalledthe conditionalprobabilityofA,givenB Notethat and hashigherlogicalprecedencethan or .Forexample,PAandBorCmeansP[AandBor C].Also, not hashigherprecedencethan and Herearesomemoreveryimportantdenitionsandproperties: SupposeAandBareeventssuchthatitisimpossibleforthemtooccurinthesamelineofthe notebook.Theyaresaidtobe disjoint events.Then P A or B = P A + P B .2 Again,thisterminology disjoint stemsfromtheset-theoreticsamplespaceapproach,whereitmeans thatA B= .Thatmathematicalterminologyworksneforourdiceexample,butinmyexperience peoplehavemajordifcultyapplyingitcorrectlyinmorecomplicatedproblems.Thisisanother illustrationofwhyIputsomuchemphasisonthenotebookframework. 3 ThinkofaddinganScolumntothenotebooktoo PAGE 24 6 CHAPTER1.DISCRETEPROBABILITYMODELS IfAandBarenotdisjoint,then P A or B = P A + P B )]TJ/F46 10.9091 Tf 10.909 0 Td [(P A and B .3 Inthedisjointcase,thatsubtractedtermis0,so.3reducesto.2. EventsAandBaresaidtobe stochasticallyindependent ,usuallyjuststatedas independent 4 if P A and B = P A P B .4 Incalculatinganandprobability,howdoesoneknowwhethertheeventsareindependent?The answeristhatthiswilltypicallybeclearfromtheproblem.Ifwetosstheblueandyellowdice, forinstance,itisclearthatonediehasnoimpactontheother,soeventsinvolvingthebluedieare independentofeventsinvolvingtheyellowdie.Ontheotherhand,intheALOHAexample,it'sclear thateventsinvolving X 1 areNOTindependentofthoseinvolving X 2 IfAandBarenotindependent,theequation.4generalizesto P A and B = P A P B j A .5 NotethatifAandBactuallyareindependent,then P B j A = P B ,and.5reducesto.4. 1.2.3BasicProbabilityComputations:ALOHANetworkExample Pleasekeepinmindthatthenotebookideaissimplyavehicletohelpyouunderstandwhattheconcepts reallymean.Thisiscrucialforyourintuitionandyourabilitytoapplythismaterialintherealworld.But thenotebookideaisNOTforthepurposeofcalculatingprobabilities.Instead,weusethepropertiesof probability,asseeninthefollowing. Let'slookatallofthisintheALOHAcontext.InEquation.1wefoundthat P X 1 =2= p 2 + )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 =0 : 52 .6 Howdidwegetthis?Let C i denotetheeventthatnodeitriestosend,i=1,2.Thenusingthedenitions above,ourstepswouldbe P X 1 =2= P C 1 and C 2 ornot C 1 andnot C 2 .7 = P C 1 and C 2 + P not C 1 andnot C 2 from.2.8 = P C 1 P C 2 + P not C 1 P not C 2 from.4.9 = p 2 + )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 .10 Herearethereasonsforthesesteps: 4 Theterm stochastic isjustafancysynonymfor random . PAGE 25 1.2.BASICIDEASOFPROBABILITY 7 .7:Welistedthewaysinwhichtheevent f X 1 =2 g couldoccur. .8:Write G = C 1 and C 2 H = D 1 and D 2 ,where D i = not C i ,i=1,2.ThentheeventsGandHare clearlydisjoint;ifinagivenlineofournotebookthereisaYesforG,thendenitelytherewillbea NoforH,andviceversa. .9:Thetwonodesactphysicallyindependentlyofeachother.Thustheevents C 1 and C 2 arestochasticallyindependent,soweapplied.4.Thenwedidthesamefor D 1 and D 2 NotecarefullythatinEquation.7,ourrststepwasto breakbigeventsdowninto smallevents, inthiscasebreakingtheevent f X 1 =2 g downintotheevents C 1 and C 2 and D 1 and D 2 .Thisisacentralpartofmostprobabilitycomputations.Incalculatingaprobability,askyourself, Howcanithappen? Goodtip: Whenyousolveproblemslikethis,writeoutthe and and or conjunctionslikeI'vedoneabove. Thishelps! Now,whatabout P X 2 =2 ?Again,webreakbigeventsdownintosmallevents,inthiscaseaccordingto thevalueof X 1 : P X 2 =2= P X 1 =0and X 2 =2 or X 1 =1and X 2 =2 or X 1 =2and X 2 =2 = P X 1 =0and X 2 =2 .11 + P X 1 =1and X 2 =2 + P X 1 =2and X 2 =2 Since X 1 cannotbe0,thatrstterm, P X 1 =0and X 2 =2 is0.Todealwiththesecondterm, P X 1 = 1and X 2 =2 ,we'lluse.5.Duetothetime-sequentialnatureofourexperimenthere,itisnaturalbut certainlynotmandated,aswe'lloftenseesituationstothecontrarytotakeAandBtobe f X 1 =1 g and f X 2 =2 g ,respectively.So,wewrite P X 1 =1and X 2 =2= P X 1 =1 P X 2 =2 j X 1 =1 .12 Tocalculate P X 1 =1 ,weusethesamekindofreasoningasinEquation.1.Fortheeventinquestion tooccur,eithernodeAwouldsendandBwouldn't,orAwouldrefrainfromsendingandBwouldsend. Thus P X 1 =1=2 p )]TJ/F46 10.9091 Tf 10.909 0 Td [(p =0 : 48 .13 Nowweneedtond P X 2 =2 j X 1 =1 .Thisagaininvolvesbreakingbigeventsdownintosmallones.If X 1 =1 ,then X 2 =2 canoccuronlyif both ofthefollowingoccur: EventA:Whichevernodewastheonetosuccessfullytransmitduringepoch1andwearegiventhat thereindeedwasone,since X 1 =1 nowgeneratesanewmessage. PAGE 26 8 CHAPTER1.DISCRETEPROBABILITYMODELS EventB:Duringepoch2,nosuccessfultransmissionoccurs,i.e.eithertheybothtrytosendorneither triestosend. RecallingthedenitionsofpandqinSection1.1,wehavethat P X 2 =2 j X 1 =1= q [ p 2 + )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 ]=0 : 41 .14 Thus P X 1 =1and X 2 =2=0 : 48 0 : 41=0 : 20 Wegothroughasimilaranalysisfor P X 1 =2and X 2 =2 :Werecallthat P X 1 =2=0 : 52 from before,andndthat P X 2 =2 j X 1 =2=0 : 52 aswell.Sowend P X 1 =2and X 2 =2 tobe 0 : 52 2 =0 : 27 .Puttingallthistogether,wendthat P X 2 =2=0 : 47 Let'sdoonemore;let'snd P X 1 =1 j X 2 =2 .[Pauseaminuteheretomakesureyouunderstandthat thisisquitedifferentfrom P X 2 =2 j X 1 =1 .]From.5,weknowthat P X 1 =1 j X 2 =2= P X 1 =1 andX 2 =2 P X 2 =2 .15 Wecomputedbothnumeratoranddenominatorherebefore,inEquations.12and.11,soweseethat P X 1 =1 j X 2 =2=0 : 20 = 0 : 47=0 : 43 1.2.4Bayes'Theorem Following.15above,wenotedthattheingredientshadalreadybeencomputed,in.12and.11.If wegobacktothederivationsinthosetwoequationsandsubstitutein.15,wehave P X 1 =1 j X 2 =2= P X 1 =1 and X 2 =2 P X 2 =2 .16 = P X 1 =1 and X 2 =2 P X 1 =1 and X 2 =2+ P X 1 =2 and X 2 =2 .17 = P X 1 =1 P X 2 =2 j X 1 =1 P X 1 =1 P X 2 =2 j X 1 =1+ P X 1 =2 P X 2 =2 j X 1 =2 .18 Lookingatthisinmoregenerality,foreventsAandBwewouldndthat P A j B = P A P B j A P A P B j A + P not A P B j not A .19 Thisisknownas Bayes'Theorem or Bayes'Rule . PAGE 27 1.2.BASICIDEASOFPROBABILITY 9 notebookline X 1 =2 X 2 =2 X 1 =2 and X 2 =2 X 2 =2 j X 1 =2 1 Yes No No No 2 No No No NA 3 Yes Yes Yes Yes 4 Yes No No No 5 Yes Yes Yes Yes 6 No No No NA 7 No Yes No NA Table1.3:TopofNotebookforTwo-EpochALOHAExperiment 1.2.5ALOHAintheNotebookContext ThinkofdoingtheALOHAexperimentmany,manytimes. Runthenetworkfortwoepochs,startingwithbothnodesactive,thersttime,andwritetheoutcome ontherstlineofthenotebook. Runthenetworkfortwoepochs,startingwithbothnodesactive,thesecondtime,andwritethe outcomeonthesecondlineofthenotebook. Runthenetworkfortwoepochs,startingwithbothnodesactive,thethirdtime,andwritetheoutcome onthethirdlineofthenotebook. Runthenetworkfortwoepochs,startingwithbothnodesactive,thefourthtime,andwritetheoutcomeonthefourthlineofthenotebook. Imagineyoukeepdoingthis,thousandsoftimes,llingthousandsoflinesinthenotebook. TherstsevenlinesofthenotebookmightlooklikeTable1.3.Weseethat: Amongthoserstsevenlinesinthenotebook,4/7ofthemhave X 1 =2 .Aftermany,manylines,this proportionwillbeapproximately0.52. Amongthoserstsevenlinesinthenotebook,3/7ofthemhave X 2 =2 .Aftermany,manylines,this proportionwillbeapproximately0.47. 5 Amongthoserstsevenlinesinthenotebook,3/7ofthemhave X 1 =2and X 2 =2 .Aftermany, manylines,thisproportionwillbeapproximately0.27. Amongtherstsevenlinesinthenotebook,fourofthemdonotsayNAinthe X 2 =2 j X 1 =2 column. Amongthesefourlines ,twosayYes,aproportionof2/4.Aftermany,manylines,this proportionwillbeapproximately0.52. 5 Don'tmakeanythingofthefactthattheseprobabilitiesnearlyaddupto1. PAGE 28 10 CHAPTER1.DISCRETEPROBABILITYMODELS 1.2.6Simulation Tosimulatewhetherasimpleeventoccursornot,wetypicallyuseRfunction runif .Thisfunctiongeneratesrandomnumbersfromtheinterval,1,withallthepointsinsidebeingequallylikely.Soforinstance theprobabilitythatthefunctionreturnsavaluein,0.5is0.5.Thushereiscodetosimulatetossingacoin: ifrunif<0.5heads<-TRUEelseheads<-FALSE Theargument1meanswewishtogeneratejustonerandomnumberfromtheinterval,1. 1.2.6.1SimulationoftheALOHAExample Followingisacomputationviasimulationofthe approximate valueof P X 1 =2 P X 2 =2 and P X 2 = 2 j X 1 =1 ,usingtheRstatisticallanguage,thelanguageofchoiceofprofessionalstatisticans.Itisopen source,it'sstatisticallycorrectnotallstatisticalpackagesareso,hasdazzlinggraphicscapabilities,etc. Tolearnaboutthesyntaxe.g. < )]TJ/F38 10.9091 Tf 11.232 0 Td [(astheassignmentoperator,seemyintroductiontoRforprogrammers at http://heather.cs.ucdavis.edu/ matloff/R/RProg.pdf 1 #findsPX1=2,PX2=2andPX2=2|X1=1inALOHAexample 2 sim<-functionp,q,nreps{ 3 countx2eq2<-0 4 countx1eq1<-0 5 countx1eq2<-0 6 countx2eq2givx1eq1<-0 7 #simulatenrepsrepetitionsoftheexperiment 8 foriin1:nreps{ 9 numsend<-0#nomessagessentsofar 10 #simulateAandB'sdecisiononwhethertosendinepoch1 11 foriin1:2 12 ifrunif PAGE 29 1.2.BASICIDEASOFPROBABILITY 11 39 cat"PX2=2:",countx2eq2/nreps,"n" 40 cat"PX2=2|X1=1:",countx2eq2givx1eq1/countx1eq1,"n" 41 } Notethateachofthe nreps iterationsofthemain for loopisanalogoustoonelineinourhypothetical notebook.So,thendtheapproximatevalueof P X 1 =2 ,dividethecountofthenumberoftimes X 1 =2 occurredbythenumberofiterations. Noteespeciallythatthewaywecalculated P X 2 =2 j X 1 =1 wastocountthenumberoftimes X 2 =2 amongthosetimesthat X 1 =1 ,justlikeinthenotebookcase. Remember,simulationresultsareonlyapproximate.Thelargerthevalueweusefor nreps ,themore accurateoursimulationresultsarelikelytobe.Thequestionofhowlargeweneedtomake nreps willbe addressedinalaterchapter. 1.2.6.2RollingDice Ifwerollthreedice,whatistheprobabilitythattheirtotalis8?Wecountallthepossibilities,orwecould getanapproximateanswerviasimulation: 1 #rollddice;findPtotal=k 2 3 #simulaterollofonedie;thepossiblereturnvaluesare1,2,3,4,5,6, 4 #allequallylikely 5 roll<-functionreturnsample:6,1 6 7 probtotk<-functiond,k,nreps{ 8 count<-0 9 #dotheexperimentnrepstimes 10 forrepin1:nreps{ 11 sum<-0 12 #rollddiceandfindtheirsum 13 forjin1:dsum<-sum+roll 14 ifsum==kcount<-count+1 15 } 16 returncount/nreps 17 } Thecalltothebuilt-inRfunction sample heresaystotakeasampleofsize1fromthesequenceofnumbers 1,2,3,4,5,6.That'sjustwhatwewanttosimulatetherollingofadie.Thecode forjin1:dsum<-sum+roll thensimulatesthetossingofadiedtimes,andcomputingthesum. SinceapplicationsofRoftenuselargeamountsofcomputertime,goodRprogrammersarealwayslooking forwaystospeedthingsup.Hereisanalternateversionoftheaboveprogram: 1 #rollddice;findPtotal=k 2 3 probtotk<-functiond,k,nreps{ 4 count<-0 PAGE 30 12 CHAPTER1.DISCRETEPROBABILITYMODELS 5 #dotheexperimentnrepstimes 6 forrepin1:nreps 7 total<-sumsample:6,d,replace=TRUE 8 iftotal==kcount<-count+1 9 } 10 returncount/nreps 11 } Herethecode sample:6,d,replace=TRUE simulatestossingthediedtimestheargument replace saysthisissamplingwithreplacement,sofor instancewecouldgettwo6s.Thatreturnsad-elementarray,andwethencallR'sbuilt-infunction sum tondthetotaloftheddice. Thesecondversionofthecodehereismorecompactandeasiertoread.Italsoeliminatesoneexplicitloop, whichisthekeytowritingfastcodeinR. 1.2.7Combinatorics-BasedProbabilityComputation Insomeprobabilityproblemsalltheoutcomesareequallylikely.Theprobabilitycomputationisthensimply amatterofcountingalltheoutcomesofinterestanddividingbythetotalnumberofpossibleoutcomes.Of course,sometimesevensuchcountingcanbechallenging,butitissimpleinprinciple.We'lldiscusstwo exampleshere. 1.2.7.1WhichIsMoreLikelyinFiveCards,OneKingorTwoHearts? Supposewedeala5-cardhandfromaregular52-carddeck.Whichislarger,PkingorPhearts? Beforecontinuing,takeamomenttoguesswhichoneismorelikely. Now,hereishowwecancomputetheprobabilities.Thereare )]TJ/F44 7.9701 Tf 5 -3.995 Td [(52 5 possiblehands,sothisisourdenominator. ForPking,ournumeratorwillbethenumberofhandsconsistingofonekingandfournon-kings.Since therearefourkingsinthedeck,thenumberofwaystochooseonekingis )]TJ/F44 7.9701 Tf 5 -3.996 Td [(4 1 =4 .Thereare48non-kings inthedeck,sothereare )]TJ/F44 7.9701 Tf 5 -3.995 Td [(48 4 waystochoosethem.Everychoiceofonekingcanbecombinedwithevery choiceoffournon-kings,sothenumberofhandsconsistingofonekingandfournon-kingsis 4 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(48 4 .Thus P king = 4 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(48 4 )]TJ/F44 7.9701 Tf 5 -3.995 Td [(52 5 =0 : 299 .20 Thesamereasoninggivesus P hearts = )]TJ/F44 7.9701 Tf 5 -3.996 Td [(13 2 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(39 3 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(52 5 =0 : 274 .21 So,the1-kinghandisjustslightlymorelikely. PAGE 31 1.2.BASICIDEASOFPROBABILITY 13 Bytheway,IusedtheRfunction choose toevaluatethesequantities,runningRininteractivemode,e.g.: >choose,2 choose,3/choose,5 [1]0.2742797 Ralsohasaverynicefunction combn whichwillgenerateallthe )]TJ/F47 7.9701 Tf 5 -3.995 Td [(n k combinationsofkthingschosen fromn,andalsoatyouroptioncallauser-speciedfunctiononeachcombination.Thisallowsyoutosave alotofcomputationalwork.SeetheexamplesinR'sonlinedocumentation. Here'showwecoulddothe1-kingproblemviasimulation: 1 #usesimulationtofindPkingwhendeala5-cardhandfroma 2 #standarddeck 3 4 #thinkofthe52cardsasbeinglabeled1-52,withthe4kingshaving 5 #numbers1-4 6 7 sim<-functionnreps{ 8 count1king<-0#countofnumberofhandswith1king 9 forrepin1:nreps{ 10 hand<-sample:52,5,replace=FALSE#dealhand 11 kings<-intersect:4,hand#findwhichkings,ifany,areinhand 12 iflengthkings==1count1king<-count1king+1 13 } 14 printcount1king/nreps 15 } 1.2.7.2AssociationRulesinDataMining Theeldof datamining isabranchofcomputerscience,butitislargelyanapplicationofvariousstatistical methodstoreallyhugedatabases. Oneoftheapplicationsofdataminingiscalledthe marketbasket problem.Herethedataconsistsof recordsofsalestransactions,sayofbooksatAmazon.com.Thebusiness'goalisexempliedbyAmazon's suggestiontocustomersthatPatronswhoboughtthisbookalsotendedtobuythefollowingbooks. 6 The goalofthemarketbasketproblemistosiftthroughsalestransactionrecordstoproduce associationrules patternsinwhichsalesofsomecombinationsofbooksimplylikelysalesofotherrelatedbooks. Thenotationforassociationrulesis A;B C;D;E ,meaninginthebooksalesexamplethatcustomers whoboughtbooksAandBalsotendedtobuybooksC,DandE.HereAandBarecalledthe antecedents oftherule,andC,DandEarecalledthe consequents .Let'ssupposeherethatweareonlyinterestedin ruleswithasingleconsequent. Wewillpresentsomemethodsforndinggoodrulesinanotherchapter,butfornow,let'slookathowmany possiblerulesthereare.Obviously,itwouldbeimpracticaltouseruleswithalargenumberofantecedents. 7 Supposethebusinesshasatotalof20productsavailableforsale.Whatpercentageofpotentialruleshave threeorfewerantecedents? 8 6 Somecustomersappreciatesuchtips,whileothersviewitasinsultingoraninvasionofprivacy,butwe'llnotaddresssuch issueshere. 7 Inaddition,thereareseriousstatisticalproblemsthatwouldarise,tobediscussedinanotherchapter. 8 Besuretonotethatthisisalsoaprobability,namelytheprobabilitythatarandomlychosenrulewillhavethreeorfewer antecedents. PAGE 32 14 CHAPTER1.DISCRETEPROBABILITYMODELS Foreachk=1,...,19,thereare )]TJ/F44 7.9701 Tf 5 -3.995 Td [(20 k possiblesetsofantecedents,thusthismanypossiblerules.Thefraction ofpotentialrulesusingthreeorfewerantecedentsisthen P 3 k =1 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(20 k )]TJ/F44 7.9701 Tf 5 -3.996 Td [(20 )]TJ/F47 7.9701 Tf 6.587 0 Td [(k 1 P 19 k =1 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(20 k )]TJ/F44 7.9701 Tf 5 -3.996 Td [(20 )]TJ/F47 7.9701 Tf 6.586 0 Td [(k 1 = 23180 10485740 =0 : 0022 .22 So,thisisjustscratchingthesurface.Andnotethatwithonly20products,therearealreadyovertenmillion possiblerules.With50products,thisnumberis 2 : 81 10 16 !ImaginewhathappensinacaselikeAmazon, withmillionsofproducts.Thesestaggeringnumbersshowwhatatremendouschallengedataminersface. 1.3DiscreteRandomVariables Inourdiceexample,therandomvariableXcouldtakeonsixvaluesintheset f 1,2,3,4,5,6 g .Thisisanite set. IntheALOHAexample, X 1 and X 2 eachtakeonvaluesintheset f 0,1,2 g ,againaniteset. 9 Nowthinkofanotherexperiment,inwhichwetossacoinuntilwegetheads.LetNbethenumberoftosses needed.ThenNcantakeonvaluesintheset f 1,2,3,... g Thisisacountablyinniteset. Nowthinkofonemoreexperiment,inwhichwethrowadartattheinterval,1,andassumethattheplace thatishit,R,cantakeonanyofthevaluesbetween0and1.Thisisanuncountablyinniteset. WesaythatX, X 1 X 2 andNare discrete randomvariables,whileRis continuous .We'lldiscusscontinuousrandomvariablesinalaterchapter. 1.4Independence,ExpectedValueandVariance Theconceptsandpropertiesintroducedinthissectionformtheverycoreofprobabilityandstatistics.Except forsomespeciccalculations,theseapplytobothdiscreteandcontinuousrandomvariablescalculations, theseapplytobothdiscreteandcontinuousrandomvariables 1.4.1IndependentRandomVariables Wealreadyhaveadenitionfortheindependenceofevents;whataboutindependenceofrandomvariables? RandomvariablesUandVaresaidtobe independent ifforanysetsIandJ,theevents f XisinI g and f Y isinJ g areindependent,i.e.PXisinIandYisinJ=PXisinIPYisinJ. 9 Wecouldevensaythat X 1 takesononlyvaluesintheset f 1,2 g ,butifweweretolookatmanyepochsratherthanjusttwo,it wouldbeeasiernottomakeanexceptionalcase. PAGE 33 1.4.INDEPENDENCE,EXPECTEDVALUEANDVARIANCE 15 1.4.2ExpectedValue 1.4.2.1IntuitiveDenition ConsiderarepeatableexperimentwithrandomvariableX.Wesaythatthe expectedvalue ofXisthe long-runaveragevalueofX,aswerepeattheexperimentindenitely. Inournotebook,therewillbeacolumnforX.Let X i denotethevalueofXinthei th rowofthenotebook. Thenthelong-runaverageofXis lim n !1 X 1 + ::: + X n n .23 Supposeforinstanceourexperimentistotoss10coins.LetXdenotethenumberofheadswegetoutof 10.Wemightgetfourheadsintherstrepetitionoftheexperiment,i.e. X 1 =4 ,sevenheadsinthesecond repetition,so X 2 =7 ,andsoon.Intuitively,thelong-runaveragevalueofXwillbe5.Thiswillbeproven below.ThuswesaythattheexpectedvalueofXis5,andwriteEX=5. 1.4.2.2ComputationandPropertiesofExpectedValue Continuingthecointossexampleabove,let K in bethenumberoftimesthevalueioccursamong X 1 ;:::;X n i=0,...,10,n=1,2,3,...Forinstance, K 4 ; 20 isthenumberoftimeswegetfourheads,intherst20repetitions ofourexperiment.Then E X =lim n !1 X 1 + ::: + X n n .24 =lim n !1 0 K 0 n +1 K 1 n +2 K 2 n ::: +10 K 10 ;n n .25 = 10 X i =0 i lim n !1 K in n .26 But lim n !1 K in n isthelong-runproportionofthetimethatX=i.Inotherwords,it'sPX=i!So, E X = 10 X i =0 i P X = i .27 Soingeneral,theexpectedvalueofadiscreterandomvariableXwhichtakesvalueinthesetAis E X = X c 2 A cP X = c .28 Notethat.28istheformulawe'lluse.Theprecedingequationswerederivation,tomotivatetheformula. Notetoothat1.28isnotthe denition ofexpectedvalue;thatwasin1.23.Itisquiteimportanttodistinguish betweenallofthese,intermsofgoals. PAGE 34 16 CHAPTER1.DISCRETEPROBABILITYMODELS ItwillbeshowninSection1.5.2.2thatinourexampleaboveinwhichXisthenumberofheadswegetin 10tossesofacoin, P X = i = 10 i 0 : 5 i )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 10 )]TJ/F47 7.9701 Tf 6.587 0 Td [(i .29 So E X = 10 X i =0 i 10 i 0 : 5 i )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 10 )]TJ/F47 7.9701 Tf 6.586 0 Td [(i .30 ItturnsoutthatEX=5. ForXinourdiceexample, E X = 6 X c =1 c 1 6 =3 : 5 .31 Itiscustomarytousecapitallettersforrandomvariables,e.g.Xhere,andlower-caselettersforvaluestaken onbyarandomvariable,e.g.chere.Pleaseadheretothisconvention. Bytheway,itisalsocustomarytowriteEXinsteadofEX,wheneverremovaloftheparenthesesdoesnot causeanyambiguity.Anexampleinwhichitwouldproduceambiguityis E U 2 .Theexpression EU 2 mightbetakentomeaneither E U 2 ,whichiswhatwewant,or EU 2 ,whichisnotwhatwewant. ForS=X+Yinthediceexample, E S =2 1 36 +3 2 36 +4 3 36 + ::: 12 1 36 =7 .32 InthecaseofN,tossingacoinuntilwegetahead: E N = 1 X c =1 c 1 2 c =2 .33 Wewillnotgointothedetailshereconcerninghowthesumofthisparticularinniteseriesiscomputed. SomepeopleliketothinkofEXusingacenterofgravityanalogy.Forgetthatanalogy!Thinknotebook! Intuitively,EXisthelong-runaveragevalueofXamongallthelinesofthenotebook. Soforinstance inourdiceexample,EX=3.5,whereXwasthenumberofdotsonthebluedie,meansthatifwedothe experimentthousandsoftimes,withthousandsoflinesinournotebook,theaveragevalueofXinthose lineswillbeabout3.5.WithS=X+Y,ES=7.Thismeansthatinthelong-runaverageincolumnSin Table1.4is7. Ofcourse,bysymmetry,EYwillbe3.5too,whereYisthenumberofdotsshowingontheyellowdie. ThatmeanswewastedourtimecalculatinginEquation.32;weshouldhaverealizedbeforehandthat ESis 2 3 : 5=7 . PAGE 35 1.4.INDEPENDENCE,EXPECTEDVALUEANDVARIANCE 17 notebookline outcome blue+yellow=6? S 1 blue2,yellow6 No 8 2 blue3,yellow1 No 4 3 blue1,yellow1 No 2 4 blue4,yellow2 Yes 6 5 blue1,yellow1 No 2 6 blue3,yellow4 No 7 7 blue5,yellow1 Yes 6 8 blue3,yellow6 No 9 9 blue2,yellow5 No 7 Table1.4:ExpandedNotebookfortheDiceProblem Inotherwords,foranyrandomvariablesUandV,theexpectedvalueofanewrandomvariableD=U+Vis thesumoftheexpectedvaluesofUandV: E U + V = E U + E V .34 NotecarefullythatUandVdoNOTneedtobeindependentrandomvariablesforthisrelationtohold.You shouldconvinceyourselfofthisfactintuitively bythinkingaboutthenotebooknotion. Saywelookat 10000linesofthenotebook,whichhascolumnsforthevaluesofU,VandU+V.Itmakesnodifference whetherweaverageU+Vinthatcolumn,oraverageUandVintheircolumnsandthenaddeitherway, we'llgetthesameresult. Whileyouareatit,convinceyourselfthat E aU + b = aE U + b .35 foranyconstants a and b .Forinstance,sayUistemperatureinCelsius.ThenthetemperatureinFahrenheit is W = 9 5 U +32 .So,Wisanewrandomvariable,andwecangetisexpectedfromthatofUbyusing .35with a = 9 5 andb=32. ButifUandV are independent,then E UV = EU EV .36 Inthediceexample,forinstance,letDdenotetheproductofthenumbersofbluedotsandyellowdots,i.e. D=XY.Then E D =3 : 5 2 =12 : 25 .37 Considerafunctiongofonevariable,andletW=gX.Wisthenarandomvariabletoo.SayXtakeson PAGE 36 18 CHAPTER1.DISCRETEPROBABILITYMODELS valuesinA,asin.28.ThenWtakesonvaluesin B = f g c : cA g .Dene A d = f c : c 2 A;g c = d g .38 Then P W = d = P X 2 A d .39 so E W = X d 2 B dP W = d .40 = X d 2 B d X c 2 A d P X = c .41 = X c 2 A g c P X = c .42 Thepropertiesofexpectedvaluediscussedabovearekeytotheentireremainderofthisbook.You shouldnoticeimmediatelywhenyouareinasettinginwhichtheyareapplicable.Forinstance,ifyou seetheexpectedvalueofthesumoftworandomvariables,youshouldinstinctivelythinkof.34right away. 1.4.2.3Casinos,InsuranceCompaniesandSumUsers,ComparedtoOthers Theexpectedvalueisintendedasa measureofcentraltendency ,i.e.assomesortofdenitionofthe probablisticmiddleintherangeofarandomvariable.Itplaysanabsolutelycentralroleinprobability andstatistics.Yetoneshouldunderstanditslimitations. First,notethattheterm expectedvalue itselfisamisnomer.Wedonotexpect Wtobe91/6inthislast example;infact,itisimpossibleforWtotakeonthatvalue. Second,theexpectedvalueiswhatwecallthe mean ineverydaylife.Andthemeanisterriblyoverused. Consider,forexample,anattempttodescribehowwealthyornotpeopleareinthecityofDavis.If suddenlyBillGatesweretomoveintotown,thatwouldskewthevalueofthemeanbeyondrecognition. EvenwithoutGates,thereisaquestionastowhetherthemeanhasthatmuchmeaning. Moresubtlythanthat,thereisthebasicquestionofwhatthemeanmeans.What,forexample,doesEquation .23meaninthecontextofpeople'sincomesinDavis?Wewouldsampleapersonatrandomandrecord his/herincomeas X 1 .Thenwe'dsampleanotherperson,toget X 2 ,andsoon.Fine,butinthatcontext, whatwould.23mean?Theansweris,notmuch. Foracasino,though,.23meansplenty.SayXistheamountagamblerwinsonaplayofaroulettewheel, andsuppose.23isequalto$1.88.Thenafter,say,1000playsofthewheelnotnecessarilybythesame gambler,thecasinoknowsitwillhavepaidoutatotalaboutabout$1,880.Soifthecasinocharges,say PAGE 37 1.4.INDEPENDENCE,EXPECTEDVALUEANDVARIANCE 19 $1.95perplay,itwillhavemadeaprotofabout$70overthose1000plays.Itmightbeabitmoreor lessthanthatamount,butthecasinocanbeprettysurethatitwillbearound$70,andtheycanplantheir businessaccordingly. Thesameprincipleholdsforinsurancecompanies,concerninghowmuchtheypayoutinclaims.Witha largenumberofcustomers,theyknowexpect!approximatelyhowmuchtheywillpayout,andthuscan settheirpremiumsaccordingly. Thekeypointinthecasinoandinsurancecompaniesexamplesisthattheyareinterestedintotals,e.g.total payoutsonablackjacktableoveramonth'stime,ortotalinsuranceclaimspaidinayear.Anotherexample mightbethenumberofdefectivesinabatchofcomputerchips;themanufacturerisinterestedinthetotal numberofdefectiveschipsproduced,sayinamonth. Bycontrast,indescribinghowwealthypeopleofatownare,thetotalincomeofalltheresidentsisnot relevant.Similarly,indescribinghowwellstudentsdidonanexam,thesumofthescoresofallthestudents doesn'ttellusmuch.Abetterdescriptionmightinvolvepercentiles,includingthe50thpercentile,the median. Nevertheless,themeanhascertainmathematicalproperties,suchas.34,thathaveallowedtherichdevelopmentoftheeldsofprobabilityandstatisticsovertheyears.Themedian,bycontrast,doesnothavenice mathematicalproperties.So,themeanhasbecomeentrenchedasadescriptivemeasure,andwewilluseit often. 1.4.3Variance Whiletheexpectedvaluetellsustheaveragevaluearandomvariabletakeson,wealsoneedameasureof therandomvariable'svariabilityhowmuchdoesitwanderfromonelineofthenotebooktoanother?In otherwords,wewantameasureof dispersion .Theclassicalmeasureis variance ,denedtobethemean squareddifferencebetweenarandomvariableanditsmean: Var U = E [ U )]TJ/F46 10.9091 Tf 10.909 0 Td [(EU 2 ] .43 ForXinthedieexample,thiswouldbe Var X = E [ X )]TJ/F15 10.9091 Tf 10.909 0 Td [(3 : 5 2 ] .44 Toevaluatethis,apply.42with g c = c )]TJ/F15 10.9091 Tf 10.909 0 Td [(3 : 5 2 : Var X = 6 X c =1 c )]TJ/F15 10.9091 Tf 10.909 0 Td [(3 : 5 2 1 6 =2 : 92 .45 Youcanseethatvariancedoesindeedgiveusameasureofdispersion.IfthevaluesofUaremostlyclustered nearitsmean,thevariancewillbesmall;ifthereiswidevariationinU,thevariancewillbelarge. PAGE 38 20 CHAPTER1.DISCRETEPROBABILITYMODELS ThepropertiesofEin.34and.35canbeusedtoshowthat Var U = E U 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [( EU 2 .46 Theterm E U 2 isagainevaluatedusing.42. Thusforexample,ifXisthenumberofdotswhichcomeupwhenwerolladie,and W = X 2 ,then E W = 6 X i =1 i 2 1 6 = 91 6 .47 Animportantpropertyofvarianceisthat Var cU = c 2 Var U .48 foranyrandomvariableUandconstantc.Itshouldmakesensetoyou:Ifwemultiplyarandomvariable by5,say,thenitsaveragesquareddistancetoitsmeanshouldincreasebyafactorof25.Andshiftingdata overbyaconstantdoesnotchangetheamountofvariationinthem,so Var cU + d = c 2 Var U .49 foranyconstantd. Thesquarerootofthevarianceiscalledthe standarddeviation Thesquaringinthedenitionofvarianceproducessomedistortion,byexaggeratingtheimportanceofthe largerdifferences.Itwouldbemorenaturaltousethe meanabsolutedeviation MAD, E j U )]TJ/F46 10.9091 Tf 11.417 0 Td [(EU j However,thisislesstractablemathematically,sothestatisticalpioneerschosetousethemeansquared difference,whichlendsitselftolotsofpowerfulandbeautifulmath,inwhichthePythagoreanTheorem popsupinabstractvectorspaces.SeeSection3.9.2fordetails. Aswithexpectedvalues,thepropertiesofvariancediscussedabove,andalsoinSection3.2.1below, arekeytotheentireremainderofthisbook.Youshouldnoticeimmediatelywhenyouareinasetting inwhichtheyareapplicable.Forinstance,ifyouseethevarianceofthesumoftworandomvariables, youshouldinstinctivelythinkof.61rightaway. 1.4.4IsaVarianceofXLargeorSmall? RecallthatthevarianceofarandomvariableXissupposetobeameasureofthedispersionofX,meaning theamountthatXvariesfromoneinstanceonelineinournotebooktothenext.ButifVarXis,say,2.5, isthatalotofvariabilityornot?Wewillpursuethisquestionhere. PAGE 39 1.4.INDEPENDENCE,EXPECTEDVALUEANDVARIANCE 21 1.4.5Chebychev'sInequality ThisinequalitystatesthatforarandomvariableXwithmean andvariance 2 P j X )]TJ/F46 10.9091 Tf 10.909 0 Td [( j c 1 c 2 .50 Inotherwords,Xdoesnotoftenstraymorethan,say,3standarddeviationsfromitsmean.Thisgivessome concretemeaningtotheconceptofvariance/standarddeviation. Toprove.50,let'srststateandproveMarkov'sInequality:ForanynonnegativerandomvariableY, P Y d EY d .51 Toprove.51,letZequal1if Y d ,0otherwise.Then Y dZ .52 thinkofthetwocases,so EY dEZ .53 Theright-handsideof.53is dP Y d ,so.51follows. Nowtoprove.50,dene Y = X )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 .54 andset d = c 2 2 .Then.51says P [ X )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 c 2 2 ] E [ X )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 ] c 2 2 .55 Since X )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 c 2 2 ifandonlyif j X )]TJ/F46 10.9091 Tf 10.909 0 Td [( j c .56 theleft-handsideof.55isthesameastheleft-handsideof.50.Thenumeratoroftheright-handsize of.55issimplyVarX,i.e. 2 ,sowearedone. 1.4.6TheCoefcientofVariation Continuingourdiscussionofthemagnitudeofavariance,lookatourremarkfollowing.50: PAGE 40 22 CHAPTER1.DISCRETEPROBABILITYMODELS Inotherwords,Xdoesnotoftenstraymorethan,say,3standarddeviationsfromitsmean.This givessomeconcretemeaningtotheconceptofvariance/standarddeviation. ThissuggeststhatanydiscussionofthesizeofVarXshouldrelatetothesizeofEX.Accordingly,one oftenlooksatthe coefcientofvariation ,denedtobetheratioofthestandarddeviationtothemean: coef.ofvar. = p Var X EX .57 Thisisascale-freemeasuree.g.inchesdividedbyinches,andservesasagoodwaytojudgewhethera varianceislargeornot. 1.4.7Covariance Thisisatopicwe'llcoverfullyinChapter3,butatleastintroducehere. AmeasureofthedegreetowhichUandVvarytogetheristheir covariance Cov U;V = E [ U )]TJ/F46 10.9091 Tf 10.909 0 Td [(EU V )]TJ/F46 10.9091 Tf 10.909 0 Td [(EV ] .58 Exceptforadivisor,thisisessentially correlation .IfUisusuallylargeatthesametimeYissmall,for instance,thenyoucanseethatthecovariancebetweenthemwitllbenegative.Ontheotherhand,iftheyare usuallylargetogetherorsmalltogether,thecovariancewillbepositive. Again,onecanusethepropertiesofEtoshowthat Cov U;V = E UV )]TJ/F46 10.9091 Tf 10.909 0 Td [(EU EV .59 Also Var U + V = Var U + Var V +2 Cov U;V .60 IfUandVareindependent,thenCovU,V=0and Var U + V = Var U + Var V .61 1.4.8ACombinatorialExample Acommitteeoffourpeopleisdrawnatrandomfromasetofsixmenandthreewomen.Supposeweare concernedthattheremaybequiteagenderimbalanceinthemembershipofthecommittee.Towardthat end,letMandWdenotethenumbersofmenandwomeninourcommittee,andletD=M-W.Let'snd ED. PAGE 41 1.4.INDEPENDENCE,EXPECTEDVALUEANDVARIANCE 23 Dcantakeonthevalues4-0,3-1,2-2and1-3,i.e.4,2,0and-2.So, ED = )]TJ/F15 10.9091 Tf 8.484 0 Td [(2 P D = )]TJ/F15 10.9091 Tf 8.485 0 Td [(2+0 P D =0+2 P D =2+4 P D =4 .62 Now,usingreasoningalongthelinesinSection1.2.7,wehave P D = )]TJ/F15 10.9091 Tf 8.485 0 Td [(2= P M =1 and W =3= )]TJ/F44 7.9701 Tf 5 -3.995 Td [(6 1 )]TJ/F44 7.9701 Tf 10 -3.995 Td [(3 3 )]TJ/F44 7.9701 Tf 5 -3.996 Td [(9 4 .63 Aftersimilarcalculationsfortheotherprobabilitiesin.62,wendtheED=1.33.Ifweweretoperform thisexperimentmanytimes,i.e.choosecommitteesagainandagain,onaveragewewouldhaveabitmore thanonemoremanthanwomenonthecommittee. 1.4.9ExpectedValue,Etc.intheALOHAExample Findingexpectedvaluesetc.intheALOHAexampleisstraightforward.Forinstance, EX 1 =0 P X 1 =0+1 P X 1 =1+2 P X 1 =2=1 0 : 48+2 0 : 52=1 : 52 .64 HereisRcodetondvariousvaluesapproximatelybysimulation: 1 #findsEX1,EX2,VarX2,CovX1,X2 2 sim<-functionp,q,nreps{ 3 sumx1<-0 4 sumx2<-0 5 sumx2sq<-0 6 sumx1x2<-0 7 foriin1:nreps{ 8 numsend<-0 9 foriin1:2 10 ifrunif PAGE 42 24 CHAPTER1.DISCRETEPROBABILITYMODELS 33 meanx2<-sumx2/nreps 34 cat"EX2:",meanx2,"n" 35 cat"VarX2:",sumx2sq/nreps-meanx2,"n" 36 cat"CovX1,X2:",sumx2/nreps,"n" 37 } Asacheckonyourunderstandingsofar,youshouldndatleastoneofthesevaluesbyhand,andseeifit jibeswiththesimulationoutput. 1.4.10ReconciliationofMathandIntuitionoptionalsection HereIhavebeenpromotingthenotebookideaoverthesterile,confusingmathematicaldenitionsinthe theoryofprobability.Itisworthnoting,though,thatthetheoryactuallydoesimplythenotebooknotion, throughatheoremknownastheStrongLawofLargeNumbers: ConsiderarandomvariableU,andasequenceofindependentrandomvariables U 1 ;U 2 ;::: whichallhave thesamedistributionasU,i.e.theyarerepetitionsoftheexperimentwhichgeneratesU.Then lim n !1 U 1 + ::: + U n n = E U withprobability1.65 Inotherwords,theaveragevalueofUinallthelinesofthenotebookwillindeedconvergetoEU. 1.5Distributions 1.5.1BasicNotions Forthetypeofrandomvariableswe'vediscussedsofar,the distribution ofarandomvariableUissimply alistofallthevaluesittakeson,andtheirassociatedprobabilities: Example: ForXinthediceexample,thedistributionofXis f ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 g .66 Example: IntheALOHAexample,distributionof X 1 is f ; 0 : 00 ; ; 0 : 48 ; ; 0 : 52 g .67 Example: InourexampleinwhichNisthenumberoftossesofacoinneededtogetthersthead,the distributionis f ; 1 2 ; ; 1 4 ; ; 1 8 ;::: g .68 PAGE 43 1.5.DISTRIBUTIONS 25 Itiscommontoexpressthisinfunctionalnotation.Wedenethe probabilitymassfunction pmfofa discreterandomvariableV,denoted p V ,as p V k = P V = k .69 foranyvaluekwhichVcantakeon. Pleasekeepinmindthenotation.Itiscustomarytousethelower-casep,withasubscriptconsistingofthe nameoftherandomvariable. Example: In.68, p N k = 1 2 k ;k =1 ; 2 ;::: .70 Example: Inthediceexample,whichS=X+Y, p S k = 8 > > > > > > < > > > > > > : 1 36 ;k =2 3 36 ;k =3 3 36 ;k =4 ::: 1 36 ;k =12 .71 Itisimportanttonotethattheremaynotbesomeniceclosed-formexpressionfor p V likethatof.70. Therewasnosuchformin.71,noristhereinourALOHAexamplefor p X 1 and p X 2 1.5.2ParametericFamiliesofpmfs 1.5.2.1TheGeometricFamilyofDistributions Recallourexampleoftossingacoinuntilwegetthersthead,withNdenotingthenumberoftossesneeded. Inorderforthistotakektosses,weneedk-1tailsandthenahead.Thus p N k = 1 2 k )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 1 2 ;k =1 ; 2 ;::: .72 Wemightcallgettingaheadasuccess,andrefertoatailasafailure.Ofcourse,thesewordsdon'tmean anything;wesimplyrefertotheoutcomeofinterestassuccess. DeneMtobethenumberofrollsofadieneededuntilthenumber5showsup.Then p N k = 5 6 k )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 1 6 ;k =1 ; 2 ;::: .73 PAGE 44 26 CHAPTER1.DISCRETEPROBABILITYMODELS reectingthefactthattheevent f M=k g occursifwegetk-1non-5sandthena5.Heresuccessisgetting a5. Thetossesofthecoinandtherollsofthedieareknownas Bernoullitrials ,whichisasequenceofindependent1-0-valuedrandomvariables B i ,i=1,2,3,... B i is1forsuccess,0forfailure,withsuccessprobability p.Forinstance,pis1/2inthecoincase,and1/6inthedieexample. Ingeneral,supposetherandomvariableUisdenedtobethenumberoftrialsneededtogetasuccessina sequenceofBernoullitrials.Then p U k = )]TJ/F46 10.9091 Tf 10.909 0 Td [(p k )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 p;k =1 ; 2 ;::: .74 Notethatthereisadifferentdistributionforeachvalueofp,sowecallthisa parametricfamily ofdistributions,indexedbytheparameterp.WesaythatUis geometricallydistributed withparameterp. Itcanbeshownthat E U = 1 p .75 whichshouldmakegoodintuitivesensetoyouand Var U = 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p p 2 .76 Bytheway,ifweweretothinkofanexperimentinvolvingageometricdistributionintermsofournotebook idea,thenotebookwouldhaveaninnitenumberofcolumns,oneforeach B i .Withineachrowofthe notebook,the B i entrieswouldbe0untiltherst1,thenNAnotapplicableafterthat. 1.5.2.2TheBinomialFamilyofDistributions AgeometricdistributionariseswhenwehaveBernoullitrialswithparameterp,withavariablenumberof trialsNbutaxednumberofsuccesses.A binomialdistribution ariseswhenwehavetheoppositea xednumberofBernoullitrialsnbutavariablenumberofsuccessessayX. 10 Forexample,saywetossacoinvetimes,andletXbethenumberofheadsweget.WesaythatXis binomiallydistributedwithparametersn=5andp=1/2.Let'sndPX=2.Therearemanyordersinwhich thatcouldoccur,suchasHHTTT,TTHHT,HTTHTandsoon.Eachorderhasprobability 0 : 5 2 )]TJ/F15 10.9091 Tf 11.126 0 Td [(0 : 5 3 andthereare )]TJ/F44 7.9701 Tf 5 -3.995 Td [(5 2 orders.Thus P X =2= 5 2 0 : 5 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 3 = 5 2 = 32=5 = 16 .77 10 Noteagainthecustomofusingcapitallettersforrandomvariables,andlower-caselettersforconstants. PAGE 45 1.5.DISTRIBUTIONS 27 Forgeneralnandp, P X = k = n k p k )]TJ/F46 10.9091 Tf 10.91 0 Td [(p n )]TJ/F47 7.9701 Tf 6.586 0 Td [(k .78 Soagainwehaveaparametricfamilyofdistributions,inthiscaseafamilyhavingtwoparameters,nandp. Let'swriteXasasumofthose0-1Bernoullivariablesweusedinthediscussionofthegeometricdistribution above: X = n X i =1 B i .79 where B i is1or0,dependingonwhetherthereissuccessonthe i th trialornot.Thenthereadershoulduse ourearlierpropertiesofEandVarinSection1.4tollinthedetailsinthefollowingderivationsofthe expectedvalueandvarianceofabinomialrandomvariable: EX = E B 1 + :::; + B n = EB 1 + ::: + EB n = np .80 and Var X = Var B 1 + :::; + B n = Var B 1 + ::: + Var B n = np )]TJ/F46 10.9091 Tf 10.909 0 Td [(p .81 Again,.80shouldmakegoodintuitivesensetoyou. 1.5.2.3ThePoissonFamilyofDistributions Anotherfamousparametricfamilyofdistributionsisthesetof PoissonDistributions ,whichisusedto modelunboundedcounts.Thepmfis P X = k = e )]TJ/F47 7.9701 Tf 6.587 0 Td [( k k ;k =0 ; 1 ; 2 ;::: .82 Theparameterforthefamily, ,turnsouttobethevalueofEXandalsoVarX. ThePoissonfamilyisveryoftenusedtomodelcountdata.Forexample,ifyougotoacertainbankevery dayandcountthenumberofcustomerswhoarrivebetween11:00and11:15a.m.,youwillprobablynd thatthatdistributioniswellapproximatedbyaPoissondistributionforsome 1.5.2.4TheNegativeBinomialFamilyofDistributions RecallthatatypicalexampleofthegeometricdistributionfamilySection1.5.2.1arisesasN,thenumber oftossesofacoinneededtogetourrsthead.Nowgeneralizethat,withNnowbeingthenumberoftosses PAGE 46 28 CHAPTER1.DISCRETEPROBABILITYMODELS neededtogetour r th head,whererisaxedvalue.Let'sndPN=k,k=r,r+1,...Forconcreteness, lookatthecaser=3,k=5.Inotherwords,wearendingtheprobabilitythatitwilltakeus5tossesto accumulate3heads. Firstnotetheequivalenceoftwoevents: f N =5 g = f 2headsintherst4tossesandheadonthe5 th toss g .83 Thateventdescribedbeforetheandcorrespondstoabinomialprobability: P headsintherst4tosses= 4 2 1 2 4 .84 Sincetheprobabilityofaheadonthe k th tossis1/2andthetossesareindependent,wendthat P N =5= 4 2 1 2 5 = 3 16 .85 Thenegativebinomialdistributionfamily,indexedbyparametersrandp,correspondstorandomvariables whichcountthenumberofindependenttrialswithsuccessprobabilitypneededuntilwegetrsuccesses. Thepmfis P N = k = k )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 r )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p k )]TJ/F47 7.9701 Tf 6.586 0 Td [(r p r ;k = r;r +1 ;::: .86 Wecanwrite N = G 1 + ::: + G r .87 where G i isthenumberoftossesbetweenthesuccessesnumbersi-1andi.Buteach G i hasageometric distribution!Sincethemeanofthatdistributionis1/p,wehavethat E N = r 1 p .88 Infact,thosergeometricvariablesarealsoindependent,soweknowthevarianceofNisthesumoftheir variances: Var N = r 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p p 2 .89 PAGE 47 1.6.RECOGNIZINGDISTRIBUTIONSWHENYOUSEETHEM 29 1.5.2.5ThePowerLawFamilyofDistributions Here p X k = ck )]TJ/F47 7.9701 Tf 6.586 0 Td [( ;k =1 ; 2 ; 3 ;::: .90 Itisrequiredthat > 1 ,asotherwisethesumofprobabilitieswillbeinnite.For satisfyingthatcondition, thevaluecischosensothatthatsumis1.0: 1 : 0= 1 X k =1 ck )]TJ/F47 7.9701 Tf 6.586 0 Td [( c Z 1 1 k )]TJ/F47 7.9701 Tf 6.587 0 Td [( dk = c .91 Hereagainwehaveaparametricfamilyofdistributions,indexedbytheparameter Thepowerlawfamilyisanold-fashionedmodelanold-fashionedtermfordistributionis law ,butthere hasbeenaresurgenceofinterestinitinrecentyears.Itturnsoutthatmanytypesofnetworksinthereal worldexhibitapproximatelypowerlawbehavior. Forinstance,inafamousstudyoftheWebA.BarabasiandR.Albert,EmergenceofScalinginRandom Networks, Science ,1999,509-512,itwasfoundthatthenumberoflinksleadingtoaWebpagehasan approximatepowerlawdistributionwith =2 : 1 .ThenumberoflinksleadingoutofaWebpagewas foundtobeapproximatelypower-lawdistributed,with =2 : 7 1.6RecognizingDistributionsWhenYouSeeThem Manyrandomvariablesoneencountersdonothaveadistributioninsomefamousparametricfamily.But manydo,andit'simportanttobealerttothispoint,andrecognizeonewhenyouseeone. 1.6.1ACoinGame ConsideragameplayedbyJackandJill.Eachofthemtossesacoinmanytimes,butJackgetsaheadstart oftwotosses.SobythetimeJackhashad,forinstance,8tosses,Jillhashadonly6;whenJacktossesfor the15 th time,Jillhasher13 th toss;etc. Let X k denotethenumberofheadsJackhasgottenthroughhisk th toss,andlet Y k betheheadcountforJill atthatsametime,i.e.amongonlyk-2tossesforher.So, Y 1 = Y 2 =0 .Let'sndtheprobabilitythatJill iswinningafterthek th toss,i.e. P Y 6 >X 6 Yourrstreactionmightbe,Aha,binomialdistribution!Youwouldbeontherighttrack,buttheproblem isthatyouwouldnotbethinkingpreciselyenough.JustWHAThasabinomialdistribution?Theansweris thatboth X 6 and Y 6 havebinomialdistributions,bothwithp=0.5,butn=6for X 6 whilen=4for Y 6 Now,asusual,askthefamousquestion,Howcanithappen?Howcanithappenthat Y 6 >X 6 ?Well, wecouldhave,forexample, Y 6 =3 and X 6 =1 ,aswellasmanyotherpossibilities.Let'swriteit PAGE 48 30 CHAPTER1.DISCRETEPROBABILITYMODELS mathematically: P Y 6 >X 6 = 4 X i =1 i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 X j =0 P Y 6 = i and X 6 = j .92 MakeSUREyourunderstandthisequation. Now,toevaluate P Y 6 = i and X 6 = j ,weseetheandsoweaskwhether Y 6 and X 6 areindependent. Theyinfactare;Jill'scointossescertainlydon'taffectJack's.So, P Y 6 = i and X 6 = j = P Y 6 = i P X 6 = j .93 Itisatthispointthatwenallyusethefactthat X 6 and Y 6 havebinomialdistributions.Wehave P Y 6 = i = 4 i 0 : 5 i )]TJ/F15 10.9091 Tf 10.91 0 Td [(0 : 5 4 )]TJ/F47 7.9701 Tf 6.586 0 Td [(i .94 and P X 6 = j = 6 j 0 : 5 j )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 6 )]TJ/F47 7.9701 Tf 6.587 0 Td [(j .95 Wewouldthensubstitute.94and.95in.92.Wecouldthenevaluateitbyhand,butitwouldbe moreconvenienttouseR's dbinom function: 1 prob<-0 2 foriin1:4 3 forjin0:i-1 4 prob<-prob+dbinomi,4,0.5 dbinomj,6,0.5 5 printprob Wegetananswerofabout0.17.IfJackandJillweretoplaythisgamerepeatedly,stoppingeachtimeafter the6 th toss,thenJillwouldwinabout17%ofthetime. 1.6.2TossingaSetofFourCoins Consideragameinwhichwehaveasetoffourcoins.Wekeeptossingthesetoffouruntilwehavea situationinwhichexactlytwoofthemcomeupheads.LetNdenotethenumbroftimeswemusttossthe setoffourcoins. Forinstance,onthersttossofthesetoffour,theoutcomemightbeHTHH.ThesecondmightbeTTTH, andthethirdcouldbeTHHT.Inthesituation,N=3. Let'sndPN=5.HerewerecognizethatNhasageometricdistribution,withsuccessdenedasgetting twoheadsinoursetoffourcoins.Whatvaluedoestheparameterphavehere? PAGE 49 1.7.ACAUTIONARYTALE 31 Well,pisPX=2,whereXisthenumberofheadswegetfromatossofthesetoffourcoins.Werecognize thatXisbinomial!Thus p = 4 2 0 : 5 4 = 3 8 .96 ThususingthefactthatNhasageometricdistribution, P N =5= )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 4 p =0 : 057 .97 1.6.3TheALOHAExampleAgain Asanillustrationofhowcommonlytheseparametricfamiliesarise,let'sagainlookattheALOHAexample. Considerthegeneralcase,withtransmissionprobabilityp,messagecreationprobabilityq,andmnetwork nodes.Wewillnotrestrictourobservationtojusttwoepochs. Suppose X i = m ,i.e.attheendofepochiallnodeshaveamessagetosend.Thenthenumberwhich attempttosendduringepochi+1willbebinomiallydistributed,withparametersmandp. 11 Forinstance, theprobabilitythatthereisasuccessfultransmissionisequaltotheprobabilitythatexactlyoneofthem nodesattemptstosend, m 1 p )]TJ/F46 10.9091 Tf 10.909 0 Td [(p m )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 = mp )]TJ/F46 10.9091 Tf 10.909 0 Td [(p m )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 .98 Nowinthatsamesetting, X i = m ,letKbethenumberofepochsitwilltakebeforesomemessageactually getsthrough.Inotherwords,wewillhave X i = m X i +1 = m X i +2 = m ,...butnally X i + K )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 = m )]TJ/F15 10.9091 Tf 9.685 0 Td [(1 ThenKwillbegeometricallydistributed,withsuccessprobabilityequalto.98. ThereisnoPoissondistributioninthisexample,butitiscentraltotheanalysisofEthernet,andalmostany othernetwork.Wewilldiscussthisatvariouspointsinlaterchapters. 1.7ACautionaryTale 1.7.1TrickCoins,TrickyExample Supposewehavetwotrickcoinsinabox.Theylookidentical,butoneofthem,denotedcoin1,isheavily weightedtowardheads,witha0.9probabilityofheads,whiletheother,denotedcoin2,isbiasedinthe oppositedirection,witha0.9probabilityoftails.Let C 1 and C 2 denotetheeventsthatwegetcoin1orcoin 2,respectively. Ourexperimentconsistsofchoosingacoinatrandomfromthebox,andthentossingitntimes.Let B i denotetheoutcomeofthe i th toss,i=1,2,3,...,where B i =1 meansheadsand B i =0 meanstails.Let X i = B 1 + ::: + B i ,so X i isacountofthenumberofheadsobtainedthroughthe i th toss. 11 Notethatthisisaconditionaldistribution,given X i = m . PAGE 50 32 CHAPTER1.DISCRETEPROBABILITYMODELS Thequestionis:Doestherandomvariable X i haveabinomialdistribution?Or,moresimply,thequestionis,Aretherandomvariables B i independent?Tomostpeople'ssurprise,theanswerisNotoboth questions.Whynot? Thevariables B i areindeed0-1variables,andtheyhaveacommonsuccessprobability.Buttheyarenot independent!Let'sseewhytheyaren't. Considertheevents A i = f B i =1 g ,i=1,2,3,...Infact,justlookatthersttwo.Bydenition,theyare independentifandonlyif P A 1 and A 2 = P A 1 P A 2 .99 First,whatis P A 1 ? Now,waitaminute! Don'tanswer,Well,itdependsonwhichcoinweget, becausethisisNOTaconditionalprobability.Yes,the conditional probabilities P A 1 j C 1 and P A 1 j C 2 are0.9and0.1,respectively,butthe unconditional probabilityis P A 1 =0 : 5 .Youcandeducethateither bythesymmetryofthesituation,orby P A 1 = P C 1 P A 1 j C 1 + P C 2 P A 1 j C 2 = : 5 : 9+ : 5 : 1=0 : 5 .100 Youshouldthinkofallthisinthenotebookcontext.Eachlineofthenotebookwouldconsistofareportof threethings:whichcoinweget;theoutcomeofthersttoss;andtheoutcomeofthesecondtoss.Noteby thewaythatinourexperimentwedon'tknowwhichcoinweget,butconceptuallyitshouldhaveacolumn inthenotebook.Ifwedothisexperimentformany,manylinesinthenotebook,about90%ofthelinesin whichthecoincolumnsayswillshowHeadsinthesecondcolumn.But50%ofthelines overall will showHeadsinthatcolumn. So,therighthandsideofEquation.99isequalto0.25.Whataboutthelefthandside? P A 1 and A 2 = P A 1 and A 2 and C 1 + P A 1 and A 2 and C 2 .101 = P A 1 and A 2 j C 1 P C 1 + P A 1 and A 2 j C 2 P C 2 .102 = : 9 2 : 5+ : 1 2 : 5 .103 =0 : 41 .104 Well,0.41isnotequalto0.25,soyoucanseethattheeventsarenotindependent,contrarytoourrst intuition.Andthatalsomeansthat X i isnotbinomial. 1.7.2IntuitioninRetrospect Togetsomeintuitionhere,thinkaboutwhatwouldhappenifwetossedthechosencoin10000timesinstead ofjusttwice.Ifthetosseswereindependent,thenforexampleknowledgeoftherst9999tossesshouldnot tellusanythingaboutthe10000thtoss.Butthatisnotthecaseatall.After9999tosses,wearegoingto haveaverygoodideaastowhichcoinwehadchosen,becausebythattimewewillhavegottenabout9000 headsinthecaseofcoin C 1 orabout1000headsinthecaseof C 2 .Intheformercase,weknowthatthe PAGE 51 1.8.WHYNOTJUSTDOALLANALYSISBYSIMULATION? 33 10000thtossislikelytobeahead,whileinthelattercaseitislikelytobetails. Inotherwords,earlier tossesdoindeedgiveusinformationaboutlatertosses,sothetossesaren'tindependent. 1.7.3ImplicationsforModeling Thelessontobelearnedisthatindependencecandenitelybeatrickything,nottobeassumedcavalierly. Andincreatingprobabilitymodelsofrealsystems,wemustgivevery,verycarefulthoughttotheconditional andunconditionalaspectsofourmodels-itcanmakeahugedifference,aswesawabove.Also,the conditionalaspectsoftenplayakeyroleinformulatingmodelsofnonindependence. Thistrickcoinexampleisjustthattrickybutsimilarsituationsoccurofteninreallife.Ifinsomemedical study,say,wesamplepeopleatrandomfromthepopulation,thepeopleareindependentofeachother.But ifwesample families fromthepopulation,andthenlookatchildrenwithinthefamilies,thechildrenwithin afamilyarenotindependentofeachother. 1.8WhyNotJustDoAllAnalysisbySimulation? Nowthatcomputerspeedsaresofast,onemightaskwhyweneedtodomathematicalprobabilityanalysis; whynotjustdoeverythingbysimulation?Thereareanumberofreasons: Evenwithafastcomputer,simulationsofcomplexsystemscantakedays,weeksorevenmonths. Mathematicalanalysiscanprovideuswithinsightsthatmaynotbeclearinsimulation. Likeallsoftware,simulationprogramsarepronetobugs.Thechanceofhavinganuncaughtbugina simulationprogramisreducedbydoingmathematicalanalysisforaspecialcaseofthesystembeing simulated.Thisservesasapartialcheck. Statisticalanalysisisusedinmanyprofessions,includingengineeringandcomputerscience,andin ordertoconductmeaningful,useful statisticalanalysis,oneneedsarmunderstandingofprobability principles. AnexampleofthatsecondpointaroseinthecomputersecurityresearchofagraduatestudentatUCD,C. Senthilkumar,whowasworkingonawaytomorequicklydetectthespreadofamaliciouscomputerworm. Hewasevaluatinghisproposedmethodbysimulation,andfoundthatthingshitawallatacertainpoint. Hewasn'tsureifthiswasareallimitation;maybe,forexample,hejustwasn'trunninghissimulationon therightsetofparameterstogobeyondthislimit.Butamathematicalanalysisshowedthatthelimitwas indeedreal. 1.9TipsonFindingProbabilities,ExpectedValuesandSoOn First,donotwrite/thinknonsense.Forexample,theexpressionPAorPBisnonsensedoyousee why? PAGE 52 34 CHAPTER1.DISCRETEPROBABILITYMODELS Similarly,don'tuseformulasthatyoudidn'tlearnandareinfactfalse.Forexample,inanexpression involvingarandomvariableX,onecanNOTreplaceXbyEX!Howwouldyoulikeitifyourprofessor weretoloseyourexam,andthentellyou,Well,I'lljustassignyouascorethatisequaltotheclassmean? Asnotedbefore,incalculatingaprobability,askyourself, Howcanithappen? Thenyou willtypicallyhaveasetofand/orterms,whichyoucomputeindividuallyandaddtogether.Anduntilyou getusedtoit, writedowneverystep,includingreasons ,asyouseein.7-.9. Anotherpointisthatyoushoulddenevariables,e.g.LetXdenotethenumberofheads. Writeitdown! Thismakesitmucheasiertotranslatefromwordstomathexpressionsandequations. Exercises 1 .ThisproblemconcernstheALOHAnetworkmodelofSection1.1.Feelfreetousebutcitecomputations alreadyintheexample. a P X 1 =2 and X 2 =1 ,forthesamevaluesof p and q intheexamples. bFind P X 2 =0 cFind P X 1 =1 j X 2 =1 2 .Consideragameinwhichonerollsasingledieuntiloneaccumulatesatotalofatleastfourdots.Let X denotethenumberofrollsneeded.Find P X 2 and E X 3 .RecallthecommitteeexampleinSection1.4.8.Supposenow,though,thattheselectionprotocolisthat theremustbeatleastonemanandatleastonewomanonthecommittee.Find E D and Var D 4 .ConsiderthegameinSection1.6.1.Find E Z and Var Z ,where Z = Y 6 )]TJ/F46 10.9091 Tf 10.909 0 Td [(X 6 5 .Saywechoosesixcardsfromastandarddeck,oneatatimeWITHOUTreplacement.Let N bethe numberofkingsweget.Does N haveabinomialdistribution?Chooseone:iYes.iiNo,sincetrialsare notindependent.iiiNo,sincetheprobabilityofsuccessisnotconstantfromtrialtotrial.ivNo,since thenumberoftrialsisnotxed.viiandiii.iviiandiv.viiiiiandiv. 6 .Supposewehavenindependenttrials,withtheprobabilityofsuccessonthei th trialbeing p i .Let X =the numberofsuccesses.Usethefactthatthevarianceofthesumisthesumofthevarianceforindependent randomvariablestoderive Var X 7 .Youboughtthreeticketsinalottery,forwhich60ticketsweresoldinall.Therewillbeveprizesgiven. Findtheprobabilitythatyouwinatleastoneprize,andtheprobabilitythatyouwinexactlyoneprize. 8 .Twove-personcommitteesaretobeformedfromyourgroupof20people.Inordertofostercommunication,wesetarequirementthatthetwocommitteeshavethesamechairbutnootheroverlap.Findthe probabilitythatyouandyourfriendarebothchosenforsomecommittee. 9 .Consideradevicethatlastseitherone,twoorthreemonths,withprobabilities0.1,0.7and0.2,respectively.Wecarryonespare.Findtheprobabilitythatwehavesomedevicestillworkingjustbeforefour monthshaveelapsed. PAGE 53 1.9.TIPSONFINDINGPROBABILITIES,EXPECTEDVALUESANDSOON 35 10 .Abuildinghassixoors,andisservedbytwofreightelevators,namedMikeandIke.Thedestination oorofanyorderoffreightisequallylikelytobeanyofoors2through6.Onceanelevatorreachesany oftheseoors,itstaysthereuntilsummoned.Whenanorderarrivestothebuilding,whicheverelevatoris currentlyclosertooor1willbesummoned,withelevatorIkebeingtheonesummonedinthecaseinwhich theyarebothonthesameoor. Findtheprobabilitythatafterthesummons,elevatorMikeisonoor3.Assumethatonlyoneorderof freightcantinanelevatoratatime.Also,supposetheaveragetimebetweenarrivalsoffreighttothe buildingismuchlargerthanthetimeforanelevatortotravelbetweenthebottomandtopoors;this assumptionallowsustoneglecttraveltime. 11 .Withoutresortingtousingthefactthat )]TJ/F47 7.9701 Tf 5 -3.996 Td [(n k = n = [ k n )]TJ/F46 10.9091 Tf 10.909 0 Td [(k !] ,nd c and d suchthat n k = n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 k + c d .105 12 .ProveEquation.46,andalsoshowthat b = EU minimizesthequantity E ] U )]TJ/F46 10.9091 Tf 10.909 0 Td [(b 2 ] 13 .Showthatif X isanonnegative-integervaluedrandomvariable,then EX = 1 X i =1 P X i .106 Hint:Write i = P i j =1 1 ,andwhenyouseeaniteratedsum,reversetheorderofsummation. 14 .Acivilengineeriscollectingdataonacertainroad.Sheneedstohavedataon25trucks,and10percent ofthevehiclesonthatroadaretrucks.Findtheprobabilitythatshewillneedtowaitformorethan200 vehiclestopassbeforeshegetstheneededdata. 15 .Supposewetossafairtimentimes,resultingin X heads.Showthattheterm expectedvalue isa misnomer,byshowingthat lim n !1 P X = n= 2=0 .107 UseStirling'sapproximation, k p 2 k k e k .108 PAGE 54 36 CHAPTER1.DISCRETEPROBABILITYMODELS PAGE 55 Chapter2 ContinuousProbabilityModels 2.1ARandomDart Imaginethatwethrowadartatrandomattheinterval,1.LetDdenotethespotwehit.Byatrandom wemeanthatallsubintervalsofequallengthareequallylikelytogethit.Forinstance,theprobabilityofthe dartlandingin.7,0.8isthesameasfor.2,0.3,.537,0.637andsoon. Therstcrucialpointtonoteisthat P D = c =0 .1 foranyindividualpointc.Thatcanbeseenbythefactthatcisinastinyasubintervalasyouwish,orby thefactthattheintervalc,c,oreven[c,c],haslength0.Or,reasonthatthereareinnitelymanypoints, andiftheyallhadsomenonzeroprobabilityw,say,thentheprobabilitieswouldsumtoinnityinsteadof to1;thustheymusthaveprobability0. Thatmaysoundoddtoyou,butremember,thisisanidealization.Dactuallycannotbejustanyoldpoint in,1.Ourdarthasnonzerothickness,ourmeasuringinstrumenthasonlyniteprecision,andsoon.So itreallyisanidealization,thoughanextremelyusefulone.It'sliketheassumptionofmasslessstringin physicsanalyses;thereisnosuchthing,butit'sagoodapproximationtoreality. ButEquation.1presentsaproblemforusindeningtheterm distribution forvariableslikethis.We deneditforadiscreterandomvariableYasalistofthevaluesYtakeson,togetherwiththeirprobabilities. Butthatwouldbeimpossibleherealltheprobabilitiesofindividualvalueshereare0. Instead,wedenethedistributionofarandomvariableWwhichputs0probabilityonindividualpointsin anotherway.Tosetthisup,werstmustdene,foranyrandomvariableWincludingdiscreteones,its cumulativedistributionfunction cdf: F W t = P W t ; PAGE 56 38 CHAPTER2.CONTINUOUSPROBABILITYMODELS ofthenameoftherandomvariable. Whatisthere?It'ssimplyanargumenttoafunction.Thefunctionherehasdomain ; 1 ,andwemust thusdenethatfunctionforeveryvalueoft. Forinstance,considerourrandomdartexampleabove.Weknowthat,forexample F D : 23= P D 0 : 23=0 : 23 .3 Ingeneralforourdart, F D t = 8 > < > : 0 ; if t 0 t; if 0 PAGE 57 2.1.ARANDOMDART 39 ofheadswegetfromtwotossesofacoin.Then F Z t = 8 > > > > < > > > > : 0 ; if t< 0 0 : 25 ; if 0 t< 1 0 : 75 ; if 1 t< 2 1 ; if t 2 .5 Forinstance, F Z : 2= P Z 1 : 2= P z =0 or Z =1=0 : 25+0 : 50=0 : 75 .Makesureyou conrmthis! F Z isgraphedbelow: ThefactthatonecannotgetanonintegernumberofheadsiswhatmakesthecdfofZatbetweenconsecutiveintegers. Inthegraphsyouseethat F D in.4iscontinuouswhile F Z in.5hasjumps.Forthisreason,we callrandomvariableslikeDoneswhichhave0probabilityforindividualpoints continuousrandom variables Atthislevelofstudyofprobability,mostrandomvariablesareeitherdiscreteorcontinuous,butsomeare not. PAGE 58 40 CHAPTER2.CONTINUOUSPROBABILITYMODELS 2.2DensityFunctions Intuitioniskeyhere.MakeSUREyoudevelopagoodintuitiveunderstandingofdensityfunctions,asitis vitalinbeingabletoapplyprobabilitywell.Wewilluseitalotinourcourse. 2.2.1Motivation,DenitionandInterpretation OK,nowwehaveanameforrandomvariablesthathaveprobability0forindividualpointscontinuous andwehavesolvedtheproblemofhowtodescribetheirdistribution.Nowweneedsomethingwhichwill becontinuousrandomvariables'analogofaprobabilitymassfunction. Thinkasfollows.From.2wecanseethatforadiscreterandomvariable,itscdfcanbecalculatedby summingispmf.Recallthatinthecontinuousworld,weintegrateinsteadofsum.So,ourcontinuous-case analogofthepmfshouldbesomethingthatintegratestothecdf.Thatofcourseisthederivativeofthecdf, whichiscalledthe density .Itisdenedas f W t = d dt F W t ; PAGE 59 2.2.DENSITYFUNCTIONS 41 Figure2.1:ApproximationofProbabilitybyaRectangle So,Xwilltakeonvaluesinregionsinwhich f X islargemuchmoreoftenthaninregionswhere itissmall,withtheratiooffrequenciesbeingproportiontothevaluesof f X ForourdartrandomvariableD, f D t =1 fortin,1,andit's0elsewhere. 1 Again, f D t isNOTPD =t,sincethelattervalueis0,butitisstillviewableasarelativelikelihood.Thefactthat f D t =1 for alltin,1canbeinterpretedasmeaningthatallthepointsin,1areequallylikelytobehitbythedart. Morepreciselyput,youcanviewtheconstantnatureofthisdensityasmeaningthatallsubintervalsofthe samelengthwithin,1havethesameprobabilityofbeinghit. Notetoothatif,say,Xhasthedensityinthepreviousparagraph,then f X =6 = 15=0 : 4 andthus P : 99 PAGE 60 42 CHAPTER2.CONTINUOUSPROBABILITYMODELS 2.2.2UseofDensitiestoFindProbabilitiesandExpectedValues Equation.6impliesthat P a PAGE 61 2.3.FAMOUSPARAMETRICFAMILIESOFCONTINUOUSDISTRIBUTIONS 43 2.3FamousParametricFamiliesofContinuousDistributions 2.3.1TheUniformDistributions 2.3.1.1DensityandProperties Inourdartexample,wecanimaginethrowingthedartattheintervalq,rsothiswillbeatwo-parameter family.Thentobeauniformdistribution,i.e.withallthepointsbeingequallylikely,thedensitymust beconstantinthatinterval.Butitalsomustintegrateto1[see.11.So,thatconstantmustbe1divided bythelengthoftheinterval: f D t = 1 r )]TJ/F46 10.9091 Tf 10.909 0 Td [(q .19 fortinq,r,0elsewhere. Iteasilyshownthat E D = q + r 2 and Var D = 1 12 r )]TJ/F46 10.9091 Tf 10.909 0 Td [(q 2 ThenotationforthisfamilyisUq,r. 2.3.1.2Example:ModelingofDiskPerformance Uniformdistributionsareoftenusedtomodelcomputerdiskrequests.Recallthatadiskconsistsofalarge numberofconcentricrings,called tracks .Whenaprogramissuesarequesttoreadorwriteale,the read/writehead mustbepositionedabovethetrackoftherstpartofthele.Thismove,whichiscalleda seek ,canbeasignicantfactorindiskperformanceinlargesystems,e.g.adatabaseforabank. Ifthenumberoftracksislarge,thepositionoftheread/writehead,whichI'lldenoteatX,islikeacontinuous randomvariable,andoftenthispositionismodeledbyauniformdistribution.Thissituationmayholdjust beforeadefragmentationoperation.Afterthatoperation,thelestendtobebunchedtogetherinthecentral tracksofthedisk,soastoreduceseektime,andXwillnothaveauniformdistributionanymore. Eachtrackconsistsofacertainnumberof sectors ofagivensize,say512byteseach.Oncetheread/write headreachesthepropertrack,wemustwaitforthedesiredsectortorotatearoundandpassunderthe read/writehead.Itshouldbeclearthatauniformdistributionisagoodmodelforthis rotationaldelay 2.3.1.3Example:ModelingofDenial-of-ServiceAttack Inonefacetofcomputersecurity,ithasbeenfoundthatauniformdistributionisactuallyawarningof trouble,apossibleindicationofa denial-of-serviceattack .Heretheattackertriestomonopolize,say,a Webserver,byinundatingitwithservicerequests.AccordingtotheresearchofDavidMarchette, 2 attackers chooseuniformlydistributedfalseIPaddresses,apatternnotnormallyseenatservers. 2 StatisticalMethodsforNetworkandComputerSecurity ,DavidJ.Marchette,NavalSurfaceWarfareCenter, rion.math. iastate.edu/IA/2003/foils/marchette.pdf . PAGE 62 44 CHAPTER2.CONTINUOUSPROBABILITYMODELS 2.3.2TheNormalGaussianFamilyofContinuousDistributions Thesearethefamousbell-shapedcurves,socalledbecausetheirdensitieshavethatshape. 3 2.3.2.1DensityandProperties DensityandParameters: Thedensityforanormaldistributionis f W t = 1 p 2 e )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 5 t )]TJ/F48 5.9776 Tf 5.756 0 Td [( 2 ; PAGE 63 2.3.FAMOUSPARAMETRICFAMILIESOFCONTINUOUSDISTRIBUTIONS 45 f Y t = d dt F Y t denitionof f Y .25 = d dt F X t )]TJ/F46 10.9091 Tf 10.909 0 Td [(d c from : 24 .26 = f X t )]TJ/F46 10.9091 Tf 10.909 0 Td [(d c d dt t )]TJ/F46 10.9091 Tf 10.909 0 Td [(d c denitionof f X andtheChainRule .27 = 1 c 1 p 2 e )]TJ/F44 7.9701 Tf 6.586 0 Td [(0 : 5 t )]TJ/F48 5.9776 Tf 5.756 0 Td [(d c )]TJ/F48 5.9776 Tf 5.756 0 Td [( 2 from : 20 .28 = 1 p 2 c e )]TJ/F44 7.9701 Tf 6.586 0 Td [(0 : 5 t )]TJ/F45 5.9776 Tf 5.756 0 Td [( c + d c 2 algebra .29 Thatlastexpressionisthe N c + d;c 2 2 density,sowearedone! EvaluatingtheNormalcdf Thefunctionin.20doesnothaveaclosed-formindeniteintegral.Thusprobabilitiesinvolvingnormal randomvariablesmustbeapproximated.Traditionally,thisisdonewithatableforthecdfofN,1.This onetableissufcientfortheentirenormalfamily,becauseifXhasthedistribution N ; 2 then X )]TJ/F46 10.9091 Tf 10.909 0 Td [( .30 hasaN,1distributiontoo,duetotheafnetransformationclosurepropertydiscussedabove. Bytheway,theN,1cdfistraditionallydenotedby .Asnoted,traditionallyithasplayedacentral role,asonecouldtransformanyprobabilityinvolvingsomenormaldistributiontoanequivalentprobability involvingN,1.OnewouldthenuseatableofN,1tondthedesiredprobability. Nowadays,probabilitiesforanynormaldistribution,notjustN,1,areeasilyavailablebycomputer.In theRstatisticalpackage,thenormalcdfforanymeanandvarianceisavailableviathefunction pnorm We'llusebothmethodsinourrstcoupleofexamplesbelow. 2.3.2.2Example:NetworkIntrusion Asanexample,let'slookatasimpleversionofthenetworkintrusionproblem.Supposewehavefound thatinJill'sremoteloginstoacertaincomputer,thenumberofdisksectorsshereadsorwritesXhasa normaldistributionhasameanof500andastandarddeviationof15.Sayournetworkintrusionmonitor ndsthatJillorsomeoneposingasherhasloggedinandhasreadorwritten535sectors.Shouldwebe suspicious? Toanswerthisquestion,let'snd P X 535 :Let Z = X )]TJ/F15 10.9091 Tf 11.153 0 Td [(500 = 15 .Fromourdiscussionabove,we knowthatZhasaN,1distribution,so P X 535= P Z 535 )]TJ/F15 10.9091 Tf 10.909 0 Td [(500 15 1= = 15=0 : 01 .31 PAGE 64 46 CHAPTER2.CONTINUOUSPROBABILITYMODELS Again,traditionallywewouldobtainthat0.01valuefromaN,1cdftableinabook.WithR,wewould justusethefunction pnorm : >1-pnorm,500,15 [1]0.009815329 Anyway,that0.01probabilitymakesussuspicious.Whileit could reallybeJill,thiswouldbeunusual behaviorforJill,sowestarttosuspectthatitisn'ther.Ofcourse,thisisaverycrudeanalysis,andreal intrusiondetectionsystemsaremuchmorecomplex,butyoucanseethemainideashere. 2.3.2.3TheCentralLimitTheorem TheCentralLimitTheoremCLTsays,roughlyspeaking,thatarandomvariablewhichisasumofmany componentswillhaveanapproximatenormaldistribution. 5 So,forinstance,humanweightsareapproximatelynormallydistributed,sinceapersonismadeofmany components.ThesameistrueforSATtestscores, 6 asthetotalscoreisthesumofscoresontheindividual problems. Binomiallydistributedrandomvariables,thoughdiscrete,alsoareapproximatelynormallydistributed.This comesfromthefactthatifsayThasabinomialdistributionwithntrials,thenwecanwrite T = T 1 + ::: + T n where T i is1forasuccessand0forafailure.Sincewehaveasum,theCLTapplies.ThusweusetheCLT ifwehavebinomialdistributionswithlargen. 2.3.2.4Example:CoinTosses Forexample,let'sndtheapproximateprobabilityofgettingmorethan12headsin20tossesofacoin.X, thenumberofheads,hasabinomialdistributionwithn=20andp=0.5Itsmeanandvariancearethennp =10andnp-p=5.So,let Z = X )]TJ/F15 10.9091 Tf 10.909 0 Td [(10 = p 5 ,andwrite P X> 12= P Z> 12 )]TJ/F15 10.9091 Tf 10.91 0 Td [(10 p 5 1 )]TJ/F15 10.9091 Tf 10.909 0 Td [( : 894=0 : 186 .32 Or: >1-pnorm,10,sqrt [1]0.1855467 Theexactansweris0.132.Remember,thereasonwecoulddothiswasthatXisapproximatelynormal, fromtheCLT.Thisisanapproximationofthedistributionofadiscreterandomvariablebyacontinuous one,whichintroducesadditionalerror. 5 TherearemanyversionsoftheCLT.Thebasiconerequiresthatthesummandsbeindependentandidenticallydistributed,but moreadvancedversionsarebroaderinscope. 6 Thisreferstotherawscores,beforescalingbythetestingcompany. PAGE 65 2.3.FAMOUSPARAMETRICFAMILIESOFCONTINUOUSDISTRIBUTIONS 47 WecangetbetteraccuracybyaccountingforthefactthatXisdiscrete,replacing12by12.5above.Think ofthenumber13owningtheregionbetween12.5and13.5.Thisiscustomary,andinthiscasegivesus 0.1317762,whiletheexactanswertosevendecimalplacesis0.131588.Thisiscalledthe correctionof continuity .Ofcourse,forlargernthisadjustmentisnotnecessary. 2.3.2.5MuseumDemonstration ManysciencemuseumshavethefollowingvisualdemonstrationoftheCLT. Therearemanyballsinachute,withatriangulararrayofrrowsofpinsbeneaththechute.Eachballfalls throughtherowsofpins,bouncingleftandrightwithprobability0.5each,eventuallybeingcollectedinto oneofrbins,numbered0tor.Aballwillendupinbiniifitbouncesrightwardinioftherrowsofpins,i =0,1,...,r.Keypoint: LetXdenotethebinnumberatwhichaballendsup.Xisthenumberofrightwardbounces successesinrrowstrials.ThereforeXhasabinomialdistributionwithn=randp=0.5 Eachbiniswideenoughforonlyoneball,sotheballsinabinwillstackup.Andsincetherearemanyballs, theheightofthestackinbiniwillbeapproximatelyproportionaltoPX=i.Andsincethelatterwillbe approximatelygivenbytheCLT,thestacksofballswillroughlylooklikethefamousbell-shapedcurve! Therearemanyonlinesimulationsofthismuseumdemonstration,suchas http://www.rand.org/ statistics/applets/clt.html and http://www.jcu.edu/math/isep/Quincunx/Quincunx. html .Bycollectingtheballsinbins,theapparatusbasicallysimulatesahistogramfor X ,whichwillthen beapproximatelybell-shaped. 2.3.2.6Optionaltopic:FormalStatementoftheCLT Denition1 Asequenceofrandomvariables L 1 ;L 2 ;L 3 ;::: convergesindistribution toarandomvariable M if lim n !1 P L n t = P M t ; forallt .33 Notebytheway,thattheserandomvariablesneednotbedenedonthesameprobabilityspace. TheformalstatementoftheCLTis: Theorem2 Suppose X 1 ;X 2 ;::: areindependentrandomvariables,allhavingthesamedistributionwhich hasmeanmandvariance v 2 .Then Z = X 1 + :::X n )]TJ/F46 10.9091 Tf 10.909 0 Td [(nm v p n .34 convergesindistributiontoaN,1randomvariable. PAGE 66 48 CHAPTER2.CONTINUOUSPROBABILITYMODELS 2.3.2.7ImportanceinModeling Normaldistributionsplayakeyroleinstatistics.Mostoftheclassicalstatisticalproceduresassumethat onehassampledfromapopulationhavinganapproximatedistributions.Thisshouldcomeasnosurprise, knowingtheCLT.Thelatterimpliesthatmanythingsinnaturedohaveapproximatenormaldistributions. 2.3.3TheChi-SquareFamilyofDistributions 2.3.3.1DensityandProperties Let Z 1 ;Z 2 ;:::;Z k beindependentN,1randomvariables.Thethedistributionof Y = Z 2 1 + ::: + Z 2 k .35 iscalled chi-squarewithkdegreesoffreedom .Wewritesuchadistributionas 2 k .Chi-squareisa one-parameterfamilyofdistributions. Itturnsoutthatchi-squareisaspecialcaseofthegammafamilyinSection2.3.5below,withr=k/2and =0 : 5 2.3.3.2ImportanceinModeling Thisdistributionisusedwidelyinstatisticalapplications.Aswillbeseeninourchaptersonstatistics,many statisticalmethodsinvolveasumofsquarednormalrandomvariables. 7 2.3.4TheExponentialFamilyofDistributions 2.3.4.1DensityandProperties Thedensitiesinthisfamilyhavetheform f W t = e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t ; 0 PAGE 67 2.3.FAMOUSPARAMETRICFAMILIESOFCONTINUOUSDISTRIBUTIONS 49 2.3.4.2ConnectiontothePoissonDistributionFamily Supposethelifetimesofasetoflightbulbsareindependentandidenticallydistributedi.i.d.,andconsider thefollowingprocess.Attime0,weinstallalightbulb,whichburnsanamountoftime X 1 .Thenweinstall asecondlightbulb,withlifetime X 2 .Thenathird,withlifetime X 3 ,andsoon. Let T r = X 1 + ::: + X r .37 denotethetimeofthe i th replacement.Also,letNtdenotethenumberofreplacementsuptoandincluding timet. 9 Thenitcanbeshownthatifthecommondistributionofthe X i isexponentiallydistributed,the NthasaPoissondistributionwithmean t .Andtheconverseistruetoo:Ifthe X i areindependentand identicallydistributedandNtisPoisson,thenthe X i musthaveexponentialdistributions. Inotherwords,NtwillhaveaPoissondistributionifandonlyifthelifetimesareexponentiallydistributed. Youcanseetheonlyifpartquickly,bythefollowingargument.First,notethat P X 1 >t = P [ N t =0]= e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t .38 Then f X 1 t = d dt )]TJ/F46 10.9091 Tf 10.909 0 Td [(e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t = e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t .39 ThecollectionofrandomvariablesNt t 0 ,iscalleda Poissonprocess Therelation E [ N t ]= t saysisthatreplacementsareoccurringatanaveragerateof perunittime.Thus iscalledthe intensityparameter oftheprocess.Itisbecauseofthisrateinterpretationthatmakes a naturalindexingparameterin.36. 2.3.4.3ImportanceinModeling Manydistributionsinreallifehavebeenfoundtobeapproximatelyexponentiallydistributed.Afamous exampleisthelifetimesofairconditionersonairplanes.Anotherfamousexampleisinterarrivaltimes,such ascustomerscomingintoabankormessagesgoingoutontoacomputernetwork.Itisusedinsoftware reliabilitystudiestoo. 2.3.5TheGammaFamilyofDistributions 2.3.5.1DensityandProperties RecallEquation.37,inwhichtherandomvariable T r wasdenedtobethetimeofthe r th lightbulb replacement. T r isthesumofrindependentexponentiallydistributedrandomvariableswithparameter 9 Again,sinceNtisacontinuousrandomvariable,thephraseandincludingisunnecssaryhere. PAGE 68 50 CHAPTER2.CONTINUOUSPROBABILITYMODELS Thedistributionof T r iscalledan Erlang distribution,withdensity f T r t = 1 r )]TJ/F15 10.9091 Tf 10.909 0 Td [(1! r t r )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t ;t> 0 .40 Thisisatwo-parameterfamily. Wecangeneralizethisbyallowingrtotakenonintegervalues,bydeningageneralizationofthefactorial function: \050 r = Z 1 0 x r )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 e )]TJ/F47 7.9701 Tf 6.586 0 Td [(x dx .41 Thisiscalledthegammafunction,anditgivesusthegammafamilyofdistributions,moregeneralthanthe Erlang: f W t = 1 \050 r r t r )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t ;t> 0 .42 Notethat \050 r ismerelyservingastheconstantthatmakesthedensityintegrateto1.0.Itdoesn'thave meaningofitsown. Thisisagainatwo-parameterfamily,withrand asparameters. Agammadistributionhasmean r= andvariance r= 2 .Inthecaseofintegerr,thisfollowsfrom.37and thefactthatanexponentiallydistributedrandomvariablehasmeanandvariance 1 = andvariance 1 = 2 anditcanbederivedingeneral.Noteagainthatthegammareducestotheexponentialwhenr=1. Recallfromabovethatthegammadistribution,oratleasttheErlang,arisesasasumofindependentrandom variables.ThustheCentralLimitTheoremimpliesthatthegammadistributionshouldbeapproximately normalforlargeintegervaluesofr.WeseeinFigure2.2thatevenwithr=10itisratherclosetonormal. Italsoturnsoutthatthechi-squaredistributionwithddegreesoffreedomisagammadistribution,withr= d/2and =0 : 5 2.3.5.2Example:NetworkBuffer SupposeinanetworkcontextnotourALOHAexample,anodedoesnottransmituntilithasaccumulated vemessagesinitsbuffer.Supposethetimesbetweenmessagearrivalsareindependentandexponentially distributedwithmean100milliseconds.Let'sndtheprobabilitythatmorethan552mswillpassbeforea transmissionismade,startingwithanemptybuffer. Let X 1 bethetimeuntiltherstmessagearrives, X 2 thetimefromthentothearrivalofthesecondmessage, andsoon.Thenthetimeuntilweaccumulatevemessagesis Y = X 1 + ::: + X 5 .Thenfromthedenition ofthegammafamily,weseethatYhasagammadistributionwithr=5and =0 : 01 .Then P Y> 552= Z 1 552 1 4! 0 : 01 5 t 4 e )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 01 t dt .43 PAGE 69 2.4.DESCRIBINGFAILURE 51 Thisintegralcouldbeevaluatedviarepeatedintegrationbyparts,butlet'suseRinstead: >1-pgamma,5,0.01 [1]0.3544101 2.3.5.3ImportanceinModeling Asseenin.37,sumsofexponentiallydistributedrandomvariablesoftenariseinapplications.Suchsums havegammadistributions. Youmayaskwhatthemeaningisofagammadistributioninthecaseofnonintegerr.Thereisnoparticular meaning,butwhenwehavearealdataset,weoftenwishtosummarizeitbyttingaparametricfamilyto it,meaningthatwetrytondamemberofthefamilythatapproximatesourdatawell. Inthisregard,thegammafamilyprovidesuswithdensitieswhichriseneart=0,thengraduallydecrease to0astbecomeslarge,sothefamilyisusefulifourdataseemtolooklikethis.Graphsofsomegamma densitiesareshowninFigure2.2. 2.4DescribingFailure Inadditiontodensityfunctions,anotherusefuldescriptionofadistributionisits hazardfunction .Again thinkofthelifetimesoflightbulbs,notnecessarilyassuminganexponentialdistribution.Intuitively,the hazardfunctionstatesthelikelihoodofabulbfailinginthenextshortintervaloftime,giventhatithaslasted uptonow.Tounderstandthis,let'srsttalkaboutacertainpropertyoftheexponentialdistributionfamily. 2.4.1MemorylessProperty Oneofthereasonstheexponentialfamilyofdistributionsissofamousisthatithasapropertythatmakes manypracticalstochasticmodelsmathematicallytractable:Theexponentialdistributionsare memoryless Whatthismeansisthatforpositivetandu P W>t + u j W>t = P W>u .44 Let'sderivethis: PAGE 70 52 CHAPTER2.CONTINUOUSPROBABILITYMODELS Figure2.2:VariousGammaDensities PAGE 71 2.4.DESCRIBINGFAILURE 53 P W>t + u j W>t = P W>t + u and W>t P W>t .45 = P W>t + u P W>t .46 = R 1 t + u e )]TJ/F47 7.9701 Tf 6.586 0 Td [(s ds R 1 t e )]TJ/F47 7.9701 Tf 6.586 0 Td [(s ds .47 = e )]TJ/F47 7.9701 Tf 6.587 0 Td [(u .48 = P W>u .49 Wesaythatthismeansthattimestartsoverattimet,orthatWdoesn'trememberwhathappenedbefore timet. Itisdifcultforthebeginningmodelertofullyappreciatethememorylessproperty.Let'smakeitconcrete. ConsidertheproblemofwaitingtocrosstherailroadtracksonEighthStreetinDavis,justwestofJStreet. Onecannotseedownthetracks,sowedon'tknowwhethertheendofthetrainwillcomesoonornot. Ifwearedriving,theissueathandiswhethertoturnoffthecar'sengine.Ifweleaveiton,andtheendof thetraindoesnotcomeforalongtime,wewillbewastinggasoline;ifweturnitoff,andtheenddoescome soon,wewillhavetostarttheengineagain,whichalsowastesgasoline.Or,wemaybedecidingwhether tostaythere,orgowayovertotheCovellRd.railroadoverpass. Supposeourpolicyistoturnofftheengineiftheendofthetrainwon'tcomeforatleastsseconds.Suppose alsothatwearrivedattherailroadcrossingjustwhenthetrainrstarrived,andwehavealreadywaitedfor rseconds.Willtheendofthetraincomewithinsmoreseconds,sothatwewillkeeptheengineon?If thelengthofthetrainwereexponentiallydistributediftherearetypicallymanycars,wecanmodelitas continouseventhoughitisdiscrete,Equation.45wouldsaythatthefactthatwehavewaitedrseconds sofarisofnovalueatallinpredictingwhetherthetrainwillendwithinthenextsseconds.Thechanceof itlastingatleastsmoresecondsrightnowisnomoreandnolessthanthechanceithadoflastingatleasts secondswhenitrstarrived. ThememorylessnessofexponentialdistributionsimpliesthataPoissonprocessNtalsohasatimestarts overpropertycalledthe Markovproperty .RecallourexampleinSection2.3.4.2inwhichNtwasthe numberoflightbulbburnoutsuptotimet.Thememorylessnesspropertymeansthatifwestartcounting afreshfromtime,sayz,thenthenumbersofburnoutsaftertimez,i.e.Qu=Nz+u-Nz,alsoisaPoisson process.Inotherwords,QuhasaPoissondistributionwithparameter .Moreover,Quisindependent ofNtforany t PAGE 72 54 CHAPTER2.CONTINUOUSPROBABILITYMODELS 2.4.2HazardFunctions 2.4.2.1BasicConcepts SupposethelifetimesoflightbulbsLwerediscrete.Supposeaparticularbulbhasalreadylasted80hours. Theprobabilityofitfailinginthenexthourwouldbe P L =81 j L> 80= P L =81 P L> 80 = p L 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(F L .50 Byanalogy,forcontinuousLwedene h L t = f L t 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(F L t .51 Again,theinterpretationisthat h L t isthelikelihoodoftheitemfailingverysoonaftert,giventhatithas lastedtamountoftime. Notecarefullythatthewordfailurehereshouldnotbetakenliterally.InourDavisrailroadcrossing exampleabove,failuremeansthatthetrainendsafailurewhichthoseofuswhoarewaitingwill welcome! Sinceweknowthatexponentiallydistributedrandomvariablesarememoryless,wewouldexpectintuitively thattheirhazardfunctionsareconstant.Wecanverifythisbyevaluating.51foranexponentialdensity withparameter ;sureenough,thehazardfunctionisconstant,withvalue Thereadershouldverifythatincontrasttoanexponentialdistribution'sconstantfailurerate,auniform distributionhasanincreasingfailurerateIFR.Somedistributionshavedecreasingfailurerates,while mosthavenon-monotonerates. Hazardfunctionmodelshavebeenusedextensivelyinsoftwaretesting.Herefailureisthediscoveryof abug,andwithquantitiesofinterestincludethemeantimeuntilthenextbugisdiscovered,andthetotal numberofbugs. Peoplehavewhatiscalledabathtub-shapedhazardfunction.Itishighnear0reectinginfantmortality andafter,say,70,butislowandratheratinbetween. Youmayhavenoticedthattheright-handsideof.51isthederivativeof )]TJ/F46 10.9091 Tf 8.484 0 Td [(ln [1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(F L t ] .Therefore Z t 0 h L s ds = )]TJ/F15 10.9091 Tf 10.303 0 Td [(ln[1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(F L t ] .52 sothat 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(F L t = e )]TJ/F53 7.9701 Tf 7.998 6.421 Td [(R t 0 h L s ds .53 PAGE 73 2.5.ACAUTIONARYTALE:THEBUSPARADOX 55 andthus 10 f L t = h L t e )]TJ/F53 7.9701 Tf 7.998 6.421 Td [(R t 0 h L s ds .54 Inotherwords,justaswecanndthehazardfunctionknowingthedensity,wecanalsogointhereverse direction.Thisestablishesthatthereisaone-to-onecorrespondencebetweendensitiesandhazardfunctions. Thismayguideourchoiceofparametricfamilyformodelingsomerandomvariable.Wemaynotonly haveagoodideaofwhatgeneralshapethedensitytakeson,butmayalsohaveanideaofwhatthehazard functionlookslike.Thesetwopiecesofinformationcanhelpguideusinourchoiceofmodel. 2.4.3Example:SoftwareReliabilityModels Hazardfunctionmodelshavebeenusedsuccessfullytomodelthearrivalsi.e.discoveriesofbugsin software.Questionsthatariseare,forinstance,Whenarewereadytoship?,meaningwhencanwe believewithsomecondencethatmostbugshavebeenfound? Typicallyonecollectsdataonbugdiscoveriesfromanumberofprojectsofsimilarcomplexity,andestimates thehazardfunctionfromthatdata. Seeforexample AccurateSoftwareReliabilityEstimation ,byJasonAllenDenton,Dept.ofComputer Science,ColoradoStateUniversity,1999,andthemanyreferencestherein. 2.5ACautionaryTale:theBusParadox Supposeyouarriveatabusstop,atwhichbusesarriveaccordingtoaPoissonprocesswithintensityparameter0.1,i.e.0.1arrivalperminute.Recallthatthemeansthattheinterarrivaltimeshaveanexponential distributionwithmean10minutes.Whatistheexpectedvalueofyourwaitingtimeuntilthenextbus? Well,ourrstthoughtmightbethatsincetheexponentialdistributionismemoryless,timestartsover whenwereachthebusstop.Thereforeourmeanwaitshouldbe10. Ontheotherhand,wemightthinkthatonaveragewewillarrivehalfwaybetweentwoconsecutivebuses. Sincethemeantimebetweenbusesis10minutes,thehalfwaypointisat5minutes.Thusitwouldseem thatourmeanwaitshouldbe5minutes. Whichanalysisiscorrect?Actually,thecorrectansweris10minutes.So,whatiswrongwiththesecond analysis,whichconcludedthatthemeanwaitis5minutes?Theproblemisthatthesecondanalysisdidnot takeintoaccountthefactthatalthoughinter-busintervalshaveanexponentialdistributionwithmean10, theparticularinter-busintervalthatweencounterisspecial. Imagineabagfullofsticks,ofdifferentlengths.Wereachintothebagandchooseastickatrandom.The keypointisthatnotallpiecesareequallylikelytobechosen;thelongerpieceswillhaveagreaterchance ofbeingselected. 11 Theformalnameforthisis length-biasedsampling 10 Recallthatthederivativeoftheintegralofafunctionistheoriginalfunction! 11 AnotherexamplewassuggestedtomebyUCDgradstudentShubhabrataSengupta:Thinkofalargeparkinglotonwhich PAGE 74 56 CHAPTER2.CONTINUOUSPROBABILITYMODELS Similarly,theparticularinter-busintervalthatwehitislikelytobealongerinterval.Toseethis,supposewe observethecomingsandgoingsofbusesforaverylongtime,andplottheirarrivalsonatimelineonawall. Insomecasestwosuccessivemarksonthetimelineareclosetogether,sometimesfarapart.Ifwewereto standfarfromthewallandthrowadartatit,wewouldhittheintervalbetweensomepairofconsecutive marks.Intuitivelywearemoreapttohitawiderintervalthananarrowerone. Onceonerecognizesthisandcarefullyndsthedensityofthatinterval,wediscoverthatthatintervaldoes indeedtendtobelongersomuchsothattheexpectedvalueofthisintervalis20minutes!Inotherwords, ifwethrowadartatthewall,say,1000times,themeanofthe1000intervalswewouldhitwouldbeabout 20.Thisincontrasttothemeanofalloftheintervalsonthewall,whichwouldbe10. Thusthehalfwaypointcomesat10minutes,consistentwiththeanalysiswhichappealedtothememoryless property. Actually,wecanintuitivelyreasonoutwhatthedensityisofthelengthoftheparticularinter-businterval thatwehit,asfollows.Firstconsiderthebag-of-sticksexample,andsupposesomewhatarticiallythat sticklengthXisadiscreterandomvariable.LetYdenotethelengthofthestickthatwepick.Supposethat, say,sticklengths2and6eachcomprise10%ofthesticksinthebag,i.e. p X = p X =0 : 1 .55 Intuitively,onewouldthenreasonthat p Y =3 p Y .56 Inotherwords,thesticksoflength2arejustasnumerousasthoseoflength6,butsincethelatterarethree timesaslong,theyshouldhavetriplethechanceofbeingchosen. Notethatthisisnotsomeabsolutephysicallaw.Differentpeoplemightdrawsticksfromthebagindifferent ways.Butitisareasonablemodel. NowletXdenoteinterarrivaltimesbetweenbuses,andYdenotetheinterarrivaltimethatwehit.Theanalog of.56wouldbethat f Y t isproportionalto tf X t ,i.e. f Y t = ctf X t .57 forsomeconstantc.Recallingthat f Y mustintegrateto1,weseethat c = Z 1 0 tf X t dt )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 .58 ButthatintegralisjustEX!Thelatterquantityis10,and f X t =0 : 1 e )]TJ/F44 7.9701 Tf 6.586 0 Td [(0 : 1 t .59 hundredsofbucketsareplacedofvariousdiameters.Wethrowaballhighintothesky,andseewhatsizebucketitlandsin.Here thedensitywouldbeproportionaltothesquareofthediameter. PAGE 75 2.6.CHOOSINGAMODEL 57 So, f Y t =0 : 01 te )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 1 t .60 YoumayrecognizethisasanErlangdensity. 2.6ChoosingaModel Theparametricfamiliespresentedhereareoftenusedintherealworld.Asindicatedpreviously,thismay bedoneonanempiricalbasis.WewouldcollectdataonarandomvariableX,andplotthefrequenciesofits valuesinahistogram.IfforexampletheplotlooksroughlylikethecurvesinFigure2.2,wecouldchoose thisasthefamilyforourmodel. Or,ourchoicemayarisefromtheory.Ifforinstanceourknowledgeofthesettinginwhichweareworking saysthatourdistributionismemoryless,thatforcesustousetheexponentialdensityfamily. Ineithercase,thequestionastowhichmemberofthefamilywechoosetowillbesettledbyusingsome kindofprocedurewhichndsthememberofthefamilywhichbesttsourdata.Wewilldiscussthisin detailinourchaptersonstatistics. Notethatwemaychoosenottouseaparametricfamilyatall.Wemaysimplyndthatourdatadoes nottanyofthecommonparametricfamiliestherearemanyothersthanthosepresentedhereverywell. Proceduresthatdonotassumeanyparametricfamilyaretermed nonparametric 2.7AGeneralMethodforSimulatingaRandomVariable SupposewewishtosimulatearandomvariableXwithcdf F X forwhichthereisnoRfunction.Thiscanbe donevia F )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 X U ,whereUhasaU,1distribution.Inotherwords,wecall runif andthenplugtheresult intotheinverseofcdfofX.Hereinverseisinthesensethat,forinstance,squaringandsquare-rooting, expandln,etc.areinverseoperationsofeachother. Forexample,sayXhasthedensity2ton,1.Then F X t = t 2 ,so F )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 s = s 0 : 5 .Wecanthengenerate XinRas sqrtrunif .Here'swhy: Forbrevity,denote F )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 X asGand F X asH.OurgeneratedrandomvariableisGU.Then P [ G U t ] = P [ U G )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 t ] = P [ U H t ] = H t .61 Inotherwords,thecdfofGUis F X !So,GUhasthesamedistributionasX. PAGE 76 58 CHAPTER2.CONTINUOUSPROBABILITYMODELS Notethatthismethod,thoughvalid,isnotnecessarilypractical,sincecomputing F )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 X maynotbeeasy. Exercises 1 .Suppose X hasauniformdistributionon-1,1,andlet Y = X 2 .Find f Y Hint:Firstnd F Y t 2 .Allthatglittersisnotgold,andnoteverybell-shapeddensityisnormal.ThefamilyofCauchydistributions,havingdensity f X t = 1 c 1 1+ t )]TJ/F47 7.9701 Tf 6.586 0 Td [(b c 2 ; 1 PAGE 77 2.7.AGENERALMETHODFORSIMULATINGARANDOMVARIABLE 59 6 .Supposeamanufacturerofsomeelectroniccomponentndsthatitslifetimeisexponentiallydistributed withmean10000hours.Theygivearefundiftheitemfailsbefore500hours.Let N bethenumberofitems theyhavesold,uptoandincludingtheoneonwhichtheymaketherstrefund.Find EN and Var N 7 .Forthedensity a exp )]TJ/F46 10.9091 Tf 8.485 0 Td [(bt;t> 0 ,showthatwemusthave a = b .Thenshowthatthemeanandvariance forthisdistributionare 1 =b and 1 =b 2 ,respectively. 8 .ConsidertherandombucketexampleinFootnote11.Supposebucketdiameter D ,measuredinmeters, hasauniformdistributionon,2.Let W denotethediameterofthebucketinwhichthetossedballlands. aFindthedensity,meanandvarianceof W ,andalso P W> 1 : 5 bWriteanRfunctionthatwillgeneraterandomvariateshavingthedistributionof W 9 .Supposethatcomputerroundofferrorincomputingthesquarerootsofnumbersinacertainrangeis distributeduniformlyon-0.5,0.5,andthatwewillbecomputingthesumofnsuchsquareroots.Finda numbercsuchthattheprobabilityisapproximately95%thatthesumisinerrorbynomorethanc. Acertainpublicparkinggaragechargesparkingfeesof$1.50forthersthourorfractionthereof,and$1 perhourafterthat.So,someonewhostays57minutespays$1.50,someonewhoparksforonehourand12 minutespays$1.70,andsoon.SupposeparkingtimesTareexponentiallydistributedwithmean1.5hours. LetWdenotethetotalfeepaid.FindEWandVarW. 10 .InSection2.45,weshowedthattheexponentialdistributionismemoryless.Infact,itistheonly continuousdistributionwiththatproperty.ShowthattheU,1distributiondoesNOThavethatproperty. Todothis,evaluatebothsidesof.44. PAGE 78 60 CHAPTER2.CONTINUOUSPROBABILITYMODELS PAGE 79 Chapter3 MultivariateProbabilityModels 3.1MultivariateDistributions 3.1.1WhyAreTheyNeeded? Mostapplicationsofprobabilityandstatisticsinvolvetheinteraction betweenvariables.Forinstance,when youbuyabookatAmazon.com,thesoftwarewilllikelyinformyouofotherbooksthatpeopleboughtin conjunctionwiththeoneyouselected.Amazonisrelyingonthefactthatsalesofcertainpairsorgroupsof booksarecorrelated. Individualpmfs p X anddensities f X don'tdescribethesecorrelations.Weneedsomethingmore.Weneed waystodescribemultivariatedistributions. 3.1.2DiscreteCase Saywerollabluedieandayellowone.LetXandYdenotethenumberofdotswhichappearontheblue andyellowdice,respectively,andletSdenotethetotalnumberofdotsappearingonthetwodice.Wewill notdiscussYmuchhere,focusingonXandS. Recallthatthe distribution ofXisdenedtobealistofallthevaluesXtakeson,andtheirassociated probabilities: f ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 ; ; 1 6 g .1 WecanwritethismorecompactlybutequivalentlybydeningX's probabilitymassfunction pmf: p X i = P X = i = 1 6 ;i =1 ; 2 ;:::; 6 .2 61 PAGE 80 62 CHAPTER3.MULTIVARIATEPROBABILITYMODELS ThedistributionofSisdenedsimilarly,eitherasalist, f ; 1 36 ; ; 2 36 ; ; 3 36 ; ; 4 36 ; ; 5 36 ; 6 36 ; 5 36 ; 4 36 ; 3 36 ; 2 36 ; 1 36 g .3 orviaitspmf p S 1 ButitmayalsobeimportanttodescribehowXandSvaryjointly.Forexample,intuitivelywewouldfeel thatXandSarepositivelycorrelated.Howdowedescribetheirjointvariation? Todothis,wedenethe bivariateprobabilitymassfunction ofXandS.JustastheunivariatepmfofXis denedtobe p X i = P X = i ,wedenethebivariatepmfas p X;S i;j = P X = i;S = j = 1 36 ;i =1 ; 2 ;:::; 6; j = i +1 ;:::;i +6 .4 Expectedvaluesarecalculatedintheanalogousmanner.RecallthatforafunctiongofX E [ g X ]= X i g i p X i .5 So,foranyfunctiongoftwodiscreterandomvariablesUandV,dene E [ g U;V ]= X i;j g i;j p U;V i;j .6 Forinstance: E XS = 6 X i =1 12 X j =2 ijp X;S i;j = 6 X i =1 i +6 X j = i +1 ij 1 36 .7 Theunivariatepmfs,called marginalpmfs ,canofcourseberecoveredfromthebivariatepmf.Toget p X from p X;S ,wesumoverthevaluesofS.Forexample,let'snd p X ,whichistheprobabilitythatX= 3.HowcouldtheeventX=3happen?Well,Scouldbeanywherefrom4to9,eachwithprobability1/6. So, p X = 9 X j =4 p X;S ;j =6 1 36 = 1 6 .8 Thatisconsistentwithourunivariatecalculationof p X ,asofcourseitshouldbe. 1 Recallthattheconventionfordenotingpmfsistousetheletter`p'withasubscriptindicatingtherandomvariable. PAGE 81 3.1.MULTIVARIATEDISTRIBUTIONS 63 Wegetconsistentresultsforexpectedvaluestoo.TreatingXasafunctionofXandS,wehave E X = 6 X i =1 i +6 X j = i +1 ip X;S i;j .9 buttheright-handsideRHSof.9reducesto E X = 6 X i =1 i i +6 X j = i +1 p X;S i;j = 6 X i =1 ip X i .10 from.8.Thelastexpressionin.10isEXasdenedintheunivariatesetting,soeverythingisindeed consistent. 3.1.3MultivariateDensities 3.1.3.1MotivationandDenition Extendingourpreviousdenitionofcdfforasinglevariable,wedenethetwo-dimensionalcdfforapair ofrandomvariablesXandYas F X;Y u;v = P X u and Y v .11 IfXandYwerediscrete,wewouldevaluatethatcdfviaadoublesumoftheirbivariatepmf.Youmayhave guessedbynowthattheanalogforcontinuousrandomvariableswouldbeadoubleintegral,anditis.The integrandisthebivariatedensity: f X;Y u;v = @ @u @ @v F X;Y u;v .12 Densitiesinhigherdimensionsaredenedsimilarly. Asintheunivariatecase,abivariatedensityshowswhichregionsoftheX-Yplaneoccurmorefrequently, andwhichoccurlessfrequently. 3.1.3.2UseofMultivariateDensitiesinFindingProbabilitiesandExpectedValues Againbyanalogy,foranyregionAintheX-Yplane, P [ X;Y A ]= ZZ A f X;Y u;v dudv .13 So,justasprobabilitiesinvolvingasinglevariableXarefoundbyintegrating f X overtheregioninquestion, forprobabilitiesinvolvingXandY,wetakethedoubleintegralof f X;Y overthatregion. PAGE 82 64 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Also,foranyfunctiongX,Y, E [ g X;Y ]= Z 1 Z 1 g u;v f X;Y u;v dudv .14 whereitmustbekeptinmindthat f X;Y u;v maybe0insomeregionsoftheU-Vplane.Notethatthereis nosetAhereasin.13.See.18belowforanexample. Findingmarginaldensitiesisalsoanalogoustothediscretecase,e.g. f X s = Z t f X;Y s;t dt .15 Otherpropertiesandcalculationsareanalogousaswell.Forinstance,thedoubleintegralofthedensityis equalto1,andsoon. 3.1.3.3Example:aTriangularDistribution SupposeX,Yhasthedensity f X;Y s;t =8 st; 0 PAGE 83 3.1.MULTIVARIATEDISTRIBUTIONS 65 HeresrepresentsXandtrepresentsY.ThegrayareaistheregioninwhichX,Yranges.ThesubregionA in.13,correspondingtotheeventX+Y > 1,isshowninthestripedareainthegure. Thedarkverticallineshowsallthepointss,tinthestripedregionforatypicalvalueofsintheintegration process.Sincesisthevariableintheouterintegral,considereditxedforthetimebeingandaskwheret willrange forthats .WeseethatforX=s,Ywillrangefrom1-stos;thuswesettheinnerintegral'slimits to1-sands.Finally,wethenaskwherescanrange,andseefromthepictureorfrom.16thatitranges from0to1.Thusthosearethelimitsfortheouterintegral. P X + Y> 1= Z 1 0 : 5 Z s 1 )]TJ/F47 7.9701 Tf 6.586 0 Td [(s 8 stdtds = Z 1 0 8 s s )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 ds = 2 3 .17 Following.14, E [ p X + Y ]= Z 1 0 Z s 0 p s + t 8 stdtds .18 Let'sndthemarginaldensity f Y t .Sowemustintegrateoutthesin.16: f Y t = Z 1 t 8 stds =4 t )]TJ/F15 10.9091 Tf 10.909 0 Td [(4 t 3 .19 PAGE 84 66 CHAPTER3.MULTIVARIATEPROBABILITYMODELS 3.2MoreonCo-variationofRandomVariables 3.2.1Covariance The covariance betweenrandomvariablesXandYisdeneda Cov X;Y = E [ X )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(EY ] .20 SupposethattypicallywhenXislargerthanitsmean,Yisalsolargerthanitsmean,andviceversafor below-meanvalues.Then.20willlikelybepositive.Inotherwords,ifXandYarepositivelycorrelated atermwewilldeneformallylaterbutkeepintuitivefornow,thentheircovarianceispositive.Similarly, ifXisoftensmallerthanitsmeanwheneverYislargerthanitsmean,thecovarianceandcorrelationbetween themwillbenegative.Allofthisisroughlyspeaking,ofcourse,sinceitdependson howmuch Xislarger orsmallerthanitsmean,etc. Covarianceislinearinbotharguments: Cov aX + bY;cU + dV = acCov X;U + adCov X;V + bcCov Y;U + bdCov Y;V .21 foranyconstantsa,b,candd.Also Cov X;Y + q = Cov X;Y .22 foranyconstantqandsoon. Notethat Cov X;X = Var X .23 foranyXwithnitevariance. Also,hereisashortcutwaytondthecovariance: Cov X;Y = E XY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX EY .24 Theproofwillhelpyoureviewsomeimportantissues,namelyaEU+V=EU+EV,bEcU=cEU andEc=cforanyconstantc,andcEXandEYareconstantsin.24. Cov X;Y = E [ X )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(EY ] denition.25 = E [ XY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(EY X + EX EY ] algebra.26 = E XY + E [ )]TJ/F46 10.9091 Tf 8.484 0 Td [(EX Y ]+ E [ )]TJ/F46 10.9091 Tf 8.485 0 Td [(EY X ]+ E [ EX EY ] E[U+V]=EU+EV.27 = E XY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX EY E[cU]=cEU,Ec=c.28 PAGE 85 3.2.MOREONCO-VARIATIONOFRANDOMVARIABLES 67 Anotherimportantproperty: Var X + Y = Var X + Var Y +2 Cov X;Y .29 Thiscomesfrom.24,therelation Var X = E X 2 )]TJ/F46 10.9091 Tf 11.756 0 Td [(EX 2 andthecorrespondingoneforY.Just substituteanddothealgebra. 3.2.2Correlation CovariancedoesmeasurehowmuchorlittleXandYvarytogether,butitishardtodecidewhetheragiven valueofcovarianceislargeornot.Forinstance,ifwearemeasuringlengthsinfeetandchangetoinches, then.21showsthatthecovariancewillincreaseby 12 2 =144 .Thusitmakessensetoscalecovariance accordingtothevariables'standarddeviations.Accordingly,the correlation betweentworandomvariables XandYisdenedby X;Y = Cov X;Y p Var X p Var Y .30 So,correlationisunitless,i.e.doesnotinvolveunitslikefeet,pounds,etc. Itisshownlaterinthischapterthat )]TJ/F15 10.9091 Tf 19.394 0 Td [(1 X;Y 1 j X;Y j =1 ifandonlyifXandYareexactlinearfunctionsofeachother,i.e.Y=cX+dforsome constantscandd 3.2.3Example:ContinuationofSection3.1.3.3 Let'sndthecorrelationbetweenXandYintheexampleinSection3.1.3.3. E XY = Z 1 0 Z s 0 st 8 stdtds .31 = Z 1 0 8 s 2 s 3 = 3 ds .32 = 4 9 .33 f X s = Z s 0 8 stdt .34 =4 st 2 j s 0 .35 =4 s 3 .36 PAGE 86 68 CHAPTER3.MULTIVARIATEPROBABILITYMODELS f Y t = Z 1 t 8 stds .37 =4 t s 2 j 1 t .38 =4 t )]TJ/F46 10.9091 Tf 10.909 0 Td [(t 2 .39 EX = Z 1 0 s 4 s 3 ds = 4 5 .40 E X 2 = Z 1 0 s 2 4 s 3 ds = 2 3 .41 Var X = 2 3 )]TJ/F52 10.9091 Tf 10.909 15.382 Td [( 4 5 2 =0 : 027 .42 EY = Z 1 0 t t )]TJ/F15 10.9091 Tf 10.909 0 Td [(4 t 3 ds = 4 3 )]TJ/F15 10.9091 Tf 12.105 7.38 Td [(4 5 = 8 15 .43 E Y 2 = Z 1 0 t 2 t )]TJ/F15 10.9091 Tf 10.909 0 Td [(4 t 3 dt =1 )]TJ/F15 10.9091 Tf 12.104 7.38 Td [(4 6 = 1 3 .44 Var Y = 1 3 )]TJ/F52 10.9091 Tf 10.909 15.382 Td [( 8 15 2 =0 : 049 .45 Cov X;Y = 4 9 )]TJ/F15 10.9091 Tf 12.105 7.38 Td [(4 5 8 15 =0 : 018 .46 X;Y = 0 : 018 p 0 : 027 0 : 049 =0 : 49 .47 3.2.4Example:aCatchupGame Considerthefollowingsimplegame.Therearetwoplayers,whotaketurnsplaying.One'spositionafterk turnsisthesumofone'swinningsinthoseturns.Basically,aturnconsistsofgeneratingarandomU,1 variable,withonedifferenceifthatplayeriscurrentlylosing,hegetsabonusof0.2tohelphimcatchup. LetXandYbethetotalwinningsofthetwoplayersafter10turns.Intuitively,XandYshouldbepositively correlated,duetothe0.2bonuswhichbringsthemclosertogether.Let'sseeifthisistrue. Thoughverysimplystated,thisproblemisfartootoughtosolvemathematicallyinanelementarycourse orevenanadvancedone.So,wewillusesimulation.InadditiontondingthecorrelationbetweenXand Y,we'llalsond F X;Y : 8 ; 5 : 2 . PAGE 87 3.3.SETSOFINDEPENDENTRANDOMVARIABLES 69 1 taketurn<-functiona,b{ 2 win<-runif 3 ifa>=breturnwin 4 elsereturnwin+0.2 5 } 6 7 cdf2<-functionxy,t1,t2{#2-dim.cdf 8 tmp<-xy[xy[,1]<=t1&xy[,2]<=t2,] 9 returnnrowtmp/nrowxy 10 } 11 12 nreps<-10000 13 nturns<-10 14 xyvals<-matrixnrow=nreps,ncol=2 15 forrepin1:nreps{ 16 x<-0 17 y<-0 18 forturnin1:nturns{ 19 #x'sturn 20 x<-x+taketurnx,y 21 #y'sturn 22 y<-y+taketurny,x 23 } 24 xyvals[rep,]<-cx,y 25 } 26 printcorxyvals[,1],xyvals[,2] 27 printcdf2xyvals,5.8,5.2 Theoutputis0.65and0.03.So,XandYareindeedpositivelycorrelatedaswehadsurmised. NotetheuseofR'sbuilt-infunction cor tocomputecorrelation.Notetoothatthebonusmakesthetwo players'winningsleapfrogovereachother.Withoutit,wewouldhaveEX=EY=5.0,and F X;Y : 8 ; 5 : 2 somewhatgreaterthan0.25.Thelatterwouldbethevalueof F X;Y : 0 ; 5 : 0 .Butthebonusmovesthe distributionsofXandYmoretoward10.0. 3.3SetsofIndependentRandomVariables Greatmathematicaltractabilitycanbeachievedbyassumingthatthe X i inarandomvector X = X 1 ;:::;X k areindependent.Inmanyapplications,thisisareasonableassumption. 3.3.1Properties Inthenextfewsections,wewilllookatsomecommonly-usedpropertiesofsetsofindependentrandom variables.Forsimplicity,considerthecasek=2,withXandYbeingindependentscalarrandomvariables. 3.3.1.1ProbabilityMassFunctionsandDensitiesFactor IfXandYareindependent,then p X;Y = p X p Y .48 PAGE 88 70 CHAPTER3.MULTIVARIATEPROBABILITYMODELS inthediscretecase,and f X;Y = f X f Y .49 inthecontinuouscase.Inotherwords,thejointpmf/densityistheproductofthemarginalones. Thisiseasilyseeninthediscretecase: p X;Y i;j = P X = i and Y = j denition .50 = P X = i P Y = j independence .51 = p X i p Y j denition .52 Hereistheproofforthecontinuouscase; f X;Y u;v = @ @u @ @v F X;Y u;v .53 = @ @u @ @v P X u and Y v .54 = @ @u @ @v P X u PY v .55 = @ @u @ @v F X u F Y v .56 = f X v f Y v .57 3.3.1.2ExpectedValuesFactor IfXandYareindependent,then E XY = E X E Y .58 Toprovethis,use.48and.49forthediscreteandcontinuouscases. 3.3.1.3CovarianceIs0 IfXandYareindependent,thenfrom.58and.24,wehave Cov X;Y =0 .59 andthus X;Y =0 aswell. PAGE 89 3.3.SETSOFINDEPENDENTRANDOMVARIABLES 71 However,theconverseisfalse.Acounterexampleistherandompair V;W thatisuniformlydistributed ontheunitdisk, f s;t : s 2 + t 2 1 g 3.3.1.4VariancesAdd IfXandYareindependent,thenfrom.29and.58,wehave Var X + Y = Var X + Var Y : .60 3.3.1.5Convolution IfXandYarenonnegative,continuousrandomvariables,andwesetZ=X+Y,thenthedensityofZisthe convolution ofthedensitiesofXandY: f Z t = Z t 0 f X s f Y t )]TJ/F46 10.9091 Tf 10.91 0 Td [(s ds .61 Youcangetintuitiononthisbyconsideringthediscretecase.SayUandVarenonnegativeinteger-valued randomvariables,andsetW=U+V.Let'snd p W ; p W k = P W = k bydenition .62 = P U + V = k substitution .63 = k X i =0 P U = i and V = k )]TJ/F46 10.9091 Tf 10.909 0 Td [(i Inwhatwayscanithappen? .64 = k X i =0 p U;V i;k )]TJ/F46 10.9091 Tf 10.909 0 Td [(i bydenition .65 = k X i =0 p U i p V k )]TJ/F46 10.9091 Tf 10.909 0 Td [(i fromSection3.3.1.1 .66 Reviewtheanalogybetweendensitiesandpmfsinourunitoncontinuousrandomvariables,Section2.2.1, andthenseehow.61isanalogousto.62through.66: kin.62isanalogoustotin.61 thelimits0tokin.66areanalogoustothelimits0totin.61 theexpressionk-iin.66isanalogoustot-sin.61 andsoon PAGE 90 72 CHAPTER3.MULTIVARIATEPROBABILITYMODELS 3.3.2Examples 3.3.2.1Example:Dice InSection3.2.1,wespeculatedthatthecorrelationbetweenX,thenumberonthebluedie,andS,thetotal ofthetwodice,waspositive.Let'scomputeit. WriteS=X+Y,whereYisthenumberontheyellowdie.Thenusingthepropertiesofcovariancepresented above,wehavethat Cov X;S = Cov X;X + Y bydenition .67 = Cov X;X + Cov X;Y from : 21 .68 = Var X +0 from : 23 ; : 59 .69 Also,from.60, Var S = Var X + Y = Var X + Var Y .70 ButVarY=VarX.SothecorrelationbetweenXandSis X;S = Var X p Var X p 2 Var X =0 : 707 .71 Sincecorrelationisatmost1inabsolutevalue,0.707isconsideredafairlyhighcorrelation.Ofcourse,we didexpectXandStobehighlycorrelated. 3.3.2.2Example:Ethernet Considerthisnetwork,essentiallyEthernet.Herenodescansendatanytime.Transmissiontimeis0.1 seconds.Nodescanalsoheareachother;onenodewillnotstarttransmittingifithearsthatanotherhasa transmissioninprogress,andevenwhenthattransmissionends,thenodethathadbeenwaitingwillwaitan additionalrandomtime,toreducethepossibilityofcollidingwithsomeothernodethathadbeenwaiting. Supposetwonodeshearathirdtransmitting,andthusrefrainfromsending.LetXandYbetheirrandom backofftimes,i.e.therandomtimestheywaitbeforetryingtosend.Let'sndtheprobabilitythatthey clash,whichis P j X )]TJ/F46 10.9091 Tf 10.909 0 Td [(Y j 0 : 1 AssumethatXandYareindependentandexponentiallydistributedwithmean0.2,i.e.theyeachhave density 5 e )]TJ/F44 7.9701 Tf 6.587 0 Td [(5 u on ; 1 .Thenfrom.49,weknowthattheirjointdensityistheproductoftheirmarginal densities, f X;Y s;t =25 e )]TJ/F44 7.9701 Tf 6.587 0 Td [(5 s + t ;s;t> 0 .72 PAGE 91 3.3.SETSOFINDEPENDENTRANDOMVARIABLES 73 Now P j X )]TJ/F46 10.9091 Tf 10.909 0 Td [(Y j 0 : 1=1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P j X )]TJ/F46 10.9091 Tf 10.909 0 Td [(Y j > 0 : 1=1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P X>Y +0 : 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P Y>X +0 : 1 .73 Lookatthatrstprobability.Applying.13with A = f s;t : s>t +0 : 1 ; 0 PAGE 92 74 CHAPTER3.MULTIVARIATEPROBABILITYMODELS secondassoonastherstfails.Thelifetimesofthebatteriesareexponentiallydistributedandindependent. Let'sndthedensityofW,thetimethatthesystemisoperationali.e.thesumofthelifetimesofthetwo batteries. Recallthatifthetwobatterieshadthesamemeanlifetimes,Wwouldhaveagammadistribution.Butthat's notthecasehere.ButwenoticethatthedistributionofWisaconvolutionoftwoexponentialdensities,as itisthesumoftwononnegativeindependentrandomvariables.Using.3.1.5,wehave f W t = Z t 0 f X s f Y t )]TJ/F46 10.9091 Tf 10.909 0 Td [(s ds = Z t 0 0 : 5 e )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 5 s e )]TJ/F44 7.9701 Tf 6.587 0 Td [( t )]TJ/F47 7.9701 Tf 6.586 0 Td [(s ds = e )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 5 t )]TJ/F46 10.9091 Tf 10.909 0 Td [(e )]TJ/F47 7.9701 Tf 6.586 0 Td [(t ; 0 PAGE 93 3.5.CONDITIONALDISTRIBUTIONS 75 IfAisanrxkbutnonrandommatrix,deneQ=AW.ThenQisanr-componentrandomvector,and Cov Q = ACov W A 0 .80 SupposeVandWareindependentrandomvectors,meaningthateachcomponentinVisindependent ofeachcomponentofW.ButthisdoesNOTmeanthatthecomponentswithinVareindependentof eachother,andsimilarlyforW.Then Cov V + W = Cov V + Cov W .81 3.5ConditionalDistributions Thekeytogoodprobabilitymodelingandstatisticalanalysisistounderstandconditionalprobability.The issuearisesconstantly. 3.5.1ConditionalPmfsandDensities First,let'sreview:Inmanyrepetitionsofourexperiment,PAisthelong-runproportionofthetimethat Aoccurs.Bycontrast,PA j Bisthelong-runproportionofthetimethatAoccurs, amongthoserepetitions inwhichBoccurs. Keepthisinyourmindatalltimes. Nowweapplythistopmfs,densities,etc.Wedenetheconditionalpmfasfollowsfordiscreterandom variablesXandY: p Y j X j j i = P Y = j j X = i = p X;Y i;j p X i .82 Byanalogy,wedenetheconditionaldensityforcontinuousXandY: f Y j X t j s = f X;Y s;t f X s .83 3.5.2ConditionalExpectation Conditionalexpectationsaredenedasstraightforwardextensionsof.82and.83: E Y j X = i = X j jp Y j X j j i .84 E Y j X = s = Z t tf Y j X t j s dt .85 PAGE 94 76 CHAPTER3.MULTIVARIATEPROBABILITYMODELS 3.5.3TheLawofTotalExpectationadvancedtopic 3.5.3.1ExpectedValueAsaRandomVariable ForarandomvariableYandaneventA,thequantityEY j Aisthelong-runaverageofY,amongthetimes whenAoccurs.NoteseveralthingsabouttheexpressionEY j A: Theexpressionevaluatestoaconstant. Theitemtotheleftofthe j symbolisa randomvariable Y. Theitemontherightofthe j symbolisan event A. Bycontrast,forthequantityEY j Wdenedbelow,forarandomvariableW,itisthecasethat: Theexpressionitselfisarandomvariable,notaconstant. Theitemtotheleftofthe j symbolisagainarandomvariableY. Buttheitemtotherightofthe j symbolisalsoarandomvariableW. Itwillbeveryimportanttokeepthesedifferencesinmind. Considerthefunctiongtdenedas 2 g t = E Y j W = t .86 Inthiscase,theitemtotherightofthe j isanevent,andthusgtisaconstantforeachvalueoft,nota randomvariable. Now,denetherandomvariableQtobegW.SinceWisarandomvariable,thenQistoo.Thequantity EY j WisthendenedtobeQ.Beforereadinganyfurther,re-readthetwosetsofbulleteditemsabove, andmakesureyouunderstandthedifferencebetweenEY j W=tandEY j W. OnecanviewEY j Wasaprojectioninanabstractvectorspace.Thisisveryelegant,andactuallyaidsthe intuition.Ifandonlyifyouaremathematicallyadventurous,readthedetailsinSection3.9.2. 3.5.3.2TheFamousFormulaTheoremofTotalExpectation Anextremelyusefulformula,givenonlyscantornomentioninmostundergraduateprobabilitycourses,is E Y = E [ E Y j W ] .87 foranyrandomvariablesYandW. 2 Ofcourse,thetisjustaplaceholder,andanyotherlettercouldbeused. PAGE 95 3.5.CONDITIONALDISTRIBUTIONS 77 TheRHSof.87looksoddatrst,butit'smerelyE[gW];sinceQ=EY j Wisarandomvariable,we cancertainlyaskwhatitsexpectedvalueis. Equation.87isabitabstract.It'saveryusefulabstraction,enablingstreamlinedwritingandthinking abouttheprocess.Still,youmayndithelpfultoconsiderthecaseofdiscreteW,inwhich.87hasthe moreconcreteform EY = X i P W = i E Y j W = i .88 Toseethisintuitively,thinkofmeasuringtheheightsandweightsofalltheadultsinDavis.Saywemeasure heighttothenearestinch,sothatheightisdiscrete.WelookatalltheadultsinDaviswhoare72inches tall,andwritedowntheirmeanweight.Thenwewritedownthemeanweightofalladultsofheight68. Thenwewritedownthemeanweightofalladultsofheight75,andsoon.Then.87saysthatifwetake theaverageofallthenumberswewritedowntheaverageoftheaveragesthenwegetthemeanweight among all adultsinDavis. Notecarefully,though,thatthisisa weighted average.Ifforinstancepeopleofheight69inchesaremore numerousinthepopulation,thentheirmeanweightwillreceivegreateremphasisinoveraverageofallthe meanswe'vewrittendown.Thisisseenin.88,withtheweightsbeingthequantitiesPW=i. Therelation.87isprovedinthediscretecaseinSection3.10. 3.5.4WhatAbouttheVariance? Bytheway,onemightguessthattheanalogoftheTheoremofTotalExpectationforvarianceis Var Y = E [ Var Y j W ] .89 Butthisisfalse. ThinkforexampleoftheextremecaseinwhichY=W.ThenVarY j Wwouldbe0,but VarYwouldbenonzero. Thecorrectformula,calledtheLawofTotalVariance,is Var Y = E [ Var Y j W ]+ Var [ E Y j W ] .90 Derivingthisformulaiseasy,bysimplyevaluatingbothsides,andusingtherelation Var X = E X 2 )]TJ/F15 10.9091 Tf -459.515 -13.549 Td [( EX 2 .Thisexerciseislefttothereader. 3.5.5Example:TrappedMiner Adaptedfrom StochasticProcesses, bySheldonRoss,Wiley,1996. Amineristrappedinamine,andhasachoiceofthreedoors.Thoughhedoesn'trealizeit,ifhechooses toexittherstdoor,itwilltakehimtosafetyafter2hoursoftravel.Ifhechoosesthesecondone,itwill PAGE 96 78 CHAPTER3.MULTIVARIATEPROBABILITYMODELS leadbacktothemineafter3hoursoftravel.Thethirdoneleadsbacktothemineafter5hoursoftravel. Supposethedoorslookidentical,andifhereturnstotheminehedoesnotrememberwhichdoorshetried earlier.Whatistheexpectedtimeuntilhereachessafety? LetYbethetimeittakestoreachsafety,andletWdenotethenumberofthedoorchosen,2or3onthe rsttry.ThenletusconsiderwhatvaluesEY j Wcanhave.IfW=1,thenY=2,so E Y j W =1=2 .91 IfW=2,thingsareabitmorecomplicated.Theminerwillgoona3-hourexcursion,andthenbebackin itsoriginalsituation,andthushaveafurtherexpectedwaitofEY,sincetimestartsover.Inotherwords, E Y j W =2=3+ EY .92 Similarly, E Y j W =3=5+ EY .93 Insummary,nowconsideringthe randomvariable EY j W,wehave Q = E Y j W = 8 < : 2 ;w:p: 1 3 3+ EY;w:p: 1 3 5+ EY;w:p: 1 3 .94 wherew.p.meanswithprobability.So,using.87or.88,wehave EY = EQ =2 1 3 ++ EY 1 3 ++ EY 1 3 = 10 3 + 2 3 EY .95 Equatingtheextremeleftandextremerightendsofthisseriesofequations,wecansolveforEY,whichwe ndtobe10. Itislefttothereadertoseehowthiswouldchangeifweassumethattheminerrememberswhichdoorshe hasalreadyhit. 3.5.6Example:AnalysisofHashTables Famousexample,adaptedfromvarioussources. Consideradatabasetableconsistingofmcells,onlysomeofwhicharecurrentlyoccupied.Eachtimea newkeymustbeinserted,itisusedinahashfunctiontondanunoccupiedcell.Sincemultiplekeysmap tothesametablecell,wemayhavetoprobemultipletimesbeforendinganunoccupiedcell. WewishtondEY,whereYisthenumberofprobesneededtoinsertanewkey.Oneapproachtodoing sowouldbetoconditiononW,thenumberofcurrentlyoccupiedcellsatthetimewedoasearch.After PAGE 97 3.5.CONDITIONALDISTRIBUTIONS 79 ndingEY j W,wecanusetheTheoremofTotalExpectationtondEY.Wewillmaketwoassumptions tobediscussedlater: aGiventhatW=k,eachprobewillcollidewithanexistingcellwithprobabilityk/m,withsuccessive probesbeingindependent. bWisuniformlydistributedontheset1,2,...,m,i.e.PW=k=1/mforeachk. TocalculateEY j W=k,wenotethatgivenW=k,thenYisthenumberofindependenttrialsuntila successisreached,wheresuccessmeansthatourprobeturnsouttobetoanunoccupiedcell.Thisisa geometric distribution,i.e. P Y = r j W = k = k m r )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 )]TJ/F46 10.9091 Tf 13.882 7.38 Td [(k m .96 Themeanofthisgeometricdistributionis,from.75, 1 1 )]TJ/F47 7.9701 Tf 13.539 4.295 Td [(k m .97 Then EY = E [ E Y j W ] .98 = m )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 X k =1 1 m E Y j W = k .99 = m )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 X k =1 1 m )]TJ/F46 10.9091 Tf 10.909 0 Td [(k .100 =1+ 1 2 + 1 3 + ::: + 1 m )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 .101 Z m 1 1 u du .102 = ln m .103 wheretheapproximationissomethingyoumightrememberfromcalculusyoucanpictureitbydrawing rectanglestoapproximatetheareaunderthecurve.. Now,whataboutourassumptions,aandb?Theassumptioninaofeachcellhavingprobabilityk/m shouldbereasonablyaccurateifkismuchsmallerthanm,becausehashfunctionstendtodistributeprobes uniformly,andtheassumptionofindependenceofsuccessiveprobesisallrighttoo,sinceitisveryunlikely thatwewouldhitthesamecelltwice.However,ifkisnotmuchsmallerthanm,theaccuracywillsuffer. Assumptionbismoresubtle,withdifferinginterpretations.Forexample,themodelmayconcernone specicdatabase,inwhichcasetheassumptionmaybequestionable.PresumablyWgrowsovertime,in PAGE 98 80 CHAPTER3.MULTIVARIATEPROBABILITYMODELS whichcasetheassumptionwouldmakenosenseitdoesn'teven have adistribution.Wecouldinstead thinkofadatabasewhichgrowsandshrinksastimeprogresses.However,evenhere,itwouldseemthatW wouldprobablyoscillatearoundsomevaluelikem/2,ratherthanbeinguniformlydistributedasassumed here.Thus,thismodelisprobablynotveryrealistic.However,evenidealizedmodelscansometimesprovide importantinsights. 3.6ParametricFamiliesofDistributions Sincetherearesomanywaysinwhichrandomvariablescancorrelatewitheachother,thereareratherfew parametricfamiliescommonlyusedtomodelmultivariatedistributionsotherthanthosearisingfromsets ofindependentrandomvariableshaveadistributioninacommonparametricunivariatefamily.Wewill discusstwohere. 3.6.1TheMultinomialFamilyofDistributions 3.6.1.1ProbabilityMassFunction Thisisageneralizationofthebinomialfamily. Supposeonetossesadie8times.Whatistheprobabilitythattheresultsconsistoftwo1s,one2,one4, three5sandone6?Well,ifthetossesoccurinthatorder,i.e.thetwo1scomerst,thenthe2,etc.,thenthe probabilityis 1 6 2 1 6 1 1 6 0 1 6 1 1 6 3 1 6 1 .104 Buttherearemanydifferentorderings,infact 8! 2!1!0!1!3!1! .105 ofthem. Fromthis,wecanseethefollowing.Suppose: wehaventrials,eachofwhichhasrpossibleoutcomesorcategories thetrialsareindependent thei th outcomehasprobability p i Let X i denotethenumberoftrialswithoutcomei,i=1,...,r.Thenwesaythat X 1 ;:::;X r havea multinomial distribution ,andthejointpmfofthe X 1 ;:::;X r is p X 1 ;:::;X r j 1 ;:::;j r = n j 1! :::j r p j 1 1 :::p j r r .106 PAGE 99 3.6.PARAMETRICFAMILIESOFDISTRIBUTIONS 81 Notethatthisfamilyofdistributionshasr+1parameters. 3.6.1.2MeansandCovariances Nowlookatthevector X = X 1 ;:::;X r 0 .Let'snditsmeanvectorandcovariancematrix. First,notethatthemarginaldistributionsofthe X i arebinomial!So, EX i = np i and Var X i = np i )]TJ/F46 10.9091 Tf 10.909 0 Td [(p i .107 SoweknowEXnow: EX = 0 @ np 1 ::: np r 1 A .108 WhataboutCovX?Tothisend,let T ki equal1or0,dependingonwhetherthek th trialresultsinoutcome i,k=1,...,nandi=1,...,r.Wesaythat T ki isthe indicatorvariable fortheeventthatk th trialresultsin outcomei.Thisisasimpleconcept,butithaspowerfuluses,asyou'llsee. Makesureyouunderstandthat X i = n X k =1 T ki .109 From.109,youcanseethat X = U 1 + ::: + U n .110 where U k = 0 @ T k 1 ::: T kr 1 A .111 Now,here'swherethepowerofthematrixoperationsinSection3.4willbeseen: Cov X = Cov U 1 + ::: + U n from : 110 .112 = Cov U 1 + ::: + Cov U n from : 81 .113 = nCov U 1 allhavethesamedistribution .114 PAGE 100 82 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Now,for i 6 = j ,wehavefrom.24 Cov T 1 i ;T 1 j = E T 1 i T 1 j )]TJ/F46 10.9091 Tf 10.909 0 Td [(ET 1 i ET 1 j .115 But T 1 i ;T 1 j =0 !And ET 1 i = p i andthesameforthejcase.So, Cov T 1 i ;T 1 j = )]TJ/F46 10.9091 Tf 8.485 0 Td [(p i p j .116 Ofcourse,fori=j, Cov T 1 i ;T 1 j = Var T 1 i = p i )]TJ/F46 10.9091 Tf 11.609 0 Td [(p i ,since T 1 i hasabinomialdistributionwith numberoftrialsequalto1. Puttingallthistogether,andrecalling.114,weseethat Cov X = n 0 B B @ p 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 1 )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p 2 ::: )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p r )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p 2 p 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 ::: )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 2 p r :::::::::::: :::::::::p r )]TJ/F46 10.9091 Tf 10.909 0 Td [(p r 1 C C A .117 NotetoothatifwedeneR=X/n,sothatRisthevectorofproportionsinthevariouscategoriese.g. X 1 =n isthefractionoftrialsthatresultedincategory1,then.117and.79,wehave Cov R = 1 n 0 B B @ p 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 1 )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p 2 ::: )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p r )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 1 p 2 p 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 ::: )]TJ/F46 10.9091 Tf 8.485 0 Td [(p 2 p r :::::::::::: :::::::::p r )]TJ/F46 10.9091 Tf 10.909 0 Td [(p r 1 C C A .118 Whew!Thatwasaworkout,buttheseformulaswillbecomeveryusefullateron,bothinthisunitand subsequentones. 3.6.1.3Application:TextMining Oneofthebranchesofcomputerscienceinwhichthemultinomialfamilyplaysaprominentroleisin textmining.Onegoalisautomaticdocumentclassication.Wewanttowritesoftwarethatwillmake reasonablyaccurateguessesastowhetheradocumentisaboutsports,thestockmarket,electionsetc.,based onthefrequenciesofvariouskeywordstheprogramndsinthedocument. Manyofthesimplermethodsforthisusethe bagofwordsmodel .Wehaverkeywordswe'vedecidedare usefulfortheclassicationprocess,andthemodelassumesthatstatisticallythefrequenciesofthosewords inagivendocumentcategory,saysports,followamultinomialdistribution.Eachcategoryhasitsownsetof probabilities p 1 ;:::;p r .Forinstance,ifBarryBondsisconsideredoneword,itsprobabilitywillbemuch higherinthesportscategorythanintheelectionscategory,say.So,theobservedfrequenciesofthewords inaparticulardocumentwillhopefullyenableoursoftwaretomakeafairlygoodguessastothecategory thedocumentbelongsto. PAGE 101 3.6.PARAMETRICFAMILIESOFDISTRIBUTIONS 83 Onceagain,thisisaverysimplemodelhere,designedtojustintroducethetopictoyou.Clearlythe multinomialassumptionofindependencebetweentrialsisgrosslyincorrecthere,mostmodelsaremuch morecomplexthanthis. 3.6.2TheMultivariateNormalFamilyofDistributions Notetothereader:Thisisamoredifcultsection,butworthputtingextraeffortinto,assomanystatistical applicationsincomputersciencemakeuseofit.Itwillseemhardattimes,butintheendwon'tbetoobad. 3.6.2.1DensitiesandProperties Intuitively,thisfamilyhasdensitieswhichareshapedlikemultidimensionalbells,justliketheunivariate normalhasthefamousone-dimensionalbellshape. Let'slookatthebivariatecaserst.Thejointdistributionof X 1 and X 2 issaidtobe bivariatenormal if theirdensityis f X;Y s;t = 1 2 1 2 p 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 e )]TJ/F45 5.9776 Tf 19.35 3.259 Td [(1 2 )]TJ/F48 5.9776 Tf 5.757 0 Td [( 2 s )]TJ/F48 5.9776 Tf 5.756 0 Td [( 1 2 2 1 + t )]TJ/F48 5.9776 Tf 5.756 0 Td [( 2 2 2 2 )]TJ/F45 5.9776 Tf 7.782 4.324 Td [(2 s )]TJ/F48 5.9776 Tf 5.756 0 Td [( 1 t )]TJ/F48 5.9776 Tf 5.756 0 Td [( 2 1 2 ; )-222(1 PAGE 102 84 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Figure3.1:BivariateNormalDensity, =0 : 2 Figure3.2:BivariateNormalDensity, =0 : 8 PAGE 103 3.6.PARAMETRICFAMILIESOFDISTRIBUTIONS 85 Allofthisreectsthehighcorrelation.8betweenthetwovariables.Ifweweretocontinuetoincrease toward1.0,wewouldseethebellbecomenarrowerandnarrower,with X 1 and X 2 comingcloserandcloser toalinearrelationship,onewhichcanbeshowntobe X 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 = 1 2 X 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 .120 Inthiscase,thatwouldbe X 1 = r 10 15 X 2 =0 : 82 X 2 .121 Themultivariatenormalfamilyofdistributionsisparameterizedbyonevector-valuedquantity,themean ,andonematrix-valuedquantity,thecovariancematrix .Specically,supposetherandomvector X = X 1 ;:::;X k 0 hasak-variatenormaldistribution. Thedensityhasthisform: f X t = ce )]TJ/F44 7.9701 Tf 6.587 0 Td [(0 : 5 t )]TJ/F47 7.9701 Tf 6.586 0 Td [( 0 )]TJ/F45 5.9776 Tf 5.756 0 Td [(1 t )]TJ/F47 7.9701 Tf 6.586 0 Td [( .122 where c = 1 k= 2 p det .123 Hereagain'denotesmatrixtranspose,-1denotesmatrixinversionanddetmeansdeterminant.Again, notethattisakx1vector. Sincethematrixissymmetric,therearekk+1/2distinctparametersthere,andkparametersinthemean vector,foratotalofkk+3/2parametersforthisfamilyofdistributions. Thefamilyhasthefollowingimportantproperties: Thecontoursforpointsatwhichthebellhasthesameheightthinkofatopographicalmapare ellipticalinshape.ThelargerthecorrelationinabsolutevaluebetweenXandY,themoreelongated theellipse.Whentheabsolutecorrelationreaches1,theellipsedegeneratesintoastraightline. LetXbeak-variatenormalvectorasabove,andletAbeaconstanti.e.nonrandommatrixwithk columns.ThentherandomvectorY=AXalsohasamultivariatenormaldistribution,withmean A andcovariancematrix A A 0 Thishastwoimportantimplications: Thelower-dimensionalmarginaldistributionsarealsomultivariatenormal. ScalarlinearcombinationsofXarenormal.Inotherwords,forconstantscalars a 1 ;:::;a k ,the quantity Y = a 1 X 1 + ::: + a k X k hasaunivariatenormaldistributionwithmean a andvariance a a 0 ,with a beingtherowvector a 1 ;:::;a k . PAGE 104 86 CHAPTER3.MULTIVARIATEPROBABILITYMODELS IfXandYarenormalandtheyareindependent,thentheyjointlyhaveabivariatenormaldistributions.Ingeneral,though,havinganormaldistributionforXandYdoesnotimplythatthey arejointlybivariatenormal. SupposeWhasamultivariatenormaldistribution.Theconditionaldistributionofsomecomponents ofW,givenothercomponents,isagainmultivariatenormal. InRthedensity,cdfandquantilesofthemultivariatenormaldistributionaregivenbythefunctions dmvnorm pmvnorm and qmvnorm inthelibrary mvtnorm .Youcansimulateamultivariatenormaldistribution byusing mvrnorm inthelibrary MASS 3.6.2.2TheMultivariateCentralLimitTheorem ThemultidimensionalversionoftheCentralLimitTheoremholds.Asumofindependentidenticallydistributedrandomvectorshasanapproximatemultivariatenormaldistribution. Forexample,sinceaperson'sbodyconsistsofmanydifferentcomponents,theCLTanon-independent, non-identicallyversionofitexplainsintuitivelywhyheightsandweightsareapproximatelybivariatenormal.Histogramsofheightswilllookapproximatelybell-shaped,andthesameistrueforweights.The multivariateCLTsaysthatthree-dimensionalhistogramsplottingfrequencyalongtheZaxisagainst heightandweightalongtheXandYaxeswillbeapproximatelythree-dimensionalbell-shaped. 3.6.2.3Example:DiceGame Supposewerolladie50times.LetXdenotethenumberofrollsinwhichwegetonedot,andletYbethe numberoftimeswegeteithertwoorthreedots.Forconvenience,let'salsodeneZtobethenumberof timeswegetfourormoredots,thoughourfocuswillbeonXandY. Supposewewishtond P X 12 and Y 16 .Supposealsothatwewin$5foreachrollofaone,and $2foreachrollofatwoorthree;we'llndtheprobabilitythatwewinmorethan$90. ThetripleX,Y,Zhasamultinomialdistributionwithn=50andthreepossibleoutcomes;2or3;4,5 or6,with p 1 =1 = 6 p 2 =1 = 3 and p 3 =1 = 2 .From.110,weseethatX,Y,Zhasanapproximately multivariatenormaldistribution. Theseprobabilitiesofinteresttousherewouldbequitedifculttonddirectly.For P X 12 and Y 16 ,forinstance,wewouldneedtosum.106overmany,manydifferentcases.So,theCLTwillbevery valuablehere. We'llofcourseneedtoknowthemeanvectorandcovariancematrixoftherandomvectorX,Y.Wehave thosefrom.107and.117: E [ X;Y ]= = 6 ; 50 = 3 .124 PAGE 105 3.6.PARAMETRICFAMILIESOFDISTRIBUTIONS 87 and Cov [ X;Y ]= 50 5 = 36 )]TJ/F15 10.9091 Tf 8.485 0 Td [(50 = 18 )]TJ/F15 10.9091 Tf 8.485 0 Td [(50 = 1850 2 = 9 .125 WeusetheRfunction pmvnorm introducedinSection3.6.2.1.ToaccountfortheintegernatureofXand Y,wecallthefunctionwithupperlimitsof12.5and16.5,ratherthan12and16,whichisoftenusedtoget abetterapproximation.Ourcodeis 1 p1<-1/6 2 p23<-1/3 3 meanvec<-50 cp1,p23 4 var1<-50 p1 -p1 5 var23<-50 p23 -p23 6 covar123<--50 p1 p23 7 covarmat<-matrixcvar1,covar123,covar123,var23,nrow=2 8 printpmvnormupper=c.5,16.5,mean=meanvec,sigma=covarmat Wendthat P X 12 and Y 16 0 : 43 .126 Now,let'sndtheprobabilitythatourtotalwinnings,W,isover$90.WeknowthatW=5X+2Y,and Section.6.2.1tellsusthatlinearcombinationsofamultivariatenormalrandomvectorareunivariate normal.Inotherwords,Whasanormaldistribution! WethusneedthemeanandvarianceofW.Themeaniseasy: EW = E X +2 Y =5 EX +2 EY =250 = 6+100 = 3=75 .127 Forthevariance,use.29: Var W = Var [ X + Y ] denitionofW .128 = Var X + Var Y +2 Cov X; 2 Y [ from : 29 .129 =5 2 Var X +2 2 Var Y +2 5 2 Cov X;Y [ propertiesofVar,Cov ] .130 =25 250 36 +4 100 9 +20 )]TJ/F15 10.9091 Tf 9.681 7.38 Td [(50 18 .131 =162 : 5 .132 Then P W> 90=1 )]TJ/F15 10.9091 Tf 10.909 0 Td [( 90 )]TJ/F15 10.9091 Tf 10.909 0 Td [(75 162 : 5 0 : 5 =0 : 12 .133 Bytheway,whataboutZ?SinceZ=50-X-Y,thereisnoneedtolookatZ,andwewouldhavedifculties ifwedid.ThereasonisthatthefactthatZisanexactlinearfunctionofXandYturnsouttomakethe covariancematrix singular ,i.e.lackinganinverse.Thatwouldcreateproblemsin.122. PAGE 106 88 CHAPTER3.MULTIVARIATEPROBABILITYMODELS 3.6.2.4Application:DataMining Themultivariatenormalfamilyplaysacentralroleinmultivariatestatisticalmethods. Forinstance,amajorissueindataminingis dimensionreduction ,whichmeanstryingtoreducewhatmay behundredsorthousandsofvariablesdowntoamanageablelevel.Oneofthetoolsforthis,called principle componentsanalysis PCA,isbasedonmultivariatenormaldistributions.Googleusesthiskindofthing quiteheavily.We'lldiscussPCAinSection6.5. Toseeabitofhowthisworks,notethatinFigure3.2, X 1 and X 2 hadnearlyalinearrelationshipwitheach other.Thatmeansthatoneofthemisnearlyredundant,whichisgoodifwearetryingtoreducethenumber ofvariableswemustworkwith. Ingeneral,themethodofprinciplecomponentstakesroriginalvariables,inthevectorXandformsr newonesinavectorY,eachofwhichissomelinearcombinationoftheoriginalones.Thesenewonesare independent.Inotherwords,thereisasquarematrixAsuchthatthecomponentsofY=AXareindependent. ThematrixAconsistsoftheeigenvectorsofCovX;moreonthisinSection6.5ofourunitonstatistical relations. Wethendiscardthe Y i withsmallvariance,asthatmeanstheyarenearlyconstantandthusdonotcarry muchinformation.Thatleavesuswithasmallersetofvariablesthatstillcapturesmostoftheinformation oftheoriginalones. Manyanalysesinbioinformaticsinvolvedatathatcanbemodeledwellbymultivariatenormaldistributions.Forexample,inautomatedcellanalysis,twoimportantvariablesareforwardlightscatterFSCand sidewardlightscatterSSC.Thejointdistributionofthetwoisapproximatelybivariatenormal. 4 3.7SimulationofRandomVectors Let X = X 1 ;:::;X k 0 bearandomvectorhavingaspecieddistribution.Howcanwewritecodeto simulateit?Itisnotalwayseasytodothis.We'lldiscussacoupleofeasycaseshere,andillustratewhat onemaydoinothersituations. Theeasiestcaseandaveryfrequenly-occurringoneisthatinwhichthe X i areindependent.Onesimply simulatesthemindividually,andthatsimulatesX! AnothereasycaseisthatinwhichXhasamultivariatenormaldistribution.WenotedinSection3.6.2.1 thatRincludesthefunction mvrnorm ,whichwecanusetosimulateourXhere.Thewaythisfunction worksistousethenotionofprinciplecomponentsmentionedinSection3.6.2.4.WeconstructY=AXfor thematrixAdiscussedthere.The Y i areindependent,thuseasilysimulated,andthenwetransformbackto Xvia X = A )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 Y Ingeneral,though,thingsmaynotbesoeasy.Forinstance,considerthedistributionin.16.Thereisno formulaicsolutionhere,butthefollowingstrategyworks. 4 See BioinformaticsandComputationalBiologySolutionsUsingRandBioconductor ,editedbyRobertGentleman,Wolfgang Huber,VincentJ.Carey,RafaelA.IrizarryandSandrineDudoit,Springer,2005. PAGE 107 3.8.TRANSFORMMETHODSADVANCEDTOPIC 89 FirstwendthemarginaldensityofX.AsinthecaseforYshownin.19,wecompute f X s = Z s 0 8 stdt =4 s 3 .134 Usingthemethodshowninourunitoncontinuousprobability,Section2.7,wecansimulateXas X = F )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 X W .135 whereWisaU,1randomvariable,generatedas runif .Since F X u = u 4 F )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 X v = v 0 : 25 ,andthus ourcodetosimulateXis runif.25 NowthatwehaveX,wecangetY.Weknowthat f Y j X t j S = 8 st 4 s 3 = 2 s 2 t .136 Remember,sisconsideredconstant.Soagainweusetheinverse-cdfmethodheretondY,givenX,and thenwehaveourpairX,Y. 3.8TransformMethodsadvancedtopic Weoftenusetheideaof transform functions.Forexample,youmayhaveseen Laplacetransforms ina mathorengineeringcourse.Thefunctionswewillseeheredifferfromthisbyjustachangeofvariable. Thoughintheformusedheretheyinvolveonlyunivariatedistributions,theirapplicationsareoftenmultivariate,aswillbethecasehere. 3.8.0.5GeneratingFunctions Let'sstartwiththe generatingfunction .Foranynonnegative-integervaluedrandomvariableV,itsgeneratingfunctionisdenedby g V s = E s V = 1 X i =0 s i p V i ; 0 s 1 .137 Forinstance,supposeNhasageometricdistributionwithparameterp,sothat p N i = )]TJ/F46 10.9091 Tf 11.459 0 Td [(p p i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ,i= 1,2,...Then g N s = 1 X i =1 s i )]TJ/F46 10.9091 Tf 10.909 0 Td [(p p i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 = 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p p 1 X i =1 s i p i = 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p p ps 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(ps = )]TJ/F46 10.9091 Tf 10.909 0 Td [(p s 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(ps .138 PAGE 108 90 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Whyrestrictstotheinterval[0,1]?Theansweristhatfor s> 1 theseriesin.137maynotconverge.for 0 s 1 ,theseriesdoesconverge.Toseethis,notethatifs=1,wejustgetthesumofallprobabilities, whichis1.0.Ifanonnegativesislessthan1,then s i willalsobelessthan1,sowestillhaveconvergence. Oneuseofthegeneratingfunctionis,asitsnameimplies,togeneratetheprobabilitiesofvaluesforthe randomvariableinquestion.Inotherwords,ifyouhavethegeneratingfunctionbutnottheprobabilities, youcanobtaintheprobabilitiesfromthefunction.Here'swhy:Forclarify,write.137as g V s = P V =0+ sP V =1+ s 2 P V =2+ ::: .139 Fromthisweseethat g V = P V =0 .140 So,wecanobtainPV=0fromthegeneratingfunction.Nowdifferentiating.137withrespecttos,we have g 0 V s = d ds P V =0+ sP V =1+ s 2 P V =2+ ::: = P V =1+2 sP V =2+ ::: .141 So,wecanobtainPV=2from g 0 V ,andinasimilarmannercancalculatetheotherprobabilitiesfrom thehigherderivatives. 3.8.0.6MomentGeneratingFunctions Thegeneratingfunctionishandy,butitislimitedtodiscreterandomvariables.Moregenerally,wecanuse the momentgeneratingfunction ,denedforanyrandomvariableXas m X t = E [ e tX ] .142 foranytforwhichtheexpectedvalueexists. Thatlastrestrictionisanathematomathematicians,sotheyusethecharacteristicfunction, X t = E [ e itX ] .143 whichexistsforanyt.However,itmakesuseofpeskycomplexnumbers,sowe'llstayclearofithere. Differentiating.142withrespecttot,wehave m 0 X t = E [ Xe tX ] .144 PAGE 109 3.8.TRANSFORMMETHODSADVANCEDTOPIC 91 Weseethenthat m 0 X = EX .145 So,ifwejustknowthemoment-generatingfunctionofX,wecanobtainEXfromit.Also, m 00 X t = E X 2 e tX .146 so m 00 X = E X 2 .147 Inthismanner,wecanforvariouskobtain E X k ,the k th moment ofX,hencethename. 3.8.1Example:NetworkPackets Asanexample,supposesaythenumberofpacketsNreceivedonanetworklinkinagiventimeperiodhas aPoissondistributionwithmean ,i.e. P N = k = e )]TJ/F47 7.9701 Tf 6.586 0 Td [( k k ;k =0 ; 1 ; 2 ; 3 ;::: .148 3.8.1.1PoissonGeneratingFunction Let'srstnditsgeneratingfunction. g N t = 1 X k =0 t k e )]TJ/F47 7.9701 Tf 6.587 0 Td [( k k = e )]TJ/F47 7.9701 Tf 6.586 0 Td [( 1 X k =0 t k k = e )]TJ/F47 7.9701 Tf 6.587 0 Td [( + t .149 wherewemadeuseoftheTaylorseriesfromcalculus, e u = 1 X k =0 u k =k .150 3.8.1.2SumsofIndependentPoissonRandomVariablesArePoissonDistributed Supposedpacketscomeintoanetworknodefromtwoindependentlinks,withcounts N 1 and N 2 ,Poisson distributedwithmeans 1 and 2 .Let'sndthedistributionof N = N 1 + N 2 ,usingatransformapproach. g N t = E [ t N 1 + N 2 ]= E [ t N 1 ] E [ t N 2 ]= g N 1 t g N 2 t = e )]TJ/F47 7.9701 Tf 6.587 0 Td [( + t .151 PAGE 110 92 CHAPTER3.MULTIVARIATEPROBABILITYMODELS where = 1 + 2 Butthelastexpressionin.151isthegeneratingfunctionforaPoissondistributiontoo!Andsincethere isaone-to-onecorrespondencebetweendistributionsandtransforms,wecanconcludethatNhasaPoisson distributionwithparameter .WeofcourseknewthatNwouldhavemean butdidnotknowthatNwould haveaPoissondistribution. So:AsumoftwoindependentPoissonvariablesitselfhasaPoissondistribution.Byinduction,thisisalso trueforsumsofkindependentPoissonvariables. 3.8.1.3RandomNumberofBitsinPacketsonOneLinkadvancedtopic Considerjustoneofthetwolinksnow,andforconveniencedenotethenumberofpacketsonthelinkbyN, anditsmeanas .ContinuetoassumethatNhasaPoissondistribution. LetBdenotethenumberofbitsinapacket,with B 1 ;:::;B N denotingthebitcountsintheNpackets.We assumethe B i areindependentandidenticallydistributed.Thetotalnumberofbitsreceivedduringthat timeperiodis T = B 1 + ::: + B N .152 SupposethegeneratingfunctionofBisknowntobehs.ThenwhatisthegeneratingfunctionofT? g T s = E s T .153 = E [ E s T j N ] .154 = E [ E s B 1 + ::: + B N j N ] .155 = E [ E s B 1 j N :::E s B N j N ] .156 = E [ h s N ] .157 = g N [ h s ] .158 = e )]TJ/F47 7.9701 Tf 6.586 0 Td [( + h s .159 Hereishowthesestepsweremade: Fromtherstlinetothesecond,weusedtheTheoremofTotalExpectation. Fromthesecondtothethird,wejustusedthedenitionofT. Fromthethirdtothefourthlines,wehaveusedalgebraplusthefactthattheexpectedvalueofa productofindependentrandomvariablesistheproductoftheirindividualexpectedvalues. Fromthefourthtothefth,weusedthedenitionofhs. Fromthefthtothesixth,weusedthedenitionof g N . PAGE 111 3.9.VECTORSPACEINTERPRETATIONSFORTHEMATHEMATICALLYADVENTUROUSONLY 93 FromthesixthtothelastweusedtheformulaforthegeneratingfunctionforaPoissondistribution withmean WecanthengetalltheinformationaboutTweneedfromthisformula,suchasitsmean,variance,probabilitiesandsoon,asseenpreviously. 3.8.2OtherUsesofTransforms Transformtechniquesareusedheavilyinqueuinganalysis,includingformodelsofcomputernetworks.The techniquesarealsousedextensivelyinmodelingofhardwareandsoftwarereliability. 3.9VectorSpaceInterpretationsforthemathematicallyadventurousonly 3.9.1PropertiesofCorrelation Let V bethesetofallrandomvariableswithnitevarianceandmean0.Treatthisasavectorspace,with thesumoftwovectorsXandYtakentobetherandomvariableX+Y,foraconstantc,thevectorcXbeing therandomvariablecX.Notethat V isclosedundertheseoperations,asitmustbe. Deneaninnerproductonthisspace: X;Y = E XY = Cov X;Y .160 RecallthatCovX,Y=EXY-EXEY,andthatweareworkingwithrandomvariablesthathavemean0. ThusthenormofavectorXis jj X jj = X;X 0 : 5 = p E X 2 = p Var X .161 againsinceEX=0. ThefamousCauchy-SchwarzInequalityforinnerproductssays, j X;Y jjj X jjjj Y jj .162 i.e. j X;Y j 1 .163 Also,theCauchy-SchwarzInequalityyieldsequalityifandonlyifonevectorisascalarmultipleofthe other,i.e.Y=cXforsomec.Whenwethentranslatethistorandomvariablesofnonzeromeans,wegetY =cX+d. Inotherwords,thecorrelationbetweentworandomvariablesisbetween-1and1,withequalityifandonly ifoneisanexactlinearfunctionoftheother. PAGE 112 94 CHAPTER3.MULTIVARIATEPROBABILITYMODELS 3.9.2ConditionalExpectationAsaProjection ContinuetoconsiderthevectorspaceinSection3.9.1. ForarandomvariableX,let W denotethesubspaceof V consistingofallfunctionshXwithmean0and nitevariance.Again,notethatthissubspaceisindeedclosedundervectoradditionandscalarmultiplication. NowconsideranyYin V .Recallthatthe projection ofYonto W istheclosestvectorTin W toY,i.e.T minimizes jj Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(T jj = E [ Y )]TJ/F46 10.9091 Tf 10.91 0 Td [(T 2 ] 0 : 5 .164 TondtheminimizingT,considerrsttheminimizationof E [ S )]TJ/F46 10.9091 Tf 10.909 0 Td [(c 2 ] .165 withrespecttoconstantscforsomerandomvariableS.Expandingthesquare,wehave E [ S )]TJ/F46 10.9091 Tf 10.909 0 Td [(c 2 ]= E S 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 cES + ES 2 .166 Taking d dc andsettingtheresultto0,wendthattheminimizingcisc=ES. Gettingbackto.164,usetheLawofTotalExpectationtowrite E [ Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(T 2 ]= E E [ Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(T 2 j X ] .167 Fromwhatwelearnedwith.165,appliedtotheconditionali.e.innerexpectationin.167,weseethat theTwhichminimizes.167isT=EY j X. Inotherwords,theconditionalmeanisaprojection!Nice,butisthisusefulinanyway?Theansweris yes,inthesensethatitguidestheintuition.Allthisisrelatedtoissuesofstatisticalpredictionherewe wouldbepredictingYfromXandthegeometryherecanreallyguideourinsight.Thisisnotveryevident withoutgettingdeeplyintothepredictionissue,butlet'sexploresomeoftheimplicationsofthegeometry. Forexample,aprojectionisperpendiculartothelineconnectingtheprojectiontotheoriginalvector.So 0= E Y j X ;Y )]TJ/F46 10.9091 Tf 10.91 0 Td [(E Y j X = Cov [ E Y j X ;Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(E Y j X ] .168 ThissaysthatthepredictionEY j Xisuncorrelatedwiththepredictionerror,Y-EY j X .Thisinturnhas statisticalimportance.Ofcourse,.168couldhavebeenderiveddirectly,butthegeometryofthevector spaceintepretationiswhatsuggestedwelookatthequantityintherstplace.Again,thepointisthatthe vectorspaceviewcanguideourintuition. PAGE 113 3.10.PROOFOFTHELAWOFTOTALEXPECTATION 95 Simlarly,thePythagoreanTheoremholds,so jj Y jj 2 = jj E Y j X jj 2 + jj Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(E Y j X jj 2 .169 whichmeansthat Var Y = Var [ E Y j X ]+ Var [ Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(E Y j X ] .170 Equation.170isacommonthemeinlinearmodelsinstatistics,thedecompositionofvariance. 3.10ProofoftheLawofTotalExpectation Let'sprove.87forthecaseinwhichWandYtakevaluesonlyintheset f 1,2,3,... g .RecallthatifTis aninteger-valuerandomvariableandwehavesomefunctionh,thenL=hTisanotherrandomvariable, anditsexpectedvaluecanbecalculatedas 5 E L = X k h k P T = k .171 Inourcasehere,QisafunctionofW,sowenditsexpectationfromthedistributionofW: E Q = 1 X i =1 g i P W = i = 1 X i =1 E Y j W = i P W = i = 1 X i =1 2 4 1 X j =1 jP Y = j j W = i 3 5 P W = i = 1 X j =1 j 1 X i =1 P Y = j j W = i P W = i = 1 X j =1 jP Y = j = E Y Inotherwords, E Y = E [ E Y j W ] .172 5 ThisissometimescalledTheLawoftheUnconsciousStatistician,bynastyprobabilitytheoristswholookdownonstatisticians. Theirpointisthattechnically EL = P k kP L = k ,andthat.171mustbeproven,whereasthestatisticianssupposedlythink it'sadenition. PAGE 114 96 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Exercises 1 .Supposetherandompair X;Y hasthedensity f X;Y s;t =8 st onthetriangle f s;t :0 PAGE 115 3.10.PROOFOFTHELAWOFTOTALEXPECTATION 97 foranyconstantsa,b,candd. 7 .Supposewewishtopredictarandomvariable Y byusinganotherrandomvariable, X .Wemayconsider predictorsoftheform cX + d forconstantscandd.Showthatthevaluesofcanddthatminimizethemean squaredpredictionerror, E [ Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(cX )]TJ/F46 10.9091 Tf 10.909 0 Td [(d 2 are c = E XY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX EY Var X .175 d = E X 2 EY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EX E XY Var X .176 8 .ProgramsAandBconsistofrandsmodules,respectively,ofwhichcmodulesarecommontoboth. Asasimplemodel,assumethateachmodulehasprobabilitypofbeingcorrect,withthemodulesacting independently.Let X and Y denotethenumbersofcorrectmodulesinAandB,respectively.Findthe correlation X;Y asafunctionofr,s,candp. Hint:Write X = X 1 + :::X r )]TJ/F47 7.9701 Tf 6.587 0 Td [(c ,where X i is1or0,dependingonwhethermoduleiofAiscorrect,forthe nonoverlappingmodulesofA.DothesameforB,andforthesetofcommonmodules. 9 .UsetransformmethodstoderivesomepropertiesofthePoissonfamily: aShowthatforanyPoissonrandomvariable,itsmeanandvarianceareequal. bSuppose X and Y areindependentrandomvariables,eachhavingaPoissondistribution.Showthat Z = X + Y againhasaPoissondistribution. 10 .Supposeonekeepsrollingadie.Let S n denotethetotalnumberofdotsafternrolls,mod8,andlet T bethenumberofrollsneededfortheevent S n =0 tooccur.Find E T ,usinganapproachlikethatinthe trappedminerexampleinSection3.5.5. 11 .Inourordinarycoinswhichweuseeveryday,eachonehasaslightlydifferentprobabilityofheads, whichwe'llcall H .Say H hasthedistribution N : 5 ; 0 : 03 2 .Wechooseacoinfromabatchatrandom, thentossit10times.Let N bethenumberofheadsweget.Find Var N 12 .JackandJillplayadicegame,inwhichonewins$1perdot.Therearethreedice,dieA,dieBand dieC.JillalwaysrollsdiceAandB.JackalwaysrollsjustdieC,buthealsogetscreditfor90%ofdieB. Forinstance,sayinaparticularrollA,BandCare3,1and6,respectively.ThenJillwouldwin$4and Jackwouldget$6.90.Let X and Y beJill'sandJack'stotalwinningsafter100rolls.UsetheCentralLimit Theoremtondtheapproximatevaluesof P X> 650 ;Y< 660 and P Y> 1 : 06 X Hints:ThiswillfollowasimilarpatterntothedicegameinSection3.6.2.3,whichwewin$5foronedot, and$2fortwoorthreedots.Remember,inthatexample,thekeywasthatwenoticedthatthepair X;Y wasasumofrandompairs.Thatmeantthat X;Y hadanapproximatebivariatenormaldistribution,so wecouldndprobabilitiesifwehadthemeanvectorandcovariancematrixof X;Y .Thusweneededto nd EX;EY;Var X ;Var Y and Cov X;Y .Weusedthevariouspropertiesof E ;Var and Cov togetthosequantities. PAGE 116 98 CHAPTER3.MULTIVARIATEPROBABILITYMODELS Youwilldothesamethinghere.Write X = U 1 + ::: + U 1 00 ,where U i isJill'swinningsonthei th roll. Write Y asasimilarsumof V i .Youprobablywillndithelpfultodene A i B i and C i asthenumbersof dotsappearingondiceA,BandConthei th roll.Thennd EX etc.Again,makesuretoutilizethevarious propertiesfor E ;Var and Cov 13 .Showthatifrandomvariables U and V areindependent, Var UV = E U 2 Var V + Var U EV 2 .177 PAGE 117 Chapter4 IntroductiontoStatisticalInference 4.1WhatStatisticsIsAllAbout Ifyoufollowtheeventsinvolvingspacetravel, 1 ,youmayhearstatementslike,Thereisa40%chancethat weatherconditionsonFridaywillbegoodenoughtolaunchthespaceshuttle.Yourresponsemightbe curiosityastothefollowingquestions: Whatdoesthat40%gurereallymean? Howaccurateisthatgure? Whatdatawasusedtoobtainthatgure,andwhatmathematicalmodelwasused? Well,thesearetypicalstatisticalissues. Ifyouthoughtthatstatisticsisnothingmorethanaddingupcolumnsofnumbersandpluggingintoformulas, youarebadlymistaken.Actually,statisticsisanapplicationofprobabilitytheory.Weemployprobabilistic modelsforthebehaviorofoursampledata,and infer fromthedataaccordinglyhencethename, statistical inference 4.2IntroductiontoCondenceIntervals 4.2.1HowLongShouldWeRunaSimulation? Inoursimulationsinpreviousunits,itwasneverquiteclearhowlongthesimulationshouldberun,i.e.what valuetosetfor nreps .Nowwewillnallyaddressthisissue. Asourexample,recallfromtheBusParadoxinSection2.5:Busesarriveatacertainbusstopatrandom times,withinterarrivaltimesbeingindependentexponentiallydistributedrandomvariableswithmean10 1 Personally,Idon't. 99 PAGE 118 100 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE minutes.Youarriveatthebusstopeverydayatacertaintime,sayfourhoursminutesafterthebuses starttheirmorningrun.Whatisyourmeanwaitforthenextbus? Wefoundmathematicallythat,duetothememorylesspropertyoftheexponentialdistribution,ourwaitis againexponentiallydistributedwithmean10.Butsupposewedidn'tknowthat,andwewishedtondthe answerviasimulation.Wecouldwriteaprogram: 1 doexpt<-functionopt{ 2 lastarrival<-0.0 3 whilelastarrival PAGE 119 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 101 Let W i denotethe i th waitingtime,i=1,2,...,nandlet W denotethesamplemean, W = W 1 + :::W n n .1 W iswhattheprogramprintsout. Thekeypointsarethat Therandomvariables W i eachhavethedistribution F W ,andthuseachhavemean andvariance 2 Therandomvariables W i areindependent. Themeanof W isalso : E W = 1 n E n X i =1 W i forconst.c, E cU = cEU .2 = 1 n n X i =1 EW i E [ U + V ]= EU + EV .3 = 1 n n EW i = .4 = .5 Thevarianceof W is1/nofthepopulationvariance: Var W = 1 n 2 Var n X i =1 W i forconst.c, ;Var [ cU ]= c 2 Var [ U ] .6 = 1 n 2 n X i =1 Var W i forU,Vindep., ;Var [ U + V ]= Var [ U ]+ Var [ V ] .7 = 1 n 2 n 2 .8 = 1 n 2 .9 Let'sthinkofthenotebookexampleinadifferentcontext.Hereourexperimentistosample20waittimes, againeitherbypersonallygoingtothebusstop20timesorrunningtheaboveprogramwith nreps equal to20. Eachline ofthenotebookwouldconsistofdatafrom20visitstothebusstop,withwaittimes W 1 ;:::;W 20 andthesamplemean W .Soournotebookwouldhaveacolumnfor W 1 ,onefor W 2 ,...,one for W 20 andespeciallyonefor W .Hereiswhatwewouldnd: Whenwesaythateach W i hasthesamedistributionasthepopulation,wemeanthefollowing,say fori=2:Ifweweretogathertogetherallthevaluesof W 1 ,onefromeachoftheinnitelymanylines PAGE 120 102 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE ofthenotebook,thentheiraveragewouldbe10.0.Also,thelong-runproportionoflinesforwhich W 1 < 4 ,say,wouldbeequalto P W< 4 ItcanbeshownthatactuallyWhasanexponentialdistribution.Thisfollowsfromthememoryless property.So, P W< 4= Z 4 0 0 : 1 e )]TJ/F44 7.9701 Tf 6.586 0 Td [(0 : 1 t dt =0 : 33 .10 Thereforethelong-runproportionoflinesforwhich W 1 < 4 wouldbe0.33. Andifweweretocalculatethestandarddeviationofallthosevaluesof W 1 ,we'dget which wealsoknowtobe10,sincethemeanandstandarddeviationareequalinthecaseofexponential distributions. Equation.5saysthatifweweretoaverageallthevaluesof W overallthelinesofthenotebook, we'dget10.0theretoo. Ifweweretocalculatethestandarddeviationofthosevaluesof W ,we'dget = p n whichweknow tobe0.5. Thesepointsareabsolutelykey,formingtheverybasisofstatistics.Youshouldspendextratimepondering them. 4.2.2.2OurFirstCondenceInterval TheCentralLimitTheoremthentellsusthat Z = W )]TJ/F46 10.9091 Tf 10.909 0 Td [( = p n .11 hasanapproximatelyN,1distribution.Wewillbeinterestedinthecentral95%ofthatdistribution, whichduetosymmetryhave2.5%oftheareainthelefttailand2.5%intherightone.ThroughtheRcall qnorm.025 ,orbyconsultingaN,1cdftableinabook,wendthattherecuttoffpointsareat-1.96 and1.96.Thus 0 : 95 P )]TJ/F15 10.9091 Tf 8.484 0 Td [(1 : 96 < W )]TJ/F46 10.9091 Tf 10.909 0 Td [( = p n < 1 : 96 .12 Doingabitofalgebraontheinequalitiesyields 0 : 95 P W )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 p n << W +1 : 96 p n .13 Nowremember,notonlydowenotknow ,wealsodon'tknow .Butwecanestimateit,asfollows: PAGE 121 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 103 Recallthatbydenition 2 = E [ W )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 ] .14 Let'sestimate 2 bytakingsampleanalogs.Thesampleanalogof is W .Whataboutthesampleanalog oftheE?Well,sinceEaveragingoverthewholepopulationofWs,thesampleanalogistoaverage overthesample.So,weget 1 n n X i =1 W i )]TJETq1 0 0 1 324.491 527.155 cm[]0 d 0 J 0.436 w 0 0 m 11.818 0 l SQBT/F46 10.9091 Tf 324.491 518.174 Td [(W 2 .15 Inotherwords,justasitisnaturaltoestimatethepopulationmeanofWbyitssamplemean,thesameholds forVarW: ThepopulationvarianceofWisthemeansquareddistancefromWtoitspopulationmean. ThereforeitisnaturaltoestimateVarWbytheaveragesquareddistanceofWfromitssample mean,amongoursamplevalues W i Weuse s 2 asoursymbolforthisestimateofpopulationvariance. 4 Wethustakeourestimateof tobe s thesquarerootofthatquantity. Bytheway,.15isequalto s 2 = 1 n n X i =1 W 2 i )]TJETq1 0 0 1 339.029 308.621 cm[]0 d 0 J 0.436 w 0 0 m 11.818 0 l SQBT/F46 10.9091 Tf 339.029 299.64 Td [(W 2 .16 Caution:Thiswayofcomputing s 2 issubjecttomoreroundofferror. Onecanshowthedetailswillbegivenattheendofthissectionthat.13isstillvalidifwesubstitute s for ,i.e. 0 : 95 P W )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 s p n << W +1 : 96 s p n .17 Inotherwords,weareabout95%surethattheinterval W )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 s p n ; W +1 : 96 s p n .18 contains .Thisiscalleda95% condenceinterval for Wecouldaddthisfeaturetoourprogram: 4 ThoughItrytosticktotheconventionofusingonlycapitalletterstodenoterandomvariables,itisconventionaltouselower caseinthisinstance. PAGE 122 104 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE 1 doexpt<-functionopt{ 2 lastarrival<-0.0 3 whilelastarrival PAGE 123 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 105 thoughit'snotimmediatelyimportanthere,notethattherewouldalsobecolumnsfor W 1 through W 1000 theweightsofour1000people,andcolumnsfor W ands. Nowhereisthepoint:Approximately95%ofallthoseintervalswouldcontain ,themeanweightinthe entireadultpopulationofDavis.Thevalueof wouldbeunknowntousthat'swhywe'dbesampling 1000peopleintherstplace!butitdoesexist,anditwouldbecontainedinapproximately95%ofthe intervals. Asavariationonthenotebookidea,thinkofwhatwouldhappenifyouand99friendseachdothisexperiment.Eachofyouwouldsample1000peopleandformacondenceinterval.Sinceeachofyouwould getadifferentsampleofpeople,youwouldeachgetadifferentcondenceinterval.Whatwemeanwhen wesaythecondencelevelis95%isthatofthe100intervalsformedbyyouand99friendsabout95of themwillcontainthetruepopulationmeanweight.Ofcourse,youhopeyouyourselfwillbeoneofthe95 luckyones!Butremember,you'llneverknowwhoseintervalsarecorrectandwhosearen't. Nowremember,inpracticeweonlytake one sampleof1000people.Ournotebookideahereismerelyfor thepurposeofunderstandingwhatwemeanwhenwesaythatweareabout95%condentthatoneinterval weformdoescontainthetruevalueof 4.2.3.2BacktoOurBusSimulation Well,inoursimulationcase,itis exactlythesamesituation .Simulationisasamplingprocess.Our isthe meaninthepopulationofallbuswaits,while W isthemeaninoursampleof1000waits.Thisisnotmere analogy;mathematicallythetwosituationsarecompletelyidentical,twoinstancesofthesameprinciple. Let'susetheyouandyour99friendsideaagain.Supposedeachofyou100peopleruntheRprogramat theendofSection4.2.2.2.Eachofyouwillgetadifferentcondenceintervalprintedoutattheendofyour run. 6 Well,whenwesaythattheprogramprintsouta95%condenceinterval,wemeanthatabout95of you100peoplewillhaveanintervalthatcontainsthetruevalueofEW. IntheDavisweightexampleabove,Istressedthatwedon'tknow afterall,that'sthereasonweare takingasampleofpeople,soastoestimate Similarly,thewholepointofdoingasimulationtondsomequantityERisthatwedon'tknowthevalue ofER!WewillsimulatemanyvaluesofR,forming R ,andusethatquantityasanestimateofER. Butourbusexamplewasjustthatan example ,setuptoillustratethenotionofaddingacondenceinterval totheoutputofasimulation.WeactuallydoknowthevalueofEWhere;it's10.Thatmakesthisarather articialexample,butthat'sgood,becauseitwillallowustoreallyseetheyouand99friendsideain action,asfollows. We'llexpandthecodetosimulate1000peoplerunningtheoriginalprogram.Inotherwords,we'lladdan extraouterlooptodo1000runsoftheprogram.Eachrunwillcomputethecondenceinterval,andthen we'llseeintheendhowmanyofthe1000runshaveacondenceintervalthatincludesthetrueEW,10.0: 1 doexpt<-functionopt{ 2 lastarrival<-0.0 3 whilelastarrival PAGE 124 106 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE 4 lastarrival<-lastarrival+rexp,0.1 5 returnlastarrival-opt 6 } 7 8 observationpt<-240 9 nreps<-1000 10 numruns<-1000 11 waits<-vectorlength=nreps 12 numcorrectcis<-0#numberofconf.ints.thatcontain10.0 13 forrunin1:numruns{ 14 forrepin1:nrepswaits[rep]<-doexptobservationpt 15 wbar<-meanwaits 16 s2<-meanwaits-meanwbar 17 s<-sqrts2 18 radius<-1.96 s/sqrtnreps 19 ifabswbar-10.0<=radiusnumcorrectcis<-numcorrectcis+1 20 } 21 cat"approx.trueconfidencelevel=",numcorrectcis/numruns,"n" Infact,theoutputofthatprogramwas0.958,sureenoughabout95%. Whyisitnotexactly0.95? Weonlysimulated1000intervals;ideallyitshouldbeaninnitenumber,togettheexactprobability thatanintervalcontains TheCentralLimitTheoremisonlyapproximate. Ideallywewoulduse.13,butduetolackofknowledgeofthetruevalueof wedon'tknow ,so whywouldweknow ?,weresortedtousingsinstead,in.18. Againrememberthatinpracticeweonlydo one runofsimulating1000waitsforthebus.Oursimulation codeaboveismerelyforthepurposeofunderstandingwhatwemeanwhenwesaythatweareabout95% condentthatoneintervalweformdoescontainthetruevalueof 4.2.3.3OneMorePointAboutInterpretation Somestatisticsinstructorsgivestudentstheoddwarning,Youcan'tsaythattheprobabilityis95%that isINtheinterval;youcanonlysaythattheprobabilityis95%condentthattheintervalCONTAINS . 7 Thisofcourseisnonsense.Asanyfoolcansee,thefollowingtwostatementsareequivalent: isintheinterval theintervalcontains Soitisridiculoustosaythattherstisincorrect.Yetmanyinstructorsofstatisticssayso. Wheredidthiscrazinesscomefrom?Well,waybackintheearlydaysofstatistics,someinstructorwas afraidthatastatementlikeTheprobabilityis95%that isintheintervalwouldmakeitsoundlike isa 7 SeeforexampletheWikipediaentry,CondenceIntervals, http://en.wikipedia.org/wiki/Confidence_ interval#Meaning_and_interpretation . PAGE 125 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 107 randomvariable.Granted,thatwasalegitimatefear,because isnotarandomvariable,andwithoutproper warning,somelearnersofstatisticsmightthinkincorrectly.Therandomentityistheinterval,not .Thisis clearinourprogramabovethe10isconstant,while wbar and s varyfromintervaltointerval. So,itwasreasonableforteacherstowarnstudentsnottothink isarandomvariable.Butlateron,some idiotmusthavethendecidedthatitisincorrecttosay isintheinterval,andotheridiotsthenfollowed suit. 4.2.4SamplingWithandWithoutReplacement Implicitinouranalysessofarinourassumptionthatthe W i areindependentisthatwearesampling with replacement ,whichmeansit'spossiblethatourrandomsamplingprocessmightchoosethesameperson twice. Ifwesamplewithreplacement,wesaythatwehavea randomsample .Ifitisdonewithoutreplacement,it's calleda simplerandomsample .Inthelattercase,.9doesnothold,becausethe W i arenotindependent thoughtheyarestillidenticallydistributed.Toseethis,supposethatDaviswereatinytownconsisting ofjustthreeadults,withweights120,161and190.Thenifforexample W 1 =190 ,then E W 2 j W 1 = +161 = 2=140 : 5 ,while E W 1 =120+161+190 = 3=157 .Thus W 1 and W 2 arenotindependent, and.9wouldfail. 8 Butexceptforcasesinwhichoursamplesizeisasubstantialfractionofthepopulationsize,theprobability ofgettingthesamepersontwicewouldbeverylow,soitdoesn'tmatter.Thuswecansafelyuseanalyses whichassumewith-replacementsamplingevenifweareusingwithout-replacementsampling. 4.2.5OtherCondenceLevels Wehavebeenusing95%asourcondencelevel.Thisiscommon,butofcoursenotunique.Wecanfor instanceuse90%,whichgivesusanarrowerintervalin.18,wemultiplyby1.65insteadofby1.96, whichthereadershouldcheck,attheexpenseoflowercondence. Acondenceinterval'serrorrateisusuallydenotedby 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( ,soa95%condencelevelhas =0 : 05 4.2.6TheStandardErroroftheEstimate Remember, W isarandomvariable.InourDavispeopleexample,eachlineofthenotebookwouldcorrespondtoadifferentsampleof1000people,andthuseachlinewouldhaveadifferentvaluefor W .Thusit makessensetotalkabout Var W ,andtorefertothesquartrootofthatquantity,i.e.thestandarddeviation of W .In.9,wefoundthistobe = p n anddecidedtoestimateitby s= p n .Thelatteriscalledthe standarderroroftheestimate s.e.,meaningtheestimateofthestandarddeviationoftheestimate W Theword estimate wasusedtwiceintheprecedingsentence.Makesuretounderstandthetwodifferent settingsthattheyapplyto. 8 Note,though,that.5 does hold,becauseexpectedvaluesofsumsequalsumsofexpectedvaluesevenfordependentrandom variables. PAGE 126 108 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Wecanseefrom.18whattodoingeneral,ifweareestimatingsomenumber by b 9 andthelatter hasanapproximatelynormaldistribution.Let s:e: b denoteourestimateforthestandarddeviationofthat distribution,i.e.thestandarderrorof b .Thenanapproximate95%condenceintervalfor is b 1 : 96 s : e : b .19 Thestandarderroroftheestimateisoneofthemostcommonly-usedquantitiesinstatisticalapplications. YouwillencounteritfrequentlyintheoutputofR,forinstance.Makesureyouunderstandwhatitmeans andhowitisused. 4.2.7WhyNotDividebyn-1?TheNotionofBias Itshouldbenotedthatitiscustomaryin.15todividebyn-1insteadofn,forreasonsthatarelargely historical.Here'stheissue: Ifwedividebyn,aswehave,thenitturnsoutthat n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 n 2 .20 ThinkaboutthisintheDavispeopleexample,onceagaininthenotebookcontext.Remember,herenis 1000,andeachlineofthenotebookrepresentsourtakingadifferentrandomsampleof1000people.Within eachline,therewillbeentriesfor W 1 through W 1000 ,theweightsofour1000people,andfor W and s .For convenience,let'ssupposewerecordthatlastcolumnas s 2 insteadof s Now,saywewanttoestimatethepopulationvariance 2 .Asdiscussedearlier,thenaturalestimatorforit wouldbethesamplevariance, s 2 .What.20saysisthatafterlookingataninnitenumberoflinesinthe notebook,theaveragevalueof s 2 wouldbejust...a...little...bit...too...small.Allthe s 2 valueswouldaverage outto 0 : 999 2 ,ratherthanto 2 .Wemightsaythat s 2 hasalittlebitmoretendencytounderestimate 2 thantooverestimateit. Wesaythat s 2 isa biased estimatorof 2 ,withtheamountofbiasbeing E s 2 = 1 n 2 .21 Let'sprove.20.We'lluse.16.Asbefore,letWbearandomvariabledistributedasthepopulation. RecallfromSection4.2.2.1thatthisimpliesthat EW = and Var W = 2 ,where and 2 arethe populationmeanandvariance.Write E n X i =1 W 2 i = nE W 2 Sec.4.2.2.1 .22 = n [ Var W + EW 2 ] shortcutformulaforVar .23 = n [ 2 + 2 ] Sec.4.2.2.1 .24 9 Thequantityispronouncedtheta-hat.Thehatsymbolistraditionalforestimateof. PAGE 127 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 109 Continuingtoworkfrom.16,write E [ W 2 ]= Var W +[ E W ] 2 = 1 n 2 + 2 .25 Nowusingallthisin.16,weget E s 2 = n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 n 2 .26 Theearlierdevelopersofstatisticswerebotheredbythisbias,sotheyintroducedafudgefactorbydividing byn-1insteadofnin.15.Butwewillusen.Afterall,whennislargewhichiswhatweareassuming byusingtheCentralLimitTheoremintheentiredevelopmentsofaritdoesn'tmakeanyappreciable difference.ClearlyitisnotimportantinourDavisexample,orourbussimulationexample. Moreover,speakinggenerallynowratherthannecessarilyforthecaseof s 2 thereisnoparticularreasonto insistthatanestimatorbeunbiasedanyway.Analternativeestimatormayhavealittlebiasbutmuchsmaller variance,andthusmightbepreferable.Andanyway,eveniftheclassicalversionof s 2 isanunbiased estimatorfor 2 s isnotanunbiasedestimatorfor ,thepopulationstandarddeviation.Inotherwords, unbiasednessisnotsuchanimportantproperty. So,inourtreatmenthere,ourdenitionof s 2 dividesbynratherthanbyn-1. TheRfunctions var and sd calculatetheversionsof s 2 and s ,respectively,thathaveadivisorofn-1. 4.2.8AndWhatAbouttheStudent-tDistribution? Anotherthingwearenotdoinghereistousethe Studentt-distribution .Thatisthenameofthedistribution ofthequantity T = W )]TJ/F46 10.9091 Tf 10.909 0 Td [( ~ s= p n .27 Here ~ s denotesthevalueofsunderitsclassicaldenition,inwhichwedividebyn-1insteadofn.Note carefullythatweareassumingthatthe W i themselvesnotjust W haveanormaldistribution.Theexact distributionofTiscalledthe Studentt-distributionwithn-1degreesoffreedom .Thesedistributionsthus formaone-parameterfamily,withthedegreesoffreedombeingtheparameter. Thisdistributionhasbeentabulated.InR,forinstance,thefunctions dt pt andsoonplaythesameroles as dnorm pnorm etc.doforthenormalfamily.Thecall qnorm.975,9 returns2.26.Thisenables ustogetanfor fromasampleofsize10,atEXACTLYa95%condencelevel,ratherthanbeingatan APPROXIMATE95%levelaswehavehadhere,asfollows. Westartwith.12,replacing1.96by2.26, W )]TJ/F46 10.9091 Tf 10.717 0 Td [( = = p n byT,and by = .Doingthesamealgebra, wendthefollowingcondenceintervalfor : W )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 : 26 ~ s p 10 ; W +2 : 26 ~ s p 10 .28 PAGE 128 110 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Ofcourse,forgeneraln,replace2.26by t 0 : 975 ;n )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ,the0.975quantileofthet-distributionwithn-1degrees offreedom.ThedistributionistabulatedbytheRfunctions dt pt andsoon. Idonotusethet-distributionherebecause: Itdependsontheparentpopulationhavinganexactnormaldistribution,whichisneverreallytrue.In theDaviscase,forinstance,people'sweightsareapproximatelynormallydistributed,butdenitely notexactlyso.Forthattobeexactlythecase,somepeoplewouldhavetohaveweightsofsay,a billionpounds,ornegativeweights,sinceanynormaldistributiontakesonallvaluesfrom to 1 Forlargen,thedifferencebetweenthet-distributionandN,1isnegligibleanyway. 4.2.9CondenceIntervalsforProportions Inourbusexampleabove,supposewealsowantoursimulationtoprintouttheestimatedprobabilitythat onemustwaitlongerthan6.2minutes: 1 doexpt<-functionopt{ 2 lastarrival<-0.0 3 whilelastarrival PAGE 129 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 111 Then E Y =1 P Y =1+0 P Y =0= P W> 6 : 2 .30 Let p denotethisprobability,andlet b p denoteourestimateofit; b p isour prop intheprogram.In.16,take W i tobeour Y i here,andnotethat Y 2 i = Y i .Thatmeansthat s 2 = b p )]TJ/F52 10.9091 Tf 11.533 0 Td [(b p 2 = b p )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p .31 Equation.18becomes b p )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 p b p )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p =n; b p +1 : 96 p b p )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p =n .32 4.2.9.2Examples Weincorporatethatintoourprogram: 1 doexpt<-functionopt{ 2 lastarrival<-0.0 3 whilelastarrival PAGE 130 112 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Notealsothatalthoughwe'veusedtheword proportion intheDavisweightsexampleinsteadof probability theyarethesame.IfIchooseanadultatrandomfromthepopulation,theprobabilitythathis/herweightis morethan150isequaltotheproportionofadultsinthepopulationwhohaveweightsofmorethan150. Andthesameprinciplesareusedinopinionpollsduringpresidentialelections.Here p isthepopulation proportionofpeoplewhoplantovoteforthegivencandidate.Thisisanunknownquantity,whichis exactlythepointofpollingasampleofpeopletoestimatethatunknownquantityp.Ourestimateis b p ,the proportionofpeopleinoursamplewhoplantovoteforthegivencandidate,andnisthenumberofpeople thatwepoll.Weagainuse.32. 4.2.9.3Interpretation Thesameinterpretationholdsasbefore.Considertheexamplesinthelastsection: Ifeachofyouand99friendsweretoruntheRprogramatthebeginningofSection4.2.9.2,you100 peoplewouldget100condenceintervalsfor P W> 6 : 4 .About95ofyouwouldhaveintervals thatdocontainthatnumber. Ifeachofyouand99friendsweretosample1000peopleinDavisandcomeupwithcondence intervalsforthetruepopulationproportionofpeoplewhoweightmorethan150pounds,about95of youwouldhaveintervalsthatdocontainthattruepopulationproportion. Ifeachofyouand99friendsweretosample1200peopleinanelectioncampaign,toestimatethetrue populationproportionofpeoplewhowillvoteforcandidateX,about95ofyouwillhaveintervals thatdocontainthispopulationproportion. 4.2.9.4Non-EffectofthePopulationSize NotethatinboththeDavisandelectionexamples,itdoesn'tmatterwhatthesizeofthepopulationis.The approximatedistributionof b p ,Np,p-p/n,andthustheaccuracyof b p ,dependsonlyon p and n .Sowhen peopleask,Howapresidentialelectionpollcangetbywithsamplingonly1200people,whenthereare morethan100,000,000votersintheU.S.?nowyouknowtheanswer.We'lldiscussthequestionWhy 1200?below. Anotherwaytoseethisistothinkofasituationinwhichwewishtoestimatetheprobabilitypofheads foracertaincoin.Wetossthecoinntimes,anduse b p asourestimateofp.Hereourpopulationthe populationofallcointossesisinnite,yetitisstillthecasethat1200tosseswouldbeenoughtogeta goodestimateofp. 4.2.9.5PlanningAhead Now,whydothepollsterssample1200people? First,notethatthemaximumpossiblevalueof b p )]TJ/F52 10.9091 Tf 11.179 0 Td [(b p is0.25. 10 Thenthepollstersknowthattheirmargin 10 Usecalculustondthemaximumvalueoffx=x-x. PAGE 131 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 113 oferrorwithn=1200willbeatmost 1 : 96 0 : 5 = p 1200 ,orabout3%,evenbeforetheypollanyone.They consider3%tobesufcientlyaccuratefortheirpurposes,so1200isthentheychoose. 4.2.10One-SidedCondenceIntervals Condenceintervalsasdiscussedsofargiveonebothanupperandlowerboundfortheparameterofinterest. Fromhereon,theword parameter isusedinabroadercontextthanjustparametricfamiliesofdistributions. Thetermwillrefertoanypopulationquantity. Insomeapplications,weareinterestedinhavingonlyanupperbound,oronlyalowerbound.Onecango throughthesamekindofreasoningasinSection4.2abovetoobtainapproximate95%one-sided condence intervals: W )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 65 s p n ; 1 .33 ; W +1 : 65 s p n .34 Notetheconstant1.65,whichisthe0.95quantileoftheN,1distr,comparedto1.96,the0.975quantile. 4.2.11CondenceIntervalsforDifferencesofMeansorProportions 4.2.11.1IndependentSamples SupposeinoursamplingofpeopleinDaviswearemainlyinterestedinthedifferenceinweightsbetween menandwomen.Let X and n 1 denotethesamplemeanandsamplesizeformen,andlet Y and m 1 forthe women.Denotethepopulationmeansandvariancesby i and 2 i ,i=1,2.Wewishtondacondence intervalfor 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 .Thenaturalestimatorforthatquantityis X )]TJETq1 0 0 1 367.579 225.047 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 367.579 216.066 Td [(Y Inordertoformacondenceintervalfor 1 )]TJ/F46 10.9091 Tf 9.273 0 Td [( 2 using X )]TJETq1 0 0 1 330.795 204.298 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 330.795 195.316 Td [(Y ,weneedtoknowthedistributionofthatlatter quantity.Toseethis,recallthatthisishowweeventuallygot.18;westartedbynotingthedistribution of W ,ormorepreciselythedistributionof W )]TJ/F46 10.9091 Tf 11.472 0 Td [( = = p n in.11,andthenusedthattoderiveour condenceinterval.So,hereweneedtoknowthedistributionof X )]TJETq1 0 0 1 380.224 163.65 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 380.224 154.668 Td [(Y Noterstthat X and Y areindependent.Theycomefromseparatepeople.Also,asnotedbefore,theyare approximatelynormallydistributed.So,theyjointlyhaveanapproximatelybivariatenormaldistribution. Thenfromourearlierunitonmultivariatedistributions,page85,weknowthatthelinearcombination X )]TJETq1 0 0 1 268.152 77.213 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 268.152 68.231 Td [(Y =1 X + )]TJ/F15 10.9091 Tf 8.485 0 Td [(1 Y .35 willalsohaveanapproximatelynormaldistribution,withmean 1 + )]TJ/F15 10.9091 Tf 8.484 0 Td [(1 2 andvariance 2 1 =n 1 + )]TJ/F15 10.9091 Tf 8.485 0 Td [(1 2 2 2 =n 2 . PAGE 132 114 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Ifwethenlet s 2 i ,i=1,2denotethetwosamplevariances,wehavethat Z = X )]TJETq1 0 0 1 295.74 621.81 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 295.74 612.828 Td [(Y )]TJ/F15 10.9091 Tf 10.909 0 Td [( 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 q s 2 1 n 1 + s 2 2 n 2 .36 hasanapproximateN,1distribution,andworkingasbefore,wehavethatanapproximate95%condence intervalfor 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 is 0 @ X )]TJETq1 0 0 1 210.506 507.576 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 210.506 498.594 Td [(Y )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 s s 2 1 n 1 + s 2 2 n 2 ; X )]TJETq1 0 0 1 331.651 507.576 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 331.651 498.594 Td [(Y +1 : 96 s s 2 1 n 1 + s 2 2 n 2 1 A .37 Asimilarderivationgivesusanapproximate95%condenceintervalforthedifferenceintwopopulation proportions p 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(p 2 : 0 @ b p 1 )]TJ/F52 10.9091 Tf 12.99 0 Td [(b p 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 s s 2 1 n 1 + s 2 2 n 2 ; b p 1 )]TJ/F52 10.9091 Tf 12.99 0 Td [(b p 2 +1 : 96 s s 2 1 n 2 + s 2 2 n 2 1 A .38 where s 2 i = b p i )]TJ/F52 10.9091 Tf 12.314 0 Td [(b p i .39 Example: Inanetworksecurityapplication,C.Mano etal 11 compareround-triptraveltimeforpackets involvedinthesameapplicationincertainwiredandwirelessnetworks.Thedatawasasfollows: sample samplemean samples.d. samplesize wired 2.000 6.299 436 wireless 11.520 9.939 344 Wehadobservedquiteadifference,11.52versus2.00,butcoulditbeduetosamplingvariation?Maybewe haveunusualsamples?Thiscallsforacondenceinterval! Thena95%condenceintervalforthedifferencebetweenwirelessandwirednetworksis 11 : 520 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 : 000 1 : 96 r 9 : 939 2 344 + 6 : 299 2 436 =9 : 52 1 : 22 .40 Soyoucanseethatthereisabigdifferencebetweenthetwonetworks,evenafterallowingforsampling variation. 11 RIPPS:RogueIdentifyingPacketPayloadSlicerDetectingUnauthorizedWirelessHostsThroughNetworkTrafcConditioning,C.Manoandatonofotherauthors,ACMT RANSACTIONSON I NFORMATION S YSTEMSAND S ECURITY ,toappear. PAGE 133 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 115 4.2.11.2RandomSampleSize InourDavisweightsexampleinSection4.2.3.1,wewereimplicitlyassumingthatthesamplessizesofthe twogroups, n 1 and n 2 ,werenonrandom.Forinstance,wemightsample500menand500women. Ontheotherhand,wemightsimplysample1000peoplewithoutregardtogender.Thenthenumberofmen andwomeninthesamplewouldberandom.Thinkonceagainofournotebookview.Inourrstsampleof 1000people,wemighthave492menand508women.Inoursecondsample,thegenderbreakdownmight be505and495,andsoon.Inkeepingwiththeconventiontodenoterandomquantitiesbycapitalletters, wemightwritethenumbersofmenandwomeninoursampleas N 1 and N 2 However,inmostcasesitshouldnotmatter.Aslongasthereisnotsomeoddpropertyofoursampling method,e.g.inwhichtherewouldbetendencyforlargersamplestohaveshortermen,wecansimplydo ourinferenceconditionallyon N 1 and N 2 ,thustreatingthemasconstants. 4.2.11.3DependentSamples Notecarefully,though,thatakeypointabovewastheindependenceofthetwosamples.Bycontrast, supposewewish,forinstance,tondacondenceintervalfor 1 )]TJ/F46 10.9091 Tf 11.209 0 Td [( 2 ,thedifferenceinmeanweightsin Davisof15-year-oldand10-year-oldchildren,andsupposeourdataconsistofpairsofweightmeasurements atthetwoageson thesamechildren .Inotherwords,wehaveasampleofnchildren,andforthe i th child wehavehis/herweight U i atage15and V i atage10.Let V and U denotethesamplemeans. Theproblemisthatthetwosamplemeansarenotindependent.Ifachildistallerthanhis/herpeersatage 15,he/shewasprobablytallerthanthemwhentheywereallage10.Inotherwords,foreachi, V i and U i arepositivelycorrelated,andthusthesameistruefor V and U .Thuswecannotuse.37. However,therandomvariables T i = V i )]TJ/F46 10.9091 Tf 11.057 0 Td [(U i ,i=1,2,...,narestillindependent.Thuswecanuse.18,so thatourapproximate95%condenceintervalis T )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 : 96 s p n ; T +1 : 96 s p n .41 where s 2 isthesamplevarianceofthe T i Acommonsituationinwhichwehavedependentsamplesisthatinwhichwearecomparingtwodependent proportions.Supposeforexamplethattherearethreecandidatesrunningforapoliticalofce,A,Band C.Wepoll1,000votersandaskwhomtheyplantovotefor.Let p A p B and p Z bethethreepopulation proportionsofpeopleplanningtovoteforthevariouscandidates,andlet b p A b p B and b p C bethecorresponding sampleproportions. Supposewewishtoformacondenceintervalfor p A )]TJ/F46 10.9091 Tf 11.572 0 Td [(p B Clearly,thetwosampleproportionsarenot independentrandomvariables,sinceforinstanceif b p A =1 thenweknowforsurethat b p B is0.Todealwith this,wecouldsetupvariables U i and V i asabove,withforexample U i being1or0,accordingtowhether the i th personinoursampleplanstovoteforAornot. Butwecandobetter.Let N A N B and N C denotetheactualnumbersofpeopleinoursamplewhostate theywillvoteforthevariouscandidates,sothatforinstance b p A = N A = 1000 .Well,thepointisthatthe PAGE 134 116 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE vector N A ;N B ;N C T hasamultinomialdistribution.Thusweknowthat b p A )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p B =0 : 001 N A )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 001 N B .42 hasvariance 0 : 001 p A )]TJ/F46 10.9091 Tf 10.909 0 Td [(p A +0 : 001 p B )]TJ/F46 10.9091 Tf 10.909 0 Td [(p B )]TJ/F15 10.9091 Tf 10.91 0 Td [(0 : 002 p A p B .43 So,thestandarderrorof b p A )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p B is p 0 : 001 b p A )]TJ/F52 10.9091 Tf 11.532 0 Td [(b p A +0 : 001 b p B )]TJ/F52 10.9091 Tf 11.533 0 Td [(b p B )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 002 b p A b p B .44 4.2.12Example:MachineClassicationofForestCovers Remotesensing ismachineclassicationoftypefromvariablesobservedaerially,typicallybysatellite.In theapplicationwe'llconsiderhere,researcherswanttopredictforestcovertypeforagivenlocationthere aresevendifferenttypes,fromknowngeographicdata,asdirectobservationistooexpensiveandmay sufferfromlandaccesspermissionissues.SeeBlackard,JockA.andDenisJ.Dean,2000,Comparative AccuraciesofArticialNeuralNetworksandDiscriminantAnalysisinPredictingForestCoverTypesfrom CartographicVariables, ComputersandElectronicsinAgriculture ,24:131-151. Therewereover50,000observations,butforsimplicitywe'lljustusetherst1,000here. Oneofthevariableswastheamountofhillsideshadeatnoon,whichwe'llcallHS12.Let'sndanapproximate95%condenceintervalforthedifferenceinpopulationmeanHS12valuesincovertype1andtype 2locations.Thetwosamplemeanswere223.8and226.3,withsvaluesof15.3and14.3,andthesample sizeswere226and585.Soourcondenceintervalis 223 : 8 )]TJ/F15 10.9091 Tf 10.909 0 Td [(226 : 3 1 : 96 r 15 : 3 2 226 + 14 : 3 2 585 = )]TJ/F15 10.9091 Tf 8.485 0 Td [(2 : 5 2 : 3= )]TJ/F15 10.9091 Tf 8.484 0 Td [(4 : 8 ; )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 3 .45 Nowlet'sndacondenceintervalforthedifferenceinpopulationproportionsofsitesthathavecovertypes 1and2.Oursampleestimateis b p 1 )]TJ/F52 10.9091 Tf 11.533 0 Td [(b p 2 =0 : 226 )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 585= )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 359 .46 Thestandarderrorofthisquantity,from.44,is p 0 : 001 0 : 226 0 : 7740 : 001 0 : 585 0 : 415 )]TJ/F15 10.9091 Tf 10.909 0 Td [(002 0 : 226 0 : 585=0 : 019 .47 Thatgivesusacondenceintervalof )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 359 1 : 96 0 : 019= )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 397 ; )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 321 .48 PAGE 135 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 117 4.2.13ExactCondenceIntervals Recallhowwederivedourpreviouscondenceintervals.Webeganwithaprobabilitystatementinvolving ourestimator,andthendidsomealgebratoturnitaroundintoaformulaforacondenceinterval.Those operationshadnothingtodowiththeapproximatenatureofthedistributionsinvolved.Wecandothesame thingifwehaveexactdistributions. Forexample,supposewehavearandomsample X 1 ;:::;X 10 fromanexponentialdistributionwithparameter .Let'sndanexact95%condenceintervalfor Let T = X 1 + ::: + X 10 .49 RecallthatThasagammadistributionwithparameters10theshape,inR'sterminologyand .Let q denotethe0.95quantileofthisdistribution,i.e.thepointtotherightofwhichthereisonly5%ofthearea underthedensity.Notecarefullythatthisisindeedafunctionof ;ithasdifferentvaluesfordifferent Then: 0 : 95= P [ T q ]= P [ q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 T ] .50 Herewehaveusedthefactthatqisadecreasingfunction. So,anEXACT95%one-sidedcondenceintervalfor is ;q )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 T .51 Now,whatIS q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ?Recallwhatqis,the0.95quantileofthegammadistributionwithshape10.Italways helpsintuitiontolookatsomespecicnumbers: >qgamma.95,10,2.5 [1]6.282087 >qgamma.95,10,4 [1]3.926304 So,q.5=6.28andq=3.92.Thatmeans q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 : 28=2 : 5 and q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 : 92=4 Youcannowseehowwecanformtheinterval.SayT=16.4.Thenwedosometrial-and-erroruntilwend anumberwsuchthat qgamma.95,10,w=16 .Ourcondenceintervalisthen,w. 4.2.14Slutsky'sTheoremadvancedtopic ThereadershouldreviewSection2.3.2.6beforecontinuing. Sinceonegenerallydoesnotknowthevalueof in.13,wereplaceitby s ,yielding.17.Whywas thatlegitimate? Theanswerdependsonthetheorembelow.First,weneedadenition. PAGE 136 118 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Denition3 Wesaythatasequenceofrandomvariables L n convergesinprobability totherandom variable L ifforevery > 0 lim n !1 P j L n )]TJ/F46 10.9091 Tf 10.909 0 Td [(L j > =0 .52 Thisisalittleweakerthanconvergencewithprobability1,asintheStrongLawofLargeNumbersSLLN, Section1.4.10.Convergencewithprobability1impliesconvergenceinprobabilitybutnotviceversa. Soforexample,if Q 1 ;Q 2 ;Q 3 ;::: arei.i.d.withmean ,thentheSLLNimpliesthat L n = Q 1 + ::: + Q n n .53 convergeswithprobability1to ,andthus L n convergesinprobabilityto too. 4.2.14.1TheTheorem Theorem4Slutsky'sTheorem abridgedversion:Considerrandomvariables X n ;Y n ; and X ,suchthat X n convergesindistributionto X and Y n convergesinprobabilitytoaconstant c withprobability1, Then: a X n + Y n convergesindistributionto X + c b X n =Y n convergesindistributionto X=c 4.2.14.2WhyIt'sValidtoSubstitute s for Wenowreturntothequestionraisedabove.Inourcontexthere,thatwetake X n = W )]TJ/F46 10.9091 Tf 10.909 0 Td [( = p n .54 Y n = s .55 Weknowthat.54convergesindistributiontoN,1while.55convergesinto1.Thusforlargen,we havethat W )]TJ/F46 10.9091 Tf 10.909 0 Td [( s= p n .56 hasanapproximateN,1distribution,sothat.17isvalid. PAGE 137 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 119 4.2.14.3Example:CondenceIntervalforaRatioEstimator AgainconsidertheexampleinSection4.2.3.1ofweightsofmenandwomeninDavis,butthistimesuppose wewishtoformacondenceintervalforthe ratio ofthemeans, = 1 2 .57 Again,thenaturalestimatoris b = X Y .58 Howcanweconstructacondenceintervalfromthisestimator?Ifitwerealinearcombinationof X and Y we'dhavenoproblem,sincealinearcombinationofmultivariatenormalrandomvariablesisagainnormal. Thatisnotexactlythecasehere,butit'sclose.Since Y convergesinprobabilityto 2 ,Slutsky'sTheorem Section4.2.14tellsusthattheproblemherereallyisoneofsuchalinearcombination.Wecanforma condenceintervalfor 1 ,thendividebothendpointsoftheintervalby Y ,yieldingacondenceintervalfor 4.2.15TheDeltaMethod:CondenceIntervalsforGeneralFunctionsofMeansorProportionsadvancedtopic The deltamethod isagreatwaytoderiveasymptoticdistributionsofquantitiesthatarefunctionsofrandom variableswhoseasymptoticdistributionsarealreadyknown. 4.2.15.1TheTheorem Theorem5 Suppose R 1 ;:::;R k areestimatorsof 1 ;:::; k basedonarandomsampleofsizen,suchthat therandomvector p n R )]TJ/F46 10.9091 Tf 10.909 0 Td [( = p n 0 B B @ R 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 R 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 2 ::: R k )]TJ/F46 10.9091 Tf 10.909 0 Td [( k 1 C C A .59 hasanasymptoticallymultivariatenormaldistributionwithmean0andnonsingularcovariancematrix = ij Lethbeasmoothscalarfunctionofkvariables,with h i denotingits i th partialderivative.Considerthe randomvariable Y = h R 1 ;:::;R k .60 PAGE 138 120 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Then p n [ Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(h 1 ;:::; k ] convergesindistributiontoanormaldistributionwithmean0andvariance [ 1 ;:::; k ] T [ 1 ;:::; k ] .61 providednotallof i = h i 1 ;:::; k ;i =1 ;:::;k .62 are0. Informally,thetheoremsaysthat Y willbeapproximatelynormalwithmean h 1 ;:::; k andcovariance matrix1/ntimes.61.Thiscanbeusedtoformcondenceintervalsfor h 1 ;:::; k .Ofcourse,the quantitiesin.61aretypicallyestimatedfromthesample. Inotherwords,ourapproximate95%condenceintervalfor h 1 ;:::; k is h R 1 ;:::;R k 1 : 96[ b 1 ;:::; b k ] T b [ b 1 ;:::; b k ] .63 Proof We'llcoverthecasek=1droppingthesubscript1forconvenience.RecalltheMeanValueTheoremfrom calculus: 12 h R = h + h 0 W R )]TJ/F46 10.9091 Tf 10.909 0 Td [( .64 forsome W between and R .Rewritingthis,wehave p n [ h R )]TJ/F46 10.9091 Tf 10.909 0 Td [(h ]= p nh 0 W R )]TJ/F46 10.9091 Tf 10.909 0 Td [( .65 Itcanbeshownandshouldbeintuitivelyplausibletoyouthatifasequenceofrandomvariablesconvergesindistributiontoaconstant,theconvergenceisinprobabilitytoo.So, R )]TJ/F46 10.9091 Tf 10.093 0 Td [( convergesinprobability to0,forcing W toconvergeinprobabilityto h .ThenfromSlutsky'sTheorem,theasymptoticdistributionof.65isthesameasthatof p nh 0 R )]TJ/F46 10.9091 Tf 10.909 0 Td [( .Theresultfollows. 4.2.15.2Example:SquareRootTransformation Itisusedtobecommon,andtosomedegreestillcommontoday,forstatisticalanalyststoapplyasquare-root transformationtoPoissondata.Thedeltamethodshedslightonthemotivationforthis. 12 Thisiswherethedeltainthenameofthemethodcomesfrom,anallusiontothefactthatderivativesarelimitsofdifference quotients. PAGE 139 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 121 Considerarandomvariable X thatisPoisson-distributedwithmean .RecallfromSection3.8.1.2that sumsofindependentPoissonrandomvariablesarethemselvesPoissondistributed.Forthatreason, X has thesamedistributionas Y 1 + ::: + Y k .66 wherethe Y i arei.i.d.Poissonrandomvariableseachhavingmean =k .BytheCentralLimitTheorem, Y 1 + ::: + Y k hasanapproximatenormaldistribution,withmeanandvariance .Thisisnotquitearigorous argument,asthemeanof Y i dependsonk,soourtreatmenthereisinformal. Nowconsider W = p X = p Y 1 + ::: + Y k .Let h t = p T ,sothat h 0 t =1 = p t .Thedeltamethod thensaysthat W alsohasanapproximatenormaldistribution,withasymptoticvariance 1 4 = 1 4 .67 So,theasymptoticvarianceof p X isaconstant,independentof .Thisbecomesrelevantinregression analysis,where,aswewilldiscussinChapter6,aclassicalassumptionisthatacertaincollectionofrandom variablesallhavethesamevariance. 4.2.15.3Example:CondenceIntervalfor 2 RecallthatinSection4.2.7wenotedthat.18isonlyanapproximatelycondenceintervalforthemean. AnexactintervalisavailableusingtheStudentt-distribution,if thepopulationisnormallydistributed.We pointedoutthat.18isveryclosetotheexactintervalforevenmoderatelylargenanyway,andsinceno populationisexactlynormal,.18isgoodenough.Notethatoneoftheimplicationsofthisandthefact that.18didnotassumeanyparticularpopulationdistributionisthataStudent-tbasedcondenceinterval workswellevenfornon-normalpopulations.WesaythattheStudent-tintervalis robust tothenormality assumption. Butwhataboutacondenceintervalforavariance?Hereonecanformanexactintervalbasedonthe chi-squaredistribution,if thepopulationisnormal.Inthiscase,though,theintervaldoesNOTworkwell fornon-normalpopulations;itisNOTrobusttothenormalityassumption.So,let'sderiveanintervalthat doesn'tassumenormality;we'llusethedeltamethod.Warning:Thiswillgetalittlemessy. Write 2 = E W 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [( EW 2 .68 andfrom.16writeourestimatorof 2 as s 2 = 1 n n X i =1 W 2 i )]TJETq1 0 0 1 313.225 53.68 cm[]0 d 0 J 0.436 w 0 0 m 11.818 0 l SQBT/F46 10.9091 Tf 313.225 44.699 Td [(W 2 = T 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [(T 2 1 .69 PAGE 140 122 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Weareusingouroldnotation,with W 1 ;:::;W n beingarandomsamplefromourpopulation,andwithW representingarandomvariablehavingthepopulationdistribution. Since ET 2 = E W 2 and ET 1 = EW ,wetakeourfunctionhtobe h u;v = u )]TJ/F46 10.9091 Tf 10.909 0 Td [(v 2 .70 Inotherwords,inthenotationofthetheorem, R 1 isour T 2 and R 2 isour T 1 We'llneedthevariancesof T 1 and T 2 ,andtheircovariance.Wealreadyhavetheirmeans,asnotedabove. Wealsohavethevarianceof T 1 ,from.9: Var T 1 = 1 n Var W .71 Nowforthevarianceof T 2 :Using.9buton W 2 insteadofW,wehave Var T 2 = 1 n Var W 2 = 1 n [ E W 4 )]TJ/F46 10.9091 Tf 10.909 0 Td [(E W 2 2 ] .72 Nowforthecovariance: Cov T 1 ;T 2 = 1 n 2 n X i =1 Cov W i ;W 2 i = 1 n 2 nCov W;W 2 .73 Butfromthefamousformulaforcovariance, Cov W;W 2 = E W 3 )]TJ/F46 10.9091 Tf 10.909 0 Td [(EW E W 2 .74 Tosummarize: Var T 1 = 1 n [ E W 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [( EW 2 ] .75 Var T 2 = 1 n [ E W 4 )]TJ/F46 10.9091 Tf 10.909 0 Td [(E W 2 2 ] .76 Cov T 1 ;T 2 = 1 n [ E W 3 )]TJ/F46 10.9091 Tf 10.909 0 Td [(EW E W 2 ] .77 Also, h 0 u;v = ; )]TJ/F15 10.9091 Tf 8.485 0 Td [(2 v T h 0 R 1 ;R 2 = ; )]TJ/F15 10.9091 Tf 8.485 0 Td [(2 ET 1 T = ; )]TJ/F15 10.9091 Tf 8.485 0 Td [(2 EW T .78 PAGE 141 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 123 Theasymptoticvariancetouseinourcondenceintervalfor 2 isseenin.61tobe 1 n E W 4 )]TJ/F46 10.9091 Tf 10.909 0 Td [(E W 2 2 +4 EW 2 f E W 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [( EW 2 g)]TJ/F15 10.9091 Tf 18.788 0 Td [(4 EW f E W 3 )]TJ/F46 10.9091 Tf 10.909 0 Td [(EW E W 2 g .79 Now,wedonotknowthevalueof E W m here,m=1,2,3,4.So,weestimate E W m as 1 n n X i =1 W m i .80 Ourcondenceintervalisthen s 2 plusandminus1.96timesthesquarerootofthisquantity. Itshouldbenoted,though,thatestimatingmeansofhigherpowersofarandomvariablerequireslarger samplesinordertoachievecomparableaccuracy.Ourcondenceintervalheremayneedaratherlarge sampletobeaccurate,asopposedtothesituationwith.18,inwhichevenn=20shouldworkwell. 4.2.16SimultaneousCondenceIntervals Supposeinourstudyofheights,weightsandsoonofpeopleinDavis,weareinterestedinestimatinga numberofdifferentquantities,withourformingacondenceintervalforeachone.Thoughourcondence levelforeachoneofthemwillbe95%,our overall condencelevelwillbelessthanthat.Inotherwords, wecannotsayweare95%condentthatalltheintervalscontaintheirrespectivepopulationvalues. Insomecaseswemaywishtoconstructcondenceintervalsinsuchawaythatwecansayweare95% condentthatalltheintervalsarecorrect.Thisbranchofstatisticsisknownas simultaneousinference or multipleinference Usuallythiskindofmethodologyisusedinthecomparisonofseveral treatments .Thistermoriginated inthelifesciences,e.g.comparingtheeffectivenessofseveraldifferentmedicationsforcontrollinghypertension,itcanbeappliedinanycontext.Forinstance,wemightbeinterestedincomparinghowwell programmersdoinseveraldifferentprogramminglanguages,sayPython,RubyandPerl.We'dformthree groupsofprogrammers,oneforeachlanguage,withsay20programmerspergroup.Thenwewouldhave themwritecodeforagivenapplication.OurmeasurementcouldbethelengthoftimeTthatittakesfor themtodeveloptheprogramtothepointatwhichitrunscorrectlyonasuiteoftestcases. Let T ij bethevalueofTforthej th programmerinthei th group,i=1,2,3,j=1,2,...,20.Wewouldthenwish tocomparethethreetreatments,i.e.programminglanguages,byestimating i = ET i 1 ,i=1,2,3.Our estimatorswouldbe U i = P 20 j =1 T ij = 20 ,i=1,2,3.Sincewearecomparingthethreepopulationmeans,we maynotbesatisedwithsimplyformingordinary95%condenceintervalsforeachmean.Wemaywish toformcondenceintervalswhich jointly havecondencelevel95%. 13 Notevery,verycarefullywhatthismeans.Asusual,thinkofournotebookidea.Eachlineofthenotebook wouldcontainthe60observations;differentlineswouldinvolvedifferentsetsof60people.So,therewould be60columnsfortherawdata,threecolumnsforthe U i .Wewouldalsohavesixmorecolumnsforthe 13 Theword may isimportanthere.Itreallyisamatterofphilosophyastowhetheroneusessimultaneousinferenceprocedures. PAGE 142 124 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE condenceintervalslowerandupperboundsforthe i .Finally,imaginethreemorecolumns,oneforeach condenceinterval,withtheentryforeachbeingeitherRightorWrong.Acondenceintervalislabeled Rightifitreallydoescontainitstargetpopulationvalue,andotherwiseislabeledWrong. Now,ifweconstructindividual95%condenceintervals,thatmeansthatinagivenRight/Wrongcolumn, inthelongrun95%oftheentrieswillsayRight.Butforsimultaneousintervals,wehopethatwithinaline weseethree Rights,and95%ofalllineswillhavethatproperty. Inourcontexthere,ifwesetupourthreeintervalstohaveindividualcondencelevelsof95%,their simultaneouslevelwillbe 0 : 95 3 =0 : 86 ,sincethethreecondenceintervalsareindependent.Conversely, ifwewantasimultaneouslevelof0.95,wecouldtakeeachoneata98.3%level,since 0 : 95 1 3 0 : 983 However,ingeneraltheintervalswewishtoformwillnotbeindependent,sotheabovecuberootmethod wouldnotwork.Herewewillgiveashortintroductiontomoregeneralprocedures. Notethatnothinginlifeisfree.Ifwewantsimultaneouscondenceintervals,theywillbewider. 4.2.16.1TheBonferonniMethod Onesimpleapproachis Bonferonni'sInequality : Lemma6 Suppose A 1 ;:::;A g areevents.Then P A 1 or ::: or A g g X i =1 P A i .81 Youcaneasilyseethisforg=2: P A 1 or A 2 = P A 1 + P A 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P A 1 and A 2 P A 1 + P A 2 .82 Onecanthenprovethegeneralcasebymathematicalinduction. Nowtoapplythistoformingsimultaneouscondenceintervals,take A i tobetheeventthatthe i th condenceintervalisincorrect,i.e.failstoincludethepopulationquantitybeingestimated.Then.81saysthat if,say,weformtwocondenceintervals,eachhavingindividualcondencelevel-5/2%,i.e.97.5%, thentheoverallcollectivecondencelevelforthosetwointervalsisatleast95%.Here'swhy:Let A 1 be theeventthattherstintervaliswrong,and A 2 isthecorrespondingeventforthesecondinterval.Then overallconf.level = P not A 1 andnot A 2 .83 =1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P A 1 or A 2 .84 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P A 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P A 2 .85 =1 )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 025 )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 025 .86 =0 : 95 .87 PAGE 143 4.2.INTRODUCTIONTOCONFIDENCEINTERVALS 125 4.2.16.2Scheffe'sMethodadvancedtopic TheBonferonnimethodisunsuitableformorethanafewintervals;eachonewouldhavetohavesucha highindividualcondencelevelthattheintervalswouldbeverywide.Manyalternativesexist,afamous onebeing Scheffe'smethod 14 Theorem7 Suppose R 1 ;:::;R k haveanapproximatelymultivariatenormaldistribution,withmeanvector = i andcovariancematrix = ij .Let b beaconsistentestimatorof Foranyconstants c 1 ;:::;c k ,considerlinearcombinationsofthe R i k X i =1 c i R i .88 whichestimate k X i =1 c i i .89 Formthecondenceintervals k X i =1 c i R i q k 2 ; k s c 1 ;:::;c k .90 where [ s c 1 ;:::;c k ] 2 = c 1 ;:::;c k T b c 1 ;:::;c k .91 andwhere 2 ; k istheupperpercentileofachi-squaredistributionwithkdegreesoffreedom. 15 Thenalloftheseintervalsforinnitelymanyvaluesofthe c i !havesimultaneouscondencelevel 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( Bytheway,ifweareinterestedinonlyconstructingcondenceintervalsfor contrasts ,i.e. c i havingthe propertythat i c i =0 ,wethenumberofdegreesoffreedomreducestok-1,thusproducingnarrower intervals. JustasinSection4.2.7weavoidedthet-distribution,herewehaveavoidedtheFdistribution,whichisused insteadofch-squareintheexactformofScheffe'smethod. 14 Thenameispronouncedsheh-FAY. 15 RecallthatthedistributionofthesumofsquaresofgindependentN,1randomvariablesiscalled chi-squarewithgdegrees offreedom .ItistabulatedintheRstatisticalpackage'sfunction qchisq . PAGE 144 126 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE 4.2.16.3Example Forexample,againconsidertheDavisheightsexampleinSection4.2.11.Supposewewanttondapproximate95%condenceintervalsfortwopopulationquantities, 1 and 2 .Thesecorrespondtovaluesof c 1 ;c 2 of,0and,1.Sincethetwosamplesareindependent, 12 =0 .Thechi-squarevalueis5.99, 16 sothesquarerootin.90is3.46.So,wewouldcompute.18for X andthenfor Y ,butwoulduse3.46 insteadof1.96. ThisactuallyisnotasgoodasBonferonniinthiscase.ForBonferonni,wewouldndtwo97.5%condence intervals,whichwoulduse2.24insteadof1.96. Scheffe'smethodistooconservativeifwejustareformingasmallnumberofintervals,butitisgreatifwe formalotofthem.Moreover,itisverygeneral,usablewheneverwehaveasetofapproximatelynormal estimators. 4.2.16.4OtherMethodsforSimultaneousInference Therearemanyothermethodsforsimultaneousinference.Itshouldbenoted,though,thatmanyofthem arelimitedinscope,incontrasttoScheffe'smethod,whichisusablewheneveronehasmultivariatenormal estimators,andBonferonni'smethod,whichisuniversallyusable. 4.2.17TheBootstrapMethodforFormingCondenceIntervalsadvancedtopic Manystatisticalapplicationscanbequitecomplex,whichmakesthemverydifculttoanalyzemathematically.Fortunately,thereisafairlygeneralmethodforndingcondenceintervalscalledthe bootstrap Hereisaverybriefoverview. Sayweareestimatingsomepopulationvalue basedoni.i.d.randomvariables Q i ,i=1,...,n.Sayour estimatoris b .Thenwedrawknewsamplesofsizen,bydrawingvalueswithreplacementfromthe Q i Foreachsample,werecompute b ,givingusvalues ~ i ,i=1,...,k.Wesorttheselattervaluesandndthe 0.025and0.975quantiles,i.e.the2.5%and97.5%pointsofthevalues ~ i ,i=1,...,k.Thesetwopointsform ourcondenceintervalfor Rincludesthe boot functiontodothemechanicsofthisforus. 4.3HypothesisTesting 4.3.1TheBasics Supposeyouhaveacoinwhichyouwanttoassessforfairness.Letpbetheprobabilityofheadsfor thecoin.Youcouldtossthecoin,say,100times,andthenformacondenceintervalforpusing.32. Thewidthoftheintervalwouldtellyouwhether100tosseswasenoughfortheaccuracyyouwant,andthe locationoftheintervalwouldtellyouwhetherthecoinisfairenough. 16 ObtainedfromRvia qchisq.95,2 . PAGE 145 4.3.HYPOTHESISTESTING 127 Forinstance,ifyourintervalwere.49,0.54,youmightfeelsatisedthatthiscoinisreasonablyfair.In fact, notecarefullythateveniftheintervalwere,say,.502,0.506,youwouldstillconsiderthecoin tobereasonablyfair. Unfortunately,thisentireprocesswouldbecountertothetraditionalusageofstatistics.Mostusersof statisticswouldusethetossdatatotestthe nullhypothesis H 0 : p =0 : 5 .92 againstthe alternatehypothesis H A : p 6 =0 : 5 .93 Theapproachistoconsider H 0 innocentuntilprovenguilty.Weformthe teststatistic Z = b p )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 q 1 n b p )]TJ/F52 10.9091 Tf 11.533 0 Td [(b p .94 Under H 0 therandomvariableZwouldhaveanapproximateN,1distribution.ThebasicideaisthatifZ turnsouttohaveavaluewhichisrareforthatdistribution,wesay,Ratherthanbelievewe'veobserveda rareevent,wechooseinsteadtoabandonourassumptionthat H 0 istrue. So,whatdowetakeforourcutoffvalueforrareness?Thisprobabilityiscalledthe signicancelevel denotedby .Theclassicalvaluefor is0.05.If H 0 weretrue,ZwouldhaveanapproximateN,1 distribution,andthuswouldbelessthan-1.96orgreaterthan1.96only5%ofthetime,arareevent. So,ifZdoesstraythatfari.e.1.96ormoreineitherdirectionfrom0,wereject H 0 ,anddecidethat p 6 =0 : 5 .Wesay,Thevalueofpissignicantlydifferentfrom0.5;moreonthisbelow,asitisNOTwhat itsoundslike. LetXbethenumberofheadswegetfromour100tosses.Notethatourrulefordecisionmakingformulated aboveisequivalentdothealgebratoseethisforyourselftosayingthatwewillaccept H 0 if 40 X 60 andrejectitotherwise. 4.3.2GeneralTestingBasedonNormallyDistributedEstimators Suppose b isanapproximatelynormallydistributedestimatorofsomepopulationvalue .Thentotest H 0 : = c ,formtheteststatistic Z = b )]TJ/F46 10.9091 Tf 10.909 0 Td [(c s:e: b .95 where s:e: b isthestandarderrorof b ,andproceedasbefore: Reject H 0 : = c atthesignicancelevelof =0 : 05 if j Z j 1 : 96 . PAGE 146 128 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE 4.3.3Example:NetworkSecurity Let'slookatthenetworksecurityexampleinSection4.2.11.1again.Here b = X )]TJETq1 0 0 1 439.39 628.278 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 439.39 619.296 Td [(Y ,andcispresumably 0dependingonthegoalsofMano etal .Ifyoureviewthematerialleadingupto.36,you'llseethat s:e: X )]TJETq1 0 0 1 288.296 569.783 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 288.296 560.801 Td [(Y = s s 2 1 n 1 + s 2 2 n 2 .96 Inthatexample,wefoundthatthestandarderrorwas0.61.So,ourteststatistic.95is Z = X )]TJETq1 0 0 1 258.274 500.628 cm[]0 d 0 J 0.436 w 0 0 m 8.758 0 l SQBT/F46 10.9091 Tf 258.274 491.647 Td [(Y )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 0 : 61 = 11 : 52 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 : 00 0 : 61 =15 : 61 .97 Thisisdenitelylargerinabsolutevaluethan1.96,sowereject H 0 ,andconcludethatthepopulationmean round-triptimesaredifferentinthewiredandwirelesscases. 4.3.4TheNotionofp-Values Inthatexampleabove,theZvalue,15.61,wasfarlargerthanthecutoffforrejectionof H 0 ,1.96.You mightsaythatweresoundinglyrejected H 0 .Whendataanalystsencountersuchasituation,theywant toindicateitintheirreports.Thisisdonethroughsomethingcalledthe observedsignicancelevel ,more oftencalledthe p-value Toillustratethis,let'slookatasomewhatmildercase,inwhichZ=2.14.Bycheckingtheatableofthe N,1distribution,orsaybycalling pnorm.14 inR,wewouldndthattheN,1distributionhasarea 0.016totherightof2.14,andofcourseanequalareatotheleftof-2.14.Inotherwords,inthegeneral formulationinSection4.3.2,wewouldbeabletoreject H 0 evenatthemuchmorestringentsignicance levelof0.032insteadof0.05.Thiswouldbeastrongerstatement,andintheresearchcommunityitis customarytosay,Thep-valuewas0.032. InourexampleaboveinwhichZwas15.61,thevalueisliterallyoffthechart; pnorm.61 returnsa valueof1.Ofcourse,it'satinybitlessthan1,butitissofaroutintherighttailoftheN,1distribution thattheareatotherightisessentially0.So,thiswouldbetreatedasvery,veryhighlysignicant. Ifmanytestsareperformedandaresummarizedinatable,itiscustomarytodenotetheoneswithsmall p-valuesbyasterisks.Thisisgenerallyoneasteriskforpunder0.05,twoforplessthan0.01,threefor 0.001,etc.Themoreasterisks,themoresignicantthedataissupposedtobe.Well,that'sacommon interpretation,butcarefulanalystsknowittobemisleading,aswewillnowdiscuss. 4.3.5What'sRandomandWhatIsNot Itiscrucialtokeepinmindthat H 0 isnotaneventoranyotherkindofrandomentity.Thiscoineitherhas p=0.5oritdoesn't.Ifwerepeattheexperiment,wewillgetadifferentvalueofX,butpdoesn'tchange. Soforexample,itwouldbewrongandmeaninglesstospeakoftheprobabilitythat H 0 istrue. PAGE 147 4.3.HYPOTHESISTESTING 129 Similarly,itwouldbewrongandmeaninglesstowrite 0 : 05= P j Z j > 1 : 96 j H 0 ,againbecause H 0 is notaneventandthiskindofconditionalprobabilitywouldnotmakesense.Whatiscustomarilywrittenis somethinglike 0 : 05= P H 0 j Z j > 1 : 96 .98 Thisisreadaloudastheprobabilitythat j Z j islargerthan1.96under H 0 ,withthephrase under H 0 referringtotheprobabilitymeasureinthecaseinwhich H 0 istrue. 4.3.6One-Sided H A Supposethatsomehowwearesurethatourcoinintheexampleaboveiseitherfairoritismoreheavily weightedtowardsheads.Thenwewouldtakeouralternatehypothesistobe H A : p> 0 : 5 .99 Arareeventwhichcouldmakeusabandonourbeliefin H 0 wouldnowbeifZin.94isverylargein thepositivedirection.So,with =0 : 05 ,ourrulewouldnowbetoreject H 0 if Z> 1 : 65 Thesamewouldbethecaseifournullhypothesiswere H A : p 0 : 5 .100 insteadof H A : p =0 : 5 .101 Then.98wouldchangeto 0 : 05 P H 0 j Z j > 1 : 65 .102 4.3.7ExactTests Remember,thetestswe'veseensofarareallapproximate.In.94,forinstance, b p hadanapproximate normaldistribution,sothatthedistributionofZwasapproximatelyN,1.Thusthesignicancelevel wasapproximate,aswerethep-valuesandsoon. 17 Buttheonlyreasonourtestswereapproximateisthatweonlyhadthe approximate distributionofourtest statisticZ,orequivalently,weonlyhadtheapproximatedistributionofourestimator,e.g. b p .Ifwehavean exact distributiontoworkwith,thenwecanperformanexacttest. 17 Anotherclassofprobabilitieswhichwouldbeapproximatewouldbethe power values.Thesearetheprobabilitiesofrejecting H 0 ifthelatterisnottrue.Wewouldspeak,forinstance,ofthepowerofourtestatp=0.55,meaningthechancesthatwewould rejectthenullhypothesisifthetruepopulationvalueofpwere0.55. PAGE 148 130 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Let'sconsiderthecoinexampleagain.Tokeepthingssimple,let'ssupposewetossthecoin10times.We willmakeourdecisionbasedonX,thenumberofheadsoutof10tosses.Supposewesetourthreshholdfor strongevidenceagain H 0 tobe8heads,i.e.wewillreject H 0 if X 8 .Whatwill be? = 10 X i =8 P X = i = 10 X i =8 10 i 1 2 10 =0 : 055 .103 That'snot0.05.Clearlywecannotgetanexactsignicancelevelof0.05, 18 butour isexactly0.055. Ofcourse,ifyouarewillingtoassumethatyouaresamplingfromanormally-distributedpopulation,then theStudent-ttestisnominallyexact.TheRfunction t.test performsthisoperation. Asanotherexample,supposelifetimesoflightbulbsareexponentiallydistributedwithmean .Inthepast, =1000 ,butthereisaclaimthatthenewlightbulbsareimprovedand > 1000 .Totestthatclaim, wewillsample10lightbulbs,gettinglifetimes X 1 ;:::;X 10 ,andcomputethesamplemean X .Wewillthen performahypothesistestof H 0 : =1000 .104 vs. H A : > 1000 .105 Itisnaturaltohaveourtesttaketheforminwhichwereject H 0 if X>w .106 forsomeconstantwchosensothat P X>w =0 : 05 .107 under H 0 .Supposewewantanexacttest,notonebasedonanormalapproximation. Recallthat 100 X ,thesumofthe X i ,hasagammadistribution,withr=10and =0 : 001 .So,wecannd thewforwhich P X>w =0 : 05 byusingR's qgamma >qgamma.95,10,0.001 [1]15705.22 So,wereject H 0 ifoursamplemeanislargerthan1570.5. 18 Actually,itcouldbedonebyintroducingsomerandomizationtoourtest. PAGE 149 4.3.HYPOTHESISTESTING 131 4.3.8What'sWrongwithHypothesisTesting Hypothesistestingisatime-honoredapproach,usedbytensofthousandsofpeopleeveryday. Butit iswrong.Iusethequotationmarksherebecause,althoughhypothesistestingismathematicallycorrect, itisatbestnoninformativeandatworstseriouslymisleading. Tobeginwith,it'sabsurdtotest H 0 intherstplace.Nocoinisabsolutelyperfectlybalanced,withp= 0.5000000000000000000000000000...Weknowthatbeforeevencollectinganydata. Butmuchworseisthiswordsignicant. Sayourcoinactuallyhasp=0.502.Fromanyone'spoint ofview,that'safaircoin!Butlookwhathappensin.94asthesamplesizengrows.ifwehavealarge enoughsample,eventuallythedenominatorin.94willbesmallenough,and b p willbecloseenoughto 0.502,thatZwillbelargerthan1.96andwewilldeclarethatpissignicantlydifferentfrom0.5.Butit isn't!Yes,pisdifferentfrom0.5,butNOTinanysignicantsense. Thisisespeciallyaproblemincomputerscienceapplicationsofstatistics,becausetheyoftenuseverylarge datasets.Adataminingapplication,forinstance,mayconsistofhundredsofthousandsofretailpurchases. ThesameistruefordataonvisitstoaWebsite,networktrafcdataandsoon.Inallofthese,thestandard useofhypothesistestingcanresultinourpouncingonverysmalldifferencesthatarequiteinsignicantto us,yetwillbedeclaredsignicantbythetest. Conversely,ifoursampleistoosmall,wecanmissadifferencethatactually is signicanti.e.important tousandwewoulddeclarethatpisNOTsignicantlydifferentfrom0.5. Insummary,thetwobasicproblemswithhypothesistestingare H 0 isimproperlyspecied.Whatwearereallyinterestedinhereiswhetherpis near 0.5,notwhether itis exactly 0.5whichweknowisnotthecaseanyway. Useoftheword signicant isgrosslyimproperor,ifyouwish,grosslymisinterpreted. Hypothesistestingformstheverycoreusageofstatistics,yetyoucannowseethatitis,asIsaidabove,at bestnoninformativeandatworstseriouslymisleading.Thisiswidelyrecognizedbythinkingstatisticians. Forinstance,see http://www.indiana.edu/ stigtsts/quotsagn.html foranicecollection ofquotesfromfamousstatisticiansonthispoint.Thereisanentirechapterdevotedtothisissueinoneof thebest-sellingelementarystatisticstextbooksinthenation. 19 Butthepracticeofhypothesistestingistoo deeplyentrenchedforthingstohaveanyprospectofchanging. 4.3.9WhattoDoInstead Inthecoinexample,wecouldsetlimitsoffairness,sayrequirethatpbenomorethan0.01from0.5inorder toconsideritfair.Wecouldthentestthehypothesis H 0 :0 : 49 p 0 : 51 .108 19 Statistics ,thirdedition,byDavidFreedman,RobertPisani,RogerPurves,pub.byW.W.Norton,1997. PAGE 150 132 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Suchanapproachisalmostneverusedinpractice,asitissomewhatdifculttouseandexplain.Buteven moreimportantly,whatifthetruevalueofpwere,say,0.51001?Wouldwestillreallywanttorejectthe coininsuchascenario? NotecarefullythatIamnotsayingthatweshouldnotmakeadecision.We do havetodecide,e.g.decide whetheranewhypertensiondrugissafeorinthiscasedecidewhetherthiscoinisfairenoughforpractical purposes,sayfordeterminingwhichteamgetsthekickoffintheSuperBowl.Butitshouldbeaninformed decision,andeventestingthemodied H 0 abovewouldbemuchlessinformativethanacondenceinterval. Formingacondenceintervalisthefarsuperiorapproach. Thewidthoftheintervalshowsuswhether nislargeenoughfor b p tobereasonablyaccurate,andthelocationoftheintervaltellsuswhetherthecoinis fairenoughforourpurposes. Notethatinmakingsuchadecision,wedoNOTsimplycheckwhether0.5isintheinterval. That wouldmakethecondenceintervalreducetoahypothesistest,whichiswhatwearetryingtoavoid.Iffor exampletheintervalis.502,0.505,wewouldprobablybequitesatisedthatthecoinisfairenoughfor ourpurposes,eventhough0.5isnotintheinterval. Hypothesistestingisalsousedformodelbuilding,suchasforpredictorvariableselectioninregression analysisamethodtobecoveredinalaterunit.Theproblemisevenworsethere,becausethereisno reasontouse =0 : 05 asthecutoffpointforselectingavariable.Infact,evenifoneuseshypothesis testingforthispurposeagain,veryquestionablesomestudieshavefoundthatthebestvaluesof for thiskindofapplicationareintherange0.25to0.40. Inmodelbuilding,westillcanandshouldusecondenceintervals.However,itdoestakemoreworktodo so.Wewillreturntothispointinourunitonmodeling,Chapter5. 4.3.10DecideontheBasisofthePreponderanceofEvidence Inthemovies,youseestoriesofmurdertrialsinwhichtheaccusedmustbeprovenguiltybeyondthe shadowofadoubt.Butinmostnoncriminaltrials,thestandardofproofisconsiderablylighter, preponderanceofevidence .Thisisthestandardyoumustusewhenmakingdecisionsbasedonstatisticaldata. Suchdatacannotproveanythinginamathematicalsense.Instead,itshouldbetakenmerelyasevidence. Thewidthofthecondenceintervaltellsusthelikelyaccuracyofthatevidence.Wemustthenweighthat evidenceagainstotherinformationwehaveaboutthesubjectbeingstudied,andthenultimatelymakea decisiononthebasisofthepreponderanceofalltheevidence. Yes,juriesmustmakeadecision.Buttheydon'tbasetheirverdictonsomeformula.Similarly,youthedata analystshouldnotbaseyourdecisionontheblindapplicationofamethodthatisusuallyoflittlerelevance totheproblemathandhypothesistesting. 4.4GeneralMethodsofEstimation Intheprecedingsections,weoftenreferredtocertainestimatorsasbeingnatural.Forexample,ifwe areestimatingapopulationmean,anobviouschoiceofestimatorwouldbethesamplemean.Butinmany PAGE 151 4.4.GENERALMETHODSOFESTIMATION 133 applications,itislessclearwhatanaturalestimateforaparameterofinterestwouldbe. 20 Wewillpresent generalmethodsforestimationinthissection. 4.4.1Example:GuessingtheNumberofRafeTicketsSold You'vejustboughtarafeticket,andndthatyouhaveticketnumber68.Youcheckwithacoupleof friends,andndthattheirnumbersare46and79.Letcbethetotalnumberoftickets.Howshouldwe estimatec,usingourdata68,46and79? Itisreasonabletoassumethateachofthethreeofyouisequallylikelytogetassignedanyofthenumbers 1,2,...,c.Inotherwords,thenumbersweget, X i ,i=1,2,3areuniformlydistributedontheset f 1,2,...,c g Wecanalsoassumethattheyareindependent;that'snotexactlytrue,sincewearesamplingwithoutreplacement,butforlargecorbetterstated,forn/csmallit'scloseenough. So,weareassumingthatthe X i areindependentandidenticallydistributedfamouslywrittenas i.i.d. in thestatisticsworldontheset f 1,2,...,c g .Howdoweusethe X i toestimatec? 4.4.2MethodofMoments Oneapproach,anintuitiveone,wouldbetoreasonasfollows.Noterstthat E X = c +1 2 .109 Let'ssolveforc: c =2 EX )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 .110 Weknowthatwecanuse X = 1 n n X i =1 X i .111 toestimateEX,soby.110, 2 X )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 isanestimateofc.Thuswetakeourestimatorforctobe b c =2 X )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 .112 ThisestimatoriscalledtheMethodofMomentsestimatorofc. Let'sstepbackandreviewwhatwedid: 20 RecallfromSection4.2.10thatwearenowusingtheterm parameter tomeananypopulationquantity,ratherananindexinto aparametricfamilyofdistributions. PAGE 152 134 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE WewroteourparameterasafunctionofthepopulationmeanEXofourdataitemX.Here,that resultedin.110. Inthatfunction,wesubstitutedoursamplemean X forEX,andsubstitutedourestimator b c forthe parameterc,yielding.112.Wethensolvedforourestimator. Wesaythatanestimator b ofsomeparameter is consistent if lim n !1 b = .113 wherenisthesamplesize.Inotherwords,asthesamplesizegrows,theestimatoreventuallyconvergesto thetruepopulationvalue. Ofcoursehere X isaconsistentestimatorofEX.Thusyoucanseefrom.110and.112that b c isa consistentestimatorofc.Inotherwords,theMethodofMomentsgenerallygivesusconsistentestimators. Whatifwehavemorethanoneparametertoestimate,say 1 ;:::; r ?Wegeneralizewhatwedidabove.To seehow,recallthat E X i iscalledthe i th moment ofX; 21 let'sdenoteitby i .Also,notethatalthough wederived.110bysolving.109forc,wedidstartwith.109.Sowedothefollowing: Fori=1,...,rwewrite i asafunction g i ofallthe k Fori=1,...,rset b i = 1 n n X j =1 X i j .114 Substitutethe b k inthe g i andthensolveforthem. Intheaboveexamplewiththerafe,wehadr=1, 1 = c g 1 c = c +1 = 2 andsoon.Atwo-parameter examplewillbegivenbelow. 4.4.3MethodofMaximumLikelihood Anothermethod,muchmorecommonlyused,iscalledthe MethodofMaximumLikelihood .Inour exampleabove,itmeansaskingthequestion,Whatvalueofcwouldhavemadeourdata,46,79 mostlikelytohappen?Well,let'sndwhatiscalledthe likelihood ,i.e.theprobablyofourparticulardata valuesoccurring: L = P X 1 =68 ;X 2 =46 ;X 3 =79= 1 c 3 ; if c 79 0 ; otherwise .115 21 Hencethename,MethodofMoments. PAGE 153 4.4.GENERALMETHODSOFESTIMATION 135 Nowkeepinmindthatcisaxed,thoughunknownconstant.Itisnotarandomvariable.Whatweare doinghereisjustaskingWhatifquestions,e.g.Ifcwere85,howlikelywouldourdatabe?Whatabout c=91? Wellthen,whatvalueofcmaximizes.115?Clearly,itisc=79.Anysmallervalueofcgivesusa likelihoodof0.Andforclargerthan79,thelargercis,thesmaller.115is.So,ourmaximumlikelihood estimatorMLEis79.Ingeneral,ifoursamplesizeinthisproblemweren,ourMLEforcwouldbe c =max i X i .116 4.4.4Example:EstimationtheParametersofaGammaDistribution Asanotherexample,supposewehavearandomsample X 1 ;:::;X n fromagammadistribution. f X t = 1 \050 c c t c )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 e )]TJ/F47 7.9701 Tf 6.587 0 Td [(t ;t> 0 .117 forsomeunknown c and .Howdoweestimate c and fromthe X i ? 4.4.4.1MethodofMoments Let'strytheMethodofMoments,asfollows.Wehavetwopopulationparameterstoestimate,cand ,so weneedtoinvolvetwomomentsofX.ThatcouldbeEXand E X 2 ,buthereitwouldmoreconveniently beEXandVarX.Weknowfromourpreviousunitoncontinuousrandomvariables,Chapter2,that EX = c .118 Var X = c 2 .119 Inourearliernotation,thiswouldber=2, 1 = c 2 = and g 1 c; = c= and g 2 c; = c= 2 Switchingtosampleanalogsandestimates,wehave b c b = X .120 b c b 2 = s 2 .121 Dividingthetwoquantitiesyields b = X s 2 .122 PAGE 154 136 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE whichthengives b c = X 2 s 2 .123 4.4.4.2MLEs WhatabouttheMLEsofcand ?Remember,the X i arecontinuousrandomvariables,sothelikelihood function,i.e.theanalogof.115,istheproductofthedensityvalues: L = n i =1 1 \050 c c X i c )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 e )]TJ/F47 7.9701 Tf 6.587 0 Td [(X i .124 Ingeneral,itisusuallyeasiertomaximizetheloglikelihoodandmaximizingthisisthesameasmaximizing theoriginallikelihood: l = c )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 n X i =1 ln X i )]TJ/F15 10.9091 Tf 12.559 7.38 Td [(1 n X k =1 X i + nc ln )]TJ/F46 10.9091 Tf 10.909 0 Td [(n ln\050 c .125 Onethentakesthepartialderivativesof.125withrespecttocand .Thesolutionvalues, c and ,are thentheMLEsofcand .Unfortunately,theseequationsdonothaveclosed-formsolutions,sotheymust besolvednumerically. 4.4.5MoreExamples Suppose f W t = ct c )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 fortin,1,withthedensitybeing0elsewhere,forsomeunknown c> 0 .We havearandomsample W 1 ;:::;W n fromthisdensity. Let'sndtheMethodofMomentsestimator. EW = Z 1 0 tct c )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 dt = c c +1 .126 So,set W = b c b c +1 .127 yielding b c = W 1 )]TJETq1 0 0 1 319.118 40.411 cm[]0 d 0 J 0.436 w 0 0 m 11.818 0 l SQBT/F46 10.9091 Tf 319.118 31.43 Td [(W .128 PAGE 155 4.4.GENERALMETHODSOFESTIMATION 137 WhatabouttheMLE? L = n i =1 cW c )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 i .129 so l = n ln c + c )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 n X i =1 ln W i .130 Thenset 0= n b c + n X i =1 ln W i .131 andthus b c = )]TJ/F15 10.9091 Tf 37.691 7.38 Td [(1 1 n P n i =1 ln W i .132 AsinSection4.4.3,noteveryMLEcanbedeterminedbytakingderivatives.Consideracontinuousanalog oftheexampleinthatsection,with f W t = 1 c on,c,0elsewhere,forsome c> 0 Thelikelihoodis 1 c n .133 aslongas c max i W i .134 andis0otherwise.So, b c =max i W i .135 asbefore. Let'sndthebiasofthisestimator. Thebiasis E b C )]TJ/F46 10.9091 Tf 10.909 0 Td [(c .Toget E b c weneedthedensityofthatestimator,whichwegetasfollows: P b c t = P all W i t denition .136 = t c n densityof W i .137 PAGE 156 138 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE So, f b c t = n c n t n )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 .138 Integratingagainstt,wendthat E b C = n n +1 c .139 Sothebiasisc/n+1,notbadatall. 4.4.6WhatAboutCondenceIntervals? Usuallywearenotsatisedwithsimplyformingestimatescalled pointestimates .Wealsowantsome indicationofhowaccuratetheseestimatesare,intheformofcondenceintervals intervalestimates Inmanyspecialcases,ndingcondenceintervalscanbedoneeasilyonan adhoc basis.Look,forinstance, attheMethodofMomentsEstimatorinSection4.4.2.Ourestimator.112isalinearfunctionof X ,so weeasilyobtainacondenceintervalfor c fromonefor EX Anotherexampleis.132.Takingthelimitas n !1 theequationshowsusandwecouldverifythat c = 1 E [ln W ] .140 Dening X i =ln W i and X = X 1 + ::: + X n = ,wecanobtainacondenceintervalfor EX intheusual way.Wethenseefrom.140thatwecanformacondenceintervalfor c bysimplytakingthereciprocal ofeachendpointoftheinterval,andswappingtheleftandrightendpoints. Whataboutingeneral?FortheMethodofMomentscase,ourestimatorsarefunctionsofthesample moments,andsincethelatterareformedfromsumsandthusareasymptoticallynormal,thedeltamethod canbeusedtoshowthatourestimatorsareasymptoticallynormalandtoobtainasymptoticvariancesfor them. Thereisawell-developedasymptotictheoryforMLEs,whichundercertainconditionsnotonlyshows asymptoticnormalitywithadeterminedasymptoticvariance,butalsoestablishesthatMLEsareinacertain senseoptimalamongallestimators.Wewillnotpursuethishere. 4.4.7BayesianMethodsadvancedtopic Consideragaintheexampleofestimatingp,theprobabilityofheadsforacertaincoin.Supposewewereto saybeforetossingthecoinevenonceIthinkpcouldbeanynumber,butmorelikelynear0.5,something likeanormaldistributionwithmean0.5andstandarddeviation,oh,let'ssay0.1. Notecarefullytheword think. Wearejustusingourgutfeelinghere,ourhunch.ThenumberpisNOTarandomvariable! WearedealingwiththisONEcoin,andithasjustONEvalueofp.Yetwearetreatingpasrandomanyway. PAGE 157 4.4.GENERALMETHODSOFESTIMATION 139 Underthisrandompassumption,theMLEwouldchange.OurdatahereisX,thenumberofheadsweget fromntossesofthecoin.Insteadofthelikelihoodbeing L = n X p X )]TJ/F46 10.9091 Tf 10.909 0 Td [(p n )]TJ/F47 7.9701 Tf 6.587 0 Td [(X .141 itnowbecomes L = 1 p 2 0 : 1 exp )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 5[ p )]TJ/F15 10.9091 Tf 10.909 0 Td [(0 : 5 = 0 : 1] 2 n X p X )]TJ/F46 10.9091 Tf 10.909 0 Td [(p n )]TJ/F47 7.9701 Tf 6.587 0 Td [(X .142 WewouldthenndthevalueofpwhichmaximizestheL,andtakethatasourestimate. Agutfeelingorhunchusedinthismanneriscalleda subjectiveprior .Priortocollectinganydata, wehaveacertainbeliefaboutp.Thisisverycontroversial,andmanypeopleincludingmeconsideritto behighlyinappropriate.They/Ifeelthatthereisnothingwrongusingone'sgutfeelingstomakeadecision, butitshouldNOTbepartofthemathematicalanalysisofthedata.One'shunchescanplayaroleindeciding thepreponderanceofevidence,asdiscussedinSection4.3.10. Ontheotherhand,maybewehaveactualdataoncoins,presumedtobearandomsamplefromthepopulation ofallcoinsofthattype,andweassumethatourcoinnowischosenlyrandomlyfromthatpopulation.Say wehaveformedanormalorothermodelforpbasedonthatdata.Itwouldbenetousethisinestimatingp foranewcoin,andthesecondLabovewouldbeappropriate.Inthiscase,wewouldbeusingan empirical prior 4.4.8TheEmpiricalcdf Recallthat F X ,thecdfof X ,isdenedas F X t = P X t ; )-222(1 PAGE 158 140 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE thesortedversionofthe X I 22 Then b F X t = 8 > < > : 0 ; for t PAGE 159 4.6.NONPARAMETRICDENSITYESTIMATION 141 AndwhataboutourrafeexampleinSection4.4.1?Certainlywecanimaginevariouskindsofrandomness thatcontributetothenumberspeoplegetontheirrafetickets.Maybe,forinstance,youwereinatrafc jamonthewaytothetheplacewhereyouboughttheticket,soyouboughtitalittlelaterthanyoumight haveandthusgotahighernumber.ButI'vealwaysemphasizedthenotionofarepeatableexperimentin thesenotes.Howcanthathappenhere?Youcouldimagine,forinstance,therafechairsuddenlylosingall thetickets,andaskingeveryonetodrawagain,resultingindifferentticketnumbers.Oryoucanimaginethe populationofallrafesthatyoumightsubmittowhichhavethesamevalueofc. Youcanseefromthisthatifonechoosestoapplystatisticscarefullywhichyouabsolutelyshoulddo theresometimesaresomeknottyproblemsofinterpretationtothinkabout. 4.6NonparametricDensityEstimation ConsidertheBusParadoxexampleagain.Recallthat W denotedthetimeuntilthenextbusarrives.This iscalledthe forwardrecurrencetime .The backwardrecurrencetime isthetimesincethelastbuswas here,whichwewilldenoteby R Supposeweareinterestedinestimatingthedensityof R f R ,basedonthesampledata R 1 ;:::;R n thatwe gatherinoursimulationinSection4.2.1,wheren=1000.Howcanwedothis? 24 Wecould,ofcourse,assumethat f R isamemberofsomeparametricfamilyofdistributions,saythetwoparametergammafamily.WewouldthenestimatethosetwoparametersasinSection4.4,andpossibly checkourassumptionusinggoodness-of-tprocedures,discussedinourunitonmodeling,Chapter5.On theotherhand,wemaywishtoestimate f R withoutmakinganyparametricassumptions.Infact,onereason wemaywishtodosoistovisualizethedatainordertosearchforasuitableparametricmodel. Ifwedonotassumeanyparametricmodel,wehaveinessencechangeourproblemfromestimatinganite numberofparameterstoaninnite-parameterproblem;theparametersarethevaluesof f X t forallthe differentvaluesoft.Ofcourse,weprobablyarewillingtoassume some structureon f R ,suchascontinuity, butthenwestillwouldhaveaninnite-parameterproblem. Wecallsuchestimation nonparametric ,meaningthatwedon'tuseaparametricmodel.However,youcan seethatitisreallyinnite-parametricestimation. Againdiscussedinourunitonmodeling,Chapter5,themorecomplexthemodel,thehigherthevariance ofitsestimator. So,nonparametricestimatorswillhavehighervariancethanparametricones. The nonparametricestimatorswillalsogenerallyhavesmallerbias,ofcourse. 4.6.1BasicIdeas Recallthat f R t = d dt F R t = d dt P R t .146 24 Actually,ourunitonrenewaltheory,Chapter9,provesthatRhasanexponentialdistribution.However,herewe'llpretendwe don'tknowthat. PAGE 160 142 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Fromcalculus,thatmeansthat f R t P R t + h )]TJ/F46 10.9091 Tf 10.909 0 Td [(P R t )]TJ/F46 10.9091 Tf 10.909 0 Td [(h 2 h .147 = P t )]TJ/F46 10.9091 Tf 10.909 0 Td [(h PAGE 161 4.6.NONPARAMETRICDENSITYESTIMATION 143 3 whileTRUE{ 4 newlastarrival=lastarrival+rexp,0.1 5 ifnewlastarrival>opt 6 returnopt-lastarrival 7 elselastarrival<-newlastarrival 8 } 9 } 10 11 observationpt<-240 12 nreps<-10000 13 waits<-vectorlength=nreps 14 forrepin1:nrepswaits[rep]<-doexptobservationpt 15 histwaits NotethatIusedthedefaultnumberofintervals,20.Hereistheresult: Thedensityseemstohaveashapelikethatoftheexponentialparametricfamily.Thisisnotsurprising, becauseit is exponential,butrememberwe'repretendingwedon'tknowthat. Hereistheplotwith100intervals: PAGE 162 144 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Again,asimilarshape,thoughmoreraggedy. 4.6.3Kernel-BasedDensityEstimationadvancedtopic Nomatterwhattheintervalwidthis,thehistogramwillconsistofabunchofrectanges,ratherthanacurve. Thatisbasicallybecause,foranyparticularvalueoft, d f X t ,dependsonlyonthe X i thatfallintothat interval.Wecouldgetasmootherresultifweusedallourdatatoestimate f X t butputmoreweightonthe datathatisclosertot.Onewaytodothisiscalled kernel-based densityestimation,whichinRishandled bythefunction density Weneedasetofweights,morepreciselyaweightfunctionk,calledthe kernel .Anynonnegativefunction whichintegratesto1i.e.adensityfunctioninitsownrightwillwork.Ourestimatoristhen c f R t = 1 nh n X i =1 k t )]TJ/F46 10.9091 Tf 10.909 0 Td [(R i h .152 Tomakethisideaconcrete,takektobetheuniformdensityon-1,1,whichhasthevalue0.5on-1,1and 0elsewhere.Then.152reducesto.150.Notehowtheparameterh,calledthe bandwidth ,continues tocontrolhowfarawayfromtotwewishtogofordatapoints. Butasmentioned,whatwereallywantistoincludealldatapoints,sowetypicallyuseakernelwithsupport onallof ; 1 .InR,thedefaultkernelisthatoftheN,1density.Thebandwidthhcontrolshow muchsmoothingwedo;smallervaluesofhplaceheavierweightsondatapointsneartandmuchlighter weightsonthedistantpoints.ThedefaultbandwidthinRistakentothethestandarddeviationofk. Forourdatahere,Itookthedefaults: PAGE 163 4.6.NONPARAMETRICDENSITYESTIMATION 145 Figure4.1:Kernelestimate,defaultbandwidth plotdensityr TheresultisseeninFigure4.1. Ithentrieditwithabandwidthof0.5.SeeFigure4.2.Thiscurveoscillatesalot,soananalystmightthink 0.5istoosmall.Weareprejudicedhere,becauseweknowthetruepopulationdensityisexponential. 4.6.4ProperUseofDensityEstimates Thereisnogood,practicalwaytochooseagoodbinwidthorbandwdith.Moreover,thereisalsonogood waytoformareasonablecondencebandforadensityestimate. So,densityestimatesshouldbeusedasexploratorytools,notasrmbasesfordecisionmaking.Youwill probablynditquiteunsettlingtolearnthatthereisnoexactanswertotheproblem.Butthat'sreallife! Exercises Notetoinstructor: SeethePrefaceforalistofsourcesofrealdataonwhichexercisescanbeassignedto complementthetheoreticalexercisesbelow. 1 .Supposewedrawasampleofsize2fromapopulationinwhich X hasthevalues10,15and12.Find p X ,rstassumingsamplingwithreplacement,thenassumingsamplingwithoutreplacement. PAGE 164 146 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE Figure4.2:Kernelestimate,bandwidth0.5 2 .Supposelifetimesoflightbulbsareexponentiallydistributedwithmean .Inthepast, was1000,but thereisaclaimthatthenewlightbulbsareimprovedand 1000.Totestthatclaim,wewillsample100 lightbulbs,gettinglifetimes X 1 ;:::;X 2 0 ,andcompute X = X 1 + ::: + X 2 0 = 20 .Wewillthenperforma hypothesistestof H 0 : =1000 vs. H A : > 1000 .Itisnaturaltohaveourtesttaketheforminwhich wereject H 0 if X>r forsomeconstant r chosensothat P X>r =0 : 05 under H 0 Supposewewantanexacttest,notonebasedonanormalapproximation.Find r 3 .ConsidertheMethodofMomentsEstimator ^ c intherafeexample,Section4.4.1.Findtheexactvalue of Var ^ c .Usethefactsthat 1+2+ ::: + r = r r +1 = 2 and 1 2 +2 + :::;r 2 = r r +1 r +1 = 6 4 .Suppose W hasauniformdistributionon-c,c,andwedrawarandomsampleofsizen, W 1 ;:::;W n FindtheMethodofMomentsandMaximumLikelihoodestimators. 5 .Anurncontains marbles,oneofwhichisblackandtherestbeingwhite.Wedrawmarblesfromtheurn oneatatime,withoutreplacement,untilwedrawtheblackone;let N denotethenumberofdrawsneeded. FindtheMethodofMomentsestimatorof basedonX. 6 .Intherafeexample,Section4.4.1,nda )]TJ/F46 10.9091 Tf 11.011 0 Td [( %condenceintervalforcbasedon c ,theMaximum LikelihoodEstimateofc. Hint:UsetheexampleinSection4.2.13asaguide. 7 .Inmanyapplications,observationscomeincorrelatedclusters.Forinstance,wemaysamplertreesat PAGE 165 4.6.NONPARAMETRICDENSITYESTIMATION 147 random,thensleaveswithineachtree.Clearly,leavesfromthesametreewillbemoresimilartoeachother thanleavesondifferenttrees. Inthiscontext,supposewehavearandomsample X 1 ;:::;X n ,neven,suchthatthereiscorrelationwithin pairs.Specically,supposethepair X 2 i +1 ;X 2 i +2 hasabivariatenormaldistributionwithmean ; andcovariancematrix 1 1 .153 i=0,...,n/2-1,withthen/2pairsbeingindependent.FindtheMethodofMomentsestimatorsof and 8 .CandidatesA,BandCarevyingforelection.Let p 1 p 2 and p 3 denotethefractionsofpeopleplanning tovoteforthem.Wepollnpeopleatrandom,yieldingestimates b p 1 b p 2 and b p 3 .Yclaimsthatshehasmore supportersthantheothertwocandidatescombined.Giveaformulaforanapproximate95%condence intervalfor p 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [( p 1 + p 3 . PAGE 166 148 CHAPTER4.INTRODUCTIONTOSTATISTICALINFERENCE PAGE 167 Chapter5 IntroductiontoModelBuilding Allmodelsarewrong,butsomeareuseful. GeorgeBox 1 [Mathematicalmodels]shouldbemadeassimpleaspossible,butnotsimpler. AlbertEinstein 2 TheabovequotebyBoxsaysitall.Considerforexamplethefamilyofnormaldistributions.Inreallife, randomvariablesareboundednoperson'sheightisnegativeorgreaterthan500inchesandareinherently discrete,duetotheniteprecisionofourmeasuringinstruments.Thus,technically,norandomvariablein practicecanhaveanexactnormaldistribution.Yettheassumptionofnormalitypervadesstatistics,andhas beenenormouslysuccessful,providedoneunderstandsitsapproximatenature. Thesituationissimilartothatofphysics.Weknowthatinmanyanalysesofbodiesinmotion,wecan neglecttheeffectofairresistance,butthatinsomesituationsonemustincludethatfactorinourmodel. So,theeldofprobabilityandstatisticsisfundamentallyabout modeling .Theeldisextremelyuseful, providedtheuserunderstandsthemodelingissueswell.Forthisreason,thisbookcontainsthisseparate chapteronmodelingissues. 5.1BiasVs.Variance ConsiderageneralestimatorQofsomepopulationvalueb.Thenacommonmeasureofthequalityofthe estimatorQisthe meansquarederror MSE, E [ Q )]TJ/F46 10.9091 Tf 10.909 0 Td [(b 2 ] .1 Ofcourse,thesmallertheMSE,thebetter. 1 GeorgeBox-isafamousstatistician,withseveralstatisticalproceduresnamedafterhim. 2 ThereaderisundoubtedlyawareofEinstein's-1955famoustheoriesofrelativity,butmaynotknowhisconnections toprobabilitytheory.Hisworkon Brownianmotion ,whichdescribesthepathofamoleculeasitisbombardedbyothers,is probabilisticinnature,andlaterdevelopedintoamajorbranchofprobabilitytheory.Einsteinwasalsoapioneerinquantum mechanics,whichisprobabilisticaswell.Atonepoint,hedoubtedthevalidityofquantumtheory,andmadehisfamousremark, Goddoesnotplaydicewiththeuniverse. 149 PAGE 168 150 CHAPTER5.INTRODUCTIONTOMODELBUILDING Onecanbreak.1downintovarianceandsquaredbiascomponents,asfollows: 3 MSE Q = E [ Q )]TJ/F46 10.9091 Tf 10.909 0 Td [(b 2 ] .2 = E [ f Q )]TJ/F46 10.9091 Tf 10.909 0 Td [(EQ + EQ )]TJ/F46 10.9091 Tf 10.909 0 Td [(b g 2 ] .3 = E [ Q )]TJ/F46 10.9091 Tf 10.909 0 Td [(EQ 2 ]+2 E [ Q )]TJ/F46 10.9091 Tf 10.909 0 Td [(EQ EQ )]TJ/F46 10.9091 Tf 10.909 0 Td [(b ]+ E [ EQ )]TJ/F46 10.9091 Tf 10.909 0 Td [(b 2 ] .4 = Var Q + EQ )]TJ/F46 10.9091 Tf 10.909 0 Td [(b 2 .5 = variance+squaredbias.6 Inotherwords,indiscussingtheaccuracyofanestimatorespeciallyincomparingtwoormorecandidates touseforourestimatortheaveragesquarederrorhastwomaincomponents,oneforvarianceandonefor bias.Inbuildingamodel,thesetwocomponentsareoftenatoddswitheachother;wemaybeabletond anestimatorwithsmallerbiasbutmorevariance,orviceversa. 5.2DesperateforData Supposewehavethesamplesofmen'sandwomen'sheightsdescribedinSection4.2.11,saywewishto predicttheheightHofanewpersonwhoweknowtobeamanbutforwhomweknownothingelse. Thequestionis,shouldwetakegenderintoaccountinourprediction?Ifso,wewouldpredictthemanto beofheight 4 T 1 = X; .7 ourestimateforthemeanheightofallmen.Ifnot,thenwepredicthisheighttobe T 2 = X + Y 2 ; .8 ourestimateofthemeanheightofallpeopleassumingthathalfthepopulationismale. RecallingournotationfromSection4.2.11,assumethat n 1 = n 2 ,andcallthecommonvaluen.Also,for simplicity,let'sassumethat 1 = 2 = 5.2.1MathematicalFormulationoftheProblem Let'sformalizethisabit.LetGdenotegender,1formale,2forfemale.Thenourrandomquantityhereis X,G.Ourexperimenthereistochooseapersonfromthepopulationatrandom.Thustheheightand genderwillberandomvariables. 3 Inreadingthefollowingderivation,keepinmindthatEQandbareconstants. 4 Assumingthatpredictingtoohighandtoolowareofequalconcerntous,etc. PAGE 169 5.2.DESPERATEFORDATA 151 Thenthecorrectpopulationmodelis E H j G = i = i .9 andourpredictor T 1 reectsthis. 5 However, T 2 makesthesimplifyingassumptionthat 1 = 2 ,sothat E H j G = i = .10 where isthecommonvalueof 1 and 2 .We'llreferto.9asthe complexmodel twoparameters,not countingvariances,andto.10asthe simplemodel oneparameter,notcountingvariances. 5.2.2BiasandVarianceoftheTwoPredictors Sincethetruemodelis.9, T 1 isunbiased,from.5.Butthepredictor T 2 fromthesimplemodelis biased: E T 2 j G =1= E : 5 X +0 : 5 Y denition .11 =0 : 5 E X +0 : 5 E Y linearityofE .12 =0 : 5 1 +0 : 5 2 [ from.5 ] .13 6 = 1 .14 Ontheotherhand, T 2 hasasmallervariance:Recalling.9,wehave Var T 1 j G =1= 2 n .15 And Var T 2 j G =1= Var : 5 X +0 : 5 Y .16 =0 : 5 2 Var X +0 : 5 2 Var Y propertiesofVar .17 = 2 2 n [ from4.9 ] .18 5.2.3Implications Thesendingsarehighlyinstructive.Youmightatrstthinkthatofcourse T 1 wouldbethebetter predictorthan T 2 .Butforasmallsamplesize,thesmalleractually0biasof T 1 isnotenoughtocounteract itslargervariance. T 2 isbiased,yes,butitisbasedondoublethesamplesizeandthushashalfthevariance. 5 Wearecallingita predictor ratherthanan estimator asinotherexamples.Thisfollowscustom,whichistousethelatterterm whenthetargetisaconstant,e.g.apopulationmean,andtousetheformertermwhenthetargetisarandomquantity.Itisnota majordistinction. PAGE 170 152 CHAPTER5.INTRODUCTIONTOMODELBUILDING Inlightof.6,weseethat T 1 ,thetruepredictor,maynotnecessarilybethebetterofthetwopredictors. Granted,ithasnobiaswhereas T 2 doeshaveabias,butthelatterhasasmallervariance. Let'sconsiderthisinmoredetail,using.5: MSE T 1 = 2 n +0 2 = 2 n .19 MSE T 2 = 2 2 n + 1 + 2 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 2 = 2 2 n + 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 2 2 .20 T 1 isabetterpredictorthan T 2 if.19issmallerthan.20,whichistrueif 2 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 2 2 > 2 2 n .21 Soyoucanseethat T 1 isbetteronlyifeither nislargeenough,or thedifferenceinpopulationmeanheightsbetweenmenandwomenislargeenough,or thereisnotmuchvariationwithineachpopulation,e.g.mostmenhaveverysimilarheights Sincethatthirditem,smallwithin-populationvariance,israrelyseen,let'sconcentrateonthersttwoitems. Thebigrevelationhereisthat: Amorecomplexmodelismoreaccuratethanasimpleroneonlyifeither wehaveenoughdatatosupportit,or thecomplexmodelissufcientlydifferentfromthesimplerone Inheight/genderexampleabove,ifnistoosmall,wearedesperatefordata,andthusmakeuseof thefemaledatatoaugmentourmaledata. Thoughwomentendtobeshorterthanmen,thebiasthat resultsfortheaugmentationisoffsetbythereductioninestimatorvariancethatweget.Butifnislarge enough,thevariancewillbesmallineithermodel,sowhenwegotothemorecomplexmodel,theadvantage gainedbyreducingthebiaswillmorethancompensatefortheincreaseinvariance. THISISANABSOLUTELYFUNDAMENTALNOTIONINSTATISTICS. Thiswasaverysimpleexample,butyoucanseethatincomplexsettings,ttingtoorichamodelcanresult inveryhighMSEsfortheestimates.Inessence,everythingbecomesnoise.Somepeoplehavecleverly coinedtheterm noisemining ,aplayontheterm datamining .Thisisthefamous overtting problem. Inourunitonstatisticalrelations,Chapter6,wewillshowtheresultsofascaryexperimentdoneatthe WhartonSchool,theUniversityofPennsylvania'sbusinessschool.Theresearchersdeliberatelyaddedfake datatoapredictionequation,andstandardstatisticalsoftwareidentieditassignicant!Thisispartly aproblemwiththeworditself,aswesawinSection4.3.8,butalsoaproblemofusingfartoocomplexa model,aswillbeseeninthatfutureunit. PAGE 171 5.3.ASSESSINGGOODNESSOFFITOFAMODEL 153 5.3AssessingGoodnessofFitofaModel OurexampleinSection4.4.4concernedhowtoestimatetheparametersofagammadistribution,givena samplefromthedistribution.Butthatassumedthatwehadalreadydecidedthatthegammamodelwas reasonableinourapplication.Herewewillbeconcernedwithhowwemightcometosuchdecisions. Assumewehavearandomsample X 1 ;:::;X n fromadistributionhavingdensity f X 5.3.1TheChi-SquareGoodnessofFitTest Theclassicwaytodothiswouldbethe Chi-SquareGoodnessofFitTest .Wewouldset H 0 : f X isamemberoftheexponentialparametricfamily.22 Thiswouldinvolvepartitioning ; 1 intokintervals s i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 ;s i ofourchoice,andsetting N i = numberof X i in s i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ;s i .23 WewouldthenndtheMaximumLikelihoodEstimateMLEof ,ontheassumptionthatthedistribution ofXreallyisexponential.TheMLEturnsouttobethereciprocalofthesamplemean,i.e. b =1 = X .24 Thiswouldbeconsideredtheparameterofthebest-ttingexponentialdensityforourdata.Wewould thenestimatetheprobabilities p i = P [ X s i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 ;s i ]= e )]TJ/F47 7.9701 Tf 6.587 0 Td [(s i )]TJ/F45 5.9776 Tf 5.756 0 Td [(1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(e )]TJ/F47 7.9701 Tf 6.587 0 Td [(s i ;i =1 ;:::;k: .25 by b p i = e )]TJ/F53 7.9701 Tf 6.704 2.103 Td [(b s i )]TJ/F45 5.9776 Tf 5.756 0 Td [(1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(e )]TJ/F53 7.9701 Tf 6.704 2.103 Td [(b s i ;i =1 ;:::;k: .26 Notethat N i hasabinomialdistribution,withntrialsandsuccessprobability p i .Usingthis,theexpected valueof EN i isestimatedtobe i = n e )]TJ/F53 7.9701 Tf 6.704 2.104 Td [(b s i )]TJ/F45 5.9776 Tf 5.756 0 Td [(1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(e )]TJ/F53 7.9701 Tf 6.704 2.104 Td [(b s i ;i =1 ;:::;k: .27 Ourteststatisticwouldthenbe Q = k X i =1 N i )]TJ/F46 10.9091 Tf 10.909 0 Td [(v i 2 v i .28 PAGE 172 154 CHAPTER5.INTRODUCTIONTOMODELBUILDING where v i istheexpectedvalueof N i undertheassumptionofexponentialness.ItcanbeshownthatQis approximatelychi-squaredistributedwithk-2degreesoffreedom. 6 NotethatonlylargevaluesofQshould besuspicious,i.e.shouldleadustoreject H 0 ;ifQissmall,itindicatesagoodt.IfQwerelargeenoughto bearareevent,saylargerthan 0 : 95 ;k )]TJ/F44 7.9701 Tf 6.586 0 Td [(2 ,wewoulddecideNOTtousetheexponentialmodel;otherwise, wewoulduseit. Hopefullythereaderhasimmediatelyrecognizedtheproblemhere. Ifwehavealargesample,this procedurewillpounceontinydeviationsfromtheexponentialdistribution,andwewoulddecidenottouse theexponentialmodelevenifthosedeviationswerequiteminor.Again,nomodelis100%correct,and thusagoodnessofttestwilleventuallytellusnottouse any modelatall. 5.3.2Kolmogorov-SmirnovCondenceBands Againconsidertheproblemabove,inwhichwewereassessingthetofaexponentialmodel.Inlinewith ourmajorpointthatcondenceintervalsarefarsuperiortohypothesistests,wenowpresent KolmogorovSmirnovcondencebands ,whichworkasfollows. Recalltheconceptofempiricalcdfs,presentedinSection4.4.8.Itturnsoutthatthedistributionof M =max PAGE 173 5.4.BIASVS.VARIANCEAGAIN 155 Warning:TheKolmogorov-SmirnovprocedureavailableintheRlanguageperformsonlyahypothesistest, ratherthanformingacondenceband.Inotherwords,itsimplycheckstoseewhetheramemberofthe familyfallswithintheband.Thisisnotwhatwewant,becausewemaybeperfectlyhappyifamemberis only near theband. Ofcourse,anotherway,thisonelessformal,ofassessingdataforsuitabilityforsomemodelistoplotthe datainahistogramorsomethingofthatnaure. 5.4BiasVs.VarianceAgain Inourunitonestimation,Section4.6,wesawaclassictradeoffinhistogram-andkernel-baseddensity estimators.Withhistograms,forinstance,thewiderbinwidthproducesagraphwhichissmoother,but possibly too smooth,i.e.withlessoscillationthanthetruepopulationcurvehas.Thesameproblemoccurs withlargervaluesofhinthekernelcase. Thisisactuallyyetanotherexampleofthebias/variancetradeoff,discussedinaboveand,asmentioned, ONEOFTHEMOSTRECURRINGNOTIONSINSTATISTICS .Alargebinwidth,oralargevalue ofh,producesmorebias.Ingeneral,thelargethebinwidthorh,thefurther E [ b f R t isfromthetruevalue of f R t .Thisoccursbecausewearemakinguseofpointswhicharenotsoneart,andthusatwhichthe densityheightisdifferentfromthatof f R t .Ontheotherhand,becausewearemakinguseofmorepoints, Var [ b f r t ] willbesmaller. THEREISNOGOODWAYTOCHOOSETHEBINWIDTHORh .Eventhoughthereisalotof theorytosuggesthowtochoosethebinwidthorh,nomethodisfoolproof.Thisismadeevenworsebythe factthatthetheorygenerallyhasagoalofminimizing integrated meansquarederror, Z 1 E b f R t )]TJ/F46 10.9091 Tf 10.909 0 Td [(f R t 2 dt .33 ratherthan,say,themeansquarederrorataparticularpointofinterest,v: E b f R t )]TJ/F46 10.9091 Tf 10.909 0 Td [(f R t 2 .34 5.5Robustness Traditionally,theterm robust instatisticshasmeantresiliencetoviolationsinassumptions.Forexample,in Section4.2.8,wepresentedStudent-t,amethodforndingexactcondenceintervalsformeans,assuming normally-distributedpopulations.Butasnotedattheoutsetofthischapter,nopopulationintherealworld hasanexactnormaldistribution.Thequestionathandwhichwewilladdressbelowis,doestheStudent-t methodstillgiveapproximatelycorrectresultsifthesamplepopulationisnotnormal?Ifso,wesaythat Student-tis robust tothenormalityassumption. Later,therewasquitealotofinterestamongstatisticiansinestimationproceduresthatdowellevenifthere are outliers inthedata,i.e.erroneousobservationsthatareinthefringesofthesample.Suchprocedures PAGE 174 156 CHAPTER5.INTRODUCTIONTOMODELBUILDING aresaidtoberobusttooutliers. Ourinteresthereisonrobustnesstoassumptions.LetusrstconsidertheStudent-texample.Asdiscussed inSection4.2.8,themainstatistichereis T = X )]TJ/F46 10.9091 Tf 10.909 0 Td [( s= p n .35 where isthepopulationmeanand s istheunbiasedversionofthesamplevariance: s = s P n i =1 X i )]TJ/F15 10.9091 Tf 14.038 2.757 Td [( X 2 n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 .36 Thedistributionof T ,undertheassumptionofanormalpopulation,hasbeentabulated,andtablesforit appearinvirtuallyeverytextbookonstatistics.Butwhatifthepopulationisnotnormal,asisinevitablythe case? Theansweristhatitdoesn'tmatter.Forlargen,evenforsampleshaving,say,n=20,thedistributionof T isclosetoN,1bytheCentralLimitTheoremregardlessofwhetherthepopulationisnormal. Bycontrast,considertheclassicprocedureforperforminghypothesistestsandformingcondenceintervals forapopulationvariance 2 ,whichreliesonthestatistic K = n )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 s 2 2 .37 whereagain s 2 istheunbiasedversionofthesamplevariance.Ifthesampledpopulationisnormal,then K canbeshowntohaveachi-squaredistributionwithn-1degreesoffreedom.Thisthensetsupthe testsorintervals.However,ithasbeenshownthattheseproceduresarenotrobusttotheassumptionofa normalpopulation.See TheAnalysisofVariance:Fixed,Random,andMixedModels ,byHardeoSahaiand MohammedI.Ageel,Springer,2000,andtheearlierreferencestheycite,especiallythepioneeringworkof Scheffe'. Exercises Notetoinstructor: SeethePrefaceforalistofsourcesofrealdataonwhichexercisescanbeassignedto complementthetheoreticalexercisesbelow. 1 .InourexampleinSection5.2,assume 1 =70 ; 2 =66 ; =4 andthedistributionofheightisnormal inthetwopopulations.Supposewearepredictingtheheightofamanwho,unknowntous,hasheight68. Wehopetoguesswithintwoinches.Find P j T 1 )]TJ/F15 10.9091 Tf 10.909 0 Td [(68 j < 2 and P j T 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(68 j < 2 forvariousvaluesofn. 2 .InSection4.2.16wediscussed simultaneousinference ,theformingofcondenceintervalswhosejoint condencelevelwas95%orsomeothertargetvalue.TheKolmogorov-SmirnovcondencebandinSection 5.3.2allowsustocomputerinnitelymanycondenceintervalsfor F X t atdifferentvaluesoft,ata priceofonly1.358.Still,ifwearejustestimating F X t atasinglevalueoft,anindividualcondence intervalusing.32wouldbenarrowerthanthatgiventousbyKolmogorov-Smirnov.Comparethewidths ofthesetwointervalsinasituationinwhichthetruevalueof F X t =0 : 4 . PAGE 175 Chapter6 StatisticalRelationsBetweenVariables 6.1TheGoals:PredictionandUnderstanding Predictionisdifcult,especiallywhenit'saboutthefuture. YogiBerra 1 Inthisunitweareinterestedinrelationsbetweenvariables.Beforebeginning,itisimportanttounderstand thetypicalgoalsinanalyzingsuchrelations: Prediction: Herewearetryingtopredictonevariablefromoneormoreothers. Understanding: Herewewishtodeterminewhichvariableshaveagreatereffectonagivenvariable. Denotethepredictorvariablesby, X ;:::;X r .Thevariabletobepredicted,Y,isoftencalledthe response variable Acommonstatisticalmethodologyusedforsuchanalysesiscalled regressionanalysis .Intheimportant specialcasesinwhichtheresponsevariableYisan indicatorvariable 2 takingonjustthevalues1and0 toindicateclassmembership,wecallthisthe classicationproblem .Ifwehavemorethantwoclasses, weneedseveralYs. Intheabovecontext,weareinterestedintherelationofasinglevariableYwithothervariables X i .But insomeapplications,weareinterestedinthemoresymmetricproblemofrelations among variables X i withtherebeingnoY.Atypicaltoolforthecaseofcontinuousrandomvariablesis principalcomponents analysis ,andapopularoneforthediscretecaseis log-linearmodel ;bothwillbediscussedlaterinthis unit. 6.2ExampleApplications:SoftwareEngineering,Networks,TextMining Example: Asanaidindecidingwhichapplicantstoadmittoagraduateprogramincomputerscience,we 1 YogiBerra-isaformerbaseballplayerandmanager,famousforhismalapropisms,suchasWhenyoureachaforkin theroad,takeit;Thatrestaurantissocrowdedthatnoonegoesthereanymore;andIneversaidhalfthethingsIreallysaid. 2 Sometimescalleda dummyvariable 157 PAGE 176 158 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES mighttrytopredictY,afacultyratingofastudentaftercompletionofhis/herrstyearintheprogram,from X =thestudent'sCSGREscore, X =thestudent'sundergraduateGPAandvariousothervariables. HereourgoalwouldbePrediction,buteducationalresearchersmightdothesamethingwiththegoal ofUnderstanding.Foranexampleofthelatter,seePredictingAcademicPerformanceintheSchoolof Computing&InformationTechnologySCIT, 35thASEE/IEEEFrontiersinEducationConference ,by PaulGoldingandSophiaMcNamarah,2005. Example: Inapaper,EstimationofNetworkDistancesUsingOff-lineMeasurements, ComputerCommunications ,byDannyRaz,NidhanChoudhuriandPrasunSinha,2006,theauthorswantedtopredictY, theround-triptimeRTTforpacketsinanetwork,usingthepredictorvariables X =geographicaldistancebetweenthetwonodes, X =numberofrouter-to-routerhops,andothervariables.Thegoalhereis primarilyPrediction. Example: Inapaper,ProductivityAnalysisofObject-OrientedSoftwareDevelopedinaCommercialEnvironment, SoftwarePracticeandExperience ,byThomasE.Potok,MladenVoukandAndyRindos,1999, theauthorsmainlyhadanUnderstandinggoal:Whatimpact,positiveornegative,doestheuseofobjectorientedprogramminghaveonprogrammerproductivity?HeretheypredictedY=numberofperson-months neededtocompletetheproject,from X =sizeoftheprojectasmeasuredinlinesofcode, X =1or0 dependingonwhetheranobject-orientedorproceduralapproachwasused,andothervariables. Example: Most textmining applicationsareclassicationproblems.Forexample,thepaperUntangling TextDataMining, ProceedingsofACL'99 ,byMartiHearst,1999cites, interalia ,anapplicationinwhich theanalystswishedtoknowwhatproportionofpatentscomefrompubliclyfundedresearch.Theywere usingapatentdatabase,whichofcourseisfartoohugetofeasiblysearchbyhand.Thatmeantthatthey neededtobeabletoreasonablyreliablypredictY=1or0accordingtowhetherthepatentwaspublicly fundedfromanumberof X i ,eachofwhichwasanindicatorvariableforagivenkeyword,suchasNSF. TheywouldthentreatthepredictedYvaluesastherealones,andestimatetheirproportionfromthem. 6.3RegressionAnalysis 6.3.1WhatDoesRelationshipReallyMean? ConsidertheDaviscitypopulationexampleagain.Inadditiontotherandomvariable W forweight,let H denotetheperson'sheight.Supposeweareinterestedinexploringtherelationshipbetweenheightand weight. Asusual,wemustrstask, whatdoesthatreallymean ?Whatdowemeanbyrelationship?Clearly, thereisnoexactrelationship;forinstance,wecannotexactlypredictaperson'sweightfromhis/herheight. Intuitively,though,wewouldguessthatmeanweightincreaseswithheight.Tostatethisprecisely,takeY tobetheweightWand X tobetheheightH,anddene m W ; H t = E W j H = t .1 Thislooksabstract,butitisjustcommon-sensestuff.Forexample, m W ; H wouldbethemeanweightof allpeopleinthepopulationofheight68inches.Thevalueof m W ; H t varieswitht,andwewouldexpect PAGE 177 6.3.REGRESSIONANALYSIS 159 thatagraphofitwouldshowanincreasingtrendwitht,reectingthattallerpeopletendtobeheavier. Wecall m W ; H the regressionfunctionofWonH .Ingeneral, m Y ; X t meansthemeanof Y amongall unitsinthepopulationforwhich X = t Notetheword population inthatlastsentence.Thefunctionmisapopulation function. Now,let'sagainsupposewehavearandomsampleof1000peoplefromDavis,with H 1 ;W 1 ;:::; H 1000 ;W 1000 .2 beingtheirheightsandweights.Weagainwishtousethisdatatoestimatepopulationvalues.Butthe differencehereisthatweareestimatingawholefunctionnow,thewholecurvem.Thatmeansweare estimatinginnitelymanyvalues,withone m W ; H t valueforeacht. 3 Howdowedothis? Thetraditionalmethodistochooseaparametricmodelfortheregressionfunction.Thatwayweestimate onlyanitenumberofquantitiesinsteadofaninnitenumber. Typicallytheparametricmodelchosenislinear,i.e.weassumethat m W ; H t isalinearfunctionoft: m W ; H t = ct + d .3 forsomeconstantscandd.Ifthisassumptionisreasonablemeaningthatthoughitmaynotbeexactlytrue itisreasonablyclosethenitisahugegainforusoveranonparametricmodel.Doyouseewhy?Again, theansweristhatinsteadofhavingtoestimateaninnitenumberofquantities,wenowmustestimateonly twoquantitiestheparameterscandd. Equation.3isthuscalleda parametric modelof m W ; H .Thesetofstraightlinesindexedbycand disatwo-parameterfamily,analogoustoparametricfamiliesofdistributions,suchasthetwo-parametric gammafamily;thedifference,ofcourse,isthatinthegammacaseweweremodelingadensityfunction, andherewearemodelingaregressionfunction. Notethatcanddareindeedpopulationparametersinthesamesensethat,forinstance,rand areparameters inthegammadistributionfamily.WewillseehowtoestimatecanddinSection6.3.7. 6.3.2MultipleRegression:MoreThanOnePredictorVariable Notethat X andtcouldbevector-valued.Forinstance,wecouldhave Y beweightandhave X bethepair X = X ;X = H;A = height,age.4 soastostudytherelationshipofweightwithheightandage.Ifweusedalinearmodel,wewouldwritefor t = t 1 ;t 2 m W ; H t = 0 + 1 t 1 + 2 t 2 .5 3 Ofcourse,thepopulationofDavisisnite,butthereistheconceptualpopulationofallpeoplewho could liveinDavis. PAGE 178 160 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Inotherwords meanweight = 0 + 1 height + 2 age.6 ItistraditionaltousetheGreekletter tonamethecoefcientsinalinearregressionmodel. Soforinstance m W ; H ; 37 : 2 wouldbethemeanweightinthepopulationofallpeoplehavingheight68 andage37.2. 6.3.3InteractionTerms Equation.5implicitlysaysthat,forinstance,theeffectofageonweightisthesameatallheightlevels. Inotherwords,thedifferenceinmeanweightbetween30-year-oldsand40-year-oldsisthesameregardless ofwearelookingattallpeopleorshortpeople.Toseethat,justplug40and30foragein.5,withthe samenumberforheightinboth,andsubtract;youget 10 2 ,anexpressionthathasnoheightterm. Ifwefeelthattheassumptionisnotagoodonetherearealsodataplottingtechniquestohelpassessthis, wecanaddan interactionterm to.5,consistingoftheproductofthetwooriginalpredictors.Ournew predictorvariable X isequalto X X ,andthusourregressionfunctionis m W ; H t = 0 + 1 t 1 + 2 t 2 + 3 t 1 t 2 .7 Ifyouperformthesamesubtractiondescribedabove,you'llseethatthismorecomplexmodeldoesnot assume,astheolddid,thatthedifferenceinmeanweightbetween30-year-oldsand40-year-oldsisthe sameregardlessofwearelookingattallpeopleorshortpeople. Recallthestudyofobject-orientedprogramminginSection6.1.Theauthorsthereset X = X X Thereadershouldmakesuretounderstandthatwithoutthisterm,wearebasicallysayingthattheeffect whetherpositiveornegativeofusingobject-orientedprogrammingisthesameforanycodesize. Thoughtheideaofaddinginteractiontermstoaregressionmodelistempting,itcaneasilygetoutofhand. Ifwehavekbasicpredictorvariables,thenthereare k 2 potentialtwo-wayinteractionterms, k 3 three-waytermsandsoon.Unlesswehaveaverylargeamountofdata,werunabigriskofovertting Section6.3.9.1.Andwithsomanyinteractionterms,themodelwouldbedifculttointerpret. 6.3.4NonrandomPredictorVariables Inourweight/height/ageexampleabove,allthreevariablesarerandom.Ifwerepeattheexperiment,i.e. wechooseanothersampleof1000people,thesenewpeoplewillhavedifferentweights,differentheights anddifferentagesfromthepeopleintherstsample. Butwemustpointoutthatthefunction m Y ; X makessenseevenif X isnonrandom.Toillustratethis,let's lookattheALOHAnetworkexampleinourintroductoryunitondiscreteprobability,Section1.1. 1 #simulationofsimpleformofslottedALOHA 2 PAGE 179 6.3.REGRESSIONANALYSIS 161 3 #anodeisactiveifithasamessagetosenditwillneverhavemore 4 #thanoneinthismodel,inactiveotherwise 5 6 #theinactiveshaveachancetogoactiveearlierwithinaslot,after 7 #whichtheactivesincludingthosenewly-activemaytrytosend;if 8 #thereisacollision,nomessagegetsthrough 9 10 #parametersofthesystem: 11 #s=numberofnodes 12 #b=probabilityanactivenoderefrainsfromsending 13 #q=probabilityaninactivenodebecomesactive 14 15 #parametersofthesimulation: 16 #nslots=numberofslotstobesimulated 17 #nb=numberofvaluesofbtorun;theywillbeevenlyspacedin,1 18 19 #willfindmeanmessagedelayasafunctionofb; 20 21 #wewillrelyonthe"ergodicity"ofthisprocess,whichisaMarkov 22 #chainseehttp://heather.cs.ucdavis.edu/matloff/132/PLN/Markov.tex, 23 #whichmeansthatwelookatjustonerepetitionofobservingthechain 24 #throughmanytimeslots 25 26 #mainloop,runningthesimulationformanyvaluesofb 27 alohamain<-functions,q,nslots,nb{ 28 deltab=0.7/nb#we'lltrynbvaluesofbin.2,0.9 29 md<-matrixnrow=nb,ncol=2 30 b<-0.2 31 foriin1:nb{ 32 b<-b+deltab 33 w<-alohasims,b,q,nslots 34 md[i,]<-alohasims,b,q,nslots 35 } 36 returnmd 37 } 38 39 #simulatetheprocessforhslots 40 alohasim<-functions,b,q,nslots{ 41 #status[i,1]=1or0,fornodeiactiveornot 42 #status[i,2]=ifnodeiactive,thenepochinwhichmsgwascreated 43 #couldtryaliststructureinsteadamatrix 44 status<-matrixnrow=s,ncol=2 45 #startwithallactivewithmsgcreatedattime0 46 fornodein1:sstatus[node,]<-c,0 47 nsent<-0#numberofsuccessfultransmitssofar 48 sumdelay<-0#totaldelayamongsuccessfultransmitssofar 49 #nowsimulatethenslotsslots 50 forslotin1:nslots{ 51 #checkfornewactives 52 fornodein1:s{ 53 if!status[node,1]#inactive 54 ifrunif PAGE 180 162 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES 68 #thisnodenowbacktoinactive 69 status[whotried,1]<-0 70 nsent<-nsent+1 71 } 72 } 73 returncb,sumdelay/nsent 74 } AminorchangeisthatIreplacedtheprobabilityp,theprobabilitythatanactivenodewouldsendinthe originalexampletob,theprobabilityof not sendingbforbackoff.LetAdenotethetimeAmeasured inslotsbetweenthecreationofamessageandthetimeitissuccessfullytransmitted. Weareinterestedinmeandelay,i.e.themeanofA.Weareparticularlyinterestedintheeffectofbhereon thatmean.Ourgoalhere,asdescribedinSection6.1,couldbePrediction,sothatwecouldhaveanideaof howmuchdelaytoexpectinfuturesettings.Or,wemaywishtoexplorendinganoptimalb,i.e.onethat minimizingthemeandelay,inwhichcaseourgoalwouldbemoreinthedirectionofUnderstanding. Irantheprogramwithcertainarguments,andthenplottedthedata: >md<-alohamain,0.1,1000,100 >plotmd,cex=0.5,xlab="b",ylab="A" TheplotisshowninFigure6.1. Notethatthoughourvaluesbherearenonrandom,theAvaluesareindeedrandom.Todramatizethatpoint, Irantheprogramagain.Remember,unlessyouspecifyotherwise,Rwilluseadifferentseedforitsrandom numberstreameachtimeyourunaprogram.I'vesuperimposedthisseconddatasetontherst,usinglled circlesthistimetorepresentthepoints: md2<-alohamain,0.1,1000,100 pointsmd2,cex=0.5,pch=19 TheplotisshowninFigure6.2. WedoexpectsomekindofU-shapedrelation,asseenhere.Forbtoosmall,thenodesareclashingwith eachotheralot,causinglongdelaystomessagetransmission.Forbtoolarge,weareneedlesslybacking offinmanycasesinwhichweactuallywouldgetthrough. Thislookslikeaquadraticrelationship,meaningthefollowing.TakeourresponsevariableYtobeA,take ourrstpredictor X tobeb,andtakeoursecondpredictor X tobe b 2 .ThenwhenwesayAandb haveaquadraticrelationship,wemean m A ; b b = 0 + 1 b + 2 b 2 .8 forsomeconstants 0 ; 1 ; 2 .So,weareusingathree-parameterfamilyforourmodelof m A ; b .Nomodel isexact,butourdataseemtoindicatethatthisoneisreasonablygood,andiffurtherinvestigationconrms that,itprovidesforanicecompactsummaryofthesituation. Again,we'llseehowtoestimatethe i inSection6.3.7. PAGE 181 6.3.REGRESSIONANALYSIS 163 Figure6.1:ScatterPlot Wecouldalsotryaddingtwomorepredictorvariables,consistingof X = q and X = s .Wewould collectmoredata,inwhichwevariedthevaluesofqands,andthencouldentertainthemodel m A ; b b = 0 + 1 b + 2 b 2 + 3 q + 4 s .9 6.3.5Prediction So,we'vetakenourdataonweight/height/age,andestimatedthefunctionmusingthatdata,yielding b m Now,anewpersoncomesin,ofheight70.4andage24.8.Whatshouldwepredicthisweighttobe? Theansweristhatwepredicthisweighttobeourestimatedmeanweightforhisheight/agegroup, b m W ; H;A : 4 ; 24 : 8 .10 PAGE 182 164 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Figure6.2:ScatterPlot,TwoDataSets Ifourmodelis.5,then.10is m W ; H t = b 0 + b 1 70 : 4+ b 2 24 : 8 .11 wherethe b i areestimatedfromourdataasinSection6.3.7below. 6.3.6OptimalityoftheRegressionFunction InpredictingYfromXwithXrandom,wemightassessourpredictiveabilitybythe meansquared predictionerror MSPE: MSPE = E Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(w X 2 .12 wherewissomefunctionwewillusetoformourpredictionforYbasedonX.Whatwisbest,i.e.whichw PAGE 183 6.3.REGRESSIONANALYSIS 165 minimizesMSPE? Toanswerthisquestion,conditiononXin.12: MSPE = E E f Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(w X 2 j X g .13 Theorem8 Thebestwism,i.e.thebestwaytopredictYfromXistopluginXintheregressionfunction. Weneedthislemma: Lemma9 ForanyrandomvariableZ,theconstantcwhichminimizes E [ Z )]TJ/F46 10.9091 Tf 10.909 0 Td [(c 2 ] .14 is c = EZ .15 Proof Expand.14to E Z 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 cEZ + c 2 .16 andusecalculustondthebestc. Applythelemmatotheinnerexpectationin.13,withZbeingYandcbeingsomefunctionofX.The minimizingvalueisEZ,i.e. E Y j X sinceourexpectationhereisconditionalonX. Allofthistellsusthatthebestfunctionwin.12is m Y ; X .Thisprovesthetheorem. 6.3.7ParametricEstimationofLinearRegressionFunctions 6.3.7.1MeaningofLinear Herewemodel m Y ; X asalinearfunctionof X ;:::;X r : m Y ; X t = 0 + 1 t + ::: + r t r .17 Notethattheterm linearregression doesNOTnecessarilymeanthatthegraphoftheregressionfunction isastraightlineoraplane.Instead,theword linear referstotheregressionfunctionbeinglinearinthe PAGE 184 166 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES parameters.So,forinstance,.8isalinearmodel;ifforexamplewemultiple 0 1 and 2 by8,thenm ismultipliedby8. 6.3.7.2PointEstimatesandMatrixFormulation So,howdoweestimatethe i ?Lookforinstanceat.8.Keepinmindthatin.8,the i arepopulation values.Weneedtoestimatethemfromourdata.Howdowedothat? Let'sdene b i ;A i tobethe i th pairfromthesimulation.Intheprogram,thisis md[i,] .Ourestimated parameterswillbedenotedby ^ i .UsingtheresultofSection6.3.5asaguide,theestimationmethodology involvesndingthevaluesof ^ i whichminimizethesumofsquareddifferencesbetweentheactualAvalues andtheirpredictedvalues: 100 X i =1 [ A i )]TJ/F15 10.9091 Tf 10.909 0 Td [( ^ 0 + ^ 1 b i + ^ 2 b 2 i ] 2 .18 Obviously,thisisacalculusproblem.Wesetthepartialderivativesof.18withrespecttothe ^ i to0, givingusethreelinearequationsinthreeunknowns,andthensolve. Forthegeneralcase.17,wehaver+1equationsinr+1unknowns.Thisismostconvenientlyexpressed inmatrixterms.Let X j i bethevalueof X j forthe i th observationinoursample,andlet Y i bethe correspondingYvalue.Pluggingthisdatainto.3.7.1,wehave E Y i j X i ;:::;X r i = 0 + 1 X i + ::: + r X r i ;i =1 ;:::;n .19 That'sasystemofnlinearequations,whichfromyourlinearalgebraclassyouknowcanberepresented morecompactlybyamatrix.Thatwouldbe E V j Q = Q .20 wherewith 0 denotingmatrixtransposeandavectorwithouta 0 beingarowvector V = Y 1 ;:::;Y n 0 ; .21 = 0 ; 1 ;:::; r 0 .22 andQisthenxr+1matrixwhosei,jelementis X j i ,with X i takentobe1.Forinstance,ifweare predictingweightfromheightandage,thenrow5ofQwouldconsistofa1,thentheheightandageofthe fthpersoninoursample. Nowtoestimatethe i ,let ^ = ^ 0 ; ^ 1 ;:::; ^ r 0 .23 PAGE 185 6.3.REGRESSIONANALYSIS 167 Thenitcanbeshownthat,afterallthepartialderivativesaretakenandsetto0,thesolutionis ^ = Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 Q 0 V .24 6.3.7.3BacktoOurALOHAExample Roranyotherstatisticalpackagedoestheworkforus.InR,wecanusethe lm linearmodelfunction: >md<-cbindmd,md[,1] >lmout<-lmmd[,2]md[,1]+md[,3] FirstIaddedanewcolumntothedatamatrix,consistingof b 2 .Ithencalled lm ,withtheargument md[,2]md[,1]+md[,3] Rdocumentationcallsthismodelspecicationargumentthe formula .ItstatesthatIwishtousetherstand thirdcolumnsof md ,i.e. b and b 2 ,aspredictors,anduseA,i.e.secondcolumn,astheresponsevariable. 4 Thereturnvaluefromthiscall,whichI'vestoredin lmout ,isanobjectofclass lm .Oneofthemember variablesofthatclass, coefcients ,isthevector b : >lmout$coefficients Interceptmd[,1]md[,3] 27.56852-90.7258579.98616 So, b 0 =27 : 57 andsoon. Theresultis b m A;b t =27 : 57 )]TJ/F15 10.9091 Tf 10.909 0 Td [(90 : 73 t +79 : 99 t 2 .25 Anothermembervariableinthe lm classis tted.values .Thisisthettedcurve,meaningthevaluesof .25at b 1 ;:::;b 100 .Inotherwords,thisis.25.Iplottedthiscurveonthesamegraph, >linescbindmd[,1],lmout$fitted.values SeeFigure6.3.Asyoucansee,thetlooksfairlygood.Whatshouldwelookfor? Remember,wedon'texpectthecurvetogothroughthepointsweareestimatingthemean ofAfor eachb,nottheAvaluesthemselves. Thereisalwaysvariationaroundthemean.Ifforinstanceweare lookingattherelationshipbetweenpeopleheightsandweights,themeanweightforpeopleofheight70 inchesmightbe,say,160pounds,butweknowthatsome70-inch-tallpeopleweighmorethanthisandsome weighless. 4 Unfortunately,Rdidnotallowmetoputthesquaredcolumndirectlyintotheformula,forcingmetouse cbind tomakea newmatrix. PAGE 186 168 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Figure6.3:QuadraticFitSuperimposed However,thereseemstobeatendencyforourestimatesof b m A;b t tobetoolowforvaluesinthemiddle rangeoft,andpossibletoohighfortaround0.3or0.4. However,withasamplesizeofonly100,it's difculttotell. It'salwaysimportanttokeepinmindthatthedataarerandom;adifferentsamplemayshow somewhatdifferentpatterns.Nevertheless,weshouldconsideramorecomplexmodel. SoItriedaquartic,i.e.fourth-degree,polynomialmodel.Iaddedthird-andfourth-powercolumnsto md callingtheresult md4 ,andinvokedthecall lmmd4[,2]md4[,1]+md4[,3]+md4[,4]+md4[,5] Theresultwas >lmout$coefficients Interceptmd4[,1]md4[,3]md4[,4]md4[,5] 95.98882-664.027801731.90848-1973.00660835.89714 PAGE 187 6.3.REGRESSIONANALYSIS 169 Figure6.4:FourthDegreeFitSuperimposed Inotherwords,wehaveanestimatedregressionfunctionof b m A;b t =95 : 98882 )]TJ/F15 10.9091 Tf 10.909 0 Td [(664 : 02780 t +1731 : 90848 t 2 )]TJ/F15 10.9091 Tf 10.909 0 Td [(1973 : 00660 t 3 +835 : 89714 t 4 .26 ThetisshowninFigure6.4.Itlooksmuchbetter.Ontheotherhand,wehavetoworryaboutovertting. WereturntothisissueinSection6.3.9.1. 6.3.7.4ApproximateCondenceIntervals Asusual,weshouldnotbesatisedwithjustpointestimates,inthiscasethe b i .Weneedanindication ofhowaccuratetheyare,soweneedcondenceintervals.Inotherwords,weneedtousethe b i toform condenceintervalsforthe i Forinstance,recallthestudyonobject-orientedprogramminginSection6.1.Thegoaltherewasprimarily Understanding,specicallyassessingtheimpactofOOP.Thatimpactismeasuredby 2 .Thus,wewantto PAGE 188 170 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES ndacondenceintervalfor 2 Equation.24showsthatthe b i aresumsofthecomponentsofV,i.e.the Y j .So,theCentralLimit Theoremimpliesthatthe b i areapproximatelynormallydistributed.Thatinturnmeansthat,inorderto formcondenceintervals,weneedstandarderrorsforthe i .Howwillwegetthem? NotecarefullythatsofarwehavemadeNOassumptionsotherthan.17.Now,though,weneedtoaddan assumption: 5 Var Y j X = t = 2 .27 forallt.Notethatthisandtheindependenceofthesampleobservationse.g.thevariouspeoplesampledin theDavisheight/weightexampleareindependentofeachotherimpliesthat Cov V j Q = 2 I .28 whereIistheusualidentiymatrixsonthediagonal,0soffdiagonal. Besureyouunderstandwhatthismeans.IntheDavisweightsexample,forinstance,itmeansthatthe varianceofweightamong72-inchtallpeopleisthesameasthatfor65-inch-tallpeople.Thatisnotquite truethetallergrouphaslargervariancebutit'sprobablyaccurateenoughforourpurposeshere. Keepinmindthatthederivationbelowisconditionalonthe X i j ,whichisthestandardapproach, especiallysincethereisthecaseofnonrandomX.Thuswewilllatergetconditionalcondenceintervals, whichisne.Toavoidclutter,Iwillsometimesnotshowtheconditioningexplicitly,andthusforinstance willwriteCovVinsteadofCovV j Q. Wecanderivethecovariancematrixof ^ ,asfollows.First,wecaneasilyderivethatforanymx1random vectorMandconstanti.e.nonrandommatrixcwithmcolumns, Cov cM = cCov M c 0 .29 Also,onecanshowthatthetransposeoftheproductoftwomatricesisthereverseproductofthetransposes. In.29,set c = Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 Q 0 and M = V .Thenfrom.24, Cov ^ =[ Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ] Q 0 Cov V Q [ Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ] 0 .30 =[ Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ] Q 0 2 IQ [ Q 0 Q )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 ] 0 .31 = 2 Q 0 Q )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 .32 HerewehaveusedthefactthatQ'Qisasymmetricmatrix,whichimpliesthesamepropertyforitsinverse. Whew!That'salotofworkforyou,ifyourlinearalgebraisrusty.Butit'sworthit,because.30now givesuswhatweneedforcondenceintervals.Here'show: 5 Actually,wecouldderivesomeusable,thoughmessy,standarderrorswithoutthisassumption. PAGE 189 6.3.REGRESSIONANALYSIS 171 First,weneedtoestimate 2 .RecallrstthatforanyrandomvariableU, Var U = E [ U )]TJ/F46 10.9091 Tf 11.203 0 Td [(EU 2 ] ,we have 2 = Var Y j X = t .33 = Var Y j X = t 1 ;:::;X r = t r .34 = E f Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(m Y ; X t g 2 .35 = E Y )]TJ/F46 10.9091 Tf 10.909 0 Td [( 0 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 1 t 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(::: )]TJ/F46 10.9091 Tf 10.909 0 Td [( r t r 2 .36 Thus,anaturalestimatefor 2 wouldbethesampleanalog,wherewereplaceEbyaveragingoverour sample,andreplacepopulationquantitiesbysampleestimates: s 2 = 1 n n X i =1 Y i )]TJ/F15 10.9091 Tf 13.633 2.878 Td [(^ 0 )]TJ/F15 10.9091 Tf 13.633 2.878 Td [(^ 1 X i )]TJ/F46 10.9091 Tf 10.909 0 Td [(::: )]TJ/F15 10.9091 Tf 13.544 2.878 Td [(^ r X r i 2 .37 So,theestimatedcovariancematrixfor ^ is d Cov ^ = s 2 Q 0 Q )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 .38 6.3.7.5OnceAgain,OurALOHAExample InRwecanobtain.38viathegenericfunction vcov : >vcovlmout Interceptmd4[,1]md4[,3]md4[,4]md4[,5] Intercept92.73734-794.47552358.860-2915.2381279.981 md4[,1]-794.475536896.8443-20705.70525822.832-11422.355 md4[,3]2358.86046-20705.704762804.912-79026.08635220.412 md4[,4]-2915.2382825822.8320-79026.086100239.652-44990.271 md4[,5]1279.98125-11422.355035220.412-44990.27120320.809 Whatisthistellingus?Forinstance,itissayingthatthe,4positioninthematrix.38isequalto 20320.809,sothestandarderrorof b 4 isthesquarerootofthis,142.6.Thusanapproximate95%condence intervalforthetruepopulation 4 is 835 : 89714 1 : 96 142 : 6= : 4 ; 1115 : 4 .39 Thatintervalisquitewide.Rememberwhatthistellsusthatoursampleofsize100isnotverylarge.On theotherhand,theintervalisquitefarfrom0,whichindicatesthatourfourth-degreemodelislegitimately betterthanourquadraticone. Bytheway,applyingtheRfunction summary toalinearmodelobjectsuchas lmout heregivesstandard errorsforthe b i andlotsofotherinformation. PAGE 190 172 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES 6.3.7.6EstimationVs.Prediction Instatisticalparlance,thereisakeendistinctionmadebetweenthewords estimation and prediction .To explainthis,let'sagainconsidertheexampleofpredictingY=weightfromX=height,age.Saywehave someoneofheight67inchesandage27,andwanttoguessi.e. predict herweight. FromSection6.3.6,weknowthatthebestpredictionism[,27].However,wedonotknowthevalueof thatquantity,sowemust estimate itfromourdata.So,our predictedvalue forthisperson'sweightwillbe ^ m [ ; 27] ,i.e.our estimate forthevalueoftheregressionfunctionatthepoint,27. 6.3.7.7ExactCondenceIntervals NotecarefullythatwehavenotassumedthatY,givenX,isnormallydistributed. Intheheight/weight context,forexample,suchanassumptionwouldmeanthatweightsinaspecicheightsubpopulation,say allpeopleofheight70inches,haveanormaldistribution. Ifwedomakesuchanassumption,thenwecangetexactcondenceintervalswhichofcourse,onlyhold ifwereallydohaveanexactnormaldistributioninthepopulation.ThisagainusesStudent-tdistributions. Inthatanalysis, s 2 hasn-r+1initsdenominatorinsteadofourn,justastherewasn-1inthedenominator for s 2 whenweestimatedasinglepopulationvariance.ThenumberofdegreesoffreedomintheStudent-t distributionislikewisen-r+1.Butasbefore,forevenmoderatelylargen,itdoesn'tmatter. 6.3.8TheFamousErrorTermadvancedtopic Booksonregressionanalysisandtherearehundreds,ifnotthousandsofthesegenerallyintroducethe subjectasfollows.Theyconsiderthelinearcasewithr=1,andwrite Y = 0 + 1 X + ;E =0 .40 with beingindependentofX.Theyalsoassumethat hasanormaldistributionwithvariance 2 Let'sseehowthiscomparestowhatwehavebeenassumingheresofar.Inthelinearcasewithr=1,we wouldwrite m Y ; X t = E Y j X = t = 0 + 1 t .41 Notethatinourcontext,wewoulddene as = Y )]TJ/F46 10.9091 Tf 10.909 0 Td [(m Y ; X X .42 Equation.40isconsistentwith.41:Theformerhas E =0 ,andsodoesthelatter,since E = EY )]TJ/F46 10.9091 Tf 10.909 0 Td [(E [ m Y ; X X ]= EY )]TJ/F46 10.9091 Tf 10.909 0 Td [(E [ E Y j X ]= EY )]TJ/F46 10.9091 Tf 10.909 0 Td [(EY =0 .43 PAGE 191 6.3.REGRESSIONANALYSIS 173 Inordertoproducecondenceintervals,welateraddedtheassumption.27,whichyoucanseeisconsistentwith.40sincethelatterassumesthat Var = 2 nomatterwhatvalueXhas. Now,whataboutthenormalityassumptionin.40?Thatwouldbeequivalenttosayingthatinourcontext, theconditionaldistributionofYgivenXisnormal,whichisanassumptionwedidnotmake.Notethatin theweight/heightexample,thisassumptionwouldsaythat,forinstance,thedistributionofweightsamong peopleofheight68.2inchesisnormal. Nomatterwhatthecontextis,thevariable iscalledthe errorterm .Originallythiswasanallusionto measurementerror,e.g.inchemistryexperiments,butthemoderninterpretationwouldbepredictionerror, i.e.howmucherrorwemakewhenweus m Y ; X t topredictY. 6.3.9ModelSelection TheissuesraisedinChapter5becomecrucialinregressionandclassicationproblems.Inthisunit,we willtypicallydealwithmodelshavinglargenumbersofparameters.Acentralprinciplewillbethatsimpler modelsarepreferable,providedofcoursetheyareaccurate.HencetheEinsteinquoteabove.Simpler modelsareoftencalled parsimonious HereIusetheterm modelselection tomeanwhichpredictorvariableswewilluse.Ifwehavedataonmany predictors,wealmostcertainlywillnotbeabletousethemall,forthefollowingreason: 6.3.9.1TheOverttingProbleminRegression Recall.8.Thereweassumedasecond-degreepolynomialfor m A ; b .Whynotathird-degree,orfourth, andsoon? Youcanseethatifwecarrythisnotiontoitsextreme,wegetabsurdresults.Ifwetapolynomialofdegree 99toour100points,wecanmakeourttedcurveexactlypassthrougheverypoint!Thisclearlywouldgive usameaningless,uselesscurve.Wearesimplyttingthenoise. RecallthatweanalyzedthisprobleminSection5.2.3inourunitonmodeling.testing.Therewenotedan absolutelyfundamentalprincipleinstatistics: Inchoosingbetweenasimplermodelandamorecomplexone,thelatterismoreaccurateonly ifeither wehaveenoughdatatosupportit,or thecomplexmodelissufcientlydifferentfromthesimplerone Thisisextremelyimportantinregressionanalysis. Forexample,lookatourregressionmodelforA againstbintheALOHAsimulationinearliersections.Wedidanalysesforasimplermodel,aquadratic polynomial,andamorecomplexmodel,aquarticpolynomialofdegree4.Rephrasingtheabovepointsin thiscontext,wewouldsay, Inchoosingbetweenthequadraticandquarticmodels,thelatterismoreaccurateonlyifeither PAGE 192 174 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES wehaveenoughdatatosupportit,or atleastoneofthecoefcients 3 and 4 isquitedifferentfrom0 Intheweight/height/ageexampleinSection6.3.1,thiswouldbephrasedas Indecidingwhethertopredictfromheightonly,versusfrombothheightandage,thelatteris moreaccurateonlyifeither wehaveenoughdatatosupportit,or thecoefcient 2 isquitedifferentfrom0 Ifweusetoomanypredictorvariables, 6 ,ourdataisdiluted,bybeingsharedbysomany i .Asa result, Var i willbelarge,withbigimplications:WhetherourgoalisPredictionorUnderstanding,our estimateswillbesopoorthatneithergoalisachieved. Thequestionsraisedinturnbytheaboveconsiderations,i.e. Howmuchdata isenoughdata?,and How different from0isquitedifferent?,areaddressedbelowinSection6.3.9.2. AdetailedmathematicalexampleofoverttinginregressionispresentedinmypaperACarefulLookat theUseofStatisticalMethodologyinDataMiningbookchapter,byN.Matloff,in FoundationsofData MiningandGranularComputing ,editedbyT.Y.Lin,WesleyChuandL.Matzlack,Springer-VerlagLecture NotesinComputerScience,2005. 6.3.9.2MethodsforPredictorVariableSelection So,wetypicallymustdiscardsome,maybemany,ofourpredictorvariables.Intheweight/height/age example,wemayneedtodiscardtheagevariable.IntheALOHAexample,wemightneedtodiscard b 4 andeven b 3 .Howdowemakethesedecisions? Notecarefullythat thisisanunsolvedproblem. Ifanyoneeverclaimstheyhaveafoolproofwaytodo this,theydonotunderstandtheproblemintherstplace.Entirebookshavebeenwrittenonthissubject e.g. SubsetSelectioninRegression ,byAlanMiller,pub.byChapmanandHall,2002,discussingmyriad differentmethods,butagain,noneofthemisfoolproof. Mostofthemethodsforvariableselectionusehypothesistestinginoneformoranother.Typicallythistakes theform H 0 : i =0 .44 Inthecontextof.6,thiswouldmeantesting H 0 : 2 =0 .45 6 IntheALOHAexampleabove, b b 2 b 3 and b 4 areseparatepredictors,eventhoughtheyareofcoursecorrelated. PAGE 193 6.3.REGRESSIONANALYSIS 175 Ifwereject H 0 ,thenweusetheagevariable;otherwisewediscardit. IhopeI'veconvincedyouthatthisisnotagoodidea.Asusual,thehypothesistestisaskingthewrong question.Forinstance,intheweight/height/ageexample,thetestisaskingwhether 2 iszeroornot, whereas whatwewanttoknow iswhether 2 isfarenoughfrom0foragetogiveusbetterpredictionsof weight.Thosearetwovery,verydifferentquestions. Averyinterestingexampleofoverttingusingrealdatamaybefoundinthepaper,HonestCondenceIntervalsfortheErrorVarianceinStepwiseRegression,byFosterandStine, www-stat.wharton.upenn. edu/ stine/research/honests2.pdf .Theauthors,oftheUniversityofPennsylvaniaWharton School,tookrealnancialdataanddeliberatelyaddedanumberofextrapredictorsthatwereinfactrandomnoise,independentoftherealdata.Theythentestedthehypothesis.44.Theyfoundthateachof thefakepredictorswassignicantlyrelatedtoY!Thisillustratesboththedangersofhypothesistesting andthepossibleneedformultipleinferenceprocedures. 7 Thisproblemhasalwaysbeenknownbythinking statisticians,buttheWhartonstudycertainlydramatizedit. Well,then,whatcanbedoneinstead?First,thereisthesamealternativetohypothesistestingthatwe discussedbeforecondenceintervals.Wesawanexampleofthatin.39.Granted,theintervalwasvery wide,tellingusthatitwouldbenicetohavemoredata.Buteventhelowerboundofthatintervalisfarfrom zero,soitlooksprettysafetouse b 4 asapredictor. Moreover,acondenceintervalfor i tellsuswhetherthevariable X i wouldhavemuchvalueasapredictor.Onceagain,considertheweight/height/ageexample.Supposeourcondenceintervalfor 2 is .04,0.56.Thatwouldsaythat,forinstance,a10-yeardifferenceinageonlymakesabouthalfapound differenceinmeanweightinwhichcaseagewouldbeofalmostnovalueinpredictingweight. Amethodthatenjoyssomepopularityincertaincirclesisthe AkaikeInformationCriterion AIC.Ituses aformula,backedbysometheoreticalanalysis,whichcreatesatradeoffbetweenrichnessofthemodeland sizeofthestandarderrorsofthe ^ i .TheRstatisticalpackageincludesafunction AIC forthis,whichis usedby step intheregressioncase. Themostpopularalternativetohypothesistestingforvariableselectiontodayisprobably crossvalidation Herewesplitourdataintoa trainingset ,whichweusetoestimatethe i ,anda validationset ,inwhich weseehowwellourttedmodelpredictsnewdata,sayintermsofaveragesquaredpredictionerror.Wedo thisforseveralmodels,i.e.severalsetsofpredictors,andchoosetheonewhichdoesbestinthevalidation set.Ilikethismethodverymuch,thoughIoftensimplystickwithcondenceintervals. Aroughruleofthumbisthatoneshouldhave r< p n 8 6.3.10NonlinearParametricRegressionModels WepointedoutinSection6.3.7.1thattheword linear in linearregressionmodel meanslinearin ,notint. Thisisthemostpopularapproach,asitiscomputationallyeasy,butnonlinearmodelsareoftenused. Themostfamousoftheseisthe logistic model,forthecaseinwhich Y takesononlythevalues0and1. 7 Theyaddedsomanypredictorsthatrbecamegreaterthann.However,theproblemstheyfoundwouldhavebeentheretoa largedegreeevenifrwerelessthannbutr/nwassubstantial. 8 AsymptoticBehaviorofLikelihoodMethodsforExponentialFamiliesWhentheNumberofParametersTendstoInnity, StephenPortnoy, AnnalsofStatistics ,1968. PAGE 194 176 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Aswehaveseenbefore,inthiscasetheexpectedvaluebecomesaprobability.Thelogisticmodelfora nonvector X isthen m Y ; X t = P Y =1 j X = t = 1 1+ e )]TJ/F44 7.9701 Tf 6.587 0 Td [( 0 + 1 t .46 Itextendstothecaseofvector-valued X intheobviousway. Thelogisticmodelisquitewidelyusedincomputerscience,inmedicine,economics,psychologyandso on. Hereisanexampleofanonlinearmodelusedinkineticsofchemicalreactions,withr=3: 9 m Y ; X t = 1 t )]TJ/F46 10.9091 Tf 10.909 0 Td [(t = 5 1+ 2 t + 3 t + 4 t .47 HeretheXvectorishydrogen,n-pentane,isopentane'. Unfortunately,inmostcases,theleast-squaresestimatesoftheparametersinnonlinearregressiondonot haveclosed-formsolutions,andnumericalmethodsmustbeused.ButRdoesthatforyou,viathe nls functioningeneral,andvia glm forthelogisticandrelatedmodelsinparticular. 6.3.11NonparametricEstimationofRegressionFunctions Insomeapplications,theremaybenoobviousparametricmodelfor m Yl ; X .Or,wemayhaveaparametric modelthatweareconsidering,butwewouldliketohavesomekindofnonparametricestimationmethod availableasameansofcheckingthevalidityofourparametricmodel.So,howdoweestimatearegression functionnonparametrically? Toguideourintuitiononthis,let'sturnagainoftheDavisexampleoftherelationshipbetweenheight andweight.Considerestimationofthequantity m W ; H : 2 ,the population meanweightofallpeople ofheight68.2.Wecouldtakeourestimate ^ m W ; H : 2 tobetheaverageweightofallthepeopleinour samplewhohavethatheight.Butwemayhaveveryfewpeopleofthatheight,sothatourestimatemay haveahighvariance,i.e.maynotbeveryaccurate. Whatwecoulddoinsteadistotakethemeanweightofallthepeopleinoursamplewhoseheightsare near 68.2,saybetween67.7and68.7.Thatwouldbiasthingsabit,butwe'dgetalowervariance.All nonparametricregressionmethodsworklikethis,thoughwithmanyvariations. Asourdenitionofnear,wecouldtakeallpeopleinoursamplewhoseheightsarewithinhamountof 68.2.ThisshouldremindyouofourdensityestimatorsinSection4.6ofourunitonestimationandtesting. Aswesawthere,ageneralizationwouldbetouseakernelmethod.Forinstance,forunivariateXandt: ^ m Y ; X t = P n i =1 Y i k t )]TJ/F47 7.9701 Tf 6.587 0 Td [(X i h P n i =1 k t )]TJ/F47 7.9701 Tf 6.587 0 Td [(X i h .48 9 See http://www.mathworks.com/index.html?s_cid=docframe_homepage . PAGE 195 6.3.REGRESSIONANALYSIS 177 ThereisanRpackagethatincludesafunction nkreg todothis.TheRbasehasasimilarmethod,called LOESS .Note:Thatisthemethodname,buttheRfunctioniscalled lowess Othertypesofnonparametricmethodsinclude ClassicationandRegressionTrees CART,nearestneighbormethods,supportvectormachines,splinesetc. 6.3.12RegressionDiagnostics Researchersinregressionanalysishavedevisedsome diagnostic methods,meaningmethodstocheckthet ofamodel,thevalidityofassumptions[e.g..27],searchfordatapointsthatmayhaveanundueinuence andmayactuallybeinerror,andsoon. TheRpackagehastonsofdiagnosticmethods.SeeforexampleChapter4of LinearModelswithR ,Julian Faraway,ChapmanandHall,2005. 6.3.13NominalVariables RecallourexampleinSection6.2concerningastudyofsoftwareengineerproductivity.Toreview,the authorsofthestudypredicted Y =numberofperson-monthsneededtocompletetheproject,from X = sizeoftheprojectasmeasuredinlinesofcode, X =1or0dependingonwhetheranobject-orientedor proceduralapproachwasused,andothervariables. Asmentionedatthetime, X iscalledanindicatorvariable.Let'sgeneralizethatabit.Supposeweare comparingtwodifferentobject-orientedlanguages,C++andJava,aswellastheprocedurallanguageC. Thenwecouldchangethedenitionof X tohavethevalue1forC++and0fornon-C++,andwecould addanothervariable, X ,whichhasthevalue1forJavaand0fornon-Java.UseoftheClanguagewould beimpliedbythesituation X = X =0 Herewearedealingwitha nominal variable,Language,whichhasthreevalues,C++,JavaandC,and representingitbythetwoindicatorvariables X and X .NotethatwedoNOTwanttorepresent Languagebyasinglevaluehavingthevalues0,1and2,whichwouldimplythatChas,forinstance,double theimpactofJava. Youcanseethatifanominalvariabletakesonqvalues,weneedq-1indicatorvariablestorepresentit.We saythatthevariablehasq levels 6.3.14TheCaseinWhichAllPredictorsAreNominalVariables:AnalysisofVariance ContinuingtheideasinSection6.3.13,supposeinthesoftwareengineeringstudytheyhadkepttheproject sizeconstant,andinsteadof X beingprojectsize,thisvariablerecordedwhethertheprogrammeruses anintegrateddevelopmentenvironmentIDE.Say X is1or0,dependingonwhethertheprogrammer usestheEclipseIDEornoIDE,respectively.ContinuetoassumethestudyincludedthenominalLanguage variable,i.e.assumethestudyincludedtheindicatorvariables X C++and X Java.Nowallofour predictorswouldbenominal/indicatorvariables.Regressionanalysisinsuchsettingsiscalled analysisof variance ANOVA. PAGE 196 178 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Eachnominalvariableiscalleda factor .So,inoursoftwareengineeringexample,thefactorsareIDEand Language.Noteagainthatintermsoftheactualpredictorvariables,eachfactorisrepresentedbyoneor moreindicatorvariables;hereIDEhasoneindicatorvariablesandLanguagehastwo. Analysisofvarianceisaclassicstatisticalprocedure,usedheavilyinagriculture,forexample.Wewillnot gointodetailshere,butmentionitbrieybothforthesakeofcompletenessandforitsrelevancetoSections 6.3.3and6.6.ThereaderisstronglyadvisedtoreviewSections6.3.3beforecontinuing. 6.3.14.1It'saRegression! Theterm analyisisofvariance isamisnomer.Amoreappropriatenamewouldbe analysisofmeans ,asit isinfactaregressionanalysis,asfollows. First,noteinoursoftwareengineeringexamplewebasicallyaretalkingaboutsixgroups,becausethereare sixdifferentcombinationsofvaluesforthetriple X ;X ;X .Forinstance,thetriple,0,1means thattheprogrammerisusinganIDEandprogramminginJava.Notethattriplesoftheformw,1,1are impossible. So,allthatishappeninghereisthatwehavesixgroupswithsixmeans.Butthatisaregression!Remember, forvariablesUandV, m V ; U t isthemeanofallvaluesofVinthesubpopulationgroupofpeopleorcars orwhateverdenedbyU=s.IfUisacontinuousvariable,thenwehaveinnitelymanysuchgroups,thus innitelymanymeans.Inoursoftwareengineeringexample,weonlyhavesixgroups,buttheprincipleis thesame.Wecanthuscasttheprobleminregressionterms: m Y ; X i;j;k = E Y j X = i;X = j;X = k ;i;j;k =0 ; 1 ;j + k 1 .49 Notetherestriction j + k 1 ,whichreectsthefactthatjandkcan'tbothbe1. Again,keepinmindthatweareworkingwithmeans.Forinstance, m Y ; X ; 1 ; 0 isthepopulationmean projectcompletiontimefortheprogrammerswhodonotuseEclipseandwhoprograminC++. Sincethetriplei,j,kcantakeononlysixvalues,mcanbemodeledfullygenerallyinthefollowingsixparameterlinearform: m Y ; X i;j;k = 0 + 1 i + 2 j + 3 k + 4 ij + 5 ik .50 where 4 and 5 arethecoefcientsoftwointeractionterms,asinSection6.3.3. 6.3.14.2InteractionTerms Itiscrucialtounderstandtheinteractionterms.Withouttheijandikterms,forinstance,ourmodelwould be m Y ; X i;j;k = 0 + 1 i + 2 j + 3 k .51 PAGE 197 6.3.REGRESSIONANALYSIS 179 whichwouldmeanasinSection6.3.3thatthedifferencebetweenusingEclipseandandnoIDEisthe sameforallthreeprogramminglanguages,C++,JavaandC.Thatcommondifferencewouldbe 1 .Ifthis conditiontheimpactofusinganIDEisthesameacrosslanguagesdoesn'thold,atleastapproximately, thenwouldusethefullmodel,.50.Moreonthisbelow. Notecarefullythatthereisnointeractiontermcorrespondingtojk,sincethatquantityis0,andthusthereis nothree-wayinteractiontermcorrespondingtoijkeither. Butsupposeweaddathirdfactor,Education,representedbytheindicator X ,havingthevalue1ifthe programmerhasaleastaMaster'sdegree,0otherwise.Thenmwouldtakeon12values,andthefullmodel wouldhave12parameters: m Y ; X i;j;k;l = 0 + 1 i + 2 j + 3 k + 4 l + 5 ij + 6 ik + 7 il + 8 jl + 9 kl + 10 ijl + 11 ikl .52 Again,therewouldbenoijklterm,asjk=0. Here 1 2 3 and 4 arecalledthe maineffects ,asopposedtothecoefcientsoftheinteractionterms, calledofcoursethe interactioneffects Theno-interactionversionwouldbe m Y ; X i;j;k;l = 0 + 1 i + 2 j + 3 k + 4 l .53 6.3.14.3NowConsiderParsimony Inthethree-factorexampleabove,wehave12groupsand12means.Whynotjusttreatitthatway,instead ofapplyingthepowerfultoolofregressionanalysis?Theanswerliesinourdesireforparsimony,asnoted inSection6.3.9.1. Ifforexample.53weretohold,atleastapproximately,wewouldhaveafarmoresatisfyingmodel.We couldforinstancethentalkoftheeffectofusinganIDE,ratherthanqualifyingsuchastatementbystating whattheeffectwouldbeforeachdifferentlanguageandeducationlevel.Moreover,ifoursamplesizeis notverylarge,wewouldgetmoreaccurateestimatesofthevarioussubpopulationmeans. Oritcouldbethat,while.53doesn'thold,amodelwithonlytwo-wayinteractions, m Y ; X i;j;k;l = 0 + 1 i + 2 j + 3 k + 4 l + 5 ij + 6 ik + 7 il + 8 jl + 9 kl .54 doesworkwell.Thiswouldnotbeasniceas.53,butitstillwouldbemoreparsimoniousthan.52. Accordingly,themajorthrustofANOVAistodecidehowrichamodelisneededtodoagoodjobof describingthesituationunderstudy.Thereisanimpliedhierarchyofmodelsofinteresthere: thefullmodel,includingtwo-andthree-wayinteractions,.52 themodelwithtwo-factorinteractionsonly,.54 PAGE 198 180 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES theno-interactionmodel,.53 Traditionallythesearedeterminedviahypothesistesting,whichinvolvescertainpartitioningsofsumsof squaressimilarto.18.Thisiswherethename analysisofvariance stemsfrom.Thenulldistribution oftheteststatisticoftenturnsouttobeanF-distribution.Ofcourse,inthisbook,weconsiderhypothesis testinginappropriate,preferringtogivesomecarefulthoughttotheestimatedparameters,butitisstandard. Furthertestingcanbedoneonindividual 1 andsoon.Oftenpeopleusesimultaneousinferenceprocedures, discussedbrieyinSection4.2.16ofourunitonestimationandtesting,sincemanytestsareperformed. 6.3.14.4Reparameterization ClassicalANOVAusesasomewhatdifferentparameterizationthanthatwe'veconsideredhere.Forinstance, considerasingle-factorsettingcalled one-wayANOVA withthreelevels.Ourpredictorsarethen X and X .Takingourapproachhere,wewouldwrite m Y ; X i;j = 0 + 1 i + 2 j .55 Thetraditionalformulationwouldbe i = + i ;i =1 ; 2 ; 3 .56 where = 1 + 2 + 3 3 .57 and i = i )]TJ/F46 10.9091 Tf 10.909 0 Td [( .58 Ofcourse,thetwoformulationsareequivalent.Itislefttothereadertocheckthat,forinstance, = 0 + 1 + 2 2 .59 TherearesimilarformulationsforANOVAdesignswithmorethanonefactor. Notethattheclassicalformulationoverparameterizestheproblem.Intheone-wayexampleabove,for instance,therearefourparameters 1 2 3 butonlythreegroups.Thiswouldmakethesystem indeterminate,butweaddtheconstraint 3 X i =1 i =0 .60 Equation.24thenmustmakeuseof generalizedmatrixinverses . PAGE 199 6.4.THECLASSIFICATIONPROBLEM 181 6.4TheClassicationProblem Asmentionedearlier,inthespecialcaseinwhichYisanindicatorvariable,withthevalue1iftheobjectisin aclassand0ifnot,theregressionproblemiscalledthe classicationproblem .Itisalsosometimescalled patternrecognition ,inwhichcasethepredictorsarecalled features .Also,theterm machinelearning usuallyreferstoclassicationproblems. Iftherearecclasses,weneedcorc-1Yvariables,whichIwilldenoteby Y i ,i=1,...,c. 6.4.1MeaningoftheRegressionFunction 6.4.1.1TheMeanHereIsaProbability Now,hereisakeypoint:Sincethemeanofanyindicatorrandomvariableistheprobabilitythatthevariable isequalto1,theregressionfunctioninclassicationproblemsreducesto m Y ; X t = P Y =1 j X = t .61 RememberthatXandtarevector-valued. Forconcreteness,let'slookatthepatentexampleinSection6.1.Again,Ywillbe1or0,dependingon whetherthepatenthadpublicfunding.We'lltake X tobeanindicatorvariableforthepresenceor absenceofNSFinthepatent, X tobeanindicatorvariableforNIH,andtake X tobethenumber ofclaimsinthepatent.Thislastpredictormightberelevant,e.g.ifindustrialpatentsarelengthier. So, m Y ; X [ ; 0 ; 5] wouldbethepopulationproportionofallpatentsthatarepubliclyfunded,amongthose thatcontainthewordNSF,donotcontainNIH,andmakeveclaims. 6.4.1.2OptimalityoftheRegressionFunction Again,ourcontextisthatwewanttoguessY,knowingX.SinceYis0-1valued,ourguessforYbasedon X,gX,shouldbe0-1valuedtoo.Whatisthebestg? Again,sinceYandgare0-1valued,ourcriterionshouldbewhatwillIcallProbabilityofCorrectClassicationPCC: PCC = P [ Y = g X ] .62 Nowproceedasin.13: PCC = E [ P f Y = g X j X g ] .63 TheanalogofLemma9is PAGE 200 182 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Lemma10 SupposeWtakesonvaluesinthesetA= f 0,1 g ,andconsidertheproblemofmaximizing P W = c ;cA .64 Thesolutionis 1 ; ifPW=1 > 0.5 0 ; otherwise .65 Proof Againrecallingthatciseither1or0,wehave P W = c = P W =1 c +[1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P W =1] )]TJ/F46 10.9091 Tf 10.909 0 Td [(c .66 =[2 P W =1 )]TJ/F15 10.9091 Tf 10.909 0 Td [(1] c +1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(P W =1 .67 Theresultfollows. Applyingthisto.63,weseethatthebestgisgivenby g t = 1 ; if m Y ; X t > 0 : 5 0 ; otherwise .68 Sowendthattheregressionfunctionisagainoptimal,inthisnewcontext. 6.4.2ParametricModelsfortheRegressionFunctioninClassicationProblems Remember,weoftentryaparametricmodelforourregressionfunctionrst,asitmeansweareestimating anitenumberofquantities,insteadofaninnitenumber. 6.4.2.1TheLogisticModel:Form Themostcommonparametricmodelintheclassicationproblemisthelogisticmodeloftencalledthe logit model,seeninSection6.3.10.Initsr-predictorform,itis m Y ; X t = P Y =1 j X = t = 1 1+ e )]TJ/F44 7.9701 Tf 6.587 0 Td [( 0 + 1 t 1 + ::: + r t r .69 PAGE 201 6.4.THECLASSIFICATIONPROBLEM 183 Forinstance,considerthepatentexample.Underthelogisticmodel,thepopulationproportionofallpatents thatarepubliclyfunded,amongthosethatcontainthewordNSF,donotcontainNIH,andmakeve claimswouldhavethevalue 1 1+ e )]TJ/F44 7.9701 Tf 6.587 0 Td [( 0 + 1 +5 3 .70 6.4.2.2TheLogisticModel:IntuitiveMotivation Thelogisticfunctionitself, 1 1+ e )]TJ/F47 7.9701 Tf 6.586 0 Td [(u .71 hasvaluesbetween0and1,andisthusacandidateformodelingaprobability.Also,itismonotonicinu, makingitfurtherattractive,asinmanyclassicationproblemswebelievethat m Y ; X t shouldbemonotonic inthepredictorvariables. 6.4.2.3TheLogisticModel:TheoreticalFoundation Buttherearemuchstrongerreasonstousethelogitmodel,asitincludesmanycommonparametricmodels forX.Toseethis,notethatwecanwrite,forvector-valueddiscreteXandt, P Y =1 j X = t = P Y =1 and X = t P X = t .72 = P Y =1 P X = t j Y =1 P X = t .73 = P Y =1 P X = t j Y =1 P Y =1 P X = t j Y =1+ P Y =0 P X = t j Y =0 .74 = 1 1+ )]TJ/F47 7.9701 Tf 6.586 0 Td [(q P X = t j Y =0 qP X = t j Y =1 .75 where q = P Y =1 istheproportionofmembersofthepopulationwhichhave Y =1 .Keepinmind thatthisprobabilityisunconditional!!!!Inthepatentexample,forinstance,ifsay q =0 : 12 ,then12%of allpatentsinthepatentpopulationwithoutregardtowordsused,numbersofclaims,etc.arepublicly funded. If X isacontinuousrandomvector,thentheanalogof.75is P Y =1 j X = t = 1 1+ )]TJ/F47 7.9701 Tf 6.586 0 Td [(q f X j Y =0 t qf X j Y =1 t .76 PAGE 202 184 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Nowsuppose X ,given Y ,hasanormaldistribution.Inotherwords,withineachclass, Y isnormally distributed.Considerthecaseofjustonepredictorvariable,i.e.r=1.Supposethatgiven Y = i X hasthe distribution N i ; 2 ,i=0,1.Then f X j Y = i t = 1 p 2 exp )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 5 t )]TJ/F46 10.9091 Tf 10.909 0 Td [( i 2 # .77 Afterdoingsomeelementarybutrathertediousalgebra,.76reducestothelogisticform 1 1+ e )]TJ/F44 7.9701 Tf 6.587 0 Td [( 0 + 1 t .78 where 0 and 1 arefunctionsof 0 0 and Inotherwords,ifXisnormallydistributedinbothclasses,withthesamevariancebutdifferent means,then m Y ; X hasthelogisticform! AndthesameistrueifXismultivariatenormalineachclass, withdifferentmeanvectorsbutequalcovariancematrices.Thealgebraisevenmoretedioushere,butit doesworkout. So,notonlydoesthelogisticmodelhaveanintuitivelyappealingform,itisalsoimpliedbyoneofthe mostfamousdistributionsXcanhavewithineachclassthemultivariatenormal. Ifyourereadthederivationabove,youwillseethatthelogitmodelwillholdforanywithin-classdistributionsforwhich ln f X j Y =0 t f X j Y =1 t .79 oritsdiscreteanalogislinearint.Wellguesswhatthisconditionistrueforexponentialdistributions too!Workitoutforyourself. Infact,anumberoffamousdistributionsimplythelogitmodel. 6.4.3NonparametricEstimationofRegressionFunctionsforClassicationadvancedtopic 6.4.3.1UsetheKernelMethod,CART,Etc. Sincetheclassicationproblemisaspecialcaseofthegeneralregressionproblem,nonparametricregression methodscanbeusedheretoo. 6.4.3.2SVMs Therearealsosomemethodswhichhavebeendevelopedexclusively,ormainly,forclassication.One ofthemwhichhasbeengettingalotofpublicityincomputersciencecirclesis supportvectormachines SVMs.ToexplaintheSVMconcept,considerthecaser=2,i.e.twopredictorvariables X and X . PAGE 203 6.4.THECLASSIFICATIONPROBLEM 185 WhatanSVMwoulddoisuseoursampledatatodrawacurveinthe X X plane,withourclassication rulethenbeing,GuessYtobe1ifXisononesideofthecurve,andguessittobe0ifXisontheother side. DON'TBUYSNAKEOIL! Therearenomagicsolutionstostatisticalproblems.SVMsdoverywell insomesituations,notsowellinothers.Ihighlyrecommendthesite www.dtreg.com/benchmarks. htm ,whichcomparessixdifferenttypesofclassicationfunctionestimatorsincludinglogisticregression andSVMonseveraldozenrealdatasets.Theoverallpercentmisclassicationrates,averagedoverallthe datasets,wasfairlyclose,rangingfromahighof25.3%toalowof19.2%.Themuch-vauntedSVMcame inat20.3%.That'snice,butitwasonlyatadbetterthanlogit's20.9%.Consideringthatthelatterhasa bigadvantageinthatonegetsanactualequationfortheclassicationfunction,completewithparameters whichwecanestimateandmakecondenceintervalsfor,itisnotclearjustwhatroleSVMandtheother nonparametricestimatorsshouldplay,ingeneral,thoughinspecicapplicationstheymaybeappropriate. 6.4.4VariableSelectioninClassicationProblems 6.4.4.1ProblemsInheritedfromtheRegressionContext InSection6.3.9.2,itwaspointedoutthattheproblemofpredictorvariableselectioninregressionisunsolved.Sincetheclassicationproblemisaspecialcaseofregression,thereisnosurerewaytoselect predictorvariablesthereeither. 6.4.4.2Example:ForestCoverData Andagain,usinghypothesistestingtochoosepredictorsisnottheanswer.Toillustratethis,let'slookagain attheforestcoverdatawesawinSection4.2.12. Thereweresevenclassesofforestcoverthere.Let'srestrictattentiontoclasses1and2.InmyRanalysisI hadtheclass1and2datainobjects cov1 and cov2 ,respectively.Icombinedthem, >cov1and2<-rbindcov1,cov2 andcreatedanewvariabletoserveasY: cov1and2[,56]<-ifelsecov1and2[,55]==1,1,0 Let'sseehowwellwecanpredictasite'sclassfromthevariableHS12hillsideshadeatnoonthatwe investigatedinthatpastunit,usingalogisticmodel. InRwetlogisticmodelsviathe glm function,forgeneralizedlinearmodels.Theword generalized here referstomodelsinwhichsomefunctionof m Y ; X t islinearinparameters i .Fortheclassicationmodel, ln m Y ; X t = [1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(m Y ; X t ]= 0 + 1 t + ::: + r t r .80 ThiskindofgeneralizedlinearmodelisspeciedinRbysettingthenamedargument family to binomial Hereisthecall: PAGE 204 186 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES >g<-glmcov1and2[,56]cov1and2[,8],family=binomial Theresultwas: >summaryg Call: glmformula=cov1and2[,56]cov1and2[,8],family=binomial DevianceResiduals: Min1QMedian3QMax -1.165-0.820-0.7751.5041.741 Coefficients: EstimateStd.ErrorzvaluePr>|z| Intercept1.5158201.1486651.3200.1870 cov1and2[,8]-0.0109600.005103-2.1480.0317 --Signif.codes:0 *** 0.001 ** 0.01 0.05.0.11 Dispersionparameterforbinomialfamilytakentobe1 Nulldeviance:959.72on810degreesoffreedom Residualdeviance:955.14on809degreesoffreedom AIC:959.14 NumberofFisherScoringiterations:4 So, b 1 = )]TJ/F15 10.9091 Tf 8.485 0 Td [(0 : 01 .Thisistiny,ascanbeseenfromourdatainthelastunit.Therewefoundthattheestimated meanvaluesofHS12forcovertypes1and2were223.8and226.3,adifferenceofonly2.5.Thatdifference inessencegetsmultipliedby0.01.Moreconcretely,in.46,pluginourestimates1.52and-0.01fromour Routputabove,rsttakingttobe223.8andthen226.3.Theresultsare0.328and0.322,respectively.In otherwords,HS12isn'thavingmucheffectontheprobabilityofcovertype1,andsoitcannotbeagood predictorofcovertype. YettheRoutputsaysthat 1 issignicantlydifferentfrom0,withap-valueof0.03.Thus,wesee onceagainthathypothesistestingdoesnotachieveourgoal.Again,crossvalidationisabettermethodfor choosingpredictors. 6.4.5YMustHaveaMarginalDistribution! Inourmaterialhere,wehavetacitlyassumedthatthevectorY,Xhasadistribution.Thatmayseemlike anoddandpuzzlingremarktomakehere,but itisabsolutelycrucial .Let'sseewhatitmeans. Considerthestudyonobject-orientedprogramminginSection6.1,butturnedaround.Thisexamplewill besomewhatcontrived,butitwillillustratetheprinciple.Supposeweknowhowmanylinesofcodeare inaproject,whichwewillstillcall X ,andweknowhowlongittooktocomplete,whichwewillnow takeas X ,andfromthiswewanttoguesswhetherobject-orientedorproceduralprogrammingwasused withoutbeingabletolookatthecode,ofcourse,whichisnowournewY. Hereisourhuge problem:Givenoursampledata,thereisnowaytoestimateqin.75.That'sbecause theauthorsofthestudysimplytooktwogroupsofprogrammersandhadonegroupuseobject-orientedprogrammingandhadtheothergroupuseproceduralprogramming.Ifwehadsampledprogrammersatrandom PAGE 205 6.5.PRINCIPALCOMPONENTSANALYSIS 187 fromactualprojectsdoneatthiscompany,thatwouldenableustoestimateq,thepopulationproportionof projectsdonewithOOP.Butwecan'tdothatwiththedatathatwedohave.Indeed,inthissetting,itmay notevenmakesensetospeakofqintherstplace. Mathematicallyspeaking,ifyouthinkabouttheprocessunderwhichthedatawascollectedinthisstudy, theredoesexistsomeconditionaldistributionofXgivenY,butYitselfhasnodistribution.So,wecanNOT estimatePY=1 j X.AboutthebestwecandoistrytoguessYonthebasisofwhichevervalueofimakes f X j Y = i X larger. 6.5PrincipalComponentsAnalysis 6.5.1DimensionReductionandthePrincipleofParsimony Considerarandomvector X = X 1 ;X 2 T .SupposethetwocomponentsofXarehighlycorrelatedwith eachother.Thenforsomeconstantscandd, X 2 c + dX 1 .81 Theninasensethereisreallyjustonerandomvariablehere,asthesecondisnearlyequaltosomelinear combinationoftherst.Thesecondprovidesuswithalmostnonewinformation,oncewehavetherst. Inotherwords,eventhoughthevectorXroamsintwo-dimensionalspace,itusuallysticksclosetoaonedimensionalobject,namelytheline.81.Wesawagraphillustratingthisinourunitonmultivariate distributions,page84. Ingeneral,considerak-componentrandomvector X = X 1 ;:::;X k T .82 Weagainwishtoinvestigatewhetherjustafew,sayw,ofthe X i tellalmostthewholestory,i.e.whether most X j canbeexpressedapproximatelyaslinearcombinationsofthesefew X i .Inotherwords,even thoughXisk-dimensional,ittendstostickclosetosomew-dimensionalsubspace. Notethatalthough.81isphrasedinpredictionterms,wearenotormoreaccurately,notnecessarily interestedinpredictionhere.Wehavenotdesignatedoneofthe X i tobearesponsevariableandtherest tobepredictors. Onceagain,thePrincipleofParsimonyiskey.Ifwehave,say,20or30variables,itwouldbeniceifwe couldreducethatto,forexample,threeorfour.Thismaybeeasiertounderstandandworkwith,albeitwith thecomplicationthatournewvariableswouldbelinearcombinationsoftheoldones. PAGE 206 188 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES 6.5.2HowtoCalculateThem Here'showitworks.Thetheoryoflinearalgebrasaysthatsince isasymmetricmatrix,itisdiagonalizable,i.e.thereisarealmatrixQforwhich Q T Q = D .83 whereDisadiagonalmatrix.Thisisaspecialcaseof singularvaluedecomposition .Thecolumns C i of Qaretheeigenvectorsof ,anditturnsoutthattheyareorthogonaltoeachother,i.e.theirdotproductis0. Let W i = C T i X;i =1 ;:::;k .84 sothatthe W i arescalarrandomvariables,andset W = W 1 ;:::;W k T .85 Then W = Q T X .86 Now,usethematerialoncovariancematricesfromourunitonmultivariateanalysis,page75, Cov W = Cov Q T X = Q T Cov X Q = D from.83.87 NotetoothatifXhasamultivariatenormaldistributionwhichwearenotassuming,thenWdoestoo. Let'srecap: Wehavecreatednewrandomvariables W i aslinearcombinationsofouroriginal X j The W i areuncorrelated.ThusifinadditionXhasamultivariatenormaldistribution,sothatWdoes too,thenthe W i willbeindependent. Thevarianceof W i isgivenbythei th diagonalelementofD. The W i arecalledthe principalcomponents ofthedistributionofX. Itiscustomarytorelabelthe W i sothat W 1 hasthelargestvariance, W 2 hasthesecond-largest,andsoon. Wethenchoosethose W i thathavethelargervariances,anddiscardtheothers,becausethelatter,having smallvariances,areclosetoconstantandthuscarrynoinformation. Allthiswillbecomeclearerintheexamplebelow. PAGE 207 6.6.LOG-LINEARMODELS 189 6.5.3Example:ForestCoverData Let'stryusingprincipalcomponentanalysisontheforestcoverdatasetwe'velookedatbefore.Thereare 10continuousvariablesalsomanydiscreteones,butthereisanothertoolforthatcase,thelog-linearmodel, discussedinSection6.6. InmyRrun,thedataset.notrestrictedtojusttwoforestcovertypes,butconsistingonlyoftherst1000 observationswasintheobject f .Herearethecallandtheresults: >prc<-prcompf[,1:10] >summaryprc Importanceofcomponents: PC1PC2PC3PC4PC5PC6 Standarddeviation1812.3941613.2871.89e+021.10e+0296.9345530.16789 ProportionofVariance0.5520.4386.01e-032.04e-030.001580.00015 CumulativeProportion0.5520.9909.96e-019.98e-010.999680.99984 PC7PC8PC9PC10 Standarddeviation25.9547816.785954.20.783 ProportionofVariance0.000110.000050.00.000 CumulativeProportion0.999951.000001.01.000 YoucanseefromthevariancevaluesherethatRhasscaledthe W i sothattheirvariancessumto1.0.It hasnotdonesoforthestandarddeviations,whichareforthenonscaledvariables.Thisisne,asweare onlyinterestedinthevariancesrelativetoeachother,i.e.savingtheprincipalcomponentswiththelarger variances. Whatweseehereisthateightofthe10principalcomponentshaveverysmallvariances,i.e.arecloseto constant.Inotherwords,thoughwehave10variables X 1 ;:::;X 10 ,thereisreallyonlytwovariables'worth ofinformationcarriedinthem. Soforexampleifwewishtopredictforestcovertypefromthese10variables,weshouldonlyusetwoof them.Wecoulduse W 1 and W 2 ,butforthesakeofinterpretabilitywesticktotheoriginalXvector;wecan useanytwoofthe X i ThecoefcientsofthelinearcombinationswhichproduceWfromX,i.e.theQmatrix,areavailablevia prc$rotation 6.6Log-LinearModels Herewediscussaprocedurewhichissomethingofananalogofprincipalcomponentsfordiscretevariables. OurmaterialonANOVAwillalsocomeintoplay.ItisrecommendedthatthereaderreviewSections6.3.14 and6.5beforecontinuing. 6.6.1TheSetting Let'sconsideravariationonthesoftwareengineeringexampleinSections6.2and6.3.14.Assumewehave thefactors,IDE,LanguageandEducation.Ourchange ofextremeimportance isthatwewillnow assumethatthesefactorsare RANDOM .Whatdoesthismean? PAGE 208 190 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES IntheoriginalexampledescribedinSection6.2,programmerswere assigned tolanguages,andinour extensionsofthatexample,wecontinuedtoassumethis.Thusforexamplethenumberofprogrammerswho useanIDEandprograminJavawasxed;ifwerepeatedtheexperiment,thatnumberwouldstaythesame. Ifweweresamplingfromsomeprogrammerpopulation,ournewsamplewouldhavenewprogrammers, butthenumberusingandIDEandJavawouldbethesameasbefore,asourstudyprocedurespeciesthis. Bycontrast,let'snowassumethatwesimplysampleprogrammersatrandom,andaskthemwhetherthey prefertouseanIDEornot,andwhichlanguagetheyprefer. 10 Thenforexamplethenumberofprogrammers whoprefertouseanIDEandprograminJavawillberandom,notxed;ifwerepeattheexperiment,we willgetadifferentcount. Supposenowwenowwishtoinvestigaterelationsbetweenthefactors.Arechoiceofplatformandlanguage relatedtoeducation,forinstance? 6.6.2TheData Denoteourthreefactorsby X s ,s=1,2,3.Here X ,IDE,willtakeonthevalues1and2insteadof1 and0asbefore,1meaningthattheprogrammerpreferstouseanIDE,and2meaningnotso. X changes thiswaytoo,and X willtakeonthevalues1forC++,2forJavaand3forC.Notethatwenolongeruse indicatorvariables. Let X s r denotethevalueof X s forther th programmerinoursample,r=1,2,...,n.Ourdataarethecounts N ijk = numberofrsuchthat X r = i;X r = j and X r = k .88 Forinstance,ifwesample100programmers,ourdatamightlooklikethis: preferstouseIDE: Bachelor'sorlessMaster'sormore C++1815 Java2210 C64 prefersnottouseIDE: Bachelor'sorlessMaster'sormore C++74 Java62 C33 Soforexample N 122 =10 and N 212 =4 Herewehaveathree-dimensional contingencytable .Each N ijk valueisa cell inthetable. 10 Othersamplingschemesarepossibletoo. PAGE 209 6.6.LOG-LINEARMODELS 191 6.6.3TheModels Let p ijk bethepopulationprobabilityofarandomly-chosenprogrammerfallingintocellijk,i.e. p ijk = P X = i and X = j and X = k = E N ijk =n .89 Asmentioned,weareinterestedinrelationsbetweenthefactors,intheformofindependence,fulland partial.Considerrstthecaseoffullindependence: p ijk = P X = i and X = j and X = k .90 = P X = i P X = j P X = k .91 Takinglogsofbothsidesin.90,weseethatindependenceofthethreefactorsisequivalenttosaying log p ijk = a i + b j + c k .92 forsomenumbers a i b j and c j .Thenumbersmustbenonpositive,andsince X m P X s = m =1 .93 wemusthave,forinstance, 2 X g =1 exp c g =1 .94 Thepointisthat.92lookslikeourno-interactionANOVAmodels,e.g..51.Ontheotherhand,if weassumeinsteadthatEducationisindependentofIDEandLanguagebutthatIDEandLanguagearenot independentofeachother,ourmodelwouldbe log p ijk = P X = i and X = j P X = k .95 = a i + b j + d ij + c k .96 Herewehavewritten P )]TJ/F46 10.9091 Tf 5 -8.837 Td [(X = i and X = j asasumofmaineffects a i and b j ,andinteraction effects, d ij ,analogoustoANOVA. AnotherpossiblemodelwouldhaveIDEandLanguageconditionallyindependent,givenEducation,meaningthatatanylevelofeducation,aprogrammer'spreferencetouseIDEornot,andhischoiceofprogramminglanguage,arenotrelated.We'dwritethemodelthisway: PAGE 210 192 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES log p ijk = P X = i and X = j P X = k .97 = a i + b j + f ik + h jk + c k .98 Notecarefullythatthetypeofindependencein.98hasaquitedifferentinterpretationthanthatin.96. Thefullmodel,withnoindependenceassumptionsatall,wouldhavethreetwo-wayinteractionterms,as wellasathree-wayinteractionterm. 6.6.4ParameterEstimation Remember,wheneverwehaveparametricmodels,thestatistician'sSwissarmyknifeismaximumlikelihoodestimation.Thatiswhatismostoftenusedinthecaseoflog-linearmodels. How,then,dowecomputethelikelihoodofourdata,the N ijk ?It'sactuallyquitestraightforward,becausethe N ijk havethemultinomialdistributionwestudiedinSection3.6.1.1ofourunitonmultivariate distributions. L = n i;j;k N ijk p N ijk ijk .99 Wethenwritethe p ijk intermsofourmodelparameters.Takeforexample.96,wherewewrite p ijk = e a i + b j + d ij + c k .100 Wethensubstitute.100in.99,andmaximizethelatterwithrespecttothe a i b j d ij and c k ,subjectto constraintssuchas.94. Themaximizationmaybemessy.Butcertaincaseshavebeenworkedoutinclosedform,andinanycase todayonewouldtypicallydothecomputationbycomputer.InR,forexample,thereisthe loglin function forthispurpose. 6.6.5TheGoal:ParsimonyAgain Again,we'dlikethesimplestmodelpossible,butnotsimpler.Thismeansamodelwithasmuchindependencebetweenfactorsaspossible,subjecttothemodelbeingaccurate. Classicallog-linearmodelproceduresdomodelselectionbyhypothesistesting,testingwhethervarious interactiontermsare0.ThetestsoftenparallelANOVAtesting,withchi-squaredistributionsarisinginstead ofF-distributions. PAGE 211 6.7.SIMPSON'SNON-PARADOX 193 6.7Simpson'sNon-Paradox Supposeeachindividualinapopulationeitherpossessesordoesnotpossesstraits A B and C ,andthatwe wishtopredicttrait A .Let A B and C denotethesituationsinwhichtheindividualdoesnotpossessthe giventrait.Simpson'sParadoxthendescribesasituationinwhich P A j B >P A j B .101 andyet P A j B;C PAGE 212 194 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES Richmond'spopulationwas37%black,proportionallyfarmorethanNewYork's0.2%.So,Richmond's heavyconcentrationofblacksmadeitsoverallmortalityratelookworsethanNewYork's,eventhough thingswereactuallymuchworseinNewYork. Butisthisreallyaparadox?Closerconsiderationofthisexamplerevealsthattheonlyreasonthisexample andotherslikeitissurprisingisthatthepredictorswereusedinthewrongorder.Onenormallylooksfor predictorsoneatatime,rstndingthebestsinglepredictor,thenthebestpairofpredictors,andsoon. Ifthisweredoneontheabovedataset,therstpredictorvariablechosenwouldberace,notcity.Inother words,thesequenceofanalysiswouldlooksomethinglikethis: Pmortality j Richmond=0.0022 Pmortality j NewYork=0.0019 Pmortality j black=0.0048 Pmortality j white=0.0018 Pmortality j black,Richmond=0.0033 Pmortality j black,NewYork=0.0056 Pmortality j white,Richmond=0.0016 Pmortality j white,NewYork=0.0018 Theanalystwouldhaveseenthatraceisabetterpredictorthancity,andthuswouldhavechosenraceasthe bestsinglepredictor.Theanalystwouldtheninvestigatetherace/citypredictorpair,andwouldneverreach apointinwhichcityalonewereintheselectedpredictorset.Thusnoanomalieswouldarise. Exercises Notetoinstructor: SeethePrefaceforalistofsourcesofrealdataonwhichexercisescanbeassignedto complementthetheoreticalexercisesbelow. 1 .Supposeweareinterestedindocumentsofacertaintype,whichwe'llcallType1.Everythingthatisnot Type1we'llcallType2,withaproportion q ofalldocumentsbeingType1.Ourgoalwillbetotrytoguess documenttypebythepresenceofabsenceofacertainword;wewillguessType1ifthewordispresent, andotherwisewillguessType2. Let T denotedocumenttype,andlet W denotetheeventthatthewordisinthedocument.Also,let p i be theproportionofdocumentsthatcontaintheword,amongalldocumentsofTypei,i=1,2.Theevent C willdenoteourguessingcorrectly. Findtheoverallprobabilityofcorrectclassication, P C ,andalso P C j W Hint:Becarefulofyourconditionalandunconditionalprobabilitieshere. 2 .InthequarticmodelinALOHAsimulationexample,ndanapproximate95%condenceintervalfor thetruepopulationmeanwaitifourbackoffparameterbissetto0.6. PAGE 213 6.7.SIMPSON'SNON-PARADOX 195 Hint:Youwillneedtousethefactthatalinearcombinationofthecomponentsofamultivariatenormal randomvectorhasaunivariatenormaldistributionsasdiscussedinSection3.6.2.1. 3 .Considerthelinearregressionmodelwithonepredictor,i.e.r=1.Let Y i and X i representthevaluesof theresponseandpredictorvariablesforthei th observationinoursample. aAssumeasinSection6.3.7.4that Var Y j X = t isaconstantint, 2 .Findtheexactvalueof Cov ^ 0 ; ^ 1 ,asafunctionofthe X i and 2 .Yournalanswershouldbeinscalar,i.e.non-matrix form. bSupposewewishtotthemodel m Y ; X t = 1 t ,i.e.theusuallinearmodelbutwithouttheconstant term, 0 .Deriveaformulafortheleast-squaresestimateof 1 4 .Supposetherandompair X;Y hasdensity 8 st on 0 PAGE 214 196 CHAPTER6.STATISTICALRELATIONSBETWEENVARIABLES isequalto 2 X;Y . PAGE 215 Chapter7 MarkovChains OneofthemostfamousstochasticmodelsisthatofaMarkovchain.Thistypeofmodeliswidelyusedin computerscience,biology,physicsandsoon. 7.1Discrete-TimeMarkovChains 7.1.1Example:FiniteRandomWalk Oneofthemostcommonlyusedstochasticmodelsisthatofa Markovchain .Tomotivatethisdiscussion, letusstartwithasimpleexample:Considera randomwalk onthesetofintegersbetween1and5,moving randomlythroughthatset,sayonemovepersecond,accordingtothefollowingscheme.Ifwearecurrently atpositioni,thenonetimeperiodlaterwewillbeateitheri-1,iori+1,accordingtotheoutcomeofrolling afairdiewemovetoi-1ifthediecomesup1or2,stayatiifthediecomesup3or4,andmovetoi+1in thecaseofa5or6.Forthespecialcasesi=1andi=5,wesimplymovebackto2or4,respectively.In randomwalkterminology,thesearecalled reectingbarriers Theintegers1through5formthe statespace forthisprocess;ifwearecurrentlyat4,forinstance,wesay weareinstate4.Let X t representthepositionoftheparticleattimet,t=0,1,2,.... Therandomwalkisa Markovprocess .Theprocessismemoryless,meaningthatwecanforgetthe past;giventhepresentandthepast,thefuturedependsonlyonthepresent: P X t +1 = s t +1 j X t = s t ;X t )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 = s t )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ;:::;X 0 = s 0 = P X t +1 = s t +1 j X t = s t .1 Theterm Markovprocess isthegeneralone.Ifthestatespaceisdiscrete,i.e.countablyinnite,thenwe usuallyusethemorespecializedterm, Markovchain Althoughthisequationhasaverycomplexlook,ithasaverysimplemeaning:Thedistributionofournext position,givenourcurrentpositionandallourpastpositions,isdependentonlyonthecurrentposition. 1 It 1 Thiscanbegeneralized,sothatthefuturedependsonthepresentandalsoonthestateoneunitoftimeago,etc.However,such modelsbecomequiteunwieldy. 197 PAGE 216 198 CHAPTER7.MARKOVCHAINS isclearthattherandomwalkprocessabovedoeshavethisproperty;forinstance,ifwearenowatposition 4,theprobabilitythatournextstatewillbe3is1/3nomatterwherewewereinthepast. Continuingthisexample,let p ij denotetheprobabilityofgoingfrompositionitopositionjinonestep.For example, p 21 = p 23 = 1 3 while p 24 =0 wecanreachposition4fromposition2intwosteps,butnotin onestep.Thenumbers p ij arecalledthe one-steptransitionprobabilities oftheprocess.DenotebyPthe matrixwhoseentriesarethe p ij : 0 B B B B @ 01000 1 3 1 3 1 3 00 0 1 3 1 3 1 3 0 00 1 3 1 3 1 3 00010 1 C C C C A .2 Bytheway,itturnsoutthatthematrix P k givesthek-steptransitionprobabilities.Inotherwords,the elementi,jofthismatrixgivestheprobabilityofgoingfromitojinksteps. 7.1.2Long-RunDistribution Intypicalapplicationsweareinterestedinthelong-rundistributionoftheprocess,forexamplethelong-run proportionofthetimethatweareatposition4.Foreachstatei,dene i =lim t !1 N it t .3 where N it isthenumberofvisitstheprocessmakestostateiamongtimes1,2,...,t.Inmostpracticalcases, thisproportionwillexistandbeindependentofourinitialposition X 0 .The i arecalledthe steady-state probabilities ,orthe stationarydistribution oftheMarkovchain. Intuitively,theexistenceof i impliesthatastapproachesinnity,thesystemapproachessteady-state,in thesensethat lim t !1 P X t = i = i .4 Actually,thelimit.4maynotexistinsomecases.We'llreturntothatpointlater,butfortypicalcasesit doesexist,andwewillusuallyassumethis.Itthensuggestsawaytocalculatethevalues i ,asfollows. Firstnotethat P X t +1 = i = X k P X t = k and X t +1 = i = X k P X t = k P X t +1 = i j X t = k = X k P X t = k p ki .5 PAGE 217 7.1.DISCRETE-TIMEMARKOVCHAINS 199 wherethesumgoesoverallstatesk.Forexample,inourrandomwalkexampleabove,wewouldhave P X t +1 =3= 5 X k =1 P X t = k and X t +1 =3= 5 X k =1 P X t = k P X t +1 =3 j X t = k = 5 X k =1 P X t = k p k 3 .6 Thenas t !1 inEquation.5,intuitivelywewouldhave i = X k k p ki .7 Remember,hereweknowthe p ki andwanttondthe i .Solvingtheseequationsoneforeachi,called the balanceequations ,giveusthe i Amatrixformulationisalsouseful.Letting denotetherowvectoroftheelements i ,i.e. = 1 ; 2 ;::: theseequationsoneforeachithenhavethematrixform = P .8 or I )]TJ/F46 10.9091 Tf 10.909 0 Td [(P =0 .9 Notethatthereisalsotheconstraint X i i =1 .10 Fortherandomwalkproblemabove,forinstance,thesolutionis = 1 11 ; 3 11 ; 3 11 ; 3 11 ; 1 11 .Thusinthelong runwewillspend1/11ofourtimeatposition1,3/11ofourtimeatposition2,andsoon. Oneoftheequationsinthesystemisredundant.Wethuseliminateoneofthem,saybyremovingthelast rowofI-Pin.9.ToreectThiscanbeusedtocalculatethe i .Itturnsoutthatoneoftheequationsin thesystemisredundant.Wethuseliminateoneofthem,saybyremovingthelastrowofI-Pin.9.To reect.10,wereplacetheremovedrowbyarowofall1s,andintheright-handsideof.9wereplace thelast0bya1.Wecanthensolvethesystem.ItcanbedonewithR's solve function.Oronecannote from.8that isalefteigenvectorofPwitheigenvalue1,soonecancall eign onP'. ButEquation.9maynotbeeasytosolve.Forinstance,ifthestatespaceisinnite,thenthismatrix equationrepresentsinnitelymanyscalarequations.Insuchcases,youmayneedtotrytondsomeclever trickwhichwillallowyoutosolvethesystem,orinmanycasesaclevertricktoanalyzetheprocessinsome wayotherthanexplicitsolutionofthesystemofequations. Andevenfornitestatespaces,thematrixmaybeextremelylarge.Insomecases,youmayneedtoresort tonumericalmethods,orsymbolicmathpackages. PAGE 218 200 CHAPTER7.MARKOVCHAINS 7.1.2.1PeriodicChains NoteagainthatevenifEquation.9hasasolution,thisdoesnotimplythat.4holds.Forinstance, supposewealtertherandomwalkexampleabovesothat p i;i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 = p i;i +1 = 1 2 .11 fori=2,3,4,withtransitionsoutofstates1and5remainingasbefore.Inthiscase,thesolutiontoEquation .9is 1 8 ; 1 4 ; 1 4 ; 1 4 ; 1 8 .Thissolutionisstillvalid,inthesensethatEquation.3willhold.Forexample, wewillspend1/4ofourtimeatPosition4inthelongrun.Butthelimitof P X i =4 willnotbe1/4,and infactthelimitwillnotevenexist.Ifsay X 0 iseven,then X i canbeevenonlyforevenvaluesofi.Wesay thatthisMarkovchainis periodic withperiod2,meaningthatreturnstoagivenstatecanonlyoccurafter amountsoftimewhicharemultiplesof2. 7.1.2.2TheMeaningoftheTermStationaryDistribution Thoughwehaveinformallydenedtheterm stationarydistribution intermsoflong-runproportions,the technicaldenitionisthis: Denition11 ConsideraMarkovchain.Supposewehaveavector ofnonnegativenumbersthatsumto 1.Let X 0 havethedistribution .Ifthatresultsin X 1 havingthatdistributiontooandthusalsoall X n wesaythat isthe stationarydistribution ofthisMarkovchain. Notethatthisdenitionstemsfrom.5. Inourrstrandomwalkexampleabove,thiswouldmeanthatifwehave X 0 distributedontheintegers1 through5withprobabilities 1 11 ; 3 11 ; 3 11 ; 3 11 ; 1 11 ,thenforexample P X 1 =1= 1 11 P X 1 =4= 3 11 etc. Thisisindeedthecase,asyoucanverifyusing.5witht=0. Inournotebookview,hereiswhatwewoulddo.Imaginethatwegeneratearandomintegerbetween1 and5accordingtotheprobabilities 1 11 ; 3 11 ; 3 11 ; 3 11 ; 1 11 2 andset X 0 tothatnumber.Wewouldthengenerate anotherrandomnumber,byrollinganordinarydie,andgoingleft,rightorstayingput,withprobability1/3 each.Wewouldthenwritedown X 1 and X 2 ontherstlineofournotebook.Wewouldthendothis experimentagain,recordingtheresultsonthesecondline,thenagainandagain.Inthelongrun,3/11ofthe lineswouldhave,forinstance, X 0 =4 ,and3/11ofthelineswouldhave X 1 =4 .Inotherwords, X 1 would havethesamedistributionas X 0 7.1.3Example:Stuck-At0Fault 7.1.3.1Description Intheaboveexample,thelabelsforthestatesconsistedofsingleintegersi.Insomeotherexamples, convenientlabelsmayber-tuples,forexample2-tuplesi,j. 2 Saybyrollingan11-sideddie. PAGE 219 7.1.DISCRETE-TIMEMARKOVCHAINS 201 Consideraserialcommunicationline.Let B 1 ;B 2 ;B 3 ;::: denotethesequenceofbitstransmittedonthis line.Itisreasonabletoassumethe B i tobeindependent,andthat P B i =0 and P B i =1 arebothequal to0.5. Supposethatthereceiverwilleventuallyfail,withthetypeoffailurebeing stuckat0 ,meaningthatafter failureitwillreportallfuturereceivedbitstobe0,regardlessoftheirtruevalue.Oncefailed,thereceiver staysfailed,andshouldbereplaced.Eventuallythenewreceiverwillalsofail,andwewillreplaceit;we continuethisprocessindenitely. Let denotetheprobabilitythatthereceiverfailsonanygivenbit,withindependencebetweenbitsinterms ofreceiverfailure.Thenthelifetimeofthereceiver,thatis,thetimetofailure,isgeometricallydistributed withsuccessprobability i.e.theprobabilityoffailingonreceiptofthei-thbitafterthereceiveris installedis )]TJ/F46 10.9091 Tf 10.909 0 Td [( i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 fori=l,2,3,... However,theproblemisthatwewillnotknowwhetherareceiverhasfailedunlesswetestitonceina while,whichwearenotincludinginthisexample.Ifthereceiverreportsalongstringof0s,weshould suspectthatthereceiverhasfailed,butofcoursewecannotbesurethatithas;itisstillpossiblethatthe messagebeingtransmittedjusthappenedtocontainalongstringof0s. Supposeweadoptthepolicythat,ifwereceivekconsecutive0s,wewillreplacethereceiverwithanew unit.Herekisadesignparameter;whatvalueshouldwechooseforit?Ifweuseaverysmallvalue,then wewillincurgreatexpense,duetothefactthatwewillbereplacingreceiverunitsatanunnecessarilyhigh rate.Ontheotherhand,ifwemakektoolarge,thenwewilloftenwaittoolongtoreplacethereceiver, andtheresultingerrorrateinreceivedbitswillbesizable.Resolutionofthistradeoffbetweenexpenseand accuracydependsontherelativeimportanceofthetwo.Therearealsootherpossibilities,involvingthe additionofredundantbitsforerrordetection,suchasparitybits.Forsimplicity,wewillnotconsidersuch renementshere.However,theanalysisofmorecomplexsystemswouldbesimilartotheonebelow. 7.1.3.2InitialAnalysis Anaturalstatespaceinthisexamplewouldbe f i;j : i =0 ; 1 ;:::;k )]TJ/F15 10.9091 Tf 10.909 0 Td [(1; j =0 ; 1; i + j 6 =0 g .12 whereirepresentsthenumberofconsecutive0sthatwehavereceivedsofar,andjrepresentsthestateof thereceiverforfailed,1fornonfailed.Notethatwhenweareinastateoftheformk-1,j,ifwereceive a0onthenextbitwhetheritisatrue0orthereceiverhasfailed,ournewstatewillbe,1,aswewill installanewreceiver.Notetoothatthereisnostate,0,sinceifthereceiverisdownitmusthavereceived atleastonebit. ThecalculationofthetransitionmatrixPisstraightforward,thoughitrequirescarefulthought.Forexample, supposethecurrentstateis,1,andthatweareinvestigatingtheexpenseandbitaccuracycorresponding toapolicyhavingk=5.Whatcanhappenuponreceiptofthenextbit?Thenextbitwillhaveatruevalue ofeither0or1,withprobability0.5each.Thereceiverwillchangefromworkingtofailedstatuswith probability .Thusournextstatecouldbe: ,1,ifa0arrives,andthereceiverdoesnotfail; PAGE 220 202 CHAPTER7.MARKOVCHAINS ,1,ifa1arrives,andthereceiverdoesnotfail;or ,0,ifthereceiverfails Theprobabilitiesofthesethreetransitionsoutofstate,1are: p ; 1 ; ; 1 =0 : 5 )]TJ/F46 10.9091 Tf 10.909 0 Td [( .13 p ; 1 ; ; 1 =0 : 5 )]TJ/F46 10.9091 Tf 10.909 0 Td [( .14 p ; 1 ; ; 0 = .15 OtherentriesofthematrixPcanbecomputedsimilarly.Notebythewaythatfromstate,1wewillgoto ,1,nomatterwhathappens. FormallyspecifyingthematrixPusingthe2-tuplenotationasabovewouldbeverycumbersome.Inthis case,itwouldbemucheasiertomaptoaone-dimensionallabeling.Forexample,ifk=5,theninestates ,0,...,,0,,1,,1,...,,1couldberenamedstates1,2,...,9.ThenwecouldformPunderthislabeling, andthetransitionprobabilitiesabovewouldappearas p 78 =0 : 5 )]TJ/F46 10.9091 Tf 10.909 0 Td [( .16 p 75 =0 : 5 )]TJ/F46 10.9091 Tf 10.909 0 Td [( .17 p 73 = .18 7.1.3.3GoingBeyondFinding Findingthe i shouldbejusttherststep.Wethenwanttousethemtocalculatevariousquantitiesof interest. 3 Forinstance,inthisexample,itwouldalsobeusefultondtheerrorrate ,andthemeantime i.e.,themeannumberofbitreceptionsbetweenreceiverreplacements, .Wecanndboth and in termsofthe i ,inthefollowingmanner. Thequantity istheproportionofthetimeduringwhichthetruevalueofthereceivedbitis1butthe receiverisdown,whichis0.5timestheproportionofthetimespentinstatesoftheformi,0: =0 : 5 1 + 2 + 3 + 4 .19 Thisshouldbeclearintuitively,butitwouldalsobeinstructivetopresentamoreformalderivationofthe samething.Let E n betheeventthatthen-thbitisreceivedinerror,with D n denotingtheeventthatthe receiverisdown.Then 3 Notethatunlikeaclassroomsetting,wherethosequantitieswouldbelistedforthestudentstocalculate,inresearchwemust decideonourownwhichquantitiesareofinterest. PAGE 221 7.1.DISCRETE-TIMEMARKOVCHAINS 203 =lim n !1 P E n .20 =lim n !1 P X n =1 and D n .21 =lim n !1 P X n =1 P D n .22 =0 : 5 1 + 2 + 3 + 4 .23 Hereweusedthefactthat X n andthereceiverstateareindependent. Equations.20followapatternwe'lluserepeatedlyinthischapter.Insubsequentexampleswewill notshowthestepswiththelimits,butthelimitsareindeedthere.Makesuretomentallygothrough thesestepsyourself. 4 Nowtoget intermsofthe i notethatsince isthelong-runaveragenumberofbitsbetweenreceiver replacements,itisthenthereciprocalof ,thelong-runfractionofbitsthatresultinreplacements.For example,saywereplacethereceiveronaverageevery20bits.Overaperiodof1000bits,thenspeaking onanintuitivelevelthatwouldmeanabout50replacements.Thusapproximately0.05outof1000of allbitsresultsinreplacements. = 1 .24 Againsupposek=5.Areplacementwilloccuronlyfromstatesoftheform,j,andeventhenonly undertheconditionthatthenextreportedbitisa0.Inotherwords,therearethreepossiblewaysinwhich replacementcanoccur: aWeareinstate,0.Here,sincethereceiverhasfailed,thenextreportedbitwilldenitelybea0, regardlessofthatbit'struevalue.Wewillthenhaveatotalofk=5consecutivereceived0s,and thereforewillreplacethereceiver. bWeareinthestate,1,andthenextbittoarriveisatrue0.Itthenwillbereportedasa0,ourfth consecutive0,andwewillreplacethereceiver,asina. cWeareinthestate,1,andthenextbittoarriveisatrue1,butthereceiverfailsatthattime,resulting inthereportedvaluebeinga0.Againwehaveveconsecutivereported0s,sowereplacethereceiver. Therefore, = 4 + 9 : 5+0 : 5 .25 Again,makesureyouworkthroughthefullversionof.25,usingthepatternin.20. 4 Theotherwaytoworkthisoutrigorouslyistoassumethat X 0 hasthedistribution ,asinSection7.1.2.2.Thennolimitsare neededin.20.Butthismaybemoredifculttounderstand. PAGE 222 204 CHAPTER7.MARKOVCHAINS Thus = 1 = 1 4 +0 : 5 9 + .26 Thiskindofanalysiscouldbeusedasthecoreofacost-benettradeoffinvestigationtodetermineagood valueofk.Notethatthe i arefunctionsofk,andthattheaboveequationsforthecasek=5mustbe modiedforothervaluesofk. 7.1.4Example:Shared-MemoryMultiprocessor Adaptedfrom ProbabiilityandStatistics,withReliability,QueuingandComputerScienceApplicatiions byK.S.Trivedi,Prentice-Hall,1982and2002,butsimilartomanymodelsintheresearchliterature. 7.1.4.1TheModel Considerashared-memorymultiprocessorsystemwithmmemorymodulesandmCPUs.Theaddress spaceispartitionedintomchunks,basedoneitherthemost-signicantorleast-signicant log 2 m bitsinthe address. 5 TheCPUswillneedtoaccessthememorymodulesinsomerandomway,dependingontheprogramsthey arerunning.Tomakethisideaconcrete,considertheIntelassemblylanguageinstruction add%eax,%ebx whichaddsthecontentsoftheEAXregistertothewordinmemorypointedtobytheEBXregister.Executionofthatinstructionwillabsentcacheandothersimilareffects,aswewillassumehereandbelow involvetwoaccessestomemoryonetofetchtheoldvalueofthewordpointedtobyEBX,andanother tostorethenewvalue.Moreover,theinstructionitselfmustbefetchedfrommemory.So,altogetherthe processingofthisinstructioninvolvesthreememoryaccesses. Sincedifferentprogramsaremadeupofdifferentinstructions,usedifferentregistervaluesandsoon,the sequenceofaddressesinmemorythataregeneratedbyCPUsaremodeledasrandomvariables.Inourmodel here,theCPUsareassumedtoactindependentlyofeachother,andsuccessiverequestsfromagivenCPU areindependentofeachothertoo.ACPUwillchoosethei th modulewithprobability q i .Amemoryrequest takesoneunitoftimetoprocess,thoughthewaitmaybelongerduetoqueuing.Inthisverysimplistic model,assoonasaCPU'smemoryrequestisfullled,itgeneratesanotherone.Ontheotherhand,whilea CPUhasonememoryrequestpending,itdoesnotgenerateanother. Let'sassumeacrossbarinterconnect,whichmeansthereare m 2 separatepathsfromCPUstomemory modules,sothatifthemCPUshavememoryrequeststomdifferentmemorymodules,thenalltherequests canbefullledsimultaneously.Also,assumeasanapproximationthatwecanignorecommunicationdelays. 5 Youmayrecognizethisashigh-orderandlow-orderinterleaving,respectively. PAGE 223 7.1.DISCRETE-TIMEMARKOVCHAINS 205 Howgoodaretheseassumptions?Oneweakness,forinstance,isthatmanyinstructions,forexample,do notusememoryatall,exceptfortheinstructionfetch,andasmentioned,eventhelattermaybesuppressed duetocacheeffects. Anotherexampleofpotentialproblemswiththeassumptionsinvolvesthefactthatmanyprogramswillhave codelike fori=0;i<10000;i++sum+=x[i]; Sincetheelementsofthearrayxwillbestoredinconsecutiveaddresses,successivememoryrequestsfrom theCPUwhileexecutingthiscodewillnotbeindependent.Theassumptionwouldbemorejustiedifwe wereincludingcacheeffects,ornoticedbyEarlBarrifwearestudyingatimesharingsystemwithasmall quantumsize. Thus,manymodelsofsystemslikethishavebeenquitecomplex,inordertocapturetheeffectsofvarious thingslikecaching,nonindependenceandsooninthemodel.Nevertheless,onecanoftengetsomeinsight fromevenverysimplemodelstoo.Inanycase,forourpurposeshereitisbesttosticktosimplemodels,so astounderstandmoreeasily. Ourstatewillbeanm-tuple N 1 ;:::;N m ,where N i isthenumberofrequestscurrentlypendingatmemory modulei.RecallingourassumptionthataCPUgeneratesanothermemoryrequestimmediatelyafterthe previousoneisfullled,wealwayshavethat N 1 + ::: + N m = m Itisstraightforwardtondthetransitionprobabilities p ij .Hereareacoupleofexamples,withm=2: p ; 0 ; ; 1 :Recallthatstate,0meansthatcurrentlytherearetworequestspendingatModule1, onebeingservedandoneinthequeue,andnorequestsatModule2.Forthetransition ; 0 ; 1 tooccur,whentherequestbeingservedatModule1isdone,itwillmakeanewrequest,thistime forModule2.Thiswilloccurwithprobability q 2 .Meanwhile,therequestwhichhadbeenqueuedat Module1willnowstartservice.So, p ; 0 ; ; 1 = q 2 p ; 1 ; ; 1 :Instate,1,bothpendingrequestswillnishinthiscycle.Togoto,1again,that wouldmeanthatthetwoCPUsrequestdifferentmodulesfromeachotherCPUs1and2choose Modules1and2or2and1.Eachofthosetwopossibilitieshasprobability q 1 q 2 ,so p ; 1 ; ; 1 = 2 q 1 q 2 Wethensolveforthe ,using.7.Itturnsout,forexample,that ; 1 = q 1 q 2 1 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 q 1 q 2 .27 7.1.4.2GoingBeyondFinding LetBdenotethenumberofmemoryrequestsinagivenmemorycycle.ThenwemaybeinterestedinEB, thenumberofrequestscompletedperunittime,i.e.percycle.WecanndEBasfollows.LetSdenote PAGE 224 206 CHAPTER7.MARKOVCHAINS thecurrentstate.Then,continuingthecasem=2,wehavefromtheLawofTotalExpectation, 6 E B = E [ E B j S ] .28 = P S = ; 0 E B j S = ; 0+ P S = ; 1 E B j S = ; 1+ P S = ; 2 E B j S = ; 2 .29 = ; 0 E B j S = ; 0+ ; 1 E B j S = ; 1+ ; 2 E B j S = ; 2 .30 AllthisequationisdoingisndingtheoverallmeanofBbybreakingdownintothecasesforthedifferent states. Nowifweareinstate,0,onlyonerequestwillbecompletedthiscycle,andBwillbe1.Thus E B j S = ; 0=1 .Similarly, E B j S = ; 1=2 andsoon.Afterdoingallthealgebra,wendthat EB = 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [(q 1 q 2 1 )]TJ/F15 10.9091 Tf 10.909 0 Td [(2 q 1 q 2 .31 ThemaximumvalueofEBoccurswhen q 1 = q 2 = 1 2 ,inwhichcaseEB=1.5.Thisisalotlessthanthe maximumcapacityofthememorysystem,whichism=2requestspercycle. So,wecanlearnalotevenfromthissimplemodel,inthiscaselearningthattheremaybeasubstantial underutilizationofthesystem.Thisisacommonthemeinprobabilisticmodeling:Simplemodelsmaybe worthwhileintermsofinsightprovided,eveniftheirnumericalpredictionsmaynotbetooaccurate. 7.1.5Example:SlottedALOHA RecalltheslottedALOHAmodelfromChapter1: Timeisdividedintoslotsorepochs. Therearennodes,eachofwhichiseitheridleorhasa single messagetransmissionpending.So,a nodedoesn'tgenerateanewmessageuntiltheoldoneissuccessfullytransmittedaveryunrealistic assumption,butwe'rekeepingthingssimplehere. Inthemiddleofeachtimeslot,eachoftheidlenodesgeneratesamessagewithprobabilityq. Justbeforetheendofeachtimeslot,eachactivenodeattemptstosenditsmessagewithprobability p. Ifmorethanonenodeattemptstosendwithinagiventimeslot,thereisa collision ,andeachofthe transmissionsinvolvedwillfail. So,weincludea backoff mechanism:Atthemiddleofeachtimeslot,eachnodewithamessagewill withprobabilityqattempttosendthemessage,withthetransmissiontimeoccupyingtheremainder oftheslot. 6 Actually,wecouldtakeamoredirectrouteinthiscase,notingthatBcanonlytakeonthevalues1and2.Then EB = P B = 1+2 P B =2= ; 0 + s ; 2 +2 ; 1 : Buttheanalysisbelowextendsbettertothecaseofgeneralm. PAGE 225 7.1.DISCRETE-TIMEMARKOVCHAINS 207 So,qisadesignparameter,whichmustbechosencarefully.Ifqistoolarge,wewillhavetoomnay collisions,thusincreasingtheaveragetimetosendamessage.Ifqistoosmall,anodewilloftenrefrain fromsendingevenifnoothernodeistheretocollidewith. Deneourstateforanygiventimeslottobethenumberofnodescurrentlyhavingamessagetosendatthe verybeginningofthetimeslotbeforenewmessagesaregenerated.Thenfor 0 PAGE 226 208 CHAPTER7.MARKOVCHAINS Now,tocalculate P successxmit j instates ,recallthatinstateswestarttheslotwithsnonidlenodes,but thatwemayacquiresomenewones;eachofthen-sidlenodeswillcreateanewmessage,withprobability q.So, P successxmit j instates = n )]TJ/F47 7.9701 Tf 6.586 0 Td [(s X j =0 n )]TJ/F46 10.9091 Tf 10.909 0 Td [(s j q j )]TJ/F46 10.9091 Tf 10.909 0 Td [(q n )]TJ/F47 7.9701 Tf 6.587 0 Td [(s )]TJ/F47 7.9701 Tf 6.586 0 Td [(j s + j )]TJ/F46 10.9091 Tf 10.909 0 Td [(p s + j )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 p .36 Substitutinginto.35,wehave = n X s =0 n )]TJ/F47 7.9701 Tf 6.587 0 Td [(s X j =0 n )]TJ/F46 10.9091 Tf 10.909 0 Td [(s j q j )]TJ/F46 10.9091 Tf 10.91 0 Td [(q n )]TJ/F47 7.9701 Tf 6.586 0 Td [(s )]TJ/F47 7.9701 Tf 6.586 0 Td [(j s + j )]TJ/F46 10.9091 Tf 10.909 0 Td [(p s + j )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 p s .37 Withsomemoresubtlereasoning,onecanderivethemeantimeamessagewaitsbeforebeingsuccessfully transmitted,asfollows: Focusattentionononeparticularnode,sayNode0.Itwillrepeatedlycyclethroughidleandbusyperiods,I andB.WewishtondEB.Ihasageometricdistributionwithparameterq, 7 so E I = 1 q .38 ThenifwecanndEI+B,wewillgetEBbysubtraction. TondEI+B,notethatthereisaone-to-onecorrespondencebetweenI+Bcyclesandsuccessfultransmissions;eachI+BperiodendswithasuccessfultransmissionatNode0.Imagineagainobservingthisnode for,say,100000timeslots,andsayEI+Bis2000.Thatwouldmeanwe'dhaveabout50cycles,thus50 successfultransmissionsfromthisnode.Inotherwords,thethroughputwouldbeapproximately50/100000 =0.02=1/EI+B.So,afraction 1 E I + B .39 ofthetimeslotshavesuccessfultransmissionsfromthisnode. Butthatquantityisthethroughputforthisnodenumberofsuccessfultransmissionsperunittime,anddue tothesymmetryofthesystem,thatthroughputis1/nofthetotalthroughputofthennodesinthenetwork, whichwedenotedaboveby So, E I + B = n .40 7 Ifamessageissentinthesameslotinwhichitiscreated,wewillcountBas1.Ifitissentinthefollowingslot,B=2,etc.B willhaveamodiedgeometricdistributionstartingat0insteadof1,butwewillignorethishereforthesakeofsimplicity. PAGE 227 7.2.HIDDENMARKOVMODELS 209 Thusfrom.38wehave E B = n )]TJ/F15 10.9091 Tf 12.105 7.381 Td [(1 q .41 whereofcourse isthefunctionofthe i in.35. Nowlet'sndtheproportionofattemptedtransmissionswhicharesuccessful.Thiswillbe E numberofsuccessfultransmissionsinaslot E numberofattemptedtransmissionsinaslot .42 Toseewhythisisthecase,againthinkofwatchingthenetworkfor100,000slots.Thentheproportionof successfultransmissionsduringthatperiodoftimeisthenumberofsuccessfultransmissionsdividedbythe numberofattemptedtransmissions.Thosetwonumbersareapproximatelythenumeratoranddenominator of7.42. Now,howdoweevaluate.42?Well,thenumeratoriseasy,sinceitis ,whichwefoundbefore.The denominatorwillbe X s s [ sp + n )]TJ/F46 10.9091 Tf 10.909 0 Td [(s pq ] .43 Thefactorsp+spqcomesfromthefollowingreasoning.Ifweareinstates,thesnodeswhichalreadyhave somethingtosendwilleachtransmitwithprobabilityp,sotherewillbeanexpectednumberspofthemthat trytosend.Also,ofthen-swhichareidleatthebeginningoftheslot,anexpectedsqofthemwillgenerate newmessages,andofthosesq,andestimatedsqpwilltrytosend. 7.2HiddenMarkovModels Theword hidden intheterm HiddenMarkovModel HMMreferstothefactthatthestateoftheprocessis hidden,i.e.unobservable. Actually,we'vealreadyseenanexampleofthis,backinSection7.1.3.Therethestate,actuallyjustpartof it,wasunobservable,namelythestatusofthereceiverbeingupordown.Butherewearenottryingtoguess X n from Y n seebelow,soitprobablywouldnotbeconsideredanHMM.HMMs. AnHMMconsistsofaMarkovchain X n whichisunobservable,togetherwithobservablevalues Y n .The X n aregovernedbythetransitionprobabilities p ij ,andthe Y n aregeneratedfromthe X n accordingto r km = P Y n = m j X n = k .44 Typicallytheideaistoguessthe X n fromthe Y n andourknowledgeofthe p ij and r km .Thedetailsaretoo complextogivehere,butyoucanatleastunderstandthatBayes'Rulecomesintoplay. PAGE 228 210 CHAPTER7.MARKOVCHAINS AgoodexampleofHMMswouldbeintextminingapplications.Herethe Y n mightbewordsinthetext, and X n wouldbetheirpartsofspeechPOSnouns,verbs,adjectivesandsoon.Considertheword round forinstance.Yourrstthoughtmightbethatitisanadjective,butitcouldbeanoune.g.anelimination roundinatournamentoraverbe.g.toroundoffanumberorroundacorner.TheHMMwouldhelpus toguesswhich,andthereforeguessthetruemeaningoftheword. HMMsarealsousedinspeechprocess,DNAmodelingandmanyotherapplications. 7.3Continuous-TimeMarkovChains IntheMarkovchainsweanalyzedabove,eventsoccuronlyatintegertimes.However,manyMarkovchain modelsareofthe continuous-time type,inwhicheventscanoccuratanytimes.Herethe holdingtime ,i.e. thetimethesystemspendsinonestatebeforechangingtoanotherstate,isacontinuousrandomvariable. ThestateofaMarkovchainatanytimenowhasacontinuoussubscript.Insteadofthechainconsistingof therandomvariables X n ;n =1 ; 2 ; 3 ;::: youcanalsostartnat0inthesenseofSection7.1.2.2,itnow consistsof f X t : t 2 [0 ; 1 g .TheMarkovpropertyisnow P X t + u = k j X s forall 0 s t = P X t + u = k j X t forall t;u> 0 .45 7.3.1Holding-TimeDistribution InorderfortheMarkovpropertytohold,thedistributionofholdingtimeatagivenstateneedstobe memoryless.Youmayrecallthatexponentiallydistributedrandomvariableshavethisproperty.Inother words,ifarandomvariableWhasdensity f t = e )]TJ/F47 7.9701 Tf 6.586 0 Td [(t .46 forsome then P W>r + s j W>r = P W>s .47 forallpositiverands.Actually,onecanshowthatexponentialdistributionsaretheonlycontinuousdistributionswhichhavethisproperty.Therefore, holdingtimesinMarkovchainsmustbeexponentially distributed. Itisdifcultforthebeginningmodelertofullyappreciatethememorylessproperty.Youareurgedtoread thematerialonexponentialdistributionsinSection2.3.4.1beforecontinuing. BecauseitiscentraltotheMarkovproperty,theexponentialdistributionisassumedforallbasicactivities inMarkovmodels.Inqueuingmodels,forinstance,boththeinterarrivaltimeandservicetimeareassumed tobeexponentiallydistributedthoughofcoursewithdifferentvaluesof .Inreliabilitymodeling,the lifetimeofacomponentisassumedtohaveanexponentialdistribution. PAGE 229 7.3.CONTINUOUS-TIMEMARKOVCHAINS 211 Suchassumptionshaveinmanycasesbeenveriedempirically.Ifyougotoabank,forexample,andrecord dataonwhencustomersarriveatthedoor,youwillndtheexponentialmodeltoworkwellthoughyou mayhavetorestrictyourselftoagiventimeofday,toaccountfornonrandomeffectssuchasheavytrafc atthenoonhour.Inastudyoftimetofailureforairplaneairconditioners,thedistributionwasalsofound tobewellttedbyanexponentialdensity.Ontheotherhand,inmanycasesthedistributionisnotcloseto exponential,andpurelyMarkovianmodelscannotbeused. 7.3.2TheNotionofRates Akeypointisthattheparameter in.46hastheinterpretationofarate,inthesensewewillnowdiscuss. First,recallthat 1 = isthemean.Saylightbulblifetimeshaveanexponentialdistributionwithmean100 hours,so =0 : 01 .Inourlamp,wheneveritsbulbburnsout,weimmediatelyreplaceitwithanewon. Imaginewatchingthislampfor,say,100,000hours.Duringthattime,wewillhavedoneapproximately 100000/100=1000replacements.Thatwouldbeusing1000lightbulbsin100000hours,soweareusing bulbsattherateof0.01bulbperhour.Forageneral ,wewoulduselightbulbsattherateof bulbsper hour.Thisconceptiscrucialtowhatfollows. 7.3.3StationaryDistribution Weagaindene i tobethelong-runproportionoftimethesystemisinstatei,andweagainwillderivea systemoflinearequationstosolvefortheseproportions. Tothisend,let i denotetheparameterintheholding-timedistributionatstatei,anddenethefollowing: U i;t isthetotaltimespentatstateiupthroughtimet N i;t isthenumberofvisitstostateiupthroughtimet H ij istheholdingtimeduringthej th visittostatei Thereason U i;t isofinteresttousisthat lim t !1 U i;t t = i .48 Next,write U i;t = H i 1 + H i 2 + ::: + H i;N i;t + smallerror.49 Thereasonforthesmallerroristhatattimet,wemaybecurrentlyatstatei,inavisitthathasnotyet nished.At t !1 ,thistermvanishes,sowe'llignoreit. PAGE 230 212 CHAPTER7.MARKOVCHAINS Nowintakingtheexpectedvaluein.49,weneedtodealwiththefactthatthereisarandomnumberof termsinthesumontheright-handside.ThiswedousingtheTheoremofTotalExpectation,asseeninthe exampleinSection3.8.1.3,yielding E [ U i;t ]= 1 i E [ N i;t ] .50 sinceholdingtimesatstateihavemean 1 = i .Andthenforlarget U i;t 1 i N i;t .51 Thismaybeclearertoyouifyoudividebothsidesbyt.Both U and N areessentiallycumulativesums,so dividingbytsetsupsomethingliketheStrongLawofLargeNumbers,discussedinSection1.4.10. Thenextpointistolookattheratesoftransitionsintoandoutofstatei.Theseshouldbeequalinthelong run,andthatwillbethebasisforourbalanceequations. Thenumberoftransitionsoutofiupthroughtimetexceptforthesmallerrorisequalto N i;t .What aboutinboundtransitions?Let p ji betheprobabilitythat,whenaholdingtimeatstatejends,ourtransition istoi.Thenforlarget,thenumberoftransitionsfromstatejtostateiisapproximately N j;t p ji .Equating thetwo,wehave,againforlarget X j 6 = i N j;t p ji N i;t .52 Combining.51and.52,wehave U i;t i N i;t X j 6 = i N j;t p ji X j 6 = i U j;t j p ji .53 Dividingbytandtakinglimits,wehave i i = X j 6 = i j j p ji .54 So, voila! ,thereareourbalanceequationsoneforeachi. Wewillsometimesrefertoquantities rs = r p rs .55 withthefollowinginterpretation.InthecontextoftheideasinourexampleoftherateoflightbulbreplacementsinSection7.3.2,onecanview.55astherateoftransitionsfromrtos, duringthetimewearein stater .Equation.54canthenbeinterpretedasequatingtherateoftransitionsintoiandtherateoutofi. PAGE 231 7.3.CONTINUOUS-TIMEMARKOVCHAINS 213 7.3.4MinimaofIndependentExponentiallyDistributedRandomVariables Equation.54arene,butinactualexamplestherewillbeanissuewithndingthe p ji .Thematerialin thissectionwillbeusedforthatpurposeinlatersections. Suppose W 1 ;:::;W k areindependentrandomvariables,with W i beingexponentiallydistributedwithparameter i .Let Z =min W 1 ;:::;W k .Then aZisexponentiallydistributedwithparameter 1 + ::: + k b P Z = W i = i 1 + ::: + k Thesum 1 + ::: + n inashouldmakegoodintuitivesensetoyou,forthefollowingreasons.Saywe havepersons1and2.Eachhasalamp.PersoniusesBrandilightbulbs.SayBrandilightbulbshave exponentiallifetimeswithparameter i .Supposeeachtimepersonireplacesabulb,heshoutsout,New bulb!andeachtime anyone replacesabulb,IshoutoutNewbulb!Persons1and2areshoutingatarate of 1 and 2 ,respectively,soIamshoutingatarateof 1 + 2 .Moreover,atanygiventime,thetimeat whichIshoutnextwillbethe minimum ofthetimesatwhichpersons1and2shoutnext. Similarly,bshouldbeintuitivelyclearaswellfromtheabovethoughtexperiment,sinceforinstancea proportion 1 = 1 + 2 ofmyshoutswillbeinresponsetoperson1'sshouts. Propertiesaandbaboveareeasytoprove,startingwiththerelation F Z t =1 )]TJ/F46 10.9091 Tf 10 0 Td [(P Z>t =1 )]TJ/F46 10.9091 Tf 10 0 Td [(P W 1 >t and ::: and W k >t =1 )]TJ/F15 10.9091 Tf 10 0 Td [( i e )]TJ/F47 7.9701 Tf 6.587 0 Td [( i t =1 )]TJ/F46 10.9091 Tf 10 0 Td [(e )]TJ/F44 7.9701 Tf 6.587 0 Td [( 1 + ::: + n t .56 Taking d dt ofbothsidesshowsa. Forb,supposek=2.wehavethat P Z = W 1 = P W 1 PAGE 232 214 CHAPTER7.MARKOVCHAINS Supposethetimeuntilfailureofasinglemachine,carryingthefullloadofthefactory,hasanexponential distributionwithmean20.0,butthemeanis25.0whentheothermachineisworking,sinceitisnotso loaded.Repairtimeisexponentiallydistributedwithmean8.0. Wecantakeasourstatespace f 0,1,2 g ,wherethestateisthenumberofworkingmachines.Now,letus ndtheparameters i and p ji forthissystem.Forexample,whatabout 2 ?Theholdingtimeinstate2is theminimumofthetwolifetimesofthemachines,andthusfromtheresultsofSection7.3.4,hasparameter 1 25 : 0 + 1 25 : 0 =0 : 08 For 1 ,atransitionoutofstate1willbeeithertostate2thedownmachineisrepairedortostate0 theupmachinefails.Thetimeuntiltransitionwillbetheminimumofthelifetimeoftheupmachine andtherepairtimeofthedownmachine,andthuswillhaveparameter 1 20 : 0 + 1 8 : 0 =0 : 175 .Similarly, 0 = 1 8 : 0 + 1 8 : 0 =0 : 25 ItisimportanttounderstandhowtheMarkovpropertyisbeingusedhere.Supposeweareinstate1,and thedownmachineisrepaired,sendingusintostate2.Remember,themachinewhichhadalreadybeenup haslivedforsometimenow.Butthememorylesspropertyoftheexponentialdistributionimpliesthatthis machineisnowbornagain. Whatabouttheparameters p ji ?Well, p 21 iscertainlyeasytond;sincethetransition 2 1 isthe only transitionpossibleoutofstate2, p 21 =1 For p 12 ,recallthattransitionsoutofstate1aretostates0and2,withrates20.0and8.0,respectively.So, p 12 = 8 : 0 20 : 0+8 : 0 =0 : 28 .58 Workinginthismanner,wenallyarriveatthecompletesystemofequations.54: 2 : 08= 1 : 125 .59 1 : 175= 2 : 08+ 0 : 25 .60 0 : 25= 1 : 05 .61 Ofcourse,wealsohavetheconstraint 2 + 1 + 0 =1 .Thesolutionturnsouttobe = : 072 ; 0 : 362 ; 0 : 566 .62 Thusforexample,during7.2%ofthetime,therewillbenomachineavailableatall. Severalvariationsofthisproblemcouldbeanalyzed.Wecouldcomparethetwo-machinesystemwitha one-machineversion.Itturnsoutthattheproportionofdowntimei.e.timewhennomachineisavailable increasesto28.6%.Orwecouldanalyzethecaseinwhichonlyonerepairpersonisemployedbythis factory,sothatonlyonemachinecanberepairedatatime,comparedtothesituationabove,inwhichwe tacitlyassumedthatifbothmachinesaredown,theycanberepairedinparallel.Weleavethesevariations asexercisesforthereader. PAGE 233 7.3.CONTINUOUS-TIMEMARKOVCHAINS 215 7.3.6Continuous-TimeBirth/DeathProcesses Wenotedearlierthatthesystemofequationsforthe i maynotbeeasytosolve.Inmanycases,forinstance, thestatespaceisinniteandthusthesystemofequationsisinnitetoo.However,thereisarichclassof Markovchainsforwhichclosed-formsolutionshavebeenfound,called birth/deathprocesses 8 Herethestatespaceconsistsoforhasbeenmappedtothesetofnonnegativeintegers,and p ji isnonzero onlyincasesinwhich j i )]TJ/F46 10.9091 Tf 11.538 0 Td [(j j =1 .Thenamebirth/deathhasitsorigininMarkovmodelsofbiologicalpopulations,inwhichthestateisthecurrentpopulationsize.Noteforinstancethattheexampleof thegracefullydegradingsystemabovehasthisform.AnM/M/1queueoneserver,Markovi.e.exponentialinterarrivaltimesandMarkovservicetimesisalsoabirth/deathprocess,withthestatebeingthe numberofjobsinthesystem. Becausethe p ji havesuchasimplestructure,thereishopethatwecanndaclosed-formsolutionto.54, anditturnsoutwecan.Let u i = i;i +1 and d i = i;i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 `u'forup,`d'fordown.Then.54is i +1 d i +1 + i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 u i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 = i i = i u i + d i ;i 1 .63 1 d 1 = 0 0 = 0 u 0 .64 Inotherwords, i +1 d i +1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( i u i = i d i )]TJ/F46 10.9091 Tf 10.909 0 Td [( i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 u i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 ;i 1 .65 1 d 1 )]TJ/F46 10.9091 Tf 10.909 0 Td [( 0 u 0 =0 .66 Applying.65recursivelytothebase.66,weseethat i d i )]TJ/F46 10.9091 Tf 10.91 0 Td [( i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 u i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 =0 ;i 1 .67 sothat i = i )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 u i )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 d i i 1 .68 andthus i = 0 r i .69 where r i = i k =1 u k )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 d k .70 8 Thoughwetreatthecontinuous-timecasehere,thereisalsoadiscrete-timeanalog. PAGE 234 216 CHAPTER7.MARKOVCHAINS where r i =0 for i>m ifthechainhasnostatespastm. Thensincethe i mustsumto1,wehavethat 0 = 1 1+ P 1 i =1 r i .71 andtheother i arethenfoundvia.69. Notethatthechainmightbenite,i.e.have u i =0 forsomei.Inthatcaseitisstillabirth/deathchain,and theformulasabovefor stillapply. 7.3.7Example:ComputerWorm NotallinterestingMarkovchainshavestationarydistributions.Hereisanexampleinwhichotherconsiderationscomeintoplay.Thischainhappenstobeabirth/deathchain,butitispurebirth,andthusdoes nothaveastationarydistribution,ormoreaccurately,hasitsstationarydistributionconcentratedonthe absorbingstate. AcomputersciencegraduatestudentatUCD,C.Senthilkumar,wasworkingonawormalertmechanism. Asimpliedversionofthemodelisthatnetworkhostsaredividedintogroupsofsizeg,sayonthebasis ofsharingthesamerouter.Eachinfectedhosttriestoinfectalltheothersinthegroup.Wheng-1group membersareinfected,analertissenttotheoutsideworld. Thestudentwasstudyingthismodelviasimulation,andfoundsomesurprisingbehavior.Nomatterhow largehemadeg,themeantimeuntilanexternalalertwasraisedseemedbounded.Heaskedmeforadvice. Imodeledthisasapurebirthprocess.Instatei,thereareiinfectedhosts,eachtryingtoinfectallofthe g-inoninfectedhots.Whentheprocessreachesstateg-1,theprocessends;wecallthisstatean absorbing state ,i.e.onefromwhichtheprocessneverleaves. Supposethatforeachinfected/noninfectedpairofhosts,thetimetoinfectionofthenoninfectedmember bytheinfectedmemberhasanexponentialdistributionwithmean1.0.Assumeindependenceamongthe variousinfectionattempts.Sinceinstateithereareig-isuchpairs,andsincewegotostatei+1whenthe rstinfectionamongtheseoccurs,wehave i = i g )]TJ/F46 10.9091 Tf 10.423 0 Td [(i .Thusthemeantimetogofromstateitostatei+1 is1/[ig-i]. Thenthemeantimetogofromstate1tostateg-1is g )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 X i =1 1 i g )]TJ/F46 10.9091 Tf 10.909 0 Td [(i .72 Usingacalculusapproximation,wehave Z g )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 1 1 x g )]TJ/F46 10.9091 Tf 10.909 0 Td [(x dx = 1 g Z g )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 1 1 x + 1 g )]TJ/F46 10.9091 Tf 10.91 0 Td [(x dx = 2 g ln g )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 .73 PAGE 235 7.4.HITTINGTIMESETC. 217 Thelatterquantitygoestozeroas g !1 .Thisconrmsthatthebehaviorseenbythestudentinsimulations holdsingeneral.Inotherwords,.72remainsboundedas g !1 .Thisisaveryinterestingresult,since itsaysthatthemeantimetoalertisboundednomatterhowbigourgroupsizeis. 7.4HittingTimesEtc. Inthissectionwe'reinterestedintheamountoftimeittakestogetfromonestatetoanother,includingcases inwhichthismightbeinnite. 7.4.1SomeMathematicalConditions ThereisarichmathematicaltheoryregardingtheasymptoticbehaviorofMarkovchains.Wewillnotpresent suchmaterialhereinthisbriefintroduction,butwewillgiveanexampleoftheimplicationsthetheorycan have. AstateinaMarkovchainiscalled recurrent ifitisguaranteedthat,ifwestartatthatstate,wewillreturn tothestateinnitelymanytimes.Anonrecurrentstateiscalled transient. Let T ii denotethetimeneededtoreturntostateiifwestartthere. 9 Notethatanequivalentdenitionof recurrenceisthat P T ii < 1 =1 ,i.e.wearesuretoreturntoiatleastonce.BytheMarkovproperty, ifwearesuretoreturnonce,thenwearesuretoreturnagainonceafterthat,andsoon,sothisimplies innitelymanyvisits. Arecurrentstateiiscalled positiverecurrent if E T ii < 1 ,whileastatewhichisrecurrentbutnot positiverecurrentiscalled nullrecurrent Let T ij bethetimeittakestogettostatejifwearenowini.Notethatthisismeasuredfromthetimethat weenterstateitothetimeweenterstatej. Onecanshowthatinthediscretetimecase,astateiisrecurrentifandonlyif 1 X n =0 P T ii = n = 1 .74 Consideran irreducible Markovchain,meaningonewhichhasthepropertythatonecangetfromanystate toanyotherstatethoughnotnecessarilyinonestep.Onecanshowthatinanirreduciblechain,ifonestate isrecurrentthentheyallare.Thesamestatementholdsifrecurrentisreplacedbypositiverecurrent. 7.4.2Example:RandomWalks Considerthefamous randomwalk onthefullsetofintegers:Ateachtimestep,onegoesleftoneinteger orrightoneintegere.g.to+3or+5from+4,withprobability1/2each.Inotherwords,weipacoinand 9 Keepinmindthat T ii isthetimefromoneentrytostateitothenextentrytostatei.So,itincludestimespentini,whichis 1unitoftimeforadiscrete-timechainandarandomexponentialamountoftimeinthecontinuous-timecase,andthentimespent awayfromi,uptothetimeofnextentrytoi. PAGE 236 218 CHAPTER7.MARKOVCHAINS goleftforheads,rightfortails. Ifwestartat0,thenwereturnto0whenwehaveaccumulatedanequalnumberofheadsandtails.Sofor even-numberedn,i.e.n=2m,wehave P T ii = n = P mheadsandmtails = 2 m m 1 2 2 m .75 OnecanuseStirling'sapproximation, m p 2 e )]TJ/F47 7.9701 Tf 6.586 0 Td [(m m m +1 = 2 .76 toshowthattheseries.74divergesinthiscase.So,thischainmeaningallstatesinthechainisrecurrent. However,itisnotpositiverecurrent. Thesameistrueforthecorrespondingrandomwalkonthetwo-dimensionalintegerlatticemovingup, down,leftorrightwithprobability1/4each.However,inthethree-dimensionalcase,thechainisnoteven nullrecurrent;itistransient. 7.4.3FindingHittingandRecurrenceTimes Forapositiverecurrentstateiinadiscrete-timeMarkovchain, i = 1 E T ii .77 TheapproachtoderivingthisissimilartothatofSection7.1.5.1.DenealternatingOnandOffsubcycles, whereOnmeansweareatstateiandOffmeansweareelsewhere.AnOnsubcyclehasduration1,and anOffsubcyclehasduration T ii )]TJ/F15 10.9091 Tf 11.281 0 Td [(1 .DeneafullcycletoconsistofanOnsubcyclefollowedbyanOff subcycle. Thenintuitivelytheproportionoftimeweareinstateiis i = E On E On + E Off = 1 1+ E T ii )]TJ/F15 10.9091 Tf 10.909 0 Td [(1 = 1 ET ii .78 Theequationissimilarforthecontinuous-timecase.Here E On =1 = i .TheOffsubcyclehasduration T ii )]TJ/F15 10.9091 Tf 10.197 0 Td [(1 = i .Notethat T ii ismeasuredfromthetimeweenterstateionceuntilthetimeweenteritagain.We thenhave i = 1 = i E T ii .79 Thuspositiverecurrencemeansthat i > 0 .Foranullrecurrentchain,thelimitsinEquation.3are0, whichmeansthattheremayberatherlittleonecansayofinterestregardingthelong-runbehaviorofthe chain. PAGE 237 7.4.HITTINGTIMESETC. 219 Weareofteninterestedinndingquantitiesoftheform E T ij .Wecandosobysettingupsystemsof equationssimilartothebalanceequationsusedforndingstationarydistributions. Firstconsiderthediscretecase.Conditioningontherststepwetakeafterbeingatstatei,wehave E T ij = X k 6 = j p ik [1+ E T kj ]+ p ij 1 .80 Byvaryingiandjin.80,wegetasystemoflinearequationswhichwecansolvetondthe ET ij .Note that.77givesusequationswecanuseheretoo. Thecontinuousversionusesthesamereasoning: E T ij = X k 6 = j p ik 1 i + E T kj + p ij 1 i .81 Onecanuseasimilaranalysistodeterminetheprobabilityofeverreachingastate,inchainswhichhave transientorabsorbingstates.Forxedjdene i = P T ij < 1 .82 Then i = X k 6 = j p ik k + p ij .83 7.4.4Example:FiniteRandomWalk Let'sgobacktotheexampleinSection7.1.1. Supposewestartourrandomwalkat2.Howlongwillittaketoreachstate4?Set b i = E T i 4 j startati From.80wecouldsetupequationslike b 2 = 1 3 + b 1 + 1 3 + b 2 + 1 3 + b 3 .84 Nowchangethemodelalittle,andmakestates1and6absorbing.Supposewestartatposition3.Whatis theprobabilitythatweeventuallyareabsorbedat6ratherthan1?Wecouldsetupequationslike.83to ndthis. PAGE 238 220 CHAPTER7.MARKOVCHAINS 7.4.5Example:Tree-Searching ConsiderthefollowingMarkovchainwithinnitestatespace f 0,1,2,3,... g 10 Thetransitionmatrixisdened by p i;i +1 = q i and p i 0 =1 )]TJ/F46 10.9091 Tf 10.9 0 Td [(q i .Thiskindofmodelhasmanydifferentapplications,includingincomputer sciencetree-searchingalgorithms.Thestaterepresentsthelevelinthetreewherethesearchiscurrently, andareturnto0representsabacktrack.Moregeneralbacktrackingcanbemodeledsimilarly. Thequestionathandis,Whatconditionsonthe q i willgiveusapositiverecurrentchain? Assuming 0 n = n )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 i =0 q i .85 Therefore,thechainisrecurrentifandonlyif lim n !1 n )]TJ/F44 7.9701 Tf 6.586 0 Td [(1 i =0 q i =0 .86 Forpositiverecurrence,weneed E T 00 < 1 .Now,onecanshowthatforanynonnegativeinteger-valued randomvariableY E Y = 1 X n =0 P Y>n .87 Thusforpositiverecurrence,ourconditiononthe q i is 1 X n =0 n )]TJ/F44 7.9701 Tf 6.587 0 Td [(1 i =0 q i < 1 .88 Exercises 1 .ConsiderawraparoundvariantoftherandomwalkinSection7.1.1.Westillhaveareectingbarrier at1,butat5,wegobackto4,stayat5orwraparoundto1,eachwithprobability1/3.Findthenewset ofstationaryprobabilities. 2 .ConsidertheMarkovmodeloftheshared-memorymultiprocessorsysteminourPLN.Ineachpartbelow, youranswerwillbeafunctionof q 1 ;:::;q m aForthecasem=3,nd p ; 0 ; 1 ; ; 1 ; 1 10 Adaptedfrom PerformanceModellingofCommunicationNetworksandComputerArchitectures ,byP.HarrisonandN.Patel, pub.byAddison-Wesley,1993. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| MILLISECOND | CLASS.METHOD | MESSAGE |
|---|---|---|
| 0 | sobekcm_page_globals.constructor | |
| 0 | sobekcm_page_globals.constructor | Application State validated or built |
| 0 | sobekcm_database.verify_item_lookup_object | |
| 0 | sobekcm_page_globals.constructor | Navigation Object created from URI query string |
| 0 | sobekcm_database.verify_item_lookup_object | |
| 0 | sobekcm_page_globals.display_item | Retrieving item or group information |
| 0 | sobekcm_page_globals.get_entire_collection_hierarchy | Retrieving hierarchy information |
| 0 | sobekcm_assistant.get_entire_collection_hierarchy | |
| 0 | cached_data_manager.retrieve_item_aggregation | |
| 0 | cached_data_manager.retrieve_item_aggregation | Found item aggregation on local cache |
| 0 | item_aggregation_builder.get_item_aggregation | Found 'all' item aggregation in cache |
| 0 | system.web.ui.page.page_load (ufdc.page_load) | |
| 0 | sobekcm_page_globals.constructor.on_page_load | |
| 0 | html_echo_mainwriter.add_style_references | Adding style references to HTML |
| 0 | html_echo_mainwriter.add_text_to_page | Reading the text from the file and echoing back to the output stream |
| 4 | html_echo_mainwriter.add_text_to_page | Finished reading and writing the file |