Citation
A Comprehensive Methodology for Analysis of Diverse Processor Architectures

Material Information

Title:
A Comprehensive Methodology for Analysis of Diverse Processor Architectures
Creator:
Richardson, Justin W
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (122 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
GEORGE,ALAN DALE
Committee Co-Chair:
GORDON-ROSS,ANN M
Committee Members:
LAM,HERMAN
OLEXA,MICHAEL T
Graduation Date:
5/2/2015

Subjects

Subjects / Keywords:
Atoms ( jstor )
Bandwidth ( jstor )
Benchmarking ( jstor )
Computer memory ( jstor )
Data processing ( jstor )
Distance functions ( jstor )
Input output ( jstor )
Integers ( jstor )
Narrative devices ( jstor )
Vendors ( jstor )
Electrical and Computer Engineering -- Dissertations, Academic -- UF
comparison -- cpu -- device -- dsp -- fpga -- gpu -- performance -- reconfigurable
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Electrical and Computer Engineering thesis, Ph.D.

Notes

Abstract:
The modern computational device landscape is a varied and diverse community of devices. Developers need a way to quickly and fairly analyze various devices for use with their particular applications. This paper proposes a framework for early device comparison based on computational capacity, memory availability, and power. Building upon computational performance metrics, this study also adds memory-based and realizable efficiency metrics. This framework allows developers and device designers to compare devices saving money and time by reducing re-design iterations. The proposed metrics enable comparisons between fixed and reconfigurable devices such as: Field Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), embedded system-on-chip architectures (SoCs), and performance-based Central Processing Units (CPUs). Using metrics to characterize the computational abilities, memory limitations, and realizable efficiency, this work examines a large spectrum of devices from the perspectives of performance, I/O, power, and efficiency. The dissertation highlights the trade-offs between the architectures studied, strengths of reconfigurable devices, the benefits of GPU-based devices, and the efficiencies of new embedded system-on-a-chip architectures. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2015.
Local:
Adviser: GEORGE,ALAN DALE.
Local:
Co-adviser: GORDON-ROSS,ANN M.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2015-11-30
Statement of Responsibility:
by Justin W Richardson.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
11/30/2015
Classification:
LD1780 2015 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

ACOMPREHENSIVEMETHODOLOGYFORANALYSISOFDIVERSEPROCESSORARCHITECTURESByJUSTINW.RICHARDSONADISSERTATIONPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFDOCTOROFPHILOSOPHYUNIVERSITYOFFLORIDA2015

PAGE 2

c2015JustinW.Richardson

PAGE 3

Idedicatethistomylovingfamilywhohelpedmakethispossible.

PAGE 4

ACKNOWLEDGMENTS ThankyoutoallthoseIworkedwithduringthecourseofthisresearch.AbigthankyoutoCHRECandDr.Georgefortheopportunity.Finally,thankyoutomyfamilyforsupportingmethroughthePhDprocess. 4

PAGE 5

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 4 LISTOFTABLES ...................................... 7 LISTOFFIGURES ..................................... 8 ABSTRACT ......................................... 10 CHAPTER 1INTRODUCTION ................................... 11 2RELATEDWORK .................................. 13 3INTERNALDEVICEMETRICS ........................... 18 3.1MCTaxonomy ................................. 20 3.2MetricMethodology .............................. 20 3.2.1Bit-levelComputationalDensity .................... 20 3.2.2IntegerComputationalDensity .................... 22 3.2.3Floating-pointComputationalDensity ................. 25 3.2.4PowerConsumption .......................... 25 3.2.5InternalMemoryBandwidth ...................... 26 3.3AcceleratorOverview ............................. 28 3.3.1FMCDevices .............................. 28 3.3.2RMCDevices .............................. 32 3.4ResultsandAnalysis .............................. 35 3.4.1Bit-levelComputationalDensityperWatt ............... 38 3.4.216-bitIntegerComputationalDensityperWatt ............ 38 3.4.332-bitIntegerComputationalDensityperWatt ............ 39 3.4.4Single-PrecisionFloating-PointComputationalDensityperWatt .. 41 3.4.5Double-PrecisionFloating-PointComputationalDensityperWatt . 43 3.4.6InternalMemoryBandwidth ...................... 44 3.5Discussion ................................... 46 4EXTERNALDEVICEMETRICS .......................... 51 4.1Memory-baseddevicemetrics ........................ 51 4.1.1ReviewofCDandCD/W ........................ 51 4.1.2ReviewofIMB .............................. 54 4.1.3ExternalMemoryBandwidth ...................... 55 4.1.4IOB .................................... 56 4.2ResultsandAnalysis .............................. 58 4.2.1CDandCD/WMetrics ......................... 59 4.2.2EMBResults .............................. 60 5

PAGE 6

4.2.3IOBResults ............................... 62 4.3Discussion ................................... 64 5REALIZABLEUTILIZATION ............................. 66 5.1ExpandedComputationalDeviceMetrics .................. 66 5.2CDandCD/W ................................. 68 5.2.1ReviewofCDandCD/W ........................ 69 5.2.2CDResultsandAnalysis ........................ 73 5.2.3CD/WResultsandAnalysis ...................... 75 5.3DeviceMemoryMetrics ............................ 78 5.3.1ExternalMemoryBandwidth ...................... 78 5.3.2Input/OutputBandwidth ........................ 79 5.3.3EMBResults .............................. 80 5.3.4IOBResults ............................... 82 5.4RealizableUtilization(RU)Metric ....................... 84 5.4.1RUMethodology ............................ 85 5.4.2CalculatingRU ............................. 86 5.4.3UsingRU ................................ 87 5.4.4ArithmeticKernelsforRU ....................... 88 5.4.5LiteratureStudyResults ........................ 89 5.4.6RUBenchmarkingResults ....................... 94 5.5Discussion ................................... 97 6CONCLUSIONS ................................... 99 APPENDIX:ADDITIONALDATA ............................. 101 REFERENCES ....................................... 117 BIOGRAPHICALSKETCH ................................ 122 6

PAGE 7

LISTOFTABLES Table page 2-1Recongurabilityfactors ............................... 17 3-1Deviceclassication ................................. 29 3-2FMCdeviceprocessingfeatures .......................... 30 3-3FMCdevicepowerandmemoryfeatures ..................... 31 3-4FPGAdeviceprocessingfeatures ......................... 33 3-5FPGAdevicepowerandmemoryfeatures ..................... 34 3-6MaximumCD(inbillionsofoperationspersecondorGOPs) .......... 36 3-7FPGAachievablefrequency(inMHz) ....................... 37 A-1EMBofnon-FPGAdevices ............................. 102 A-2CDofFPGAdevicesshowingfrequencyandpowervariations ......... 103 A-3IMBforBBSoverarangeofachievablefrequencies(GB/s) ........... 103 A-4IMBforCBSforvarioushitrates(GB/s) ...................... 104 A-5Memory-sustainableCDacrossprecisions .................... 105 A-6EMBofRLDandFLDdevices ........................... 106 A-7Computationaldensity ................................ 107 A-8Computationaldensityperwatt ........................... 110 A-9Matrixmultiplication ................................. 113 A-10Matrixdecomposition ................................ 114 A-11N-bodysimulation .................................. 115 A-12RUbenchmarkingresults .............................. 116 7

PAGE 8

LISTOFFIGURES Figure page 3-1MCtaxonomy ..................................... 21 3-2Bit-levelcomputationaldensityperwatt(FMC) .................. 39 3-3Bit-levelcomputationaldensityperwatt(RMC) .................. 40 3-416-bitintegercomputationaldensityperwatt(FMC) ............... 41 3-516-bitintegercomputationaldensityperwatt(RMC) ............... 42 3-632-bitintegercomputationaldensityperwatt(FMC) ............... 43 3-732-bitintegercomputationaldensityperwatt(RMC) ............... 44 3-8Single-precisionoating-pointcomputationaldensityperwatt(FMC) ...... 45 3-9Single-precisionoating-pointcomputationaldensityperwatt(RMC) ...... 46 3-10Double-precisionoating-pointcomputationaldensityperwatt(FMC) ...... 47 3-11Double-precisionoating-pointcomputationaldensityperwatt(RMC) ..... 48 3-12Internalmemorybandwidth(CBS) ......................... 49 3-13Internalmemorybandwidth(BBS) ......................... 50 4-1CD/WforRMCdevices ............................... 59 4-2CD/WforFMCdevices ............................... 61 4-3EMBforFPGAdevices ............................... 62 4-4EMBforFMCdevices ................................ 63 4-5IOBdataforselecteddevices ............................ 64 5-1CDforselectxed-logicprocessors ........................ 74 5-2CDforselectrecongurable-logicdevices ..................... 75 5-3CD/Wforselectxed-logicdevices ......................... 76 5-4CD/Wforselectrecongurable-logicprocessors ................. 77 5-5EMBforselectxed-logicprocessors ....................... 81 5-6EMBforselectrecongurable-logicdevices .................... 82 5-7IOBdataforselectxed-logicdevices ....................... 83 8

PAGE 9

5-8IOBdataforselectrecongurable-logicdevices .................. 84 5-9Conceptdiagramofrealizableutilization ...................... 86 5-10Realizedutilizationformatrixmultiplication .................... 90 5-11Realizableutilizationformatrixdecomposition ................... 91 5-12Realizedutilizationforn-bodysimulations ..................... 92 5-13Combinedrealizableutilizationresults ....................... 94 5-14Realizableutilizationbenchmarkingresults .................... 95 9

PAGE 10

AbstractofDissertationPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofDoctorofPhilosophyACOMPREHENSIVEMETHODOLOGYFORANALYSISOFDIVERSEPROCESSORARCHITECTURESByJustinW.RichardsonMay2015Chair:AlanD.GeorgeMajor:ElectricalandComputerEngineeringThemoderncomputationaldevicelandscapeisavariedanddiversecommunityofdevices.Developersneedawaytoquicklyandfairlyanalyzevariousdevicesforusewiththeirparticularapplications.Thispaperproposesaframeworkforearlydevicecomparisonbasedoncomputationalcapacity,memoryavailability,andpower.Buildinguponcomputationalperformancemetrics,thisstudyalsoaddsmemory-basedandrealizableefciencymetrics.Thisframeworkallowsdevelopersanddevicedesignerstocomparedevices,savingmoneyandtimebyreducingre-designiterations.Theproposedmetricsenablecomparisonsbetweenxedandrecongurabledevicessuchas:DigitalSignalProcessors(DSPs),FieldProgrammableGateArrays(FPGAs),GraphicsProcessingUnits(GPUs),embeddedsystem-on-chiparchitectures(SoCs),andperformance-basedCentralProcessingUnits(CPUs).Usingmetricstocharacterizethecomputationalabilities,memorylimitations,andrealizableefciency,thisworkexaminesalargespectrumofdevicesfromtheperspectivesofperformance,I/O,power,andefciency.Thedissertationhighlightsthetrade-offsbetweenthearchitecturesstudied,strengthsofrecongurabledevices,thebenetsofGPU-baseddevices,andtheefcienciesofnewembeddedSoCarchitectures. 10

PAGE 11

CHAPTER1INTRODUCTIONThelandscapeofmoderncomputationalprocessingdevicesislarge,varied,andextendsacrossabroadspectrumofdifferentarchitectures,features,andtargetapplications.Oneofthegoalsofthisstudyisthematerializationofamethodologytofacilitatequickandfairdevicecharacterizationandcomparisonacrossbothtraditionalxedprocessingarchitecturesandnewerrecongurabledevicearchitectures.Thisgoalalsoincorporatesthedevelopingofaschematogenericallyquantifykeyperformancefactorsincomputationaldevicesindependentoftheirtargetapplication.Finally,thisstudyaimstoevaluateandcomparetheoretical,projected,andactualperformanceofprocessingdevicesacrossthedevicespectrumusingthedevelopedmetrics.Akeymotivationforthisinvestigationistheneedforaquickandeasilyadaptablesystemoffairdevicecomparisonforthisworldoffast-movingcomputationaltechnologies.Sucharubricisrequiredtoallowbothsystembuildersanddevicemanufacturersaprocessformakinginformeddecisionsfortheirapplications,basedonanalyticalorempiricalmetrics,asopposedtomoresubjectivecriteria.Secondly,innovationincomputationaldevicetechnologyrapidlyaddsnewdevicesandtechnologies,thereforerequiringaexiblebasisforcomparison.Finally,standardizedmetricsallowforeffectivescrutinyofdeviceefciencyagainstapplicationperformance,becauseapplicationsrarelyperformatanideallevel.Therearemanychallengesthatinhibittheeaseofdevicecomparison.Includedinthesechallengesisthecomplexitythatarisesfromthecomplexdevicecharacterizationproblemsinalargeanddiversesetofdevices.Thisdevicediversityprovidessignicantroomfortunedperformance,butcanmakeitextremelydifculttopredictapplicationperformanceforanabundanceofvarieddevices,systems,andapplications.Inaddition,formostcomputingplatforms,expertknowledgeisneededfordevicestoachievemaximumperformance. 11

PAGE 12

Thisstudyissplitintothreemajorphases.Phase1examinestheuseofdevicemetricstocharacterizethecomputationalcapacitiesofbothxedandrecongurablecomputingdevices.Phase2expandsthesedevicemetricstoincludemetricstocharacterizethememorylimitationsofdevices.Phase3expandsthedevicespaceandintroducesanewRealizableUtilizationmetrictoquantifythedifferencebetweenapplicationperformanceandthetheoreticalcomputationaldeviceabilityasdenedinthephase1metrics. 12

PAGE 13

CHAPTER2RELATEDWORKManyresearchershavepreviouslysurveyedtheeldofcomputingdevicesandcomputingcharacterizationtechniques1.Theliteratureincludesseveralclassicationtechniquesandwebuildoffofmanyoftheminthispaper.Previousworkshaveusednumerouscriteriatoclassifybothxedandrecongurableaccelerators.Originallyintendedtodescribexedarchitecturedevices,Flynn'staxonomyisacommonmethodusedtodescribeadevice'sparallelism.ItclassiesacceleratorsasSingleInstructionSingleData(SISD),SingleInstructionMultipleData(SIMD),MultipleInstructionSingleData(MISD),andMultipleInstructionMultipleData(MIMD)[ Flynn 1966 ].Host/coprocessorcouplingtreatstheacceleratorasacoprocessortoatraditionalmicroprocessorhostandclassiestheacceleratorbasedonthelevelofintegration.Thecoprocessorcanbedirectlyconnectedtothehostprocessor,connectedviathememorybus,orconnectedasI/O[ ComptonandHauck 2002 ; RadunovicandMilutinovic 1998 ].Therearemanyotherimportantclassiersintheliteraturetargetingrecongurableaccelerators.Devicesizeistheamountofrecongurablelogicusedforrecongurableprocessing[ GuccioneandGonzalez 1995 ].Thepresenceofon-chipmemoriesandvariousmemorycongurationshavealsobeenusedtoclassifydevices[ GuccioneandGonzalez 1995 ; SawitzkiandSpallek 1999 ].GuccioneandGonzalezusedevicesizeandmemorycongurationtoestablishfourcategoriesofrecongurablemachines.AsmallrecongurabledevicewithnolocalmemoryisaCustomInstruction-SetArchitecture.LargedeviceswithoutlocalmemoryareApplication-SpecicArchitectures. 1Portionsofthischapterhavebeenpreviouslypublishedandadapted/reprintedwithpermissionof:JasonWilliams,ChrisMassie,AlanD.George,JustinRichardson,KunalGosrani,andHermanLam.2010.CharacterizationofFixedandRecongurableMulti-CoreDevicesforApplicationAcceleration.ACMTrans.RecongurableTechnol.Syst.3,4,Article19(November2010),29pages.DOI=10.1145/1862648.1862649http://doi.acm.org/10.1145/1862648.1862649 13

PAGE 14

Fordeviceswithlocalmemory,smalldevicesareclassiedasRecongurableLogicCoprocessors,andlargedevicesareRecongurableSupercomputers[ GuccioneandGonzalez 1995 ].Faulttoleranceisimportantforsomemission-criticalapplicationsforbothxedandrecongurablearchitectures[ RadunovicandMilutinovic 1998 ].Fornetworksofdevices,recongurabilityofthedevice-to-deviceinterconnectisanimportantclassier[ RadunovicandMilutinovic 1998 ].Methodsofreconguration,suchasparallelorserialloadingofabitstream,andsupportfordynamicandpartialreconguration,canbeusedtocategorizemanyrecongurabledevices[ BondalapatiandPrasanna 2002 ].Verticalandhorizontalmicroinstructionsareusedtodistinguishdevicesin[ Simaetal. 2002 ].Verticalmicroinstructionsonlycontroloneresource;horizontalmicroinstructionscontrolmultipleresources.Theexecutionmodelreferstotheoperationofrecongurableresourceswhencoupledwithahostsystemasdescribedin[ ComptonandHauck 2002 ].Recongurableresourcescanoperatesimultaneouslywithhostoperation,oroperationonthehostcanbesuspendedwhiletherecongurableresourcesareprocessing.Therearenumerouspreviouslyresearchedcharacterizationsthatareparticularlyapplicabletoourtaxonomy.Processingelement(PE)granularityandheterogeneityofcomponents[ ComptonandHauck 2002 ; RadunovicandMilutinovic 1998 ]arecommonclassiersintheliterature.Thegranularityofadeviceisbasedonthenativegranularityofitsbasicprocessingelements.Aheterogeneousdevicehasprocessingelementsofdifferenttypesorstructuresthatareoptimizedtoperformdifferenttasks.Homogeneousdevicescontainonlyasingletypeofcomputationalunit.Weprimarilyfocusonrecongurabilityandheterogeneityinourtaxonomy.Oneoftheprimarychallengesofrecongurableandexoticxeddeviceevaluationisacquiringcomputationalperformancemetricsintermsthatarecomparabletotraditionalmicroprocessors.Weleverageseveralrelatedworksondeviceperformance 14

PAGE 15

characterization.Ourcomputationaldensity(CD)metricisprimarilyanadaptationofworkdonebyDeHon.Itrelatesprocessingelementwidth,thenumberofprocessingelements,andclockfrequencytoperformance,normalizedbydieareaandprocesstechnology[ DeHon 1996 ].Floating-pointperformanceevaluationmethodsforrecongurablearchitecturesareexploredin[ Strenski 2007 ]usingvendortoolsanddatasheetinformationtodeterminethemaximumnumberofprocessingelementsforaparticularoperationthatcanbesupportedinparallel.Again,themaximumachievablefrequencyisusedtorelatethenumberofbasicoperationsthatcanbedoneinparallelforagivenprecision(paralleloperations)toperformance.Weextendthismethodologytocommonintegeroperations.Severalperformancecomparisonsareshownin[ UnderwoodandHemmert 2004 ]thatdemonstratetheapplicabilityofRCtechnologiestooating-pointoperations.Memoryperformanceplaysasignicantroleinoverallsystemperformance.Severalworksdiscusstheincreasingdisparitybetweentheimprovementinprocessorspeedsandtheimprovementinmemoryspeeds.Usinggeneralapplicationmemoryaccessbehaviour,WulfandMcKeeshowmemoryperformancetobethedominantperformancefactorastheaveragememoryaccesstimeexceedsthetimetoexecuteveinstructions[ WulfandMcKee 1995 ].Theyalsopredicttheaveragenumberofcyclesperaccesstobe98.8by2010.SohiandFranklinillustratethroughsimulationresultsthatlowcachebandwidthhampersoverallsystemperformance;particularlyasinstructionissuingorparallelprocessingcapabilitiesincrease[ SohiandFranklin 1991 ].Burgeretal.usesimulationandexecutiontimedecompositionmethodstoshowthatmemorybandwidthisaprimaryperformancebottleneckduetoaggressivelatency-hidingtechniquesonseveralbenchmarks[ Burgeretal. 1996 ].Furthermore,manylatency-hidingtechniquesactuallyexacerbatememorybandwidthlimitations.Systemperformanceisincreasinglydictatedbytheoperandtransferratefromexternalmemoryandtheeffectivenessofon-chipmemoryinpreservingoperandsforreuse. 15

PAGE 16

Saulsburyetal.suggestintegratingsimplerprocessorswithDRAMmemory.Tightintegrationbetweenmemoryandsimplesingle-scalarprocessorsisshowntooutperformhigh-endsuperscalarprocessorswithtraditionalmemoryhierarchiesandmemorybandwidthlimitations[ Saulsburyetal. 1996 ].Althoughmuchoftheliteraturefocusesonthegrowinggapbetweenprocessorperformanceandoff-chipmemoryperformance,themessageisthesameregardlessofmemorystructureorlocationinthememoryhierarchy:asprocessingperformancecontinuestoincrease,especiallyviaexplicitparallelism,morestressisplacedonthememorysystemtoprovidedataatafastenoughratetokeepprocessingelementsfullyutilized.Theemphasisonthememorybottleneckpointsouttheneedforadditionalmethodsandmetricstoevaluatememoryperformance.OurIMBmetricisproposedtoquantitativelyassesson-chipmemoryperformance.Whilemuchofthepreviousworkfocusedonseparatelyclassifyingxedandrecongurablearchitectures,oneimportantdistinctionisthatwefocusonincorporatingbothparadigmsintoasingle,many-coretaxonomy.Anewtaxonomyisneededduetotheon-goingarchitectureandapplicationreformations.Thetaxonomyproposedisusedbothasameanstoclassifydevicesandtohelpselecttheappropriatein-depthcharacterizationmethodsdescribedinSection 3.1 .Additionally,muchofthepreviousfocushasbeenonthecomputationalperformanceofdevices.Althoughcomputationalperformanceisanimportantdeviceselectioncriterion,weexpandtheselectionprocessbyincorporatingpowerconsumption,anissueofincreasinglyvitalimportanceinbothhigh-performancecomputing(HPC),high-performanceembeddedcomputing(HPEC),andmemoryperformance,toaddresspotentialmemorybottleneckissues. 16

PAGE 17

Table2-1. Recongurabilityfactors FactorDescriptionExample DatapathDevicecanchangewidthordepthofdatapath(s)Recongurefromfourparalleldatapathswith3pipelinestagestoveparalleldatapathswith4pipelinestagesDeviceMemoryDevicecanchangewidthordepthofon-chipmemoryblocksRecongurefroma32-bit1024deepmemoryblocktoa64-bit512deepmemoryblockPE/BlockDevicecanchangeoperationofPE/BlockRecongurePEfromaMultiplieroperationtoaMultiply-&-AccumulateoperationPrecisionDevicecanchangenumericalprecisionofPEsRecongurePEfroma64-bitMultipliertoa24-bitMultiplierInterfaceDevicecanchangememoryorI/OinterfaceRecongurememoryinterfacefromRLDRAMcontrollertoaDDR2RAMcontrollerModeDevicecanchangeassignmentoftaskstoprocessingelementsRecongurefromallPEsperformingtaskAtotwoPEsperformingtaskAandtwoPEsperformingtaskBPowerDevicecancyclepowerofPEsforperformanceandpowertradeoffRecongurePEsfromhigh-power,high-performanceoperationtolow-power,low-performanceoperationInterconnectDevicecanchangecommunicationpathsbetweenPEsonchipRecongurecommunicationfrombusinterconnectiontopologytomeshinterconnectiontopologybetweenPEs 17

PAGE 18

CHAPTER3INTERNALDEVICEMETRICSAlthoughMoore'sLawcontinuestoholdtrueinthattransistorcountsondevicesaredoublingevery18months,wehavereachedapointwherewecannolongerincreaseclockratesandinstruction-levelparallelism(ILP)tomeettheinsatiabledemandforcomputingperformance1.Thus,largeamountsofresearcharecurrentlyfocusedonhowtobestutilizeallofthetransistorsonachip.Overthelastfewyears,multi-coredeviceshaveemergedastheleadingtechnologytotakeadvantageofincreasingtransistorcounts.Thisarchitecturereformationisshiftingthefocustoexploitingexplicitparallelism,ratherthanrelyingonILPandhigherclockratestoachievehighperformance.Theresultingapplicationreformationisdrivingapplicationdeveloperstowriteexplicitlyparallelprograms,ratherthanrelyingonautomaticcompileroptimizationsforhighperformance.Multi-coredevicesarendingtheirwayintonewacceleratortechnologiesthatareusedtoaugmenttheperformanceoftraditionalmicroprocessor-basedsystems.Multi-coredeviceshaveatleasttwomajorcomputationalcomponentsinasinglepackage.Many-coredeviceshavemany(e.g.hundreds)computationalcomponentsinasinglepackage.Thedemarcationbetweenmulti-coreandmany-coredevicesisstillsomewhatvague.Wedonotdifferentiatebetweenmulti-coreandmany-coredevicesandusethenotationMCtorefertothemcollectively.Inthispaper,wedenetwoprimaryclassesofMCarchitecturetechnology:FixedMC(FMC)andRecongurableMC(RMC).FMCdeviceshaveaxedhardwarestructurethatcannotbechangedafter 1Portionsofthischapterhavebeenpreviouslypublishedandadapted/reprintedwithpermissionof:JasonWilliams,ChrisMassie,AlanD.George,JustinRichardson,KunalGosrani,andHermanLam.2010.CharacterizationofFixedandRecongurableMulti-CoreDevicesforApplicationAcceleration.ACMTrans.RecongurableTechnol.Syst.3,4,Article19(November2010),29pages.DOI=10.1145/1862648.1862649http://doi.acm.org/10.1145/1862648.1862649 18

PAGE 19

fabrication.AprimeexampleofanFMCdeviceistheIntelXeonX3230processor.Ithasfouridenticalxedprocessorcoresonasingledie[ IntelCorp. 2008b , c ].Bycontrast,RMCdevicescanchangetheirhardwarestructureafterfabricationtoadapttochangingproblemrequirements.MultiplecomputationalcorescanbeinstantiatedontheRMCfabric.TheprimaryenablingtechnologyinRMCistheeld-programmablegatearray(FPGA),butseveralotherexcitingtechnologiesareenteringthemarketinthiscategory.Severalsub-categoriesweredenedinChapter 2 alongwithfacetsofrecongurability.Inordertoachievenear-optimalimplementationsgivenspecicdesigngoalsandtoreducedevelopmenttime,asystemdesignermustbeabletoanalyzeandevaluateappropriatecomputingdevicesandacceleratortechnologiesearlyinthedevelopmentcycle.However,comparingdisparatecomputingtechnologiesimpartiallyandobjectivelyhasbeenachallengethroughoutthehistoryofcomputing.ThisisanevengreaterchallengeconsideringthevastdesignspaceofFMCandRMCdevicesandthenumberandvarietyofavailablearchitectures.Weproposeseveralcomputationaldensity(CD)metricstofacilitatecomparingdeviceswithinandbetweenarchitecturalcategories.Thesemetricsprovidethedesignerwithrelativeperformanceinformationintermsofbit,integer,andoating-pointoperations,incorporatingpowerconsumptionandmemoryconstraints.Additionally,wehavedenedaninternalmemorybandwidth(IMB)metric.TheIMBmetricisusedtoanalyzeadevice'son-chipmemoryaccesscapabilities,whichisacommonbottleneckinmanysystems.IncalculatingIMB,wedifferentiatebetweenblock-basedsystems(BBS)andcache-basedsystems(CBS).Thesecontributionsareintendedtoassistdesignersinrapiddeviceexplorationforefcienttargetdeviceselection.Wehavenotcurrentlyincludedmetricstodescribeadevice'seaseofprogrammingoroff-chipbandwidth.Theremainderofthispaperisorganizedasfollows.Section 3.1 introducesahierarchicalMCcomputingtaxonomyandeightrecongurabilityfactors.Section 3.2 discussesthemethodsusedtocalculatetheCDandIMBmetricsforeachdevice 19

PAGE 20

type.AnoverviewoftheacceleratortechnologiesconsideredinthisstudyispresentedinSection 3.3 .ResultsanddiscussionofCDandIMBcalculationsarepresentedinSection 3.4 .Finally,conclusionsarerenderedinSection 3.5 . 3.1MCTaxonomyWeproposeahierarchical,treestructuretoclassifycomputingdevicesasshowninFig. 3-1 .Thesingle-coreversionofthistaxonomyisfairlytrivial.Thus,therootofthetreeistheMCcategory.ThenextlevelofthetaxonomydifferentiatesbetweenFMCandRMCdevices.ThebasicdenitionsofFMCandRMCareaspreviouslydescribed.DevicescanalsobeahybridofFMCandRMC,withsegregatedxedandrecongurableresourcesonasinglediethatoperateinamutuallyexclusivemanner.Atthelowestlevelwedifferentiatebetweenheterogeneousandhomogeneousarchitectures.Aspreviouslydened,heterogeneousdevicescontainmultipletypesofprocessingelements.Homogeneousdevicescontainonlyasingletypeofprocessingelement.Finally,withineachcategorytherearedeviceswithavarietyofbasePEgranularities.Tofurtherclarifyandclassifythedifferencesbetweenxedandrecongurablearchitectures,weintroduceasetofrecongurabilityfactorsthatweresummarizedinTable 2-1 .Devicesthatexhibitzeroorveryfewoftherecongurabilityfactorswouldbeclassiedasxeddevices.Conversely,devicesexhibitingmanyoftherecongurabilityfactorswouldbeclassiedasrecongurabledevices.Thenextsectiondescribesthemetricscalculationsforthisstudyandprovidesandprovidesexamplesoftheiruse. 3.2MetricMethodologyInthissectionweproposeseveralmetricstocomparedeviceswithinandbetweentaxonomycategories.Weevaluatebit-level,integer,andoating-pointoperations. 3.2.1Bit-levelComputationalDensityBit-levelCDwasoriginallyproposedbyDeHon[ DeHon 1996 ].Itdescribesthecomputationalperformanceofadeviceonindividualbits,normalizingbydieareaand 20

PAGE 21

Figure3-1. MCtaxonomy processtechnology.Wedeviatefromtheoriginalmetricbyomittingthenormalizationandinsteadgroupdevicesbyprocesstechnology.Bit-levelCDcanbedenedintermsofdevicetype.Eq. 3 appliesforFMCdevicesandcoarse-grainedRMCdevicesas CDbit=fXi(WiNi)(3)whereWiisthewidthofelementtypei,Niisthenumberofelementsoftypeiorthenumberofinstructionsthatcanbeissuedsimultaneously,andfistheclockfrequency.VectorunitsareincludedintheEq. 3 .WenowredenethismetricforFPGAsintermsoflook-uptables(LUTs).EachLUTcanimplementatleastonegate-levelbitoperation.Eq. 3 pertainstoFPGA 21

PAGE 22

technologiesas CDbit=f[NLUT+Xi(WiNi)](3)whereNLUTisthenumberofLUTs,Wiisthewidthofelementtypei(suchasDSPmultiplierresources),Niisthenumberofelementsoftypei,andfistheclockfrequency.Thesetwoequationsgiveusanestimateofmaximumbit-levelCDintermsoftheclockrateandparallelism(theNterms).Itisimportanttonotethatthesearetheoreticalpeakvalues.OneofthekeyadvantagesofFPGAsisthattheyhavelessoverheadforbit-levelcomputations,soachievableperformancewillbemuchclosertopeakperformancethanitwouldbeforcoarser-graineddevices[ DeHon 1996 ].Wehaveconservativelychosentonotmakeadistinctionbetweenthenumberof4-and6-inputLUTs.Dependingonthefunctionsbeingimplemented,a6-inputLUTisnotalways1.5timesmorecomputationallypowerfulthana4-inputLUT. 3.2.2IntegerComputationalDensityFMCandcoarse-grainedRMCdevicestypicallycontainArithmeticLogicUnits(ALUs)orcoarse-grainedprocessingelementsforintegercomputation.Inthiscase,todeterminetheintegerCD,weuseEq. 3 as CDint=fXiNi CPIi(3)whereNiisthenumberofintegerexecutionunitsorthenumberofintegerinstructionsthatcanissuedsimultaneouslyofelementtypei,CPIiistheaveragenumberofclockcyclesperintegerinstructionforelementtypei,andfistheoperatingfrequencyofthedevice.Thesummationoveriinthisequationtakesintoaccountarchitecturesthatsupportvector/SIMDintegerinstructions.Integeraddition,subtraction,andmultiplicationoftenrequirethesamenumberofclockcyclesforxedarchitecturedevicesintermsofthroughput.Integerdividersaremorecomplexandrequiremoreclockcycles.Consequently,integerandoating-point 22

PAGE 23

performanceisoftenreportedintermsofadditionandmultiplicationperformanceandnotdivision.Forconsistencywithpreviouspractices,wewillonlyconsideradditionandmultiplication.OurmethodologybalancesadditionandmultiplicationperformanceforeachintegerCDmetricbyusinganequalnumberofadditionandmultiplicationoperations,suchthatthenumberofbalancedoperationsismaximized.FortheFieldProgrammableObjectArray(FPOA),whichisdescribedinmoredetailinSubsection 3.3.2 ,integerCDcanbecalculatedforintegerwidthsthatmatchmultiplesofthewidthofthebasicblock.DuetotheheterogeneousnatureofmostRMCdevices,theintegerCDmetricisasummationofthecomputationalcapacitiesofthevariouselements.Thetotalnumberofeachtypeofelementisextractedfromthedevicedatasheet.ForFPGAs,aCDmethodologysimilartotheonedescribedbyStrenskiisapplied[ Strenski 2007 ].ThischaracterizationishighlydependentontheperformanceoftheIPcores.Weassumethatintegercoresprovidedbythevendorarehighlyoptimizedandwillprovideagoodbasisforcharacterization.Theparametersinthefollowingprocedureareavailableaspartofthecoredocumentationfromthevendororviaexperimentationusingvendortools.Whenusingexperimentation,typicalmethodstooptimallybalancehighclockfrequencyandlowresourceutilizationshouldbeused.Thesemethodsareasfollows: 1. Determinethemaximumamountoflogicresourcesandthemaximumamountofspecialon-chipresources(e.g.DSPmultipliers),forthedevice. 2. Assume15%logicresourceoverheadforsteeringlogicandmemoryorI/Ointerfacing. 3. DeterminetheresourceutilizationandmaximumachievablefrequencyforoneinstanceofthecoreusingDSPresources. 4. Determinetheresourceutilizationandmaximumachievablefrequencyforoneinstanceofthecoreutilizinglogic-onlyresources. 23

PAGE 24

5. Determinethenumberofsimultaneouscores,OpsDSP,thatcanbeinstantiateduntilallDSPresourcesareexhausted. 6. Usinganyremaininglogicresources,determinethenumberofsimultaneouslogic-onlycores,Opslogic,thatcanbeinstantiated. 7. Theusablefrequencyfisthelowerofthefrequenciesdeterminedinsteps3and4.Thus,theintegerCDisdenedas: CDint=f(OpsDSP+Opslogic)(3)ThisformularepresentspeakCDwithoutanyconsiderationofpotentialperformancelimitationsduetomemorybandwidthoron-chipRAMresourcerestrictionsfordatabuffering.Strenski[ Strenski 2007 ]describesamethodtolimitthenumberofparalleloperationsbasedontheamountofavailableon-chipmemoryresources.Memoryneedstobeallocatedtostoretwooperandsperoperation.Theoperandscanbeoverwrittenwiththeresultinmemory.Dual-portmemorycongurationsareusedtoincreasetheinternalbandwidth.Thus,thememory-sustainableCDislimitedbythesizeoftheoperandsandtheamountofparallelpathstoon-chipmemory.FortheFPGAcalculationspresentedinSection 3.4 ,weattemptedtomaximizethenumberofparalleloperationswhiletryingtobalancethenumberofadditionandmultiplicationoperations.ThisbalancecanbeachievedbyiteratingthroughcombinationsofDSPandlogicresourcesallocatedtoadditionormultiplicationoperationsusingthemethodspreviouslydiscussed.Thefrequencyusedforallcalculationsisthelowerofthemultiplierandadderfrequencies.Thismethodisapplicabletobothintegerandoating-pointcalculations.Singleinstantiationsofacoreprovideareasonableestimateforachievablefrequencysince,tobeconservative,wearenotnecessarilyassumingallparalleloperationsareconstituentsofapipelineandweusethelowestachievablefrequencycalculatedforeachprecision.Althoughtheachievablefrequenciesusedheremaybeoptimisticcomparedtoachievablefrequencies 24

PAGE 25

onrealapplications,thememory-sustainabilitylimitationsonthenumberofparalleloperationscausedbythewideandatstructureassumedinthismethodologyarepessimistic.Whentakentogether,wefeeltheseimprecisionsbalanceeachothertoprovideareasonableperformanceestimate.Thesimplifyingassumptionsinourmethodologyenablerapiddevicecomparisonindependentofspecicalgorithmrequirements. 3.2.3Floating-pointComputationalDensityInmostcases,oating-pointCDcanbedeterminedatthedevicelevelusingsimilarmethodsasshownaboveforintegerCD.Coarse-graineddevicesusethesamemodelasintegerCD,Eq. 3 ,insertingthenumberofoating-pointunitsorsimultaneousoating-pointinstructionsthatcanbeissuedforNi,andthenumberofcyclesperoating-pointinstructionforCPIi.Thesameconstraintsregardingdivisionapply;onlyadditionandmultiplicationoperationsareconsideredaspartofthismetric.AswithintegerCD,anequalnumberofadditionandmultiplicationoperationsareusedinndingthemaximumnumberofparalleloperations.Again,oating-pointCDforFPGAsiscalculatedusingthesameprocedureweusedforintegeroperationsbyrepeatingthecalculationusingoating-pointcomputationalcores.ThesameiterativeprocedureasintegerCDisusedtodeterminethemaximumnumberofparalleloperationsthatemployanequalnumberofadditionandmultiplicationoperations.Thenumberofparalleloating-pointoperationsofthesedevicesistypicallymuchlessthanthenumberofparallelintegeroperationssincethereismoreresourceutilizationineachoating-pointcomputationalcore.Consequently,thememorylimitationsnotedpreviouslycouldhaveamuchgreaterimpactonintegeroperationsthanoating-pointoperationsintermsofmemory-sustainableCD. 3.2.4PowerConsumptionPowerconsumptionisalsoanimportantdevicecharacteristic,forHPECandHPCalike.PowerconsumptioncanbeachallengingmetrictocomputeforRMCdevices. 25

PAGE 26

Recongurabledevicescanhavemuchlowerpowerconsumptionfrompeakvaluessinceonlyconguredportionsofthechipareactive.Adetailedanalysisofstaticanddynamicpowerisbeyondthescopeofthispaper.Withinametricweholdfrequencyconstant.Therefore,forrecongurablearchitectures,weassumethatpowerscaleslinearlywithresourceutilizationuptothemaximumpowerconsumptionspeciedinvendordocumentationintheresultsthatfollow.ForFPGAs,wealsoscalethemaximumpowerbytheratioofachievablefrequencytomaximumfrequency.Usingvendortoolsanddocumentation,weareabletoestimatethestaticpowerofaconguredFPGA,whichisusedasthestartingpointinthelinearapproximation.Itisdifculttoestimatetheerrorintroducedbyscalingthemaximumpowerconsumptionbasedonfrequencyandresourceutilization.AnerrorestimatewouldbechallengingduetothemeasurementofFPGApowerconsumptioninisolationfromotherdevicesontheboardorsystem(e.g.RAM,interfacecontrollers,etc.).Wefeelthatbasedonthelineardependenceofdynamicpowertofrequencyandthatpowerincreaseswithresourceutilization,ourmethodforestimatingpoweroverarangeofparallelismisreasonable.TheCDperWatt(CD/W)metricsarecalculatedbytakingtheCDforeachlevelofparallelismanddividingbythepowerconsumptionatthatlevelofparallelism. 3.2.5InternalMemoryBandwidthInternalMemoryBandwidthisacrucialmetricduetotheimpactthatthememorysubsystemcanhaveonoverallsystemperformance.IMBisespeciallycriticalfordeviceswithhighcomputationalcapabilities(e.g.highCD)whichrequirehighinternalmemoryperformancetokeepprocessingelementsbusy.IdlePEsarenotperformingcomputationalworkandarewastingenergy.TheIMBmetricseekstoquantifymemoryperformancebydescribingtherateatwhichdatacanbetransferredfromon-chipmemoriestoprocessingelementsforbothblock-basedsystems(BBS)andcache-basedsystems(CBS). 26

PAGE 27

BBSstypicallyhaveaverylargenumberofblocksofmemoryaccessibletothePEs.Blockmemoryactssimilartoscratchpadmemory;thereisnoinherentcachestructure.Controlanddesignissuessuchasdatasegmentation,distribution,replacement,andcoherencyarelefttotheusertoimplement.BBSscanbefoundinbothFMCandRMCdevices.Eq. 3 isusedtocalculateIMBforblock-basedsystemsas IMBblock=XiNiPiWifi 8CPAi(3)whereNiisthenumberofblockmemoriesoftypei,Piisthenumberofportsformemoryoftypei,Wiisthewidthofmemoryoftypei,fiistheoperatingfrequencyofmemorytypei,dividingby8convertsfrombitspersecondtobytespersecond,andCPAiisthenumberofcyclesperaccessformemoryoftypei.Fordevicesthatdonotproducexed-frequencydesigns(e.g.FPGAs),fiisvariableuptotheachievablefrequencyofthedesign,otherwise,fiisconstant.CBSsgenerallyhavemultiplelevelsofcacheavailabletothePEs.Thereisahardwarestructureinplacethatimplementscachedesignfeaturesandcontroloptionssuchasassociativity,linesize,replacementalgorithms,coherency,etc.CBSscanbefoundinRMCandFMCdevices,althoughtheyaremuchmoreprevalentinFMCdevices.RMCdevicesthatdonotnativelyimplementaCBScouldbeconguredtodoso,butthisoptionisnotconsideredhereforsimplicity.ImplementingaCBSstructureonaBBSRMCdevicewouldrequireadditionalcomplexcontrollogicthatwoulddiminishtheadvantagethatsimplerblock-basedsystemshaveforsomeapplicationscenarios.ThemethodforcalculatingIMBforCBSisshowninEq. 3 as IMBcache=%hitrateXiNiPiWifi 8CPAi(3)where%hitrateisascalefactortoaccountforvariablehit-rates,Niisthenumberofblockmemoriesoftypei,Piisthenumberofportsformemoryoftypei,Wiisthewidth 27

PAGE 28

ofmemoryoftypei,fiistheoperatingfrequencyofmemoryoftypei,dividingby8convertsfrombitspersecondtobytespersecond,andCPAiisthenumberofcyclesperaccessformemoryoftypei.IMBiscalculatedseparatelyforeachlevelofcache,thusthehit-ratescalefactorisnotincludedinthesummation.IMBdoesnotincludethedataratesbetweencomputationalcircuitryandregisters.WeassumethatregistersareinternalstructurespresentinbothBBSsandCBSsthatareseparatefromeitherblockorcachememory.ThegoalofIMBistoevaluateadevice'scapabilitytokeepPEsfed;registersareexcludedfromIMBbecausetheyaretypicallynotthebottlenecktoPEutilization.AlthoughIMBisnotexplicitlyintegratedintotheCDandCD/Wmetrics,CDdoesincludesomenotionofmemory-sustainabilityasnotedinSubsection 3.2.2 . 3.3AcceleratorOverviewInthissectionwedescribethefeaturesofavarietyofFMCandRMCdevicessothattheCDandIMBmetricsdescribedinSection 3.2 canbeapplied.Wehaveincludeddevicesfrom130nm,90nm,65nm,45nm,and40nmprocesstechnologies.Table 3-1 providesasummaryofclassicationsforthevariousdevices. 3.3.1FMCDevicesSeveralFMCdeviceshavebeenincludedinthisstudy.Tables 3-2 and 3-3 listthedevicesandprovideasummaryofthekeyfeaturesneededtocomputetheCDandIMBmetrics.ThesedevicesexhibitveryfewoftherecongurabilityfactorslistedinTable 2-1 ,andarethusclassiedasFMCdevices.Thisinformationwasgatheredfrom[ ClearSpeedTechnologyPLC 2007 ]fortheClearSpeedCSX600accelerator,[ FreescaleSemiconductor,Inc. 2006 , 2005 ]fortheFreescaleMPC7447PowerPC,[ AMD,Inc. 2008 ; X-bitLaboratories 2006 ]fortheAMDAthlonX26400+,[ Chenetal. 2007 ; Wang 2005 ]fortheCellBroadbandEngine(CellBE),[ FreescaleSemiconductor,Inc. 2006 , 2008 ]fortheFreescaleMPC8640DPowerPC,[ NvidiaCorp. 2007 , 2006 , 2008 ]fortheNvidiaTeslaC870graphicsprocessingunit(GPU),[ IntelCorp. 2008a , 28

PAGE 29

Table3-1. Deviceclassication DeviceFMCRMCBBSCBSHetero.Homo. ArrixFPOAXXXECA-64XXXMONARCHXXXXStratix-IIS180XXXStratix-IIISL340XXXStratix-IIISE260XXXStratix-IVSE530XXXTILE64XXXVirtex-4LX200XXXVirtex-4SX55XXXVirtex-5LX330TXXXVirtex-5SX95TXXXAthlon64X26400+XXXAtomN270XXXCellBEXXXCSX600XXXMPC7447XXXMPC8640DXXXOpteron8360SEXXXPowerXCell8iXXXTeslaC870XXXXeon7041XXXXeonX3230XXX 2000 ]fortheIntelXeon7041,[ X-bitLaboratories 2007 ; Hester 2006 ]fortheAMDOpteron8360SE,[ Chenetal. 2007 ; IBMCorp. 2008 ]forIBM'sPowerXCell8i,[ IntelCorp. 2008b , c , 2006 ]fortheIntelXeonX3230,and[ IntelCorp. 2008b , a ; Shimpi 2008 ]forIntel'sAtom2N270processor.TheCellBEisaheterogeneousdevicesinceithas 2DetailedinformationontheL1cachestructureswasnotavailableforAtom.WehaveassumedtheAtom'sL1cacheinterfacesaresimilartotheCoremicroarchitecturesincebothAtomandCoreusesimilarL2interfaces. 29

PAGE 30

Table3-2. FMCdeviceprocessingfeatures DeviceCoresInstructionsIssued/CoreDatapathWidth(bits)Frequency(MHz)ProcessTech.(nm) CSX60096164250130MPC744711+232/1281000130Athlon64X26400+23+132/64320090CellBE1+82+164/128320090MPC8640D21+232/128100090TeslaC870128232135090Xeon704123+164/128300090Opteron8360SE43+164/128250065PowerXCell8i1+82+164/128320065XeonX323044+164/128266065AtomN27011+164/128160045 atraditionalprocessingunitplusuptoeightadditionalcomputeunits.Thisstructureofaprocessingunitwithwidercomputeunitsleadstothe1+8and64/128notationinTable 3-2 .Theotherdevicesareconsideredhomogeneousbecauseallofthesub-unitsarethesameatthelevelofreplication.ThedeviceswehavecategorizedasFMCatmostonlyexhibittheModeandPowerrecongurabilityfactors.Themajorityofrecongurabilityfactorsarenotrepresentedbythissetofdevices,leadingtotheFMCdesignation.AdditionalbackgroundonFMCarchitecturesisofinterestforcalculatingCDandIMB.TheAMD,Freescale,andIntelXeonprocessorseachhavevectorunitsforeachcore,againleadingtothex+yand64/128notation.NotethattheAthlonX26400+andXeon7041vectorunitshaveathroughputofoneinstructioneverytwoclockcyclesandthatfor32-bitintegermultiplicationthereisa4xthroughputreductionfortheTeslaC870.ThePowerPCprocessors(MPC7447andMPC8640D)canissueoneALUandtwovectorinstructionsforintegeroperations,twoFloatingPointUnit(FPU)andone 30

PAGE 31

Table3-3. FMCdevicepowerandmemoryfeatures DevicePower(W)On-ChipMemory CSX60010I,Dcaches,9632-bitSRAMbanksMPC744710L1-I,L1-D:4words/2clockcycles,L2:8words/9clockcyclesAthlon64X26400+125Eachcore:L1-I:16bytes/clockcycle,L1-D:264-bitinterface,L2:264-bitinterfaceCellBE70L1-I,L1-D,L2(PPE),8128-bitLSbanks(SPEs)MPC8640D14Eachcore:L1-I,L1-D:4words/2clockcycles,L2:8words/11.5clockcyclesTeslaC87012016sharedmemorieseachw/1632-bitbanksXeon7041165Eachcore:L1-I:3ops/clockcycle,L1-D:2128-bitinterface,L2:256-bitinterfaceOpteron8360SE105Eachcore:L1-I:32bytes/clockcycle,L1-D:2128-bitinterfaceL2:2128-bitinterface,L3:variablePowerXCell8i92L1-I,L1-D,L2(PPE),8128-bitLSbanks(SPEs)XeonX323095Eachcore:L1-I:128-bitinterface,L1-D:2128-bitinterface,L2:256-bitinterfaceAtomN2703.3L1-I:128-bitinterface,L1-D:2128-bitinterface,L2:256-bitinterface vectorforsingle-precisionoating-pointoperations,andthreeFPUinstructionsfordouble-precisionoating-point.InIMBcalculations,weassumethateachopis32-bitssinceitissimilartoaRISCinstruction[ HennessyandPatterson 2007 ]fortheXeon7041.Finally,wefocusonlyonblockmemoryforthedevicesincludedherethathavebothcacheandblockmemorieson-chip(e.g.CellBE,etc.);cacheresourcesareminorbandwidthcontributorscomparedtoblockmemoryresourcesinthesedevices.Forall 31

PAGE 32

FMCdevices,on-chipmemoryoperatesatthecorespeedandthenumberofcyclesperaccessisequaltooneunlessotherwisenoted.Fixed-functionASICs,althoughnotincludedhere,wouldnotexhibitModerecongurability.Wehaveexcludedxed-functionASICsbecausetheytendtobeproprietaryandthusdifculttonddatadescribingthem. 3.3.2RMCDevicesWehaveevaluatedavarietyofFPGAandnon-FPGARMCdevices.AllofthesedevicesshowevidenceofnumerousrecongurabilityfactorsandarethereforeconsideredRMCdevices.SomeofthekeyparametersforFPGAdevicesregardingtheCDandIMBmetricsarelistedinTables 3-4 and 3-5 .NotethattheAlteraStratix-IIFPGAuses99-bitDSPmultipliers.TheAlteraStratix-III,Stratix-IV,andXilinxVirtex-4devicesuse1818-bitmultipliers.TheXilinxVirtex-5devicesuse2518-bitmultipliers.Asshowninthenextsection,themaximumfrequencylistedhereisonlyusedforthebit-levelCDmetric.Themaximumachievablecorefrequency,whichisthefrequencyforasingleinstanceofaDSPorlogic-onlycoreafterPlace-and-Route(PAR),isusedfortheintegerandoating-pointmetrics.FurtherdetailsondeterminingmaximumachievablefrequencyareprovidedinSubsection 3.2.2 .ThevaluesinTables 3-4 and 3-5 wereacquiredfrom[ AlteraCorp. 2007a , b , 2008 ; BittWare,Inc. 2008 ; Xilinx,Inc. 2007 , 2008a ].PowerconsumptionvalueswereestimatedusingAltera'sPowerPlayearlypowerestimatorandXilinx'sXPowerpowerestimatorspreadsheettools.Maximumutilizationforclock,logic,DSP,andon-chipmemoryresources,maximumclockfrequency,anda12.5%switchingratewereassumedinthespreadsheettoolstoestimatemaximumpowerconsumption.AlloftheFPGAdevicesinthisstudydisplayalleightoftherecongurabilityfactors.Wehavealsoconsideredseveralnew,alternativeRMCtechnologies.MathStar'sArrixField-ProgrammableObjectArray(FPOA),modelMOA2400D-10,hasaclockrateof1GHzandwasbuilton90nmprocesstechnology.TheFPOAhas256ALUobjects 32

PAGE 33

Table3-4. FPGAdeviceprocessingfeatures DeviceLUTsDSPsMax.Fre-quency(MHz)ProcessTech.(nm) Stratix-IIEP2S180143,52076850090Stratix-IIIEP3SE260203,52076855065Stratix-IIIEP3SL340270,40057655065Stratix-IVEP4SE530424,9601,02460040Virtex-4SX5549,15251250090Virtex-4LX200178,1769650090Virtex-5SX95T58,80064055065Virtex-5LX330T207,36019255065 andsixty-four1616-bitMultiply-Accumulate(MAC)Objects.Theyalsoinclude40-bitaccumulatorsthatcanperformanoperationeveryclockcycle.Powerconsumptionisratedat15.3Wat25%utilizationand37.6Wfor100%utilization,fora1Vcorevoltage[ Mathstar,Inc. 2007a , b ].WeconsidertheFPOAaheterogeneousdevice.TheFPOAiscategorizedasRMCbecauseitrepresentstheDatapath,DeviceMemory,Mode,Power,andInterconnectrecongurabilityfactors.ElementCXI'sECA-64isaheterogeneous,data-ow,recongurableprocessorbuilton90nmprocesstechnologywitha200MHzclock.Thereisavarietyofprocessingelementtypes,supportingmanyparalleloperations.TheECA-64haspublishedpowerconsumptionofuptooneWattatfullutilization[ ElementCXI,Inc. 2007a , b ].ThegoaloftheECA-64istoprovideperformanceandfaultresiliencyandtoreplacemanycustomprocessors.TheECA-64includesalleightrecongurabilityfactorsandisclassiedasRMC.TheTILE64processorfromTileraisa64-coreprocessor(atmost63corescanbeusedforprocessing)witharecongurablemeshnetwork.Eachcoreisafull32-bit 33

PAGE 34

Table3-5. FPGAdevicepowerandmemoryfeatures DeviceMin.Power(W)Max.Power(W)On-ChipMemory Stratix-IIEP2S1803.26309128-bitdualportblocks@420MHz,76832-bitdualportblocks@550MHz,93016-bitdualportblocks@500MHzStratix-IIIEP3SE2602.11254872-bitdualportblocks@600MHz,86432-bitdualportblocks@580MHzStratix-IIIEP3SL3402.83324872-bitdualportblocks@600MHz,1,04032-bitdualportblocks@580MHzStratix-IVEP4SE5303.55396472-bitdualportblocks@600MHz,1,28032-bitdualportblocks@600MHzVirtex-4SX551.001032032-bitdualportblocks@500MHzVirtex-4LX2001.272333632-bitdualportblocks@500MHzVirtex-5SX95T1.891048872-bitdualportblocks@550MHzVirtex-5LX330T3.432764872-bitdualportblocks@550MHz processor,runningat750MHz.EachcoreisaVeryLongInstructionWord(VLIW)architecturethatcanissuethreeinstructionsperclockcycle.Instructionpackingallowsfour16-bitorve8-bitintegeroperationstobeprocessedsimultaneously.Itsidlepowerconsumptionis5Wandmaximumpowerconsumptionis28W.TheTILE64isbuilton90nmtechnology[ TileraCorp. 2008 ; Barton 2007 ].ThegoaloftheTILE64istoprovidesupercomputingperformanceonasinglechip.Custompipelinescanbesetupamongtiles,enablingDatapathrecongurabilityontheTILE64.ThecollectiveL2caches 34

PAGE 35

foreachtilecanalsofunctionasauniedL3cache,demonstratingdevicememoryrecongurability.TheTILE64alsoincorporatestheMode,Power,andInterconnectrecongurabilityfactors.Finally,weconsideradevicethatoperatesusingxedorrecongurableresourcesinasomewhatmutuallyexclusivemanner.TheMONARCHpolymorphousprocessorspansbothFMCandRMCcategories.ItcontainssixRISCprocessorsandaFieldProgrammableComputingArray(FPCA)ofcoarse-grainedelements.Itoperatesat333MHzandhasastandbypowerconsumptionof6.7Wandamaximumpowerconsumptionof33W[ Lewinsetal. 2007 ].ThegoaloftheMONARCHprocessoristoprovidethecapabilitytoadapttochangingapplicationrequirements.WhenemployingtheRISCprocessors,MONARCHonlyexhibitstheModeandPowerrecongurabilityfactors.WhenusingtheFPCA,MONARCHcantakeadvantageofDatapath,DeviceMemory,Mode,Power,andInterconnectrecongurability.ThedifferencesinthetwodistinctmodesofoperationcauseMONARCHtobeclassiedasbothanFMCandRMCdevice. 3.4ResultsandAnalysisInthissectionweprimarilyfocusondetailedresultsformemory-sustainableCD/W.Table 3-6 summarizesbothrawandmemory-sustainableCD.Memory-sustainableCDisdenedastheCDthatadevicecansupportwithitson-chipmemorystructure.Memory-sustainableCDislimitedwhentherearenotenoughparallelpathstomemoryforthemaximumnumberofparalleloperationsthatcanbeprocessed.Foreachmetricandeachdevice,wecalculatethemaximummemory-sustainableCD.Thisenablesustodeterminethemaximumamountofexploitableparallelism,whichisthenumberofmemory-sustainableparalleloperationsthatcanbeprocessed.Asindicatedpreviously,clockfrequencyisheldconstantwithinametric.Maximumclockfrequencyisusedforthebit-levelmetrics(CDandCD/W)andachievablefrequencyisusedfortheremainingmetricsasshowninTable 3-7 .Wethenexaminetheimpactofvaryingparallelismto 35

PAGE 36

Table3-6. MaximumCD(inbillionsofoperationspersecondorGOPs) DeviceBit-level16-bitInt.32-bitInt.SPFPDPFP RawSus.RawSus.RawSus.RawSus.RawSus. ArrixFPOA51205120320320160160n/an/an/an/aECA-64435435131366n/an/an/an/aMONARCH20482048656565656565n/an/aStratix-IIS180752167521644244212312353531111Stratix-IIISL34015442215442293391821321396962626Stratix-IIISE26011953911953981777820420473732222Stratix-IVSE5302438662438669907663123121321326868TILE6446084608240240144144n/an/an/an/aVirtex-4LX2008995289952357116664268461616Virtex-4SX5529184291843651107140313177Virtex-5LX330T1501631501636063001311221191162626Virtex-5SX95T48435484355992262219282821515AthlonX26400+204820487070454526261313AtomN2703073071414888855CellBE409640962052051151152052051919CSX600153615362424242424242424MPC74472882881717996633MPC8640D57657634341818121266Opteron8360SE4480448019019011011080804040PowerXCell8i40964096205205115115205205102102TeslaC8701105911059346346216216346346n/an/aXeon7041153615364242303030302424XeonX323040954095128128858585856464 36

PAGE 37

Table3-7. FPGAachievablefrequency(inMHz) DeviceBit-level16-bitInteger32-bitIntegerSPFPDPFP Stratix-IIS180500410420286148Stratix-IIISL340550400273329195Stratix-IIISE260550400273329195Stratix-IVSE530550291243241184Virtex-4LX200500344249274185Virtex-4SX55500344249274185Virtex-5LX330T550463378357237Virtex-5SX95T550463378357237 compareperformance.Weareusingtheprocesstechnologyofadevicetohelpgroupandcomparedevicesbothwithintheirgenerationandacrossgenerations.Fortheintegerandoating-pointmetrics,weadjustthemaximumpowerconsumptionofFPGAdevicesbytheratioofachievablefrequencytomaximumfrequency.Theremaybeintuitiveexpectationsforeachmetric.Forthebit-levelmetrics,itisexpectedthattheFPGAswouldperformthebestduetotheirne-grainedLUT-basedarchitectureandlowpowerconsumption.Fortheintegermetrics,onemightexpectthecoarse-grainedrecongurabledevices,suchastheArrixFPOA,tobethebestperformers,duetothelargenumberofcoarse-grainedprocessingelementsthatcanbeactivesimultaneously.Forthesingle-precisionoating-point(SPFP)metric,devicesusedprimarilyforgraphicsprocessing(CellBE,TeslaC870)mightbeexpectedtoperformthebest.Fordouble-precisionoating-point(DPFP)CD/W,onemightexpectthattheserver-classmicroprocessors,oftenusedasHPCclusterbuildingblocks,wouldprovidethebestperformance.However,inmanycasestheresultsprovidedsurprisesandnewinsight. 37

PAGE 38

3.4.1Bit-levelComputationalDensityperWattThe40nmStratix-IVEP4SE530FPGAhassignicantlymorerecongurablelogicresourcesthanallotherdevices,includingthe65and90nmFPGAs,andhastheoverallbestbit-levelCD/WperformanceasshowninFigs. 3-2 and 3-3 .TheCD/Wcurvesattenwhenthedevicehasreacheditsmaximumnumberofparalleloperations.Furtherlevelsofparallelismwillnotshowanincreaseinperformance.Thebest65nmperformeristheVirtex-5LX330T,withtheother65nmRMCdevicesaclosesecond.Withinthe90nmdevices,theFPGAsalsohavesignicantlybetterperformancethantheotherdevices.For90nm,theVirtex-4LX200hasthehighestperformancewiththeSX55aclosesecond.TheStratix-IIS180lagsbehindtheotherFPGAssinceithasfewerlogicresourcesthantheLX200andhigherpowerconsumptionthantheSX55.ForthismetricthemaximumfrequencieslistedinTable 3-7 wereused.TheFMCdevicesforallprocesstechnologiesperformpoorlyinthismetricduetothehighoverheadforbitoperationsonthesedevicesandconsiderablyhigherpowerconsumption. 3.4.216-bitIntegerComputationalDensityperWattFigs. 3-4 and 3-5 showtheCD/WmetricsforbothFMCandRMCdevices,respectively.TheVirtex-5SX95Tistheoverallleader,althoughthe40nmStratix-IVSE530isaveryclosesecondathighoperationdensities.For90nmdevicestheleaderforalmostalllevelsofparallelismistheVirtex-4SX55,whiletheECA-64andStratix-IIEP2S180alsoperformwell.Wescalepowerlinearlyuptothemaximumresourceutilization,whichcorrespondstothemaximumnumberofrawparalleloperations.Devicesonlyhavememory-sustainableoperationsinstantiated,thustheywillneverreachtheirmaximumpowerdissipationiffullutilizationofthechipcannotbesustained(e.g.theSX55andSX95T).Thelargeamountoflow-powerDSPresourcesandthepowersavingsduetothelimitednumberofmemory-sustainableoperationsthatcanbeinstantiatedgivetheSX55andSX95TalargeCD/Wadvantagewithintheir 38

PAGE 39

Figure3-2. Bit-levelcomputationaldensityperwatt(FMC) respectivetechnologynodesformoderatelevelsofparallelism.Asthenumberofparalleloperationsincreases,severalAlteraFPGAsshowverygoodperformancesincetheyarenotaslimitedbymemory-sustainabilityissuesastheXilinxdevices.Again,theFMCdevicestendtoperformpoorlyinthismetricduetotheirhigh,xedpowerconsumption. 3.4.332-bitIntegerComputationalDensityperWattOverall,theVirtex-5SX95TleadsthismetricasshowninFigs. 3-6 and 3-7 forhighlevelsofexploitableparallelism,againduetothepowerconsumptionissuescitedpreviously.Forverylargelevelsofparallelism,theAlteraFPGAsshowstrongperformance,withtheStratix-IVSE530nearingtheperformanceoftheSX95T.32-bit 39

PAGE 40

Figure3-3. Bit-levelcomputationaldensityperwatt(RMC) CD/WisanotherinstancewheresomeoftheXilinxFPGAssufferintermsofsustainableperformanceduetomemorycapacityandhierarchyissues.TheVirtex-4LX200hasarawmaximum32-bitintegerCDof66GOPs,butcanonlysustain41.5GOPs,a37%reduction.TheVirtex-5LX330ThasarawmaximumCDof131GOPsbutcanonlysustain122GOPs,a6%reduction.MemoryandbufferinglimitationsleadtoareductioninoverallCDandCD/Wforthesedevices.TheAlteraFPGAshaveabetterbalancebetweenthenumberofparalleloperationsandthenumberofmemorylocations,andthustheydonotexhibitthislimitation.ThereisaninterestingsituationwhichcanbeseeninFig. 3-7 .EventhoughtherawperformanceoftheECA-64isrelativelylow,thepowerconsumptionissolowthatitinitiallyleadstheCD/Wmetricfor90nm 40

PAGE 41

Figure3-4. 16-bitintegercomputationaldensityperwatt(FMC) devices.ThenegativeinitialslopefortheECA-64isduetoitslessthanoneWattpowerconsumptionatlessthan100%resourceutilizationandthatthismetricisnormalizedtooneWatt.Forlowlevelsofexploitableparallelism,theStratix-IIEP2S180andTILE64aregoodperformers.Asexploitableparallelismincreases,theVirtex-4SX55becomesabetterperformer. 3.4.4Single-PrecisionFloating-PointComputationalDensityperWattDevicesthatarenotintendedforSPFPorDPFPoperationsandwouldlikelyperformpoorlyarethereforenotincludedintheSPFPandDPFPmetrics.Inaddition,duetolackofvendordata,resultsforTILE64arealsoexcluded.DespitetheirsignicantperformanceadvantageforrawCDperformanceforSPFPoverotherdevices,theCell, 41

PAGE 42

Figure3-5. 16-bitintegercomputationaldensityperwatt(RMC) PowerXCell8i,andGPUareextremelypower-hungryandperformworseonCD/WthanmostoftheRMCdevices,asshowninFigs. 3-8 and 3-9 .The65nmFPGAshaveamajorperformanceincreaseoverthepreviousgenerationdevices,whilemaintaininggoodpowerefciency,sothattheyachievethebestCD/Wforalllevelsofparallelism,ledbytheVirtex-5SX95TasshowninFig. 3-9 .AlthoughtheStratix-IVSE530hastwicetheCDperformanceoftheSX95T,thehigherpowerconsumptionoftheStratix-IVSE530causesittobeaclosesecondinthismetric.Eventheextremelylow-powerAtomN270doesnothaveenoughSPFPprocessingcapabilitytoovercometheperformanceperWattadvantageofmostRMCdevices,specicallytheFPGAs. 42

PAGE 43

Figure3-6. 32-bitintegercomputationaldensityperwatt(FMC) 3.4.5Double-PrecisionFloating-PointComputationalDensityperWattTheStratix-IVSE530isthenoticeableleaderforalldevicesduetoitslargefabricandthearea-intensivenatureofDPFPcoresasshowninFig. 3-11 .TheVirtex-5SX95ThadthehighestCD/Wscoreofall65nmdevices,withtheother65nmAlteraandXilinxFPGAsclusteredtogether.For90nmdevices,theVirtex-4devicesweretheclearwinnersforalllevelsofparallelismforthepower-normalizedmetric.Althoughithashigherperformance,tighterintegrationofmultiplecoresonasingledie,andimprovedpowerefciencyoverpreviousgenerations,theXeonX3230(showninFig. 3-10 )continuestolagbehindthe65nmRMCdevicesinCD/W.The65nmPowerXCell8i,aversionoftheCellBEwithimprovedDPFPperformance,hadthehighestoverallDPFP 43

PAGE 44

Figure3-7. 32-bitintegercomputationaldensityperwatt(RMC) CDperformanceofalldevices,butwashamperedinCD/Wbyits92Wattsofmaximumpowerconsumption.TheCSX600showsremarkableoverallCD/Wperformancefora130nmdeviceduetoitsrelativelyhighCDscoreandlowpowerconsumption,asshowninFig. 3-10 .AlthoughitwasthebestrawCDperformerofthe90nmdevices,theXeon7041wastheworstdeviceintermsofCD/Wduetoitsveryhighpowerconsumption. 3.4.6InternalMemoryBandwidthFigs. 3-12 and 3-13 showtheIMBmetricforcache-basedsystemsandblock-basedsystems,respectively.BBSdevices,specicallyFPGAs,tendtofaroutperformCBSdevices.TheMPC7447,MPC8640D,andAtomN270signicantlylagBBSdevicesduetotheirlimitedparallelpathstoon-chipmemorycoupledwiththeirlowmemory-access 44

PAGE 45

Figure3-8. Single-precisionoating-pointcomputationaldensityperwatt(FMC) frequency,relativetootherCBSdevices.TheTILE64,having64L1andL2cacheswitha750MHzaccessfrequency,hasmoderateparallelaccesstocache,butstillcannotcompetewithanyoftheFPGAs.TheAthlon,Opteron,andXeonprocessorshavemultiplecoreseachhavingpathstocache.Theirrelativelylowparallelismtomemoryisnotovercomebytheirhighmemory-accessfrequency,andthustheyalsosignicantlylagbehindmanyBBSdevices.FromFig. 3-13 ,wecanseethatFPGAsdominatethismetric,evenatrelativelylowachievablefrequencies,duetotheirhighlevelofparallelaccesstomemory. 45

PAGE 46

Figure3-9. Single-precisionoating-pointcomputationaldensityperwatt(RMC) 3.5DiscussionWehavepresentedataxonomyandasetofrecongurabilityfactorsforclassifyingxedandrecongurabledeviceacceleratortechnologies.Thesefactorsandtaxonomyprovideusefulconceptsandterminologytodenecharacteristicsofcomputingtechnologies.Additionally,wehavepresentedamethodologytocomparativelyassessthesetechnologiesintermsofcomputationalandmemoryperformanceandpowerconsumption.Finally,wehaveshownthelargevariationsinresultingdatathatcanarisewhenthismethodologyisappliedtodisparateacceleratortechnologies.AsshowninSection 3.4 ,variousdevicesshowgoodcomputationalperformancedependingonthelevelofexploitableparallelismandthesizeandtypeofoperation 46

PAGE 47

Figure3-10. Double-precisionoating-pointcomputationaldensityperwatt(FMC) considered.AlthoughFMCdevicestendedtoperformbetterintermsofrawoating-pointCD,theRMCdevicesperformedbetterwhenthismetricwasnormalizedbypowerconsumption.RMCdeviceshaveadistinctadvantageoverFMCdevicessincetheycanadapttoamuchwiderrangeofproblemrequirementswithbetterpowerefciency.Inthisstudy,RMCdevicesshowedstrongCDperformanceformostprecisionsandshowedaclearCD/Wadvantageinallcases.Specically,FPGAswithmanylow-powerDSPstendedtohaveveryhighCD/Wscores,evenforoating-pointoperations.Werecognizetheremaystillbespecicsituations,especiallywhenhighrawdouble-precisionoating-pointperformanceisrequired,whendevelopersmaywanttochooseanFMCdevice. 47

PAGE 48

Figure3-11. Double-precisionoating-pointcomputationaldensityperwatt(RMC) Block-basedsystemstendedtohavebettermemoryperformancethancache-basedsystems.FPGAsdominatetheIMBmetricduetotheirmanyparallelpathstoon-chipmemoryandlowoverheadformemoryaccesses.ThemismatchbetweenrawCDandmemory-sustainableCDforsomedevicesdemonstratestheimportanceofbalancingcomputationalcapabilitiesandmemoryaccesscapabilitiesinarchitecturedesign.CDandCD/WfordevicessuchastheVirtex-4SX55andVirtex-5SX95Tcouldbeimprovedbyimplementingmoreparallelmemorystructures.ThiswouldincreasetheiraggregateIMB,andmoreparalleloperationscouldbesustained,leadingtohigherCDscoresandpotentiallyhigherCD/Wscores.Memory-sustainabilityisrarelyanissueforAlteraFPGAsduetotheirbetterbalancebetweencomputationandmemoryaccess 48

PAGE 49

Figure3-12. Internalmemorybandwidth(CBS) capabilities,although16-bitintegerCDandCD/WcouldbeslightlyimprovedintheStratix-IVSE530withmoreparallelmemorystructures.Ingeneral,thenewerprocesstechnologydevicesperformedbetteronCD/Wthantheolderprocesstechnologydevices.Thisdemonstratesakeyaspectofthearchitecturereformation:wehavenotreachedtheendofMoore'slaw,andexplicitparallelismwillallowustofullyutilizeprocesstechnologyadvancesandincreasingtransistorcountswhileachievinggoodpowerefciency.Werecognizetheneedforprogressintheapplicationreformationtoenableprogramstoexploitthehighlevelofparallelismpresentedhere. 49

PAGE 50

Figure3-13. Internalmemorybandwidth(BBS) Thereareseveralotherimportantmetricsforoverallsystemperformancethatareplannedforfuturework.Off-chipmemorybandwidthdescribestheabilityofadevicetokeepitsprocessingelementsfedwithdatafromexternalsources.TheI/Ocapabilitiesandbandwidtharealsoimportantconsiderationsinsomesystems.Finally,costisanotherdrivingfactorindeviceselection,butisdifculttoassessduetothetime-varying,volume-dependentnatureofretailcostandothercomplexcostcomponents(e.g.NREandproductioncosts).Wealsoplantoexploreandanalyzeapplicationmetricsandtodevelopamappingbetweentheapplicationmetricsandthemetricspresentedhere. 50

PAGE 51

CHAPTER4EXTERNALDEVICEMETRICS 4.1Memory-baseddevicemetricsThischapteroverviewssomeofthepreviousresearchondevicemetricsandintroducesnewmemory-basedmetrics.ThisincludesaquickreviewofCDandCD/Wfromthepreviouschapter.Inaddition,thissectionoutlinesthemethodologyforthenewmemory-baseddevicemetrics. 4.1.1ReviewofCDandCD/WTheCDmetric[ Williamsetal. 2011 ]isusedtodeterminethecomputationalcapabilityofadeviceandcompareittootherdevices,bothwithinandbetweenarchitecturalcategoriesatvaryinglevelsofprecision1.Adevice'scapabilityforcomputationischaracterizedbyitsintegerCD,oating-pointCD,andbit-levelCDatvarioussizes.Asin[ Williamsetal. 2011 ],weuseMCtocollectivelyrefertomulti-coreandmany-coredevices,whichhaveatleasttwomajorcomputationalcomponentsinasinglepackage;FixedMC(FMC)andRecongurableMC(RMC)arethetwoprimaryclasses.FMCdeviceshaveaxedhardwarestructurethatcannotbechangedafterfabrication.RMCdevicescanchangetheirlogicalhardwarestructureafterfabricationtoadapttochangingproblemrequirements.Thereadershouldreferto[ Williamsetal. 2011 ]foramoredetaileddiscussionontherecongurabilityfactorsthatareusedtoclassifyadeviceaseitherFMCorRMC. 1Portionsofthischapter,c2010IEEE,havebeenpreviouslypublishedandadapted/reprintedwithpermissionof:Richardson,J.;Fingulin,S.;Raghunathan,D.;Massie,C.;George,A.;Lam,H.,”ComparativeanalysisofHPCandacceleratordevices:Computation,memory,I/O,andpower,”High-PerformanceRecongurableComputingTechnologyandApplications(HPRCTA),2010FourthInternationalWorkshopon,vol.,no.,pp.1,10,14-14Nov.2010doi:10.1109/HPRCTA.2010.5670797 51

PAGE 52

TodeterminetheintegerCDforFMCandcoarse-grainedRMCdevices,Eq. 4 isused,whereNiisthenumberofintegerexecutionunitsorthenumberofintegerinstructionsthatcanbeissuedsimultaneouslyofelementtypei,CPIiistheaveragenumberofclockcyclesperintegerinstructionforelementtypei(suchasDSP,ALU,orLUTresources),andfistheoperatingfrequencyofthedevice.Thesubscriptirepresentsthetypeofcomputationalelementwithinthedevicethatisunderanalysis.Thesummationoveri,inthisequation,takesintoaccountarchitecturesthatsupportvector/SIMDintegerinstructionsbyincludingdifferenttypesofcomputationalcomponents.Weassumethatonlyadditionandmultiplicationoperationsareconsidered,andthenumberofparalleloperationsismaximizedwhilekeepingthenumberofadditionsandmultiplicationsequal.Whencalculatingthenumberofparalleloperationssupportedbyadevice,weconsiderahardware-supported,multiply-accumulateoperationasonlyoneoperation. CDint=fXiNi CPIi(4)ForFPGAs,integerCDisdeterminedusingachievablefrequencyandthenumberofparalleloperationsofafullyutilizedlogicfabricandDSPresources.AsingleintegercoreforbothadditionandmultiplicationisinstantiatedonanFPGAusingvendorIPcores.Foreachcore,theresourceutilizationalongwiththemaximumachievablefrequencyisdeterminedfromthevendortools.Thisinformationallowsthenumberofsimultaneouscoresthatcanbeinstantiatedonadevicetobedetermined,utilizingallavailableDSPandlogicresourcesandassuming15%logicoverheadforsteeringlogicandI/Ointerfacing.Again,onlyadditionandmultiplicationoperationsareconsideredandbalanced.Thenumberofparalleloperationsismultipliedbythemaximumachievablefrequency,limitedbythelowestbetweenmultiplicationandaddition.Basedontheamountofavailableon-chipmemoryresources,thenumberofparalleloperationsislimitedinordertoincorporatememorybandwidthoron-chipRAMresourcesfor 52

PAGE 53

databuffering,whichcanhavealimitingeffectonthepeakCD.Theon-chipmemoryneedstoallocatetwooperandsperoperationformemory-sustainableCD,whichistheCDusedthroughoutthispaper.Thisprovisionensuresthatthenumberofparalleloperationsadevicecansupportislimitedbytherealisticabilityoftheinternalmemorystructuretoprovidedataforeachparalleloperation.ToillustratetheprocessinwhichCDiscalculatedforadevice,aVirtex-6SX475TFPGAcanbeanalyzedforintegerperformance.Forexample,whencalculatingthe32-bitinteger(i.e.Int32)CDfortheVirtex-6SX475T,theInt32IPcoresofaddersandmultipliersarerstgeneratedwithtoolssuppliedbythevendor.OneofeachdesignissynthesizedandsimulatedontheFPGA.Usingtheutilizationreport,itcanbedeterminedthatthefabriccouldsupport1937operationsinparallel,halfmultipliesandhalfadditions.From[ Xilinx,Inc. 2008b ]weseethattheblockRAMsofthefabriccanonlysupply1064pairsofoperandseachclockcycle.Sincethisamountislessthanthemaximumnumberofparalleloperations,1064isthemaximumamountofmemory-sustainableoperationsthatcanbecomputedinparallel.Usingthetimingreportgeneratedfromthedesignofmultipliersandadders,theoperatingfrequencyisdeterminedtobe296MHz.Sinceeachoperationtakesonecycle,CPIis1,andwearecomputing1064operationsinparallel.UsingEq. 4 ,thememory-sustainableInt32CDforthedeviceiscalculatedas: CDint=296MHz1064ops=314.944GOPS(4)TheCDperwatt(CD/W)metriciscalculatedbytakingtheCDforeachlevelofparallelismanddividingbythepowerconsumptionatthatlevelofparallelism.ForFMC,themaximumpowerisused.ForRMC,powerisassumedtoscalelinearlywithresourceutilizationandachievablefrequency.Floating-pointCDandCD/WaredeterminedusingasimilarprocedureasintegerCDandCD/W.ForamoredetailedmethodologyofCD 53

PAGE 54

andCD/W,pleasereferto[ Williamsetal. 2011 ][ Williamsetal. 2008a ][ Williamsetal. 2008b ]. 4.1.2ReviewofIMBIMBisusedtomeasuretheon-chipmemoryinterfacecapabilities.Itquantiesthememoryperformanceofasystembymeasuringtherateatwhichdatacanbetransferredfromon-chipmemoriestotheprocessingelements.IMBisimportantbecausememoryoftenbecomesabottleneck,limitingtheamountofoperandssuppliedtotheprocessingelementsofthesystem.Itisdenedseparatelyforcache-basedsystems(CBS)andblock-basedsystems(BBS).Eq. 4 from[ Williamsetal. 2011 ]isusedtocalculatetheIMBforBBS. IMBblock=XiNiPiWifi 8CPAi(4)InEq. 4 ,Niistheamountofblockmemoriesoftypei,Wiisthedatawidth,Piistheamountofportsformemoryoftypei,CPAiisthenumberofcyclesperaccesstomemory,divisionby8istoconvertfrombitspersecondtobytespersecond,andfiisthefrequencyofthedevice,orifthefrequencyisnotconstant,asinFPGAs,thenitisvariableuptotheoperatingfrequencyofthedesign.Onceagain,thesubscriptidenotesmemorytypetosupportdevicesthathavemorethanonetypeofinternalmemory.InCBS,multipleseparatelevelsofcachemayexist,andIMBiscalculatedseparatelyforeach,sothatthehit-rateconsiderationisonlyincludedperlevelofcache.Theyalsofeaturehardwaretodetermineassociativity,linesize,coherencyprotocol,replacementalgorithms,etc.Theseparametersaffecttheaccesstimesofthecache,andareaddressedbythefrequencyandcyclesperaccessvariables.TheequationusedtocalculateIMBforCBSisgivenfrom[ Williamsetal. 2011 ]andinEq. 4 : 54

PAGE 55

IMBcache=%hitrateXiNiPiWifi 8CPAi(4)InEq. 4 ,Niisthenumberofblockmemoriesoftypei,Wiisthedatawidth,Piisthenumberofportsformemoryoftypei,CPAiisthenumberofcyclesperaccesstomemory,divisionby8istoconvertfrombitstobytespersecond,andfiisthefrequencyofthedevice.IMBdoesnotaccountforregisteraccesstimes;insteaditisassumedthattheseareinternalcomponentsseparatefromblockorcachememory.Registersareusuallyquicklyaccessedanddonothinderperformance.IMBonlyseekstoevaluatetherateatwhichprocessingelementshaveallthenecessaryoperands,anditisassumedthatregistersdonotlimitthisrate.Toaidtheunderstandingofthiscalculation,anexampleusingtheVirtex-6SX475Tispresented.OntheVirtex-6SX475T,thereare106436KbblockRAMs.Eachofthesehasa72-bitportwidth,andsimpledual-portfunctionality.Themaximumoperatingfrequencyisused,whichis600MHz.SincealloftheseBRAMsarethesame,thesummationisoverthissingleset.UsingEq. 4 ,theIMBforthedeviceiscalculatedas: IMBblock=1064272600MHz 81=7776GB/s(4) 4.1.3ExternalMemoryBandwidthExternalMemoryBandwidth(EMB)isproposedandintroducedinthispapertodescribethetotalbandwidthachievabletoexternalmemoryfromadeviceorvice-versa.EMBonlyincludesthebandwidthofusabledata,soextrabitsusedforerror-correctioncoding(ECC)arenotincluded.EMBandIOB(detailedinSubsection 4.1.4 ),areintroducedtocharacterizethecapabilityofadevicetointerfacewiththerestofthesystem.EMBdoesnotincludeI/Obandwidthornetwork-controllerbandwidthasthesearetypicallyatthecostofauser-denedinterfacingimplementationforanapplication. 55

PAGE 56

EMBisdenedonlyfordirectlyattachedmemory.Althoughadevicecouldaccessanotherdevice'smemorythroughanI/Oport,thisisnotconsideredinthecalculationofEMB.ForFMCandcoarse-grainedRMCdeviceswithbuilt-inmemorycontrollers,EMBisthesumoftheconcurrentEMBprovidedbyallmemorycontrollers.Fordevicesthatuseafront-sidebus(FSB),theentirebusisallocatedformemorybandwidth.Otherwise,theexternaldatapathwidthandtheswitchingfrequencyareusedtodetermineEMB.TodetermineEMBforFPGAs,themethodologyemployedissimilartodeterminingCDforFPGAs.AsinglememorycontrollerisinstantiatedontheFPGAusingavendorIPcore.Theresourceutilizationisdeterminedalongwiththemaximum-achievablememoryinterfacefrequency.Thenumberofsimultaneouscoresthatcanbeinstantiatedutilizingallavailableresourcesaredetermined,assuming15%logicoverheadforsteeringlogicandI/Ointerfacing.LimitingfactorsincludethenumberofLUTs,ALMs/Slices,andthenumberofbondedIOBs.TogetabetterunderstandingofhowtocalculateEMB,astepbystepcalculationforVirtex-6SX475Tisasfollows.AsingleDDR2memory-controllerIPcoreisinstantiatedonthechipandtheresourceutilizationisobtained.ThemaximumnumberofDDR2controllersthatcanbeinstantiatedsimultaneouslyiscalculatedbydividingtheavailablenumberofbondedIOBs(840)bythenumberofIOBsused(121)byasinglememorycontroller.Thememory-interfacefrequency(533MHz)ismultipliedwiththememory-interfacewidth(64bits)usingtheappropriateunits,togettheEMBofoneDDR2controlleras8.528GB/s.Thisrateisin-turnmultipliedby6,themaximumnumberofDDR2controllersinstantiatedtocalculateamaximumEMBof51.168GB/s. 4.1.4IOBIOBisproposedhereandusedtodescribethetotalI/Ocapabilitiesofadevice,notjusttheexternalmemorybandwidth.Deviceswithdedicatedportsforinterfacingwithmemoryoftenalsohaveadditionalportsfordatainput/output,whicharenotconsidered 56

PAGE 57

intheEMBcalculation.Devicesmayalsohavehigherbandwidthcapabilitiesonaportthatsharesallorsomepinswithonesusedforamemoryinterface,suchasisthecaseforFPGAs.AnI/ObandwidthmetricdescribesthemaximumdatathroughputofadevicethatEMBomitsorahighertotalbandwidththatispossibleonadevice.IOBiscalculatedasthetotalaggregatesumofthebandwidthprovidedbyallinputsandoutputsthatcanoperateconcurrently.Thehighestbandwidthportsareusedwhenthereisoverlapornon-concurrency.Eq. 4 showstheaggregationofI/Oportsbasedoni,whereirepresentsthedifferenttypesofI/Oportsthatcanbeusedconcurrently.Lineencoderscanbeusedtoencodedataintoadifferentformatwhichbenetstransmissionforreasonsotherthandatathroughput.Variousschemessuchas8b/10bor64b/66b,canbeemployedthathavevaryingoverheadsonthelinerate.Ifanencodingschemeisused,suchas8b/10b,thenirepresentsthefractionofIOBthatisavailablefordata.Forthecaseof8b/10bencoding,weassumethatfractionis0.80.Theaggregatesumisthenaddedtotheinput/outputbandwidthofanydedicatedexternalmemorycontrollersavailableonthedevice(denotedasIOBmem). IOB=IOBmem+XiiIOBi(4)TherearenumerouswaystocharacterizetheI/Oofadevice.Insingle-endedI/O,onesignalismadebetweentwoICsandcomparedtoaspeciedvoltagerangeortoareferencevoltage.Indifferentialsignaling,twosignalsaremadebetweentwoICsandthesignalsarecomparedtoeachothertodeterminethelogicvalue[ AthavaleandChristensen 2005 ].Thesetwosignalingmethodscanhavedifferingbandwidths,evenwhencomparingtwosingle-endedsignalstoasingledifferentialsignalingpair.Whenstudyingdevices'IOB,itisimportanttousefaircomparisonsandkeepallparametersequalwhendirectcomparisonsaredesired.Foranexample,considertheNvidiaTeslaC1060GPU.TherearetwointerfacesforI/Oonthisprocessor,thememoryinterfaceandthePCIebus.TocomputeIOB,the 57

PAGE 58

aggregateistakenofbothinterfacesandthecalculationsareshowninEqs. 4 4 .Ithas4GBofdedicatedGDDR3memoryclockedat800MHzona512-bitinterface.ThePCIeinterfacehasa500MB/stransferrateforeachlaneineachdirection. IOBmem=800MHz(512bits=8)2=102.4GB/s(4) IOBPCIe=500MB/s16Lanes2=16GB/s(4) IOB=IOBmem+1IOBPCIe=118.4GB/s(4) 4.2ResultsandAnalysisInthissectionwefocusonreportingtrendsobservedamongstvariousdeviceswhenthesuiteofmetricsisappliedtothem.Forourstudy,afullrangeofresultswerecollected,includingBit,Int16,Int32,Single-PrecisionFloatingPoint(SPFP)andDouble-PrecisionFloatingPoint(DPFP)formsofCDandCD/W,aswellasIMB,EMB,andIOBforthedevicesshowninthefollowinggures.Duetothelargenumberofdeviceswhichhavebeenincludedinourstudy,weillustratethemoreinterestingcasesthroughgraphsinthissectionandhavereportedalltheothermetricsinexpansivetablesintheAppendix.Manyofthesedevicesarenewtothispaperandwerenotincludedin[ Williamsetal. 2011 ].Onlyselectedmetricsarepresentedindetail:Int16CD/WforRMCandFMCdevices;SPFPCD/WforRMCandFMCdevices;EMBforFMCandRMCdevices;andIOBforFMCandRMCdevices.TheFMCdeviceshighlightedinthisstudy,listedinFig. 4-2 ,includearangeofCPUs,DSPs,andGPUs.TheRMCdevices,listedinFig. 4-1 ,includeFPGAsofvaryingtypes,theTileraTILE64processor,andthePACTXPP-3c.Inthisstudythemaximumlevelofexploitableparallelismofdeviceswascalculatedandtheperformanceofdevicesiscomparedbyvaryingthelevelofparallelism.Thismeansthattheclockfrequencyisrecalculatedforeachmetric(i.e.Int16vsInt32)andvariesbasedonwhichmetricisbeingcalculated. 58

PAGE 59

4.2.1CDandCD/WMetricsFig. 4-1 showsInt16andSPFPformsofCD/WforRMCdevices.Thebarsaregroupedbydevice,onerepresentingInt16precisionandtheotherSPFP.Thedatalabelsaboveeachbardenotethemaximumnumberofparalleloperationsthateachdeviceiscapableofsustainingatthegivenprecision.Someinterestingresultscanbeobservedfromthegure.TheEP4SE530hasthehighestmemory-sustainableInt16CD/W(54.07GOPS/Watt)duetothelargelogicfabricandthehighamountofon-chipmemory.TheVirtex-6LX760hasthehighestmemory-sustainableSPFPCD/W(10.27GOPS/Watt).ThePACTXPP-3chasthelargestnumberofparalleloperations(348)ofnon-FPGAdevicesstudiedandhashigherInt16CD/W(16.24GOPS/Watt)thanalltheothercoarse-grainedRMCdevicessuchastheTILE64eventhoughithasnoSPFPsupport. Figure4-1. CD/WforRMCdevices Fig. 4-2 showsInt16andSPFPformsofCD/WforFMCdevicesinthesameformatasFig. 4-1 .TheTI-OMAPdevicehasthehighestInt16CD/Wratio(7.72GOPS/Watt).Eventhoughthisdevicedoesnothavealargenumberofcomputationalresources, 59

PAGE 60

itachieveshighCD/Wduetoitslowpowerconsumption.AnothertrendvisiblefromthisgureissignicantlylowerCD/WnumbersfortheFMCdevicesascomparedtoRMCwhichcanbedirectlyattributedtothepresenceofavastnumberofcomputationalresourcesonaFPGAandthehigherpowerconsumptionofFMCdevices.TheVirtex-6LX760hasthegreatestunconstrainedInt16CD(3,443.2GOPS),butitislimitedbyitsIMB,whichsignicantlylowersthememory-sustainableCD/Wto48.37GOPS/Watt.ForSPFPCD/W,theresultsshowthatmostRMCdevicesperformbetterthanGPUdevices;thisisbecauseGPUdeviceshavehighpowerconsumptionwhichoffsetsthebenetsofitssuperiorSPFPperformance.Forhighlevelsofparallelism,theGeForceGTX480hasthehighestSPFPCD/W(4.12GOPS/Watt)oftheFMCdevicesstudied.Althoughhigh-endFPGAsintheStratixIVandVirtex-6familieshaveamuchlargernumberofparalleloperationsthantheGeForceGTX480,theachievablefrequencyislowcomparedtotheoperatingfrequencyoftheGTX480.However,theFPGAsmentionedhaveaCD/Wthatis3xlargerthantheGTX480.Interestingly,theGTX285doesnotperformnearlyaswellastheGTX480eventhoughtheyareofthesamefamilyduetothehighpowerconsumptionofthedeviceandlowernumberofprocessors.Similarly,theCorei7-980X(1.84GOPS/Watt)andItanium9350(1.18GOPS/Watt)havehighCDperformance,butarenotpower-efcient.TheTIOMAP-L137DSPusesverylowpowerwhichallowsittoperformwellinCD/W,despiteithavingthelowestCDofthedevicesstudied.ForacompletelistofdeviceswiththeirrespectiveCDvalues,seeTable A-5 intheAppendix.Table A-2 intheAppendix,reportsCDforFPGAdevicesforInt16andInt32precisions.ThistableisincludedtopointoutthedifferenceinfrequenciesandpowerusageovervariousprecisionsforFPGAs. 4.2.2EMBResultsFig. 4-3 showsEMBforkeyRMCdevicesinGB/s.TheVirtex-6LX760hasthehighestEMB(68.2GB/s)ofthedevicesstudiedduetothefactthattheLX760has 60

PAGE 61

Figure4-2. CD/WforFMCdevices ahighernumberofbondedIOBsthancorrespondingdevices.ThesebondedIOBsallowthesimultaneousinstantiationofahighernumberofDDR2controllersresultinginahigherEMB.ThenumbersaboveeachbarinthegraphshowtheremaininglogicutilizationaftertheinstantiationofmemorycontrollersrequiredtoattainmaximumEMB.ThesepercentagesillustratethefactthataverylowlogicoverheadisrequiredtohavemultiplememorycontrollersandhenceattainahigherEMB.Interestingly,itisseenthatmostofthedevicesrequirealmostnologicutilizationafterinstantiatingtheirmemorycontrollerswhereastheVirtex-4SX55occupiesalmost40%ofthechipforitstwomemorycontrollers.Fig. 4-4 showsEMBforFMCdevicesinGB/s.Asexpected,GPUdevicesperformthebestamongstallcategoriesofdevices.GPUsaredesignedtohandlehighlyparallelapplicationswhichrequirelargesetsofstreamingdata.ThisdesignmakesitnecessarytohavefastandwidememorybusesthatresultinhighEMB.TheNvidiaGeForce 61

PAGE 62

Figure4-3. EMBforFPGAdevices GTX480hasthehighestEMB(177.4GB/s)amongstalldevices.CPUdevicestypicallyhandlesmallerapplicationsusingsmallersetsofdata,hencetheyhavealesserEMB.TheIntelXeonX7560hasthehighestEMB(34.11GB/s)amongstthenon-GPUFMCdevices,whichisduetotheadditionofthehigh-bandwidthIntelQuickPathInterconnect(QPI). 4.2.3IOBResultsFig. 4-5 showsIOBforbothFMCandRMCdevicesinGB/s.ThepackageofanFPGAisshowninparenthesis.TheFPGAsassumedifferentialsignalingandan8b/10bencodingscheme,whichhasanoverheadof20percent,inordertocomparetheI/OdataratesofFMCdevicestotheI/OlineratesoftheFPGAs.TheI/Obandwidthshownconsistsofequalpartsinputandoutput.ItshouldbenotedthatanunbalancedI/O 62

PAGE 63

Figure4-4. EMBforFMCdevices canhaveaneffectonthetotalI/Oachievablebyadevicesincenotallchannelsarebidirectionalandtheremaybeanunequalnumberofinputandoutputports.TheGeForceGTX480hasthehighestIOB(193.4GB/s)ofthedevicesstudied,followedbytheGTX285(175GB/s).GPUsareoptimizedfor3Drendering,whichrequiresprocessingonlargeworkingsetsofdata.Thisformofdataprocessingmakeshavingawideandfastmemorybusanecessitytoachievehighperformance.Comparingagainstmicroprocessors,theamountofdatathatneedstobeprocessedistoolargetotinthecacheofaCPU.TheworkingsetofapplicationsthattypicallyrunonCPUshaverandommemoryaccesspatternsandaresmallerthanthosethatrunonaGPU,requiringmanyfrequentfetchesfromoff-chipmemory.CPUmemoryinterfacesareshiftingfrombusestoaveryfastgroupofserialdatalinescommunicatingviapacketswithmuchlowerlatency,suchasHyperTransportorIntel'sQPI.AsCPUs 63

PAGE 64

havebeenincreasinginthenumberofcoresandIOB,streamingapplicationsmaybemoreeffectivelyparallelizedonthem. Figure4-5. IOBdataforselecteddevices 4.3DiscussionWehaveenhancedourexistingmethodologyfordevicemetricstoassesstheoff-chipmemorybandwidthofdevices,usingexternalmemorybandwidth(EMB)andI/Obandwidth(IOB).Developerscanusethedevicemetricsdescribedinthispaperandin[ Williamsetal. 2011 ]toassistinalgorithm-guideddeviceselectionearlyinthedevelopmentcycleandtounderstanddevicetradeoffs.Wehavealsopresentedastudyofanewanddiversesetofdevicestodeterminetheircomputationalcapabilitiesandoff-chipbandwidths.Thereisalargevariationintheresultingdatathatariseswhenthesemetricsareusedtostudydisparateacceleratortechnologies. 64

PAGE 65

AfewinterestingtrendsobservedbyevaluatingthedataincludeFPGAdevicesshowingthehighestCDandCD/Wforbitandintegeroperations.ThistrendcanbeattributedtothelargefabricsandamountofLUTsenablingFPGAstoachievemassiveamountsofparallelism.AsobservedinSection 4.2 ,GPUstendtoperformwellinmostcategories;however,theystandoutinoating-pointcalculationsduetothehighclockratesoftheirshaderunitsandthesheernumberofthem(GTX480hasthehighestSPFPCD:1031.136GOPS).CPUsalsoperformwellinoating-point,especiallydouble-precision.ManyoftheotherdeviceshavetoexpendextraresourcesorclockcyclesforDPFPcalculations,howevermodernCPUshavededicatedfunctionalunitsforthatpurpose.Combinedwiththehighestclockspeedsofanyofthestudieddevices,theyperformDPFPoperationswell.TheEMBandIOBresultsshowthatGPUshavethehighestexternalbandwidth,withverywidememorycontrollersworkingatveryhighfrequencies.FPGAsalsoperformwell,buttheiroperatingfrequencykeepsthemfrommatchingGPUmemorybandwidth.SomeofthenewerCPUs,particularlytheIntelCorei7-980x,achieveveryhighIOBscoresascomparedtootherCPUs.ThisisduetotheshiftfromtheFSBconnectiontoQPI.Usingapoint-to-pointinterconnectallowsamuchhigherclockratethanasharedbus,providingmuchhigherdatabandwidth.Futureworkisplannedtoallowformoreuserdenedparameterswhencalculatingcertainmetrics.TheseparametersincludetheadditionofmoreavailableoperationsandvaryingtheratioofoperationswhendeterminingCD.Thisexpansionwillallowuserstomorecloselydeterminewhichdevicewouldttheiralgorithmbasedupontherequiredcalculations.AnotherplannedextensionofthesemetricsincludestheparameterizationofIOBandEMBmetrics.Ourgoalistoallowuserstodeneandcustomizetheindividualattributesusedtocalculateeachmetrictobestsuittheirapplication. 65

PAGE 66

CHAPTER5REALIZABLEUTILIZATIONThemodernprocessorlandscapeisavariedanddiversecommunity1.Assuch,developersneedawaytoquicklyandfairlycomparevariousdevicesforusewithparticularapplications.Thischapterexpandstheauthors'previouslypublishedcomputational-densitymetricsandpresentsananalysisofanewgenerationofvariousdevicearchitecturesincludingCPU,DSP,FPGA,GPU,andhybridarchitectures.Also,newmemorymetricsareaddedtoexpandtheexistingsuiteofmetricstocharacterizethememoryresourcesonvariousprocessingdevices.Finally,anewrelationalmetric,realizableutilization(RU),isintroducedwhichquantiesthefractionofthecomputationaldensitymetricthatanapplicationachieveswithinanindividualimplementation.TheRUmetriccanbeusedtoprovidevaluablefeedbacktoapplicationdevelopersandarchitecturedesignersbyhighlightingtheupperboundonspecicapplicationoptimizationandprovidingaquantiablemeasureoftheoreticalandrealizableperformance.Overall,theanalysisinthischapterquantiestheperformancetradeoffsbetweenthearchitecturesstudied,thememorycharacteristicsofdifferentdevicetypes,andtheefciencyofdevicearchitectures. 5.1ExpandedComputationalDeviceMetricsTheever-changinglandscapeofcomputationaltechnologieshascreatedaneedformethodologiesbywhichdiverseprocessingarchitecturescanbequicklyandobjectivelycompared.Previouslypublishedmethods,includingourown,basedoncomputationaldevicemetrics,havebeeneffectiveinperformingsuchcomparisons[ DeHon 1996 ; Williamsetal. 2011 ; Milluzzietal. 2014 ; SohiandFranklin 1991 ; Saulsburyetal. 1Portionsofthischapter,c2012IEEE,havebeenpreviouslypublishedandadapted/reprintedwithpermissionof:Richardson,J.W.;George,A.D.;Lam,H.,”PerformanceAnalysisofGPUAcceleratorswithRealizableUtilizationofComputationalDensity,”Ap-plicationAcceleratorsinHighPerformanceComputing(SAAHPC),2012Symposiumon,vol.,no.,pp.137,140,10-11July2012doi:10.1109/SAAHPC.2012.13 66

PAGE 67

1996 ; Burgeretal. 1996 ].Usingthisapproach,amodernprocessingdevice,regardlessofwhetheritisxed-logic(e.g.,CPU,DSP,GPU),recongurable-logic(e.g.,FPGA,CPLD),orhybrid(e.g.,CPU/FPGA,CPU/DSP,CPU/GPU),canbeabstractlycharacterizedusingkeydevicemetricsandthusbecomparedobjectively.Thispaperexpandstheauthors'previouslypublishedworkinthreeways.First,usingmetricsthatweestablishedin[ Williamsetal. 2011 ],thispaperpresentsananalysisofanewgenerationofvariousdevicearchitectures.Second,thispaperexpandsourexistingsuiteofmetricswithnewmetricstocharacterizethememoryresourcesonvariousprocessingdevices.Finally,athirdcontributionofthispaperistointroduceanewrelationalmetric,realizableutilization(RU),whichquantiesthefractionofthecomputationaldensitymetric(i.e.,upperbound)thatanapplicationachieveswithinanindividualimplementation.Focusingontheseconcepts,abriefoutlineofeachsectionispresentedinthefollowingparagraphs.Inthispaper,alargesuiteofxed-logic,recongurable-logic,andhybriddevicesisanalyzedbasedontheircomputationaldensityandcomputationaldensityperWatt.Thecomputationaldensity(CD)ofadevicecharacterizesthecomputationalcapacityoftheavailableprocessingcoresonadevice.ComputationaldensityperWatt(CD/W)isthepower-awareversionofthismetricandnormalizesthecomputationaldensitybypowerconsumption.InSection 5.2 ,thesedevicemetricswillbereviewedinmoredetailandthecomputationalmetricswillbeusedtoanalyzealargesuiteofnewgenerationsofCPU,DSP,FPGA,GPU,andhybriddevices.Theresultsprovideinsightintothestrengthsandweaknessesofdifferentdevicearchitecturesinrelationtotheoreticalcomputationalperformance.FromtheCDandCD/Wanalysis,thispaperwillshowthedifferentperformancetradeoffsbetweendevicearchitectures,operationprecisions,andpowerefciencies.Inadditiontocomputationalcapacity,thebandwidthofexternalmemoryunitsisacrucialperformancebottleneck.Inordertocapturetheperformanceofbothmemory 67

PAGE 68

andI/Operipherals,theexternalmemorybandwidth(EMB)andtheinput/outputbandwidth(IOB)metricsareintroducedforcharacterizingandcomparingdevicesinregardstodatamovement.IOBandEMB,whichhighlightmemoryresources,willbedenedinmoredetailinSection 5.3 andwillbeusedtoanalyzethesamesuiteofdevices.Theresultsfromthissectionwillshowcasetherelativestrengthsofdeviceswithhigh-speedmemoriesincontrasttodeviceswithmoregenericI/Ocapabilities.Finally,devicemetricsprovidearst-orderanalysis,effectivelyprovidinganup-perboundofadevice'scapability.Howmuchofthedevice'supper-boundcapabilitycanbeusedinaparticularapplication,however,isdeterminedbymanyfactors,includingapplicationcharacteristics,designtools,anduserexperience.Toexplorethisrelationship,Section 5.4 introducesanddenestherealizable-utilizationmetric.TheauthorsuseRUtoshowhowclosereal-worldapplicationscangettotheupperboundfoundwiththemetrics.Priortobenchmarkinginvestment,RUresultsfromaliteraturestudy,focusedonGPUs,showcasesthedecreasingapplicationefciencyasthecomputationaldensityofnewdevicesincreases.RUbenchmarkingonalargesetofkernelsanddevicesprovidesinsightintoperformancethatisgainedusinghand-codedintrinsicsoroptimizedlibrariesversesbenchmarkingkernelscodedforportability.Insummary,theRUmetriccanbeusedtoprovidevaluablefeedbacktoapplicationdevelopersandarchitecturedesignersbyhighlightingtheupperboundonspecicapplicationoptimizationandprovidingaquantiablemeasureoftheoreticalandrealizableperformance.Section 5.5 concludesthispaperwithanalsummaryofinsightsanddirectionsforfuturework. 5.2CDandCD/WInthissection,wereviewtwoofourpreviouslypublishedmetricsthatcharacterizeadevice'scomputationalcapacity,CDandCD/W.Followingthereview,Subsections 5.2.2 and 5.2.3 showtheresultsandgiveanalysisofthetop-scoringnewdevicesstudied. 68

PAGE 69

Thisanalysisprovidesinsightsintohowadevice'sarchitecturecanbecharacterizedandcomparedwithotherdevicearchitecturesavailableonthemarket. 5.2.1ReviewofCDandCD/WOurpreviousworkin[ Williamsetal. 2011 ],reviewedhere,introducedCD,ametricwhichwasusedtodeterminethecomputationalcapabilityofadevice,andwasusedtocomparedevicesbothwithinandbetweenarchitecturalcategoriesusingoperationsofvaryingdatatypessuchasoating-point,integer,andbit-level.Thismethodologyalsoallowedforvaryingoperationssuchasaddition,multiplication,multiply-accumulate,etc.andtheratiobetweeneachinstructiontypeandthedataprecision(e.g.16-bit,32-bit,64-bit)couldeasilybeadjustedasdesired.Thisexibilityallowedforanalysisofdataoperationsofanysizethehardwarecouldsupport.CDresultswerepresentedandusedtoevaluatethecomputationalperformanceofthetwobroadcategoriesofprocessingdevices,xed-logicandrecongurable-logic.Fixed-logicdeviceshaveaxedhardwarestructurethatcannotbechangedafterfabrication(e.g.CPUs,DSPs,GPUs).Recongurable-logicdevicescanchangetheirlogicalhardwarestructureafterfabricationtoadapttochangingproblemrequirements(e.g.FPGAs).Thereadermayreferto[ Williamsetal. 2011 ]foramoredetaileddiscussionontherecongurabilityfactorsthatwereusedtoclassifyadevice.Inourpreviousworkandinthisnewstudy,wefocusedonadditionandmultiplicationinstructionsfordatatypesrangingfrom16-bitinteger(i.e.Int16)to64-bitdouble-precisionoatingpoint(i.e.DPFP).Duringthisstudy,wemaximizedthenumberofparalleloperationswhilekeepingthenumberofadditionsandmultiplicationsequal,however,themethodologyallowsforotheroperationsandmixesasdesired.TheunitsforCDareoperationspersecond(OPS)and,whencalculatingthenumberofparalleloperationssupportedbyadevice,weconsideredahardware-supported,multiplyaccumulateasonlyoneoperation. 69

PAGE 70

AsanexampleofCDforxed-logicdevices,Eqs. 5 and 5 showtheCDcalculationsfora32-bitinteger(Int32)analysiswitha50%add-multiplysplitfortheAMDTrinityA10-6800KAPU.Sincethisisahybriddevice,thecontributionsfromboththeCPUandGPUhalveswerecombined.FortheCPUside,theoperatingfrequencyofthedevicewasmultipliedbyboththenumberofoperatingcoresandthesumofallavailableprocessingelementsrunningInt32additionsandmultiplications,asshowninEq. 5 .Tomaintainmemorysustainability,thenumberofoperationsthatcouldbeprocessedwaslimitedbytheincomingdatabuswidth.Thus,thedatabuswidth(128-bit)ofeachunitwasdividedbythedatatypesize(32-bit)todeterminethenumberofinstructionsthatcouldbesupportedbythememoryarchitecturesatthesametime. CDInt32CPU=4.4GHz44Xi=1128 32=281.6GOPS(5)TheGPUside,Eq. 5 ,wassimilar,withthekeydifferencesbeingtheoperatingfrequencyoftheGPU(0.844MHz)andthenumberofavailablecomputationalunits(16).Together,theseequationsyieldedatotaldeviceCDof605.7GOPS.Inthisexample,theGPU'ssingle-instruction,multiple-data(SIMD)processingunitscontributedsignicantlytotheoverallcomputationalcapabilityofthedevice. CDInt32GPU=0.844GHz616Xi=1128 32=324.1GOPS(5)Eq. 5 showshowadifferentdatatype,DPFP,limitstheavailableunitscapableofhandlingthattypeofinstructionandtherebychangedthecalculationfortheCPUsideofthedevice.SimilarlyfortheGPUside,theonlyfactorthatchangedwastheoperationsize,reducingtheavailablebandwidthbyhalf.ChangingthetypeofinstructiontoDPFPyieldsaGPUCDof162.05GOPS,therebyresultinginatotaldeviceCDof232.45GOPS. 70

PAGE 71

CDDPFPCPU=4.4GHz42Xi=1128 64=70.4GOPS(5)Forrecongurable-logicdevicessuchasFPGAs,Int32CDwasdeterminedusingachievablefrequencyandthenumberofparalleloperationsoffullyutilizedDSPresourcesandlogicfabric.AsingleintegercoreforbothadditionandmultiplicationwasinstantiatedonanFPGAusingvendorIPcores.Foreachcore,theresourceutilizationalongwiththemaximumachievablefrequencywasdeterminedfromvendortools.ThisinformationallowedthenumberofsimultaneouscoresthatcouldbeinstantiatedonadevicetobedeterminedbyusingallavailableDSPandlogicresourceswhilereserving15%overheadforsteeringlogicandI/Ointerfacing.Again,forouranalysis,onlyadditionandmultiplicationoperationswereconsideredandbalanced,butotherhardware-supportedoperationsandmixesofoperationscouldeasilybestudied.Thenumberofparalleloperationswasmultipliedbythemaximumachievablefrequency,whichwaslimitedtothelowestofmultiplicationandaddition.Basedontheamountofavailableon-chipmemoryresources,thenumberofavailableparalleloperationsintheCDcalculationwaslimited.Thislimitationwasenforcedinordertoaccountformemorybandwidthofon-chipRAMresourcesfordatabuffering,whichcouldhavealimitingeffectonthepeakCD.Theon-chipmemorymustallocatetwooperandsperoperationformemory-sustainableCD.Thisprovisionensuredthatthenumberofparalleloperationsadevicecouldsupportwaslimitedbythecapabilityoftheinternalmemorystructuretoprovidedataforeachparalleloperation.ToillustratetheprocessinwhichCDiscalculatedforarecongurable-logicdevice,aVirtex-6SX475TFPGAwasanalyzedforInt32performance.WhencalculatingtheInt32CDfortheSX475T,theInt32IPcoresofaddersandmultiplierswererstgeneratedwithtoolssuppliedbyXilinx.OneofeachdesignwassynthesizedandsimulatedontheFPGA.Usingtheutilizationreport,itwasdeterminedthatthefabriccouldsupport1,936operationsinparallel,ofwhich50%weremultiplicationand50% 71

PAGE 72

additions.From[ Xilinx,Inc. 2008b ]wedeterminedthattheblockRAMsofthefabriccouldonlysupply1,064pairsofoperandseachclockcycle.Sincethisamountwaslessthanthemaximumnumberofparalleloperations,1064wasthemaximumamountofmemory-sustainableoperationsthatcouldbecomputedinparallel.Usingthetimingreportgeneratedfromthedesignofmultipliersandadders,theoperatingfrequencywasdeterminedtobe296MHz.Sinceeachoperationtakesonecycle,andwewerecomputing1,064operationsinparallel,Eq. 5 showsthememory-sustainableInt32CDfortheSX475T.SimilarlyforDPFP,thenumberofsustainableoperationswasreducedto386andthefrequencyreducedto214MHz,asshowninEq. 5 . CDInt32=296MHz1064ops 1=314.94GOPS(5) CDDPFP=214MHz386ops 1=82.6GOPS(5)Asprocessingdevicesincreaseincomputationalcapacityandpowerconsumption,deviceefciencyhasbecomeamajorconcern,andourmethodologyintroducedapower-efciencymetrictoquantifythisimportantinformationintheformofCDperWatt(CD/W).ThismetriciscalculatedbytakingtheCDanddividingbythepowerconsumptionatthelevelofdeviceutilizationusedtocomputetheCD.Forxed-logicprocessors,themaximumthermaldesignpower(TDP)isusedforcomparability.Forrecongurable-logicprocessors,vendortoolsareusedtoestimatepowerusageofthedevice,anddynamicpowerisassumedtoscalelinearlywithresourceutilizationifmoredetailedinformationisunavailable.FortheTrinityA10-6800Kexample,theInt32CD(605.7GOPS)isdividedbythemaxTDP(100Watts)yieldinganInt32CD/Wof6.06GOPS/W.LikewisefortheSX475T,theInt32CD(314.94GOPS)isdividedbytheestimatedpower(20.54Watts)togiveanInt32CD/Wof15.33GOPS/W.FormoredetailsonCDandCD/Wandtheirapplications,thereadermayreferto[ Williams 72

PAGE 73

etal. 2011 , 2008a , b ].Thesemetricshavebeenappliedtoabroadrangeofmodernprocessorswithnewresultsandanalysispresentedinthefollowingsubsections. 5.2.2CDResultsandAnalysisDuringthisstudy,weanalyzedtheCDandCD/Wvaluesof130newlystudieddevices,featuresizelessthanorequalto90nm,spanningxedandrecongurable-logicarchitectureswiththeirhybrids.First,thissubsectionwilloverviewthemostinterestingCDresultsandanalysisfromthe81xed-logic,32recongurable-logic,and17hybriddevices.Similarly,Subsection 5.2.3 willhighlighttheCD/Wresultsandanalysisforthesamesetofdevices.Duetothewealthofdata,wehavelteredthexed-logicresultsintoasubsetbasedonthetop-scoringdevicesineitherCDorCD/W.Wegroupedthexed-logicdevicesintovemajorcategories:CPUs,GPUs,DSPs,many-core(M-CORE),andxed-logichybrids(FL-HYBRID).Forthisstudy,wefocusedonInt16,Int32,single-precisionoatingpoint(SPFP),andDPFPdatatypesforbothadditionandmultiplicationoperations.Ouroperationmixwassetat50%addsand50%multiplies.Forafulllistofallthexed-logicdevicesstudied,withtheirrespectiveCDvalues,pleaseseeTable A-7 intheAppendix.Fig. 5-1 showstheCDresultsforthexed-logicprocessorsineachcategorywiththebarsrepresentingdifferentoperationtypes.Notableresultsfromouranalysisfollow.TheAMDRadeonR9295X2hadthehighestSPFPscore(5,745GOPS)byasignicantmargin.ThisresultstemmedfromthetwinGPUsonthisdeviceandwasexpected,giventhetrendofincreasingcomputationalpowerofGPUdevicesforgeneral-purposecomputing.FortheInt32andDPFPdatatypes,theNVIDIAGTX690GPU(3,133GOPS)andNVIDIAGTXTitan(785GOPS)surpassedtheR9295X2highlightingthedifferentemphasesbetweentheseGPUarchitectures.TheOpteron6380scoredhighlyintheInt16precision(3,584GOPS)evensurpassingtheGPUs.GPUdevicesarebuilttospecically 73

PAGE 74

Figure5-1. CDforselectxed-logicprocessors handleoating-pointinstructionsand,whiletheyarebeingenhancedtosupportotherprecisions,theirmainemphasisshowsintheirSPFPandDPFPscores.Inasimilarmannertothexed-logicdevices,welteredtherecongurable-logicresultsintoasubsetofthetop-scoringdevicesineitherCDorCD/W.Additionally,wegroupedtherecongurable-logicdevicesintotwomajorcategories,FPGAfamiliesandrecongurable-logichybrid(RL-HYBRID)devices,asshowninFig. 5-2 .TheVirtex-7VX980TbyXilinxscoreshighestinalldatatypesfortheFPGAdevicesduetothelargenumberofresourcesusedforcomputationallogic.Overall,therecongurablenatureofthefabricinFPGAsenablesthemtotakebetteradvantageoftheirresourceswithsmallerdatatypes.Incontrast,theSPFPresultsforFPGAsaresignicantlylowerthantheGPUsbecauseSPFPfunctionalunitsaremuchlargerandrequiresignicantlymoreresourcesthanthesmallerxed-pointfunctionalunits.ForacompletelistoftheFPGAsstudiedandtheirCDvalues,seeTable A-7 intheAppendix. 74

PAGE 75

Figure5-2. CDforselectrecongurable-logicdevices 5.2.3CD/WResultsandAnalysisTheCDresults,presentedintheprevioussection,characterizethecomputationalabilitiesoftherespectivedeviceswithoutconsiderationofthepowerconsumed.Thissectiondetailshowdifferentdevicesexcelwhenpowerisconsidered.Thepower-awareCD/Wscoresforxed-logicandrecongurable-logicprocessorsarepresentedinFigs. 5-3 and 5-4 ,respectively.Thexed-logicCD/Wdataclearlyshowsthepowerefciencyofthenewgenerationofheterogeneoushybriddevices.TheTIOMAP4430hybridSoChasthehighestInt16CD/Wratio(90.43GOPS/W,0.6W)withitssuccessortheOMAP5430havingthebestSPFP(70.92GOPS/W,1.37W)andDPFP(23.39GOPS/W,1.37W).Eventhoughthesedevicesdonothavelargecomputationalresources,theyachievehighCD/Wduetolowpowerconsumption.Anotherhybriddevice,theNVIDIATegraK1,scoresthehighestinInt32(62.24GOPS/W,5W).Additionally,thedatashowssignicantly 75

PAGE 76

Figure5-3. CD/Wforselectxed-logicdevices lowerCD/Wnumbersfortraditionalxed-logicprocessorsascomparedtoFPGAprocessorsandhybriddevices.ThistrendcanbedirectlyattributedtothelargenumberofcomputationalresourcesonmanyFPGAsandthehigherpowerconsumptionoftraditionalxed-logicdevices.TheStratixIVEP4SE680hasthehighestInt16CD/W(135.18GOPS/W),Int32(57.74GOPS/W),andDPFP(13.16GOPS/W)duetothelargelogicfabricandthehighamountofon-chipmemorywhilemaintaininglowpowerconsumption(6.74W).Additionally,theVirtex-7585ThasthehighestSPFPCD/W(30.50GOPS/W,16.25W)duetothenumberofoating-pointoperationsthatcanbepackedonthedevicewithinitspowerenvelope.TheAchronixSpeedster22iHDhasthesmallestprocesstechnology(22nm)oftheFPGAdevices,whichimprovesitsCD/Wscore(51.55GOPS/W,14.48W),butcounterintuitively,thegreateravailableresourcesontheotherFPGAdevicesoutmatchestheprocesstechnologygainswithhigherperformanceperWatt. 76

PAGE 77

Figure5-4. CD/Wforselectrecongurable-logicprocessors ForSPFP,theCD/WresultsshowthatmanyFPGAprocessorsperformonparwithGPUdevices.ThehigherpowerconsumptionofGPUdevicesoffsetsthebenetsoftheirsuperiorSPFPperformance.Forhighlevelsofparallelism,theRadeonR9295X2hasthehighestCD/WinSPFP(11.5GOPS/W,500W)ofthehigh-poweredGPUdevicesstudied.Althoughhigh-endFPGAsintheStratixIVandV,aswellasVirtex-6and7,familieshaveamuchlargernumberofparalleloperationsthantheRadeonR9295X2,theachievablefrequencyislowcomparedtotheGPU'soperatingfrequency.TheOMAP5430SoCusesverylittlepower,whichallowsittoperformwellinCD/W,despitehavingamuchlowerCDthanotherxed-logicdevicesstudied.ForacompletelistofdeviceswiththeirrespectiveCD/Wvalues,seeTable A-8 intheAppendix.Thecomputationaldevicemetricsproviderst-orderinsightintotheperformancecapabilitiesofdevices.CDandCD/Wshowedthecomputationalabilityofhigh-performanceGPUscomesatasteeppowercost,andtheexibilityofFPGAdevicesshowsin 77

PAGE 78

small-precisionperformance.Hybriddevices,especiallyCPU/GPUhybrids,showasignicantperformanceperWattadvantageovertraditionalxed-logicprocessorsandarecompetingwithFPGAsinpower-efcientcomputing. 5.3DeviceMemoryMetricsToprovideathoroughcharacterizationofacomputationaldevice,weintroducetwonewmemorymetricstoquantifytheabilityofaprocessortomoveinformationintoandoutoftheprocessingcores.Buildingupontherelatedresearchanddevicemetricspresentedin[ Williamsetal. 2011 ],Subsections 5.3.1 and 5.3.2 introducetheexternalmemorybandwidth(EMB)andtheinput/outputbandwidth(IOB)metrics.Subsections 5.3.3 and 5.3.4 highlighttheresultsofourmemorymetricsanalysis. 5.3.1ExternalMemoryBandwidthExternalmemorybandwidthisusedtodescribethetotalbandwidthbetweenadeviceandattachedexternalmemory.Thismetriconlyincludesthebandwidthofusabledata,excludingbitsforerror-correctioncoding.Inaddition,EMBdoesnotincludeI/Oornetwork-controllerbandwidth,asthesearetypicallyatthecostofauser-denedinterfacingimplementationforanapplication.Althoughadevicecouldaccessanotherdevice'smemorythroughanI/Oport,thisisnotconsideredinthecalculation.ThefollowingparagraphsdetailtheEMBcalculationforbothxed-logicandrecongurable-logicdevices.Forxed-logicprocessorswithbuilt-inmemorycontrollers,EMBisthesumoftheconcurrentbandwidthprovidedbyallmemorycontrollers.Fordevicesthatuseafront-sidebus,theentirebusisallocatedformemorybandwidth.Forrecongurable-logicdevices,themethodologyemployedissimilartodeterminingCDforFPGAs.Insteadofimplementingcomputationalcores,memorycontrollersareimplementedintothelogicfabric.IncontrasttotheCDcase,limitingfactorsformemorycontrollersincludethenumberofLUTs,ALMs/Slices,andthenumberofbondedIOBs. 78

PAGE 79

TogetabetterunderstandingofhowtocalculateEMB,astep-by-stepcalculationfortheVirtex-6SX475Tfollows.AsingleDDR2(DoubleDataRate)memory-controllerIPcoreisinstantiatedonthechip,andtheresourceutilizationisobtainedfromthepostplace-and-routereport.Inthisexample,weusedtheDDR2corebecauseitwasthehighestperformancememorycorefromthevendorcompatiblewiththedevice.Fromtheplace-and-routeinformation,themostlimitingresourcewasthenumberofbondedIOBs.ThemaximumnumberofDDR2controllersthatcanbeinstantiatedsimultaneouslyiscalculatedbydividingtheavailablenumberofbondedIOBs(840)bythenumberofIOBsusedbyasinglememorycontroller(121).Thememory-interfacefrequency(533MHz)ismultipliedbythememory-interfacewidth(64bits)usingtheappropriateunitstogettheEMBofoneDDR2controller(8.528GB/s).Thisrateisin-turnmultipliedbysix,themaximumnumberofDDR2controllersinstantiated,tocalculateamaximumEMBof51.168GB/s. 5.3.2Input/OutputBandwidthInput/outputbandwidthisusedtodescribethetotalI/Ocapabilitiesofadevice.Deviceswithdedicatedportsforinterfacingwithmemoryoftenhaveadditionalportsfordatainput/output,whicharenotconsideredintheEMBcalculation.Devicesmayalsohavehigherbandwidthcapabilitiesonaportthatsharesallorsomepinswithonesusedforamemoryinterface,asinrecongurable-logicdevices.Inthiscase,IOBwouldincludethecombination,producingthemostbandwidthforthedevice.TherearenumerouswaystocharacterizetheI/Oofadevice.Insingle-endedI/O,onesignalismadebetweentwointegratedcircuitsandcomparedtoaspeciedvoltagerangeortoareferencevoltage.Indifferentialsignaling,twosignalsaremadebetweentwoICs,andthesignalsarecomparedtoeachothertodeterminethelogicvalue[ AthavaleandChristensen 2005 ].Thesetwosignalingmethodscanhavedifferingbandwidths,evenwhencomparingtwosingle-endedsignalstoanindividualdifferential 79

PAGE 80

signalingpair.WhenstudyingIOB,itisimportanttokeepsimilarparametersequalwhendirectcomparisonsaredesired.IOBiscalculatedasthetotalaggregatesumofthebandwidthprovidedbyallinputsandoutputsthatcanoperateconcurrently.Thehighestbandwidthportsareusedwhenthereisoverlapornon-concurrency.Lineencoderscanbeusedtoencodedataintoadifferentformat,whichbenetstransmissionforreasonsotherthandatathroughput.Variousschemessuchas8b/10bor64b/66b,canbeemployedthathavevaryingoverheadsonthelinerate.Ifanencodingschemeisused,suchas8b/10b,thentheIOBrepresentsthefractionofbandwidththatisavailableforrealdata.Theaggregatesumisthenaddedtotheinput/outputbandwidthofanydedicatedexternalmemorycontrollersavailableonthedevice.Forexample,considertheNvidiaTeslaC1060GPU.TherearetwointerfacesfordataI/Oonthisprocessor,thememoryinterfaceandthePCIebus.TocomputeIOB,theaggregateistakenofbothinterfaces,andthecalculationsareshowninEqs. 5 5 .IthasdedicatedGDDR3memoryclockedat800MHzona512-bitinterface.ThePCIeinterfacehasa500MB/stransferrateforeachlaneineachoftwodirections. IOBmem=800MHz(512bits=8)2=102.4GB/s(5) IOBPCIe=500MB/s16Lanes2=16GB/s(5) IOB=IOBmem+IOBPCIe=118.4GB/s(5) 5.3.3EMBResultsThissectionpresentshighlightsfromtheEMBanalysisforthesamesuiteof130xed-logicandrecongurable-logicdevices.Fig. 5-5 showsEMB,inGB/s,forthexed-logicprocessorshighlightedintheprevioussection.Asexpected,GPUdevicesperformbestinallcategoriesstudied.GPUsaredesignedtohandlememory-intensiveapplicationswhichrequirelargesetsofstreamingdata.ThisdesignhasfastandwidememorybusesthatresultinhighEMB.TheRadeonR9295X2hasthehighestEMB 80

PAGE 81

(640GB/s)ofthedevicesstudied.CPUdevicestypicallyhandlesmallerapplicationsusingsmallersetsofdata;hence,theyhavealesserEMB.TheIntelXeonPhi7120phasthehighestEMB(352GB/s)ofthenon-GPU,xed-logicdevicesduetothehigh-speedmemoryinterfacesupportingtheXeoncoresonthedevice. Figure5-5. EMBforselectxed-logicprocessors Fig. 5-6 showsEMBforhighlightedrecongurable-logicdevicesandtheirhybrids.TheVirtex-7VX980ThasthehighestEMB(89.6GB/s)duetothegreaternumberofcommunicationresourcesavailableasexternalmemorycontrollers.TheseresourcesallowthesimultaneousinstantiationofmoreDDR2controllersresultinginahigherEMBscore.Mostofthedevicesrequirealmostnologicutilizationafterinstantiatingtheirmemorycontrollers,althoughthesmallestFPGAscanuseasignicantnumberoflogicresourcesinsupportingthememorycontrollers. 81

PAGE 82

Figure5-6. EMBforselectrecongurable-logicdevices 5.3.4IOBResultsThehighlightsfromtheIOBanalysispresentsomeofthedifferencesbetweenarchitecturesdominatedbyEMB(i.e.GPUs)andmoreI/Odiverseprocessingdevices(i.e.FPGAs).Fig. 5-7 showsIOBforxed-logicdeviceswiththelighterportionrepresentingthecontributionfromEMB.TheR9295X2hasthehighestIOB(640GB/s)ofthedevicesstudiedduetothehighEMBscores.GPUsareoptimizedfor3Drendering,whichrequiresprocessingonlargeworkingsetsofdata.Comparingagainstmicroprocessors,theamountofdatathatneedstobeprocessedistoolargetotinthecacheofaCPU.TheworkingsetsofapplicationsthattypicallyrunonCPUshaverandommemory-accesspatternsandaresmallerthanthosethatrunonaGPU,requiringfrequentfetchesfromoff-chipmemory.CPUmemoryinterfaceshaveshiftedfrombusestoaveryfastgroupofserialdatalinescommunicatingviapacketswithmuch 82

PAGE 83

lowerlatency,suchasAMD'sHyperTransportorIntel'sQuickPathInterconnect(QPI).Thistrendmeansthat,asCPUcorecountsandIOBincrease,streamingapplicationscanbemoreeffectivelyparallelizedonthem. Figure5-7. IOBdataforselectxed-logicdevices Forrecongurable-logicdevices,showninFig. 5-8 ,theIOBtendstobemadeupofdifferentI/Oconnections.TheVirtex-7VX980ThasthehighestIOB(417.43GB/s)ofFPGAdevicesduetothelargeEMBandthehighavailabilityofI/Oresources.TheSpeedster22HiDuseshardmacroblocksforI/O,whichreducelogicavailableforcomputationbuthelpthedevicescorewellinIOB(267.41GB/s)versusotherFPGAdevices.SmallerFPGAsstruggleinIOBduetothelownumberofpinsavailableforI/Ooperationsandalowernumberofcommunicationtransceivers.Thememorymetricsproviderst-orderinsightintothedatamovementcapabilitiesofadevice.EMBandIOBshowedusthehighmemory-metricscoresofnewGPUdevicesaremostlyduetotheirhigh-speedmemorycontrollerinterfaces.Incontrast, 83

PAGE 84

Figure5-8. IOBdataforselectrecongurable-logicdevices recongurable-logicdeviceshavegreaterfocusonprovidingcommunicationexibilityinmorewaysthansimplememorycontrollerinterfaces,hencebetterIOBscores.Theseinsightscanhelpapplicationandhardwaredesignerstailortheirdevicechoicesbasedonthetypesofdatabeingmanipulatedintheirdesiredapplications. 5.4RealizableUtilization(RU)MetricMetricsprovideameanstofacilitateobjectivedevicecomparison,trade-offanalysis,andrst-orderperformanceprediction.Thisapproachprovidesanupperboundofadevice'scapability.Thefractionofadevice'scapabilitythatcanbeutilizedinaparticularapplicationcannotbedeterminedwithoutapplicationperformancedata.Toaddressthisissue,weintroducetheconceptofrealizableutilization(RU),whichquantiesthefractionofCDthatanapplicationachieveswithinaspecicimplementation.TheRUmetricprovidesinsightondevicetoapplicationmappingintermsofachieved 84

PAGE 85

performance,temperingthecomputationalmetricswithrealisticexpectations.Inthissection,computationaldensityiscontrastedwithperformancedatafrombothliteratureandbenchmarkingresults.Additionally,todemonstratehowRUcanshowcasethetrade-offsinperformanceversusportability,codewithintrinsicfunctionsiscomparedwithoptimizedlibrariesandbenchmarkscodedforportability. 5.4.1RUMethodologyTherearemanyfactorsthatcanreducetheperformanceofadevice,includingapplicationcharacteristics,tools,anduserexperience.Realizableutilizationisamethodtoquantifythedifferencebetweenadevice'stheoreticalperformanceandtheactualperformanceausercanexpecttoachieve.Sincebenchmarkingeverydevicewitheveryapplicationisimpractical,RUallowsdeveloperstoestimatetheirapplication'sprojectedperformanceonaparticulardevice.Adeviceperformancemetric,suchasCD,scaledusingRUdatafrompublishedtechnicalliteratureandbenchmarkingcanbeusedtoestimatetherealizedperformanceexpectedinaspecicapplication.InFig. 5-9 ,illustratingtheRUmethodology,thetheoreticalcomputationalcapacity,representedbyCD,isreducedbyvariousfactorssuchasdeveloperexperience,toolsused,applicationcharacteristics,etc.Weobservedthisthrottlingtrendthroughdatafoundintechnicalpublicationsandbenchmarkingstudies.RUstartswithCDrepresentingthetheoreticalcomputationalcapacityofadevice.Performancedataisthencollectedfromeitherscholarlypublicationsorbenchmarkingexperience.Theapplicationthroughputfromtechnicaldataandbenchmarksshowstheperformanceachievedforaspecicplatform,application,andimplementation.ThisperformancedataisusedtocompareobservedthroughputwithCD,yieldingtheRUscore.TheRUscorebecomesmoreapplicableastheamountofliteratureincreases;fornewtechnologiesthisdatamaybesparse.Benchmarkingrequiresmoredevelopmenttimeandeffortbut,sinceitusesmorehardwareresourcesandisclosertothedesiredapplication,itprovidesmoreaccuratedata.Usingdataacquiredfrombenchmarkingprovides 85

PAGE 86

Figure5-9. Conceptdiagramofrealizableutilization morerevealingRUscores,sincethedeveloperfrequentlytunestheapplicationtothehardwareorconverselytunesthehardwaretotheapplication,asinFPGAs. 5.4.2CalculatingRUOnceCDisdetermined,theRUmetriciscalculatedbydividingtheobservedthroughput(T)inOPSbytheCDofthedeviceusedintheapplication,asseeninEq. 5 .Thedevice'sCDismultipliedbyascalingfactorrepresentingthefractionofthedeviceusedbytheapplication.Thisfactor,,dependsontheimplementationoftheapplication.Forexample,ifanapplicationonlyusesonecoreofaquad-coreCPU,thenthefactoris0.25.Thefactorisnecessarybecausesomeapplicationshavenotbeenparallelizedand,withoutadjustingtheavailableCD,thecomparisonbetweenapplicationswouldnotbeasinsightful. 86

PAGE 87

RU=T CDdevice(5)Thedeveloper'sknowledgeoftheirapplicationanditsimplementationallowstobestbecalculatedduringbenchmarking.Whentheinformationfoundinpublicationsourcesdoesnotprovideenoughdatatoreliablydetermine,thenaratioof1isassumed.Thisassumptionisbasedonthehypothesisthatmostdeveloperswhoarepublishingtheirworkandhavingitpeer-reviewedwillbetryingtomaximizeperformanceoftheirapplication.Iftheapplicationisnotusingallofthemainresourcesofthedevice,thedevelopergenerallyincludesenoughinformationtocalculate.SincetheCDvaluerepresentsthetheoreticalmaximumthroughput,Eq. 5 showsthattheRUmetricisboundedbelowbyzeroandabovebyone.WhileRUisaratio,itisexpressedasapercentage.Fromthisalternateperspective,RUisthepercentageofadevice'stheoreticalperformanceachievedbyanapplication.Thisanalysisprovidesinsightfordevelopers,notonlybeforecodingtheirapplication,butalsoduringthedevelopmentcycle. 5.4.3UsingRUOncetheRUmetrichasbeencalculated,itprovidesusefulinsightforvariousprogramtypes.OneuseofRUisinthedevicedesignprocess.DevicearchitectscanuseRUwhendevelopingnoveldevicesbycomparinganewarchitecturetosimilarlystructuredexistingdevices.TheRUscoreindicateswhatprogramareasaremostlikelytomapwellonafuturedevice,andthatinformationcanbeincorporatedintothedevicedevelopmentprocess.Forexample,anovelmany-coredevicecouldbecomparedwithexistingRUscoresforGPUs,showingsimilardeviceenhancementsfordatamovementcouldhelpimprovetheperformanceofdenselinearalgebra.AnotheruseofRU,fromasystemdesigner'sstandpoint,istoassistinselectionofanappropriateaccelerationplatformbeforesignicantcostsareexpendedoncutting-edgehardware.Applicationswithsimilarstructureorkernelscanbeanalyzedto 87

PAGE 88

seewhatplatformsaremakingthemostoftheavailableresources.Forexample,inasystemthatisbeingbuilttorunasignicantnumberofFFTs,theRUscorescanshowthataDSPorFPGAoptionmightbethemosteffectiveuseofresources.Thisinsightmitigatessomeoftheriskwithdevelopingapplicationsonnewplatforms,andallowsdeveloperstonarrowtheeldofapplicationaccelerators.Finally,softwaredeveloperscanuseRUtogainfeedbackwhiledevelopingtheirapplications.Duringtheoptimizationcycleofdevelopment,itcanbedifculttojudgewhenthemaximizedperformancefromtheoptimizationhasbeenreached,andhowmuchmoreoptimizationperformancecanbeexpected.RUallowsdeveloperstocomparethekernelsorapplicationsthatareundergoingoptimization,suchasSIMDoptimizationwithARMNEONacceleration,tosimilarapplicationsandkernels.Thiscomparisoncanhelpthedeveloperjudgehowmuchmoreperformancetheycanexpecttoachievefromtheirapplication,thendecideifadditionalperformanceisworththetimeandcost. 5.4.4ArithmeticKernelsforRUTheauthorsinvestigatedtheapplicationofRUthroughaliteraturestudyofthreearithmetic-heavyapplicationkernels.Therstkernelinthisstudy,matrixmultiplication(MM),includesbothsimplematrix-multiplicationkernels,Eq. 5 ,andgenericmatrix-multiplication(GEMM)kernels,Eq. 5 . (AB)ij=mXk=1AikBkj(5)GEMMkernelsarecommonsubroutinesusedaspartoftheBasicLinearAlgebraSystem(BLAS).ForEq. 5 ,thevaluesandarescalarcoefcients. C AB+C(5) 88

PAGE 89

Thesecondkerneltypestudiedismatrixdecomposition(MD)includingCholesky,LU,andQRdecompositions.Choleskydecompositions,Eq. 5 ,breakdownareal,positive-denitematrix(A)intoanuppertriangularmatrix(R)withpositivediagonalcoefcientsandtheCholeskyfactor(RT)[ Dongarraetal. 1979 ]. A=RTR(5)LUdecompositions,Eq. 5 ,reduceasquarematrix(A)intoalowertriangularmatrix(L)andauppertriangularmatrix(U)[ LancasterandTismenetsky 1985 ]. A=LU(5)Thenaltypeofmatrixdecomposition,QR,isusedforEigenvaluecalculationsinmanyapplications.Eq. 5 showshowQRdecompositionbreaksthesquarematrix(A)intoanorthogonalmatrix(Q)andauppertriangularmatrix(R)[ Bhaskar 2006 ].IncontrasttothepreviousLUdecomposition,theQRdecompositionalwaysexists. A=QR(5)Thethirdkernelareastudiedinthispaperisn-bodysimulations.Twomajortypesofn-bodysimulationsaremoleculardynamicssimulationsandastronomicalorgravitationalsimulations.Moleculardynamicssimulationscomputetheelectrostaticforcesandinteractionsbetweenmoleculeswithinthesimulation[ Bailey 1995 ],andgravitationalsimulationscomputethegravitationalforcesfromasetofspatiallyrelatedobjects[ AarsethandAarseth 2003 ]. 5.4.5LiteratureStudyResultsTobuilduponthemetricsresults,theauthorsselectedGPU-baseddevicesformorein-depthliteraturestudy.ThisliteraturestudywasconductedtohighlighttheuseofRUwithoutinvestingthetimeandmoneythatbenchmarkingrequires.Surveyingboth 89

PAGE 90

academicandvendorpapers,weanalyzedthemostcommonGPUarchitecturesbycalculatingtheachievedthroughputandusingthatinformationtocalculateRU.Theseresultsareseparatedintothreekeykerneltypes:matrixmultiplication(Table A-9 intheappendix);matrixdecomposition(Table A-10 );andn-bodysimulations(Table A-11 ).Inmanyofthepublications,theIntelCore2architectureswerehighlightedascomparisonpoints.TheirRUscores,usingtheIntelMathKernelLibrary(MKL),wereincludedasareferenceforgeneral-purposeCPUs.Fig. 5-10 showstheresultscollectedformatrixmultiplication. Figure5-10. Realizedutilizationformatrixmultiplication Forthematrix-multiplicationkernels,thebestRUscoresforGPUdevicesarefoundintheGeForce8SeriesGPUs.Manyofthevariousoptimizations,suchasCUDABLAS,generatedgroupingsofsimilarperformancemarkswithinanarchitecturalfamily.Thesegroupingshighlightedkeyoptimizationlevelsbyclusteringsimilar 90

PAGE 91

optimizationscoresclosetoeachotherineachfamilycolumn.Ofthethreekernelsstudied,matrixmultiplicationwasthemostcommonand,duetothewellknownfeaturesofitsoperations,matrix-multiplicationprovidessignicantRUresults.Thepeakmatrix-multiplicationRUscoreshighlightanobvioustrend,signicantlydecreasingRUastheCDincreases.Thistrendimpliesthat,inmostcases,rawperformanceisstillincreasingwithmorepowerfulchips,buttherealizedperformanceisnotkeepingpacewiththetheoreticalcapacitiesofthedevices.ThisdecreasingRUtrendpointstoapplicationsand/ortoolsasthelimitingfactorinachievedperformance.IncontrasttotheGPUs,useofMKLbytheCPUsachievesmoreoftheircomputationalcapacities,henceahigherRU.ThehigherRUscoresandlowerCDscoresofCPUsfurtheremphasizetheinverserelationshipbetweenRUandCDnotedpreviouslyintheGPUscores. Figure5-11. Realizableutilizationformatrixdecomposition 91

PAGE 92

Thehighest-scoringGPUdeviceinthesecondkernelarea,matrixdecompositionshowninFig. 5-11 ,istheGeForce8800GTXinthe8-seriesfamily,withanRUscoreof55.56%.ThisdevicehasthesamebasicarchitectureastheTeslaC870ofthesamefamily,thehighestscoringGPUinmatrixmultiplication,andisoneofthemostcommonlyfounddevicesinthisliteraturestudy.Thematrix-decompositionRUscoresfurtherrevealthetrendofdecreasingRUscoreswithincreasingdevicecapacityandshowsimilarperformancepatternsasmatrixmultiplication.Onceagain,weobservetheinverserelationshipthatlower-CDCPUscanachieveahigherpercentageoftheirtheoreticalperformancebut,astheCDincreases,therealizedportiondecreases. Figure5-12. Realizedutilizationforn-bodysimulations 92

PAGE 93

Thenalkernelareainthisstudyisn-bodysimulations.Bothmoleculardynamicssimulations,aswellasastrophysicalsimulations,areincludedinthedatasetplottedinFig. 5-12 .InadditiontothesimulationsfoundfortheCore2platform,wealsofoundresultsforthePowerPCprocessorandincludedtheirRUscorestoshowthetrenddoesnotonlyapplytotheCore2CPUseries.Thedatainn-bodysimulationsshowsthehighestrangeofscores(1%to99%)andincludesthehighestscoresoverall.Thesehighscorespointtoabettertbetweentheapplication'sstructureandtheGPUdevice'shardwareresources.Conversely,theCPUsdidnotscoreaswellinthiskernelarea,indicatingthatthen-bodykernelsstudiedarenotasefcientontheCPUarchitectures.WhilethedeviceswithhigherCDscoredsignicantlybetterinRUherethantheydidinotherapplicationareas,theystillfellbehindtheolderGeForce8SeriesGPUs.Fig. 5-13 showsalltheRUscoresplottedtogether.ThedatapointsreinforcethesignicantprevalenceoftheGeForce8Series.TheseolderGPUsaremorecost-efcienttoobtainandemploythesamedevicearchitecture.Combined,thesethreesetsofresultsshowthehigherCDdevicestendtohavelowerRUscores,especiallyinmatrix-basedkernels.WhileoverallrawperformanceisincreasingasthedeviceCDgrows,thedownwardtrendinRUshowsthatapplicationsandtoolsarenotyetabletocapitalizefullyontheaddedcomputationalresources.Thistrendcouldbecausedbymanydifferentissuesincludingdevicetools,developerexperience,applicationcharacteristics,architecturebottlenecks,andothers.DeterminingwhichoftheseissuesaremostresponsibleforreducingRUisbeingconsideredforfuturework.Oneofthemajorissueswiththeliterature-basedapproachisthatup-to-dateresultsarehardtondgiventhetimeforacademicoptimizationandpublication.Onewaytoaddressthisissueistodoactualbenchmarking.Thefollowingsectionhighlightssomeofourresultsinthisarea. 93

PAGE 94

Figure5-13. Combinedrealizableutilizationresults 5.4.6RUBenchmarkingResultsExpandinguponthepreviousliteraturestudy,benchmarkingfollowsasthenextstepinusingRU.Forthiscasestudy,multipledeviceswerecomparedbasedonseveralcomputationalkernelsandlibraryimplementations.Usingvendor-optimizedlibraries,hand-optimizedkernels,andcodesoptimizedforportability,webenchmarkedacombinationofninekernelsonelevendeviceswithatotaloftwenty-nineindividualdatapoints.Kernelsincludedseveralcommontypesfoundincomputeheavyapplications(MatrixMultiplication(MM),FastFourierTransform(FFT),SingleValueDecomposition(SVD),etc.).Advancedlibraries,ATLAS,forexample,weretestedagainsthand-optimizedalgorithmsandcodeoptimizedforportability.RUallowsthequanticationofthe 94

PAGE 95

performancedifferencebetweenimplementationtypeswithrespecttodevicemetricslikeCD.Fig. 5-14 shows,indecreasingorderofRU,theresultsofthisstudy. Figure5-14. Realizableutilizationbenchmarkingresults TherstkeyinsightfromFig. 5-14 istheclearperformancegainedusingmaturelibrariesonhigh-performanceprocessors.ThehighestRUscore(91.4%)wasfoundonaTeslaK20Xusingamatrix-multiplicationalgorithmbasedonthehigh-performancecuBLASlibrary.ThislibraryfromNVidiaprovideskeybasiclinearalgorithmsystem(BLAS)supportfortheirGPUarchitectures.AnotherinsightistheexcellentRUgainsfromusingSIMDprocessingwhenthealgorithmsupportsit.Thetimedomainniteimpulseresponse(TDFIR)lter,codedwithDSPintrinsicfunctions,achievedthesecondhighestRUscore(59.7%)ontheTIKeyStone-IDSPdevice.Incontrasttothehighperformanceoftunedlibraries,codethatwasoptimizedforportabilitybetweenplatformssacricedasignicantamountofRUtoachievetheirgeneralapplicability.Bylimitingtogenericcodingtechniquesandinstructionsthatcouldbeeasilycarriedbetweendifferentdevices,amajortrade-offbetweenportabilityandperformanceisillustrated.Forexample,thistrade-offisshown 95

PAGE 96

bythetwomatrixmultiplicationimplementationsontheTIDSPthatrangefrom50%RUwhenusingamatrix-operationlibraryto0.2%usinggenericinstructions.Betweenthehigh-performancelibrariesandthelower-performancecodesoptimizedforportabilityisamiddleground,comprisedofaninterestingsetofapplication/devicecombinations.Counterintuitively,eventhehighlyoptimizedlibrariesfrequentlyscoremid-range(e.g.11.7%)onFFTalgorithms.ThisresultindicatesthatthestructureofFFTsdoesnotmapaswelltothesedevicesaslinear-algebrakernels.Therefore,alternativedevicearchitecturesmaybeofinterestforFFT-heavyapplications.Table A-12 intheappendixshowsthenumericalRUscoresforthedevicesincludedinthiscasestudy.ThissectionintroducedtherealizableutilizationmetrictoquantifythefractionoftheCDthatapplicationsachieve.RUfromliteraturestudiesshowedthestrengthsofmaturearchitecturesandweaknessofless-developeddevicesbyshowcasingthedifferencesinachievedutilization.Fromourbenchmarkingresults,RUshowedsignicantutilizationgainswithintrinsicoptimizationsandhighlyoptimizedlibrariesincontrasttocodeoptimizedforportability.Together,theseresultshighlighthowtheRUmetriccanbeusedtohelpdeveloperspredictapplicationperformanceonvariouscomputationaldevices. 96

PAGE 97

5.5DiscussionInthispaper,theauthorshaveusedcomputational-densitymetricstosurveyanewgenerationofvariousdevicearchitectures.Also,themetricsmethodologywasexpandedtoincludenewmemory-basedmetricstocharacterizethedatamovementabilitiesofprocessors.Further,anewrealizableutilizationmetricwasintroducedtoquantifythetheoreticalversesachievedapplicationperformance.First,thisworkevaluatedthecomputational-densitydataofanewgenerationofdevices,clearlyshowingthestrengthofnewhybriddevicesintermsofcomputationperWatt.ThehybridOMAPprocessorsscorethehighestCD/WinInt32,SPFP,andDPFPprecisionsduetotheirsmallpowerenvelope.Additionally,forInt16,theStratixIVEP4SE680FPGAdeviceshowsthehighestCD/Wbecauseofitspower-efcientrecongurableresources.Anotherinsight,observedinSection 5.2 ,wasthattheRadeonR9295X2GPUhasadistinctadvantageinSPFPcalculationswhenpowerisnotconsideredduetothehighclockratesoftheshaderunitsandthesheercorecount.Second,thispaperenhancedourexistingmethodology[ Williamsetal. 2011 ]fordevicemetricsbyintroducingnewmemorymetrics,EMBandIOB,andusedthesemetricstodiscusstheoff-chipmemorybandwidthofdevices.TheEMBandIOBresults,inSection 5.3 ,showthatGPUs,theRadeonR9295X2forexample,havethehighestexternalbandwidth,withwidememorycontrollersworkingatveryhighfrequencies.FPGAsemphasizedataconnectivity,withhighIOB,buttheirloweroperatingfrequencykeepsthemfrommatchingGPUmemorybandwidth.NewerCPUs,particularlytheXeonE5-2670,canachievehigherIOBcomparedtoearlierCPUsduetotheshiftfromthefront-sidebusconnectionstohigherspeedinterfacessuchasQPI.Finally,realizableutilization,anewmetricintroducedinSection 5.4 ,isusedtotempercomputationaldensitywithrealapplicationperformance.FromtheGPUliteraturestudy,wenoticedasignicantdownwardRUtrendwithmatrix-basedapplicationsasthearchitecturegenerationsbecamenewer.RawGPUperformance 97

PAGE 98

continuedtoincrease,butthecomputationalcapacityoutpacestheapplicationperformance.ThebenchmarkinganalysisshowedthatRUcanbeusedtodemonstrateandquantifythetradeoffbetweenhighlyoptimizedcodewithlowportabilityandgeneralcodewithhigherportability.TheanalysisshowedthebestRUscoresoccurwithhighlytunedlibrariessuchascuBLASoroptimizedSIMDinstructionsanddensecomputationalloads.Whencodedforportability,usinggenericstructures,similarapplicationssacriceasignicantamountofperformanceforcompatibility.Overall,themetricspresentedinthispaperprovidearst-orderanalysisofdevicecharacterizationandinsightintothestrengthsofawidevarietyofdevicearchitectures.Futureworkisplannedtoallowforuser-denedparameterswhencalculatingcertainmetrics.Theseparametersincludetheadditionofmoreoperationsandenhancedapplication-to-metricmappingusingautomatedtools.Thisexpansionwillallowuserstomoreaccuratelydeterminewhichofmanydeviceswouldbestttheiralgorithm.Ourultimategoalistoallowuserstodeneandcustomizetheindividualattributesusedtocalculateeachmetrictobestsuittheirapplication. 98

PAGE 99

CHAPTER6CONCLUSIONSThelandscapeofmoderncomputationalprocessingdevicesislarge,varied,andincludesabroadspectrumofdifferentarchitectures,features,andtargetapplications.Thisstudyidentiedaneedanddenedamethodologytocompareadiversesetofcomputationaldevicearchitecturesinacompleteandequivalentmanner.Additionally,thisstudyhasshowninterestingresultsbycomparingachievedversestheoreticalperformance.Phase1demonstrateddevicemetricstoenableacomprehensiveandimpartialanalysisoftheoreticalperformanceacrossabroadspectrumofdevicetypes.Thisphasefocusedonthecharacterizationoftheinternalbottlenecksofdevicesusingdevicemetrics.CD,CD/W,andIMBwerecreatedandtheirusefulnesswasdemonstratedacrossadiversesetofprocessingdevicesthatincludedbothxedandrecongurabledevices.Phase2focusedonaddingexternalmemorymetricsthatprovideinsightintoadevice'sdatainterfaceabilities.TheEMBandIOBmetricswerecreatedtohelpcharacterizethememoryowaspectsofcomputationaldevices.Thisstudyalsoexpandedthelandscapeofdevices,whichincreasestheimpactofthemetricsmethodology.Buildingontheprevioustwophases,Phase3introducedRealizableUtilization,anewmetrictoquantifythedifferencebetweenthetheoreticalperformancemeasuredusingthedevicemetricsfromtheearlierphasesandtheachievedperformancefromanapplicationbenchmark.ThenewRUmetricwasusedtoquantifytheachievedsystemperformancefromdatafoundintechnicalpublicationsandfrombenchmarkingexperience.Therealizableutilizationstudyshowedtheperformancebenetofusingadvancedcomputinglibrariesandtheperformancesacricedforportability. 99

PAGE 100

Inconclusion,thisworkprovidesametricmethodologytocompareadiversesetofcomputingdevicearchitectures,providesexamplesofhowithasbeenused,andhighlightstrendsfoundwithinthelandscapeofprocessingdevices.TheoreticalcomputationalcapacitiesandpowerefciencyweredescribedusingtheCDandCD/WmetricswithFPGAdevicesshowingahighperformanceperwattability.TheIMB,EMB,andIOBmetricswereusedtodemonstrateandcharacterizethememorycapabilityofcomputationaldevicesandhowthatlimitedthecomputationalabilityofdevices.Finally,RUwaspresentedtoquantifythedifferencebetweenthetheoreticalcomputationalcapacityrepresentedbyCDandtheachievedperformancefromliteratureandbenchmarking.Incombination,thissuiteofmetricsprovidesbothsystembuildersanddevicedesignersthefunctionalitytocompareandcontrasttheperformanceabilitiesofcomputationaldevicesregardlessoftheirindividualarchitectures. 100

PAGE 101

APPENDIX:ADDITIONALDATA 101

PAGE 102

TableA-1. EMBofnon-FPGAdevices DeviceEMB(GB/s) ADSP-TS203S0.50AMDAthlonIIX463521.33AMDOpteron8439SE12.80AMDPhenomIIX61090T21.33Athlon64X26400+12.80AtomN2704.26BlueGene/P13.60Cell25.60Core2DuoT99008.53CSX6003.20ECA-642.40FreescaleP20206.40FreescaleP408025.60GeForceGTX285158.98GeForceGTX480177.41IBMPowerXCell8i25.60IntelCorei7980X25.58Intelitanium935017.06IntelXEONX756034.11MONARCH8.53MPC74471.00MPC8640D17.07NvidiaIon17.07NvidiaTeslaC1060102.40NVIDIATeslaC87076.80Opteron8360SE12.80PACTXPP-3c9.60TIOMAP-L1372.40TILE6425.60Xeon70415.30XeonW558032.00XeonX32308.53 102

PAGE 103

TableA-2. CDofFPGAdevicesshowingfrequencyandpowervariations Int16CDDevicePar.OpsRawPar.OpsSustainableFrequency(MHz)GOPsRawGOPsSustainableMinWattsMaxWatts StratixIIEP2S18010791079410442.39442.393.2624.60StratixIIIEP3SE26020431944400817.20777.602.1118.18StratixIIIEP3SL34023272296400930.80918.402.8323.27Virtex-4LX100640240344220.1682.561.3414.45Virtex-4LX2001038336344357.07115.581.2715.82Virtex-4SX551061320344364.98110.081.2516.43Virtex-5LX330T1346648463623.20300.023.4322.73Virtex-5SX95T1336488463618.57225.942.2516.68Virtex-6LX760606214405683443.22817.926.0651.77Virtex-6SX475T618621285683513.651208.706.0162.35Int32CDDevicePar.OpsRawPar.OpsSustainableFrequency(MHz)GOPsRawGOPsSustainableMinWattsMaxWatts StratixIIEP2S180292292420122.64122.643.2625.20StratixIIIEP3SE260737737273201.20201.202.1112.41StratixIIIEP3SL340781781273213.21213.212.8315.88Virtex-4LX10018012024944.8229.881.3410.46Virtex-4LX20027916824969.4741.831.2711.45Virtex-4SX5530116024974.9539.841.2511.89Virtex-5LX330T359324378135.70122.473.4318.56Virtex-5SX95T355244378134.1992.232.2513.61Virtex-6LX7601774720296525.10213.126.0626.98Virtex-6SX475T19371064296573.35314.946.0132.49 TableA-3. IMBforBBSoverarangeofachievablefrequencies(GB/s) AchievableFreq.(MHz)50100150200250300350400450500550600 CellLS409.600409.600409.600409.600409.600409.600409.600409.600409.600409.600409.600409.600CSX60096.00096.00096.00096.00096.00096.00096.00096.00096.00096.00096.00096.000EP2S180507.6001015.2001522.8002030.4002538.0003045.6003553.2004060.8004559.7605052.9605360.1605360.160EP3SE260388.800777.6001166.4001555.2001944.0002332.8002721.6003110.4003499.2003888.0004276.8004527.360EP3SL340459.200918.4001377.6001836.8002296.0002755.2003214.4003673.6004132.8004592.0005051.2005344.000EP4SE530569.6001139.2001708.8002278.4002848.0003417.6003987.2004556.8005126.4005696.0006265.6006835.200EP4SE680669.2001338.4002007.6002676.8003346.0004015.2004684.4005353.6006022.8006692.0007361.2008030.400FPOA694.000694.000694.000694.000694.000694.000694.000694.000694.000694.000694.000694.000MONARCH330.336330.336330.336330.336330.336330.336330.336330.336330.336330.336330.336330.336TeslaC10604992.0004992.0004992.0004992.0004992.0004992.0004992.0004992.0004992.0004992.0004992.0004992.000TeslaC8701382.4001382.4001382.4001382.4001382.4001382.4001382.4001382.4001382.4001382.4001382.4001382.400V4LX200134.400268.800403.200537.600672.000806.400940.8001075.2001209.6001344.0001344.0001344.000V4SX55128.000256.000384.000512.000640.000768.000896.0001024.0001152.0001280.0001280.0001280.000V5LX330T583.2001166.4001749.6002332.8002916.0003499.2004082.4004665.6005248.8005832.0006415.2006415.200V5SX95T439.200878.4001317.6001756.8002196.0002635.2003074.0003513.6003952.8004392.0004831.2004831.200V6LX760648.0001296.0001944.0002592.0003240.0003888.0004536.0005184.0005832.0006480.0007128.0007776.000V6SX475T957.6001915.2002872.8003830.4004788.0005745.6006703.2007660.8008618.4009576.00010533.60011491.200 103

PAGE 104

TableA-4. IMBforCBSforvarioushitrates(GB/s) HitRate10%20%30%40%50%60%70%80%90%100% Atom330L1D+115.3630.7246.0861.4476.8092.16107.52122.88138.24153.60Atom330L210.2420.4830.7240.9651.2061.4471.6881.9292.16102.40BlueGene/PL1D+110.8821.7632.6443.5254.4065.2876.1687.0497.92108.80BlueGene/PL216.3232.6448.9665.2881.6097.92114.24130.56146.88163.20Core2DuoT9900L1D+129.4358.8788.30117.73147.17176.60206.04235.47264.90294.34Core2DuoT9900L29.8119.6229.4339.2449.0658.8768.6878.4988.3098.11Opteron8360SEL1D+17372.8014745.6022118.4029491.2036864.0044236.8051609.6058982.4066355.2073728.00Opteron8360SEL23686.407372.8011059.2014745.6018432.0022118.4025804.8029491.2033177.6036864.00P2020L1D+111.5223.0434.5646.0857.6069.1280.6492.16130.68115.20P2020L27.6815.3623.0430.7238.4046.0853.7661.4469.1276.80P4080L1D+157.60115.20172.80230.40288.00345.60403.20460.80518.40576.00P4080L238.4076.80115.20153.60192.00230.40268.80307.20345.60384.00TIOMAP-L137L1D+1442.37884.741327.101769.472211.842654.213096.583538.943981.314423.68TIOMAP-L137L2110.59221.18331.78442.37552.96663.55774.14884.74995.331105.92XeonW5580L1D+161.44122.88184.32245.76307.20368.64430.08491.52552.96614.40XeonW5580L240.9681.92122.88163.84204.80245.76286.72327.68368.64409.60 104

PAGE 105

TableA-5. Memory-sustainableCDacrossprecisions DeviceBitInt16Int32SPFPDPFPPar.OpsGOPSPar.OpsGOPSPar.OpsGOPSPar.OpsGOPSPar.OpsGOPS ADSP-TS203S384192.0010.005.006.003.006.003.000.000.00Athlon64X26400+6402048.0022.0070.4014.0044.808.0025.604.0012.80AthlonIIX463517925196.8076.00220.4044.00127.6032.0092.8016.0046.40Atom330384614.4018.0028.8010.0016.0010.0016.006.009.60AtomN270192307.209.0014.405.008.005.008.003.004.80BlueGene/P12801088.0016.0013.608.006.8016.0013.608.006.80Cell12804096.0064.00204.8036.00115.2064.00204.806.0019.20Core2DuoT99007682354.6948.00147.1724.0073.5816.0049.068.0024.53Corei7-980X23047672.32144.00479.5272.00239.7672.00239.7636.00119.88CSX60061441536.0096.0024.0096.0024.0096.0024.0096.0024.00ECA-642176435.2064.0012.8032.006.400.000.000.000.00FPOA61446144.00320.00320.00160.00160.0013.000.220.000.00FreescaleP2020256307.2016.0019.208.009.608.009.604.004.80FreescaleP40807681152.0016.0024.0016.0024.008.0012.003.004.50GeForceGTX4801945627257.86736.001031.14736.001031.14736.001031.14128.89128.89GeForceGTX2851536022671.36480.00708.48480.00708.48480.00708.4830.0044.28Itanium935015362359.3096.00166.0848.0083.0416.0027.688.0013.84Monarch61502047.95196.0065.27196.0065.27196.0065.270.000.00MPC7447288288.0017.0017.009.009.006.006.003.003.00MPC8640D576576.0034.0034.0018.0018.0012.0012.006.006.00Nvidia9400M10241126.4032.0035.2032.0035.2032.0035.200.000.00NvidiaIon1555491740.80155549.0064.00155549.0051.2051.2051.20155549.009.60NvidiaTeslaC10601536019968.00480.00624.00480.00624.00480.00624.0030.0039.00NvidiaTeslaC870819211059.20256.00345.60256.00216.00256.00345.600.000.00Opteron8360SE17924480.0076.00190.0044.00110.0032.0080.0016.0040.00Opteron8439SE26887526.40114.00319.2066.00184.8048.00134.4024.0067.20PACTXPP-3c55681948.80348.00121.80174.0060.900.000.000.000.00PhenomIIX61090T26888601.60114.00364.8066.00211.2048.00153.6024.0076.80PowerXCell8i12804096.0064.00204.8036.00115.2064.00204.8032.00102.40StratixIIEP2S18015043275216.001079.00442.39292.00122.64186.0053.2072.0010.66StratixIIIEp3SE260217344119539.201944.00777.60737.00201.20221.0078.23114.0039.12StratixIIIEP3SL340280768154422.402296.00918.40781.00213.21292.0096.07134.0026.13StratixIVEP4SE530443392243865.602632.00765.911352.00328.54551.00132.79371.0068.26TIOMAP-L1376419.2016.004.806.001.806.001.802.500.75TILE6461444608.00320.00240.00192.00144.0048.0036.000.000.00Virtex-4LX10010003250016.00240.0082.56120.0029.88120.0032.8852.009.62Virtex-4LX20017990489952.00336.00115.58168.0041.83168.0046.0384.0015.54Virtex-4SX555836829184.00320.00110.08160.0039.84114.0040.2443.0013.93Virtex-5LX330T210816115948.80648.00300.02324.00122.47324.00115.67109.0025.83Virtex-5SX95T7040038720.00488.00225.94244.0092.23180.0073.8061.0021.72Virtex-6HX565T369792221875.201824.001036.03912.00269.95836.00255.82284.0060.78Virtex-6LX550T35923221539.201264.00717.95632.00187.07632.00193.39278.0059.49Virtex-6LX760489792293875.201440.00817.92720.00213.12720.00220.32346.0074.04Virtex-6SX475T333888200332.802128.001208.701064.00314.941059.00343.12386.0082.60Xeon70415121536.0014.0042.0010.0030.0010.0030.008.0024.00XeonW558015364915.2096.00307.2048.00153.6048.00153.6024.0076.80XeonX323015364094.9848.00127.9732.0085.3132.0085.3124.0063.98XeonX756030726961.15192.00435.0796.00217.5496.00217.5448.00108.77 105

PAGE 106

TableA-6. EMBofRLDandFLDdevices DeviceInformationMetricResultsManufacturerDeviceTypeProcessTech(nm)EMBIOB AMDRadeonR9295X2GPU28640.00656.00XilinxVirtex-7VX980TFPGA2889.60417.43IntelXeonPhi7120pCPU22352.00358.87IntelXeonPhi5110pCPU22320.00326.87NVIDIATeslaK40GPU28288.40304.40AchronixSpeedster22iHDFPGA2257.56267.41AlteraStratixVSGXMBBR1H43I2FPGA2833.59213.92AlteraStratixV5SGXEBBR1H43I2FPGA2844.80191.13XilinxKintex-7QXQ7K410T RF900-1MFPGA2842.67184.29IntelXeonE5-2670CPU3251.20163.20XilinxZynq-7000Q7Z045 RF676-1QHYBRID2834.11126.63XilinxVirtex-5FX130T FF1738-1FPGA6521.33121.58AlteraArriaVSoC5ASXFB5H4F40I3HYBRID2853.33107.96AlteraStratixV5SGSMB8I2F45C2FPGA2870.3698.73AMDOpteron6380CPU3251.2091.20AlteraStratixIVEP4SGX230KF40I3FPGA4044.8087.90AMDTrinityA10-5800KCPU3229.8673.59IntelCorei7-4610YCPU2225.6064.10AMDRichlandA10-6800kCPU3234.1352.13TileraTILEGx-8036CPU4012.8045.92TITMS320C6678DSP4012.8033.86AlteraArriaV5AGXMB7G6FPGA288.5331.56NVIDIATegraK1HYBRID2813.8522.55AlteraArriaIIGX260FF35I5FPGA4019.2022.39ClearSpeedCSX700DSP907.5021.50IntelAtomZ3770CPU2217.3618.79 106

PAGE 107

TableA-7. Computationaldensity DeviceInfo.CD(GOPS)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP Aeroex/GaislerGR712RCCPU180100.000.080.080.030.03AchronixSpeedster22iHDFPGA22RC1064.00190.50189.1027.96AlteraArriaIIGX260FF35I5FPGA40RC160.38119.6793.6035.52AlteraArriaV5AGXFB7K4F40I3FPGA28RC1197.00310.00260.7097.10AlteraArriaV5AGXMB7G6FPGA28RC1003.00241.6084.1331.30AlteraArriaVSoC5ASXFB5H4F40I3HYBRID281170.20294.20246.9091.29AlteraCycloneV5CGXFC9E6F35I7FPGA28RC358.8081.5394.1731.81AlteraCycloneV5CGXFC9E7FPGA28RC382.8088.5940.9715.38AlteraCycloneVSoC5CSXFC6D6F31I7HYBRID28149.5040.9945.9613.84AlteraStratixIIEP2S180FPGA90RC442.39122.6453.2010.66AlteraStratixIIIEP3SE260FPGA65RC777.60201.2078.2339.12AlteraStratixIIIEP3SL150F115214FPGA65RC329.7087.1069.4321.55AlteraStratixIIIEP3SL340FPGA65RC918.40213.2196.0726.13AlteraStratixIVEP4SE530FPGA40RC765.91328.54132.7968.26AlteraStratixIVEP4SE680FPGA40RC910.83389.04171.5988.69AlteraStratixIVEP4SGX230KF40I3FPGA40RC838.70258.10153.2058.13AlteraStratixIVEP4SGX230KF40M3FPGA40RC811.50258.10153.0060.55AlteraStratixV5SGSMB8I2F45C2FPGA28RC877.50351.00261.0085.69AlteraStratixV5SGXEBBR1H43I2FPGA28RC1999.00499.00461.30140.60AlteraStratixVSGXMBBR1H43I2FPGA28RC2448.00554.10310.6087.75AMDAthlon64X26400+CPU902000.0070.4044.8025.6012.80AMDAthlonIIX4635CPU454000.00220.40127.6092.8046.40AMDFireProS10000GPU28825.001524.60785.402956.80739.20AMDFusion-A8-3870KCPU323000.00384.00192.00192.0096.00AMDOpteron6380CPU322500.003584.001792.00448.00224.00AMDOpteron8360SECPU652500.00190.00110.0080.0040.00AMDOpteron8439SECPU452800.00319.20184.80134.4067.20AMDPhenomIIX61090TBlackCPU453200.00364.80211.20153.6076.80AMDRadeonHD7970GPU281000.002112.001088.004096.001024.00AMDRadeonHD7990GPU281000.002112.001088.004096.001024.00AMDRadeonR9290XGPU281000.001472.70758.352824.40352.38AMDRadeonR9295X2GPU281020.002962.081525.925744.64718.08AMDRichlandA10-6800kCPU324400.00887.30605.70469.40232.40AMDTrinityA10-5800KCPU323800.00998.40499.20275.20137.60BAESystemsRAD750CPU250133.000.270.270.130.13BAESystemsRADSPEEDDSP90233.0017.7110.1270.8370.83BoeingMAESTROCPU90260.0050.9625.4812.7412.74ClearSpeedCSX600DSP90210.0024.0024.0024.0024.00ClearSpeedCSX700DSP90250.0024.0013.7196.0096.00ElementCXIECA-64CPU12.806.400.000.00FreescaleMSC8251DSP451000.004.002.002.001.00FreescaleMSC8256DSP451000.0024.0012.0012.006.00FreescaleP2020CPU451000.004.004.008.008.00FreescaleP4080CPU451500.0024.0024.0012.004.50FreescaleP5040CPU452200.0017.6017.608.808.80FreescalePowerPC603eCPU290266.000.270.270.270.27HoneywellHXRHPPCCPU35080.000.080.080.080.08IBMBlueGene/PCPU90850.0013.606.8013.606.80IBMPowerPC750CPU90300.000.800.800.400.40IBMPowerXCell8iCPU653200.00204.80115.20204.80102.40IntelAtom330CPU451600.0028.8016.0016.009.60IntelAtomD425CPU451800.0016.209.009.005.40IntelAtomE680CPU451600.0014.408.008.004.80IntelAtomN270CPU451600.0014.408.008.004.80IntelAtomN280CPU451660.0014.948.308.304.98 107

PAGE 108

Table A-7 continued DeviceInfo.CD(GOPS)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP IntelAtomS1260CPU322000.0080.0040.0024.008.00IntelAtomZ2760HYBRID321800.0072.0036.0021.607.20IntelAtomZ3770CPU222390.00105.1258.4046.7223.36IntelAtomZ500CPU45800.007.204.004.002.40IntelAtomZ515CPU451200.0010.806.006.003.60IntelAtomZ560CPU452130.0019.1710.6510.656.39IntelCore2DuoE6700CPU452660.00153.6076.8051.2025.50IntelCore2DuoMobileSU9300CPU451200.0057.6028.8019.209.60IntelCore2DuoMobileU7500CPU651060.0050.8825.4416.968.48IntelCore2DuoT9900(Penryn)CPU453060.00147.1773.5849.0624.53IntelCore2ExtremeQX6700CPU652660.00255.36127.6885.1242.56IntelCore2QuadQ6600CPU652400.00230.40115.2076.8038.40IntelCorei7-2700KCPU323900.00374.40187.20249.60124.80IntelCorei7-3770KCPU223500.00540.80270.40332.80166.40IntelCorei7-3930KCPU323800.00547.20273.60364.80182.40IntelCorei7-3960XCPU323300.00561.60280.80374.40187.20IntelCorei7-4610YCPU222900.00316.20177.00124.8078.40IntelCorei7-4770kCPU223900.00876.40446.00305.60152.80IntelCorei7-980XCPU323330.00479.52239.76239.76119.88IntelItanium9350CPU651730.00166.0883.0427.6813.84IntelQuarkX1000CPU32400.000.800.800.400.40IntelXeon7041CPU903000.0042.0030.0030.0024.00IntelXeonE5-2670CPU322600.00633.60316.80422.40211.20IntelXeonPhi3120pCPU221100.001065.901065.901065.90564.30IntelXeonPhi5110pCPU221053.001074.061074.061074.06568.62IntelXeonPhi7120pCPU221238.001379.211379.211379.21730.17IntelXeonW5580CPU453200.00307.20153.60153.6076.80IntelXeonX3230CPU652660.00127.9785.3185.3163.98IntelXeonX7560CPU452260.00479.52239.76239.76119.88MonarchMonarchCPU9065.2765.2765.270.00NVIDIAGeForce9400MGPU651100.0035.2035.2035.200.00NVIDIAGeForce8600GTSGPU801450.0092.8092.8092.800.00NVIDIAGeForce8800UltraGPU901500.00768.00384.00384.000.00NVIDIAGeforce9800GX2GPU651500.00768.00768.00768.000.00NVIDIAGTX285GPU551476.00708.48708.48708.4844.28NVIDIAGTX480GPU401215.001031.141031.141031.14128.89NVIDIAGTX590GPU401215.001555.201555.201555.20777.60NVIDIAGTX690GPU281019.003130.393130.393130.3965.22NVIDIAGTXTitanGPU28876.002354.692354.692354.69784.90NVIDIAIon(9400M+Atom330)GPU651100.0064.0051.2051.209.60NVIDIATegra2HYBRID401200.0052.8028.8028.809.60NVIDIATegra3HYBRID401600.00137.9873.9873.9825.60NVIDIATegra4HYBRID281900.00261.18154.78109.1830.40NVIDIATegraK1HYBRID282300.00440.00311.20256.0044.40NVIDIATeslaC1060GPU551300.00624.00624.00624.0039.00NVIDIATeslaC870GPU901350.00345.60216.00345.600.00NVIDIATeslaK20GPU28706.001762.181762.181762.18587.39NVIDIATeslaK20xGPU28706.001897.721897.721897.72632.58NVIDIATeslaK40GPU28875.002520.002520.002520.00837.12 108

PAGE 109

Table A-7 continued DeviceInfo.CD(GOPS)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP PACTXPP-3cDSP90121.8060.900.000.00FreescaleMPC7447CPU1301300.0017.009.006.003.00FreescaleMPC8640DCPU901250.0034.0018.0012.006.00QualcommSnapdragonMSM8660HYBRID451500.0057.6028.8028.809.60QualcommSnapdragonMSM8960HYBRID281500.0099.0949.6036.8012.00SPARC64VIIIfxCPU452000.00256.00128.00128.0064.00STICellCPU45204.80115.20204.8019.20TIDM3730HYBRID451000.0029.1214.5612.483.72TIKeystoneII66AK2H12HYBRID281400.00729.60364.80198.4099.20TIOMAP4430HYBRID451000.0054.2627.1326.738.47TIOMAP4460HYBRID451200.0073.0936.5436.1412.00TIOMAP5430HYBRID281700.00115.6857.8497.0232.00TIOMAP-L137HYBRID65456.004.801.801.800.75TITMS320C6672DSP401500.00144.0072.0048.0024.00TITMS320C6678DSP401250.00480.00240.00160.0080.00TigerSharcADSP-TS203SDSP130500.005.003.003.000.00TileraTILEGx-8036CPU401500.00270.00162.0054.0054.00TileraTile64CPU90900.00240.00144.0036.000.00TileraTilePro36CPU90500.0090.0054.0013.500.00TileraTilePro64CPU90700.00179.2089.600.000.00XilinxArtix-7350TFPGA28RC358.43220.09191.8649.40XilinxArtix-7QA350T FB676-1MFPGA28RC939.10163.30134.2045.52XilinxKintex-7QXQ7K410T RF900-1MFPGA28RC1696.00380.60224.3091.95XilinxSpartan-6QXQ6SLX150T FGG676-2FPGA40RC185.1037.9621.227.86XilinxVirtex-4XC4VLX160 FF1513-10FPGA90RC216.8048.3939.7210.72XilinxVirtex-4LX200 FF1513-10FPGA90RC258.3063.3751.0213.78XilinxVirtex-4QLX160 FF1513-10FPGA90RC213.4051.2543.5511.20XilinxVirtex-4QVLX200 CF1509-10FPGA90RC266.8959.1049.8114.28XilinxVirtex-5FX130T FF1738-1FPGA65RC416.3089.6880.2317.53XilinxVirtex-5LX155T FF1136-1FPGA65RC283.0065.0472.4116.09XilinxVirtex-5QFX130T EF1738-1FPGA65RC422.0085.0480.2317.07XilinxVirtex-5QFX200T EF1738-1FPGA65RC529.30106.10109.6025.63XilinxVirtex-5QLX155T EF1136-1FPGA65RC319.3038.7572.9716.48XilinxVirtex-6SX475TFPGA40RC1208.70314.94343.1282.60XilinxVirtex-6QLX550T RF1759-1FPGA40RC1382.00352.90311.4081.47XilinxVirtex-7585TFPGA28RC2028.80490.80495.70135.30XilinxVirtex-7VX980TFPGA28RC2912.00841.20539.80219.40XilinxVirtex-7QVX980T RF1930-1IFPGA28RC2186.00890.50539.60207.20XilinxZynq-7000Q7Z020 CL484-1QHYBRID28152.0146.2940.5712.63XilinxZynq-7000Q7Z045 RF676-1QHYBRID28719.81269.20194.1065.27 109

PAGE 110

TableA-8. Computationaldensityperwatt DeviceInfo.CD/W(GOPS/Watt)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP Aeroex/GaislerGR712RCCPU180100.000.050.050.020.02AchronixSpeedster22iHDFPGA22RC51.5513.1613.301.31AlteraArriaIIGX260FF35I5FPGA40RC32.9516.6912.695.91AlteraArriaV5AGXFB7K4F40I3FPGA28RC70.3320.618.613.48AlteraArriaV5AGXMB7G6FPGA28RC76.1522.079.603.83AlteraArriaVSoC5ASXFB5H4F40I3HYBRID28RC65.9619.308.523.40AlteraCycloneV5CGXFC9E6F35I7FPGA28RC65.9718.178.653.46AlteraCycloneV5CGXFC9E7FPGA28RC66.1618.279.493.80AlteraCycloneVSoC5CSXFC6D6F31I7HYBRID28RC44.5613.238.513.03AlteraStratixIIEP2S180FPGA90RC691.93191.8283.2016.67AlteraStratixIIIEP3SE260FPGA65RC437.28113.1543.9922.00AlteraStratixIIIEP3SL150F115214FPGA65RC16.465.332.581.26AlteraStratixIIIEP3SL340FPGA65RC604.09140.2463.1917.19AlteraStratixIVEP4SE530FPGA40RC54.0721.938.545.75AlteraStratixIVEP4SE680FPGA40RC135.1857.7425.4713.16AlteraStratixIVEP4SGX230KF40I3FPGA40RC41.2915.046.002.69AlteraStratixIVEP4SGX230KF40M3FPGA40RC41.2315.046.002.69AlteraStratixV5SGSMB8I2F45C2FPGA28RC61.7116.3816.267.99AlteraStratixV5SGXEBBR1H43I2FPGA28RC69.1019.238.673.06AlteraStratixVSGXMBBR1H43I2FPGA28RC80.6222.3612.894.33AMDAthlon64X26400+CPU902000.000.560.360.200.10AMDAthlonIIX4635CPU454000.002.321.340.980.49AMDFireProS10000GPU28825.004.072.097.881.97AMDFusion-A8-3870KCPU323000.003.841.921.920.96AMDOpteron6380CPU322500.0031.1715.583.901.95AMDOpteron8360SECPU652500.001.811.050.760.38AMDOpteron8439SECPU452800.003.041.761.280.64AMDPhenomIIX61090TBlackCPU453200.002.921.691.230.61AMDRadeonHD7970GPU281000.005.632.9010.922.73AMDRadeonHD7990GPU281000.005.632.9010.922.73AMDRadeonR9290XGPU281000.005.082.629.731.22AMDRadeonR9295X2GPU281020.006.003.0911.501.43AMDRichlandA10-6800kCPU324400.008.876.064.652.32AMDTrinityA10-5800KCPU323800.009.984.992.751.38BAESystemsRAD750CPU250133.000.050.050.030.03BAESystemsRADSPEEDDSP90233.001.180.674.724.72BoeingMAESTROCPU90260.002.301.150.570.57ClearSpeedCSX600DSP90210.002.402.402.402.40ClearSpeedCSX700DSP90250.002.401.379.609.60FreescaleMSC8251DSP451000.001.370.690.690.34FreescaleMSC8256DSP451000.003.971.991.990.99FreescaleP2020CPU451000.000.620.621.231.23FreescaleP4080CPU451500.000.800.800.400.15FreescaleP5040CPU452200.000.360.360.180.18FreescalePowerPC603eCPU290266.000.080.080.080.08HoneywellHXRHPPCCPU35080.000.010.010.010.01IBMBlueGene/PCPU90850.000.850.430.850.43IBMPowerPC750CPU90300.000.170.170.090.09IBMPowerXCell8iCPU653200.002.231.252.231.11IntelAtom330CPU451600.003.602.002.001.20IntelAtomD425CPU451800.001.620.900.900.54IntelAtomE680CPU451600.003.201.781.781.07IntelAtomN270CPU451600.004.362.422.421.45IntelAtomN280CPU451660.005.983.323.321.99 110

PAGE 111

Table A-8 continued DeviceInfo.CD/W(GOPS/Watt)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP IntelAtomS1260CPU322000.009.414.712.820.94IntelAtomZ2760HYBRID321800.0024.0012.007.202.40IntelAtomZ3770CPU222390.0026.2814.6011.685.84IntelAtomZ500CPU45800.0011.086.156.153.69IntelAtomZ515CPU451200.007.714.294.292.57IntelAtomZ560CPU452130.007.674.264.262.56IntelCore2DuoE6700CPU452660.002.361.180.790.39IntelCore2DuoMobileSU9300CPU451200.005.762.881.920.96IntelCore2DuoMobileU7500CPU651060.005.092.541.700.85IntelCore2DuoT9900(Penryn)CPU453060.004.202.101.400.70IntelCore2ExtremeQX6700CPU652660.001.960.980.650.33IntelCore2QuadQ6600CPU652400.002.191.100.730.37IntelCorei7-2700KCPU323900.003.941.972.631.31IntelCorei7-3770KCPU223500.007.023.514.322.16IntelCorei7-3930KCPU323800.004.212.102.811.40IntelCorei7-3960XCPU323300.004.322.162.881.44IntelCorei7-4610YCPU222900.0027.5015.3910.856.82IntelCorei7-4770kCPU223900.0010.435.313.641.82IntelCorei7-980XCPU323330.003.691.841.840.92IntelItanium9350CPU651730.000.900.450.150.07IntelQuarkX1000CPU32400.000.360.360.180.18IntelXeon7041CPU903000.000.250.180.180.15IntelXeonE5-2670CPU322600.005.512.753.671.84IntelXeonPhi3120pCPU221100.003.553.553.551.88IntelXeonPhi5110pCPU221053.004.774.774.772.53IntelXeonPhi7120pCPU221238.004.604.604.602.43IntelXeonW5580CPU453200.002.361.181.180.59IntelXeonX3230CPU652660.001.350.900.900.67IntelXeonX7560CPU452260.003.351.841.670.92NVIDIAGeForce9400MGPU651100.002.932.932.930.00NVIDIAGeForce8600GTSGPU801450.001.311.311.310.00NVIDIAGeForce8800UltraGPU901500.004.392.192.190.00NVIDIAGeforce9800GX2GPU651500.003.903.903.900.00NVIDIAGTX285GPU551476.003.473.473.470.22NVIDIAGTX480GPU401215.004.124.124.120.51NVIDIAGTX590GPU401215.004.264.264.262.13NVIDIAGTX690GPU281019.0010.4310.4310.430.22NVIDIAGTXTitanGPU28876.009.429.429.423.14NVIDIAIon(9400M+Atom330)GPU651100.002.722.182.180.41NVIDIATegra2HYBRID401200.0052.8028.8028.809.60NVIDIATegra3HYBRID401600.0068.9936.9936.9912.80NVIDIATegra4HYBRID281900.0052.2430.9621.846.08NVIDIATegraK1HYBRID282300.0088.0062.2451.208.88 111

PAGE 112

Table A-8 continued DeviceInfo.CD/W(GOPS/Watt)ManufacturerDeviceTypeTech(nm)Frequency(MHz)Int16Int32SPFPDPFP NVIDIATeslaC1060GPU551300.003.323.323.320.21NVIDIATeslaC870GPU901350.002.031.272.030.00NVIDIATeslaK20GPU28706.007.837.837.832.61NVIDIATeslaK20xGPU28706.008.088.088.082.69NVIDIATeslaK40GPU28875.0010.7210.7210.723.56FreescaleMPC7447CPU1301300.001.700.900.600.30FreescaleMPC8640DCPU901250.002.431.290.860.43QualcommSnapdragonMSM8660HYBRID451500.0048.0024.0024.008.00QualcommSnapdragonMSM8960HYBRID281500.0036.7018.3713.634.44SPARC64VIIIfxCPU452000.004.412.212.211.10STICellCPU452.931.652.930.27TIDM3730HYBRID451000.0029.1214.5612.483.72TIKeystoneII66AK2H12HYBRID281400.0033.6416.829.154.57TIOMAP4430HYBRID451000.0090.4345.2144.5514.11TIOMAP4460HYBRID451200.0038.7119.3619.146.36TIOMAP5430HYBRID281700.0084.5542.2870.9223.39TIOMAP-L137HYBRID65456.007.722.902.901.21TITMS320C6672DSP401500.0010.395.193.461.73TITMS320C6678DSP401250.0028.3714.197.883.94TigerSharcADSP-TS203SDSP130500.001.711.021.020.00TileraTILEGx-8036CPU401500.009.005.401.801.80TileraTile64CPU90900.006.133.680.920.00TileraTilePro36CPU90500.006.924.151.040.00TileraTilePro64CPU90700.007.793.900.000.00XilinxArtix-7350TFPGA28RC46.7811.969.273.36XilinxArtix-7QA350T FB676-1MFPGA28RC52.6513.738.933.72XilinxKintex-7QXQ7K410T RF900-1MFPGA28RC51.3618.038.723.50XilinxSpartan-6QXQ6SLX150T FGG676-2FPGA40RC22.465.953.891.47XilinxVirtex-4XC4VLX160 FF1513-10FPGA90RC13.893.782.911.05XilinxVirtex-4LX200 FF1513-10FPGA90RC19.245.314.051.50XilinxVirtex-4QLX160 FF1513-10FPGA90RC14.183.893.011.08XilinxVirtex-5FX130T FF1738-1FPGA65RC23.816.095.752.03XilinxVirtex-5LX155T FF1136-1FPGA65RC18.214.714.551.47XilinxVirtex-5QFX130T EF1738-1FPGA65RC25.156.245.751.73XilinxVirtex-5QFX200T EF1738-1FPGA65RC23.335.825.301.71XilinxVirtex-5QLX155T EF1136-1FPGA65RC18.464.334.551.48XilinxVirtex-6SX475TFPGA40RC47.6315.339.653.52XilinxVirtex-6QLX550T RF1759-1FPGA40RC27.827.315.431.94XilinxVirtex-7585TFPGA28RC85.8030.4030.505.70XilinxVirtex7VX980TFPGA28RC51.6215.399.323.76XilinxVirtex-7QVX980T RF1930-1IFPGA28RC48.3114.798.763.47XilinxZynq-7000Q7Z020 CL484-1QHYBRID28RC36.9110.727.833.06XilinxZynq-7000Q7Z045 RF676-1QHYBRID28RC44.6912.257.933.17 112

PAGE 113

TableA-9. Matrixmultiplication PlatformThroughputCDRURef.(MFLOPS)(GOPS)(%) TeslaC870CUBLAS128,020.00345.6037.04[ VolkovandDemmel 2008b ]TeslaC870Custom207,600.00345.6060.07[ VolkovandDemmel 2008b ]TeslaC1060CUBLAS188,760.00624.0030.25[ VolkovandDemmel 2008b ]TeslaC1060Custom257,400.00624.0041.25[ VolkovandDemmel 2008b ]TeslaC8701024,square1,510.00345.600.44[ Cecilia Cecilia ]TeslaC8702048,square1,600.00345.600.46[ Cecilia Cecilia ]TeslaC8704096,square1,720.00345.600.50[ Cecilia Cecilia ]TeslaC8704x4,tile3,760.00345.601.09[ Cecilia Cecilia ]TeslaC8708x8,tile7,730.00345.602.24[ Cecilia Cecilia ]TeslaC87012x12,tile11,340.00345.603.28[ Cecilia Cecilia ]TeslaC87016x16,tile22,100.00345.606.39[ Cecilia Cecilia ]GeForce8800GTX43,000.00345.6012.44[ Cecilia Cecilia ]GeForce480GTX643,326.401,031.140.32[ NVidia 2010a ]GeForce480GTX512206,564.901,031.1420.03[ NVidia 2010a ]GeForce480GTX1024234,492.101,031.1422.74[ NVidia 2010a ]GeForce480GTX64,short4,693.501,031.140.46[ NVidia 2010a ]GeForce480GTX512,short215,042.001,031.1420.85[ NVidia 2010a ]GeForce480GTX1024,short237,920.001,031.1423.07[ NVidia 2010a ]GeForce480GTX2048174,334.901,031.1416.91[ NVidia 2010a ]GeForce240GT643,748.20238.501.57[ NVidia 2010a ]GeForce240GT51266,741.70238.5027.98[ NVidia 2010a ]GeForce240GT102470,230.00238.5029.45[ NVidia 2010a ]GeForce240GT204860,850.00238.5025.51[ NVidia 2010a ]GeForce8800GTX,initial10,580.00345.603.06[ Ryooetal. 2008 ]GeForce8800GTX,16x1646,490.00345.6013.45[ Ryooetal. 2008 ]GeForce8800GTX,latency87,600.00345.6025.35[ Ryooetal. 2008 ]GeForce8800GTX,reduced91,140.00345.6026.37[ Ryooetal. 2008 ]GeForce285GTXCSR12,800.00624.002.05[ BellandGarland 2009 ]GeForce8800GTX206,000.00345.6059.61[ VolkovandDemmel 2008a ]GeForce8800GTX,transposed205,000.00345.6059.32[ VolkovandDemmel 2008a ]Core2E6700,MKL25,900.0042.0061.67[ VolkovandDemmel 2008a ]Core2Q6600,MKL69,800.0076.8090.89[ VolkovandDemmel 2008a ] 113

PAGE 114

TableA-10. Matrixdecomposition PlatformThroughputCDRURef.(MFLOPS)(GOPS)(%) GeForce8800Ultra,Cholesky27,000.00384.007.03[ Barrachinaetal. 2008 ]GeForce8800Ultra,Cholesky33,500.00384.008.72[ Barrachinaetal. 2008 ]GeForce8800Ultra,Cholesky41,200.00384.0010.73[ Barrachinaetal. 2008 ]GeForce8800Ultra,LU37,800.00384.009.84[ Barrachinaetal. 2008 ]GeForce8800Ultra,LU47,900.00384.0012.47[ Barrachinaetal. 2008 ]GeForce8800Ultra,LU46,900.00384.0012.21[ Barrachinaetal. 2008 ]GeForce8800GTX,Cholesky183,000.00345.6052.95[ VolkovandDemmel 2008a ]GeForce8800GTX,LU179,000.00345.6051.79[ VolkovandDemmel 2008a ]GeForce8800GTX,QR192,000.00345.6055.56[ VolkovandDemmel 2008a ]Core2E6700,Cholesky23,000.0042.0054.76[ VolkovandDemmel 2008a ]Core2E6700,LU22,700.0042.0054.05[ VolkovandDemmel 2008a ]Core2E6700,QR22,500.0042.0053.57[ VolkovandDemmel 2008a ]Core2Q6600,Cholesky34,900.0076.8045.44[ VolkovandDemmel 2008a ]Core2Q6600,LU59,200.0076.8077.08[ VolkovandDemmel 2008a ]Core2Q6600,QR44,800.0076.8058.33[ VolkovandDemmel 2008a ]GeForce280GTX,Cholesky152,400.00622.0824.50[ Barrachinaetal. 2009 ]GeForce280GTX,Cholesky46,800.00622.087.52[ Barrachinaetal. 2009 ]GeForce280GTX,Cholesky156,200.00622.0825.11[ Barrachinaetal. 2009 ]GeForce280GTX,LU64,900.00622.0810.43[ Barrachinaetal. 2009 ]GeForce280GTX,LU140,600.00622.0822.60[ Barrachinaetal. 2009 ]GeForce280GTX,LU142,400.00622.0822.89[ Barrachinaetal. 2009 ]GeForce280GTX,paddedCholesky154,100.00622.0824.77[ Barrachinaetal. 2009 ]GeForce280GTX,paddedCholesky46,800.00622.087.52[ Barrachinaetal. 2009 ]GeForce280GTX,paddedCholesky159,500.00622.0825.64[ Barrachinaetal. 2009 ]GeForce280GTX,paddedLU63,200.00622.0810.16[ Barrachinaetal. 2009 ]GeForce280GTX,paddedLU145,400.00622.0823.37[ Barrachinaetal. 2009 ]GeForce280GTX,paddedLU143,900.00622.0823.13[ Barrachinaetal. 2009 ]GeForce280GTX,hybridCholesky153,900.00622.0824.74[ Barrachinaetal. 2009 ]GeForce280GTX,hybridCholesky168,000.00622.0827.01[ Barrachinaetal. 2009 ]GeForce280GTX,hybridCholesky175,600.00622.0828.23[ Barrachinaetal. 2009 ]GeForce280GTX,hybridLU138,900.00622.0822.33[ Barrachinaetal. 2009 ]GeForce280GTX,hybridLU142,800.00622.0822.96[ Barrachinaetal. 2009 ]GeForce280GTX,hybridLU145,900.00622.0823.45[ Barrachinaetal. 2009 ] 114

PAGE 115

TableA-11. N-bodysimulation PlatformThroughputCDRURef.(MFLOPS)(GOPS)(%) PPC11.848.500.14[ Hollandetal. 2009 ]GeForce480GTX1024point16,916.671,031.141.64[ NVidia 2010b ]GeForce480GTX30,720point757,763.331,031.1473.49[ NVidia 2010b ]GeForce480GTX65,536point746,770.001,041.4571.71[ NVidia 2010b ]GeForce8800GTXmemoryaware291,000.00345.6084.20[ Owensetal. 2008 ]GeForce8800GTX191,000.00345.6055.27[ Owensetal. 2008 ]XeonQX6700280.0085.310.33[ Owensetal. 2008 ]XeonQX67005,300.0085.316.21[ Owensetal. 2008 ]GeForce8800GTX131,072point256,000.00345.6074.07[ HamadaandIitaka 2007 ]GeForce8800GTX16,384point342,000.00345.6098.96[ Nylandetal. 2007 ]GeForce8800GTX131,072point340,000.00345.6098.38[ Bellemanetal. 2008 ] 115

PAGE 116

TableA-12. RUbenchmarkingresults Algorithm(Device,Optimization)AchievedGOPSDeviceCD(GOPS)RU(%) MM(K20X,cuBLAS)578.042632.58091.378TDFIR(DSP,Intrinsic)95.550160.00059.719MM(Phi,MKL)326.665568.62057.449MM(DSP,mathlib)80.000160.00050.000MM(G80,cuBLAS)234.5001031.13622.742MM(A15-ODROID,ATLAS)2.77512.80021.683MM(A15-KeyStone-II,ATLAS)2.12510.00021.249Convolution(DSP,Intrinsic)27.580160.00017.238FDFIR(DSP,Intrinsic)21.310160.00013.319FFT(DSP,mathlib)18.770160.00011.731FFT(K20x,cuFFT)61.608632.5809.739FFT-2D(K20x,cuFFT)61.608632.5809.739FFT-1D(K20x,cuFFT)60.247632.5809.524Convolution(MSV)0.0120.1339.249DGESVD(K20x,CULA)48.167633.5807.602MM(A15-ODROID,Intrinsic)0.80012.8006.250Filtering(DSP)9.110160.0005.694FFT(CPU,FFTW)8.818211.2004.175MM(A9,ATLAS)1.55248.5103.199MM(MSV)0.0030.1332.441Convolution(Zynq,ARM)0.0828.0401.025FFT(Phi,MKL)4.042568.6200.711MM(A15-ODROID)0.04712.8000.367MM(Zynq,ARM)0.0298.0400.355Beamforming(DSP)0.380160.0000.238MM(DSP)0.318160.0000.199Convolution(Tile)0.07554.0000.140Convolution(DSP)0.126160.0000.079MM(Tile)0.01254.0000.021MatrixMultiplication(MM),FastFourierTransform(FFT),SingleValueDecomposition(SVD) 116

PAGE 117

REFERENCES AARSETH,S.ANDAARSETH,S.2003.GravitationalN-BodySimulations.CambridgeMonographsonMathematicalPhysics.CambridgeUniversityPress. AlteraCorp.2007a.StratixIIDeviceHandbook.AlteraCorp. AlteraCorp.2007b.StratixIIIDeviceHandbook.AlteraCorp. AlteraCorp.2008.StratixIVDeviceHandbook.AlteraCorp. AMD,Inc.2008.Keyarchitecturalfeaturesamdathlonx2dual-coreprocessors. http://www.amd.com/us-en/Processors/ProductInformation/0,,30 118 9485 13041%5E13043,00.html . ATHAVALE,A.ANDCHRISTENSEN,C.2005.High-SpeedSerialI/OMadeSimple,ADesigners'Guide,withFPGAApplications.XilinxConnectivitySolutions. BAILEY,D.1995.ProceedingsoftheSeventhSiamConferenceonParallelProcessingforScienticComputing.ProceedingsinAppliedMathematicsSeries.Siam. BARRACHINA,S.,CASTILLO,M.,IGUAL,F.,MAYO,R.,ANDQUINTANA-ORT,E.2008.Solvingdenselinearsystemsongraphicsprocessors.InEuro-Par2008ParallelProcessing,E.Luque,T.Margalef,andD.Bentez,Eds.LectureNotesinComputerScience,vol.5168.SpringerBerlin/Heidelberg,739. BARRACHINA,S.,CASTILLO,M.,IGUAL,F.D.,MAYO,R.,QUINTANA-ORTI,E.S.,ANDQUINTANA-ORTI,G.June7,2009.Exploitingthecapabilitiesofmoderngpusfordensematrixcomputations.ConcurrencyandComputation:PracticeandExperience. BARTON,M.2007.Tilera'scorescommunicatebetter.MicroprocessorReport. BELL,B.ANDGARLAND,M.2009.Implementingsparsematrix-vectormultiplicationonthroughput-orientedprocessors.HighPerformanceComputing,Networking,StorageandAnalysis,2009.SC2009.InternationalConferencefor. BELLEMAN,R.G.,BEDORF,J.,ANDZWART,S.F.P.2008.Highperformancedirectgravitationaln-bodysimulationsongraphicsprocessingunitsii:Animplementationincuda.NewAstronomy13. BHASKAR.2006.AppliedMathematicalMethods.PearsonEducation. BittWare,Inc.2008.B2-AMCDataSheet.BittWare,Inc. BONDALAPATI,K.ANDPRASANNA,V.K.2002.Recongurablecomputingsystems.ProceedingsoftheIEEE90,7(Jul.),1201. BURGER,D.,GOODMAN,J.R.,ANDKAGI,A.1996.Memorybandwidthlimitationsoffuturemicroprocessors.InISCA'96:Proceedingsofthe23rdAnnualInternationalSymposiumonComputerArchitecture.ACM,NewYork,NY,USA,78. 117

PAGE 118

CECILIA,J.M.Thegpuonthematrix-matrixmultiply:Performancestudyandcontributions. CHEN,T.,RAGHAVAN,R.,DALE,J.N.,ANDIWATA,E.2007.Cellbroadbandenginearchitectureanditsrstimplementation:aperformanceview.IBMJournalofRe-searchandDevelopment51,5,559. ClearSpeedTechnologyPLC2007.CSX600ArchitectureWhitepaper.ClearSpeedTechnologyPLC. COMPTON,K.ANDHAUCK,S.2002.Recongurablecomputing:asurveyofsystemsandsoftware.ACMComputerSurveys34,2,171. DEHON,A.1996.Recongurablearchitecturesforgeneral-purposecomputing.Tech.rep.,MassachusettsInstituteofTechnology,Cambridge,MA,USA. DONGARRA,J.,FORINDUSTRIAL,S.,ANDMATHEMATICS,A.1979.LinpackUsers'Guide.MiscellaneousBks.SocietyforIndustrialandAppliedMathematics. ElementCXI,Inc.2007a.ECA-64DeviceArchitectureOverview.ElementCXI,Inc. ElementCXI,Inc.2007b.ECA-64ProductBrief.ElementCXI,Inc. FLYNN,M.J.1966.Veryhigh-speedcomputingsystems.ProceedingsoftheIEEE54,12(Dec.),1901. FreescaleSemiconductor,Inc.2005.MPC7450RISCMicroprocessorFamilyReferenceManualRev.5.FreescaleSemiconductor,Inc. FreescaleSemiconductor,Inc.2006.AltivecTechnologyProgrammingEnvironmentsManualRev.3.FreescaleSemiconductor,Inc. FreescaleSemiconductor,Inc.2008.MPC8641DIntegratedHostProcessorFamilyReferenceManualRev.2.FreescaleSemiconductor,Inc. GUCCIONE,S.ANDGONZALEZ,M.J.1995.Classicationandperformanceofrecongurablearchitectures.InFPL'95:Proceedingsofthe5thInternationalWorkshoponField-ProgrammableLogicandApplications.Springer-Verlag,London,UK,439. HAMADA,T.ANDIITAKA,T.March5,2007.Thechamomilescheme:Anoptimizedalgorithmforn-bodysimulationsonprogrammablegraphicsprocessingunits.NewAstronomy. HENNESSY,J.L.ANDPATTERSON,D.A.2007.ComputerArchitecture:AQuantitativeApproach.MorganKaufmannPublishersInc.,SanFrancisco,CA,USA. HESTER,P.2006.2006technologyanalystday. http://www.amd.com/us-en/assets/content type/DownloadableAssets/PhilHesterAMDAnalystDayV2.pdf . 118

PAGE 119

HOLLAND,B.,NAGARAJAN,K.,ANDGEORGE,A.2009.Rat:Rcamenabilitytestforrapidperformanceprediction.ACMTrans.RecongurableTechnol.Syst.1,4(Jan.),22:1:31. IBMCorp.2008.PowerXCell8iProcessorSpecications.IBMCorp. IntelCorp.2000.Intelnetburstarchitecture. http://www.intel.com/software/products/documentation/vlin/mergedprojects/analyzer ec/mergedprojects/reference olh/reference hh/inbma.htm . IntelCorp.2006.InsideIntelCoreMicroarchitecture.IntelCorp. IntelCorp.2008a.IntelArchitectureSoftwareDeveloper'sManual,Volume1:BasicArchitecture.IntelCorp. IntelCorp.2008a.Intelxeonprocessor7041. http://processornder.intel.com/Details.aspx?sSpec=SL8UD . IntelCorp.2008b.Intelxeonprocessorx3230. http://processornder.intel.com/Details.aspx?sSpec=SLACS . IntelCorp.2008b.MobileIntelAtomProcessorN270SingleCoreDatasheet.IntelCorp. IntelCorp.2008c.ProductBriefIntelXeonProcessor3000Sequence.IntelCorp. LANCASTER,P.ANDTISMENETSKY,M.1985.TheTheoryofMatrices:WithApplica-tions.ComputerScienceandAppliedMathematics.AcademicPress. LEWINS,L.,PRAGER,K.,GROVES,G.,ANDVAHEY,M.2007.World'srstpolymorphiccomputer-monarch.InProceedingsofthe11thAnnualHigh-PerformanceEmbeddedComputingWorkshop.Lexington,MA,USA. Mathstar,Inc.2007a.ArrixFamilyFPOAArchitectureGuide.Mathstar,Inc. Mathstar,Inc.2007b.ArrixFamilyProductDataSheetandDesignGuide.Mathstar,Inc. MILLUZZI,A.,RICHARDSON,J.,GEORGE,A.,ANDLAM,H.2014.Amulti-tieredoptimizationframeworkforheterogeneouscomputing.IEEEProc.ofHigh-PerformanceExtremeComputingConference. NVIDIA.2010a.Nvidiasdkcore. http://developer.nvidia.com/cuda-toolkit . NVIDIA.2010b.Nvidiasdkdirectcomputecore. http://developer.nvidia.com/cuda-toolkit . NvidiaCorp.2006.NvidiaGeForce8800GPUArchitectureOverview.NvidiaCorp. NvidiaCorp.2007.NvidiaCUDAComputeUniedDeviceArchitectureProgrammingGuide.NvidiaCorp. NvidiaCorp.2008.Nvidiateslac870specications. http://www.nvidia.com/object/tesla c870.html . 119

PAGE 120

NYLAND,L.,HARRIS,M.,ANDPRINS,J.2007.Fastn-bodysimulationwithcuda.GPUGems3. OWENS,J.,HOUSTON,M.,LUEBKE,D.,GREEN,S.,STONE,J.,ANDPHILLIPS,J.2008.Gpucomputing.ProceedingsoftheIEEE96,5(may),879. RADUNOVIC,B.ANDMILUTINOVIC,V.M.1998.Asurveyofrecongurablecomputingarchitectures.InFPL'98:Proceedingsofthe8thInternationalWorkshoponField-ProgrammableLogicandApplications,FromFPGAstoComputingParadigm.Springer-Verlag,London,UK,376. RYOO,S.,RODRIGUES,C.I.,BAGHSORKHI,S.S.,STONE,S.S.,KIRK,D.B.,ANDHWU,W.-M.W.2008.Optimizationprinciplesandapplicationperformanceevaluationofamultithreadedgpuusingcuda.InProceedingsofthe13thACMSIGPLANSymposiumonPrinciplesandpracticeofparallelprogramming.PPoPP'08.ACM,NewYork,NY,USA,73. SAULSBURY,A.,PONG,F.,ANDNOWATZYK,A.1996.Missingthememorywall:thecaseforprocessor/memoryintegration.InISCA'96:Proceedingsofthe23rdAnnualInternationalSymposiumonComputerArchitecture.ACM,NewYork,NY,USA,90. SAWITZKI,S.ANDSPALLEK,R.G.1999.Aconceptforanevaluationframeworkforrecongurablesystems.InFPL'99:Proceedingsofthe9thInternationalWorkshoponField-ProgrammableLogicandApplications.Springer-Verlag,London,UK,475. SHIMPI,A.L.2008.Intel'ssilverthorneunveiled:Detailingbabycentrino. http://www.anandtech.com/showdoc.aspx?i=3230&p=3 . SIMA,M.,VASSILIADIS,S.,COTOFANA,S.,VANEIJNDHOVEN,J.T.J.,ANDVISSERS,K.A.2002.Field-programmablecustomcomputingmachines-ataxonomy-.InFPL'02:ProceedingsoftheRecongurableComputingIsGoingMainstream,12thInternationalConferenceonField-ProgrammableLogicandApplications.Springer-Verlag,London,UK,79. SOHI,G.S.ANDFRANKLIN,M.1991.High-bandwidthdatamemorysystemsforsuperscalarprocessors.SIGOPSOperatingSystemsReview25,SpecialIssue,53. STRENSKI,D.2007.Fpgaoatingpointperformance–apencilandpaperevaluation.HPCWire. TileraCorp.2008.TILE64ProcessorProductBrief.TileraCorp. UNDERWOOD,K.D.ANDHEMMERT,K.S.2004.Closingthegap:Cpuandfpgatrendsinsustainableoating-pointblasperformance.InFCCM'04:Proceedingsofthe12thAnnualIEEESymposiumonField-ProgrammableCustomComputingMachines.IEEEComputerSociety,Washington,DC,USA,219. 120

PAGE 121

VOLKOV,V.ANDDEMMEL,J.2008a.Lu,qrandcholeskyfactorizationsusingvectorcapabilitiesofgpus. http://www.eecs.berkeley.edu/Pubs/TechRpts/2008/EECS-2008-49.html . VOLKOV,V.ANDDEMMEL,J.W.Nov.15-21,2008b.Benchmarkinggpustotunedenselinearalgebra.HighPerformanceComputing,Networking,StorageandAnalysis,2008.SC2008.InternationalConferencefor. WANG,D.T.2005.Isscc2005:thecellmicroprocessor. http://www.realworldtech.com/page.cfm?ArticleID=RWT021005084318&p=2 . WILLIAMS,J.,GEORGE,A.,RICHARDSON,J.,GOSRANI,K.,MASSIE,C.,ANDLAM,H.2011.Characterizationofxedandrecongurablemulti-coredevicesforapplicationacceleration.ACMTransactionsonRecongurableTechnologyandSystems(TRETS)3,4. WILLIAMS,J.,GEORGE,A.,RICHARDSON,J.,GOSRANI,K.,ANDSURESH,S.July7-10,2008a.Computationaldensityofxedandrecongurablemulti-coredevicesforapplicationacceleration.Proc.ofRecongurableSystemsSummerInstitute2008(RSSI). WILLIAMS,J.,GEORGE,A.,RICHARDSON,J.,GOSRANI,K.,ANDSURESH,S.Sep.23-25,2008b.Fixedandrecongurablemulti-coredevicecharacterizationforhpec.Proc.ofHigh-PerformanceEmbeddedComputingWorkshop(HPEC). WULF,W.A.ANDMCKEE,S.A.1995.Hittingthememorywall:implicationsoftheobvious.SIGARCHComputerArchitectureNews23,1,20. X-bitLaboratories2006.Amd'snextgenerationmicroarchitecturepreview:fromk8tok8l. http://www.xbitlabs.com/articles/cpu/display/amd-k8l.html . X-bitLaboratories2007.Amdk10micro-architecture. http://www.xbitlabs.com/articles/cpu/display/amd-k10.html . Xilinx,Inc.2007.Virtex-4FamilyOverview.Xilinx,Inc. Xilinx,Inc.2008a.Virtex-5FamilyOverview.Xilinx,Inc. Xilinx,Inc.2008b.Virtex-6FamilyOverview.Xilinx,Inc. 121

PAGE 122

BIOGRAPHICALSKETCH JustinRichardsonwasaPhDstudentfromGainesville,Floridawithamajorinelectricalandcomputerengineering.DuringhistenureattheUniversityofFlorida,hereceivedaBachelorofScienceinelectricalengineeringwithaminorinmathematicsinthespringof2007.HegraduatedMagnacumLaudeandwasamemberofthePhiKappaPhiandGoldenKeyInternationalhonoursocieties.JustinalsoreceivedaMastersofScienceinelectricalandcomputerengineeringinfallof2008andaMastersofScienceinmanagementfromtheHoughGraduateSchoolofBusinessattheUniversityofFlorida. 122