<%BANNER%>

Intermediate Fabrics

Permanent Link: http://ufdc.ufl.edu/UFE0045513/00001

Material Information

Title: Intermediate Fabrics Low-Overhead Coarse-Grained Virtual Reconfigurable Fabric to Enable Fast Place and Route
Physical Description: 1 online resource (59 p.)
Language: english
Creator: Landy, Aaron M
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2013

Subjects

Subjects / Keywords: fpga -- reconfigurable -- routing
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Field-programmable gate arrays (FPGAs) have been widely shown to have significant performance and power advantages compared to microprocessors and graphics-processing units (GPUs), but remain a niche technology due in part to productivity challenges. Although such challenges have numerous causes, previous work has shown two significant contributing factors: 1) prohibitive place-and-route times preventing mainstream design methodologies, and 2) limited application portability preventing design reuse. Virtual reconfigurable architectures, referred to as intermediate fabrics (IFs), were recently introduced as a potential solution to these problems, providing 100x-1000x place-and-route speedup, while also enabling application portability across potentially any physical FPGA. However, one significant limitation of existing intermediate fabrics is area overhead incurred from virtualized interconnect resources. In this work, we approach this problem through complementary top-down and bottom-up approaches, seeking to reduce both the number and size of multiplexers that comprise the interconnect. First we perform design-space exploration of virtual interconnect architectures and introduce an optimized virtual interconnect that reduces area overhead by 48% to 54% compared to previous work, while also improving clock frequencies by 24% with a modest routability overhead of 16%. We also extend "constant folding" used by traditional logic optimization to support pseudo-constants, which are values that change with low frequency. We present a method of pseudo-constant logic optimization based on dynamically reconfigurable capabilities of FPGAs, which optimizes logic for different pseudo-constant values and then reconfigures the logic whenever the pseudo-constant changes. We show this optimization achieves up to a 1.25x increase in functional density for multiplexers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Aaron M Landy.
Thesis: Thesis (M.S.)--University of Florida, 2013.
Local: Adviser: Stitt, Greg M.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2013
System ID: UFE0045513:00001

Permanent Link: http://ufdc.ufl.edu/UFE0045513/00001

Material Information

Title: Intermediate Fabrics Low-Overhead Coarse-Grained Virtual Reconfigurable Fabric to Enable Fast Place and Route
Physical Description: 1 online resource (59 p.)
Language: english
Creator: Landy, Aaron M
Publisher: University of Florida
Place of Publication: Gainesville, Fla.
Publication Date: 2013

Subjects

Subjects / Keywords: fpga -- reconfigurable -- routing
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre: Electrical and Computer Engineering thesis, M.S.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract: Field-programmable gate arrays (FPGAs) have been widely shown to have significant performance and power advantages compared to microprocessors and graphics-processing units (GPUs), but remain a niche technology due in part to productivity challenges. Although such challenges have numerous causes, previous work has shown two significant contributing factors: 1) prohibitive place-and-route times preventing mainstream design methodologies, and 2) limited application portability preventing design reuse. Virtual reconfigurable architectures, referred to as intermediate fabrics (IFs), were recently introduced as a potential solution to these problems, providing 100x-1000x place-and-route speedup, while also enabling application portability across potentially any physical FPGA. However, one significant limitation of existing intermediate fabrics is area overhead incurred from virtualized interconnect resources. In this work, we approach this problem through complementary top-down and bottom-up approaches, seeking to reduce both the number and size of multiplexers that comprise the interconnect. First we perform design-space exploration of virtual interconnect architectures and introduce an optimized virtual interconnect that reduces area overhead by 48% to 54% compared to previous work, while also improving clock frequencies by 24% with a modest routability overhead of 16%. We also extend "constant folding" used by traditional logic optimization to support pseudo-constants, which are values that change with low frequency. We present a method of pseudo-constant logic optimization based on dynamically reconfigurable capabilities of FPGAs, which optimizes logic for different pseudo-constant values and then reconfigures the logic whenever the pseudo-constant changes. We show this optimization achieves up to a 1.25x increase in functional density for multiplexers.
General Note: In the series University of Florida Digital Collections.
General Note: Includes vita.
Bibliography: Includes bibliographical references.
Source of Description: Description based on online resource; title from PDF title page.
Source of Description: This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility: by Aaron M Landy.
Thesis: Thesis (M.S.)--University of Florida, 2013.
Local: Adviser: Stitt, Greg M.

Record Information

Source Institution: UFRGP
Rights Management: Applicable rights reserved.
Classification: lcc - LD1780 2013
System ID: UFE0045513:00001


This item has the following downloads:


Full Text

PAGE 1

INTERMEDIATEFABRICS:LOW-OVERHEADCOARSE-GRAINEDVIRTUALRECONFIGURABLEFABRICSTOENABLEFASTPLACEANDROUTEByAARONLANDYATHESISPRESENTEDTOTHEGRADUATESCHOOLOFTHEUNIVERSITYOFFLORIDAINPARTIALFULFILLMENTOFTHEREQUIREMENTSFORTHEDEGREEOFMASTEROFSCIENCEUNIVERSITYOFFLORIDA2013

PAGE 2

c2013AaronLandy 2

PAGE 3

ACKNOWLEDGMENTS Ithankthechairandmembersofmysupervisorycommitteefortheirmentoringandtime,theUniversityofFloridaGraduateSchool,theNationalScienceFoundation,andtheNSFCenterforHighPerformanceRecongurableComputing(CHREC)fortheirgeneroussupport.Ithankmyparentsfortheirmanyyearsoflovingencouragement,andIthankElyseforsupportingmygoalsanddreams. 3

PAGE 4

TABLEOFCONTENTS page ACKNOWLEDGMENTS .................................. 3 LISTOFTABLES ...................................... 6 LISTOFFIGURES ..................................... 7 ABSTRACT ......................................... 8 CHAPTER 1INTRODUCTION ................................... 10 2RELATEDWORK .................................. 13 2.1PlacementandRouting ............................ 13 2.2OverlayNetworks ............................... 13 2.3ConstantPropagation ............................. 14 2.4IntermediateFabrics .............................. 15 3INTERCONNECTENHANCEMENTS ....................... 16 3.1IntermediateFabricArchitecture ....................... 16 3.1.1Overview ................................ 16 3.1.2PreviousInterconnectArchitecture .................. 18 3.2OptimizedInterconnect ............................ 19 3.3Experiments .................................. 24 3.3.1ExperimentalSetup .......................... 24 3.3.2Toolow ................................. 24 3.3.3RoutabilityMetric ............................ 25 3.3.4InterconnectEvaluation ........................ 26 3.3.5InterconnectComparisonforUniformIntermediateFabrics ..... 26 3.3.6InterconnectComparisonforSpecializedIntermediateFabrics ... 29 4PSEUDOCONSTANTLOGICOPTIMIZATION .................. 33 4.1Pseudo-ConstantDesignProcess ...................... 34 4.1.1Pseudo-ConstantIdentication .................... 35 4.1.2Pseudo-ConstantTechnologyMapping ................ 35 4.1.3Pseudo-ConstantBitleCreation ................... 36 4.1.4Pseudo-ConstantInvatidationDetection ............... 37 4.2TechnologyMapping .............................. 38 4.2.1Pseudo-ConstantPrimitivesforXilinxVirtex5 ............ 38 4.2.1.1DistributedRAM ....................... 39 4.2.1.2ShiftRegister ......................... 41 4.2.2ArchitecturalExtensions ........................ 41 4

PAGE 5

4.3Experiments .................................. 42 4.3.0.132-bitFullAdder ....................... 44 4.3.0.2Multiplexer .......................... 46 4.3.132-bitComparator ........................... 48 4.3.2FunctionalDensity ........................... 49 5CONCLUSIONS ................................... 52 REFERENCES ....................................... 54 BIOGRAPHICALSKETCH ................................ 59 5

PAGE 6

LISTOFTABLES Table page 3-1Acomparisonbetweenthepresentedvirtualinterconnectandpreviousuniformvirtualinterconnect .................................. 27 3-2ComparisonofIntermediateFabricOverhead ................... 31 6

PAGE 7

LISTOFFIGURES Figure page 3-1Overviewofanintermediatefabricsimplementation ............... 17 3-2Previousintermediatefabricinterconnectarchitecture .............. 18 3-3Anoptimizedvirtual-trackimplementation ..................... 20 3-4Layoutofintermediatefabricusingoptimizedinterconnect ............ 21 3-5Switchboxtopologies ................................ 22 3-6Virtex4LX100multiplexerLUTusage ....................... 23 4-1Acomparisonofconstantpropagation ....................... 34 4-2FunctionalarchitectureofaXilinxVirtex5LUT .................. 38 4-3AXilinxVirtex5SLICEMconguredasdistributedRAM ............. 40 4-4AmodiedVirtex5slice ............................... 43 4-5Apseudo-constantadder .............................. 45 4-6ComparisonofadderLUTutilization ........................ 46 4-7ComparisonofmultiplexerLUTutilization ..................... 47 4-8Pseudo-constantcomparator ............................ 48 4-9Functionaldensityofpseudo-constantadders ................... 51 4-10Functionaldensityofpseudo-constantmultiplexers ................ 51 7

PAGE 8

AbstractofThesisPresentedtotheGraduateSchooloftheUniversityofFloridainPartialFulllmentoftheRequirementsfortheDegreeofMasterofScienceINTERMEDIATEFABRICS:LOW-OVERHEADCOARSE-GRAINEDVIRTUALRECONFIGURABLEFABRICSTOENABLEFASTPLACEANDROUTEByAaronLandyMay2013Chair:GregStittMajor:ElectricalandComputerEngineeringField-programmablegatearrays(FPGAs)havebeenwidelyshowntohavesignicantperformanceandpoweradvantagescomparedtomicroprocessorsandgraphics-processingunits(GPUs),butremainanichetechnologydueinparttoproductivitychallenges.Althoughsuchchallengeshavenumerouscauses,previousworkhasshowntwosignicantcontributingfactors:1)prohibitiveplace-and-routetimespreventingmainstreamdesignmethodologies,and2)limitedapplicationportabilitypreventingdesignreuse.Virtualrecongurablearchitectures,referredtoasintermediatefabrics(IFs),wererecentlyintroducedasapotentialsolutiontotheseproblems,providing100x-1000xplace-and-routespeedup,whilealsoenablingapplicationportabilityacrosspotentiallyanyphysicalFPGA.However,onesignicantlimitationofexistingintermediatefabricsisareaoverheadincurredfromvirtualizedinterconnectresources.Inthiswork,weapproachthisproblemthroughcomplementarytop-downandbottom-upapproaches,seekingtoreduceboththenumberandsizeofmultiplexersthatcomprisetheinterconnect.Firstweperformdesign-spaceexplorationofvirtualinterconnectarchitecturesandintroduceanoptimizedvirtualinterconnectthatreducesareaoverheadby48%to54%comparedtopreviouswork,whilealsoimprovingclockfrequenciesby24%withamodestroutabilityoverheadof16%.Wealsoextendconstantfoldingusedbytraditionallogicoptimizationtosupport 8

PAGE 9

pseudo-constants,whicharevaluesthatchangewithlowfrequency.Wepresentamethodofpseudo-constantlogicoptimizationbasedondynamicallyrecongurablecapabilitiesofFPGAs,whichoptimizeslogicfordifferentpseudo-constantvaluesandthenreconguresthelogicwheneverthepseudo-constantchanges.Weshowthisoptimizationachievesuptoa1.25xincreaseinfunctionaldensityformultiplexers. 9

PAGE 10

CHAPTER1INTRODUCTIONManystudiesinrecongurablecomputinghaveshownthatFPGAsenableordersofmagnitudeperformance,power,andsizeadvantagesovertraditionalmicroprocessorandgraphics-processingunitsforapplicationsinmanyeldsofbothhigh-performanceandembeddedcomputing.Despitetheseadvantages,manymainstreamdesignersdonotuseFPGAsduetothehighcomplexity,poorproductivity,andlackofportabilityofmodernFPGAdesignmethodologies.Advancesinhigh-levelsynthesispromisetoenableFPGAcompilationfromhigh-levelheterogeneousprocessinglanguages,suchasOpenCLandCUDA.Whilecompilerimprovementshelpfreedesignersfromthelow-levelcomplexityofthetraditionalASICprototypingbasedFPGAdesignow,theseimprovementsdonotaddresstheextremelylongcomplilationtimesneededtoperformfull-detailplacementandroutingofanFPGAdesign.Despiteongoingresearchintofull-detailplaceandroute(PAR)speedup,PARcanlasthoursforsmalldesignsandevendaysforlargedesigns[ 11 ].Additionally,onceadesignhasbeencompiledforaspecicFPGA,theresultingbitlecannotbeusedonanyothertypeofFPGA,evenwithinthesamefamily.Previouswork[ 11 ][ 42 ]haspresentedintermediatefabrics(IFs)asapotentialsolutiontotheseproblems,providingordersofmagnitudePARspeedupandapplicationportabilityacrossphysicaldevices.Intermediatefabricsarecourse-grained,applicationspecialized,virtualrecongurablefabricsimplementedonoff-the-shelfFPGAs.Byabstractingne-grainedresourcessuchasindividualFPGAlookup-tables(LUT),registers,andxed-hardwarearithmeticunitsintocoarse-grainedfunctionaloperatorsandlogiccores,intermediatefabricsenablequickmappingofhigh-leveloperationsontoFPGAhardware.Althoughintermediatefabricsprovidesignicantproductivityimprovements,previousfabricimplementationshavelimitedapplicabilityduetoareaoverheadincurred 10

PAGE 11

bythevirtualinterconnect,whichprohibitsmanyusagecases.Althoughthisoverheadcanbereducedviaspecialization[ 11 ],previousintermediatefabricscanstilluse2.5xtheareaofacircuitdirectlyimplementedonaphysicalFPGA[ 42 ].Inthispaper,weseektoaddressandreducevirtualinterconnectareaoverhead.Weexaminethevirtualinterconnectarchitectureemployedbypreviousintermediatefabricsstudies,consideringtradeoffsbetweenareaoverhead,clockoverhead,place-and-routetime,routingexibility,bitlesize,andrecongurationtime,amongothers.Byredesigningthelocalconnectivitycongurationoftheinterconnect,weidentifyanoptimizedalternativearchitecturethatreducesLUTrequirementsby48%-54%andip-oprequirementsby46%-59%,whileimprovingclockfrequenciesbyanaverageof24%.Toachievetheseimprovements,thenewinterconnecthasaroutabilityoverheadof16%,whichcouldbeaddressedbysacricingasmallamountofareasavingstoincludemorevirtualroutingresources.Additionally,weexploreacomplementaryapproachtoreducinginterconnectareaoverheadbyreducingtheareaconsumedbyeachofthemanymultiplexersthatcomposetheinterconnect.Weintroducepseudo-constantlogicoptimization,whichisconceptuallysimilartotraditionalconstantfolding[ 19 ],widelyusedinstaticlogicoptimization.However,unliketraditionalconstantfolding,apseudo-constantlogicoptimizationexploitsFPGAlookup-table(LUT)recongurabilitytodynamicallymodifythesynthesizedlogic,allowingforinfrequentchangesinthepseudo-constantvalue.Weshowthatforcommontypesoflogic,suchasmultiplexers,pseudo-constantlogicoptimizationscanachieveareasavingsfrom27%-50%onXilinxVirtex5FPGAs.Additionally,weshowthatpseudo-constantoptimizedmultiplexersmatchthefunctionaldensityoftraditionalsynthesisinasfewas128operationsperinvalidation,andapproachupto1.25xgreaterfunctionaldensityforinfrequentinvalidations.Inthisarticle,wedescribeintermediatefabrics,examinerelatedworkinFPGAvirtualizationandfastplaceandroute,discussthesourcesofoverheadinintermediate 11

PAGE 12

fabrics,andpresentanovellow-overheadinterconnectarchitecture.Additionally,wediscusspseudo-constantbasedlogicoptimizationanditsapplicationtoreduceintermediatefabricinterconnectoverhead. 12

PAGE 13

CHAPTER2RELATEDWORKThischapterexaminespreviousworkrelatedtotheoptimizationspresentedinthefollowingchapters,andhighlightingkeysimilaritiesanddifferences.Specically,wediscussthoseworksrelatingtoFPGAplacementandrouting,FPGAoverlaynetworks,andintermediatefabrics. 2.1PlacementandRoutingMuchpreviousworkhasfocusedonfastplace-and-routeusingbothcoarse-grainedarchitectures[ 8 ][ 27 ][ 41 ][ 47 ]andspecializedalgorithms[ 4 ][ 30 ][ 37 ],insomecasescombinedwithaplace-and-route-amenablefabric.Intermediatefabricsarecomplementarytotheseapproachesandcouldpotentiallyusethesealgorithmsforplace-and-route. 2.2OverlayNetworksNumerouspreviousstudieshavefocusedonoverlaynetworks,whichareconceptuallysimilartointermediatefabricsandimplementavirtualnetworkatopaphysicalFPGA.Forexample,Kapreetal.[ 26 ]comparedtradeoffsbetweenpacket-switchedandtime-multiplexedoverlaynetworksimplementedonanFPGA.Intermediatefabricsdifferfromtheseoverlaynetworksbyprovidingavirtualinterconnectcapableofimplementingregister-transfer-level(RTL)circuitsatdifferentlevelsofgranularityasopposedtoarbitrarycommunicationbetweenabstractprocessingelements.Bythisdenition,anintermediatefabricisanoverlaynetwork,butanoverlaynetworkisnotnecessarilyanintermediatefabric.Previousworkhasalsoinvestigatedne-grainedoverlaynetworksforvirtualFPGAs[ 7 ][ 33 ].VirtualFPGAsareconceptuallysimilartointermediatefabrics,whichalsoprovidevirtualrecongurablefabricsforimplementingdigitalcircuits.However,overlaysforvirtualFPGAscloselyimitatene-grainedFPGAarchitectures[ 7 ][ 33 ](e.g.LUTsasresources).IntermediatefabricscanalsoimplementLUT-basedarchitectures,but 13

PAGE 14

insteadareusuallyspecializedforspecicdomainsandevenindividualapplicationsusingaresourcegranularityuncommontoFPGAs,whichprovidesfastplace-and-route.PreviousvirtualFPGAscanbeviewedasspecic,low-levelinstancesofanintermediatefabric.Onekeydifferenceisthatbecauseintermediatefabricscanbespecialized,interconnectrequirementsdifferfromne-grainedvirtualFPGAs,andalsovarybetweenspecializations.Numerouspreviousstudieshaveintroducedrecongurable,coarse-grainedphysicaldevicesfordifferentapplicationdomains[ 5 ][ 10 ][ 15 ][ 22 ][ 24 ][ 34 ][ 40 ][ 41 ][ 43 ].Althoughthosedevicesprovidegoodperformancefortheirtargetedapplications,thedisadvantageofsuchanapproachisthatspecializedphysicaldevicesgenerallyhavehighcostsduetolimitedeconomyofscale.Intermediatefabricscanprovidethesamearchitecturesimplementedvirtuallyatopcommoncommercial-off-the-shelfFPGAs,whichhassignicantcostadvantagesandanacceptableoverheadforsomeusecases.Severalstudieshavealsoconsideredvirtualcoarse-grainedarchitecturesforspecicdomains[ 41 ][ 45 ].Theseapproachesarecomplementaryandrepresentindividualinstancesofintermediatefabrics. 2.3ConstantPropagationManystudieshaveshownthatconstantpropagationcanincreasefunctionaldensityandperformance[ 12 ][ 13 ][ 19 ][ 20 ][ 21 ][ 23 ][ 31 ].Whilethosetechniquesareeffective,synthesismustbeabletostaticallyidentifyconstants.Thepresentedworkenablestheseoptimizationsincaseswhereaconstantvalueisnotknownatcompiletime,andalsowhenavaluechangeswithlowfrequency.Previousstudieshavedemonstratedaconceptsimilartopseudo-constantsbyusingpartialrecongurationforrun-timelogicminimization[ 17 ][ 23 ][ 31 ][ 32 ][ 44 ][ 46 ].Previousworkalsoshowedthatpartialrecongurationcanhaveprohibitiverecongurationtimes,implementationcomplexity,andlimitationsonrecongurationgranularity[ 14 ][ 35 ][ 46 ].Thispastworkexaminedtrade-offsbetweenareaandrecongurationtimewhenusingrun-timelogicoptimization,andincludedthedevelopmentofafunctionaldensitymetrictoquantifythe 14

PAGE 15

advantages.WeextendpastworkbyreducingrecongurationtimesandimplementationcomplexityviatheLUT-basedRAMprimitivesprovidedbymostFPGAs.PriorstudieshavealsousedLUTRAMasdynamicallyrecongurablelogic.TheFPGAoverlaynetworkpresentedbyBrantetal.[ 7 ]usedLUTRAMtoimplementvirtualLUTsinavirtualFPGAfabric.Thatworkalsodecreasedmultiplexerresourcesviaanapproachsimilartowhatwedescribe.Weexpanduponthatworkbygeneralizingpseudo-constantlogicoptimizationforpotentiallyanylogicfunction. 2.4IntermediateFabricsIn[ 11 ],CooleandStittintroduceintermediatefabricsasapossiblesolutiontoexceedinglylongFPGAplace-and-routetimes.Theyalsoproposefabricspecializationtoaddressareaoverheadconcerns.Usingspecialization,fabricoverheadcanbereducedbyincludinginthefabriconlythoseresourcesessentialtoimplementagivenapplication.Thisrepresentsthelowestoverheadachievablebyearlyintermediatefabrics,butpayssignifcantpenaltyinfabricresuability.Theoptimizationspresentedinthisworkofferalternativeapproachestooverheadreductionwithoutsacricingfabricreusability. 15

PAGE 16

CHAPTER3INTERCONNECTENHANCEMENTSThischapterdiscussesenhancementsmadetotheintermediatefabricinterconnectarchitecturetoreduceareaoverheadwhileminimizingroutabilitytradeoffs.Thechapterrstprovidesanoverviewoftheintermediatefabricarchitectureandtheinterconnectusedbyinitialintermediatefabricstudies.Itthendetailstheoptimizedinterconnectstyleandnallycomparesareaoverheadandroutabilitybetweentheoriginalandoptimizedinterconnect. 3.1IntermediateFabricArchitectureThissectionoverviewsintermediatefabricsinSection 3.1.1 andthendiscussesthevirtualinterconnectarchitectureusedbypreviousintermediatefabricsinSection 3.1.2 3.1.1OverviewAsshowninFigure 3-1 ,anintermediatefabricisavirtualrecongurabledevice,implementedatopaphysicalFPGA,whichimplementscircuitsfromHDLorhigh-levelcodeviasynthesis,placement,androuting.Intermediatefabrics,likeoverlaynetworks[ 26 ]andvirtualFPGAs[ 7 ][ 33 ],provideafabriccapableofimplementingnumerouscircuits.However,unlikethosetechniques,intermediatefabricstendtobespecializedfortherequirementsofaspecicsetofapplications,whileprovidingenoughroutabilitytosupportsimilarapplicationsordifferentfunctionsinthesamedomain.TheexampleinFigure 3-1 illustratesanintermediatefabricspecializedforafrequency-domainsignal-processingcircuit,andprovidescorrespondingoating-pointresourcesforFFTsandarithmeticcomputation.WhendirectlycompilingthiscircuittoanFPGA,place-and-routeislikelytorequirehoursduetothecompilerdecomposingthecircuitintotens-of-thousandsofLUTs.However,whentargetingtheintermediatefabric,thecompilerdecomposesthecircuitintoseveralcoarse-grainedresources,whichreducestheplace-and-routeinputsizebyordersofmagnitudeandprovides100xto1000xplace-and-routespeedup[ 11 ][ 42 ].Acompletediscussionofintermediatefabric 16

PAGE 17

Figure3-1. Intermediatefabrics(IFs)arevirtualapplication-specializedfabricsimplementedatopFPGAsthathidephysicaldevicecomplexitytoachievefastplace-and-routeandapplicationportability. usagemodelsandtheirimplementationsisoutsidethescopeofthispaper;weinsteadsummarizetwobasicmodels.Thelibrarymodelprovidesalarge,pre-implementedsetofintermediatefabricsthatadesignerorsynthesistoolcanchoosefrombasedontherequirementsoftheapplication.FortheexampleinFigure 3-1 ,adesignerortoolcouldchoosetheselectedfabricfromoneofmanyfabricsthatprovidedifferentfabricsizes,differentcombinationsofresources,differentprecisions,etc.Analternativeisthesynthesismodel,duringwhichthesynthesistoolcreatesaspecializedfabricbasedontheapplicationrequirements.Theadvantagetothesynthesismodelisreducedareaoverhead.However,thedisadvantageisthattheapplicationdesignermustwaitforplace-and-routetoimplementtheintermediatefabriconthephysicalFPGA.Althoughsuchplace-and-routemayrequirehours,thecompilationtimeisamortizedoverthelifetimeofthefabricbecausethephysicalplace-and-routeisonlyneededonce. 17

PAGE 18

Figure3-2. Previousintermediatefabricinterconnectarchitecture,where(b)routingtracksbetweenresourceswereimplementedas(c)multiplexersbasedonthenumberoftracksources. 3.1.2PreviousInterconnectArchitectureFigure 3-2 (a)illustratesthebasicisland-stylefabricusedinpreviousintermediatefabrics[ 11 ][ 42 ].SuchafabriccloselyimitatesthewidelystudiedstructureofphysicalFPGAsconsistingofswitchboxes,connectionboxes,andbidirectionalroutingtracks,butreplacesLUTswithapplication-specicresources(e.g.,oating-pointunits,FFTs)referredtoascomputationalunits(CUs).Notethatbecauseintermediatefabricscanbespecialized,theCUsandvirtualroutingtrackscanpotentiallybeanywidth.Forexample,afabricwithoating-pointCUsmightprovide32-bitroutingtracks.Intermediatefabricsalsocontainspecializedregionsforcontrolandmemoryoperations.However,inthispaper,wefocusontheareasofacircuitthatcontributethemosttolongplace-and-route,whichformanyapplicationsarecoarse-grained,pipelineddatapathoperations(e.g.,FFTs).ThemainlimitationofpreviousintermediatefabricsisareaoverheadincurredbyimplementingthevirtualfabricatopaphysicalFPGA(i.e.,synthesizedVHDLforthevirtualfabric).Suchoverheadresultsfromseveralsources.Thelargestsourceofoverheadcomesfrommuxlogicinthevirtualinterconnect.Previousintermediatefabricsusevirtualbidirectionalroutingtracks[ 11 ][ 42 ],whoseregister-transfer-level(RTL)implementationisshowninFigure 3-2 (b)and(c).Foranm-bittrackwithnpossiblesources,theRTLimplementationusesanm-bit,n:1mux,insomecaseswith 18

PAGE 19

aregisterorlatchonthemuxoutput.Forexample,Figure 3-2 (b)showsacommoncongurationofabidirectionaltrackwithfoursources:twoswitchboxesandtwoCUs,withthecorrespondingRTLimplementationshowninFigure 3-2 (c)asa4:1mux,withaselectvaluestoredina2-bitvirtualcongurationregister.Consideringthelargenumberoftracksfoundinmostfabrics,thismux-basedimplementationofvirtualtracksusesnumerousLUTresourcesinthephysicalFPGA,andisresponsibleforover50%ofthetotalLUTusageinmanyintermediatefabrics.Similarly,virtualswitchboxesandconnectionboxesimplementvarioustopologiesusingadditionalmuxesbetweenvirtualtracks.TheexactpercentageofLUTusageforswitch/connectionboxesvariesdependingontheboxtopologyandexibility,butisalsoasignicantcontributortoareaoverhead.Whencombiningallinterconnectresources(tracks,switchboxes,andconnectionboxes),wedeterminedthatthevirtualinterconnectiscommonlyresponsibleforover90%ofLUTrequirements.Inadditiontothemuxoverhead,intermediatefabricsalsorequirephysicalip-opresourcesforanystorage.Virtualregistersaretechnicallynotoverheadbecausesynthesistoolscandirectlyimplementvirtualregistersonphysicalip-opsintheFPGA.However,virtualcongurationip-opsandanypipelinedinterconnectisoverheadbecausetheresultingphysicalip-opswouldnotbeusedbyacircuitdirectlytargetingtheFPGA. 3.2OptimizedInterconnectBasedonthesignicantoverheadcausedbythevirtualinterconnectdescribedintheprevioussection,inthispaperwefocusonvirtualinterconnectoptimizationstoreducemuxes,withthegoalofretaininghighroutability.Duringaninitialattemptatoptimizingvirtualtracks,weobservedthattheRTLimplementationshowninFigure 3-2 (c)containssomeredundancythatcouldpotentiallyberemoved.Specically,aphysicaltrackwouldneverhaveacommonsourceandsink,whichresultsinanunnecessaryinputtothemux.Forexample,aphysicalFPGAwouldneverrouteasignaloutofaswitchboxandbackintothesameswitchboxusingthesametrack.Therefore, 19

PAGE 20

Figure3-3. (a)Anoptimizedvirtual-trackimplementationtoreduceroutingredundancy,whicheliminatesmuxeswhen(b)trackshavetwosources. wecaneliminatetheredundantroutesandreplacethen:1muxwithndifferent,n-1:1muxes,whereeachmuxdenesoneofthepossibletrackdestinations.Figure 3-3 (a)showsanexamplefortheprevioustrackinFigure 3-2 (c),wheren=4.Despiteeliminatingroutingredundancy,suchanapproachdoesnotsaveareabecauseinmostcases,nseparaten-1:1muxesrequiremoreLUTsthanasinglen:1mux.However,wehaveobservedthereisaspecialcasewherethetrackimplementationinFigure 3-3 (a)canachievereducedarea.Foranyvirtualtrackwithexactlytwopossiblesources,thisimplementationsimpliesintotwodirectionalwiresasshowninFigure 3-3 (b).Inotherwords,a2-sourcevirtualtrackrequirestwoseparate1:1muxes,buta1:1muxisjustawire.Therefore,byusingonly2-sourcevirtualtracksthroughouttheentireintermediatefabric,wecanpotentiallyreplaceallmuxlogicandwiresinFigure 3-3 (a)withtwowiresforeachtrack.Suchanoptimizationhassignicantpotentialduetovirtualtrackscontributingtoover50%ofareaoverhead.Furthermore,thisoptimizationsavesasignicantamountofwirespertrack,whilesimultaneouslyimprovingroutabilitybyenablingroutingintwodirections.Anadditionaladvantage 20

PAGE 21

Figure3-4. LayoutofintermediatefabricusingoptimizedinterconnectwithCUI/Oconnecteddirectlytoadjacentswitchboxes. isthatbyreducingmuxes,thefabricrequireslesscongurationregisterstostorethecorrespondingselectvalues,whichreducesip-opoverheadwhilealsoimprovingrecongurationtimes.Althoughusing2-sourcevirtualtracksreducesarea,replacingthe3-and4-sourcetracksusedinpreviousfabricsisasignicantchallenge.Inatraditionalisland-stylearchitecture,atracktypicallyhas3-4possiblesources:2switchboxesand1-2CUs.Ifweeliminatetheswitchboxconnections,thetrackcanonlyroutebetweenadjacentresources,whichsignicantlylimitsroutability.Similarly,ifweremovetheCUconnections,thenthereisnowayforroutingtoreachCUs.Toaddressthisproblem,weconsideredseveralsignicantmodicationstotraditionalfabrics.First,westartedwith2-sourcetracksbetweenadjacentswitchboxes,witheachswitchboxasapossiblesource.However,thatinterconnectcongurationdoesnotprovideamechanismforconnectingCUstotheroutingtracks.Wecouldhave 21

PAGE 22

Figure3-5. Switchboxtopologiesfor(a)previousintermediatefabricinterconnectand(b)thepresentedinterconnectwithdiagonalCUchannels. addedconnectionboxes,butthatwouldviolatethe2-sourcerestriction.Therefore,weconsideredaddingadditionalchannelstoeachswitchboxwithdirectconnectionstotheCUI/O.TheoverallfabriclayoutforthisoptimizedvirtualinterconnectisshowninFigure 3-4 .Asillustrated,inthisunconventionalfabric,novirtualtrackhasmorethan2sources,whicheliminatesallmuxespreviouslyneededtoimplementtracks.Onechallengeindesigningthisoptimizedinterconnectisthatalthoughweeliminatedtrackmuxes,weaddedadditionalmuxesinsideoftheswitchboxestosupporttheadditionalCUchannels.Unlesstheswitchboxesaddfewermuxesthanweremovedfromthetracks,thisoptimizationdoesnotreducearea.ToensurethattheoptimizedinterconnectreducesLUTusage,weexploittheinternalcharacteristicsoftheswitchboxtohandletheadditionalroutingrequirementswithminimallogic.Previousintermediatefabricswitchboxesuseaplanartopology,whereeachoutputfromtheswitchboxusesa3:1muxthatselectsaninputfromoneofthethreeotherchannels,asshowninFigure 3-5 (a).Forthenewinterconnect,thesemultiplexorscouldpotentiallyrequirefourmoreinputstohandleroutingofthefouradjacentCUs,whichwouldsignicantlyoutweightracksavings.However,wecanexploitthefactthatincreasingmuxinputs 22

PAGE 23

Figure3-6. Virtex4LX100multiplexerLUTusageforvaryingMUXinputcounts.Theplateausprovideopportunitiesforswitchboxestoaddmoreconnectionswithoutanareapenalty. doesnotalwaysincreaseLUTrequirements.AsshowninFigure 3-6 ,FPGAshavedifferentareaplateauswhereadditionalmuxinputshavethesameLUTrequirementsaslesserinputs(e.g.,3-4inputsand6-8inputs).TheoptimizedinterconnectexploitsthischaracteristicbyaddingCUI/Oconnectionstothemuxesuntilreachingthelargestinputsizeofaplateau,whichmaximizesroutabilitywithoutanyincreaseinarea.Interestingly,thepresentedinterconnectcanbespecializedfordifferentphysicalFPGAs,whichhavedifferentmuxplateausduetovaryingLUTsizes.Althoughtheoptimizedinterconnectswitchboxesarenotrestrictedtoaspecictopology,wechooseaplanar-liketopologyforevaluationandtargetthemuxplateausfor4-inputmuxes.Therefore,theswitchboxesincrease3-inputmuxesto4inputswhereverpossible.Theswitchboxesalsouse5-inputmuxes,butdonotincreasetheinputsto6ormore,despitetheplateaubetween6and8inputs.Increasingthemuxinputsto8mayimproveroutabilitywithadditionaloverhead,butwedefersuchanalysistofuturework.AnexampletopologyisshowninFigure 3-5 (b),wheretheswitchboxprovidesaplanartopologyforthenorth,east,south,andwestchannels,whichcorrespondto 23

PAGE 24

virtualtracks.Inthisexample,theCUchannels(southeast,southwest,northwest,northeast)connecttotheotherchannelsincustomizableways.Notethatwearenotproposingaspecicswitchboxtopologyfortheoptimizedinterconnect.Instead,likeanyintermediatefabric,weexpectthetopologytochangebasedonapplicationandroutabilityrequirements.Fortheapplicationsweevaluated,usingahighlydirectionalfabricwasbenecialduetopipelined,feed-forwarddatapaths.However,theswitchboxcaneasilybecustomizedforothertopologies.Intheexperiments,weuseafabricgenerationtoolthatallowsspecicationoftheexactswitchboxtopologyinafabricdescriptionle. 3.3ExperimentsInthissection,wecompareintermediatefabricsusingthepresentedvirtualinterconnectwithpreviouswork[ 11 ][ 42 ].Section 3.3.1 describestheexperimentalsetup.Section 3.3.5 comparesarearequirements,clockspeedups,androutabilityofbothapproachesforunspecialized,uniformfabrics.Section 3.3.6 presentssimilarexperimentsforapplication-specializedfabrics. 3.3.1ExperimentalSetupThissectiondescribestheintermediatefabrictoolowusedfortheexperiments(Section 3.3.2 ),alongwiththeroutabilitymeasurements(Section 3.3.3 ),andthetoolsusedforevaluatingthedifferentinterconnects(Section 3.3.4 ). 3.3.2ToolowToimplementapplicationsontheintermediatefabrics,wemanuallysynthesizecircuitsbycreatingtechnology-mappednetlists.Weplantoconvertopen-sourcesynthesistoolstotargetintermediatefabrics,includingOpenCLhigh-levelsynthesis,butsuchaprojectisoutsidethescopeofthispaper.Forplace-and-route,weusethealgorithmpreviouslydescribedin[ 11 ]toensurethatthecomparisonbetweenthenewandpreviousinterconnectisnotunfairlyskewedbyimprovedplacement.Infact,theplace-and-routeresultsforthenewinterconnectarelikelypessimisticbecausewe 24

PAGE 25

didnotmodifytheplacercostfunctionforthenewinterconnect.Theplace-and-routealgorithmisavariationofVPR[ 6 ],andusessimulatedannealingforplacementwithacostfunctionthatminimizesboundingboxsize.Routingusesthewell-knownPathFinder[ 36 ]negotiated-congestionalgorithm.Boththenewandpreviousinterconnecthavevaryingamountsofpipelininginswitchboxesorontracks.Insteadofusingpipelinedroutingalgorithms(e.g.,[ 16 ],bothapproachesuserealignmentregistersinfrontofeachCUtobalancetheroutingdelaysofallinputs.Becausethispipeliningstrategyonlyworksforpipelineddatapathsthatcanberetimedwithoutaffectingcorrectness,welimittheevaluationtofabricswithcoarse-grainedresourcescommonlyneededbydatapathsinsignalprocessing.Toconguretheintermediatefabricfordifferentapplications,theplace-and-routetooloutputsacongurationbitlethatwestoreinablockRAMonthetargetedFPGA.EachintermediatefabricincludesaprogrammerwhichloadsthebitlefromtheblockRAMbyshiftingbitsintovirtualcongurationregistersthatcontroltheCUsandvirtualswitchboxes. 3.3.3RoutabilityMetricTofairlycomparetradeoffsbetweeninterconnects,itisnecessarytomeasureroutability.Toperformthesemeasurementsforagivenintermediatefabric,weplace-and-routealargenumberofrandomlygeneratednetlistsofvaryingsizes,anddeterminetheroutabilityscoreoftheinterconnectbasedonthepercentageofnetliststhatroutesuccessfully.Duetothefastplace-and-routetimeforintermediatefabricswewereabletotest1,000netlistsforeachfabrictoobtainahigh-precisionmetric.Therandomnetlistgeneratorcreatesdirectedacyclicgraphstructuresrepresentativeofpipelineddatapaths.BasedontheCUcompositionofeachindividualfabrictested,thegeneratorcreatesarandomnumberofdatapathstages,eachconsistingofarandomnumberoftechnology-mappedcells,andcreatesrandomconnectionsbetweeneachstage.Eachstagecontainsatminimumenoughcells,andenoughconnectionsaremadebetweenstages,suchthateachcellhasatleastonepathtothenextstage.This 25

PAGE 26

methodresultsinnetlistscontainingoneormoredisjointpipelinesofoneormorestageseach. 3.3.4InterconnectEvaluationToevaluatedifferentinterconnects,wedevelopedatoolcapableofgeneratingVHDLforintermediatefabricsusingthenewinterconnect.Thetooltakesasinputsafabric-descriptionlethatdenestheparametersofthefabric,suchassize,aspectratio,bit-widthandthemakeupofthefabric,includingCUcomposition,androwandcolumnchanneldescriptions.Channeldescriptionsincludenumberoftracks,directionofeachtrack,andswitchboxtopology.ToobtainphysicalFPGAutilizationandtimingresults,wesynthesizedtheintermediatefabricVHDLusingXilinxISE10.1,SynopsysSynplifyPro2012,andAlteraQuartusII10.1,dependingonthetargetedFPGA.ToevaluatetheeffectsofFPGAvariationoneachvirtualinterconnect,weimplementedintermediatefabricsonXilinxVirtex4LX100andLX200,XilinxVirtex5LX330,andAlteraStratixIVE530FPGAs.TheintermediatefabricHDLsynthesizedforeachtestcaseusesthexed-logicmultipliersavailableoneachphysicaldeviceforallCUs(XilinxDSP48sandAltera18x18Multipliers);thereforealldeviceutilizationrepresentstheLUTandip-opoverheadofimplementingthetargetapplicationviaanintermediatefabricratherthanadirectHDLimplementation. 3.3.5InterconnectComparisonforUniformIntermediateFabricsInthissectionwecomparearea,routability,andmaximumclockspeedofintermediatefabricsusingthepresentedinterconnecttointermediatefabricsusinginterconnectpreviouslypresentedin[ 11 ]and[ 42 ].Weevaluateeachinterconnectusingdifferentfabricsizes,implementedonseveraldifferentphysicalFPGAs.Althoughintermediatefabricscanbespecializedtoanapplication,inthissectionweevaluatefabricsindependentlyoftargetedapplicationsbyusingauniformfabricconsistingof16-bitDSPCUswithvariousdimensions(e.g.,5x5=5rowsand5columnsofI/Oand 26

PAGE 27

Table3-1. Acomparisonbetweenthepresentedvirtualinterconnect(New)andpreviousuniformvirtualinterconnect(Prev). LUTUsageFlip-FlopUsageRoutabilityClockFPGAFabricSizePrevNewSavePrevNewSavePrevNewLossPrevNewSpeedup XilinxV4LX2003x32%1%71%1%1%72%100%78%22%173MHz175MHz1%4x45%2%64%1%1%65%100%95%5%163MHz172MHz6%5x58%3%60%2%1%62%100%87%13%152MHz172MHz13%6x612%5%55%3%1%59%100%85%15%144MHz171MHz19%7x717%8%53%5%2%57%100%84%16%123MHz170MHz38%8x823%11%52%6%3%56%100%85%16%125MHz170MHz36%9x930%15%51%8%4%55%99%84%16%115MHz168MHz46%12x836%20%46%10%5%55%99%79%20%113MHz160MHz42%XilinxV5LX33013x1337%20%46%18%9%53%98%80%18%125MHz162MHz30%14x1444%24%46%21%10%52%94%83%12%131MHz146MHz11%AlteraS4E53015x15n/a*14%n/a*n/a*18%n/a*90%71%21%n/a*175MHzn/a*16x16n/a*16%n/a*n/a*21%n/a*90%70%22%n/a*177MHzn/a*Average21%11%54%8%3%59%98%82%16%136MHz167MHz24% CUs).Table 3-1 comparesLUTandip-oputilization(asa%oftotaldeviceresources),routabilityof1000randomlygeneratednetlists,andmaximumclockspeedforidenticalintermediatefabricsusingthenewandpreviousinterconnects.Weimplementedfabricsizesbetween3x3and12x8onaVirtex4LX200,whereanNxMfabriciscomposedofonerowofMinputs,N-2rowsofMCUs,andonerowofMoutputs.Weevaluatedlargerfabricsizesof13x13and14x14onaVirtex5LX330,andsizes15x15and16x16onalargeStratixIVE530.Forfabricsusingthepreviousinterconnect,weused316-bittracksperchannelwithspecializedconnectionboxesfrom[ 11 ],aspreviousworkindicatedthiscongurationtobeaneffectivetradeoffbetweenroutabilityandoverhead.Forfabricsusingthenewinterconnect,weused216-bittracksperrowand4trackspercolumnwiththeswitchboxtopologydescribedinSection 3.2 optimizedfor4-inputmuxes.TheseresultsshowtheLUTandip-oputilizationsofthenewinterconnectaresignicantlylessthanthepreviousinterconnect,withanaverageLUTsavingsof54%andip-opsavingsof59%forthefabricsevaluated.NotethatwewereunabletosynthesizetheoldinterconnectontheStratixIVdevice.WetriedthreedifferentversionofQuartus,buttheoldinterconnectwouldcauseacrashduringtheretimingstageofsynthesis.Forthisreason,weexcludetheStratixIVresultsfromtheaverages.Additionally,thenewinterconnectshowedsignicantmaximumclockfrequencyspeedup 27

PAGE 28

forlargerfabrics.WhenimplementedontheVirtex4,newinterconnectclockspeedsdecreasedonly6.3%betweenfabricsofsize3x3to12x8,whereasthepreviousinterconnectsufferedfroma34.7%decreaseinclockspeedoverthesamerange.Overall,thenewinterconnectaveraged167MHzcomparedto136MHz.Thenewinterconnectdidincuraroutabilitypenalty,withaaveragedecreaseof16%comparedtothepreviousinterconnect.Whilethisoverheadisapotentiallimitationofthenewinterconnect,especiallywhenappliedtoageneral-purposefabric,webelievethisoverheadtobeanacceptabletradeoffwhencomparedtothesignicantareasavingsprovidedbythenewinterconnect.RoutabilityoverheadcanalsobeeasilycompensatedforwhendesigningtheCUcompositionofafabric.Becausetheplaceralgorithmusedintheseexperimentsisunchangedfromthatusedfortheoldfabric,itislikelythatanappropriatelycustomizedplacercostfunctionwouldsignicantlyimprovetheroutabilityofthenewinterconnect.Similarly,fabricsusingthenewinterconnectcouldaccountfordecreasedroutabilitybyincludingmanymoreroutingresourceswhilestillsavingarea.Routabilitydecreasedmonotonicallywithincreasedfabricsizeduetotheincreaseddifcultyofroutinglargernetlists.Theoneexceptionwasthe3x3fabricwiththenewinterconnect,whichhadlowerroutabilitythanthelargerfabrics.WeidentiedthesourceofthisproblemaslimitedconnectionsbetweenI/OandCUsforverysmallfabricsusingthenewinterconnect.Becauseweexpect3x3tobeanunusuallysmallsizeforactualusage,thisoverheadisnotasignicantlimitation.TheseresultsalsoshowdecreasedLUToverheadsavingsofonly46%infabricsimplementedontheVirtex5device.ThissmallerimprovementislikelyduetodifferentCLBcongurationusedbythatdevice,withslightlyalteredmux-areaplateaucharacteristics,whereastheoptimizationsusedbytheevaluatedinterconnectwereoptimizedfor4-inputmuxes.DespitebeingoptimizedforadifferentLUTconguration,thenewinterconnectstillhadsignicantsavings.Flip-opusageontheAlteradevicewassignicantlyhigherthanbothXilinxdevices,whichresultedfromtheXilinxFPGAsimplementingtherealignmentregistersasSRL16 28

PAGE 29

primitives,incontrasttotheAlteraFPGAwhichusedip-ops.Asfuturework,wewillinvestigateoptimizationsforAlteraFPGAs.Oneadditionaladvantageofreducingmuxesthroughouttheinterconnectisthecorrespondingeliminationofcongurationregisterstostoretheselectvalues.Thefewerregistersreduceip-ops,whichwasshowninTable 3-1 ,butalsoreducescongurationbitlesize,whichcorrespondinglyreducescongurationtimesandblockRAMoverheadofthefabric.Fortheexamplesinthissection,thenewinterconnectimprovedcongurationtimesbyanaverageof55%comparedtothepreviousinterconnect. 3.3.6InterconnectComparisonforSpecializedIntermediateFabricsOneadvantageofintermediatefabricsisthatadesignerortoolcanspecializethearchitectureandinterconnectforagivendomainorevenanindividualapplication.Inthissection,wecompareintermediatefabricsusingapplication-specializedinterconnectpresentedin[ 11 ]withthenewinterconnect.Toenableafaircomparison,weevaluatethesameapplicationcircuitsfrom[ 11 ]usingthesamespecializedfabricsaspreviousexperiments.Specializationusedinthepreviousexperimentsincludedvaryingfabricsizesandnon-uniforminterconnects.Forthenewinterconnect,welimitspecializationtofabricsizes,makingtheresultspessimistic.Forallspecializedfabrics,weusedthesmallestfabricandinterconnectthatcouldsuccessfullyroutethetargetapplicationnetlist.Fortheseexperiments,thephysicalFPGAisaVirtex4LX100,whichwechosetomatchthepreviousexperiments.Toperformthecomparison,weusedthetwelveapplicationsfrom[ 11 ],sevenofwhichwereimplementedusingboth16-bitxedpointarithmeticand32-bitoatingpointarithmetic,indicatedwithaFXDorFLTsufxrespectively.AlltrackwidthsmatchedtheCUwidths.Allcircuitswithoutasufxused16-bitxed-pointCUs.Webrieysummarizethepreviousapplicationsasfollows.Matrixmultiplyperformsthekernelofamatrixmultiplication,calculatingtheinnerproductoftwo8-elementvectorsusing7addersand8multipliers.FIRimplementsa12-tapniteimpulseresponselterintransposeformwithsymmetriccoefcients 29

PAGE 30

using11addersand12multipliers.N-body,representingthekernelofanN-bodysimulation,calculatesthegravitationalforceexertedonaparticleduetootherparticlesintwo-dimensionalspaceusing13adders,multipliers,andadivider.Accummonitorsastream,countingthenumberoftimesthevalueislessthanathreshold.Itisthesmallestnetlist,consistingof4comparatorsand3adders.Normalizenormalizesaninputstreamusing8multipliersand8adders.Bilinearperformsbilinearinterpolationonanimage,requiring8multipliersand3adders.Floyd-Steinbergperformsimageditheringusing6addersand4multipliers.Thresholdingperformsautomaticimagethresholdingusing8comparatorsand14adders.Sobelusesa3x3convolutiontoperformSobeledgedetectionwith2multipliersand11adders.Gaussianblurusesa5x5convolutiontoperformnoisereductionusing25multipliersand24adders.Maxlterperformsa3x3sliding-windowimagelterwith8comparators.Meanltersimilarlycalculatestheaverageofaslidingwindow,whichwevaryfrom3x3to7x7,requiringamaximumof48addersand1multiplier.Figure 3-2 comparestheinterconnectsforeachcasestudy.Therstmajorcolumn,Place-and-RouteTime,comparesplace-and-routeexecutiontimesforanintermediatefabricwiththepreviousinterconnect(IFPrev),anintermediatefabricwiththenewinterconnect(IFNew),andwhensynthesizingVHDLforeachexampledirectlytotheFPGA.Thetablealsoshowstheresultingplace-and-routespeedupforthenewandpreviousinterconnects.Theresultsshowcomparableplace-and-routetimesforboththeoldandnewinterconnect.However,becausethepreviousinterconnectalreadyachievesaplace-and-routespeedupof554xcomparedtoanFPGA,thefurtherimprovementbythenewinterconnectprovideda1350xplace-and-routespeedup.Theplace-and-routespeedupwaslargerfortheoating-pointexamplesduetolongerplace-and-routetimesforthephysicalFPGA.Furthermore,theseplace-and-routespeedupsarehighlypessimisticbecausethespecializedexamplesfrom[ 11 ]donotincludecommonboardlogicsuchasPCIeandmemorycontrollers.Otherstudieshaveshownthatincludingthesecontrollerswithtighttimingconstraintscanaddupto 30

PAGE 31

Table3-2. Acomparisonbetweenintermediatefabrics(IFs)withthepresentedvirtualinterconnect(IFNew)andpreviousapplication-specializedinterconnect(IFPrev). 20minutestoFPGAplace-and-routetime,buthavenoeffectonintermediatefabricplace-and-routetime[ 42 ].ThesecondmajorcolumninFigure 3-2 reportsareasavingsofthenewinterconnectintermsofFPGALUTsandip-ops,alongwiththeroutabilityoverheadincurredtoachievethesesavings.Onaverage,thenewinterconnectsignicantlyreducedLUTusageby48%andip-opusageby46%,despitethesignicantspecializationbythepreviousfabrics.Onaverage,routabilityslightlyimprovedby8%withthenewinterconnect.However,thisaverageisskewedbythreeoutliers,normalize,Gaussian,andmean7x7,whichhadverylowroutabilityduetosignicantspecializationinthepreviousfabrics.Excludingtheseoutliers,thenewinterconnecthada2%routabilityoverhead.Thesmallerroutabilityoverheadcomparedtotheprevioussectionisduetothespecializedversionsofthepreviousinterconnect,whichusedjustenough 31

PAGE 32

routingresourcestoroutethetargetedapplication,andthereforeloweredgeneralroutability.ThenalcolumnofFigure 3-2 comparesthemaximumclockspeedofthespecializedfabricsusingboththenewandoldinterconnect.Forspecializedfabrics,theseexperimentsshowanegligibleaverageimpactonclockspeed,withbothinterconnectsshowinganaverageclockfrequencyof186MHz.However,therewassignicantvariationashighas21%betweenspecializedfabrics.Itshouldbenotedthattheseresultsarecontrarytotheresultsforlargerfabricspresentedintheprevioussection,whichshowedacleartrendoffasterclockspeedsforlargerfabricsusingthenewinterconnect.Thereasonforthesmallerclockimprovementcomparedtotheprevioussectionisduetothehigherspecializationofthepreviousinterconnect,asopposedtousingauniforminterconnect. 32

PAGE 33

CHAPTER4PSEUDOCONSTANTLOGICOPTIMIZATIONThischapterdiscussesabottom-upapproachapproachtoreducingintermediatefabricinterconnectareaoverhead.Specically,weseektoreducetheareaconsumedbyeachofthemanymultiplexersthatcomposetheinterconnect.Theoptimizationspresentedbelowarecomplementarytothetop-downarchitecturalapproachpresentedinthepreviouschapter,seekingtolimitthenumberofmultiplexersusedinthefabric.FPGAlogicoptimizationisawidelystudiedtopic,withdozensofexistingoptimizationsthatbuildupondecadesofdigital-designresearch[ 3 ][ 7 ][ 9 ][ 18 ][ 25 ][ 38 ][ 39 ].Acommonstrategyinvolvesiterativelypropagatingconstantswhileperforminglogicminimization(i.e.,constantfolding[ 19 ]).Forexample,Figure 4-1 (a)showsa4:1multiplexer,whichasynthesistoolmaymaptothree4-inputlookuptables(LUTs).Insomesituations,asshowninFigure 4-1 (b),aconstantmaypropagatetothemux?sselectinput,whichsimpliesthelogictoawire.Unfortunately,constant-basedoptimizationshavelimitedapplicability.Forexample,circuitdesignersoftenavoidconstantinputstoenablesupportforasmanyusecasesaspossible[ 46 ].However,wehaveobservedthatcircuitscommonlyincludesignalsthatexhibitnear-constantbehaviorwherethesignalvalueisrarelychanged,whichwedeneaspseudo-constant.Forexample,manysignal-processingapplicationsinitiallysetapseudo-constantconvolutionkernel,whichremainsthesameforthedurationoftheapplication.Alternatively,eachframeofalowframe-ratevideomayalsobeconsideredpseudo-constant.Thesepseudo-constantvaluesareofteninputstocommonlogiccomponentssuchasadders,multipliers,comparators,andmuxes(e.g.,[ 7 ][ 29 ]),whichcouldpotentiallybenetfromconstantfoldingtoreduceareaand/orincreasereplication.Weintroducepseudo-constantlogicoptimization,whichisconceptuallysimilartotraditionalconstantfolding,widelyusedinstaticlogicoptimization.However,whenapseudo-constantchangesvaluesatruntime,theoptimizedlogicbecomes 33

PAGE 34

Figure4-1. Acomparisonofconstantpropagationforamultiplixerwith(a)anon-constantselect,(b)aconstantselect,and(c)whenusingpseudoconstantlogicoptimizationforinputsthatrarelychange. invalid.Topreventtheseinvalidationsfromaffectingcorrectness,weexploitFPGAlookup-table(LUT)recongurabilitytodynamicallymodifythelogicaccordingtothenewpseudo-constantvalue.AlthoughLUTrecongurationcausesperformanceoverhead,low-frequencyinvalidationsoftenmakethisoverheadinsignicant.Ingeneral,higherinvalidationfrequenciesprovidevarioustradeoffsbetweenareasavingsandperformanceoverhead.Thischapterdiscussesthedesignprocess,implementation,andevaluationofpseudo-constantlogicoptimization. 4.1Pseudo-ConstantDesignProcessThissectiondenesthefourcomponentsofpseudo-constantlogicoptimization:identication,technologymapping,bitlecreation,andinvalidationdetection. 34

PAGE 35

4.1.1Pseudo-ConstantIdenticationTherststepofpseudo-constantlogicoptimizationistheidenticationofpotentialpseudo-constants.Ourcurrentapproachusesdesigner-speciedidentication,wheredesignersuseknowledgeofanapplicationsbehaviortomanuallyspecifypseudo-constantsignals.Ratherthanrequiringtheactualvalueofapseudo-constant,adesignerorsynthesistoolneedonlyknowthatasignalwillbepseudo-constant(e.g.,aconvolutionkernel).Althoughdesignersmayoftenbeawareofpseudo-constants,therewillbesituationswherepotentialpseudo-constantsarenotobvious.Synthesistoolscouldpotentiallyuseaproling-basedheuristicthatprolesthenumberofdistinctvaluesofagivensignal,alongwiththefrequencythatthevaluechanges(i.e.,theinvalidationfrequency).Furthermore,weenvisionahybridapproachwheredesignersspecifythesignalstoprole.Previouswork[ 28 ]hasintroducedsuchprolingforbothsimulationandin-circuitbehavior.Weplantoinvestigateautomaticpseudo-constantidenticationasfuturework. 4.1.2Pseudo-ConstantTechnologyMappingOneimportantdifferencebetweenpseudo-constantandtraditionallogicoptimizationisthattheelaboratedcircuitsmaybeidentical,butmaydiffersignicantlyaftertechnologymapping.Forexample,considerthe4:1multiplexerfromFigure 4-1 ,withaconstantorpseudo-constantselectinput.Forthisexample,logicoptimizationswouldreplacethemultiplexerwithawireconnectedtothemuxinputthatcorrespondstotheselectsconstant/pseudo-constantvalue.However,dependingontheavailableFPGAprimitives,technologymappingforthepseudo-constantlogicmayrequiremorethanawirebecausetheresultingcircuitmusthandlechangescausedbyinvalidatedpseudo-constants.Todealwiththeseinvalidations,anylogicthatisoptimizedinthepreviousstepismarkedaspseudo-constantlogic,whichtechnologymappinghandlesdifferentlyfromnormallogic.Technologymappingforpseudo-constantlogicissimilartotraditionaltechnologymapping,butisrestrictedtoFPGAprimitivesthatsupportruntime 35

PAGE 36

reconguration.Althoughtherecouldpotentiallybenumerousprimitives,inthispaperwefocusoncommonprimitivesinexistingFPGAdevices:LUTRAMandLUTshiftregisters.Section 4.2 describesexamplemappingsforXilinxdevices.Pseudo-constantsarealsopossibleonAlteradevicesusingMLABs,butevaluationofMLABsisoutsidethescopeofthispaper.Previousworkhasfocusedonusingpartialrecongurationforsimilargoals[ 46 ],whichweomitfromthisstudyduetothelongrecongurationtimescomparedtorewritingLUTcontents.However,partialrecongurationmayrepresentaPareto-optimaltradeoffintermsofareasavingsandperformanceoverhead.Weplantoinvestigatethesetradeoffsasfuturework 4.1.3Pseudo-ConstantBitleCreationAftertechnologymapping,theresultingcircuitmustcreateand/orprovideasmall,correspondingbitlethatimplementsthelogicforeachpseudo-constantvalue.InthecaseofLUTRAMorLUTshiftregisters,thisbitleissimplythetruthtablestoredintheLUT.ForthemuxexampleinFigure 4-1 (c),thebitleforthepseudo-constantmuxis16bits,duetothe4-inputLUT(24bits).Theoverheadofpseudo-constantbitlecreationdependsonthecharacteristicsofaparticularpseudo-constant,whichmaybeamenabletoeitherofineoronlinecreation.Inthispaper,wefocusmainlyonofinecreation,butdiscussthetradeoffsandchallengesofonlinecreationtopresentallpossibleusecases.Ofinecreationispossiblewhenthedesignerorsynthesistoolisawarethatapseudo-constantonlyhasseveralpossiblevalues.Inthiscase,thesynthesistoolcanpre-computethebitleforallpossiblevaluesandstorethebitlesinon-chipmemory,whichthecircuitloadsintothecorrespondingprimitivesduringapseudo-constantinvalidation.Forexample,fora4:1muxwithapseudo-constantselect,asynthesistoolcouldstaticallydeterminefourseparatebitlesandstoretheminablockRAMorothermemory.Ofinecreationisnotlimitedtofunctionswithsmallnumbersofinputs.Forexample,aninputtoa32-bitcomparatormayonlyhavetwodifferentpossiblevaluesforagivenapplication(e.g.,aruntime-speciedthresholdin 36

PAGE 37

animage-processingapplication),whichwouldenableasynthesistooltostaticallycreatetwoseparatebitles.Onlinebitlecreationisneededwhenasynthesistoolisnotawareofthedifferentpossiblevaluesofapseudo-constant,oralternativelywhentherearetoomanypossiblevalues,whichwouldrequireasignicantamountofon-chipmemorytostorebitles.Ingeneral,onlinebitlecreationismorecomplicatedandrequiresaportionofthecircuit,oraco-processor,tocalculatetruthtablesforinvalidatedpseudo-constantlogic.Inmanysituations,onlinecreationisnotpracticalbecausethelogicrequiredforbitlecreationislargerthanthesavingsfromthepseudo-constantlogic.Notethatpseudo-constantbitlesalsocreatememoryoverhead.Wethereforeexpectpseudo-constantoptimizationtobeappropriatewhereblockRAMisnotthemainresourcebottleneckofanapplication. 4.1.4Pseudo-ConstantInvatidationDetectionPseudo-constantcircuitsmustidentifywhenapseudo-constantchangesvalues,whichwerefertoasinvalidationdetection.Afterdetectinganinvalidation,thecircuitloadsanewbitleintothecorrespondingresources.Inthispaper,weuseapplication-specieddetection,wherethedesignerexplicitlyspecieswhenagivenpseudo-constantchanges.Onedisadvantageisthatthisapproachiserrorproneandrequiresknowledgeofpseudo-constantinvalidations.However,formanyapplications,invalidationsareobvious.Forexample,fordesigner-speciedpseudo-constants,thedesignerisalreadyawareofpseudo-constants,andislikelyawareofwhentheapplicationchangesapseudo-constant(e.g.,anewimage).Asfuturework,weenvisionthepossibilityofruntimedetection,whichdoesntrequiredesignerknowledge,butisoftenimpracticalduetooverhead.Inthegeneralcase,runtimedetectionrequiresacomparator,whichmayoutweighsavingsexceptforlargeregionsofpseudo-constantlogic(e.g.,largeaddertrees). 37

PAGE 38

Figure4-2. FunctionalarchitectureofaXilinxVirtex5LUT.EachLUTcanbeconguredasa64x1dual-portedRAM,asinglevariable-lengthshiftregisterupto32-bitslong,ortwoindependentvariable-lengthshiftregistersupto16-bitslongeach. 4.2TechnologyMappingInthissection,wediscusshowdifferentpseudo-constantlogiccanbetechnologymappedontoFPGALUTprimitives.InSection 4.2.1 ,wepresentpseudo-constantprimitivesfortheXilinxVirtex5.InSection 4.2.2 ,weidentifyarchitecturalbottlenecksandpresentextensionsthatwouldenableVirtex5tobettersupportpseudo-constants. 4.2.1Pseudo-ConstantPrimitivesforXilinxVirtex5General-purposelogicresourcesinXilinxVirtex5devicesarecomposedofcolumnsofcongurablelogicblocks(CLB).EachCLBiscomposedoftwoSLICEs,eachofwhichcontainsfourLUTs.WhiledevicesarecomposedofequalnumbersoftwodifferentSLICEtypes,SLICEMandSLICEL,onlySLICEMshavedynamicallyrecongurableLUTprimitives;therefore,inthisworkweconsideronlySLICEMs.Figure 4-2 showsthesimpliedfunctionalarchitectureoftheVirtex5ssix-input,two-outputLUT.TheLUTislogicallycomposedoftwove-input,one-output(32x1) 38

PAGE 39

random-accessFIFOstructures,addressedbytheLUTslowerveinputs(A1:A5).Amuxusesthesixthinput(A6)toselectoneofthelowerinputstodrivetheLUTsprimaryO6output,whileonedirectlydrivesasecondaryoutputO5.OutputO6maybeanycombinationalfunctionofallsixinputs,whileO5isasubsetfunctionofonlyveinputs.EachVirtex5LUTcanbeconguredaseithera64x1or32x2dual-portedRAM,withonesynchronouswriteandoneasynchronousreadport.Anadditionalsixinputs(WA[1:6])specifythewriteaddress.EachLUTcanalsobeconguredasone32-bitshiftregister,ortwo16-bitshiftregisters,eachwithaddressableoutputsthatcanselectanybitoftheshiftregister.XilinxreferstotheseshiftregisterprimitivesasSRL32andSRL16respectively.Figure 4-3 showsasimpliedviewofaSLICEMfromtheVirtex5userguide[ 2 ].EachSLICEMcontainsfourLUTs,referredtoasA,B,C,andD.EachLUThassixdedicatedlogicorreadaddressinputs,aswellastwodatainputstodriveLUTRAMandSRLinputs.LUTDsreadaddressinputs(D1:D6)arealsousedtodrivethesixwriteaddressinputsforallfourLUTs.Asdiscussedlater,thisaddressingmethodisalimitationforpseudo-constantlogicbecauseLUTDcannotbeefcientlyusedforoutputs.PairedwitheachLUTisdedicatedcarry-chainlogicandaip-op.DedicatedoutputscarryeachLUTO6outputandip-opoutputtotheroutingfabric.MuxesselectfromtheLUTO5output,carry-chainoutput,andshift-outaswelltodrivetheip-opinputandathirdoutput.Theshift-outportofeachLUTisconnectedtotheshift-inportofthenextLUT(i.e.,ShiftOutD?ShiftInC)tocreatelongershiftregisterchains,upto128bitsperSLICEM.TwodedicatedmuxesselectbetweenoutputsfromLUTAandB,andLUTCandD,andathirdmuxselectsbetweenthosetwomuxes.Thisstructureenableseight-inputmuxesusingonlytwoLUTs,and16-inputmuxesusingfourLUTs. 4.2.1.1DistributedRAMToimplementtheLUTRAMpseudo-constantprimitive,weuseXilinxDistributedRAM.EachXilinxLUTallowsreadandwriteaccesstothe64SRAMbitsineither 39

PAGE 40

Figure4-3. AllfourLUTs(A-D)ofasingleXilinxVirtex5SLICEMconguredasdistributedRAM. 64x1-bitor32x2-bitdimensions.MultipleLUTsperslicecanbegroupedtogethertocreatewiderordeepermemories.BecausethewriteaddressesforthefourLUTsaredrivenbyLUTDssixlogicandreadinputs,signicantlimitationsareplacedontheefciencyofLUTRAMstructureswhenusingtheVirtex5.Forexample,adual-ported64x1RAMrequirestwoLUTs,resultingina50%areapenalty.Toachievemaximumareaefciency,aLUTRAMprimitiveusingVirtex5distributedRAMshouldideallyuseallfourLUTsinasingleSLICEM.InputsD[1:6]drivethecommonwriteaddressandareusedtocongureLUTSA,BandC,whichcanthenbeusedasthreeindependentLUTs,whileLUTDsinputsareconsumedbyservingasthewrite-addressforLUTsA,B,andC.Figure 4-3 showsfourLUTsconnectedinthisfashion.UsingLUTRAM,eachSLICEMyieldseitherthree6-input,1-outputfunctions,orthree5-input,2-outputfunctions.Becauseonlyoneip-opisavailabletoeachLUT,inthecaseof2-outputfunctions,onlyoneoutputcanuseaip-op,whichcanbealimitationfor 40

PAGE 41

pipelinedlogic.IfinputsD[1:6]canbedrivenbybothlogicduringnormaloperationandcongurationhardwareduringreconguration,thenLUTDmayalsobeusedforpseudo-constantbasedlogic,eliminatingthepenalty.Inthiscase,four6:1or5:2functionscouldberealizedperSLICEM. 4.2.1.2ShiftRegisterLUTshift-registerprimitivescanbeimplementedusingXilinxSRLprimitives.WhenconguringLUTsasshiftregisters,congurationbitsformanyLUTscanbeshiftedseriallyinasinglecongurationchain.UsingtheSRL32,asingleLUTcanbeconguredasave-input,one-outputfunction.ConguredastwoSRL16s,eachLUTcanbeconguredasafour-input,two-outputfunction.UnlikeSRL32,eachSRL16mustbedrivenbyanindependentcongurationinput;multipleSRL16primitivescannotbechainedtogetherinasingle,longcongurationchain. 4.2.2ArchitecturalExtensionsThepseudo-constantprimitivesfortheVirtex5,describedabove,highlightmanyofthechallengesofimplementingpseudo-constantlogicoptimizationonmodernFPGAs.Inthissection,wediscusstheFPGAarchitecturalcharacteristicsthatmostlimittheeffectivenessofpseudo-constantlogicoptimizations,particularlythoseoftheXilinxVirtex5CLBarchitecture,andsuggestmodicationstoimprovetheefciencyofpseudo-constants.Pseudo-constantimplementationscanbeviewedasatraditionalinput/output-boundproblem.TheVirtex5implementationsdescribedaboveshowthatthenumberofinputsandoutputstoanFPGAsLUTsareakeylimitationofpseudo-constantlogicpackingandplaceanupperboundontheachievableareareduction.Additionally,thenumberofinputsrequiredtoproduceagivenoutputvalue,andthenumberofinputssharedamongmultipleoutputs,enforceasimilarlimitation.Forexample,inthedesignofanaddercircuitdescribedinthenextsection,thekeydesignlimitationwasthenumberofoutputsfromaLUT.Whilegroupsoffourtosixinputpins,producingthreetovesumoutputsandacarry,coulddriveasingleLUT,at 41

PAGE 42

mosttwooutputsperLUTcouldbegenerated.Additionally,theavailabilityofonlyonesetoffastcarrylogicandip-opperLUT,limitstheachievablemaximumclockspeedwhenusingtwooutputsperLUT.Additionally,withLUTRAMprimitives,oneLUTperSLICEisconsumedsolelybytheuseofitsaddresspinsbytheRAMwriteaddress,andcannotbeusedforlogic.Asresultofthesechallenges,wesuggestthatfutureFPGAarchitecturescouldbeaugmentedtoimprovetheviabilityofpseudo-constantoptimizations.Forexample,wehaveobservedthatmodicationstoimproveefciencyofwider-outputfunctions,suchasthosefoundinmanyarithmeticoperations,couldgreatlybenetpseudo-constantoptimizations.Particularly,moreoutputsperLUTandafastcarrylogicandip-oppairforeachLUToutput,couldgreatlyimprovetheefciencyofwide-ormulti-outputfunctions.ByaddinganextrasetofaddresspinstotheSLICEtoserveasthecommonwriteaddressinput,the25percentlossoffunctionaldensityinLUTRAMbaseddesignscanbeaverted.Figure 4-4 showsapossibleSLICEarchitectureforadeviceincludingthesemodications.Specically,thisdevicesSLICEarchitectureiscomposedoffoursix-input,two-outputLUTsidenticaltothoseofaVirtex5.Additionally,carry-logicandip-opstagesidenticaltothoseavailabletotheO6outputofVirtex5LUTsareaddedtobothoutputsofeachLUT.Anadditionalfthsetofsixinputpinsisaddedtoserveasacommonwrite-addressinputforLUTRAMprimitives. 4.3ExperimentsToevaluatepseudo-constantlogicoptimization,wemanuallytechnologymappedcommonlogicfunctionsontopseudo-constantprimitivesforXilinxVirtex5FPGAs.BecauseVirtex5,Virtex6,andVirtex7devicesallemployanidenticalCLBarchitecture,theresultsalsoapplytothosedevices.Todeterminebenets,wealsosynthesizeeachcircuitwithouttheproposedoptimizationtoaXilinxVirtex5LX50FPGAusingXilinxISE14.2.Foreachexample,weincluderesultsforonlythemostefcientofeitheraLUTRAMorshift-register-baseddesign.InadditiontotheVirtex5,weevaluatethesame 42

PAGE 43

Figure4-4. AmodiedVirtex5sliceincludingenhancementstoimproveefciencyofpseudo-constantoptimizedlogic.Anextracarrylogicandip-opoutputstageisaddedtothesecondaryO5outputofeachLUT,andafthsetofaddresspinsisaddedfordedicatedwriteaddressuse. circuitsonatheoreticaldeviceincorporatingthemodicationsproposedinSection 4.2.2 .ThistheoreticaldeviceiscomposedofCLBsusingthemodiedVirtex5architectureshowninFigure 4-4 ,forwhichweassumeVirtex5timingandswitchingcharacteristics[ 1 ].Notethatthetimingresultsareoptimisticbecausethetheoreticalarchitecturemayhavelongerdelays.Similarly,thetheoreticaldevicemayhavegeneral-purposeareatradeoffs,whichareoutsidethescopeofthisstudy.Wealsoevaluatedsupportlogictorecongurepseudo-constantcircuits.Weimplementedasimpleprogrammertoiterativelyshifteachbitofthepseudo-constantbitleintothecongurationbitsofeachLUT.ThiscircuitiscomposedofacounterandaBRAMtostorethebitle,andconsumesasfewas10LUTs.Therefore,pseudo-constantoptimizationsareonlybenecialaftersavingsmorethan10LUTs.ThiscircuitcanbeusedtoprogrameachLUTsequentially,orreplicatedtoprogramtwoormoreinparallel.Inthissection,weevaluatelogicthatiscommonlyreplicatedinlargenumbersbymanyFPGAapplications.Thesereplicatedcircuitsrepresentanappropriateusage 43

PAGE 44

caseforpseudo-constantcircuitsasmanycopiescanshareasmallnumberofsupportcircuitsforrecongurationandinvalidation.Thissharingenablesoverheadofthepseudo-constantsupportlogictobeamortizedoveralargenumberofoptimizedcircuits.Theevaluatedcircuitsincludeanadder,acomparator,andamultiplexer. 4.3.0.132-bitFullAdderWhensynthesizedintoFPGALUTs,addercircuitsareoutput-bound.Becauseadditionoperationsarewide-outputfunctions,thekeychallengeinminimizingthenumberofLUTsisindrivingallNoutputs.SynthesisinXilinxISEforaVirtex5adderusesthededicatedfastcarrylogictocreateripple-carryadders.EachLUTaddstheithbitofeachinputAandB,generatingasumandcarryoutput.Theseoutputsdrivethecarrylogic,whichcombinesthesesignalswithCi-1togenerateaSiandCi.Ifinsteadtheaddoperationhadonepseudo-constantinputandonenormal(i.e.,non-constant)input,thepseudo-constantvaluecanbefoldedintothefunctionimplementedbyeachLUTtoreducetheLUTutilization.Inthiscase,theonlyinputtoeachLUTwouldbetheithbitofthenon-constantinput.EventhougheachLUThasseveralfreeinputs,becausebothsumandcarry-outsignalsmustbegeneratedforeachbit,bothLUToutputsareconsumed,andnoLUTscanbeeliminatedfromthecircuit.Supposeinsteadthreebitsofthenon-constantinput,[Ai,,Ai-2],alongwithacarryinputCi-3,wereconnectedtotwoLUTs.Thefouravailableoutputsfromthisstructurecanthenimplementoutputs[Si,Si-2]andCi.Becauseeachpreviousbitsinputsareavailabletothefunctiongeneratorforeachoutputbit,theinternalcarryvaluescanbecalculatedwithoutconsumingLUToutputs.Thisstructureimplementsa3-bitfulladderusingonlytwoLUTs,ratherthanthree,providinga33%areasavings.Figure 4-5 :UsingtheSRL16-basedfour-input,two-outputpseudo-constantLUTprimitivedescribedabove,manysuchpseudo-constant3-bitfulladderscanbechainedtogethertoimplementwiderpseudo-constantadders.Figure 4-5 showsa32-bitadderdesignedusingthesestructures.WhensynthesizedforaVirtex5device 44

PAGE 45

Figure4-5. AnSRL16-basedpseudo-constant32-bitfulladderdesign. usingXilinxXST,anormal32-bitadderconsumes32LUTs.Whensynthesizedusingthepseudo-constantbaseddesign,a32-bitadderconsumesonly22LUTsanareasavingsof31%.Figure 4-6 showshowadderLUTcountgrowswithinputwidthforboththetraditionalandpseudo-constantadders.ThedarkerlineshowshowtraditionaladderLUTcountgrowslinearly,equaltothebitwidth.Thelighterlineshowshowthepseudo-constantadderLUTcountgrowsslowerandinastep-wisefashion.Thestep-wisebehavior,andLUTsavings,isduetothefactthateveryotherLUTgeneratestwobitsoftheoutput.ThegurealsoshowsthatLUTsavingsincreasesasadderwidthincreases.BecausetheVirtex5CLBsfastcarrylogicisaccessiblebyonlyoneoutputfromeachLUT,theoptimizeddesigncannotbenetfromthefastcarrylogic.Despiteashorteroverallcombinationalpath,11logicstagesratherthan32,thelongerpathbetweenneighboringLUTsincreasesthecircuitscombinationaldelayby5x,from2.515 45

PAGE 46

Figure4-6. AgraphofLUTcountsforpseudo-constantandtraditionallysynthesizedaddersonaXilinxVirtex5asadderbitwidthincreases. nsfortraditionallogicto10.377nsusingthepseudo-constantdesign.Additionally,becauseonlyoneip-opisavailableperLUT,onlyoneoutputfromeachLUTcandirectlydriveapipelineregisterwithoutconsuminganadditionalroute-throughLUT.Whenthepseudo-constantdesignisinsteadmappedontothemodiedarchitecturefromSection 4.2.2 ,the31%areasavingsareretained,whileatthesametimeeachoutputbitcantakeadvantageofthefastcarrylogicandip-opoutputstage.Thus,a32-bitripplecarryaddercanbemappedtothemodiedarchitectureusing22LUTswithacombinationaldelayof1.343ns.Thisdelayforthepseudo-constant-optimizedadderis47%fasterthanatraditionallysynthesizedadder. 4.3.0.2MultiplexerApseudo-constantmultiplexercanbedesignedsimilarlytotheadderdescribedintheprevioussection.Usingtraditionalsynthesismethods,afour-inputmuxrequiresoneLUTonaVirtex5.Multiplefour-inputmuxescanbecombinedusingdedicatedSLICEmuxhardwaretocreateuptoone16-inputmuxperSLICE.Iftheselectinputtoamuxwerefoundtobepseudo-constant,usingtheSRL32ve-input,one-output 46

PAGE 47

Figure4-7. AgraphofLUTcountsforpseudo-constantandtraditionallysynthesizedmultiplexersonaXilinxVirtex5asthenumberofinputsincreases. LUTprimitive,ave-inputmuxconsumesonlyoneLUT,anda20-inputmuxcanbecreatedineachSLICE.Thisdesignyieldsa25percentincreaseinfunctionaldensityovertraditionalsynthesis.Additionally,afour-input,two-outputmuxcanbedesignedusingtheSRL16four-input,two-outputLUTprimitiveconsumingonlyoneLUT,yieldingupto50percentLUTsavings.BytakingadvantageoftheLUTRAM-basedprimitiveinthemodiedarchitectures,asix-input,one-outputmuxcanbecreatedusingjustoneLUT,withuptoa24-inputmuxperSLICE.Thisdesignyieldsa50percentincreaseinfunctionaldensityovertraditionalsynthesis,anda25percentincreaseoverthepseudo-constantdesignontheVirtex5.Figure 4-7 showstheLUTcountneededformuxesimplementedwitheachdesignasthenumberofinputsgrows,upto32inputs.Thegureshowsastep-wisetrendduetoLUTcountsgrowingindifferentmultiples.TheunoptimizedmuxincreasesinmultiplesoffourinputsperLUT.Thepseudo-constantmuxontheVirtex5increasesinmultiplesveinputsperLUT.Themuxonthemodiedarchitecturesincreasesinmultiplesof6inputsperLUT.Asmuxesgrowlarger,theLUTsavingsachievedbypseudo-constantdesignsincreases.Thereisnodifferenceintimingperformancebetweeneachdesign. 47

PAGE 48

Figure4-8. AnSRL16-basedpseudo-constant32-bitcomparatordesign. 4.3.132-bitComparatorApseudo-constantcomparatorcanbedesignedsimilarlytotheadderdescribedabove.Supposeacircuitmustcomparetwo32-bitnumbers,AandB,forequivalence.WhensynthesizedtotheVirtex5architecture,thiscircuitrequires11LUTs,withapropagationdelayof4.658ns.IfinputBwasfoundtobepseudo-constant,itsvaluecanbefoldedintothefunctionimplementedbythecircuitsLUTs.Figure 4-8 showsacomparatordesignusingtheSRL32-basedve-input,one-outputLUTprimitivedescribedabove.TheinputstoeachLUTarecomprisedofagroupoffourconsecutivebitsofthevariableinput,alongwithacarry-outfromthepreviousgroup.Theoutputsfromthesegroupsarecascadedtogethertocreatea32-bitwidecomparatorusingonly8LUTs,vs.11foranareasavingsof27%.Thepropagationdelayincreasedto6.556ns.Bytakingadvantageofthemodiedarchitecture,usingthesix-input,one-outputLUTRAMprimitive,thepseudo-constantcomparatorcircuitcanbeoptimizedfurther.Thesizeofeachgroupofinputsisincreasedfromfourtove.Thusa32-bitpseudo-constantcomparatorcanbesynthesizedonthemodiedarchitectureusingonly6LUTs,yieldinga20percentareadecreaseanda20percentshortercombinationaldelaycomparedtothepseudo-constantdesignsynthesizedontheVirtex5.Theresultingdelayisonly2.7%slowerthanthetraditionalcomparator. 48

PAGE 49

4.3.2FunctionalDensityIn[ 46 ],Wirthlinetal.presentafunctionaldensitymetric,D,denedastheinverseoftheproductofacircuitsarea,A,andoperatingtime,T,asshowninEquation 4 : D=1 AT(4)Thismetricisusedtoquantifythebenetsofcircuitspecializationandenablecomparisonofareaandperformancetradeoffs.Additionally,[ 46 ]presentsaspecializedformofEquation1forusewithrun-timerecongurablecircuitssuchaspseudo-constantoptimizedcircuits.Byaddingrecongurationtime,tcong,dividedbyoperationsperreconguration,n,totheoperatingtimeterm,themetricaccountsfortheperformanceeffectsofrecongurationoperationsatagiveninvalidationfrequency.Equation 4 showsthismodiedmetric. D=1 A)]TJ /F5 11.955 Tf 5.48 -9.68 Td[(texec+tcong n(4)Figure 4-9 plotsthefunctionaldensity,asdenedbyEquation2,foreachofthethreeaddercircuits.Thegureshowstheoperationsbetweeninvalidations(i.e.,theinverseofinvalidationfrequency)decreasinglogarithmically.ThisgureshowsthatwhilethecombinationaldelayoverheadontheVirtex5architecturepreventsthepseudo-constantcircuitfrommatchingthefunctionaldensityofthetraditionaladdercircuit,onthemodiedarchitecturethepseudo-constantcircuitsurpassesthefunctionaldensityofthetraditionaladderafteronly19operationsbetweenrecongurations.Additionally,recongurationoverheadperoperationreachesnearlyzeroafteronly214operations,asmallgureconsideringFPGAclockfrequenciesinthehundredsofmegahertz.Forinfrequentinvalidations,thefunctionaldensityofthepseudo-constantadderonthemodiedarchitectureapproached2.7x.Inanypseudo-constantdesignusingLUTRAMorshift-registerLUTs,recongurationcanloadthepseudo-constantbitleintoeachLUTeitherinserialorinparallel.Inserialreconguration,eachbit 49

PAGE 50

iswrittenintoeachLUTonebitatatime,oneLUTatatime.Thismethodyieldsthelongestrecongurationtimeandthelargestperformancepenalty.Alternatively,duringparallelreconguration,eachbitmuststillbewrittenintoeachLUTonebitatatime,butallLUTscanbewrittenonthesamecycle.ParallelrecongurationdecreasesrecongurationtimebyafactorofN,whereNisthetotalnumberofLUTsinthepseudo-constantcircuit.Becauseparallelrecongurationrequiresproportionallymorerecongurationresources,adesignerorsynthesistoolmustconsideranarea-performancetradeoffbetweenparallelandserialreconguration.Additionally,thedegreeofparallelismcanbeadjustedtondanappropriatePareto-optimaldesignpointforeachdesign.Figure 4-10 comparesthefunctionaldensityofeachpseudo-constant32-inputmuxtotraditionalmuxesusingeitherfully-parallelorfully-serialreconguration.Longerdashedlinesshowparallelreconguration,anddottedlinesshowserialreconguration.Lighterlinesshowpseudo-constantmuxesimplementedonthestandardVirtex5architecture.Darkerlinesshowpseudo-constantdesignsimplementedonthemodiedarchitecture.Functionaldensityofintermediarydegreesofparallelismcanbeinferredbetweeneachtrendlineforeacharchitecture.AlldensitiesareshownasaratiotofunctionaldensityofatraditionalVirtex5mux,showninsolidblack.Theresultsshowthatpseudo-constantmuxesapproachafunctionaldensityof1.25xontheVirtex5architecture,and1.5xonthemodiedarchitecture,whencomparedtotraditionalsynthesis.Additionally,thegraphshowsthatthebreak-evenpoint,atwhichfunctionaldensityofthepseudo-constantoptimizedandtraditionalcircuitsareequal,isapproximately128operationsperinvalidationusingfullyparallelreconguration,andfewerthan900operationsusingfullyserialreconguration. 50

PAGE 51

Figure4-9. Functionaldensityofapseudo-constantaddercomparedtoatraditionaladderastheinvalidationfrequencyincreases.ResultsareshownforboththeVirtex5andmodiedarchitectures. Figure4-10. Functionaldensityofeachpseudo-constantmuxdesigncomparedtoatraditionalmuxastheinvalidationfrequencyincreases.Functionaldensityforeachdesignisshownforbothfully-parallelandfully-serialreconguration. 51

PAGE 52

CHAPTER5CONCLUSIONSPreviousworkintroducedintermediatefabricstoaddressFPGAproblemsrelatedtolengthyplace-and-routetimesandalackofapplicationportability.Althoughpreviousintermediatefabricapproachesachievebothapplicationportabilityandsignicantplace-and-routespeedup,theareaoverheadofthoseapproachesprohibitsimportantusecases.Toaddressthisproblem,weidentiedthevirtualinterconnectasthemainsourceoftheoverhead,andfollowedtwocomplementaryapproachestoreduceoverhead.Afteridentifyingmultiplexersastheprimarycomponentoftheinterconnect,werstperformeddesign-spaceexplorationtoidentifyunconventionalalternativesthatcouldachieveeffectivePareto-optimaltradeoffsbetweenoverheadandroutability.Basedonthisanalysis,weintroducedanoptimizedvirtualinterconnectarchitecturethatreducesarearequirementsbyapproximately50%andimprovesclockfrequenciesby24%,withamodest16%reductioninroutability.Additionally,wesoughttoreducethesizeofoverheadduetoeachindividualmultiplexerthroughpseudo-constantlogicoptimization.Weshowedthatpseudo-constantoptimizationscanincreasefunctionaldensityofcommonlogicstructuressuchasmultiplexersupto1.25x.Whiletheseoptimizationscanapplytomanyotherfunctionalelements,suchasaddersandcomparators,theexperimentsalsoshowthedifcultyofimplementingpseudo-constantdesignsonmodernFPGAs.Inparticular,restrictionsondynamicrecongurabilityandnarrow-outputfunctionalunitslimittheeffectivenessofpseudo-constantoptimizations.IffutureFPGAdesignsaddresstheseconcerns,pseudo-constantoptimizationscouldbeaviablemethodofincreasingfunctionaldensityinFPGAdesigns,withimprovementsashighas2.7x.Whiletheseoptimizationsenabledesignerstoemployintermediatefabricsinawiderrangeofarea-constrainedapplications,thereisstillopportunityforcontinued 52

PAGE 53

improvement.Futureworkmustaddressandlimittheroutabilityandexibilitypenaltyoftheoptimizedinterconnectpresented,aswellasboththemanualandautomateddesignchallengesofintegratingpseudo-constantlogicoptimizations.Evenwitha50-75%reductioninLUTutilization,intermediatefabricswillstillhaveprohibitiveoverheadforusecaseswhereanFPGAisclosetobeingfullyutilized.Fortunately,thetrendstowardsmulti-million-LUTFPGAswilllessenthisproblemovertime.Inaddition,weplantoinvestigatevirtualinterconnectthatdirectlytargetsthephysicalFPGAinterconnectwithoutusingmuxes.Suchanapproachcouldmapvirtualswitchboxesdirectlyontophysicalswitchboxes,potentiallyeliminatingmuchoftheremainingoverhead.However,suchanapproachrequiresknowledgeofproprietaryroutingarchitectures,andisthereforedeferredtofuturework. 53

PAGE 54

REFERENCES [1] XilinxVirtex-5FPGADataSheet:DCandSwitchingCharacteristics,2010.URL http://www.xilinx.com/support/documentation/data_sheets/ds202.pdf [2] XilinxVirtex-5FPGAUserGuide,2012.URL http://www.xilinx.com/support/documentation/user_guides/ug190.pdf [3] Ashenhurst,R.L.Thedecompositionofswitchingfunctions.Proc.Internatl.Symp.TheoryofSwitching,AnnalsComputationLab.vol.29.Cambridge,Mass:HarvardUniversity,1957,74. [4] Athanas,P.,Bowen,J.,Dunham,T.,Patterson,C.,Rice,J.,Shelburne,M.,Suris,J.,Bucciero,M.,andGraf,J.WiresonDemand:Run-TimeCommunicationSynthesisforRecongurableComputing.FPL'07:InternationalConferenceonFieldProgrammableLogicandApplications.2007,513. [5] Becker,J.,Pionteck,T.,Habermann,C.,andGlesner,M.Designandimplementationofacoarse-graineddynamicallyrecongurablehardwarearchitecture.VLSI'01:ProceedingsofIEEEComputerSocietyWorkshoponVLSI.2001,41. [6] Betz,VaughnandRose,Jonathan.VPR:Anewpacking,placementandroutingtoolforFPGAresearch.FPL'97:Proceedingsofthe7thInternationalWorkshoponField-ProgrammableLogicandApplications.London,UK:Springer-Verlag,1997,213. [7] Brant,A.andLemieux,G.G.F.ZUMA:AnOpenFPGAOverlayArchitecture.Field-ProgrammableCustomComputingMachines(FCCM),2012IEEE20thAnnualInternationalSymposiumon.2012,93. [8] Callahan,TimothyJ.,Chong,Philip,DeHon,Andre,andWawrzynek,John.FastmodulemappingandplacementfordatapathsinFPGAs.FPGA'98:Proceedingsofthe1998ACM/SIGDAsixthinternationalsymposiumonFieldprogrammablegatearrays.NewYork,NY,USA:ACM,1998,123. [9] Chen,Chau-Shen,Tsay,Yu-Wen,Hwang,TingTing,Wu,A.C.H.,andLin,Youn-Long.Combiningtechnologymappingandplacementfordelay-minimizationinFPGAdesigns.Computer-AidedDesignofIntegratedCircuitsandSystems,IEEETransactionson14(1995).9:1076. [10] Compton,KatherineandHauck,Scott.Totem:CustomRecongurableArrayGeneration.FCCM'01:Proceedingsofthethe9thAnnualIEEESymposiumonField-ProgrammableCustomComputingMachines.Washington,DC,USA:IEEEComputerSociety,2001,111. 54

PAGE 55

[11] Coole,JamesandStitt,Greg.IntermediateFabrics:VirtualArchitecturesforCircuitPortabilityandFastPlacementandRouting.CODES/ISSS'10:ProceedingsoftheIEEE/ACM/IFIPinternationalconferenceonHardware/Softwarecodesignandsystemsynthesis.2010,13. [12] Cox,C.E.andBlanz,W.E.GANGLION-afasteld-programmablegatearrayimplementationofaconnectionistclassier.Solid-StateCircuits,IEEEJournalof27(1992).3:288. [13] Dehon,AndreMaurice.Recongurablearchitecturesforgeneral-purposecomput-ing.Ph.D.thesis,1996.AAI0597715. [14] Donthi,S.andHaggard,R.L.AsurveyofdynamicallyrecongurableFPGAdevices.SystemTheory,2003.Proceedingsofthe35thSoutheasternSymposiumon.2003,422426. [15] Ebeling,Carl,Cronquist,DarrenC.,andFranklin,Paul.RaPiD-RecongurablePipelinedDatapath.FPL'96:Proceedingsofthe6thInternationalWorkshoponField-ProgrammableLogic,SmartApplications,NewParadigmsandCompilers.London,UK:Springer-Verlag,1996,126. [16] Eguro,KenandHauck,Scott.Armada:timing-drivenpipeline-awareroutingforFPGAs.FPGA'06:Proceedingsofthe2006ACM/SIGDA14thinternationalsymposiumonFieldprogrammablegatearrays.NewYork,NY,USA:ACM,2006,169. [17] Eldredge,J.G.andHutchings,B.L.DensityenhancementofaneuralnetworkusingFPGAsandrun-timereconguration.FPGAsforCustomComputingMachines,1994.Proceedings.IEEEWorkshopon.1994,180. [18] Farrahi,A.H.andSarrafzadeh,M.Complexityofthelookup-tableminimizationproblemforFPGAtechnologymapping.Computer-AidedDesignofIntegratedCircuitsandSystems,IEEETransactionson13(1994).11:1319. [19] Foulk,P.W.Data-foldinginSRAMcongurableFPGAs.FPGAsforCustomComputingMachines,1993.Proceedings.IEEEWorkshopon.1993,163. [20] Giri,A.,Visvanathan,V.,Nandy,S.K.,andGhoshal,S.K.HighspeeddigitallteringonSRAM-basedFPGAs.VLSIDesign,1994.,ProceedingsoftheSeventhInternationalConferenceon.1994,229. [21] Goslin,G.UsingXilinxFPGAstodesigncustomdigitalsignalprocessingdevices.1995,565. [22] Grant,David,Wang,Chris,andLemieux,GuyG.F.ACADframeworkforMalibu:anFPGAwithtime-multiplexedcoarse-grainedelements.Proceedingsofthe19thACM/SIGDAinternationalsymposiumonFieldprogrammablegatearrays.FPGA'11.NewYork,NY,USA:ACM,2011,123. 55

PAGE 56

[23] Gunther,B.,Milne,G.,andNarasimhan,L.Assessingdocumentrelevancewithrun-timerecongurablemachines.FPGAsforCustomComputingMachines,1996.Proceedings.IEEESymposiumon.1996,10. [24] Hammerquist,M.andLysecky,R.DesignspaceexplorationforapplicationspecicFPGAsinsystem-on-a-chipdesigns.SOC'08:ProceedingsoftheIEEEInternationalSOCConference.2008,279. [25] Hayes,J.P.AuniedswitchingtheorywithapplicationstoVLSIdesign.Proceed-ingsoftheIEEE70(1982).10:11401151. [26] Kapre,Nachiket,Mehta,Nikil,deLorimier,Michael,Rubin,Raphael,Barnor,Henry,Wilson,MichaelJ.,Wrighton,Michael,andDeHon,Andre.Packet-Switchedvs.Time-MultiplexedFPGAOverlayNetworks.ProceedingsoftheIEEESymposiumonField-ProgrammableCustomComputingMachines.2006. [27] Koch,Andreas.Structureddesignimplementation:astrategyforimplementingregulardatapathsonFPGAs.FPGA'96:Proceedingsofthe1996ACMfourthinternationalsymposiumonField-programmablegatearrays.NewYork,NY,USA:ACM,1996,151. [28] Koehler,S.,Stitt,G.,andGeorge,A.D.Performancevisualizationandexplorationforrecongurablecomputingapplications.???? [29] Landy,AaronandStitt,Greg.Alow-overheadinterconnectarchitectureforvirtualrecongurablefabrics.Proceedingsofthe2012internationalconferenceonCompilers,architecturesandsynthesisforembeddedsystems.CASES'12.NewYork,NY,USA:ACM,2012,111. [30] Lavin,C.,Padilla,M.,Lamprecht,J.,Lundrigan,P.,Nelson,B.,andHutchings,B.HMFlow:AcceleratingFPGACompilationwithHardMacrosforRapidPrototyping.Field-ProgrammableCustomComputingMachines(FCCM),2011IEEE19thAnnualInternationalSymposiumon.2011,117. [31] Lemoine,E.andMerceron,D.RuntimerecongurationofFPGAforscanninggenomicdatabases.FPGAsforCustomComputingMachines,1995.Proceedings.IEEESymposiumon.1995,90. [32] Lysaght,Patrick,Stockwood,Jon,Law,J.,andGirma,D.ArticialNeuralNetworkImplementationonaFine-GrainedFPGA.Proceedingsofthe4thInternationalWorkshoponField-ProgrammableLogicandApplications:Field-ProgrammableLogic,Architectures,SynthesisandApplications.FPL'94.London,UK,UK:Springer-Verlag,1994,421. [33] Lysecky,Roman,Miller,Kris,Vahid,Frank,andVissers,Kees.Firm-coreVirtualFPGAforJust-in-TimeFPGACompilation(abstractonly).Proceedingsofthe2005ACM/SIGDA13thinternationalsymposiumonField-programmablegatearrays.FPGA'05.NewYork,NY,USA:ACM,2005,271. 56

PAGE 57

[34] Marshall,Alan,Stanseld,Tony,Kostarnov,Igor,Vuillemin,Jean,andHutchings,Brad.Arecongurablearithmeticarrayformultimediaapplications.FPGA'99:Proceedingsofthe1999ACM/SIGDASeventhInternationalSymposiumonFieldProgrammableGateArrays.NewYork,NY,USA:ACM,1999,135. [35] McDonald,E.J.RuntimeFPGAPartialReconguration.AerospaceConference,2008IEEE.2008,1. [36] McMurchie,LarryandEbeling,Carl.PathFinder:anegotiation-basedperformance-drivenrouterforFPGAs.FPGA'95:Proceedingsofthe1995ACMThirdInternationalSymposiumonFieldProgrammableGateArrays.NewYork,NY,USA:ACM,1995,111. [37] Mulpuri,ChandraandHauck,Scott.RuntimeandqualitytradeoffsinFPGAplacementandrouting.FPGA'01:Proceedingsofthe2001ACM/SIGDANinthInternationalSymposiumonFieldProgrammableGateArrays.NewYork,NY,USA:ACM,2001,29. [38] Murgai,R.,Nishizaki,Y.,Shenoy,N.,Brayton,R.K.,andSangiovanni-Vincentelli,A.Logicsynthesisforprogrammablegatearrays.DesignAutomationConference,1990.Proceedings.,27thACM/IEEE.1990,620. [39] Roth,J.PaulandKarp,R.M.MinimizationOverBooleanGraphs.IBMJournalofResearchandDevelopment6(1962).2:227. [40] Sekanina,Lukas.EvolvableSystems:FromBiologytoHardware,chap.VirtualRecongurableCircuitsforReal-WorldApplicationsofEvolvableHardware.SpringerBerlin/Heidelberg,2003,116. [41] Shukla,Sunil,Bergmann,NeilW.,andBecker,Jurgen.QUKU:ATwo-LevelRecongurableArchitecture.ISVLSI'06:ProceedingsoftheIEEEComputerSocietyAnnualSymposiumonEmergingVLSITechnologiesandArchitectures.Washington,DC,USA:IEEEComputerSociety,2006,109. [42] Stitt,G.andCoole,J.IntermediateFabrics:VirtualArchitecturesforNear-InstantFPGACompilation.EmbeddedSystemsLetters,IEEE3(2011).3:81. [43] Tsu,William,Macy,Kip,Joshi,Atul,Huang,Randy,Walker,Norman,Tung,Tony,Rowhani,Omid,George,Varghese,Wawrzynek,John,andDeHon,Andre.HSRA:high-speed,hierarchicalsynchronousrecongurablearray.FPGA'99:Proceedingsofthe1999ACM/SIGDAseventhinternationalsymposiumonFieldprogrammablegatearrays.NewYork,NY,USA:ACM,1999,125. [44] Villasenor,J.,Schoner,B.,Chia,Kang-Ngee,Zapata,C.,Kim,HeaJoung,Jones,C.,Lansing,S.,andMangione-Smith,B.Congurablecomputingsolutionsforautomatictargetrecognition.FPGAsforCustomComputingMachines,1996.Proceedings.IEEESymposiumon.1996,70. 57

PAGE 58

[45] Wang,J.,Chen,Q.S.,andLee,C.H.Designandimplementationofavirtualrecongurablearchitecturefordifferentapplicationsofintrinsicevolvablehardware.Computers&DigitalTechniques,IET2(2008).5:386. [46] Wirthlin,MichaelJ.andHutchings,BradL.ImprovingFunctionalDensityThroughRun-TimeConstantPropagation.InACM/SIGDAInternationalSymposiumonFieldProgrammableGateArrays.1997,86. [47] Yiannacouras,Peter,Steffan,J.Gregory,andRose,Jonathan.VESPA:portable,scalable,andexibleFPGA-basedvectorprocessors.Proceedingsofthe2008internationalconferenceonCompilers,architecturesandsynthesisforembeddedsystems.CASES'08.NewYork,NY,USA:ACM,2008,61. 58

PAGE 59

BIOGRAPHICALSKETCH AaronLandyreceivedtheBachelorofScienceinElectricalEngineeringdegreefromtheUniversityofTexasatAustinin2011,withspecializationinComputerArchitectureandEmbeddedSystems.WhileattheUniversityofTexas,heworkedunderDr.DerekChioutoimplementalow-overheadin-situdebuggingframeworkforFPGAapplications.In2011,heworkedinPost-SiliconValidationfortheAtomSystem-on-ChipatIntelCorporationinAustin,Texas.HejoinedtheNSFCenterforHighPerformanceRecongurableComputing(CHREC)attheUniversityofFloridaasaPh.D.studentandresearchassistantunderDr.GregStitt.AaronreceivedtheMasterofScienceinElectricalEngineeringdegreefromtheUniversityofFloridain2013.Hisresearchinterestsincluderecongurablecomputing,computerarchitecture,andembeddedsystems.HiscurrentworkfocusesonFPGAtoolowsandproductivity,particularlyfastplace-and-route,high-levelsynthesis,andFPGAvirtualization. 59