ICBR ( Integrated Center for Biotechnology Research ) and XSEDE ( Extreme Science and Engineering Discovery Environment )

MISSING IMAGE

Material Information

Title:
ICBR ( Integrated Center for Biotechnology Research ) and XSEDE ( Extreme Science and Engineering Discovery Environment )
Series Title:
Big Data, Little Data: Having it All A Research Data & Data Management 2013 Workshop
Physical Description:
Presentation slides
Language:
English
Creator:
Gardner, Aaron
Publisher:
George A. Smathers Libraries, University of Florida
Place of Publication:
Gainesville, FL
Publication Date:

Notes

Abstract:
Presentation slides by resource expert presenter at the Big Data, Little Data Workshop on Oct. 3, 2013.
General Note:
Data Management / Curation Task Force Materials ( DMCTF Materials )

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Applicable rights reserved.
System ID:
AA00017906:00009

Full Text

PAGE 2

Data Overview at ICBR We have O( ) Aaron Gardner ( agardner@ufl.edu )

PAGE 3

Interdisciplinary Center for Biotechnology Research (ICBR) Created in 1987 by Florida Legislature Organized under the Office of the Vice President for Research Institute of Food and Agricultural Sciences College of Medicine College of Liberal Arts & Sciences 55 ICBR Employees 22% faculty, 45% full time staff, and 33 % postdoctoral associates, etc. more than 400 different services ranging from custom and high throughput DNA sequencing to electron microscopy and antibody development and production, ICBR is here to exceed the research needs of UF

PAGE 4

ICBR 25 th Anniversary Scientists and researchers at the University of Florida recognize the ICBR as the leading resource for a full range of life science technologies. Projects within the ICBR have ranged from new cultivars of plants, mapping the genome of ants, tracing the DNA of people across continents, analyzing genes to understand how social situations and health change our DNA, and developing cures for rare and cancerous diseases

PAGE 5

Administration Divisions Bioinformatics Cellomics Genomics Proteomics Sections Cyberinfrastructure Special Projects ICBR Organizational Overview

PAGE 6

Scientific Core Laboratories by Division Bioinformatics Bioinformatics Electron Microscopy and Bioimaging Flow Cytometry Hybridoma Cellomics Gene Expression Genotyping NextGen DNA Sequencing Quantitative PCR Sanger Sequencing Genomics Mass Spectrometry Protein Biomarkers Proteomics

PAGE 7

Cyberinfrastructure Section Compute 1000 cores Storage 600TB Networking 1 & 10 GbE QDR IB Formed in 2011 Product of over a decade of lessons learned with previous ICBR computing efforts Section within Administration Renewed focus on enabling all labs, enabling users to have additional freedom with the resources they use Current Infrastructure Quick Stats Dustin White Jeremy Jensen Aaron Gardner Omar Lopez Alex Oumantsev Computational resources are bare metal, virtual machine pools, SMP Storage resources include parallel tier, disk backup, and archival tier Network standardizing on Ethernet

PAGE 8

Data Drivers at ICBR Historical 1987 1997 Interesting note: ICBR was funded from the beginning to support computation for the center within its core laboratory model GCG Wisconsin Package running on a VAX 1998 2004 High throughput c apillary s equencing Sanger scale tools that fit the demands of the time Convenient public interface to analysis results (LIMS, BlastQuest etc.) Bioinformatics as a Service 2005 2011 Next generation s equencing (NGS) comes to ICBR I nfrastructure for sequencers and for bioinformatics services increases in complexity and scale Scale of data and computation required for analysis disrupts everything, everywhere

PAGE 9

Data Drivers at ICBR (cont.) NGS DNA s equencing technologies change, costs drop dramatically, and core facilities large and small contend with generating new instrument data at an unprecedented scale. In the last decade the cost of sequencing a human sized genome has fallen from $100M to $10K (10,000x cost reduction).

PAGE 10

Data Drivers at ICBR (cont.) Future 2013 Future The lessons learned from NGS are proactively being applied to all other laboratories and disciplines within ICBR Cryo electron microscopy ( cryo EM) Already have two instruments Larger can generate up to 2TB/day when running Unlike NGS, primary analysis algorithms in flux Raw data still holds value when reanalyzed months or years later High throughput proteomics on the radar Already seeing orders of magnitude higher MS/MS spectra on newer instruments, resources required for analyzing are climbing as well Collaborations and Shared Resources Preparing our center to participate in larger projects is a priority P ersonalized and translational medicine epidemiology, etc. This is driving our push for a converged on demand infrastructure that can accommodate internal and external tenants forever?

PAGE 11

What we are experimenting with for DM: iRODS Integrated Rule Oriented Data System ( iRODS ) that aims at managing distributed massive data Started as an open source initiative in 2004 with the help of NSF funding of $ 20M Collaboration between UNC, RENCI Applications : Data Grids, Institutional repositories, Archives Genomics, Astronomy High Energy Physics Scalable to billions of files and petabytes of data across tens of federated data grids

PAGE 12

What is iRODS ? iRODS the Integrated Rule Oriented Data System is a project for building the next generation data management cyberinfrastructure One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions Interfaces : GUI, Web, WebDAV, CLI Operations: Search, A ccess and View, A dd/Extract M etadata, Annotate, Analyze & P rocess, M anage, Replicate, Copy, Share, Repurpose, Control and Track access, Subscribe & more iRODS Server software and Rule Engine run on each data server. The iRODS iCAT Metadata Catalog uses a database to track metadata describing data and everything that happens to it

PAGE 13

Challenges With This Model: Need for single instance resources capable of dealing with big data Will now need multitenancy capabilities even as a single organization Must minimize latency to better utilize limited resources massive scalability approach might not be suitable for a small or midsize research environment with legacy codes, inexperienced users, etc. Future Vision Currently, management of our computational resources and data is strictly controlled, with CI members fully supporting all services This minimizes service downtime, prevents data breaches, etc. It comes at a cost we are no longer as agile as we need to be Users instantiate resources on demand which they have privileged access to but no support is offered beyond clearing hang ups Common use services requiring a higher degree of reliability and/or security are built and managed by us upon request, with unprivileged access provided to users Core computational services are still supported end to end by us, and provided for consumption by resources in the previous two levels Solution: Move to a three level service and support model

PAGE 14

Converged Infrastructure Will Help Us Get There Infrastructure as a Service ( IaaS ) Multitenancy for Big Data Need for software defined networking (SDN) Example: Movement toward OpenFlow and SDN at UF (I2IP) Unified fabric? (or at least abstracted) Need massive single instance scalability in terms of compute cycles, main memory, IOPs, storage throughput (edge case for public cloud, general case for us) Scalable building blocks handling both data throughput and computation simultaneously (extend in storage processing concept) allows for big data performance for even small and midsize infrastructures, minimizes latency in the stack, data processing becomes LUN to LUN DevOps and sharing of infrastructure codes is the way to support tenants (local staff, partner institutions, students, etc.) while still giving them control and freedom of their resources. Examples: Galaxy Toolshed, Chef Cookbooks, RackSpace Private Cloud OpenStack

PAGE 15

Converged Infrastructure Will Help Us Get There (cont.) Converged infrastructure is not as necessary when you are: 1. Hiring lots of smart people and committing their time to infrastructure 2. Attacking a single or small set of large problems 3. R arely revalidating or reintegrating your HW stack after deployment This is because if you tie your platform closely to a mixed and disparate hardware stack: staff time to explore reintegration and revalidation issues, rewrite code for new architectures this can work for hyper giants and single service efforts but legacy and vendor controlled codes, flexible infrastructure infrastructure for yet unknown or unsolved problems iRODS embedded platform helps us pilot this idea and provides a place to check point and stabilize wild wild west resource usage, establish authoritative data, and enable controlled data sharing between tenants Conclusion: Converging the hardware stack lets you focus more on the software stack and having a software stack that is portable and reproducible across infrastructures and organizational boundaries is where we want to go

PAGE 16

Aaron Gardner ( agardner@ufl.edu ) Cyberinfrastructure Section Director Interdisciplinary Center for Biotechnology Research biotech.ufl.edu

PAGE 18

October 8, 2013 XSEDE at a Glance Aaron Gardner (agardner@ufl.edu) Campus Champion University of Florida

PAGE 19

What is XSEDE? The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced powerful, and robust collection of integrated digital resources and services in the world. It is a single virtual system that scientists can use to interactively share computing resources data and expertise 2

PAGE 20

What is XSEDE? 5 year, $121 million project supported by the NSF Replaces and expands on the NSF TeraGrid project More than 10,000 scientists used TeraGrid XSEDE continues same sort of work as TeraGrid Expanded scope Broader range of fields and disciplines Leadership class resources at partner sites combine to create an integrated, persistent computational resource Allocated through national peer review process Free* (see next slide) 3

PAGE 21

4

PAGE 22

5

PAGE 23

What is Cyberinfrastructure? 6 Cyberinfrastructure is a technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge. 1 Term was used by the NSF Blue Ribbon committee in 2003 in response to the question: barriers to the rapid evolution of high performance computing, making it truly usable by all the nation's scientists, engineers, scholars, and citizens? The TeraGrid 2 is the NSF s response to this question. Cyberinfrastructure is also called e Science 3 1 Source: Wikipedia 2 More properly, the TeraGrid in it s current form: the Extensible Terascale Facility 3 Source: NSF

PAGE 24

Who Uses XSEDE? Physics (91) 19% Molecular Biosciences (271) 17% Astronomical Sciences (115) 13% Atmospheric Sciences (72) 11% Materials Research (131) 9% Chemical, Thermal Sys (89) 8% Chemistry (161) 7% Scientific Computing (60) 2% Earth Sci (29) 2% Training (51) 2% >2 billion cpu hours allocated 1400 allocations 350 institutions 32 research domains

PAGE 25

XSEDE Supports a Breadth of Research Earthquake Science and Civil Engineering Molecular Dynamics Nanotechnology Plant Science Storm modeling Epidemiology Particle Physics Economic analysis of phone network patterns Brain science Analysis of large cosmological simulations DNA sequencing Computational Molecular Sciences Neutron Science International Collaboration in Cosmology and Plasma Physics Sampling of much larger set. Many examples are new to use of advanced digital services. Range from petascale to disjoint HTC, many are data driven. XSEDE will support thousands of such projects. From direct contact with user community as part of requirements collections: 8

PAGE 26

October 8, 2013 Revised September 3, 2013 Campus Champion Institutions Standard 82 EPSCoR States 49 Minority Serving Institutions 12 EPSCoR States and Minority Serving Institutions 8 Total Campus Champion Institutions 151

PAGE 27

Who Uses XSEDE? Spider Silk 10 PI: Markus Buehler Institution: MIT We found that the structure of spider silk at the nanoscale can explain why this material is as strong as steel, even though the glue of the hydrogen bonds holding spider silk together at the molecular level is 100 to 1,000 times weaker than steel s metallic bonds. says Buehler. Excerpts from TeraGrid Science Highlights 2010

PAGE 28

Data Mining and Text Analysis 11 PI: Sorin Matei, David Braun Institution: Purdue University Purdue researchers led by Sorin Adam Matei are analyzing the entire collection of articles produced in Wikipedia from 2001 2008, and all their revisions a computationally demanding task made possible by TeraGrid resources. We looked at how article production is distributed across users contributions relative to each other over time. The work includes visualizations of patterns to make them easier to discern, says Matei. Excerpts from TeraGrid Science Highlights 2010

PAGE 29

XSEDE Science Gateways for Bioinformatics Web based Science Gateways provide access to XSEDE CIPRES provides: BEAST GARLI MrBayes RAxML MAFFT High performance, parallel applications run at SDSC 4000 users of CIPRES and hundreds of journal citations *Adapted from information provided by Wayne Pfeiffer, SDSC

PAGE 30

Who Uses XSEDE? iPlant Science goals by 2015: Major emerging computational problem is deducing Phenotype from Genotype, e.g. QTL (Quantitative Trait Locus) mapping accurate prediction of traits (e.g. drought tolerance for maize) based on genetic sequence. Based on data collected in hundreds of labs around the world and stored in dozens of online distributed databases. Infrastructure needs : This data driven petascale combinatoric problem requires high speed access to both genotypic and phenotypic databases (distributed at several sites). XSEDE will provide the coordinated networking and workflow tools needed for this type of work 13

PAGE 31

Brain Science Connectome Science goals by 2015 : Capture, process, and analyze ~1 mm 3 of brain tissue, reconstructing complete neural wiring diagram at full synaptic resolution; present resulting image data repository to national community for analysis and visualization Infrastructure Needs: High throughput transmission electron microscopy (TEM) high resolution images of sections must be processed, registered (taking warping into account), and assembled for viewing; Raw data (>6 PB), must be archived; TEM data must be streamed in near real time at sustained ~1 GB/s. Results in ~3 PB of co registered data. As with all large datasets that researchers throughout the country will want to access, XSEDE s data motion, network tuning, and campus bridging capabilities will be invaluable. 14

PAGE 32

What Resources does XSEDE Offer? 15

PAGE 33

Data Storage and Transfer SDSC Gordon SSD system with fast storage NCSA Mass Storage System http://www.ncsa.illinois.edu/UserInfo/Data/MSS NICS HPSS http://www.nics.utk.edu/computing resources/hpss/ Easy data transfer In browser SFTP or SCP clients through Portal SSH Standard data transfer SCP to move data in/out of XSEDE systems Requires SSH key setup Rsync to move data in High performance data transfer Globus Online: https://www.globusonline.org/ 16

PAGE 34

Support Resources Local Campus Champion Centralized XSEDE help help@xsede.org Extended one on one help (ECSS): https://www.xsede.org/ecss Training http://www.xsede.org/training 17

PAGE 35

Other Resources Science Gateways Extended Support Open Science Grid FutureGrid Blue Waters (NCSA) Titan (ORNL/NICS) ALCF (Argonne) Hopper (NERSC) 18

PAGE 36

Tap into community knowledge (see next slide) Extended Collaborative Support Service (ECSS) Resources with complementary characteristics to those found locally Extending network of collaborators The XSEDE community (noticing a theme yet?) 19

PAGE 37

20

PAGE 38

Campus Champion Get your feet wet with XSEDE < 10k cpu hours 2 day lead time Start Up Benchmark and gain experience with resources 200k cpu hours 2 week lead time Education Class and workshop support Short term (1 week to 6 months) XSEDE Research Allocation (XRAC) Up to 10M cpu hours 10 page request, 4 month lead time 21 https://www.xsede.org/how to get an allocation

PAGE 39

Steps to Getting Your Allocation Step One Campus Champion Allocation Log onto the Portal and get an account Send Campus Champion (me!) your portal account ID Step Two Start Up/Educational Allocation Sign up for a startup account Do benchmarking Step Three XRAC Requires written proposal and CVs 22

PAGE 40

Campus Champion Role Summary What I will do for you: Help setup your XSEDE portal account Get you acquainted with accessing XSEDE systems Walk you through the allocations process Answer XSEDE related questions that I have the answers to Get you plugged in with the right people from XSEDE when I Pass issues and feedback to the community with high visibility Help evangelize XSEDE at events you host or directly with your colleagues Fix code Babysit production jobs Plug the government back in 23

PAGE 41

Acknowledgements & Contact Presentation Content Thank You Jeff Gardner (University of Washington) Kim Dillman (Purdue University) Other XSEDE Campus Champions The XSEDE Community at Large Aaron Gardner agardner@ufl.edu 24



PAGE 1

August 11, 2014 XSEDE at a Glance Aaron Gardner (agardner@ufl.edu) Campus Champion University of Florida

PAGE 2

What is XSEDE? The Extreme Science and Engineering Discovery Environment (XSEDE) is the most advanced powerful, and robust collection of integrated digital resources and services in the world. It is a single virtual system that scientists can use to interactively share computing resources data and expertise 2

PAGE 3

What is XSEDE? 5 year, $121 million project supported by the NSF Replaces and expands on the NSF TeraGrid project More than 10,000 scientists used TeraGrid XSEDE continues same sort of work as TeraGrid Expanded scope Broader range of fields and disciplines Leadership class resources at partner sites combine to create an integrated, persistent computational resource Allocated through national peer review process Free* (see next slide) 3

PAGE 4

4

PAGE 5

5

PAGE 6

What is Cyberinfrastructure? 6 Cyberinfrastructure is a technological solution to the problem of efficiently connecting data, computers, and people with the goal of enabling derivation of novel scientific theories and knowledge. 1 Term was used by the NSF Blue Ribbon committee in 2003 in response to the question: barriers to the rapid evolution of high performance computing, making it truly usable by all the nation's scientists, engineers, scholars, and citizens? The TeraGrid 2 is the NSF s response to this question. Cyberinfrastructure is also called e Science 3 1 Source: Wikipedia 2 More properly, the TeraGrid in it s current form: the Extensible Terascale Facility 3 Source: NSF

PAGE 7

Who Uses XSEDE? Physics (91) 19% Molecular Biosciences (271) 17% Astronomical Sciences (115) 13% Atmospheric Sciences (72) 11% Materials Research (131) 9% Chemical, Thermal Sys (89) 8% Chemistry (161) 7% Scientific Computing (60) 2% Earth Sci (29) 2% Training (51) 2% >2 billion cpu hours allocated 1400 allocations 350 institutions 32 research domains

PAGE 8

XSEDE Supports a Breadth of Research Earthquake Science and Civil Engineering Molecular Dynamics Nanotechnology Plant Science Storm modeling Epidemiology Particle Physics Economic analysis of phone network patterns Brain science Analysis of large cosmological simulations DNA sequencing Computational Molecular Sciences Neutron Science International Collaboration in Cosmology and Plasma Physics Sampling of much larger set. Many examples are new to use of advanced digital services. Range from petascale to disjoint HTC, many are data driven. XSEDE will support thousands of such projects. From direct contact with user community as part of requirements collections: 8

PAGE 9

August 11, 2014 Revised September 3, 2013 Campus Champion Institutions Standard 82 EPSCoR States 49 Minority Serving Institutions 12 EPSCoR States and Minority Serving Institutions 8 Total Campus Champion Institutions 151

PAGE 10

Who Uses XSEDE? Spider Silk 10 PI: Markus Buehler Institution: MIT We found that the structure of spider silk at the nanoscale can explain why this material is as strong as steel, even though the glue of the hydrogen bonds holding spider silk together at the molecular level is 100 to 1,000 times weaker than steel s metallic bonds. says Buehler. Excerpts from TeraGrid Science Highlights 2010

PAGE 11

Data Mining and Text Analysis 11 PI: Sorin Matei, David Braun Institution: Purdue University Purdue researchers led by Sorin Adam Matei are analyzing the entire collection of articles produced in Wikipedia from 2001 2008, and all their revisions a computationally demanding task made possible by TeraGrid resources. We looked at how article production is distributed across users contributions relative to each other over time. The work includes visualizations of patterns to make them easier to discern, says Matei. Excerpts from TeraGrid Science Highlights 2010

PAGE 12

XSEDE Science Gateways for Bioinformatics Web based Science Gateways provide access to XSEDE CIPRES provides: BEAST GARLI MrBayes RAxML MAFFT High performance, parallel applications run at SDSC 4000 users of CIPRES and hundreds of journal citations *Adapted from information provided by Wayne Pfeiffer, SDSC

PAGE 13

Who Uses XSEDE? iPlant Science goals by 2015: Major emerging computational problem is deducing Phenotype from Genotype, e.g. QTL (Quantitative Trait Locus) mapping accurate prediction of traits (e.g. drought tolerance for maize) based on genetic sequence. Based on data collected in hundreds of labs around the world and stored in dozens of online distributed databases. Infrastructure needs : This data driven petascale combinatoric problem requires high speed access to both genotypic and phenotypic databases (distributed at several sites). XSEDE will provide the coordinated networking and workflow tools needed for this type of work 13

PAGE 14

Brain Science Connectome Science goals by 2015 : Capture, process, and analyze ~1 mm 3 of brain tissue, reconstructing complete neural wiring diagram at full synaptic resolution; present resulting image data repository to national community for analysis and visualization Infrastructure Needs: High throughput transmission electron microscopy (TEM) high resolution images of sections must be processed, registered (taking warping into account), and assembled for viewing; Raw data (>6 PB), must be archived; TEM data must be streamed in near real time at sustained ~1 GB/s. Results in ~3 PB of co registered data. As with all large datasets that researchers throughout the country will want to access, XSEDE s data motion, network tuning, and campus bridging capabilities will be invaluable. 14

PAGE 15

What Resources does XSEDE Offer? 15

PAGE 16

Data Storage and Transfer SDSC Gordon SSD system with fast storage NCSA Mass Storage System http://www.ncsa.illinois.edu/UserInfo/Data/MSS NICS HPSS http://www.nics.utk.edu/computing resources/hpss/ Easy data transfer In browser SFTP or SCP clients through Portal SSH Standard data transfer SCP to move data in/out of XSEDE systems Requires SSH key setup Rsync to move data in High performance data transfer Globus Online: https://www.globusonline.org/ 16

PAGE 17

Support Resources Local Campus Champion Centralized XSEDE help help@xsede.org Extended one on one help (ECSS): https://www.xsede.org/ecss Training http://www.xsede.org/training 17

PAGE 18

Other Resources Science Gateways Extended Support Open Science Grid FutureGrid Blue Waters (NCSA) Titan (ORNL/NICS) ALCF (Argonne) Hopper (NERSC) 18

PAGE 19

Tap into community knowledge (see next slide) Extended Collaborative Support Service (ECSS) Resources with complementary characteristics to those found locally Extending network of collaborators The XSEDE community (noticing a theme yet?) 19

PAGE 20

20

PAGE 21

Campus Champion Get your feet wet with XSEDE < 10k cpu hours 2 day lead time Start Up Benchmark and gain experience with resources 200k cpu hours 2 week lead time Education Class and workshop support Short term (1 week to 6 months) XSEDE Research Allocation (XRAC) Up to 10M cpu hours 10 page request, 4 month lead time 21 https://www.xsede.org/how to get an allocation

PAGE 22

Steps to Getting Your Allocation Step One Campus Champion Allocation Log onto the Portal and get an account Send Campus Champion (me!) your portal account ID Step Two Start Up/Educational Allocation Sign up for a startup account Do benchmarking Step Three XRAC Requires written proposal and CVs 22

PAGE 23

Campus Champion Role Summary What I will do for you: Help setup your XSEDE portal account Get you acquainted with accessing XSEDE systems Walk you through the allocations process Answer XSEDE related questions that I have the answers to Get you plugged in with the right people from XSEDE when I Pass issues and feedback to the community with high visibility Help evangelize XSEDE at events you host or directly with your colleagues Fix code Babysit production jobs Plug the government back in 23

PAGE 24

Acknowledgements & Contact Presentation Content Thank You Jeff Gardner (University of Washington) Kim Dillman (Purdue University) Other XSEDE Campus Champions The XSEDE Community at Large Aaron Gardner agardner@ufl.edu 24



PAGE 2

Data Overview at ICBR We have O( ) Aaron Gardner ( agardner@ufl.edu )

PAGE 3

Interdisciplinary Center for Biotechnology Research (ICBR) Created in 1987 by Florida Legislature Organized under the Office of the Vice President for Research Institute of Food and Agricultural Sciences College of Medicine College of Liberal Arts & Sciences 55 ICBR Employees 22% faculty, 45% full time staff, and 33 % postdoctoral associates, etc. more than 400 different services ranging from custom and high throughput DNA sequencing to electron microscopy and antibody development and production, ICBR is here to exceed the research needs of UF

PAGE 4

ICBR 25 th Anniversary Scientists and researchers at the University of Florida recognize the ICBR as the leading resource for a full range of life science technologies. Projects within the ICBR have ranged from new cultivars of plants, mapping the genome of ants, tracing the DNA of people across continents, analyzing genes to understand how social situations and health change our DNA, and developing cures for rare and cancerous diseases

PAGE 5

Administration Divisions Bioinformatics Cellomics Genomics Proteomics Sections Cyberinfrastructure Special Projects ICBR Organizational Overview

PAGE 6

Scientific Core Laboratories by Division Bioinformatics Bioinformatics Electron Microscopy and Bioimaging Flow Cytometry Hybridoma Cellomics Gene Expression Genotyping NextGen DNA Sequencing Quantitative PCR Sanger Sequencing Genomics Mass Spectrometry Protein Biomarkers Proteomics

PAGE 7

Cyberinfrastructure Section Compute 1000 cores Storage 600TB Networking 1 & 10 GbE QDR IB Formed in 2011 Product of over a decade of lessons learned with previous ICBR computing efforts Section within Administration Renewed focus on enabling all labs, enabling users to have additional freedom with the resources they use Current Infrastructure Quick Stats Dustin White Jeremy Jensen Aaron Gardner Omar Lopez Alex Oumantsev Computational resources are bare metal, virtual machine pools, SMP Storage resources include parallel tier, disk backup, and archival tier Network standardizing on Ethernet

PAGE 8

Data Drivers at ICBR Historical 1987 1997 Interesting note: ICBR was funded from the beginning to support computation for the center within its core laboratory model GCG Wisconsin Package running on a VAX 1998 2004 High throughput c apillary s equencing Sanger scale tools that fit the demands of the time Convenient public interface to analysis results (LIMS, BlastQuest etc.) Bioinformatics as a Service 2005 2011 Next generation s equencing (NGS) comes to ICBR I nfrastructure for sequencers and for bioinformatics services increases in complexity and scale Scale of data and computation required for analysis disrupts everything, everywhere

PAGE 9

Data Drivers at ICBR (cont.) NGS DNA s equencing technologies change, costs drop dramatically, and core facilities large and small contend with generating new instrument data at an unprecedented scale. In the last decade the cost of sequencing a human sized genome has fallen from $100M to $10K (10,000x cost reduction).

PAGE 10

Data Drivers at ICBR (cont.) Future 2013 Future The lessons learned from NGS are proactively being applied to all other laboratories and disciplines within ICBR Cryo electron microscopy ( cryo EM) Already have two instruments Larger can generate up to 2TB/day when running Unlike NGS, primary analysis algorithms in flux Raw data still holds value when reanalyzed months or years later High throughput proteomics on the radar Already seeing orders of magnitude higher MS/MS spectra on newer instruments, resources required for analyzing are climbing as well Collaborations and Shared Resources Preparing our center to participate in larger projects is a priority P ersonalized and translational medicine epidemiology, etc. This is driving our push for a converged on demand infrastructure that can accommodate internal and external tenants forever?

PAGE 11

What we are experimenting with for DM: iRODS Integrated Rule Oriented Data System ( iRODS ) that aims at managing distributed massive data Started as an open source initiative in 2004 with the help of NSF funding of $ 20M Collaboration between UNC, RENCI Applications : Data Grids, Institutional repositories, Archives Genomics, Astronomy High Energy Physics Scalable to billions of files and petabytes of data across tens of federated data grids

PAGE 12

What is iRODS ? iRODS the Integrated Rule Oriented Data System is a project for building the next generation data management cyberinfrastructure One of the main ideas behind iRODS is to provide a system that enables a flexible, adaptive, customizable data management architecture At the iRODS core, a Rule Engine interprets the Rules to decide how the system is to respond to various requests and conditions Interfaces : GUI, Web, WebDAV, CLI Operations: Search, A ccess and View, A dd/Extract M etadata, Annotate, Analyze & P rocess, M anage, Replicate, Copy, Share, Repurpose, Control and Track access, Subscribe & more iRODS Server software and Rule Engine run on each data server. The iRODS iCAT Metadata Catalog uses a database to track metadata describing data and everything that happens to it

PAGE 13

Challenges With This Model: Need for single instance resources capable of dealing with big data Will now need multitenancy capabilities even as a single organization Must minimize latency to better utilize limited resources massive scalability approach might not be suitable for a small or midsize research environment with legacy codes, inexperienced users, etc. Future Vision Currently, management of our computational resources and data is strictly controlled, with CI members fully supporting all services This minimizes service downtime, prevents data breaches, etc. It comes at a cost we are no longer as agile as we need to be Users instantiate resources on demand which they have privileged access to but no support is offered beyond clearing hang ups Common use services requiring a higher degree of reliability and/or security are built and managed by us upon request, with unprivileged access provided to users Core computational services are still supported end to end by us, and provided for consumption by resources in the previous two levels Solution: Move to a three level service and support model

PAGE 14

Converged Infrastructure Will Help Us Get There Infrastructure as a Service ( IaaS ) Multitenancy for Big Data Need for software defined networking (SDN) Example: Movement toward OpenFlow and SDN at UF (I2IP) Unified fabric? (or at least abstracted) Need massive single instance scalability in terms of compute cycles, main memory, IOPs, storage throughput (edge case for public cloud, general case for us) Scalable building blocks handling both data throughput and computation simultaneously (extend in storage processing concept) allows for big data performance for even small and midsize infrastructures, minimizes latency in the stack, data processing becomes LUN to LUN DevOps and sharing of infrastructure codes is the way to support tenants (local staff, partner institutions, students, etc.) while still giving them control and freedom of their resources. Examples: Galaxy Toolshed, Chef Cookbooks, RackSpace Private Cloud OpenStack

PAGE 15

Converged Infrastructure Will Help Us Get There (cont.) Converged infrastructure is not as necessary when you are: 1. Hiring lots of smart people and committing their time to infrastructure 2. Attacking a single or small set of large problems 3. R arely revalidating or reintegrating your HW stack after deployment This is because if you tie your platform closely to a mixed and disparate hardware stack: staff time to explore reintegration and revalidation issues, rewrite code for new architectures this can work for hyper giants and single service efforts but legacy and vendor controlled codes, flexible infrastructure infrastructure for yet unknown or unsolved problems iRODS embedded platform helps us pilot this idea and provides a place to check point and stabilize wild wild west resource usage, establish authoritative data, and enable controlled data sharing between tenants Conclusion: Converging the hardware stack lets you focus more on the software stack and having a software stack that is portable and reproducible across infrastructures and organizational boundaries is where we want to go

PAGE 16

Aaron Gardner ( agardner@ufl.edu ) Cyberinfrastructure Section Director Interdisciplinary Center for Biotechnology Research biotech.ufl.edu