This year's program incorporates a day of developer and administrator tutorials. You should attend the developer tutorial if you intend to build research applications that incorporate Globus services. If you are an HPC/campus computing administrator managing Globus endpoints you should attend the introductory and advanced administration tutorials.
|Tuesday, April 11
|registration desk open
GlobusWorld 2017 - Opening KeynoteSteve Tuecke, Globus Co-founder | slides
Steve will review notable events in the evolution of the Globus service over the past year, and provide an update on future product direction.
Compute Canada: Experiences with Globus as National Research Infrastructure
Compute Canada entered into a partnership with Globus in 2014, and has deployed Globus file transfer and sharing tools at over twenty computational and storage-intensive sites across Canada, comprising a national research data transfer service. This talk will briefly describe Compute Canada’s data service, highlight how Globus was used to migrate more than 1.5PB data from aging legacy systems to new national storage systems, and how Compute Canada and the Canadian Association of Research Libraries have partnered to leverage Globus’s new search functionality and data publication software for a national-scale research data repository for Canada.
Enabling the Minnesota Supercomputing Institute's Large Data Archive System
In 2009, the Minnesota Supercomputing Institute (MSI), an organization that provides high-performance computing and large-scale storage needs for research groups at the University of Minnesota, applied for a National Institute of Health (NIH) grant for research. In order to apply for the grant, MSI was required to install a large data archive system, prompting them to seek out a new storage solution. The organization required a system that was reliable, dense, and highly scalable, leading them to purchase a Spectra tape library. Five years later, MSI upgraded its tape technology and incorporated a Spectra BlackPearl appliance into their environment. MSI’s most recent modification to their data center was the addition of Globus client software. The Globus client software allows the faculty to easily move data files between computers, servers and its supercomputing facility, using a simple browser. This prevents groundbreaking research efforts from being stalled when IT technical issues arise. Minnesota Supercomputing Institute’s current configuration enables the university to archive and share petabytes of information with a convenient solution.
File Transfer Feedback from Globus Tools
We’ve all had the experience of file transfers having very different performance from day to day. How well a file transfer goes is generally out of the control of the end user – sometimes it’s speedy, and other times, not so much. Our group spends a lot of time understanding the behavior of networks, and trying to make sure end users get the performance they expect from it. What if there was a way to let us, and the folks at Globus, know that your transfer didn’t go well? And that someone could look into what was happening behind the scenes, on the backbone networks that most users don’t even see?
A Collaborative Platform for Integrating AgroInformatics Data Using Globus
The field of agricultural informatics (AgroInformatics) is of growing interest to researchers in academia, private industry and governmental non-profits. Much of this interest is driven by an acute need to develop sustainable agricultural practices that optimize food production. Furthermore, new data sources and new high performance computational resources open up the opportunity for researchers to ask big questions concerning the role that genotype, the environment, management practices, and socioeconomic factors have in agricultural successes. Lastly, AgroInformatics research is becoming more collaborative, with organizations showing an increased interest in selectively sharing data in order to foster an environment in which questions related to such things as trait identification to improve crop yields can be asked on a global scale.
Globus at the University of Michigan and the Advanced Research Computing Organization
In this talk I will present how Globus is currently being deployed and used on Campus in various research entities. Since research data storage is distributed across many different units on campus, management of Globus Endpoints poses unique challenges. Advanced Research Computing (ARC) is the biggest Globus user at >80% flowing through one endpoint system despite hosting only 5-10% of all storage. Most storage and endpoints thus are not under ARC control or direct management. I will describe some of the issues this decentralized environment presents as well as some of the solutions. Thoughts on future directions will be presented as well.
Login with XSEDE and Jetstream: Our Experiences with Globus Auth
The XSEDE community and the Jetstream cloud service provider are using Globus Auth to simplify and streamline user authentication and add support for identity linking: especially campus credentials. We will share what we've done and what we've learned.
Topic Modeling in the Cloud with Globus and CloudyCluster
In this lightning talk, we will present a case study on how topic modeling in the cloud can leverage Globus data transfers to simplify and facilitate computation for PLDA+ in AWS with CloudyCluster in under an hour.
Please join a table for informal conversation on a topic of interest. Globus staff will be spread across tables to participate in discussions.
Lightning Talks - Globus Labs
Introducing Globus Labs
An Ensemble-based Recommendation Engine for Scientific Data Transfers using Recurrent Neural Networks
Big data scientists face the challenge of locating valuable datasets across a network of distributed storage locations. We explore methods for recommending storage locations (“endpoints") for users based on a range of prediction models including collaborative filtering and heuristics that consider available information such as user, institution, access history, endpoint ownership, and endpoint usage. We combine the strengths of these models by training a deep recurrent neural network on their predictions. Collectively we show, via analysis of historical usage from the Globus research data management service, that our approach can predict the next storage location accessed by users with 80.3% and 95.3% accuracy for top-1 and top-3 recommendations, respectively. Additionally, our heuristics can predict the endpoints that users will use in the future with over 75% precision and recall.
Draining the Data Swamp
Scientists’ capacity to make use of existing data is predicated on their ability to find and understand those data. While significant progress has been made with respect to data publication, and indeed one can point to a number of well-organized and highly utilized data repositories, there remain many such repositories in which archived data are poorly described and thus impossible to use. We present Skluma—an automated system designed to process vast amounts of data and extract deeply embedded metadata, latent topics, relationships between data, and contextual metadata derived from related documents. We show that Skluma can be used to organize and index a large climate data collection that totals more than 500GB of data in over a half-million files.
Responsive Storage: Home Automation for Research Data Management
Exploding data volumes coupled with the rapidly increasing rate of data acquisition and the need for yet more complex research processes has placed a significant strain on researchers’ data management processes. It is not uncommon now for research data to flow through pipelines comprised of dozens of different management, organization, and analysis processes, while simultaneously being distributed across a number of different storage systems. To alleviate these issues we propose adopting a home automation approach to managing data throughout its lifecycle. To do so, we have developed RIPPLE, a responsive storage architecture that allows users to express data management tasks using high level rules. RIPPLE monitors storage systems for events, evaluates rules, and uses serverless computing techniques to execute actions in response to these events. We evaluate our approach by examining two real-world projects and demonstrate that RIPPLE can automate many mundane and cumbersome data management processes.
Explaining Wide Area Data Transfer Performance
Increasing scientific data and worldwide science discovery collaboration require moving large amounts of data over wide area networks (WANs). End-to-end file transfers over WAN involve many subsystems and tunable application parameters that pose significant challenges for performance optimization. Performance models make it possible to evaluate resource configurations effi- ciently, allowing systems to identify an optimal or near-optimal parameter set for a given transfer requirement. Armed with log data for millions of Globus transfers involving billions of files and 100s of petabytes, we develop models that can be used to determine bottlenecks and predict transfer rates based on a combination of historical transfer data and current endpoint activity, without the need for online experiments on individual endpoints. Our work broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and various measures of endpoint load. We create profiles for endpoint CPU load, network interface card load, and transfer characteristics via extensive feature engineering, and show that these profiles can be used to explain large fractions of transfer performance. For 27,130 transfers over 30 heavily used source-destination pairs (“edges”), totaling 5191TB in 254 million files, we obtained median absolute percentage prediction errors (MdAPE) of 7.0% and 4.6% when using distinct linear and nonlinear models per edge, respectively. When using a single model for all edges, we obtain MdAPEs of 19% and 6.8%, respectively. These profiles are useful not only for this particular prediction task but also for optimization and explanation, providing new understanding of the impact of competing load on transfer rate. Their prediction can be used for distributed workflow scheduling and optimization.
The Materials Data Facility
The Materials Data Facility (MDF) operates two cloud-hosted services, data publication and data discovery, built on Globus services. These MDF services are built to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to publish research datasets, regardless of size, from distributed storage; and interact with and discover published and indexed data and metadata via REST APIs to facilitate automation, and analysis.
Integrating Globus with Jupyter
Project Jupyter supports interactive data science and scientific computing across several programming languages, providing a tool that enables the rapid sharing of computational science tools and methods. We see several areas where Globus can contribute to Jupyter to broaden the access and distribution of notebooks, along with enabling streamlined data discovery and access within the notebook environment. To begin, we’ve extended the current suite of Jupyter OAuth2 authenticators with one using the Globus Auth platform. Building on this we’re continuing to incorporate Globus Transfer and other services, and we’ll describe and demonstrate our progress.
Bridging Compute and Storage Infrastructures
The Compute and Data Environment for Science at Oak Ridge National Lab is providing compute and data infrastructure resources, coupled with experts, to create a new environment for scientific discovery. The CADES goal is to continually develop an environment that allows researchers to share data, among local and distributed resources, easily and in a performant manner. Through the Science DMZ architecture we can start to connect and abstract different infrastructures by deploying workflows that utilize data portal tools, allowing us to achieve that cohesive and performant environment across the lab. This talk will give a preview of the CADES environment, the Science DMZ architecture, and the workflows we are helping develop which utilize the Science DMZ and Globus.
Science DMZ Patterns for the Modern Research Data Portal
This talk will describe the modern research data portal, and how it can be built using Globus and the Science DMZ. The architectural enhancements over the legacy data portal model will be discussed, as well as current scalability of Globus endpoints at large-scale HPC facilities.
Shaping the User Experience
We interact with a wide range of objects and services every day. Some help us achieve our goals and even delight us while others impede our progress and raise our ire. We'll take a peek at how Globus approaches the process of improving the user experience.
Globus Professional Service Engagements
Last year, Globus created a dedicated professional services team: engineers that can aid organizations and projects in leveraging the Globus platform-as-a-service in building and supporting custom integrations. Rick will discuss current engagements and the roles that the professional services team could play in your projects.
The Globus Python SDK
The Globus Python SDK provides a powerful suite of tools for interacting with Globus Services and, in particular, handling authentication and authorization via Globus Auth. However, these tools are pluggable, and can be used to handle the use of Globus Auth with your own APIs. The core abstractions of the SDK are a simple object model of Authorizers and Clients. Authorizers handle authorization and recovery from “Unauthorized” API responses, and Clients use Authorizers to authorize their requests to a service. Learn how to extend the SDK with Custom Clients, and see how Globus is using this capability to add support for new services.
Ask What We Can Do for YOU!
We will describe our customer engagement strategy and the programs we are developing for research computing centers to better educate their users about Globus capabilities.
Palladium (Ground Floor)
|Wednesday, April 12
|registration desk open
Tutorial: Introduction to Globus for System AdministratorsLed by: Vas Vasiliadis | slides
You will learn how to install and configure a Globus endpoint using Globus Connect Server. This session is targeted at system administrators, storage managers, and anyone who is tasked with maintaining Globus endpoints at their institution. The content will include a mix of presentation and hands-on exercises.
Tutorial: Advanced Globus Administration TopicsLed by: Vas Vasiliadis
This session is designed to address your specific Globus deployment issues. We will provide more detailed reviews of common deployment configurations such as multi-server data transfer nodes, using the management console with pause/resume rules, and integrating campus identity systems for streamlined user authentication.
Developer Tutorial: Building the Modern Research Data Portal, Part 1: Introduction and Transfer APILed by: Rachana Ananthakrishnan | slides
We will introduce the Globus platform and describe how you can use Globus services to deliver unique data management capabilities in your applications. This will include:
You will use a Jupyter notebook to experiment with the Globus Transfer API, using it to manage endpoints, transfer and share files. We will also demonstrate a simple, yet fully-functional, application that leverages the Globus platform for data distribution and analysis.
Developer Tutorial: Building the Modern Research Data Portal, Part 2: Globus Auth and Sample PortalLed by: Rachana Ananthakrishnan
We will introduce the Globus Auth API and demonstrate how it is used in a sample data portal. You will learn how to register an application with Globus Auth, authenticate using Globus Auth's OpenID Connect API, and access various authentication and authorization functions via sample Python scripts. We will also demonstrate how to directly access files from an endpoint using the Globus Connect HTTPS Endpoint Server.
|poster session — hosted by Globus and NDS
reception with hors d'oeuvres and cash bar
The 7th National Data Service Consortium Workshop will be held at the Hotel Allegro immediately following GlobusWorld 2017. For more information please visit the NDS workshop site.
We are soliciting poster submissions to be presented at a joint session of the co-located NDS/GlobusWorld workshops, Chicago, April 12, 2017.