Monday, November 30, 2015

AWS Week in Review – November 23, 2015

Let’s take a quick look at what happened in AWS-land last week:

Monday

November 23

Tuesday

November 24

Wednesday

November 25

Thursday

November 26

Friday

November 27

Saturday

November 28

Sunday

November 29

New & Notable Open Source

New SlideShare Presentations

New Customer Success Stories

New YouTube Videos

Upcoming Events

Help Wanted

Stay tuned for next week! In the meantime, follow me on Twitter and subscribe to the RSS feed.

Jeff;

Thursday, November 26, 2015

Early “Black Friday” ?? YGBSM! (*)

After a nice Thanksgiving Day dinner with my mother, my brother and his wife, my daughter with her husband and son, my wife and I decided to make a quick trip to Walmart to pick up some odds & ends. First problem, it was rainy with poor visibility. Second problem, the entire area … Continue reading Early “Black Friday” ?? YGBSM! (*)

Tuesday, November 24, 2015

New AWS Quick Start – Sitecore

Sitecore is a popular enterprise content management system that also includes a multi-channel marketing automation component with an architecture that is a great fit for the AWS cloud! It allows marketers to deliver a personalized experience that takes into account the customers’ prior interaction with the site and the brand (they call this feature Context Marketing).

Today we are publishing a new Sitecore Quick Start Reference Deployment. This 19-page document will show you how to build an AWS cluster that is fault-tolerant and highly scalable. It builds on the information provided in the Sitecore Scaling Guide and recommends an architecture that uses the Amazon Relational Database Service (RDS), Elastic Load Balancing, and Auto Scaling.

Using the AWS CloudFormation template referenced in the Quick Start, you can launch Sitecore into a Amazon Virtual Private Cloud in a matter of minutes. The template creates a fully functional deployment of Sitecore 7.2 that runs on Windows Server 2012 R2. The production configuration runs in two Availability Zones:

You can use the template as-is, or you can copy it and then modify it as you see fit. If you decide to do this, the new CloudFormation Visual Designer may be helpful:

The Quick Start includes directions for setting up a test server along with some security guidelines. It also discusses the use of Amazon CloudFront to improve site performance and AWS WAF to help improve application security.

Jeff;

Now Available – EC2 Dedicated Hosts

Last month, I announced that we would soon be making EC2 Dedicated Hosts available. As I wrote at the time, this model allows you to control the mapping of EC2 instances to the underlying physical servers. Dedicated Hosts allow you to:

  • Bring Your Own Licenses – You can bring your existing server-based licenses for Windows Server, SQL Server, SUSE Linux Enterprise Server, and other enterprise systems and products to the cloud. Dedicated Hosts provide you with visibility into the number of sockets and physical cores that are available so that you can obtain and use software licenses that are a good match for the actual hardware.
  • Help Meet Compliance and Regulatory Requirements – You can allocate Dedicated Hosts and use them to run applications on hardware that is fully dedicated to your use.
  • Track Usage – You can use AWS Config to track the history of instances that are started and stopped on each of your Dedicated Hosts. This data can be used to verify usage against your licensing metrics.
  • Control Instance Placement – You can exercise fine-grained control over the placement of EC2 instances on each of your Dedicated Hosts.

Available Now
I am happy to be able to announced the Dedicated Hosts are available now and that you can start using them today. You can launch them from the AWS Management Console, AWS Command Line Interface (CLI), AWS Tools for Windows PowerShell, or via code that makes calls to the AWS SDKs.

Let’s provision a Dedicated Host and then launch some EC2 instances on it via the Console! I simply open up the EC2 Console, select Dedicated Hosts in the left-side navigation bar, and click on Allocate a Host.

I choose the instance type (Dedicated hosts for M3, M4, C3, C4, G2, R3, D2, and I2 instances are available), the Availability Zone, and the quantity (each Dedicated Host can accommodate one or more instances of a particular type, all of which must be the same size).

If I choose to allow instance auto-placement, subsequent launches of the designed instance type in the chosen Availability Zone are eligible for automatic placement on the Dedicated Host, and will be placed there if instance capacity is available on the host and the launch specifies a tenancy of Host without specifying a particular one. If I do not allow auto-placement, I must specifically target this Dedicated Host when I launch an instance.

When I click Allocate host, I’ll receive confirmation that it was allocated:

Billing for the Dedicated Host begins at this point. The size and number of instances are running on it does not have an impact on the cost.

I can see all of my Dedicated Hosts at a glance. Selecting one displays detailed information about it:

As you can see, my Dedicated Host has 2 sockets and 24 cores. It can host up to 22 m4.large instances, but is currently not hosting any. The next step is run some instances on my Dedicated Host. I click on Actions and choose Launch Instance(s) onto Host (I can also use the existing EC2 launch wizard):

Then I pick an AMI. Some AMIs (currently RHEL, SUSE Linux, and those which include Windows licenses) cannot be used with Dedicated Hosts, and cannot be selected in the screen below or from the AWS Marketplace:

The instance type is already selected:

Instances launched on a Dedicated Host must always reside within a VPC. A single Dedicated Host can accommodate instances that run in more than one VPC.

The remainder of the instance launch process proceeds in the usual way and I have access to the options that make sense when running on a Dedicated Host. You cannot, for example, run Spot instances on a Dedicated Host.

I can also choose to target one of my Dedicated Hosts when I launch an EC2 instance in the traditional way. I simply set the Tenancy option to Dedicated host and choose one of my Dedicated Hosts (I can also leave it set to No preference and have AWS make the choice for me):

If I select Affinity, a persistent relationship will be created between the Dedicated Host and the instance. This gives you confidence that the instance will restart on the same Host, and minimizes the possibility that you will inadvertently run licensed software on the wrong Host. If you import a Windows Server image (to pick one that we expect to be popular), you can keep it assigned to a particular physical server for at least 90 days, in accordance with the terms of the license.

I can return to the Dedicated Hosts section of the Console, select one of my Hosts, and learn more about the instances that are running on it:

Using & Tracking Licensed Software
You can use your existing software licenses on Dedicated Hosts. Verify that the terms allow the software to be used in a virtualized environment, and use VM Import/Export to bring your existing machine images into the cloud. To learn more, read about Bring Your Own License in the EC2 Documentation. To learn more about Windows licensing options as they relate to AWS, read about Microsoft Licensing on AWS and our detailed Windows BYOL Licensing FAQ.

You can use AWS Config to record configuration changes for your Dedicated Hosts and the instances that are launched, stopped, or terminated on them. This information will prove useful for license reporting. You can use the Edit Config Recording button in the Console to change the settings (hovering your mouse over the button will display the current status):

To learn more, read about Using AWS Config.

Some Important Details
As I mentioned earlier, billing begins when you allocate a Dedicated Host. For more information about pricing, visit the Dedicated Host Pricing page.

EC2 automatically monitors the health of each of your Dedicated Hosts and communicates it to you via the Console. The state is normally available; it switches to under-assessment if we are exploring a possible issue with the Dedicated Host.

Instances launched on Dedicated Hosts must always reside within a VPC, but cannot make use of Placement Groups. Auto Scaling is not supported, and neither is RDS.

Dedicated Hosts are available in the US East (Northern Virginia), US West (Oregon), US West (Northern California), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (Brazil) regions. You can allocate up to 2 Dedicated Hosts per instance family (M4, C4, and so forth) per region; if you need more, just ask.

Jeff;

AWS Week in Review – November 16, 2015

Let’s take a quick look at what happened in AWS-land last week:

Monday

November 16

Tuesday

November 17

Wednesday

November 18

Thursday

November 19

Friday

November 20

New & Notable Open Source

  • goofyfs is a filey (their terminology) system for S3.
  • aws-sdk-perl is an attempt to build a complete AWS SDK in Perl.
  • aws-ses-recorder is a set of Lambda functions to process SES.
  • flywheel is a proxy for AWS.
  • aws-sdk-config-loader is an AWS config file loader for the CLI tools.
  • caravan is a lightweight Python framework for SWF.
  • rusoto is a set of AWS client libraries for Rust.
  • ng-upload-s3 is an AngularJS directive to upload files directly to S3.
  • aws-templates is a collection of custom CloudFormation templates.
  • ec2-browser is an EC2 browser.
  • Consigliere is an AWS Trusted Advisor dashboard that supports multiple accounts.

New SlideShare Presentations

New Customer Success Stories

New YouTube Videos

Upcoming Events

Help Wanted

Stay tuned for next week! In the meantime, follow me on Twitter and subscribe to the RSS feed.

Jeff;

Friday, November 20, 2015

New – Saved Reports for the AWS Cost Explorer

The AWS Cost Explorer allows you to explore and forecast your AWS costs (read The New Cost Explorer for AWS to learn more). You can use Cost Explorer’s built-in filtering and grouping facilities to analyze your expenditures by Account, Service, Tag, Availability Zone, Purchase Option, and API Operation. For example, here’s a quick look at my personal AWS account, with charges grouped by service:

Earlier this month we added a new feature that allows you to save your Cost Explorer reports. After I create the report above, I can save it by entering a new name (Monthly Spend by Service) and clicking on Save report:

Then I can see the built-in reports, along with the ones that I have created, in the menu:

As you can see from the menu, I also created a report named Daily Spend by Service. I can view it by choosing it from the menu. The reports are saved on a per-account basis. They can be accessed by the “root” account and by any IAM users that have the proper permissions.

I spent some time exploring my own personal expenditures, and found that it was illustrative to explore my costs on a per-API basis. I can actually see the cost of the resources created by each API call:

The tall blue bar on the right indicates the charge that I incurred when I renewed one of the many domain names that I own.

Use it Now
This functionality was released earlier this month. If you have not used Cost Explorer before, you will need to enable it for your account (read Enabling Cost Explorer to learn more).

Jeff;

Thursday, November 19, 2015

Amazon EMR Update – Apache Spark 1.5.2, Ganglia, Presto, Zeppelin, and Oozie

My colleague Jon Fritz wrote the guest post below to introduce you to the newest version of Amazon EMR.

Jeff;


Today we are announcing Amazon EMR release 4.2.0, which adds support for Apache Spark 1.5.2, Ganglia 3.6 for Apache Hadoop and Spark monitoring, and new sandbox releases for Presto (0.125), Apache Zeppelin (0.5.5), and Apache Oozie (4.2.0).

New Applications in Release 4.2.0
Amazon EMR provides an easy way to install and configure distributed big data applications in the Hadoop and Spark ecosystems on managed clusters of Amazon EC2 instances. You can create Amazon EMR clusters from the Amazon EMR Create Cluster Page in the AWS Management Console, AWS Command Line Interface (CLI), or using a SDK with EMR API. In the latest release, we added support for several new versions of applications:

  • Spark 1.5.2 – Spark 1.5.2 was released on November 9th, and we’re happy to give you access to it within two weeks of general availability. This version is a maintenance release, with improvements to Spark SQL, SparkR, the DataFrame API, and miscellaneous enhancements and bug fixes. Also, Spark documentation now includes information on enabling wire encryption for the block transfer service. For a complete set of changes, view the JIRA. To learn more about Spark on Amazon EMR, click here.
  • Ganglia 3.6 – Ganglia is a scalable, distributed monitoring system which can be installed on your Amazon EMR cluster to display Amazon EC2 instance level metrics which are also aggregated at the cluster level. We also configure Ganglia to ingest and display Hadoop and Spark metrics along with general resource utilization information from instances in your cluster, and metrics are displayed in a variety of time spans. You can view these metrics using the Ganglia web-UI on the master node of your Amazon EMR cluster. To learn more about Ganglia on Amazon EMR, click here.
  • Presto 0.125 – Presto is an open-source, distributed SQL query engine designed for low-latency queries on large datasets in Amazon S3 and the Hadoop Distributed Filesystem (HDFS). Presto 0.125 is a maintenance release, with optimizations to SQL operations, performance enhancements, and general bug fixes. To learn more about Presto on Amazon EMR, click here.
  • Zeppelin 0.5.5 – Zeppelin is an open-source interactive and collaborative notebook for data exploration using Spark. You can use Scala, Python, SQL, or HiveQL to manipulate data and visualize results. Zeppelin 0.5.5 is a maintenance release, and contains miscellaneous improvements and bug fixes. To learn more about Zeppelin on Amazon EMR, click here.
  • Oozie 4.2.0 – Oozie is a workflow designer and scheduler for Hadoop and Spark. This version now includes Spark and HiveServer2 actions, making it easier to incorporate Spark and Hive jobs in Oozie workflows. Also, you can create and manage your Oozie workflows using the Oozie Editor and Dashboard in Hue, an application which offers a web-UI for Hive, Pig, and Oozie. Please note that in Hue 3.7.1, you must still use Shell actions to run Spark jobs. To learn more about Oozie in Amazon EMR, click here.

Launch an Amazon EMR Cluster with Release 4.2.0 Today
To create an Amazon EMR cluster with 4.2.0, select release 4.2.0 on the Create Cluster page in the AWS Management Console, or use the release label emr-4.2.0 when creating your cluster from the AWS CLI or using a SDK with the EMR API.

Jon Fritz, Senior Product Manager

Wednesday, November 18, 2015

New AWS Public Data Sets – TCGA and ICGC

My colleagues Angel Pizarro and Ariel Gold wrote the incredibly interesting guest post below.

— Jeff;


Today we are pleased to announce that qualified researchers can now access two of the world’s largest collections of cancer genome data at no cost on AWS as part of the AWS Public Data Sets program. Providing access to these petabyte-scale genomic data as shared resources on AWS lowers the barrier to entry, thus expanding the research community and accelerating the pace of research and discovery in the development of new treatments for cancer patients.

The Cancer Genome Atlas (TCGA) corpus of raw and processed genomic, transcriptomic, and epigenomic data from thousands of cancer patients is now freely available on Amazon S3 to users of the Cancer Genomics Cloud, a cloud pilot program funded by the National Cancer Institute and powered by the Seven Bridges Genomics platform.

The International Cancer Genome Consortium (ICGC) PanCancer dataset generated by the Pancancer Analysis of Whole Genomes (PCAWG) study is also now available on AWS, giving cancer researchers access to over 2,400 consistently analyzed genomes corresponding to over 1,100 unique ICGC donors. These data will also be freely available on Amazon S3 for credentialed researchers subject to ICGC data sharing policies.

These two data sets represent the first controlled-access genomic data that have been redistributed to the wider research audience on the cloud. Previously, researchers needed to download and store their own copies of the data before they could begin their experiments. Now, with this data hosted on AWS for the community, researchers can begin their work right away. Researchers will also have access to a broader toolset hosted and shared by the community within AWS. This translates into a much lower barrier to entry and more time for science.

Making these data and tools available in the cloud will also enable a greater level of collaboration across research groups, since they will have a common place to access and share data. Finally, researchers will also be able to securely bring their own data and tools into AWS, and combine these with the existing public data for more robust analysis. No-cost data access, a broader set of available tools, and increased collaborative capabilities will enable researchers to focus on their science and not infrastructure, allowing them to get more done in shorter periods, and ultimately accelerating the pace of research and discovery in the study of cancer.

Accessing TCGA and ICGC on AWS
The difference between TCGA and ICGC, and previously released AWS Public Data Sets such as the National Institutes of Health (NIH) 1000 Genomes Project, Genome in a Bottle (GIAB), and the 3000 Rice Genome, is the need to limit access to researchers that have gone through a review process for their intended use of the data. Because of this requirement, access to TCGA and ICGC on AWS will be administered by our third-party partners, Seven Bridges Genomics and the Ontario Institute for Cancer Research, respectively. These partners have the rights to redistribute the data on behalf of the original data providers. The partners will also curate and update the data over time, as well as develop a community of users who can share cloud-based tools and best practices in order to accelerate use of the data and advance our understanding of cancer.

You can learn more about the data sets, and specifics on how to access them, on our TCGA on AWS page and ICGC on AWS page.

Tools and Resources for Working with the Data
The TCGA data will be available to users of the Cancer Genomics Cloud (CGC). Researchers can apply for early access here. Once accepted, users will be able access the data via the CGC Web portal or use the CGC’s API for programmatic access to the data. The CGC will have a set of data analysis pipelines already integrated into the platform so that users can start working right away with the most common toolsets.

The ICGC data will be generally accessible via the use of a downloadable command line tool. Users can search for files using the ICGC Data Portal and access individual or related sets of alignment and variant files through the ICGC Storage Client. The alignments and a selection of Sanger somatic variant calls are currently available in Amazon S3. Further variant calls will be released following additional quality checking, validation, and analysis. For more information see the ICGC on the Cloud page and ICGC Storage Client documentation.

As always, when working with sensitive genomics data on AWS, you should take care to secure your storage and computational resources. The Architecting for Genomic Data Security and Compliance in AWS whitepaper is a good starting point if you are unfamiliar with the service features and tools necessary to work with data in a secure manner. Genomics platforms such as the CGC take care to meet these types of requirements as part of their value proposition. For example, DNAnexus has provided user documentation on how to leverage the ICGC Storage Client within their platform here.

Recognizing that it is no easy task to work with data at this scale, the PCAWG group are also releasing the PanCancer Launcher. This is an open source system to create EC2 instances, enqueue the analysis work items, trigger Docker-based analysis pipelines, and clean up the launched resources as computational tasks complete.

Currently, the PanCancer Launcher includes support for the BWA-mem-based alignment pipeline and its associated quality control steps. Future releases will expand support for the variant calling pipelines created by the project that encompassed current best practice variant calling pipelines from 4 academic organizations: the German Cancer Research Center (DKFZ), the European Molecular Biology Laboratory (EMBL) in Heidelberg, the Wellcome Trust Sanger Institute, and the Broad Institute. You can read more about how to leverage the PanCancer Launcher in the Launcher HOWTO Guide.

Genomics in the Era of Cloud Computing
It has been interesting to witness the parallel evolution of genomics and cloud computing over the last decade. Both have been driven by new technologies that leverage economies of scale. Both have fundamentally changed the types of questions that can be asked simply because we can now collect and analyze the data in the same place.

The genomics research community, which have witnessed their storage and compute requirements double overnight when new chemistry kits are released, realized long ago that scalable cloud computing models are a better fit than large capital purchases that have to be planned for and amortized over 3-5 years. Today, it is common practice to work with data sets that reach in the hundreds of terabytes, and a few important ones that reach into the petabytes like the TCGA and ICGC. For genomics, cloud has become the new normal for how science gets done.

You can learn more about how genomics thought leaders are innovating in the genomics field through the use of cloud in this new video:

Be sure to also visit the Scientific Computing on AWS and Genomics on AWS pages for more user stories and tools.

Thank You
We’d like to thank our collaborators at the Ontario Institute for Cancer Research and Seven Bridges Genomics who helped us launch these public data sets and will be curating the data, administering access, and cultivating the ecosystem of tools around them. We look forward to working with many more organizations and researchers who will share their expertise and tools in order to accelerate the development of new treatments for cancer patients. Tell us how you’re using the data via the TCGA on AWS and ICGC on AWS pages and sign up for project updates.

Angel Pizarro (AWS Scientific Computing) and Ariel Gold (AWS Public Data Sets)