AWS Comprehend, Translate and Transcribe

At the re:invent2017 AWS presented a lot of new services (read all the announcements here: re:Invent 2017 Product Announcements). In this post we are going to see three new services related to the language processing.

  • Amazon Comprehend
  • Amazon Translate
  • Amazon Transcribe

These new services are listed within the Machine Learning section.

Amazon Comprehend

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights and relationships in text. Amazon Comprehend identifies the language of the text; extracts key phrases, places, people, brands, or events; understands how positive or negative the text is; and automatically organizes a collection of text files by topic.

You can use the Amazon Comprehend APIs to analyze text and use the results in a wide range of applications including voice of customer analysis, intelligent document search, and content personalization for web applications.

The service constantly learns and improves from a variety of information sources, including Amazon.com product descriptions and consumer reviews – one of the largest natural language data sets in the world – to keep pace with the evolution of language.

You can read more about it here: AWS Comprehend and here Amazon Comprehend – Continuously Trained Natural Language Processing.
Watch the video from the AWS re:Invent Launchpad: Amazon Comprehend.

This service is available, here you can find few examples (using Boto3 Python AWS SDK).

Instantiate a new client.

Detect the dominant language in your text.

Detect the entities in your text.

Detect the key phrases in your text.

Get the sentiment in your text.

Amazon Translate

Amazon Translate is a neural machine translation service that delivers fast, high-quality, and affordable language translation. Neural machine translation is a form of language translation automation that uses machine learning and deep learning models to deliver more accurate and more natural sounding translation than traditional statistical and rule-based translation algorithms. Amazon Translate allows you to easily translate large volumes of text efficiently, and to localize websites and applications for international users.

The service is still in preview, watch the launch video here: AWS re:Invent 2017: Introducing Amazon Translate

You can read more about it here: Introducing Amazon Translate – Real-time Language Translation.

Amazon Transcribe

Amazon Transcribe is an automatic speech recognition (ASR) service that makes it easy for developers to add speech to text capability to their applications. Using the Amazon Transcribe API, you can analyze audio files stored in Amazon S3 and have the service return a text file of the transcribed speech.

Amazon Transcribe can be used for lots of common applications, including the transcription of customer service calls and generating subtitles on audio and video content. The service can transcribe audio files stored in common formats, like WAV and MP3, with time stamps for every word so you can easily locate the audio in the original source by searching for the text. Amazon Transcribe is continually learning and improving to keep pace with the evolution of language.

The service is still in preview, watch the launch video here: AWS re:Invent 2017: Introducing Amazon Transcribe
You can read more about it here: Amazon Transcribe – Accurate Speech To Text At Scale.

This is an example of how to use this service (code written by @jrhunt and taken from here).
Note that the API for Transcribe (while in preview) is subject to change (this code may not be the final version of the API):

Output of the speech recognition:

I am looking forward to use these new services, they are easy to use, they easily integrate with the AWS world and they can add powerful features to your applications.

Machine learning with Amazon Rekognition and Elasticsearch – Skedler Blog

I published a new blog post on the Skedler Blog.
In the post we are going to see how to build a machine learning system to perform the image recognition task and use Elasticsearch as search engine to search for the labels identified within the images.

The components that we used are the following:

  • Elasticsearch
  • Kibana
  • Skedler Reports and Alerts
  • Amazon S3 bucket
  • Amazon Simple Queue Service (eventually you can replace this with AWS Lambda)
  • Amazon Rekognition

System Architecture:

 

You can read the full post here: Machine learning with Amazon Rekognition and Elasticsearch

Please share the post and let me know your feedbacks.

Simple Elasticsearch monitor with AWS Lambda and Amazon QuickSight

Recently I have needed a simple dashboard to monitor some Elasticsearch metrics and visualize/aggregate them. I have heard about Amazon QuickSight.
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data.
You can upload CSV or Excel files; ingest data from AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, Amazon S3, and Amazon EMR (Presto and Apache Spark); connect to databases like SQL Server, MySQL, and PostgreSQL.

The Elasticsearch metrics I want to monitor are the following:

  • Indices Health (by color)
  • Indices Status
  • Number of documents
  • Storage size
  • Indices by size and health

Due to the fact that the dashboard must be really simple, I did not want to manage any complex (any at all) infrastructure so I thought about a serverless architecture.
If the word serverless sounds completely new to you or you want to read more, here you can find some useful information:

The components of my architecture are the following:

  • AWS Lambda (Python 3.6)
  • Amazon S3
  • Amazon QuickSight
  • Elasticsearch 5.6.3 – Lucene 6.6.1 (my ES Cluster is deployed on Elastic Cloud, I assume you have your own cluster deployed somewhere)

I assume you know about the previously listed components, if not, please go online and read about them before go further.

The goal of the AWS Lambda function is to fetch the Elasticsearch metrics from the cluster and store two CSV files (one for indices metrics and on for cluster metrics) to Amazon S3 (the Lambda execution is scheduled daily). Once the CSVs have been uploaded to S3, the QuickSight dashboard fetches them and displays the metrics we need.

To deploy the Lambda function, I used the Serverless framework. (version 1.23.0 – npm 5.4.2 – node v6.11.4) Serverless is your toolkit for deploying and operating serverless architectures.. I assume you know about the Serverless framework (write serverless configuration file and deploy/invoke a function) and you have installed/configured it.

Let’s start by defining the serverless .yaml configuration file.
We define a new function get_es_stats scheduled to run every 24 hours. We create a set of environment variables (related to the ES cluster details and S3 bucket).
Note that we need to define a iamRoleStatements to allow the Lambda function to write to the S3 bucket.

I am using the serverless-python-requirements plugin to install the Python requirements (note the plugins and custom elements.

Once we define the serverless configuration, let’s create the Python function that fetches the Elasticsearch metrics and post them to S3.

I run some performance test and I decided not to use the Python Elasticsearch library but to call directly the REST API of the ES cluster.
To fetch the indices stats, I used the following endpoint. Elasticsearch cat indices

To fetch the cluster health, I used the following endpoint. Elasticsearch Cluster Health

We are now ready to deploy the function. Create a requirements.txt file with following lines:

and then run

and this to manually invoke your function

Once you have ran your function you will find two CSVs file in your S3 bucket, indices_stats.csv


and cluster_stats.csv

Now you can create a QuickSight dashboard.

From the QuickSight page create a new Data Set from S3 (when you create a new QuickSight account be sure you have set the right permission to read from the S3 Bucket).

Upload two Amazon S3 manifest files, one for the indices_stats.csv file and one for the cluster_stats.csv file. You use JSON manifest files to specify files in Amazon S3 to import into Amazon QuickSight.

Once you created the two datasets you will find them in the available datasets.

You can now create a new QuickSight analysis and show the collected metrics, here few examples of visualizations. You can schedule a refresh for the two datasets so when the Lambda function will update the two CSVs on S3, QuickSight will refresh the sources and the dashboard will be updated.

Block charts showing the indices by status and health.

Pie chart the show the indices by their dimensions and total storage used.

Indices by number of documents and health and number of nodes in the cluster and active shards.

 

The goal of this post is to present a simple serverless architecture to show a few Elasticsearch metrics in a simple dashboard. You can extend and improve this architecture monitoring more metrics and creating a better QuickSight dashboard.

You can use this type of architecture when you do not have Kibana (and a X-Pack subscription) or you want a simple analytics system inside the AWS world.

List AWS S3 buckets with public ACLs

This morning AWS sent a reminder to its users about S3 buckets with access control lists (ACLs) configured to allow read access from any user on the Internet (public).
The email encourages you to promptly review your S3 buckets and their contents to ensure that you are not inadvertently making objects visible to users that you don’t intend.
 
aws-warning

Here you can find the full email that has been sent: Securing Amazon S3 Buckets.

Here you can find more details about the AWS S3 ACLs:

In this post I am not going to show how to secureAWS S3 buckets with public ACLs a S3 bucket. You can read this blog post How to secure an Amazon S3 Bucket by Mark Nunnikhoven @marknca (AWS Community Hero).
I mention just few sentences from the article.

  1. TL:DR; Just put everything in buckets & don’t make them PUBLIC!
  2. Amazon S3 buckets are private by default. You have to take explicit steps to allow public, unauthenticated access as in the case of these two leaks.
  3. Amazon S3 provides logical methods for controlling access to your data. Which method you use depends on your needs. Regardless of the method you choose, you should be regularly reviewing the current access to your data.

I wrote a Python script to list the S3 buckets with one of the following ACLs configured:

  • Public Access: List Objects
  • Public Access: Write Objects
  • Public Access: Read Bucket Permissions
  • Public Access: Write Bucket Permissions
  • Public Access: Full Control

I used the official AWS Python SDK Boto3 and Python 3.5.

Create a new connection to the S3 resource.

Loop through the S3 buckets and the grants for each bucket.

We are looking for the ACLs assigned to a Group in particular the All Users group (public).

Check the granted permission.

Once we find a public ACLs we can log the information or eventually write some code to remove the public ACLs (Boto3 Put Bucket ACL)

Here you can find the full code shown in this post: S3-public-ACLs-finder
In the repository I am posting a message to Slack when a bucket with Public ACLs is found.

Feel free to improve the code (maybe adding other notification systems beside Slack) and share it.

——— Update 31 August 2017 ———

As @marknca noticed (check the Tweet thread here), this script is not checking the S3 bucket policy. So it is still possible to grant public S3 actions to a bucket using a policy.
These few lines allow you to check the bucket policy and look for the actions granted to any public anonymous users.

Here you can find more details about public S3 policy and S3 actions: