Machine learning with Amazon Rekognition and Elasticsearch – Skedler Blog

I published a new blog post on the Skedler Blog.
In the post we are going to see how to build a machine learning system to perform the image recognition task and use Elasticsearch as search engine to search for the labels identified within the images.

The components that we used are the following:

  • Elasticsearch
  • Kibana
  • Skedler Reports and Alerts
  • Amazon S3 bucket
  • Amazon Simple Queue Service (eventually you can replace this with AWS Lambda)
  • Amazon Rekognition

System Architecture:

 

You can read the full post here: Machine learning with Amazon Rekognition and Elasticsearch

Please share the post and let me know your feedbacks.

Simple Elasticsearch monitor with AWS Lambda and Amazon QuickSight

Recently I have needed a simple dashboard to monitor some Elasticsearch metrics and visualize/aggregate them. I have heard about Amazon QuickSight.
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data.
You can upload CSV or Excel files; ingest data from AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, Amazon S3, and Amazon EMR (Presto and Apache Spark); connect to databases like SQL Server, MySQL, and PostgreSQL.

The Elasticsearch metrics I want to monitor are the following:

  • Indices Health (by color)
  • Indices Status
  • Number of documents
  • Storage size
  • Indices by size and health

Due to the fact that the dashboard must be really simple, I did not want to manage any complex (any at all) infrastructure so I thought about a serverless architecture.
If the word serverless sounds completely new to you or you want to read more, here you can find some useful information:

The components of my architecture are the following:

  • AWS Lambda (Python 3.6)
  • Amazon S3
  • Amazon QuickSight
  • Elasticsearch 5.6.3 – Lucene 6.6.1 (my ES Cluster is deployed on Elastic Cloud, I assume you have your own cluster deployed somewhere)

I assume you know about the previously listed components, if not, please go online and read about them before go further.

The goal of the AWS Lambda function is to fetch the Elasticsearch metrics from the cluster and store two CSV files (one for indices metrics and on for cluster metrics) to Amazon S3 (the Lambda execution is scheduled daily). Once the CSVs have been uploaded to S3, the QuickSight dashboard fetches them and displays the metrics we need.

To deploy the Lambda function, I used the Serverless framework. (version 1.23.0 – npm 5.4.2 – node v6.11.4) Serverless is your toolkit for deploying and operating serverless architectures.. I assume you know about the Serverless framework (write serverless configuration file and deploy/invoke a function) and you have installed/configured it.

Let’s start by defining the serverless .yaml configuration file.
We define a new function get_es_stats scheduled to run every 24 hours. We create a set of environment variables (related to the ES cluster details and S3 bucket).
Note that we need to define a iamRoleStatements to allow the Lambda function to write to the S3 bucket.

I am using the serverless-python-requirements plugin to install the Python requirements (note the plugins and custom elements.

Once we define the serverless configuration, let’s create the Python function that fetches the Elasticsearch metrics and post them to S3.

I run some performance test and I decided not to use the Python Elasticsearch library but to call directly the REST API of the ES cluster.
To fetch the indices stats, I used the following endpoint. Elasticsearch cat indices

To fetch the cluster health, I used the following endpoint. Elasticsearch Cluster Health

We are now ready to deploy the function. Create a requirements.txt file with following lines:

and then run

and this to manually invoke your function

Once you have ran your function you will find two CSVs file in your S3 bucket, indices_stats.csv


and cluster_stats.csv

Now you can create a QuickSight dashboard.

From the QuickSight page create a new Data Set from S3 (when you create a new QuickSight account be sure you have set the right permission to read from the S3 Bucket).

Upload two Amazon S3 manifest files, one for the indices_stats.csv file and one for the cluster_stats.csv file. You use JSON manifest files to specify files in Amazon S3 to import into Amazon QuickSight.

Once you created the two datasets you will find them in the available datasets.

You can now create a new QuickSight analysis and show the collected metrics, here few examples of visualizations. You can schedule a refresh for the two datasets so when the Lambda function will update the two CSVs on S3, QuickSight will refresh the sources and the dashboard will be updated.

Block charts showing the indices by status and health.

Pie chart the show the indices by their dimensions and total storage used.

Indices by number of documents and health and number of nodes in the cluster and active shards.

 

The goal of this post is to present a simple serverless architecture to show a few Elasticsearch metrics in a simple dashboard. You can extend and improve this architecture monitoring more metrics and creating a better QuickSight dashboard.

You can use this type of architecture when you do not have Kibana (and a X-Pack subscription) or you want a simple analytics system inside the AWS world.

List AWS S3 buckets with public ACLs

This morning AWS sent a reminder to its users about S3 buckets with access control lists (ACLs) configured to allow read access from any user on the Internet (public).
The email encourages you to promptly review your S3 buckets and their contents to ensure that you are not inadvertently making objects visible to users that you don’t intend.
 
aws-warning

Here you can find the full email that has been sent: Securing Amazon S3 Buckets.

Here you can find more details about the AWS S3 ACLs:

In this post I am not going to show how to secureAWS S3 buckets with public ACLs a S3 bucket. You can read this blog post How to secure an Amazon S3 Bucket by Mark Nunnikhoven @marknca (AWS Community Hero).
I mention just few sentences from the article.

  1. TL:DR; Just put everything in buckets & don’t make them PUBLIC!
  2. Amazon S3 buckets are private by default. You have to take explicit steps to allow public, unauthenticated access as in the case of these two leaks.
  3. Amazon S3 provides logical methods for controlling access to your data. Which method you use depends on your needs. Regardless of the method you choose, you should be regularly reviewing the current access to your data.

I wrote a Python script to list the S3 buckets with one of the following ACLs configured:

  • Public Access: List Objects
  • Public Access: Write Objects
  • Public Access: Read Bucket Permissions
  • Public Access: Write Bucket Permissions
  • Public Access: Full Control

I used the official AWS Python SDK Boto3 and Python 3.5.

Create a new connection to the S3 resource.

Loop through the S3 buckets and the grants for each bucket.

We are looking for the ACLs assigned to a Group in particular the All Users group (public).

Check the granted permission.

Once we find a public ACLs we can log the information or eventually write some code to remove the public ACLs (Boto3 Put Bucket ACL)

Here you can find the full code shown in this post: S3-public-ACLs-finder
In the repository I am posting a message to Slack when a bucket with Public ACLs is found.

Feel free to improve the code (maybe adding other notification systems beside Slack) and share it.

——— Update 31 August 2017 ———

As @marknca noticed (check the Tweet thread here), this script is not checking the S3 bucket policy. So it is still possible to grant public S3 actions to a bucket using a policy.
These few lines allow you to check the bucket policy and look for the actions granted to any public anonymous users.

Here you can find more details about public S3 policy and S3 actions:

Create Amazon Lightsail instance using Python

At the end of November (during the re:invent 2017 event) AWS launched a new service: Amazon Lightsail.
Amazon Lightsail is the easiest way to launch and manage a virtual private server with AWS. With a couple of clicks you can launch a virtual machine pre-configured with SSD storage, DNS management, and a static IP address.
Today you can launch the following operating system:

  • Amazon Linux AMI
  • Ubuntu

or the following developer stack:

  • LAMP
  • LEMP
  • MEAN
  • Node.js

or the following application:

  • Drupal
  • Joomla
  • Redmine
  • GitLab

The following instance plans ara available:

lightsail_instance_type

In this post we are going to see how to launch a new Lightsail (MEAN stack) instance using the Python SDK.
If you want to read more about Amazon Lightsail, take a look to the following resource:

We are going to use Python 3.5 and the official Amazon SDK: Boto3 (Boto3 Documentation).
Define a new client for the Lightsail service:

Notice that Lightsail is available only in N. Virginia: Regions and Endpoints – Lightsail

We can now list and print all the available blueprints:

The output will look like this (we need the blueprint id to launch a new instance):

blueprints
Besides the blueprint, to launch a new instance, we have to specify also the instance type. We can get the available bundles using the get_bundles method:

For each bundle these are the available information (we need the bundleId):

We can now launch a new Lightsail instance (in the example MEAN developer stack and nano instance):

From the Lightsail dashboard we can follow the launch process:

lightsail_generating
Once the instance is up and running we can connect using SSH (if not specified, the default account key-pair will be used) and deploy our MEAN application.

lightsail_running
Check out the public IP of your instance to see the default Bitnami MEAN Stack page.

lightsail_homepage
In this post we saw how to use the AWS API to deploy a MEAN developer stack running in a Lightsail virtual private server.
Lightsail is a new service and is quickly evolving. I suggest you to take a look at it because is a super-fast way to launch development stack for easy deploy (when you do not need high performance and scalability).