PyCon Nove – Speaker

PyCon Nove is the ninth edition of the Italian Python Conference.
The event will take place in Florence, the 19th – 22nd April 2018.

During the event I will be speaking about “Monitoring Python Flask Application Performances with Elastisearch and Kibana”.

Here you can find the complete schedule: Pycon Nove Schedule and the abstract of my talk.

You can reserve your spot here: Pycon Registration.

PS: after the conference on Friday evening don’t miss the Elastic Meetup!

Hope to see you there 🙂

How to combine Text Analytics and Search using AWS Comprehend and Elasticsearch 6.0 – Skedler Blog

I published a new blog post on the Skedler Blog.
In the post we are going to see how to combine text analytics and search using AWS Comprehend and Elasticsearch 6.0.

The components that we are going to use are the following:

  • Amazon S3 bucket and Amazon Simple Queue Service
  • Amazon Comprehend
  • Elasticsearch 6.0
  • Elasticsearch Ingest Attachment Processor Plugin
  • Elasticsearch Ingest Geoip Processor Plugin
  • Kibana
  • Skedler Reports and Alerts

System Architecture:


 
You can read the full post here: How to combine Text Analytics and Search using AWS Comprehend and Elasticsearch 6.0

Please share the post and let me know your feedbacks.

Machine learning with Amazon Rekognition and Elasticsearch – Skedler Blog

I published a new blog post on the Skedler Blog.
In the post we are going to see how to build a machine learning system to perform the image recognition task and use Elasticsearch as search engine to search for the labels identified within the images.

The components that we used are the following:

  • Elasticsearch
  • Kibana
  • Skedler Reports and Alerts
  • Amazon S3 bucket
  • Amazon Simple Queue Service (eventually you can replace this with AWS Lambda)
  • Amazon Rekognition

System Architecture:

 

You can read the full post here: Machine learning with Amazon Rekognition and Elasticsearch

Please share the post and let me know your feedbacks.

Simple Elasticsearch monitor with AWS Lambda and Amazon QuickSight

Recently I have needed a simple dashboard to monitor some Elasticsearch metrics and visualize/aggregate them. I have heard about Amazon QuickSight.
Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data.
You can upload CSV or Excel files; ingest data from AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, Amazon S3, and Amazon EMR (Presto and Apache Spark); connect to databases like SQL Server, MySQL, and PostgreSQL.

The Elasticsearch metrics I want to monitor are the following:

  • Indices Health (by color)
  • Indices Status
  • Number of documents
  • Storage size
  • Indices by size and health

Due to the fact that the dashboard must be really simple, I did not want to manage any complex (any at all) infrastructure so I thought about a serverless architecture.
If the word serverless sounds completely new to you or you want to read more, here you can find some useful information:

The components of my architecture are the following:

  • AWS Lambda (Python 3.6)
  • Amazon S3
  • Amazon QuickSight
  • Elasticsearch 5.6.3 – Lucene 6.6.1 (my ES Cluster is deployed on Elastic Cloud, I assume you have your own cluster deployed somewhere)

I assume you know about the previously listed components, if not, please go online and read about them before go further.

The goal of the AWS Lambda function is to fetch the Elasticsearch metrics from the cluster and store two CSV files (one for indices metrics and on for cluster metrics) to Amazon S3 (the Lambda execution is scheduled daily). Once the CSVs have been uploaded to S3, the QuickSight dashboard fetches them and displays the metrics we need.

To deploy the Lambda function, I used the Serverless framework. (version 1.23.0 – npm 5.4.2 – node v6.11.4) Serverless is your toolkit for deploying and operating serverless architectures.. I assume you know about the Serverless framework (write serverless configuration file and deploy/invoke a function) and you have installed/configured it.

Let’s start by defining the serverless .yaml configuration file.
We define a new function get_es_stats scheduled to run every 24 hours. We create a set of environment variables (related to the ES cluster details and S3 bucket).
Note that we need to define a iamRoleStatements to allow the Lambda function to write to the S3 bucket.

I am using the serverless-python-requirements plugin to install the Python requirements (note the plugins and custom elements.

Once we define the serverless configuration, let’s create the Python function that fetches the Elasticsearch metrics and post them to S3.

I run some performance test and I decided not to use the Python Elasticsearch library but to call directly the REST API of the ES cluster.
To fetch the indices stats, I used the following endpoint. Elasticsearch cat indices

To fetch the cluster health, I used the following endpoint. Elasticsearch Cluster Health

We are now ready to deploy the function. Create a requirements.txt file with following lines:

and then run

and this to manually invoke your function

Once you have ran your function you will find two CSVs file in your S3 bucket, indices_stats.csv


and cluster_stats.csv

Now you can create a QuickSight dashboard.

From the QuickSight page create a new Data Set from S3 (when you create a new QuickSight account be sure you have set the right permission to read from the S3 Bucket).

Upload two Amazon S3 manifest files, one for the indices_stats.csv file and one for the cluster_stats.csv file. You use JSON manifest files to specify files in Amazon S3 to import into Amazon QuickSight.

Once you created the two datasets you will find them in the available datasets.

You can now create a new QuickSight analysis and show the collected metrics, here few examples of visualizations. You can schedule a refresh for the two datasets so when the Lambda function will update the two CSVs on S3, QuickSight will refresh the sources and the dashboard will be updated.

Block charts showing the indices by status and health.

Pie chart the show the indices by their dimensions and total storage used.

Indices by number of documents and health and number of nodes in the cluster and active shards.

 

The goal of this post is to present a simple serverless architecture to show a few Elasticsearch metrics in a simple dashboard. You can extend and improve this architecture monitoring more metrics and creating a better QuickSight dashboard.

You can use this type of architecture when you do not have Kibana (and a X-Pack subscription) or you want a simple analytics system inside the AWS world.

Real-time Tweets geolocation visualization with Elasticsearch and Kibana region map

In Kibana version 5.5 a new type of chart has been added: Region Map.
Region maps are thematic maps in which boundary vector shapes are colored using a gradient: higher intensity colors indicate larger values, and lower intensity colors indicate smaller values. These are also known as choropleth maps.

In this post we are going to see how to use the Region Map to visualize the geolocation detail of a stream of Tweets (consumed using the Twitter streaming API). Basically we will show the location (by country) of a stream of Tweets on the map (higher intensity colors indicate larger volume of Tweets).

Here you can read more about the Region Map:

I am using Elasticsearch and Kibana version 5.5 on Ubuntu 14.04 and Python 3.4.

We are going to use the Twitter streaming API to consume the public data stream flowing through Twitter (set some hashtags/keywords to filter the tweets). Given the latitude and longitude (GEOJson format) of each tweet (when available) we are going to use the Google Maps API (Geocoding) to get the country name (or code) from the latitude and longitude.
Once we identified the country (given the latitude and longitude), we are going to index the Tweet to Elasticsearch and then visualize its location using the Kibana Region Map.
For each Tweet we are interested to the country (that represents the geographic location of the Tweet as reported by the user or client application), the text (for further query) and the creation date (to filter our result).

First of all, define a new Elasticsearch mapping called tweet, within the index tweetrepository:

Notice that the country field is a keyword field type (A field to index structured content such as email addresses, hostnames, status codes, zip codes or tags). It will be used as join (between the map and the term aggregation) field for the Region Map visualization.

Using Python tweepy we are going to read the public stream of Tweets.

For each tweet we are going to use the Google API to identify the country from the GEOJson details. Once we identified the country we index the document to Elasticsearch.

This is how an indexed document looks like.

tweet_region_map_document

We are going now to create a new region map visualization.

new_region_map

In the option section of the visualization, select the Vector Map. This is the map layer that will be used. This list includes the maps that are hosted by the Elastic Maps Service as well as your self-hosted layers that are configured in the config/kibana.yml file. To learn more about how to configure Kibana to make self-hosted layers available, see the region map settings documentation.

We will use the World Country vector map. The join field is the property from the selected vector map that will be used to join on the terms in your terms-aggregation. In this example the join field is the country name (so we can match the regions of the map with our documents).

In the style section you can choose the color schema (red to green, shades of blue/green, heatmap) that will be used.
region_map_configuration

 

In the buckets section select the country field (field of our mapping). The values of this field will be used as lookup (join) on the vector map.

region_map_configuration1

 

This is how our region map looks like. The darker countries are the one with a higher number of Tweets.

region_map
I really like this new type of visualization, it easy to use and allows you to add nice visualization map (even with self-hosted layers that are configured in the config/kibana.yml file) to your Kibana dashboards.

If you use Kibana to visualize logs and if you use Logstash take a look at this plugin: GeoIP Filter. The GeoIP filter adds information about the geographical location of IP addresses, based on data from the Maxmind GeoLite2 databases (so you can use the geographical location in your region map).