Index file attachments in Elasticsearch

Elasticsearch allows to index file attachments using the mapper attachments plugin. This plugin uses the text extraction library Apache Tika. It supports different types of file format, you can find the list here.

To install the plugin you need to run this on all the nodes of your Elasticsearch cluster (each node must be restarted after installation):

The plugin adds the attachment type when mapping properties so that documents can be populated with file attachment contents (encoded as base64).

NB: The mapper-attachments plugin will be replaced by the ingest-attachment plugin in the 5.x version of Elasticsearch.

I am using the 2.3.3 version of Elasticsearch on Ubuntu 14.04.4 LTS.

We are going now to create a new index called library and will add a new type (book) and index some documents.
To create a new index:

Then we add the mapping for the type book. We use the data type attachment (the new data type added by the plugin).
The attachment type not only indexes the content of the doc in content sub field, but also automatically adds meta data on the attachment as well (when available). You can find the list of all the medatada here.
In the example we add a field called file with two metadata: content_type and language.

Now we can post some documents (I generated some random text in English, using randomtextgenerator.com encoded inBase64 using base64encode.org):

The added document has been analyzed by the plugin and the metadata have been automatically inferred (content_type: text/plain and language: en).

To enable the document language detection, you need to set index.mapping.attachment.detect_language : true in the elastichsearch.yml configuration file (it could come with a cost).

This is how the added document is stored (you can see that the content_type and language metadata have been automatically! inferred).

content1

 

It is possible to perform a search among the content of the document using the query_string element (here the query in Sense).

I posted an other document (this time in Greek).

Now we can perfom a search query to extract all the document written in English and that contain the word θέατρο (a random Greek word XD).

This is the result:

content2

The plugin is also able to infer the content_type of the attachment we are going to index, so I generated some HTML (containing some paragraphs in Polish) and posted it.

We can now search for all the documents with the HTML content type and we can see that the result of the query contains the indexed document.

content3

Here you can find the complete documentation about the mapper-attachments plugin from the Elastic website (it covers some topics we did not see in this post, like how to handle with the number of indexed characters and how to highlight attachments content).