Kicking the tires of AWS Textract

Amazon Web Services' new ML/AI service Amazon Textract came to general availability and I gave it a quick test.

AWS has multiple services in AI/ML field. These include, for example, Amazon Comprehend for text analysis, Amazon Forecast for predicting future from set of data and Amazon Rekognition to extract information from pictures. Amazon Textract is a new service in this field and it was just announced to be generally available. Textract is a service which does Optical Character Recognition (OCR) from multiple file formats and stores output in a more usable format in JSON.

At the moment of release the AWS Textract can detect Latin-script characters from standard English alphabet and ASCII symbols. It can use PNG, JPEG and PDF as input files. I would say that there are enough input formats but would have wanted to see more languages available. Of course Finnish is not something that I assume to see anytime soon or at all. Textract is now available in three regions in US and Ireland in Europe.

Analyse test

Textract allows one to easily test what kind of results they can get with it. One can open Textract service and first see a sample document created by AWS. This helps to get started and get some kind of idea how to use it. Documents can be uploaded directly from the console and it automatically creates a S3 bucket to store them.

Textract sample document

 

I did tests with multiple files and file formats to see how it performs but used one PDF document as an example for this post. The PDF I used was AWS Landing Zone immersion day information sheet because it was handily available and had text, table and image in it. On the left in the picture, we can see again the areas where Textract has identified content and on the right is the extraction. From this kind of clear and simple document it seems to have picked up everything easily. It took around 10 seconds for this document to be analysed.

Test document

 

I would say that Textract handled all the files I gave it without too much problem. The view of the file and places where it finds text does not always align even though text output is correct. This happened for example with my CV where the visual representation was off on many places.

Visual analyse sample

Results

Outputs can also be downloaded directly from the console in a zip file and it will provide these four files.

  • apiResponse.json
  • tables.csv
  • keyValues.csv
  • rawText.txt

Tables.csv, keyValues.csv and rawText.txt are all quite clear. Tables holds all the tables and fields Textract found from the document and keyValues.csv holds form data. This is the table that was found in the document. It has been correctly read and put in table. Interestingly, it has also added empty columns for the long empty spaces between texts.

Test document table

 

Rawdata.csv contains extracted text from document in a raw format. It has all the text in non edited format, all the words just after each other.

H Automated Landing Zone Immersion Day Please join the AWS Nordics Partner team for an immersion day for the Automated Landing Zone. Learn how to set up an account structure according to best practices with the help of the ALZ solution. After you have performed this training, you will get access to the ALZ solution tools and materials sO you can use when setting up customer environments. This training will also be helpful for those of you interested in the AWS Control Tower service that will be available later this year. WHEN: April 1st 2019 (no joke) WHERE: AWS Office at Kungsgatan 49 in Stockholm Preliminary agenda 10:00 10:30 Welcome and Registration 10:30 10:40………

Textract also gives a full output of the process. This information is in JSON format and contains all the information about the findings. There is detailed information what was found and in where. It also gives a confidence percentage of the finding. This is a very large JSON document even with a small PDF, almost as big file as the original PDF.

    {
      "BlockType": "WORD",
      "Confidence": 99.962646484375,
      "Text": "account",
      "Geometry": {
        "BoundingBox": {
          "Width": 0.0724315419793129,
          "Height": 0.012798813171684742,
          "Left": 0.448628693819046,
          "Top": 0.37925970554351807
        },
        "Polygon": [
          {
            "X": 0.448628693819046,
            "Y": 0.37925970554351807
          },
          {
            "X": 0.5210602283477783,
            "Y": 0.37925970554351807
          },
          {
            "X": 0.5210602283477783,
            "Y": 0.39205852150917053
          },
          {
            "X": 0.448628693819046,
            "Y": 0.39205852150917053
          }
        ]
      },
      "Id": "f1c9bdeb-f76a-44ff-8037-6cb746d5613d",
      "Page": 1
    },

 

Conclusion

Textract is a needed addition to AWS AI/ML service family and fills the gap in analysis tools. Textract says that it will read English from multiple file formats and seems to do that well. All tests with PDFs and pictures were successful. Of course one wouldn’t use this service like this and upload single files manually. Textract has support in AWS cli and both Java and Python SDKs. That makes it possible to have, for example, automatic triggers in S3 bucket when new files are uploaded which launches Textract to do it’s thing. Overall a nice service which will probably be a very useful one for text analysis use cases.

Download a free Cloud Buyer's Guide

AWS Summit Berlin 2019

My thoughts on the Berlin AWS Summit 2019

What is an AWS Summit?

AWS Summits are small, free events that happen in various cities around the world. They are a “satellite” event of the re:Invent which takes place in Las Vegas every year in November. If you cannot attend re:Invent, you should definately try to attend an AWS Summit.

Berlin AWS Summit

I have had the pleasure of attending the Berlin AWS Summit for 4 years in a row.

Werner Vogels

The event was a 2 day event held on 26-27 of February 2019 in Berlin. The first day was more focused for management or new cloud users and the second day had more deep-dive technical sessions. The event started with a keynote held by Werner Vogels, CTO of Amazon. This year the Berlin AWS Summit seemed to be very focused on topics around Machine Learning and AI. Also I think this year there were more people attending compared to 2018 or 2017.

You will always find other sessions that are interesting to you, even if ML&AI are currently not on your radar. For example I attended the session about “Observability for Modern Applications” that showed how to use AWS X-Ray and App Mesh to monitor and control large scale microservices running in AWS EKS or similar. App Mesh is currently in public preview and it looks very interesting!

The partners

Every year there are a lot of stands by various partners showcasing their products to the passers by. You can also participate in raffles with the cost of your email address (and obvious marketing emails that will ensue). Most of them will also hand out free swag, stickers or pens etc.

stands 1Stands 2Stands 3

Solita Oy is an AWS Partner, please check our qualifications on the AWS Partners page.

Differences to previous years

This year there was no AWS Certified lounge which was a surprise to me. It is a restricted area for people who have an active AWS Certification where they can network with other certified people. I hope it will return next year again.

 

Thank you for the event!

Thank you and goodbye

Choosing provider for cloud

Sticking with your old habits and misconceptions is dangerous, choosing cloud partner is something that should be done with care.

There is nowadays a plethora of cloud operators to choose from and almost everyone has their favourite. AWS is the oldest and probably has the most features and services, Azure is go to place when running Microsoft-related applications or workloads and if you are looking into using AI or ML you go with Google. This has been a common misconception.

In reality choosing your cloud is not so black and white. Providers who came into the game a bit later than Amazon have been investing heavily on the development and are fast catching up. Amazon haven’t been resting on AI or ML front either. And there is also Alibaba, the Amazon of China, who is also pushing hard on the west now and seems to have focus on AI and ML.

Relying on this kind of categorising is dangerous as cloud operator strengths could change quite quickly and it might limit your capability to operate efficiently.

This is where you need to focus. Map your main goals when using the cloud. Check the options available by yourself if your skillset is up to date with all the options. This might be almost impossible as cloud providers are pushing new services almost daily. So I highly recommend that you move to the most important step and choose a partner to help you.

Choosing your partner right can make some serious cost saving and accelerate your development. Do your homework and spend some time benchmarking potential partners. Make sure your partner has enough real life experience on running and building to cloud.