Ye old timey IoT, what was it anyway and does it have an upgrade path?

What were the internet connected devices of old that collected data? Are they obsolete and need to be replaced completely or is there an upgrade path for integrating them into data warehouses in the cloud?

Previously on the internet

In the beginning the Universe was created.
This has made a lot of people very angry and been widely regarded as a bad move.

— Douglas Adams in his book “The restaurant at the end of the universe”

Then someone had another great idea to create computers, the Internet and the world wide web, and ever since then its been a constant stream of all kinds of multimedia content that one might enjoy as a kind of remittance for the previous blunders by the universe. (As these things, usually, go some people have regarded these as bad moves as well.)

Measuring the world

Some, however, enjoy a completely different type of content. I am talking about data, of course. This need for understanding and measuring the world around us has been with us ever since the dawn of mankind, but interconnected worldwide network combined with cheaper and better automation accelerated our efforts massively.

Previously you had to trek to the ends of the earth, usually accompanied with great financial and bodily risk, to try and set up test equipment or to monitor them with your senses and write down the readings to a piece of paper. But then, suddenly, electronic sensors and other measurement apparatus could be combined with a computer to collect data on-site and warehouse it. (Of course, back then we called warehouse of data a “database” or a “network drive” and had none of this new age poppycock terminology.)

Things were great; No need any longer to put your snowshoes on and risk being eaten by a vicious polar bear when you could just comfortably sit on your chair next to a desk with a brand new IBM PS/2 on it and check the measurements through this latest invention called Mosaic web browser or a VT100 terminal if your department was really old-school. (Oh, those were the days.)

These prototypes of IoT devices were very specialized pieces of hardware for very special use cases for scientists and other rather special types of folk and no Internet-connected washing machines were on sight, yet. (Oh, in hindsight ignorance is bliss. Is it not?)

The rise of the acronym

First, they used Dual-tone Pulse Modulation or DTMF. You put your phone next to the device, pushed a button on it and the thing would scream an ear-shattering series of audible pulses to your phone which then relayed them into a computer somewhere. Later, if you were lucky, a repairman would come over and completely disregard the self-diagnostic report your washing machine had just sent over the telephone lines and usually either fixed the damn thing or made the issue even worse while cursing computers all the way to hell. (Plumbers make bad IT support personnel and vice versa.)

From landlines to wireless

So because of this, and many other reasons, someone had a great idea to network devices like these directly to your Internet connection and cut the middle man, your phone, off the equation altogether. This made things simpler for everyone. (Except for the poor plumber who still continued to disregard the self-diagnostic reports.) And everything was great for a while again until, one day, we woke up and there was a legion of decades-old washing machines, tv’s, temperature sensors, cameras, refrigerators, ice boxes, video recorders, toothbrushes and plethora of other “smart” devices connected to the Internet.

Internet Of Things, or IoT for short, describes these devices as a whole and the phenomenon, the age, that created them.

Suddenly it was no longer just a set of specialized hardware for special people that had connected smart devices collecting data. Now it was for everybody. (This has, yet again, been regarded as a bad move.) We have to look past the obvious security concerns that this heat of connecting every single (useless) thing to the Internet has created, but we can also see the benefit. The data flows, and the data is the new oil as the saying goes.

And there is a lot of data

The volume of data collected with all these IoT devices is staggering and therefore just simple daily old-timey FTP transfers of data to the central server are no longer a viable way of collecting it. We have come up with different new protocols like REST, Websockets, and MQTT to ingest real-time streams of new data points to our databases from all of these data-collecting devices.

Eventually, all backend systems were migrated or converted into data warehouses that were only accepting data with these new protocols and therefore were fundamentally incompatible with the old IoT devices.

What to do? Obsolete and replace them all or is there something that can be done to extend the lifespan of those devices and keep them useful?

The upgrade path, a personal journey

As an example of an upgrade path, I shall share a personal journey on which I embarked in the late 1990s. At this point in time, this is a macabre exercise in fighting against the inevitable obsoletion, but I have devoted tears, sweat, and countless hours over the years to keep these systems alive and today’s example is no different. The service in question runs on a minimal budget and with volunteer effort; So heavy doses of ingenuity are required.

Vaisala weather station at Hämeenlinna observatory.
Vaisala weather station located at Hämeenlinna observatory is now connected with Moxa serial server to remote logger software.

 

Even though Finland is located near or in the arctic circle there are no polar bears around, except in a zoo. Setting up a Vaisala weather station is not something that will cause a furry meat grinder to release your soul from your mortal coil, no, it is actually quite safe. Due to a few circumstances and happy accidents, it is just what I ended up doing two decades ago when setting up a local weather station service in the city of Hämeenlinna. The antiquated 90’s web page design is one of those things I look forward to updating at some point, but today we are talking about what goes on in the background. We talk about data collection.

The old, the garbage and the obsolete

Here, we have the type of equipment that measures and logs data points about the weather conditions at a steady pace. Measurements, which are then read out by specialized software on a computer placed next to it since the communication is just plain old ASCII over a serial connection. The software is old. I mean really old. Actually I am pretty sure that some of you reading this were not even born back in 1998:

Analysis of text strings inside of a binary
Above image shows an analysis of the program called YourVIEW.exe that is used to receive data from this antiquated weather station. It is programmed with Labview version 3.243 that was released back in 1998. This software does not run properly on anything newer than Windows 2000.

This creates a few problematic dependencies; Problems that tend to get bigger with passing time.

The first issue is an obvious one: Old and unsupported version of Windows operating system. No new security patches or software drivers are available which in any IT scenario are a huge problem, but still a common one in any aging IoT solution.

The second problem is: No new hardware is available. No operating system support means no new drivers mean no new hardware if the old one brakes down. After spending a decade to scavenge this and that piece of obsolete computer hardware to pull together a somewhat functioning PC is a quite daunting task that keeps on getting harder every year. People tend to just dispose of their old PCs when buying a new one. The halflife of old PC “obtanium” is really short.

Third challenge: One can’t get rid of the Windows 2000 even if one wanted to since the logging software does not run on anything newer than that; And, yes, I tried even black magic, voodoo sacrifices and Wine under Linux to no avail.

And finally, the data collection itself is a problem: How do you modernize something that uses its own data collection /logging software and integrate it with modern cloud services when said software was conceived before modern cloud computing even existed?

Path step 1, an intermediate solution

As with any problem of technical nature after investigating the problem yields several solutions, but most of them are infeasible for a reason or another. In my example case, I came up with a partial solution that later enables me to continue building on top of it. At its core this is a cloud journey, an cloud migration, not much different from those I daily work with our customers at Solita.

For the first problem, Windows updates, we really can’t do anything about without updating the Windows operating system to more recent and supported release, but unfortunately, the data logging software won’t run anything newer than Windows 2000; Lift and shift it is then. The solution is to virtualize the server and bolster the security around the vulnerable underbelly of the system with firewalls and other security tools. This has the added benefit of improving service SLA due to lack of server/workstation hardware failures, network, and power outages. However, since the weather station communicates over a serial connection (RS232) we need to also somehow virtualize the added physical distance away. There are many solutions, but I chose a Moxa NPort 5110A serial server for this project. Combined with an Internet router capable of creating a secure IPSec tunnel between the cloud & on-site and by using Moxa’s Windows RealCOM drivers one can virtualize the on-site serial port to the remote Windows 2000 virtual server securely.

How about modernizing the data collection then? Luckily YourVIEW writes the received data points into CSV file so it is possible to write secondary logger with Python to collect those data points directly to a remote MySQL server as they become available.

Path step 2, next steps

What was before a vulnerable and obsolete piece of scavenged parts is still a pile of obsolete junk, but now it has a way forward. Many would have discarded it is garbage and thrown this data collection platform away, but with this example, I hope to demonstrate that everything has a migration path and with proper lifecycle management your IoT infrastructure investment does not necessarily need to be only a three-year plan, but one can expect to gain returns for even decades.

An effort on my part is ongoing to replace the MyView software all together with a homebrew logger that runs in a Docker container and published data with MQTT to the Google Cloud Platform IoT Core. IoT Core together with Google Cloud Pub/Sub assembles an unbeatable data ingestion framework. Data can be stored into, but not limited to, Google Cloud SQL and/or exported to BigQuery for additional data warehousing and finally for visualization for example in Google Data Studio.

Even though I use the term “logger” here the term “gateway” would be suitable as well. Old systems require interpretation and translation to be able to talk to modern cloud services. Either commercial solution exists from the vendor of the hardware or in my case I have to write one.

Together we are stronger

I would like to think that my very specific example above is unique, but I am afraid that is not. In principle, all integration and cloud migration journeys have their unique challenges.

Luckily modern partners, like Solita, with extensive expertise in cloud platforms like the Google Cloud Platform, Amazon Web Services or Microsoft Azure and in software development, integration, and data analytics can help a customer to tackle these obstacles. Together we can modernize and integrate existing data collection infrastructures for example in the web, healthcare, banking, at the factory floor, or in logistics. Throwing existing hardware or software into the trash and replacing them with new ones is time-consuming, expensive, and sometimes easier said than done. Therefore carefully planning an upgrade path with a knowledgeable partner might be a better way forward.

Even when considering investing in a completely new solution for data collection a need for integration is usually a requirement at some stage of the implementation and Solita together with our extensive partner network is here to help you.

My Wednesday at AWS re:Invent 2019

It was early morning today because the alarm clock woke up me around 6 am. The day started with Worldwide Public Sector Keynote talk at 7 am in Venetian Plazzo Ohall.

Worldwide Public Sector Breakfast Keynote – WPS01

This was my first time to take part to the Public Sector keynote. I’m not sure how worldwide it was. At least Singapore and Australia were mentioned, but I cannot remember anything special said about Europe.

Everyone who is following international security industry even a little could not have missed the fact who many cities, communities etc. have faced ransomware attack. Some victims paid the ransom in Bitcoins, some did not pay and many of victims are just quiet. The public cloud environment is a great way to protect your infrastructure and important data. Here is summary how to protect yourself:

The RMIT University from Australia has multiple education programs for AWS competencies and it was announced that they are now official AWS Cloud Innovation Centre (CIC). Typical students have some education background (eg. bachelor in IT) and they want to make some move in job market by re-education. Sounds great way!

The speaker Mr. Martin Bean from the RMIT showed the picture from Jetsons (1963, by Hanna Barber Production) that could already list multiple things that are invented for mass markets much later. Mr. Martin also reminded two things that got my attention: there are more people owning cellphone than toothbrush and 50 percent of jobs are going to transform to another in next 20 years.

Visit to expo area

After keynote I visited expo in Venetian Sands Expo area before heading to the Aria for the rest of Wednesday. The expo was huge, noisy, crowded etc. The more detail experience from last year was enough. At AWS Cloud Cafe I took panorama picture (click to zoom in) and that’s it, I was ready to leave.

I took the shuttle bus towards Aria. I was very happy that the bus driver dropped off us next the main door of Aria hotel which saves about average 20-30 minutes of queueing in Aria’s parking garage. Important change! On the way I passed the Manhattan of New York.

Get started with Amazon ElastiCache in 60 minutes – DAT407

Mr. Kevin McGehee (Principal Software Engineer, AWS) was the instructor for the ElasticCache Redis builder session. In the session we logged in to the Amazon console, opened Cloud9 development environment and then the just followed the clear written instructions.

The guide for builder session can be found from here: https://reinvent2019-elasticache-workshop.s3.amazonaws.com/guide.pdf

This session was about how to import data to the Redis via python and index and refine the data at the importing phase. In refinement the data becomes information with aggregated scoring, geo location etc. It’s easier to use by the requestor. That was interesting and looked easy.

Build an effective resource compliance program – MGT405

Mr.Faraz Kazmi (Software Development Engineer, AWS) held this builder session.

Conformance pack under AWS Config service was published last week. It can be integrated in AWS Organization level in account structure. With conformance packs you can make a group of config rules (~governance rules for common settings) easily in YAML format template and have consolidated view over those rules. There are few AWS managed packs currently available. “Operational Best Practices For PCI-DSS” pack is one for example.  It’s clear that AWS will provide more and more of those rule sets in upcoming months and so will also do the community via Github.

There are timeline view and compliance view of your all resources, so it makes this tool very effective to have consolidated view of compliance of resources.

You can find the material from here: https://reinvent2019.aws-management.tools/mgt405/en/

Btw. If you cannot find Conformance packs, you are possible using old Config service UI in the AWS Console. Make sure to switch to new UI. All new features are only done to the new UI.

The clean-up phase in the guide is not perfect. To addition to the guide you have to manually delete SNS topic and IAM roles that was created in the wizards. It was a disappointment that no sandbox account was provided.

Best practices for detecting and preventing data exposure – MGT408

Ms. Claudia Charro (Enterprise Solutions Architect, AWS) from Brasilia was the teacher in this session. This was very similar to my previous session that I was not aware. In both session we used Config rules and blocked public s3 usages.

The material can be found from here: https://reinvent2019.aws-management.tools/mgt408/en/cont/testingourenforcement.html

AWS Certification Appreciation Reception

The Wednesday evening started (as usually) with reception for certificated people at the Brooklyn Bowl. It is again nice venue to have some food, drinks, music and mingle with other people. I’m writing this around 8 pm so I left a bit early to get good night sleep for Thursday which is the last full day.

Brooklyn bowl outside Brooklyn bowl inside Brooklyn bowl bowling Brooklyn bowl dance floor

On the way back to my hotel (Paris Las Vegas) I found the version 3 Tesla Supercharge station which was the one of the first v3 stations in the world. It was not too crowed. The station was big when I’m comparing with the supercharger stations in Finland. The v3 Supercharger stations can provide up to 250kW charging power for Model 3 Long Range (LR) models, which has 75kWh battery size. I would have like to see the new (experimental) Cybertruck model.

Would you like to hear more what happens in re:Invent 2019? Sign up to our team’s Whatsapp group to chat with them and register to What happens in Vegas won’t stay in Vegas webinar to hear a whole sum-up after the event.

New call-to-action

 

My Tuesday at AWS re:Invent 2019

Please also check my blog post from Monday.

Starting from Tuesday each event venue provides great breakfast with lightning speed service. It scales and works. It’s always amazing how each venue can provide food services for thousands of people in short period of time. The most people are there for the first time so guidance has to be very clear and simple.

Today started with keynote by Mr. Andy Jassy. I was not able to join the live session at Venetian because of my next session. Moving from one location to another takes at least 15 minutes, and you have to be at least 10 minutes before at your session to reclaim you reserved seat. Starting last year, the booking systems forces one hour cap between sessions in different venues.

Keynote by Andy Jassy on Tuesday

You can find the full recap written by my colleagues here: Andy Jassy’s release roller coaster at AWS re:Invent 2019

Machine learning was the thing today. The Sagemaker service received tons of new features. Those are explained by our ML specialists here: Coming soon!

So I joined overflow room at Mirage for Jassy’s keynote session. Everyone has their personal headphones. I have more than 15 years background in software development, so it was love in first sight with CodeGuru. There are good news and bad news. It is service for static analysis of your code, to make review and finally but not definitely least to provide realtime profiling concept via installed agent.

The profiling information is provided in 5 minutes periods and it will provide profiling for several factors: CPU, memory and latency. It was promising product because Mr. Jassy told that Amazon has used it by itself for couple of years already. So, it is mature product already.

So, what was the bad news. It supports only Java. Nothing to add to that.

The other interesting announcement for me was general availability of Outposts. Finally, also in Europe you can have AWS fully managed servers inside your corporate datacenter. Those servers integrate fully to AWS Console and can be used eg. for running ECS container services. The starting price 8300 USD per month is very competitive because it already includes roughly 200 cores, 800GB memory and 2,7TB of instance storage. You can add EBS storage additionally starting from 2.7TB.

You can find more information here: https://aws.amazon.com/blogs/aws/aws-outposts-now-available-order-your-racks-today/

Performing analytics at the edge – IOT405

This session was a workshop and level 400 (highest). It was held by Mr. Sudhir Jena (Sr. IoT Consultant, AWS) and Mr. Rob Marano (Sr. Practice Manager, AWS).

Industry 4.0 a.k.a. IoT was totally new sector for me. It was very informative and pleasant session. It was all about AWS IoT Greengrass service which can provide low latency response but still managed platform which tons of features for handling data stream from IOT devices locally.

For multiple people it was first touch to AWS Cloud Development Kit which I fell in love about three months ago. It has multiple advances like refactoring, strong typing and good IDE support. You can find more information about AWS CDK here: https://docs.aws.amazon.com/cdk/latest/guide/home.html

In our workshop session we demonstrated to receive temperature, humidity etc. time series data stream from IoT device. The IoT device was in our case EC2 which simulated IoT device. From AWS IoT Greengrass console you can eg. deploy new version of analytic functions to the IoT devices.

Material for workshop can be found from Github: https://github.com/aws-samples/aws-iot-greengrass-edge-analytics-workshop

AWS Transit Gateway reference architectures for many VPCs – NET406

This was a session and it was held by Nick Matthews (Principal Solutions Architect, AWS) in glamorous Ballroom F at Mirage. It fits more than thousand people. The session was almost full, so it was a very popular session.

To summaries the topic, there are several good ways to do interconnectivity between multiple VPCs and corporate data center. In small scale you can things more manually but in large scale you need automation.

One provided solution for automation based on the use of tags. The autonomous team (owner of account A) can tag their shared resources predefined way. The transit account can read those changes via CloudTrail logging. So, each modification will create CloudTrail audit event which triggers lambda function. The function checks for if change is required and makes change request item to metadata table in DynamoDB to wait for approval. The network operator is notified via SNS (Simple notification service). The operator can then allow (or decline) the modification. Another Lambda will then do the needed route table modifications for the transit account and for the account A.

If you are interested, you can watch video from August 2019: https://pages.awscloud.com/AWS-Transit-Gateway-Reference-Architectures-for-Many-Amazon-VPCs_2019_0811-NET_OD.html

If you want to wait, I’m pretty sure that this re:Invent talk was also recorded and can be found from AWS Youtube channel in few week: https://www.youtube.com/user/AmazonWebServices

Fortifying web apps against bots and scrapers with AWS WAF – SEC357

Mr. Yuri Duchovny (Solution architect, AWS) held the session. It was the most intensive session with a lots of todo with many architectural examples and usage scenarios in demo screen. The AWS WAF service has got a new shiny UI in AWS Console. Also the AWS published few new features already in last few weeks, eg. Managed rules to give more protection in nondisruptive way. The WAF it self did not have multiple predefined rules for protection, only XSS (Cross-site Scripting) an SQLi (SQL Injection) were supported. All other rules needed to configure manually as regular expressions or so.

The WAF is service that should always be turned on for CloudFront Distribution, Application Load Balancer (ALB) and API Gateway.

The workshop material is again public and can be found from here: https://github.com/gtaws/ProtectWithWAF

Encryption options for AWS Direct Connect – NET405

Mr. Sohaib Tahir (Sr Solutions Architect, AWS) from Seattle was the teacher in this session. It was more listening than doing because of the short period of time. We (attendees) were group of seven from USA, Japan and Finland.

There was five possibilities to encrypt direct connection:

1. Private VIF (virtual interface) + application-layer TLS
2. Private VIF + virtual VPN appliances (can be in transit VPC)
3. Private VIF + detached VGW + AWS Site-to-site VPN (CloudHub functionality)
4. Public VIF + AWS Virtual Private Gateway (GP, IPSec tunnel, BGP)
5. Public VIF + AWS Transit Gateway (BGP, IPSec tunnel, BGP) NEW!

It’s good to remember that single VPN connections has 1,25 Gbps limit which can be hit easily with DX connection and eg. data intensive migration jobs. AWS recommendation is to use number five architecture if it is possible. Using the fifth architecture requires to have own direct connection so you cannot use shared model direct connection from 3rd party operator.

AWS published yesterday cross-region VPC connectivity via Transit Gateway. During the session Mr. Tahir started to do demonstrate this new feature ad-hoc but we ran out of time.

My Monday at AWS re:invent 2019

I started the day with breakfast at Denny’s. It was nice to have typical (I think) American breakfast. Thanks Mr. Heikki Hämäläinen for your company. By the way, all attendees from Solita are wearing those bright red hoodie shown in the picture. Thanks to our Cloud Ambassador Anton Floor. The hoodie makes it a lot easier to spot a colleague in a crowded places. Okay, let’s start going through my actual sessions.

How NextRoll leverages AWS Batch for daily business operations – CMP311

Advertisement company’s Tech Lead Mr. Roozbeh Zabihollahi described shortly their journey with AWS Batch service. If I remember correctly, they use about 5000 CPU years which is huge amount of computing power. It was nice to hear NextRoll allows their teams quite freely to choose which services they want to use. Nowadays Mr. Zabihollahi sees that more and more teams are looking into AWS Batch as a promising choice to use, rather than Hadoop or Spark.

Mr Zabihollahi believes that AWS Batch is good for several things:

AWS Batch is good for

If you are consideration start using AWS Batch you should be familiar at least these challenges:

The Mr. Steve Kendrex (Sr. Technical Product Manager, AWS) presented the road map of AWS Batch service. The support for Fargate (a.k.a serverless container service) is coming but Steve could not provide details for a wide audience. My personal guess is the spot instance support for Fargate is coming soon which provide key cost efficiency factor for batch operations.

Build self-service registration with facial recognition – ARC320

My first builder session this year was about integrating facial recognition for registering guests to an event. Me and four other attendees were led by Mr. Alan Newcomer (Solutions Architect, AWS) to this interesting topic. Mr. Newcomer had lived before near Las Vegas which was interesting to hear about him.

Each builder session starts with short queuing for the right table which you have hopefully reserved a spot beforehand:

The hall has multiple tables which each has 7 chairs, one for a teacher and 6 for participants, and screen for guidance purposes.

Typical the teacher provides website which has all the required information to do exercise. Additional to that the teacher provides unique password for each participant eg. for AWS Console login. After that each participant can start doing the exercise by themselves. The teacher provides helps whenever needed. You need to keep good pace all the time to be able to do whole exercise.

During the recognition session we built an application with had tree main functionalities:  user registering, do RSVP one day before the event and finally registering user at event via facial recognition. You actually look up the workshop material by yourself here: http://regappworkshop.com/

Managing DNS across hundreds of VPCs – NET411

This was my second chalk talk today. It started very well because right at the beginning audience heard real life problems from different attendees. The chalk talk was guided by Mr. Matt Johnson (Manager, Solutions Architecture, WWPS, AWS) and Mr. Gavin McCullagh (Principal System Development Engineer, AWS). They did extremely well.

It was reminded that the support for overlapping private zones was published recently. It enables autonomous structured dns management in multi-account environment.  For more information go to: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-route-53-now-supports-overlapping-namespaces-for-private-hosted-zones/

During the session we looked up four different architecture for sharing DNS information with multiple VPCs (~accounts). The number four “Share and Associate Zones and Rules” was the most interesting which suites for massive number of private DNS zones and VPCs. It has hub account for outbound DNS traffic to corporate network and it uses private zone associating between VPC’s. The associating does not yet have native CloudFormation support but there are several ways to handle association, eg. using CloudFormation custom resources or custom Lamda function.

One major feature request was that AWS should support DNS query logging (query and response) in the VPC. The audience wanted to receive the logging information to CloudWatch log groups. The logging is needed for security/audit and debugging purposes.

Processing AWS Ground Station data in AWS – NET409

This my second builder session sounded very fancy, handling data from satellites. The attendees had very different experience from AWS and from satellites, everything from one to five in both topics. After the session I needed to update my CV…

In the sky there are few open satellites. Those can be listened by AWS Ground Station service and the data received to AWS account. The data link between the Ground Station and your VPC is made via elastic network interface (ENI).

In the example case we received 15 Mbps stream for 15 minutes. It was the period that the satellite was visible for the Ground Station’s antenna system. The stream from Ground Station needs always to be received by to Kratos DataDefender software that will parse UDP traffic. The Ground Station traffic is not in right order and sometimes missing species which is handled by the DataDefender.

The data stream was analyzed in few phases via S3 bucket and EC2 instances. The final product was precise TIFF format picture of the view of the satellite passing the Ground station antenna. The resolution was about 1 megapixel per kilometer.

Nordics Customer Reception

The evening ended to pleasant and well organised the Nordics Customer Reception event at the Barrymore. The Solita was one of the sponsors of the event. From the terrace we had great view towards the Encore hotel:

Would you like to hear more what happens in re:Invent 2019? Sign up to our team’s Whatsapp group to chat with them and register to What happens in Vegas won’t stay in Vegas webinar to hear a whole sum-up after the event.

New call-to-action

Kicking the tires of AWS Textract

Amazon Web Services' new ML/AI service Amazon Textract came to general availability and I gave it a quick test.

AWS has multiple services in AI/ML field. These include, for example, Amazon Comprehend for text analysis, Amazon Forecast for predicting future from set of data and Amazon Rekognition to extract information from pictures. Amazon Textract is a new service in this field and it was just announced to be generally available. Textract is a service which does Optical Character Recognition (OCR) from multiple file formats and stores output in a more usable format in JSON.

At the moment of release the AWS Textract can detect Latin-script characters from standard English alphabet and ASCII symbols. It can use PNG, JPEG and PDF as input files. I would say that there are enough input formats but would have wanted to see more languages available. Of course Finnish is not something that I assume to see anytime soon or at all. Textract is now available in three regions in US and Ireland in Europe.

Analyse test

Textract allows one to easily test what kind of results they can get with it. One can open Textract service and first see a sample document created by AWS. This helps to get started and get some kind of idea how to use it. Documents can be uploaded directly from the console and it automatically creates a S3 bucket to store them.

Textract sample document

 

I did tests with multiple files and file formats to see how it performs but used one PDF document as an example for this post. The PDF I used was AWS Landing Zone immersion day information sheet because it was handily available and had text, table and image in it. On the left in the picture, we can see again the areas where Textract has identified content and on the right is the extraction. From this kind of clear and simple document it seems to have picked up everything easily. It took around 10 seconds for this document to be analysed.

Test document

 

I would say that Textract handled all the files I gave it without too much problem. The view of the file and places where it finds text does not always align even though text output is correct. This happened for example with my CV where the visual representation was off on many places.

Visual analyse sample

Results

Outputs can also be downloaded directly from the console in a zip file and it will provide these four files.

  • apiResponse.json
  • tables.csv
  • keyValues.csv
  • rawText.txt

Tables.csv, keyValues.csv and rawText.txt are all quite clear. Tables holds all the tables and fields Textract found from the document and keyValues.csv holds form data. This is the table that was found in the document. It has been correctly read and put in table. Interestingly, it has also added empty columns for the long empty spaces between texts.

Test document table

 

Rawdata.csv contains extracted text from document in a raw format. It has all the text in non edited format, all the words just after each other.

H Automated Landing Zone Immersion Day Please join the AWS Nordics Partner team for an immersion day for the Automated Landing Zone. Learn how to set up an account structure according to best practices with the help of the ALZ solution. After you have performed this training, you will get access to the ALZ solution tools and materials sO you can use when setting up customer environments. This training will also be helpful for those of you interested in the AWS Control Tower service that will be available later this year. WHEN: April 1st 2019 (no joke) WHERE: AWS Office at Kungsgatan 49 in Stockholm Preliminary agenda 10:00 10:30 Welcome and Registration 10:30 10:40………

Textract also gives a full output of the process. This information is in JSON format and contains all the information about the findings. There is detailed information what was found and in where. It also gives a confidence percentage of the finding. This is a very large JSON document even with a small PDF, almost as big file as the original PDF.

    {
      "BlockType": "WORD",
      "Confidence": 99.962646484375,
      "Text": "account",
      "Geometry": {
        "BoundingBox": {
          "Width": 0.0724315419793129,
          "Height": 0.012798813171684742,
          "Left": 0.448628693819046,
          "Top": 0.37925970554351807
        },
        "Polygon": [
          {
            "X": 0.448628693819046,
            "Y": 0.37925970554351807
          },
          {
            "X": 0.5210602283477783,
            "Y": 0.37925970554351807
          },
          {
            "X": 0.5210602283477783,
            "Y": 0.39205852150917053
          },
          {
            "X": 0.448628693819046,
            "Y": 0.39205852150917053
          }
        ]
      },
      "Id": "f1c9bdeb-f76a-44ff-8037-6cb746d5613d",
      "Page": 1
    },

 

Conclusion

Textract is a needed addition to AWS AI/ML service family and fills the gap in analysis tools. Textract says that it will read English from multiple file formats and seems to do that well. All tests with PDFs and pictures were successful. Of course one wouldn’t use this service like this and upload single files manually. Textract has support in AWS cli and both Java and Python SDKs. That makes it possible to have, for example, automatic triggers in S3 bucket when new files are uploaded which launches Textract to do it’s thing. Overall a nice service which will probably be a very useful one for text analysis use cases.

Download a free Cloud Buyer's Guide

AWS Summit Berlin 2019

My thoughts on the Berlin AWS Summit 2019

What is an AWS Summit?

AWS Summits are small, free events that happen in various cities around the world. They are a “satellite” event of the re:Invent which takes place in Las Vegas every year in November. If you cannot attend re:Invent, you should definately try to attend an AWS Summit.

Berlin AWS Summit

I have had the pleasure of attending the Berlin AWS Summit for 4 years in a row.

Werner Vogels

The event was a 2 day event held on 26-27 of February 2019 in Berlin. The first day was more focused for management or new cloud users and the second day had more deep-dive technical sessions. The event started with a keynote held by Werner Vogels, CTO of Amazon. This year the Berlin AWS Summit seemed to be very focused on topics around Machine Learning and AI. Also I think this year there were more people attending compared to 2018 or 2017.

You will always find other sessions that are interesting to you, even if ML&AI are currently not on your radar. For example I attended the session about “Observability for Modern Applications” that showed how to use AWS X-Ray and App Mesh to monitor and control large scale microservices running in AWS EKS or similar. App Mesh is currently in public preview and it looks very interesting!

The partners

Every year there are a lot of stands by various partners showcasing their products to the passers by. You can also participate in raffles with the cost of your email address (and obvious marketing emails that will ensue). Most of them will also hand out free swag, stickers or pens etc.

stands 1Stands 2Stands 3

Solita Oy is an AWS Partner, please check our qualifications on the AWS Partners page.

Differences to previous years

This year there was no AWS Certified lounge which was a surprise to me. It is a restricted area for people who have an active AWS Certification where they can network with other certified people. I hope it will return next year again.

 

Thank you for the event!

Thank you and goodbye

Choosing provider for cloud

Sticking with your old habits and misconceptions is dangerous, choosing cloud partner is something that should be done with care.

There is nowadays a plethora of cloud operators to choose from and almost everyone has their favourite. AWS is the oldest and probably has the most features and services, Azure is go to place when running Microsoft-related applications or workloads and if you are looking into using AI or ML you go with Google. This has been a common misconception.

In reality choosing your cloud is not so black and white. Providers who came into the game a bit later than Amazon have been investing heavily on the development and are fast catching up. Amazon haven’t been resting on AI or ML front either. And there is also Alibaba, the Amazon of China, who is also pushing hard on the west now and seems to have focus on AI and ML.

Relying on this kind of categorising is dangerous as cloud operator strengths could change quite quickly and it might limit your capability to operate efficiently.

This is where you need to focus. Map your main goals when using the cloud. Check the options available by yourself if your skillset is up to date with all the options. This might be almost impossible as cloud providers are pushing new services almost daily. So I highly recommend that you move to the most important step and choose a partner to help you.

Choosing your partner right can make some serious cost saving and accelerate your development. Do your homework and spend some time benchmarking potential partners. Make sure your partner has enough real life experience on running and building to cloud.