THE TIBERIUM BLOG - recent events, threats, and all things cyber

Chapter 2: Classifying domains through string entropy

Introduction 

This is the second blog in the ‘Classifying Malicious Domains’ series, which aims to give insight into how to we at Tiberium use our knowledge of attacker’s techniques, tactics, and procedures to detect attacks before they occur. 

Today we’re going to talk about ‘dodgy’ looking domains – that is a domain that looks more like a plate of alphabet soup than a bona fide website.  

An early tl;dr 

Before we get started, we’re actually going to hit the tl;dr and look at the results of the ‘dodgy domain’ classifier I’ll be talking about today – and for context, we’re using ‘Shannon Entropy’ to calculate this ‘dodginess’

The classifier built achieved an 80% true positive rate on NCSC’s latest list of malicious domains, with an 8% false-positive rate across Alexa’s top 1000 websites. I stipulate all this at the start just to lend credence to the wealth of awesome context that is about to follow, so stick with me as there’s something of real value here. 

What’s a ‘dodgy’ domain 

Kicking things off I’ll give you an example to show what a ‘dodgy’ domain looks like. Here’s two domains – one’s malicious, one’s completely safe

tiberium[.]io 

sffxmrlphxqjceiaqjdi[.]com 

The former looks like a standard and inconspicuous domain, whereas the latter is an unintelligible mess of letters that genuinely panics me. This is natural for us as humans to spot – however, consider how this identification could occur at a machine level? After all, both are just strings of letters… right?  

High entropy 

When we look at sffxmrlphxqjceiaqjdi[.]com we’re looking at what’s known as a ‘high-entropy string’. At a high level, entropy dictates the unpredictability of an event, and the mentioned domain is at its core a very unpredictable string of English characters. I won’t dare try and explain this more granularly as long as this blog on Shannon Entropy exists as I’d be doing a disservice to techies worldwide.  

As stated, we’re going to be using ‘Shannon Entropy’ to calculate the entropy of a domain, as it’s a really simple but devilishly effective means of calculating the entropy of a string. 

Does high entropy = malicious? 

Before we get carried away with the technicalities we need to confirm that high entropy generally means a domain is more likely to be malicious.  

MITRE has stated high-entropy domains are often indicative of the use of domain generation algorithms, which are used to switch domains for malware to communicate with.  

Read this article from the ever fantastic Malwarebytes to understand domain generation algorithms (DGA) better, but in essence, any sign of a DGA-created domain is likely bad news, so we’re rolling with this assumption. 

The classifier build 

The Shannon Entropy classifier has been built using Azure Logic Apps and Function Apps (both of which I can’t sing the praises of enough). As spoken about earlier, we’ll be using NCSC’s malicious domains list as our known malicious domains list and the Alexa top 1000 websites as our (mostly) known non-malicious domains list.  

Here’s a high-level design of the classifier: 

Our results 

After mapping out the Shannon Entropy for these 2 sets of domains and plotting them by percentile, we can find the golden threshold for classifying. At a Shannon Entropy > 3.1, we correctly classify 80% of the NCSC malicious domains – and incorrectly classify only 8% of the Alexa top 1000 domains. This is a fantastic result given the simplicity of what we’re looking for here, and when we start tying in additional classifiers (such as our RDAP classifier from the previous chapter) our success rate becomes unparalleled.  

And think about what it is we’ve achieved here – we are able to proactively identify ~80% of malicious domains without the need for any IOCs. This is pure TTP-based threat intel on display and isn’t it stunning. 

What’s next 

It amazes me how something as simple as entropy can have such high efficacy in tasks like this, and paired with some of the other classifiers in our arsenal we quickly become a blue-team force to be reckoned with.  

To end though I’ll pose an odd way an attacker can bypass this classifier with almost humiliating ease.  

aaaaaaaaaaaaaaaaaaaa[.]xyz // Shannon Entropy = -0.0 > classified as ‘non-malicious’ 

Keep your eyes open for the next chapter of this blog series to find out how we can harness a power even stronger than entropy to achieve a more potent true positive rate, and book a demo of Tiberium MYTHIC to see this good stuff in action. 

Tiberium MYTHIC
Blog subscription banner
Share on: