Selecting domains with random names

Published: 2017-07-05
Last Updated: 2017-07-05 18:30:23 UTC
by Didier Stevens (Version: 1)
5 comment(s)

I often have to go through lists of domains or URLs, and filter out domains that look like random strings of characters (and could thus have been generated by malware using an algorithm).

That's one of the reasons I developed my re-search.py tool. re-search is a tool to search through (text) files with regular expressions. Regular expressions can not be used to identify strings that look random, that's why re-search has methods to enhance regular expressions with this capability.

We will use this list of URLs in our example:
http://didierstevens.com
http://zcczjhbczhbzhj.com
http://www.google.com
http://ryzaocnsyvozkd.com
http://www.microsoft.com
http://ahsnvyetdhfkg.com

Here is an example to extract alphabetical .com domains from file list.txt with a regular expression:
re-search.py [a-z]+\.com list.txt

Output:
didierstevens.com
zcczjhbczhbzhj.com
google.com
ryzaocnsyvozkd.com
microsoft.com
ahsnvyetdhfkg.com

Detecting random looking domains is done with a method I call "gibberish detection", and it is implemented by prefixing the regular expression with a comment. Regular expressions can contain comments, like programming languages. This is a comment for regular expressions: (?#comment).

If you use re-search with regular expression comments, nothing special happens:
re-search.py "(?#comment)[a-z]+\.com" list.txt

However, if your regular expression comment prefixes the regular expression, and the comment starts with keyword extra=, then you can use gibberish detection (and other methods, use re-search.py -m for a complete manual).
To use gibberisch detection, you use directive S (S stands for sensical). If you want to filter all strings that match the regular expression and are gibberish, you use the following regular expression comment: (?#extra=S:g). :g means that you want to filter for gibberish.

Here is an example to extract alphabetical .com domains from file list.txt with a regular expression that are gibberish:
re-search.py "(?#extra=S:g)[a-z]+\.com" list.txt

Output:
zcczjhbczhbzhj.com
ryzaocnsyvozkd.com
ahsnvyetdhfkg.com

If you want to filter all strings that match the regular expression and are not gibberish, you use the following regular expression comment: (?#extra=S:s). :s means that you want to filter for sensical strings.

Classifying a string as gibberish or not, is done with a set of classes that I developed based on work done by rrenaud at https://github.com/rrenaud/Gibberish-Detector. The training text is a public domain book in the Sherlock Holmes series. This means that English text is used for gibberish classification. You can provide your own trained pickle file with option -s.

 

 

Didier Stevens
Microsoft MVP Consumer Security
blog.DidierStevens.com DidierStevensLabs.com

Keywords: DGA domain malware
5 comment(s)

Comments

Perhaps some numbers might be appropriate in the domain names. They seem to be a growing fashion these days.

{^_^} Joanne
Another script by Mark Baggett to note. looking for Frequency of occurrence or looking for high entropy.
His script - 'freq.py' https://github.com/MarkBaggett/MarkBaggett/blob/master/freq/freq.py


-JG-
you could use the work described here:
http://datadrivensecurity.info/blog/posts/2014/Sep/dga-part1/
which, as is explained in the article comes from this:
https://github.com/ClickSecurity/data_hacking/tree/master/dga_detection
There are DGA domain names.. and then there are Turkish ones. They can be difficult to tell apart..
I had the same experience with Dutch and Polish domain names. That's why you can "train the script" for a particular language. When I did train it for Dutch and Polish, these domains no longer appeared as false positives.

Diary Archives