The "No Standard Format" Problem
You’ve probably looked at a “security feed” before. I’m not talking about an RSS feed of security news. I’m talking about lists of bad IPs or bad domains or malware hashes. Have you ever tried to do anything with one? How about more than one? Chances are if you’ve ever set eyes on more than one of them, you’ve noticed something: they NEVER come in the same format.
The problem is that there’s no standard format for sharing this info. And everyone’s format is better than everyone else’s. Not to mention, all of the “things” that can use this information - snort, Splunk, Bro - and they all take the information in different formats (literally, those 3 all take 3 different formats).
Splunk is the closest to “standard” in that it uses regular old CSV, and it’ll take whatever fields are in the CSV, as long as you know how to relate them to your data (hopefully there’s an IP or domain field in your data and the CSV). Bro needs a particularly formatted CSV in what appears to be a common non-standard format, CSV-with-a-#-sign-before-the-headers. Snort, well, it needs Snort rules (there’s also some reputation stuff, but that’s another topic).
Enter the Collective Intelligence Framework Tool
The fact of the matter is, nothing matches anything else, or what your tool of choice is expecting. Fortunately, there’s a tool out there that claims to solve this problem - CIF, or Collective Intelligence Framework. CIF is a project led by REN-ISAC (which if you’ve heard of them before I’m impressed, I hadn’t and their website tells me they haven’t had a web designer since the 90’s). Other major supporters include the NSF, Indiana University and Internet 2. The problem with CIF is it's a beast of a piece (or many pieces of) of code. It is written in Perl, with a hodgepodge of module dependencies, which fail if you try to run on the newest Ubuntu, and in testing can crash a large-sized Linode. I’m not entirely sure what it’s doing that can cause that, but one of two things is wrong with this picture: either this code is bloated and disgusting and massively fragile, or the task at hand requires such brute force that it tackles a Linode. What was that task again? Right. Parsing *plain text* data feeds and “normalizing” them (read: storing them in a common database) for later access. Maybe doing some DNS lookups.
There are some services out there that try to make this better by providing various APIs into their data. The most common format is DNSBL. In this world, you can’t download the whole list, but you can check any IP relatively quickly by reversing it and shoving a domain on the end of it. This is nice for real-time checking IPs against a blacklist or two (its most common use is in SPAM filtering). However, when you’re trying to run live analytics on a live network stream of potentially gigabytes of traffic, you’re going to knock over whatever service you’re using. Which is why you’re normally limited to the number of requests you can make unless you pay top dollar.
There are also more and more services starting to offer API access. Emerging Threats is one of them (their service is currently in closed beta). For a small fee, you can query their database for any IP, domain or malware hash and get information about it (historical information no less). It’s actually quite a nice intel system. The problem here is it's still only one source of data, and it’s still not very good for real-time lookups.
The Answer Is: There Is No Real Good Answer
The answer is there is no real good answer. As an industry, I can only think that right now people are: a) using something like CIF and over-dedicating computing resources to something that should be a few scripts and a cronjob, b) using a few clobbered together scripts and a cronjob, or c) doing everything by hand. Until there’s either a standard format for all of this data, or a simple tool to aggregate it, security intelligence scores an F on the usability scale, and an F on the “plays well with others scale”.
In an ideal world, security intelligence feeds would come in some magical standard format that contained all of the relevant data you could ever want - the “ioc”, what type of “thing” it is, when it was first and last seen doing bad things, a link to some article describing its behavior, a confidence level, recommended action (block, log, etc). The fact of the matter is, that’s all a pipe dream. So, the next best thing comes close to CIF, but is smoother and cleaner, more reliable, and less of a resource hog. There’s an encouraging tool called TekDefense Automator but it, too, is intended for manual analysis and not real-time analysis.
What's the Next Best Thing?
The ideal “next best thing” is a simple, easy to configure tool where you can enter in a web interface the source of the list, what format its in (something resembling CSV, or XML, or just a plain ol’ list, etc), tag it (malware, phishing, tor, etc), set a confidence, etc. Then, on a set schedule, this tool would download all the lists, parse out the “IOC”s, insert them or update their records in the DB, etc. Lastly, it would provide customizable outputs for various formats, such as “Splunk lookup table” (aka good ol’ CSV), bro, snort rules, whatever. It could even output Cisco/iptables configs to block the IPs, a zone file for a DNSBL, etc. It should be simple and easy to add new output mechanisms. And, most importantly, it should run on a poor, defenseless Linode who never saw it coming. Without a standard “ioc” description format, this is the next best thing.