7.1. Why we do what we do

While ConsoleUser supports many models of user behavior, we have found that for many situations using a Markov model to represent application transitions and activities within an application often produce users whose profiles can be distinguished from each other and also reflect some sources of real world data. That is what our automated tools produce, though if other models are needed we have provided ways to do that–including examples using ConsoleUser to “act” out parts of a script to create insider threat scenarios (for example see this paper).

Our models focus on maintaining the distribution of activity across applications a user uses, without paying attention to what the particular applications are. That is, we will attempt to recreate the distribution of time a user spends across their “favorite” application and second favorite and so on down the line. We do not, however, try to make the distribution of which application is the first favorite for users map to what is actually the most popular application either in a population for for a particular model of the “type” of user each could be.

For a more detailed presentation of how our Markov transition matrices are generated see here.

7.1.1. Data Sources

Over the years we have been able to access different pools of data in order to shape our user models.

Schonlau Data

The data used by DuMouchel and Schonlau for their paper (http://www.schonlau.net/publication/interface98.pdf) on using transition probabilities to predict masqueraders. The data is available at http://www.schonlau.net/ and consists of unix command sequences of 50 users.

NSA Data

As part of our Darpa Dynamic Quarantine of Worms (DQW) program we had access to a set of Windows application data. The data was arranged in such a way that allowed for identification of applications that were spawned by parent applications. This data was never published but was made available to researchers on a limited basis.

Skaion Data

This data was captured by Skaion Corporation during 2004 to aid in enclave modelling for the DARPA Dynamic Quarantine of Worms (DQW) program. Although the main focus here was on network traffic, user behavior can be inferred from the appearance of traffic on the network.

ADAMS Data

The ADAMS program had as part of it’s evaluation procedures the user activity of ~5000 users gathered using SureView (http://www.prnewswire.com/news-releases/raytheon-chosen-by-darpa-for-critical-cybersecurity-research-program-on-insider-threats-126829403.html). Although we were not allowed to use the data outside of a secure environment, we were allowed to export statistical data to help inform our modelling. With sufficient warning to allow us to clean up and package that data, and with the permission of DARPA and Raytheon, we may be able to make this data available to researchers.