7.2. UserDBConfig

This is a tool to generate the configurations for all the users across both ConsoleUser and MPTGS so that they will preserve desired statistical properties across the different application/traffic types.

7.2.1. Installation

UserDBConfig is a collection of Perl scripts and modules. To use it some additional modules are required (that are part of a default setup). The tar that includes UserDBConfig includes the extra modules it needs so they can be installed without an Internet connection; however, to build those modules requires a general development environment. The module build/install process requires gcc and the ability to process Perl Makefiles (e.g., perl-ExtUtils-MakeMaker).

Once those system requirements are set change into the libs directory. From there run ./install_libs.sh which will process the modules there in the order given in install_order. Once all the libraries are installed, UserDBConfig is ready to run to generate its files.

The tools to fetch ConsoleUser information requires ConsoleUser to be installed. There are a couple of tools it provides that are used, and they will also look for a file /usr/local/tg/db.info file which can store DB connection information instead of needing to repeat it on the command line for each fetch.

Completing the process requires a MySQL/MariaDB server to be available for ConsoleUser configs. If the tool is generating MPTGS as well a PostgreSQL database server is also required. The instructions below assume the MariaDB server is installed locally and that the user running the commands does not require a password. If either of those assumptions are not true for your environment please modify the commands accordingly.

Fetching ConsoleUser configs expects that ConsoleUser has been installed in /usr/local/tg/cu.

7.2.2. Process

To generate the final configurations one must first setup the tool’s configuration files. After that, GenerateTables.pl will produce intermediate files that can be used to populate a MySQL/MariaDB database. This database can be populated by using the output file populate_db.sql with a command like

Those intermediate files can also populate a separate PostgreSQL database. This database is used directly by the MPTGS during its operation. In contrast, the mysql database is used to coordinate distribution of ConsoleUser configurations to appropriate controllers and directories.

The 2 steps, then, to generate config files and populate the database for ConsoleUsers is

$ ./GenerateTables.pl main.config
$ mysql < populate_db.sql

There are additional steps to populate the database that MPTGS will use. That process has been put into the script doAllMPTGS.sh which will prepare a PostgreSQL database, run GenerateTables.pl massage the data for MPTGS needs, then populate the DB. After this process that database is ready to be used by an instance of MPTGS.

doAllMPTGS.sh will populate the database named in the environment variable PGDATABASE if it exists, otherwise the name is hard coded.

To populate both the databases the following process is recommend:

$ ./doAllMPTGS.sh
$ mysql < populate_db.sql

7.2.3. Main Config

This is the file that is passed to GenerateTables.pl and controls all of the parameters and relationships GenerateTables.pl is to produce.

This can be in YAML or Legacy Text. The treatment of many of the keys is the same, so the structure of those common elements is discussed below.

7.2.3.1. Domains

Many aspects of the user configuration deal with logical groupings of elements, called domains. A domain of users need not be located in the same DNS domain or Windows domain or IP space, they are just a convenient grouping for this configuration’s needs. It allows the relationship between groups to be specified, but those groups are for the convenience of the specifier, and need have no meaning to anyone else.

7.2.3.2. Distribution Matrices

To express the relationships between different domains, a matrix line is specified. These lines match the likelihood of pairing users from one domain with a target from each relevant domain. For something like personal communication like email, the domains are both user domains (UDOMAIN*N* above). When selecting a website for a user to add to its bookmark list the user from one UDOMAIN is matched against the proportion of different WDOMAIN*N*.

The format of these lines is a pipe delimited list of comma delimited CDF values. That is, the line is first tokenized on | and each of those tokens is matched to the relevant domain in the order each was specified. Those tokens are themselves tokenized on , where the values are CDF values mapping to the target domains in the order they were specified. For example, a config file containing

NDOMAINS 2
UDOMAIN1 firstUsers
UDOMAIN2 secondUsers

NWDOMAINS 3
WDOMAIN1 firstWeb
WDOMAIN2 secondWeb
WDOMAIN3 thridWeb

WDIST 0.5,0.8,1.0|0.25,1.0,1.0

would result in the mapping from UDOMAIN1 to have its bookmarks go 50% to sites in firstWeb, 30% to sites in secondWeb, and 20% to sites in thirdWeb. Then UDOMAIN2 users would have their bookmarks go 25% to sites in firstWeb, 75% to sites in secondWeb and not at all to sites in thirdWeb.

(One side note, because our traffic generally follows links in pages this does not guarantee that the traffic will ultimately represent these percentages, just that initial selections will be made from pools with these percentages.)

7.2.3.3. Popularity Distribution

Many aspects of configuration involve users building lists of values the user will choose from. This includes things like applications to use, websites to visit, recipients to email, etc. Our observations from various data sources (see Why we do what we do) is that many of these lists have a long tail in terms of how many elements are contained.

Some users will stick to a small number of choices for how to go about a task, while others will have a wide variety of values. In real life, we can imagine one user who only visits a couple of websites and rarely strays from them, versus someone who is continually visiting new sites.

To capture that variety we generate the size of assorted lists as follows:

\[randomValue <= 1 - k^{-NODE\_PARAM}\]

where k is the number that make that side raise above a value from a uniform random distribution. We then add a minimum number of choices which is currently 3.

Other modules select list sizes fitting a lognomal distribution such that the number is generated as:

\[ \begin{align}\begin{aligned}Z &= random\_normal()\\N &= e^{{\mu} + {\sigma} * Z}\end{aligned}\end{align} \]

where mu and sigma can be specified for each activity type via <type>_COUNT_MU and <type>_COUNT_SIGMA respectively.

After building a list of the proper size each element gets its own per- user probability. An element can show up in multiple user’s lists and have a different likelihood of being selected from the list for each of them. The probabilities for each user’s list containing N elements are assigned following a Zipf distribution with the harmonic number calculated as

\[H_{k, N} = \frac{1}{\sum_{k=1}^N{k^{zipf}}}\]

Where the zipf number can be specified for each activity type via a <type>_ZIPF parameter in the main config file. The zipf parameter is a shape parameter as defined in the Riemann zeta function .

Each element then gets assigned its probability based on its list index i as

\[CDF_i = \sum_{n=1}^i{H_{k, N} * n^{-zipf}}\]

That determines how many items will be added to the user’s preference list. We also want there to be some items that are selected as the most popular across all users. To choose an example from the real world, when browsing the web different people have different sets of sites they typically visit. Still, across all people some sites show up in nearly every list, like Google (or similar) is probably highly placed on everyone’s list of sites to visit, while the rest of the list varies from person to person.

When the next element is chosen to add to the user’s list, first we select a domain using the <type>DIST matrix. Within that domain a specific element is chosen. The probability of each candidate in that domain being chosen is generated according to the following formula where i is the index of the entry in the list:

\[ \begin{align}\begin{aligned}val &= .5 * \erfc{\frac{i - 1}{\sigma}} + i^{-zipf}\\H &= \sum_{n=1}^{list size} val\\PDF_i &= \frac{val}{H}\end{aligned}\end{align} \]

zipf and sigma can be specified in by adding <type>_ZIPF and <type>_SIGMA to the config file respectively. Note that this sigma is different from the <type>_COUNT_SIGMA from above.

Combining these then, we have the likelihood of any element appearing in a user’s list governed by one distribution, and then how popular that element is within a user’s list governed by another. Each user has their own favorite entries, but across the population some entries are going to be more popular than others.