7.2. UserDBConfig¶
This is a tool to generate the configurations for all the users across both ConsoleUser and MPTGS so that they will preserve desired statistical properties across the different application/traffic types.
7.2.1. Installation¶
UserDBConfig is a collection of Perl scripts and modules. To use it
some additional modules are required (that are part of a default setup).
The tar that includes UserDBConfig includes the extra modules it needs
so they can be installed without an Internet connection; however, to
build those modules requires a general development environment. The
module build/install process requires gcc
and the ability to
process Perl Makefiles (e.g., perl-ExtUtils-MakeMaker).
Once those system requirements are set change into the libs
directory. From there run ./install_libs.sh
which will process the
modules there in the order given in install_order
. Once all the
libraries are installed, UserDBConfig is ready to run to generate its
files.
The tools to fetch ConsoleUser information requires ConsoleUser to be
installed. There are a couple of tools it provides that are used, and
they will also look for a file /usr/local/tg/db.info
file which can
store DB connection information instead of needing to repeat it on the
command line for each fetch.
Completing the process requires a MySQL/MariaDB server to be available for ConsoleUser configs. If the tool is generating MPTGS as well a PostgreSQL database server is also required. The instructions below assume the MariaDB server is installed locally and that the user running the commands does not require a password. If either of those assumptions are not true for your environment please modify the commands accordingly.
Fetching ConsoleUser configs expects that ConsoleUser has been
installed in /usr/local/tg/cu
.
7.2.2. Process¶
To generate the final configurations one must first setup the tool’s
configuration files. After that, GenerateTables.pl
will produce
intermediate files that can be used to populate a MySQL/MariaDB
database. This database can be populated by using the output file
populate_db.sql
with a command like
Those intermediate files can also populate a separate PostgreSQL
database. This database is used directly by the MPTGS during
its operation. In contrast, the mysql
database is used to
coordinate distribution of ConsoleUser configurations to
appropriate controllers and directories.
The 2 steps, then, to generate config files and populate the database for ConsoleUsers is
$ ./GenerateTables.pl main.config
$ mysql < populate_db.sql
There are additional steps to populate the database that MPTGS will use.
That process has been put into the script doAllMPTGS.sh
which will
prepare a PostgreSQL database, run GenerateTables.pl
massage the
data for MPTGS needs, then populate the DB. After this process that
database is ready to be used by an instance of MPTGS.
doAllMPTGS.sh
will populate the database named in the environment
variable PGDATABASE
if it exists, otherwise the name is hard coded.
To populate both the databases the following process is recommend:
$ ./doAllMPTGS.sh
$ mysql < populate_db.sql
7.2.3. Main Config¶
This is the file that is passed to GenerateTables.pl
and controls
all of the parameters and relationships GenerateTables.pl
is to
produce.
This can be in YAML or Legacy Text. The treatment of many of the keys is the same, so the structure of those common elements is discussed below.
7.2.3.1. Domains¶
Many aspects of the user configuration deal with logical groupings of elements, called domains. A domain of users need not be located in the same DNS domain or Windows domain or IP space, they are just a convenient grouping for this configuration’s needs. It allows the relationship between groups to be specified, but those groups are for the convenience of the specifier, and need have no meaning to anyone else.
7.2.3.2. Distribution Matrices¶
To express the relationships between different domains, a matrix line is
specified. These lines match the likelihood of pairing users from one
domain with a target from each relevant domain. For something like
personal communication like email, the domains are both user domains
(UDOMAIN*N*
above). When selecting a website for a user to add to
its bookmark list the user from one UDOMAIN
is matched against the
proportion of different WDOMAIN*N*
.
The format of these lines is a pipe delimited list of comma delimited
CDF values. That is, the line is first tokenized on |
and each
of those tokens is matched to the relevant domain in the order each
was specified. Those tokens are themselves tokenized on ,
where
the values are CDF values mapping to the target domains in the order
they were specified. For example, a config file containing
NDOMAINS 2
UDOMAIN1 firstUsers
UDOMAIN2 secondUsers
NWDOMAINS 3
WDOMAIN1 firstWeb
WDOMAIN2 secondWeb
WDOMAIN3 thridWeb
WDIST 0.5,0.8,1.0|0.25,1.0,1.0
would result in the mapping from UDOMAIN1
to have its bookmarks go
50% to sites in firstWeb
, 30% to sites in secondWeb
, and 20% to
sites in thirdWeb
. Then UDOMAIN2
users would have their
bookmarks go 25% to sites in firstWeb
, 75% to sites in
secondWeb
and not at all to sites in thirdWeb
.
(One side note, because our traffic generally follows links in pages this does not guarantee that the traffic will ultimately represent these percentages, just that initial selections will be made from pools with these percentages.)
7.2.3.3. Popularity Distribution¶
Many aspects of configuration involve users building lists of values the user will choose from. This includes things like applications to use, websites to visit, recipients to email, etc. Our observations from various data sources (see Why we do what we do) is that many of these lists have a long tail in terms of how many elements are contained.
Some users will stick to a small number of choices for how to go about a task, while others will have a wide variety of values. In real life, we can imagine one user who only visits a couple of websites and rarely strays from them, versus someone who is continually visiting new sites.
To capture that variety we generate the size of assorted lists as follows:
where k
is the number that make that side raise above a value from
a uniform random distribution. We then add a minimum number of choices
which is currently 3.
Other modules select list sizes fitting a lognomal distribution such that the number is generated as:
where mu
and sigma
can be specified for each activity type via
<type>_COUNT_MU
and <type>_COUNT_SIGMA
respectively.
After building a list of the proper size each element gets its own per-
user probability. An element can show up in multiple user’s lists and
have a different likelihood of being selected from the list for each of
them. The probabilities for each user’s list containing N
elements
are assigned following a Zipf distribution with the harmonic number
calculated as
Where the zipf
number can be specified for each activity type via a
<type>_ZIPF
parameter in the main config file. The zipf
parameter is a shape parameter as defined in the
Riemann zeta function
.
Each element then gets assigned its probability based on its list index
i
as
That determines how many items will be added to the user’s preference list. We also want there to be some items that are selected as the most popular across all users. To choose an example from the real world, when browsing the web different people have different sets of sites they typically visit. Still, across all people some sites show up in nearly every list, like Google (or similar) is probably highly placed on everyone’s list of sites to visit, while the rest of the list varies from person to person.
When the next element is chosen to add to the user’s list, first we
select a domain using the <type>DIST
matrix. Within that domain a
specific element is chosen. The probability of each candidate in that
domain being chosen is generated according to the following formula
where i
is the index of the entry in the list:
zipf
and sigma
can be specified in by adding <type>_ZIPF
and <type>_SIGMA
to the config file respectively. Note that this
sigma
is different from the <type>_COUNT_SIGMA
from above.
Combining these then, we have the likelihood of any element appearing in a user’s list governed by one distribution, and then how popular that element is within a user’s list governed by another. Each user has their own favorite entries, but across the population some entries are going to be more popular than others.