.. _userdb_config:

UserDBConfig
============

This is a tool to generate the configurations for all the users across
both :ref:`cu` and :ref:`mptgs` so that they will preserve
desired statistical properties across the different application/traffic
types.

Installation
------------

UserDBConfig is a collection of Perl scripts and modules.  To use it
some additional modules are required (that are part of a default setup).
The tar that includes UserDBConfig includes the extra modules it needs
so they can be installed without an Internet connection; however, to
build those modules requires a general development environment.  The
module build/install process requires ``gcc`` and the ability to
process Perl Makefiles (e.g., perl-ExtUtils-MakeMaker).

Once those system requirements are set change into the ``libs``
directory.  From there run ``./install_libs.sh`` which will process the
modules there in the order given in ``install_order``.  Once all the
libraries are installed, UserDBConfig is ready to run to generate its
files.

The tools to fetch ConsoleUser information requires ConsoleUser to be 
installed.  There are a couple of tools it provides that are used, and
they will also look for a file ``/usr/local/tg/db.info`` file which can
store DB connection information instead of needing to repeat it on the
command line for each fetch.

Completing the process requires a MySQL/MariaDB server to be available
for :ref:`cu` configs.  If the tool is generating :ref:`mptgs` as well
a PostgreSQL database server is also required.  The instructions below
assume the MariaDB server is installed locally and that the user
running the commands does not require a password.  If either of those
assumptions are not true for your environment please modify the
commands accordingly.

Fetching ConsoleUser configs expects that ConsoleUser has been
installed in ``/usr/local/tg/cu``.


Process
-------

To generate the final configurations one must first setup the tool's
configuration files.  After that, ``GenerateTables.pl`` will produce
intermediate files that can be used to populate a MySQL/MariaDB
database.  This database can be populated by using the output file
``populate_db.sql`` with a command like 

Those intermediate files can also populate a separate PostgreSQL
database.  This database is used directly by the :ref:`mptgs` during
its operation.  In contrast, the ``mysql`` database is used to
coordinate distribution of :ref:`cu` configurations to
appropriate controllers and directories.

The 2 steps, then, to generate config files and populate the database
for ConsoleUsers is

.. code-block:: bash

    $ ./GenerateTables.pl main.config
    $ mysql < populate_db.sql

There are additional steps to populate the database that MPTGS will use.
That process has been put into the script ``doAllMPTGS.sh`` which will
prepare a PostgreSQL database, run ``GenerateTables.pl`` massage the
data for MPTGS needs, then populate the DB.  After this process that
database is ready to be used by an instance of MPTGS.

``doAllMPTGS.sh`` will populate the database named in the environment
variable ``PGDATABASE`` if it exists, otherwise the name is hard coded.

To populate both the databases the following process is recommend:

.. code-block:: bash

    $ ./doAllMPTGS.sh
    $ mysql < populate_db.sql


Main Config
-----------

This is the file that is passed to ``GenerateTables.pl`` and controls
all of the parameters and relationships ``GenerateTables.pl`` is to
produce.

This can be in :ref:`YAML <userdb_yaml_config>` or
:ref:`Legacy Text <userdb_text_config>`.  The treatment of many of the
keys is the same, so the structure of those common elements is
discussed below.


Domains
^^^^^^^

Many aspects of the user configuration deal with logical groupings of
elements, called domains.  A domain of users need not be located in
the same DNS domain or Windows domain or IP space, they are just a
convenient grouping for this configuration's needs.  It allows the
relationship between groups to be specified, but those groups are for
the convenience of the specifier, and need have no meaning to anyone
else.
 
.. _dist_matrix:

Distribution Matrices
^^^^^^^^^^^^^^^^^^^^^

To express the relationships between different domains, a matrix line is
specified.  These lines match the likelihood of pairing users from one
domain with a target from each relevant domain.  For something like
personal communication like email, the domains are both user domains
(``UDOMAIN*N*`` above).  When selecting a website for a user to add to
its bookmark list the user from one ``UDOMAIN`` is matched against the
proportion of different ``WDOMAIN*N*``.

The format of these lines is a pipe delimited list of comma delimited
CDF values.  That is, the line is first tokenized on ``|`` and each
of those tokens is matched to the relevant domain in the order each
was specified.  Those tokens are themselves tokenized on ``,`` where
the values are CDF values mapping to the target domains in the order
they were specified.  For example, a config file containing

.. code-block:: ini

    NDOMAINS 2
    UDOMAIN1 firstUsers
    UDOMAIN2 secondUsers

    NWDOMAINS 3
    WDOMAIN1 firstWeb
    WDOMAIN2 secondWeb
    WDOMAIN3 thridWeb

    WDIST 0.5,0.8,1.0|0.25,1.0,1.0

would result in the mapping from ``UDOMAIN1`` to have its bookmarks go
50% to sites in ``firstWeb``, 30% to sites in ``secondWeb``, and 20% to
sites in ``thirdWeb``.  Then ``UDOMAIN2`` users would have their
bookmarks go 25% to sites in ``firstWeb``, 75% to sites in
``secondWeb`` and not at all to sites in ``thirdWeb``.

(One side note, because our traffic generally follows links in pages
this does not guarantee that the traffic will ultimately represent
these percentages, just that initial selections will be made from pools
with these percentages.)

.. _node_param:

Popularity Distribution
^^^^^^^^^^^^^^^^^^^^^^^

Many aspects of configuration involve users building lists of values
the user will choose from.  This includes things like applications to
use, websites to visit, recipients to email, etc.  Our observations
from various data sources (see :ref:`justification`) is that many of
these lists have a long tail in terms of how many elements are
contained.

Some users will stick to a small number of choices for how to go about
a task, while others will have a wide variety of values.  In real life,
we can imagine one user who only visits a couple of websites and rarely
strays from them, versus someone who is continually visiting new sites.

To capture that variety we generate the size of assorted lists as follows:

.. math::
   randomValue <= 1 - k^{-NODE\_PARAM}

where ``k`` is the number that make that side raise above a value from
a uniform random distribution.  We then add a minimum number of choices
which is currently 3.

Other modules select list sizes fitting a 
`lognomal distribution <https://en.wikipedia.org/wiki/Log-normal_distribution>`_
such that the number is generated as:

.. math::

    Z &= random\_normal()

    N &= e^{{\mu} + {\sigma} * Z}

where ``mu`` and ``sigma`` can be specified for each activity type via
``<type>_COUNT_MU`` and ``<type>_COUNT_SIGMA`` respectively.

After building a list of the proper size each element gets its own per-
user probability.  An element can show up in multiple user's lists and
have a different likelihood of being selected from the list for each of
them.  The probabilities for each user's list containing ``N`` elements
are assigned following a Zipf_ distribution with the harmonic number
calculated as 

.. math::

    H_{k, N} = \frac{1}{\sum_{k=1}^N{k^{zipf}}}

Where the ``zipf`` number can be specified for each activity type via a
``<type>_ZIPF`` parameter in the main config file.  The ``zipf``
parameter is a shape parameter as defined in the
`Riemann zeta function <https://en.wikipedia.org/wiki/Riemann_zeta_function>`_
.

Each element then gets assigned its probability based on its list index
``i`` as

.. math::

    CDF_i = \sum_{n=1}^i{H_{k, N} * n^{-zipf}}

That determines how many items will be added to the user's preference
list.  We also want there to be some items that are selected as the
most popular across all users.  To choose an example from the real
world, when browsing the web different people have different sets of
sites they typically visit.  Still, across all people some sites show
up in nearly every list, like Google (or similar) is probably highly
placed on everyone's list of sites to visit, while the rest of the list
varies from person to person.

When the next element is chosen to add to the user's list, first we
select a domain using the ``<type>DIST`` matrix.  Within that domain a
specific element is chosen.  The probability of each candidate in that
domain being chosen is generated according to the following formula
where ``i`` is the index of the entry in the list:

.. math::

    val &= .5 * \erfc{\frac{i - 1}{\sigma}} + i^{-zipf}

    H &= \sum_{n=1}^{list size} val

    PDF_i &= \frac{val}{H}

``zipf`` and ``sigma`` can be specified in by adding ``<type>_ZIPF``
and ``<type>_SIGMA`` to the config file respectively.  Note that this
``sigma`` is different from the ``<type>_COUNT_SIGMA`` from above.

Combining these then, we have the likelihood of any element appearing
in a user's list governed by one distribution, and then how popular
that element is within a user's list governed by another.  Each user
has their own favorite entries, but across the population some entries
are going to be more popular than others.

.. _Zipf: https://en.wikipedia.org/wiki/Zipf%27s_law