.. _userdb_config: UserDBConfig ============ This is a tool to generate the configurations for all the users across both :ref:`cu` and :ref:`mptgs` so that they will preserve desired statistical properties across the different application/traffic types. Installation ------------ UserDBConfig is a collection of Perl scripts and modules. To use it some additional modules are required (that are part of a default setup). The tar that includes UserDBConfig includes the extra modules it needs so they can be installed without an Internet connection; however, to build those modules requires a general development environment. The module build/install process requires ``gcc`` and the ability to process Perl Makefiles (e.g., perl-ExtUtils-MakeMaker). Once those system requirements are set change into the ``libs`` directory. From there run ``./install_libs.sh`` which will process the modules there in the order given in ``install_order``. Once all the libraries are installed, UserDBConfig is ready to run to generate its files. The tools to fetch ConsoleUser information requires ConsoleUser to be installed. There are a couple of tools it provides that are used, and they will also look for a file ``/usr/local/tg/db.info`` file which can store DB connection information instead of needing to repeat it on the command line for each fetch. Completing the process requires a MySQL/MariaDB server to be available for :ref:`cu` configs. If the tool is generating :ref:`mptgs` as well a PostgreSQL database server is also required. The instructions below assume the MariaDB server is installed locally and that the user running the commands does not require a password. If either of those assumptions are not true for your environment please modify the commands accordingly. Fetching ConsoleUser configs expects that ConsoleUser has been installed in ``/usr/local/tg/cu``. Process ------- To generate the final configurations one must first setup the tool's configuration files. After that, ``GenerateTables.pl`` will produce intermediate files that can be used to populate a MySQL/MariaDB database. This database can be populated by using the output file ``populate_db.sql`` with a command like Those intermediate files can also populate a separate PostgreSQL database. This database is used directly by the :ref:`mptgs` during its operation. In contrast, the ``mysql`` database is used to coordinate distribution of :ref:`cu` configurations to appropriate controllers and directories. The 2 steps, then, to generate config files and populate the database for ConsoleUsers is .. code-block:: bash $ ./GenerateTables.pl main.config $ mysql < populate_db.sql There are additional steps to populate the database that MPTGS will use. That process has been put into the script ``doAllMPTGS.sh`` which will prepare a PostgreSQL database, run ``GenerateTables.pl`` massage the data for MPTGS needs, then populate the DB. After this process that database is ready to be used by an instance of MPTGS. ``doAllMPTGS.sh`` will populate the database named in the environment variable ``PGDATABASE`` if it exists, otherwise the name is hard coded. To populate both the databases the following process is recommend: .. code-block:: bash $ ./doAllMPTGS.sh $ mysql < populate_db.sql Main Config ----------- This is the file that is passed to ``GenerateTables.pl`` and controls all of the parameters and relationships ``GenerateTables.pl`` is to produce. This can be in :ref:`YAML ` or :ref:`Legacy Text `. The treatment of many of the keys is the same, so the structure of those common elements is discussed below. Domains ^^^^^^^ Many aspects of the user configuration deal with logical groupings of elements, called domains. A domain of users need not be located in the same DNS domain or Windows domain or IP space, they are just a convenient grouping for this configuration's needs. It allows the relationship between groups to be specified, but those groups are for the convenience of the specifier, and need have no meaning to anyone else. .. _dist_matrix: Distribution Matrices ^^^^^^^^^^^^^^^^^^^^^ To express the relationships between different domains, a matrix line is specified. These lines match the likelihood of pairing users from one domain with a target from each relevant domain. For something like personal communication like email, the domains are both user domains (``UDOMAIN*N*`` above). When selecting a website for a user to add to its bookmark list the user from one ``UDOMAIN`` is matched against the proportion of different ``WDOMAIN*N*``. The format of these lines is a pipe delimited list of comma delimited CDF values. That is, the line is first tokenized on ``|`` and each of those tokens is matched to the relevant domain in the order each was specified. Those tokens are themselves tokenized on ``,`` where the values are CDF values mapping to the target domains in the order they were specified. For example, a config file containing .. code-block:: ini NDOMAINS 2 UDOMAIN1 firstUsers UDOMAIN2 secondUsers NWDOMAINS 3 WDOMAIN1 firstWeb WDOMAIN2 secondWeb WDOMAIN3 thridWeb WDIST 0.5,0.8,1.0|0.25,1.0,1.0 would result in the mapping from ``UDOMAIN1`` to have its bookmarks go 50% to sites in ``firstWeb``, 30% to sites in ``secondWeb``, and 20% to sites in ``thirdWeb``. Then ``UDOMAIN2`` users would have their bookmarks go 25% to sites in ``firstWeb``, 75% to sites in ``secondWeb`` and not at all to sites in ``thirdWeb``. (One side note, because our traffic generally follows links in pages this does not guarantee that the traffic will ultimately represent these percentages, just that initial selections will be made from pools with these percentages.) .. _node_param: Popularity Distribution ^^^^^^^^^^^^^^^^^^^^^^^ Many aspects of configuration involve users building lists of values the user will choose from. This includes things like applications to use, websites to visit, recipients to email, etc. Our observations from various data sources (see :ref:`justification`) is that many of these lists have a long tail in terms of how many elements are contained. Some users will stick to a small number of choices for how to go about a task, while others will have a wide variety of values. In real life, we can imagine one user who only visits a couple of websites and rarely strays from them, versus someone who is continually visiting new sites. To capture that variety we generate the size of assorted lists as follows: .. math:: randomValue <= 1 - k^{-NODE\_PARAM} where ``k`` is the number that make that side raise above a value from a uniform random distribution. We then add a minimum number of choices which is currently 3. Other modules select list sizes fitting a `lognomal distribution `_ such that the number is generated as: .. math:: Z &= random\_normal() N &= e^{{\mu} + {\sigma} * Z} where ``mu`` and ``sigma`` can be specified for each activity type via ``_COUNT_MU`` and ``_COUNT_SIGMA`` respectively. After building a list of the proper size each element gets its own per- user probability. An element can show up in multiple user's lists and have a different likelihood of being selected from the list for each of them. The probabilities for each user's list containing ``N`` elements are assigned following a Zipf_ distribution with the harmonic number calculated as .. math:: H_{k, N} = \frac{1}{\sum_{k=1}^N{k^{zipf}}} Where the ``zipf`` number can be specified for each activity type via a ``_ZIPF`` parameter in the main config file. The ``zipf`` parameter is a shape parameter as defined in the `Riemann zeta function `_ . Each element then gets assigned its probability based on its list index ``i`` as .. math:: CDF_i = \sum_{n=1}^i{H_{k, N} * n^{-zipf}} That determines how many items will be added to the user's preference list. We also want there to be some items that are selected as the most popular across all users. To choose an example from the real world, when browsing the web different people have different sets of sites they typically visit. Still, across all people some sites show up in nearly every list, like Google (or similar) is probably highly placed on everyone's list of sites to visit, while the rest of the list varies from person to person. When the next element is chosen to add to the user's list, first we select a domain using the ``DIST`` matrix. Within that domain a specific element is chosen. The probability of each candidate in that domain being chosen is generated according to the following formula where ``i`` is the index of the entry in the list: .. math:: val &= .5 * \erfc{\frac{i - 1}{\sigma}} + i^{-zipf} H &= \sum_{n=1}^{list size} val PDF_i &= \frac{val}{H} ``zipf`` and ``sigma`` can be specified in by adding ``_ZIPF`` and ``_SIGMA`` to the config file respectively. Note that this ``sigma`` is different from the ``_COUNT_SIGMA`` from above. Combining these then, we have the likelihood of any element appearing in a user's list governed by one distribution, and then how popular that element is within a user's list governed by another. Each user has their own favorite entries, but across the population some entries are going to be more popular than others. .. _Zipf: https://en.wikipedia.org/wiki/Zipf%27s_law