Discussions

Expand all | Collapse all

Trying out Distributed MapD/OmniSci

  • 1.  Trying out Distributed MapD/OmniSci

    Posted 01-10-2019 20:28

    So we recently start looking into running OmniSci/MapD with larger data set than before. We have had some performance issues in the past and continuing on with larger hardware is not always a practical solution for us either because it’s not an option or it doesn’t make business sense. With the cluster option being available, we thought we’d explore it, and I was asked to set up an instance for our tests.

    I contacted the support team, as we are an enterprise customer, and recieved some quick overview on the phone with a rep, and a PDF guide on setting up the cluster. However, the guide was showing how to setup multiple instances of MapD on a single host which, well, doesn’t suit our needs. We want to run it on multiple host machines. I thought I would write this up in case someone else is also trying to do something similar. I can’t guarantee this post will make total sense unless you’ve seen the guide, but I will do my best to keep the reference down. I’m both ESL and not a great writer/editor.

    Setting Infrastructure.

    The way I’m tackling this is to use Terraform to set up my infrastructure since the tests will span multiple days, and may involve combination of different hardware, it’s just easier to be able to issue a single commands to make those changes. It will also be done on Google platform. Neither of which should be a big deal since I don’t think I will be posting the terraform file directly at this point anyway. I’d like to focus on what is setup, and using terraform code simply to demonstrate, but you do you.

    To start, created a google_compute_instance_template as the baseline for all the machine that will run MapD components. I, then, setup my instances as 2 separate resources, google_compute_instance_from_template for a standalone VM acting as the master, and google_compute_instance_group_manager for a managed group of leaf nodes. Rather than setting up a bunch of different VM that needs to be managed individually, this allows me to manage all of the leaf nodes as a group together, no matter how many.

    I don’t put the master node within the managed group because the way the managed group add or remove machines. You can’t guarantee your master will not be affected.

    The master node, where the aggregator, and the dictionary server would be installed, will required additional disks for our internal tooling. Luckily google_compute_instance_from_template allows for overwrite of the template through the attached_disk block, so we must re-list all of the disks that will be attached here again. I also use disks that are PERSISTENT type. This means that when the instances are brought down, the disks are kept alive. They won’t be reattached by the auto-scaler but at least the data are safe.

    You can run an instance of MapD on your aggregator, and I was assured that the aggregator and the sds server are very lightweight. However, the leaf nodes should go into your managed group. Also, while I haven’t gotten to this stage yet, you should be able to set up the entire MapD server using just the init scripts. The configuration for all of the leaf nodes are the same anyway in term of path to data drive, address of the master node, and the port to connect to. This can all be done through init_data scripts that can be included with the terraform config.

    To ensure these MapD instances can access each other openly, I created agoogle_compute_firewall rule that opens up port 9091 to all resources with the tags I use to annotate my MapD resources.

    allow {
      protocol = ""tcp""
      ports = [""9091""]
    }
    
    target_tags = [""${local.namespace}""]
    source_tags = [""${google_compute_instance_template.explorer.tags}""]
    

    It’s a little wide-catching but adequate for our testing.

    Of course there are other resources you will need to setup as well like your VPC, subnet for the instances, and other resources. However, those should follow pretty standard setup for any other applications.

    Installing MapD

    MapD installation on each machine is the normal method. Only the configuration will be different. In our case, we did a regular tar file deployment over the docker version we usually use. This means we can stay closer to the guide and start the processes using their systemctl.

    One important note that was neglect from the guide was that the install_mapd_systemd.sh script needs to be run from where the script is because it uses relative path to your console rather than the script’s own location to find other files it sources. Essentially, the systemd services will not be created unless you are executing the scripts from the same folder location.
    Also, something that wasn’t very clear to me was the path variables. I wasn’t sure if it wants me to point to the folder where the root is or point to the root folder itself. The guide was referring to the storage folder as where we create the data folder. This left an impression that it wants the directory and the root folder would be. After trying both out, I found out I needed to pass in the path to the root folder.

    MAPD_PATH: /home/explorer/explorer/mapd
    MAPD_STORAGE: /home/explorer/explorer/mapd/storage/data
    

    We also needed to modify the service files to point to the right path for different nodes as well. Since we mount an external drive to /home/explorer/explorer/mapd/storage/data, This means the aggr and leaf1 are going to be located inside the data folder. This is breaking from the convention in the example where the data folder for each nodes are inside the node names like aggr/data or leaf1/data. This could probably be overcome by reorganize how we setup the drive but I’m sticking with this for now.

    The cluster.conf file will need the appropriate hostname from the perspective of each leaf instance as well so they can’t call be localhost. The firewall rule with tag means we have to use the internal IP addresses when declaring hostnames for the master and the nodes in the cluster.conf file.

    Once you have the configuration files in places, you can start the master node by running the following commands.

    sudo systemctl start mapd_sd_server
    sudo systemctl start mapd_server@aggr
    sudo systemctl start mapd_web_server@aggr
    

    On the leaf nodes, you will want to start the leaf variant of systemctl.

    sudo systemctl start mapd_server@leaf
    

    Hopefully I didn’t leave anything out but I was able to connect to the front-end and make query OK. If you have any questions, or anything to add, I’d love to hear it.



  • 2.  RE: Trying out Distributed MapD/OmniSci

    Posted 01-10-2019 20:58

    Thanks @mayppong! These writeups are full of great information, and we appreciate you posting them. We liked How we migrate to role-based permission as well - very informative.