The case for GlueOps

I am a Unix/Linux Infrastructure engineer, a university educated sysadmin of the old lore. In my 25+ years of Sysadmining / Devopsing I have never seen the case where the craft was isolated from either business decisions, service considerations, efficiency in delivery or developer enablement. The job included all of the above, quite a bit of coding and then some. We, of the old school sysadmins, would take all the ingredient and GLUE then together to a cohesive, coherent service all the while working with open source tools because the tooling budget was almost always nil. This process of creation is what I call GlueOPs, a holistic view of the modern IT landscape.

Open Source

Never was a decision made on its technological merit alone, it has always been:”do it for free or as close as you can get” . This is not a limitation, this is the spring board of innovation. The greatest boon to humanity in the last (20th) Century has been open source software. Millions of spectacular minds spent significant amounts of time in collaborating to produce incredible software. The GNU, Apache, linux Foundations, to name a few, and a host of universities published code that created the Internet boom as we now know it. Some were duds (gopher) , some remained in the stone age but still useful (FTP) and some bloomed into thousands of viable systems (Linux).

In GlueOPs the end target is the complete service to be offered to the client, be it an internal to the organization or traditionally an external one. The first step, even before deciding on the infrastructure to be used (cloud, physical, hybrid) is the selection of software components. For example if there is a need for a centralized authentication system will one chose LDAP or something else ? One weighs the merits of each choice and goes shopping so to speak on searching for the best fitting open source package to fit the needs. A target of 90% fit is a very viable one. If one can find a piece of software that covers 90% of the needs one can always add another 5% via custom coding and cover just about every base. 100% is unattainable as the time-effort to target curve is asymptotic at 100%, it will reach it only at the heat death of the universe.

GlueOps gives particular emphasis in resiliency of services so the tooling deployed must always be able to operate in a redundant, highly available and if possible load balanced modus operandi. Having systems laying about waiting to be used in case of an emergency is considered inelegant unless one tries to have a disaster recovery site. So tools like Haproxy and keepalived are essential building blocks of GlueOps.

The next level of GlueOps are the more traditional aspects of IT in general: databases, VPNs, web servers and accelerators and so on and so forth. The methodology of choice and application stays the same.

The religious aspect of GlueOps

It always surprises me that after so much code has been written there are still little nooks and crannies that have not been touched. To fill in these nooks an admin will invariably need to author one or more tools and that leads us to religion: Computer languages are too many and fights are so easy to start among various proponents of each paradigm. These fights have been fought with such ferocity that would have made ISIS fighters green with envy.

GlueOps emphasizes versatility. Admins should be versatile to use the best tool for the job keeping in mind the first tenet: look for existing open code. As code could be written in any language a master GlueOp should be able to at least read and understand most of them. What is an absolute necessity is for one to be competent to scripting in a shell script (bashat) and at least one more scripting language like Perl or Python. The final choice of which scripting language is the availability of libraries for common tasks. A case is perl: the language has been struggling to attract new developers but still had some life in it. Once CPAN (comprehensive Perl Archine Network) went offline, the language quickly died off to be replace by the new kid on the block: Python. So staying relevant and re-educating one self is a corner stone of GlueOps.

At the wizard level of GlueOps one has to be a master C programmer. So many of the scripting libraries are but a shell around C lib functions. Understanding therefore C is understanding how a linux kernel operates and that is wizard-fu!

Automation

I will not spend much time on this. Automation is absolutely essential when one deals with a swarm of machines. It not only helps in day to day operations, it helps to make work repeatable and at least semi documented. One can chose saltstack or ansible or both but one must use absolutely use something. If not for everything else just to have a quick look on how the systems were set up and expected to operate.

Observability and measurements

The cornerstone of engineering is measurements. In Fantasy Lore, if one knows the true name of a dragon one can control it. The contrapositive in engineering lore is: If it cannot be measured it cannot be controlled. Therefore every project should have its own observability subproject. Thankfully there are great tools out now for practically free. Prometheus, Influxdb or the ELK stack for time series databases and tools like Grafana for visualization and alerting. Without these packages to  measure the services, one is blind, and that is a bad thing in the business world.

One will have the usual graphing and alerting on usage and alerts on timeouts and so on but measuring will also give you further insights into your systems and applications. Suppose for example that you can see a flat system load line fixed at 1, on a two core system. This almost certainly means that a service is wedged and needs to be restarted although the service as a whole is fine. Even an unexplained regular IO spike should be investigated, in the engineering world there is no room for twilight.

The code

Of course without the application proper there is no service. GlueOps has an important role to play there too: guidance. Developers often forget that we live in a distributed world and must always keep in mind the subjects of resiliency, load balancing etc. etc. as stated above. A GlueOp who is worth his salary should be able to sit down with these cats and explain networking and systems in such a manner that they understand the needs of the service as a whole. The conversation should preferably finish  without murdering any hotheaded fools. I am not being cute here, some fights have been bloody indeed. A case and point is explaining how a global lock in a network filesystem degrades performance across the cluster but that is another story to be recited at some other time while quaffing bottles of bourbon.

So what do GlueOps do ?

Considering all the above aspects of the work one can easily see that every single one of them is necessary. Take any one of them lightly and the service as a whole becomes brittle and unstable. The admin has to Glue every piece together in such a fashion that the total comes alive. GlueOps do this: they glue together code, infrastructure, management and observability platforms to create business nourishing services. They are go-to people, problem solvers, engineers, coaches. Companies should stop trying to limit their job description to “systems operator” or “create a continuous integration pipeline”. These is far too limiting for someone who is called to keep money making services running 24x7x365.

Remotely local workers

This is the year of our lord 2020. A plague wafts over the land. people are scared and stay indoors awaiting salvation.

That is how it would have sounded if we were in the Middle Ages, yet it is pretty accurate description of the situation at hand. The latest victim of the disease is commercial real estate. Companies have began to sell their lavish headquarters and this triggered my grey cells.

First of all let me differentiate my position. I am not a fanatic of remote work and I have said so in the past (https://managenot.wordpress.com/2011/02/01/rubber-band-work-force/) yet it is convenient and is currently helping us to keep working during the pandemic. The real issue is that people like to have other people around. It is part of our genetic makeup: being alone is no fun! So eventually businesses will need enough office space again.

This is the crux of my idea: since we now have the ability to work remotely just as efficiently as we do on site ,it would be a good idea to create co-working spaces with an emphasis of cross-company fertilization.

What I mean with that is companies should not own or lease their own buildings to accommodate all their work force. Instead they should rent co-working spaces, closer to the employees’ locations if necessary. This will create a mixing and association of people from completely disparate companies and backgrounds.

Picture this: instead for someone to ruminate endlessly on how to optimize a computer process, go have a cup of coffee in the common lounge and listen to a sales person tell a tall tale. You do not have to listen. It is not a meeting within your company. You are not bound by rules to attend it. Your are free to stay or leave. You are free to engage or not. No matter what you decide , this foreign to your immediate needs, input will trigger some associations in your brain. These mental associations are the fertilizer of innovation and ingenuity. You don’t believe me ? Go have a dozen meetings with the same people on the same problem and let me know how zombified you feel afterwards.

Will companies, that normally value secrecy too much, adapt to such disruption ? This is a massive change in paradigm and habit! Let me remind you for some old wisdom on the subject: Frank Herbert wrote throughout his work: the “Harq al’Ada”: the breaking of the habit. It is the same meme that Marvin Minsky denotes when he says: “Try to surprise yourself by the way you think today”. Both these tenets have been very much at play in recent startup scene. They have served the newfound wealth extremely well. It is time to break everyone habits. It is now time to break the habit of worker over concentration which ironically lead to isolation. Less zombie meetings more zest!

Cloud wisps

This is a recipe for people who run their own VMs for any reason whatsoever.

If you are running a lot of Linux virtual hosts try to locate a kernel compiled in the “cloud” fashion, like debian’s linux-image-cloud-amd64. This version of the Linux kernel has almost all hardware related code removed minus a lot of other stuff too. The performance improvement one gets is three fold:

a) far fewer context switches in your hosts which translates to snappier system reponses i.e. less latency.

b) Less CPU utilization which means you can service more guests if you need to.

c) the extra CPU bandwidth will allow you to make your KSM tuning more aggressive therefore more free ram overall.

Add to the above a knob to tune Linux’s memory ballooning device and you can oversell your RAM even more.

And this how you can become a tiny AWS 😉

Un-learning

#Learning is a difficult process! It is so difficult in fact that we have institutionalized it in protected places called Universities. It is so difficult that you get awarded a degree when the institution of learning decrees that you have achieved certain goals. It is so difficult that when it is time to unlearn almost all of us have a negative visceral reaction to it.

Yet when the time is right a good learner has to re-learn, one has to remember the joy and wonder of the first time one started to learn!

This is particularly applicable in the world of computer science / programming / devops where new tools crop up continuously. A decent engineer has to continually educate oneself as the technologies used advance. So the trick to be able to have a young mind mind and keep learning is to accept that sometimes on has to un-learn, to cast aside old assumptions. THAT is difficult for all the reasons above.

But there is a simple, naive if you like, method. Be young at heart, accept that the world still has wonders to be discovered, all one has to do is un-learn.

After all Un-learning is no different from learning. It is the rediscovering of one’s original passion to partake of the wonders of the world! It is a change of mental clothes. The person underneath is still the same and even better dressed after the change to fresh clothes 😉

Ellegantly monitor Gluster health

A gluster cluster is a nice thing to have as long as it does not lose coherence. Often times volumes need to heal and bricks go out of connection. Thankfully monitoring all such niceties is possible with collectd + a simple and elegant bash script.

First you need a collectd.conf.d gluster config file


LoadPlugin exec
TypesDB     "/usr/share/collectd/types.db.gluster"
TypesDB     "/usr/share/collectd/types.db"

<Plugin exec>
  Exec "nobody:nogroup" "/usr/local/bin/glusterstats.bash"
</Plugin>

 

 

 

And now you need the script itself


#!/bin/bash

HOSTNAME="${COLLECTD_HOSTNAME:-$(hostname)}"
INTERVAL="${COLLECTD_INTERVAL:-10}"
myip=$(/sbin/ifconfig ens19 | grep inet | awk '{print $2}')

while true
do
  volumes=$(sudo /usr/sbin/gluster volume list)
  for vol in $volumes
  do
    sudo /usr/sbin/gluster volume heal $vol info > /tmp/$vol.info
    brickdata=""
    disc=0
    while read line
    do
      entries=""
      if [[ $brickdata == "" ]]
      then
        brickdata=$(echo $line | grep Brick | grep $myip | awk '{print $2}')
        brick=$(echo $brickdata | cut -f2 -d: | sed -e 's/^\///g' | sed -e 's/\//_/g')
        peer=$(echo $brickdata | cut -f1 -d:)
        status=$(echo $line | grep "Status: Transport endpoint is not connected")
        if [ $? -eq 0 ]
        then
                disc=$((disc+1))
        fi
      fi
      entries=$(echo $line | grep entries | cut -f2 -d: | sed -e 's/^ [ ]*//g')

      if [[ $entries != "" && $brick != "" ]]
      then
#                       echo $peer $brick $entries
        echo "PUTVAL \"$HOSTNAME/gluster/brick_heal_entries-${brick}\" interval=$INTERVAL N:$entries"
        echo "PUTVAL \"$HOSTNAME/gluster/counter-disconnections-${brick}\" interval=$INTERVAL N:$entries"
        brickdata=""
        disc=0
      fi
    done < /tmp/$vol.info
  done
sleep $INTERVAL
done

Remember to use Grafana for alerting

Active-Active HA cluster on Hetzner servers

Hetzner is a dedicated server provider that gives extremely good value to its customers but has a few quirks. The particular quirk that made my life so difficult, and I guess of many others, is its implementation of floating IPs.

Hetzner’s implementation of floating IPs is not ARP based as an orthodox sysadmin would expect. I will hazard a guess that it is done in such a fashion so that data streams do not leak across customers.

But despair not, there is a straightforward way to do exactly what one would expect to do with linux-ha. One can setup a pair of floating IPs between a pair of hosts and use keepalived to manage IP migration thus effectively creating an active-active setup.

First you need two servers with a private interconnection between them and one floating IP per server. Configure the IP on each server following the instructions on the Hetzner Wiki . Make sure that you configure both floating IPs on each host so that on failover the interfaces will be ready. Also remember to setup your internal IPs and set aside two IPs of your own for the internal floating scheme that vrrp needs. In our example we will use 192.168.1.1 for maste1 and 192.168.2.1 for master2. Obviously the internal network is 192.168.0.0/16

Now setup the Hetzner api scripts as shown in their failover script and you are almost done. Next step is to install keepalived and configure it so that only its vrrpd module is active. As an example on a Debian server edit the file /etc/default/keepalived and insert the following line in there

DAEMON_ARGS="-P

Now the trick is to have two virtual routers configured in keepalived each having as master one of your hosts and slave the other one. A sample configuration for master1 follows:



#
# This runs on master1
# 192.168.[12].1 are the internal floating IPs we manage as a necessity
# but the notification scripts do the actual work
# eth1 is the internal interconnect
#
global_defs {
      notification_email {
      devops@yourdomain
    }
    notification_email_from noreply@yourdomain
    smtp_server smtp.yourdomain
    smtp_connect_timeout 60
    script_user root
}


vrrp_instance virtualrouter1 {
    state MASTER
    interface eth1
    virtual_router_id 1
    priority 200
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234pass
    }
        virtual_ipaddress {
        192.168.1.1
    }
    preempt_delay 5
    notify_master "/usr/local/bin/hetzner.sh ip1 set master1"
    notify_backup "/usr/local/bin/hetzner.sh ip1 set master2"
}

vrrp_instance virtualrouter2 {
    state BACKUP
    interface eth1
    virtual_router_id 2
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass pass4321
    }
    virtual_ipaddress {
        192.168.2.1
    }
    preempt_delay 5
    notify_master "/usr/local/bin/hetzner.sh ip2 set master1"
    notify_backup "/usr/local/bin/hetzner.sh ip2 set master2"
}

Obviously the configuration on master2 is the inverse of the above.

  • Beware fallback and failback times are a bit longer than what one would expect , but the setup adds a nine to your overall availability!

Bust your users’ chops by finding their passwords



#!/bin/bash

# rockyou is @ https://github.com/brannondorsey/naive-hashcat/releases/download/data/rockyou.txt
# wordlists are @ http://www.md5this.com/tools/wordlists.html 
#increase performance if you feel like it
performance=3
performance=1

LDAPSERVER=
LDAPBASE=
LDAPBINDDN=
LDAPPASS=

cd /root/infosec/hashcat
(ldapsearch -Z -x -h $LDAPSERVER -b ou=users,$LDAPBASE -D $LDAPBINDDN -w "$LDAPPASS" uid UserPassword |\
grep ^userPassword |\
awk '{ print $2}' | while read line
do
echo $line| base64 --decode 2>/dev/null | grep SSHA
done ) > /root/infosec/hashcat/passwords

/root/infosec/hashcat/hashcat \
--quiet \
-O -w $performance \
-D 2 \
--gpu-temp-abort 80 \
-m 111 \
-r rules/generated2.rule \
-o cleartext_passwords \
passwords dicts/rockyou.txt dicts/www.md5this.com/Wordlist.txt dicts/www.md5this.com/wordlists/*.dic

if [ -f cleartext_passwords ]
then
echo found some passwords
cat cleartext_passwords
fi

 


Beautifully Track Certificate Expirations with Grafana , Influxdb and Collectd.

In my previous write up I had a sample python script to display the number of days a certificate is valid. I have moved forward and created a complete certificate tracking solution using Collectd, Influxdb and Grafana. I will not go through the complete setup here because there are just too many tutorials about this kind of thing. What follows is the recipe for just tracking certificates’ expirations, and arduous and perilous task for admins of all ages.

First you need a collectd config file that uses the exec plugin, normally it would go inside /etc/collectd/collectd.conf.d/certs_valid.conf

LoadPlugin exec

   Exec nobody "/etc/collectd/collectd.conf.d/collectd_certs_valid.py" "one_domain:443"
   Exec nobody "/etc/collectd/collectd.conf.d/collectd_certs_valid.py" "another_domain:443"

Now copy the following python code in /etc/collectd/collectd.conf.d/collectd_certs_valid.py and make sure you can run by hand. A good test is collectd_certs_valid.py http://www.google.com:443 it should start giving out lines like: PUTVAL “www.google.com/x509-seconds/timeleft” interval=10 N:5948602  every ten seconds.




#!/usr/bin/python3 -u
#
# Calculate the expiration days for a cert angelos@multiwave.fr
#
import OpenSSL
import ssl
import sys
import datetime
import time
import os


if len(sys.argv) <=1:
  print("Usage: expires host:port")
  exit(1)

try:
  [hostname,port]=sys.argv[1].split(":")
except:
  hostname=sys.argv[1]
  port=443

try:
  conn = ssl.create_connection((hostname, port))
except:
  print("ssl connection failed")
  exit(1)

try:
  interval=int(os.environ['COLLECTD_INTERVAL'])
except:
  interval=10

while True:
  conn = ssl.create_connection((hostname, port))
  context = ssl.SSLContext(ssl.PROTOCOL_SSLv23)
  sock = context.wrap_socket(conn, server_hostname=hostname)
  certificate = ssl.DER_cert_to_PEM_cert(sock.getpeercert(True))
  x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, certificate)
  expires=time.strptime(x509.get_notAfter().decode('ascii'), '%Y%m%d%H%M%SZ')

  now=datetime.datetime.now().timetuple()
  diff=time.mktime(expires)-time.mktime(now)

  print("PUTVAL \"%s/x509-seconds/timeleft\" interval=%d N:%d"% (hostname,interval,diff))

  time.sleep(interval)



The above effectively allows collectd to insert proper expiration time data into your Influxdb. Note the -u parameter to python to use unbuffered output.

Now all you have to do is use  Grafana and create a new dashboard/panel combination and use the x509 metrics retrieved from the python script.

Remember to adjust your Y axes for time data in seconds and the net result of the Graph will be pretty nice. Adjust it to your heart’s content:

The really truly awesome part is Grafana’s Alerts. You can create an alert if a certificate’s time is less than a week to get and alert.

Never Be surprised by an expired certificate ever again!

P.S. many thanks to all the net folks who know the innards of openSSL and x509 for python

Certificate Expiration Tracking

Days a cert is still valid

A python script to check the validity period of an SSL certificate



#!/usr/bin/python3
#
# Calculate the expiration days for a cert
#
import OpenSSL
import ssl
import sys
import datetime
import time

try:
[hostname,port]=sys.argv[1].split(":")
except:
hostname=sys.argv[1]
port=443

try:
conn = ssl.create_connection((hostname, port))
except:
print("ssl connection failed")
exit(1)

context = ssl.SSLContext(ssl.PROTOCOL_TLSv1)
sock = context.wrap_socket(conn, server_hostname=hostname)
certificate = ssl.DER_cert_to_PEM_cert(sock.getpeercert(True))
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, certificate)
expires=time.strptime(x509.get_notAfter().decode('ascii'), '%Y%m%d%H%M%SZ')

now=datetime.datetime.now().timetuple()
diff=((time.mktime(expires)-time.mktime(now))/3600/24)

print("%s:%s days left:%d" % (hostname, port, diff))

use flake8 to check recursively a python dir/project

For all those lazy yet finicky python coders. Code to be read while listening to David Byrne’s Lazy




#!/bin/bash

TOSCAN=$1
FLAKE8="flake8 --max-line-length=100"

if [ "X$TOSCAN" == "X" ]
then
        TOSCAN="."
fi

TMPDIR=/tmp/$
mkdir $TMPDIR

echo "0" > $TMPDIR/flakeerrors.txt
flakeerrors=0
find $TOSCAN -type f -name "*.py" |\
while IFS= read -r file
do
        echo ">>>>>>>> flake8 checking file:$file "
        res=0
        $FLAKE8 "$file" >&  $TMPDIR/flake8out.txt
        res=$?
        cat $TMPDIR/flake8out.txt
        if (( $res > 0 ))
        then
                echo "======== flake8 Return:$res"
                (( flakeerrors++ ))
        fi
        # subshells do not propagate vars
        echo $flakeerrors > $TMPDIR/flakeerrors.txt
done
# back from the while subshell
flakeerrors=$(cat $TMPDIR/thisbuild/flakeerrors.txt)
if (( $flakeerrors > 0 ))
then
        echo " **********************
                flake8 error
                count: ${flakeerrors}
                *********************" | /usr/games/cowsay -e \*\*
fi