A mind-dump on refreshments

After having played with various tools in #AWS and #GCP and elsewhere, I realized what is the reason of the former’s success, at least insofar as my vision allows me to grok it.

Back in the day, the drivers of innovation were the Internet Service Providers, the legendary ISPs. They were constantly brain storming to create services to harness the power of this new medium. They would try to create all sorts of goodies to entice customers with. They created email accounts with antispam, free personal web pages, local caches, etc. etc. Much of is quite commonplace today back back in ’92 ’95 it had to be invented. Yet, as of these days, the ISPs have resigned themselves to be mere connectivity providers and the mantle of the innovator has passed to the hyperscalers, pretty much.

So should we discuss who is the most innovative of the hyperscalers ? Well one can argue decades on who is driving innovation but let’s focus on the customer aspect for a sec. Which one has a plethora of readily available services ? I have the distinct feeling that technology consumers (what we call customers) in general have a better time using #AWS. These people have a real avalanche of extra services to be consumed. There is a cost most definitely associated with these services but this another gripe altogether.

That is another reason why this modus operandi seems so familiar to old timers like me. There is so much creation and flowing creativity that reminds me of the early days of the internet and that is refreshing. It pushes my retirement mind set back into the far future.

k8s war story

As all things come to pass so do kubernetes versions. Time has come for an upgade, or as kops would like to call it and update. Did you notice I mentioned kops? One of our clusters is built with kops, which in general behaves wonderfully and does the job in a veeery slow and methodical fashion. It takes forever to update a cluster but this is a good thing because the services on it keep on operating while the update is taking place.

We need to update 3 (three) major versions. Being paranoid (on occasion) I decided to update one major revision at a time to avoid massive compatibility breakage. So the first update starts chugging along a node at a time and after a non insignificant amount of time it tells me that all went well.

After a quick verification that everything is working as expected he fires up the second updating round for the next major revision. Kops chugs along, updates a few nodes and after quite a while freezes. OK says the master and restarts it, goes for a coffee and the damn thing freezes again. OK says the master and tries to strace the process to only find out that it is waiting on futex. Useless information! OK says the master and goes tribal. So what does a tribesman do when kubernetes is acting all ornery like a spoiled child denied its fifth glass of chocolate milk ? He checks the server logs!

Our engineer dives into the server logs because that was a node operation after all! Our senior keeps seeing an error pertaining to /etc/cni.d being empty and thus kops is not able to populate the networking stack. After spending some (as in a few) frustrating quarters of an hour trying to google that sh|t, his extremely keen side vision detects a pull image error. Lo and behold the error makes sense! Docker.com has limited the number of free image pulls to 100 per day for random users. We already did one major update and pretty much used up our quota. But there is a 200 per day limit for authenticated users. YES! There is hope, we can work with that. The mouse goes on overdrive and we now have an account – a cheapskate free account – and we can pull twice as much.

Alright now all we need is to create a kubernetes secret on every namespace which will make this late job later! So what does a seasoned (as in old) engineer do ? Picks up all the failed image pulls from the syslog and runs them manually on the node itself. After all that is what kops and kubernetes do at the lowest of levels. To boot, our senior is extremely handy with sshing over bastion hosts into nodes and quickly running scores of commands. That is, how can one put it, a subconscious skill over the decades.

One (as in single) quarter of an hour later, all nodes are updated and up, the cluster is working and all the apps are working as expected! Success !

Suffice to say that the 3rd and final revision update never went ahead. Real soon now …

The perfect me

There is so much much pressure on us to be somebody else, something else. Be flawless, be grand, be fabulous to quote the TV series. “Why can’t you be like Jude law?”

Well ladies and gents no cigar.

I am flawed, I have made bad choices, I have paid for them.

I am awesome, I have made great choices. I have been rewarded for them, I have derived great pleasure from them.

I am not perfect , I am human , I cope.

I will not change my appearance any more, I have grown up.

I will not imitate a TV persona like Lenny Kravitz, he made his choices, I made mine.

I will not change my gender, I am secure.

I will not become a statistic to fuel your “counter-establishment fight” , I am home.

Good luck, I hope you find your Ithaca

Remotely local workers

This is the year of our lord 2020. A plague wafts over the land. people are scared and stay indoors awaiting salvation.

That is how it would have sounded if we were in the Middle Ages, yet it is pretty accurate description of the situation at hand. The latest victim of the disease is commercial real estate. Companies have began to sell their lavish headquarters and this triggered my grey cells.

First of all let me differentiate my position. I am not a fanatic of remote work and I have said so in the past (https://managenot.wordpress.com/2011/02/01/rubber-band-work-force/) yet it is convenient and is currently helping us to keep working during the pandemic. The real issue is that people like to have other people around. It is part of our genetic makeup: being alone is no fun! So eventually businesses will need enough office space again.

This is the crux of my idea: since we now have the ability to work remotely just as efficiently as we do on site ,it would be a good idea to create co-working spaces with an emphasis of cross-company fertilization.

What I mean with that is companies should not own or lease their own buildings to accommodate all their work force. Instead they should rent co-working spaces, closer to the employees’ locations if necessary. This will create a mixing and association of people from completely disparate companies and backgrounds.

Picture this: instead for someone to ruminate endlessly on how to optimize a computer process, go have a cup of coffee in the common lounge and listen to a sales person tell a tall tale. You do not have to listen. It is not a meeting within your company. You are not bound by rules to attend it. Your are free to stay or leave. You are free to engage or not. No matter what you decide , this foreign to your immediate needs, input will trigger some associations in your brain. These mental associations are the fertilizer of innovation and ingenuity. You don’t believe me ? Go have a dozen meetings with the same people on the same problem and let me know how zombified you feel afterwards.

Will companies, that normally value secrecy too much, adapt to such disruption ? This is a massive change in paradigm and habit! Let me remind you for some old wisdom on the subject: Frank Herbert wrote throughout his work: the “Harq al’Ada”: the breaking of the habit. It is the same meme that Marvin Minsky denotes when he says: “Try to surprise yourself by the way you think today”. Both these tenets have been very much at play in recent startup scene. They have served the newfound wealth extremely well. It is time to break everyone habits. It is now time to break the habit of worker over concentration which ironically lead to isolation. Less zombie meetings more zest!

Cloud wisps

This is a recipe for people who run their own VMs for any reason whatsoever.

If you are running a lot of Linux virtual hosts try to locate a kernel compiled in the “cloud” fashion, like debian’s linux-image-cloud-amd64. This version of the Linux kernel has almost all hardware related code removed minus a lot of other stuff too. The performance improvement one gets is three fold:

a) far fewer context switches in your hosts which translates to snappier system reponses i.e. less latency.

b) Less CPU utilization which means you can service more guests if you need to.

c) the extra CPU bandwidth will allow you to make your KSM tuning more aggressive therefore more free ram overall.

Add to the above a knob to tune Linux’s memory ballooning device and you can oversell your RAM even more.

And this how you can become a tiny AWS 😉

Un-learning

#Learning is a difficult process! It is so difficult in fact that we have institutionalized it in protected places called Universities. It is so difficult that you get awarded a degree when the institution of learning decrees that you have achieved certain goals. It is so difficult that when it is time to unlearn almost all of us have a negative visceral reaction to it.

Yet when the time is right a good learner has to re-learn, one has to remember the joy and wonder of the first time one started to learn!

This is particularly applicable in the world of computer science / programming / devops where new tools crop up continuously. A decent engineer has to continually educate oneself as the technologies used advance. So the trick to be able to have a young mind mind and keep learning is to accept that sometimes on has to un-learn, to cast aside old assumptions. THAT is difficult for all the reasons above.

But there is a simple, naive if you like, method. Be young at heart, accept that the world still has wonders to be discovered, all one has to do is un-learn.

After all Un-learning is no different from learning. It is the rediscovering of one’s original passion to partake of the wonders of the world! It is a change of mental clothes. The person underneath is still the same and even better dressed after the change to fresh clothes 😉

The case for GlueOps

I am a Unix/Linux Infrastructure engineer, a university educated sysadmin of the old lore. In my 25+ years of Sysadmining / Devopsing I have never seen the case where the craft was isolated from either business decisions, service considerations, efficiency in delivery or developer enablement. The job included all of the above, quite a bit of coding and then some. We, of the old school sysadmins, would take all the ingredient and GLUE then together to a cohesive, coherent service all the while working with open source tools because the tooling budget was almost always nil. This process of creation is what I call GlueOPs, a holistic view of the modern IT landscape.

Open Source

Never was a decision made on its technological merit alone, it has always been:”do it for free or as close as you can get” . This is not a limitation, this is the spring board of innovation. The greatest boon to humanity in the last (20th) Century has been open source software. Millions of spectacular minds spent significant amounts of time in collaborating to produce incredible software. The GNU, Apache, linux Foundations, to name a few, and a host of universities published code that created the Internet boom as we now know it. Some were duds (gopher) , some remained in the stone age but still useful (FTP) and some bloomed into thousands of viable systems (Linux).

In GlueOPs the end target is the complete service to be offered to the client, be it an internal to the organization or traditionally an external one. The first step, even before deciding on the infrastructure to be used (cloud, physical, hybrid) is the selection of software components. For example if there is a need for a centralized authentication system will one chose LDAP or something else ? One weighs the merits of each choice and goes shopping so to speak on searching for the best fitting open source package to fit the needs. A target of 90% fit is a very viable one. If one can find a piece of software that covers 90% of the needs one can always add another 5% via custom coding and cover just about every base. 100% is unattainable as the time-effort to target curve is asymptotic at 100%, it will reach it only at the heat death of the universe.

GlueOps gives particular emphasis in resiliency of services so the tooling deployed must always be able to operate in a redundant, highly available and if possible load balanced modus operandi. Having systems laying about waiting to be used in case of an emergency is considered inelegant unless one tries to have a disaster recovery site. So tools like Haproxy and keepalived are essential building blocks of GlueOps.

The next level of GlueOps are the more traditional aspects of IT in general: databases, VPNs, web servers and accelerators and so on and so forth. The methodology of choice and application stays the same.

The religious aspect of GlueOps

It always surprises me that after so much code has been written there are still little nooks and crannies that have not been touched. To fill in these nooks an admin will invariably need to author one or more tools and that leads us to religion: Computer languages are too many and fights are so easy to start among various proponents of each paradigm. These fights have been fought with such ferocity that would have made ISIS fighters green with envy.

GlueOps emphasizes versatility. Admins should be versatile to use the best tool for the job keeping in mind the first tenet: look for existing open code. As code could be written in any language a master GlueOp should be able to at least read and understand most of them. What is an absolute necessity is for one to be competent to scripting in a shell script (bashat) and at least one more scripting language like Perl or Python. The final choice of which scripting language is the availability of libraries for common tasks. A case is perl: the language has been struggling to attract new developers but still had some life in it. Once CPAN (comprehensive Perl Archine Network) went offline, the language quickly died off to be replace by the new kid on the block: Python. So staying relevant and re-educating one self is a corner stone of GlueOps.

At the wizard level of GlueOps one has to be a master C programmer. So many of the scripting libraries are but a shell around C lib functions. Understanding therefore C is understanding how a linux kernel operates and that is wizard-fu!

Automation

I will not spend much time on this. Automation is absolutely essential when one deals with a swarm of machines. It not only helps in day to day operations, it helps to make work repeatable and at least semi documented. One can chose saltstack or ansible or both but one must use absolutely use something. If not for everything else just to have a quick look on how the systems were set up and expected to operate.

Observability and measurements

The cornerstone of engineering is measurements. In Fantasy Lore, if one knows the true name of a dragon one can control it. The contrapositive in engineering lore is: If it cannot be measured it cannot be controlled. Therefore every project should have its own observability subproject. Thankfully there are great tools out now for practically free. Prometheus, Influxdb or the ELK stack for time series databases and tools like Grafana for visualization and alerting. Without these packages to  measure the services, one is blind, and that is a bad thing in the business world.

One will have the usual graphing and alerting on usage and alerts on timeouts and so on but measuring will also give you further insights into your systems and applications. Suppose for example that you can see a flat system load line fixed at 1, on a two core system. This almost certainly means that a service is wedged and needs to be restarted although the service as a whole is fine. Even an unexplained regular IO spike should be investigated, in the engineering world there is no room for twilight.

The code

Of course without the application proper there is no service. GlueOps has an important role to play there too: guidance. Developers often forget that we live in a distributed world and must always keep in mind the subjects of resiliency, load balancing etc. etc. as stated above. A GlueOp who is worth his salary should be able to sit down with these cats and explain networking and systems in such a manner that they understand the needs of the service as a whole. The conversation should preferably finish  without murdering any hotheaded fools. I am not being cute here, some fights have been bloody indeed. A case and point is explaining how a global lock in a network filesystem degrades performance across the cluster but that is another story to be recited at some other time while quaffing bottles of bourbon.

So what do GlueOps do ?

Considering all the above aspects of the work one can easily see that every single one of them is necessary. Take any one of them lightly and the service as a whole becomes brittle and unstable. The admin has to Glue every piece together in such a fashion that the total comes alive. GlueOps do this: they glue together code, infrastructure, management and observability platforms to create business nourishing services. They are go-to people, problem solvers, engineers, coaches. Companies should stop trying to limit their job description to “systems operator” or “create a continuous integration pipeline”. These is far too limiting for someone who is called to keep money making services running 24x7x365.

Ellegantly monitor Gluster health

A gluster cluster is a nice thing to have as long as it does not lose coherence. Often times volumes need to heal and bricks go out of connection. Thankfully monitoring all such niceties is possible with collectd + a simple and elegant bash script.

First you need a collectd.conf.d gluster config file


LoadPlugin exec
TypesDB     "/usr/share/collectd/types.db.gluster"
TypesDB     "/usr/share/collectd/types.db"

<Plugin exec>
  Exec "nobody:nogroup" "/usr/local/bin/glusterstats.bash"
</Plugin>

 

 

 

And now you need the script itself


#!/bin/bash

HOSTNAME="${COLLECTD_HOSTNAME:-$(hostname)}"
INTERVAL="${COLLECTD_INTERVAL:-10}"
myip=$(/sbin/ifconfig ens19 | grep inet | awk '{print $2}')

while true
do
  volumes=$(sudo /usr/sbin/gluster volume list)
  for vol in $volumes
  do
    sudo /usr/sbin/gluster volume heal $vol info > /tmp/$vol.info
    brickdata=""
    disc=0
    while read line
    do
      entries=""
      if [[ $brickdata == "" ]]
      then
        brickdata=$(echo $line | grep Brick | grep $myip | awk '{print $2}')
        brick=$(echo $brickdata | cut -f2 -d: | sed -e 's/^\///g' | sed -e 's/\//_/g')
        peer=$(echo $brickdata | cut -f1 -d:)
        status=$(echo $line | grep "Status: Transport endpoint is not connected")
        if [ $? -eq 0 ]
        then
                disc=$((disc+1))
        fi
      fi
      entries=$(echo $line | grep entries | cut -f2 -d: | sed -e 's/^ [ ]*//g')

      if [[ $entries != "" && $brick != "" ]]
      then
#                       echo $peer $brick $entries
        echo "PUTVAL \"$HOSTNAME/gluster/brick_heal_entries-${brick}\" interval=$INTERVAL N:$entries"
        echo "PUTVAL \"$HOSTNAME/gluster/counter-disconnections-${brick}\" interval=$INTERVAL N:$entries"
        brickdata=""
        disc=0
      fi
    done < /tmp/$vol.info
  done
sleep $INTERVAL
done

Remember to use Grafana for alerting

Active-Active HA cluster on Hetzner servers

Hetzner is a dedicated server provider that gives extremely good value to its customers but has a few quirks. The particular quirk that made my life so difficult, and I guess of many others, is its implementation of floating IPs.

Hetzner’s implementation of floating IPs is not ARP based as an orthodox sysadmin would expect. I will hazard a guess that it is done in such a fashion so that data streams do not leak across customers.

But despair not, there is a straightforward way to do exactly what one would expect to do with linux-ha. One can setup a pair of floating IPs between a pair of hosts and use keepalived to manage IP migration thus effectively creating an active-active setup.

First you need two servers with a private interconnection between them and one floating IP per server. Configure the IP on each server following the instructions on the Hetzner Wiki . Make sure that you configure both floating IPs on each host so that on failover the interfaces will be ready. Also remember to setup your internal IPs and set aside two IPs of your own for the internal floating scheme that vrrp needs. In our example we will use 192.168.1.1 for maste1 and 192.168.2.1 for master2. Obviously the internal network is 192.168.0.0/16

Now setup the Hetzner api scripts as shown in their failover script and you are almost done. Next step is to install keepalived and configure it so that only its vrrpd module is active. As an example on a Debian server edit the file /etc/default/keepalived and insert the following line in there

DAEMON_ARGS="-P

Now the trick is to have two virtual routers configured in keepalived each having as master one of your hosts and slave the other one. A sample configuration for master1 follows:



#
# This runs on master1
# 192.168.[12].1 are the internal floating IPs we manage as a necessity
# but the notification scripts do the actual work
# eth1 is the internal interconnect
#
global_defs {
      notification_email {
      devops@yourdomain
    }
    notification_email_from noreply@yourdomain
    smtp_server smtp.yourdomain
    smtp_connect_timeout 60
    script_user root
}


vrrp_instance virtualrouter1 {
    state MASTER
    interface eth1
    virtual_router_id 1
    priority 200
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass 1234pass
    }
        virtual_ipaddress {
        192.168.1.1
    }
    preempt_delay 5
    notify_master "/usr/local/bin/hetzner.sh ip1 set master1"
    notify_backup "/usr/local/bin/hetzner.sh ip1 set master2"
}

vrrp_instance virtualrouter2 {
    state BACKUP
    interface eth1
    virtual_router_id 2
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass pass4321
    }
    virtual_ipaddress {
        192.168.2.1
    }
    preempt_delay 5
    notify_master "/usr/local/bin/hetzner.sh ip2 set master1"
    notify_backup "/usr/local/bin/hetzner.sh ip2 set master2"
}

Obviously the configuration on master2 is the inverse of the above.

  • Beware fallback and failback times are a bit longer than what one would expect , but the setup adds a nine to your overall availability!

Bust your users’ chops by finding their passwords



#!/bin/bash

# rockyou is @ https://github.com/brannondorsey/naive-hashcat/releases/download/data/rockyou.txt
# wordlists are @ http://www.md5this.com/tools/wordlists.html 
#increase performance if you feel like it
performance=3
performance=1

LDAPSERVER=
LDAPBASE=
LDAPBINDDN=
LDAPPASS=

cd /root/infosec/hashcat
(ldapsearch -Z -x -h $LDAPSERVER -b ou=users,$LDAPBASE -D $LDAPBINDDN -w "$LDAPPASS" uid UserPassword |\
grep ^userPassword |\
awk '{ print $2}' | while read line
do
echo $line| base64 --decode 2>/dev/null | grep SSHA
done ) > /root/infosec/hashcat/passwords

/root/infosec/hashcat/hashcat \
--quiet \
-O -w $performance \
-D 2 \
--gpu-temp-abort 80 \
-m 111 \
-r rules/generated2.rule \
-o cleartext_passwords \
passwords dicts/rockyou.txt dicts/www.md5this.com/Wordlist.txt dicts/www.md5this.com/wordlists/*.dic

if [ -f cleartext_passwords ]
then
echo found some passwords
cat cleartext_passwords
fi