Thursday, January 28, 2016

Disnix 0.5 release announcement and some reflection

In this blog post, I'd like to announce the next Disnix release. At the same time, I noticed that it has been eight years ago that I started developing it, so this would also be a nice opportunity to do some reflection.

Some background information


The idea was born while I was working on my master's thesis. A few months prior, I got familiar with Nix and NixOS -- I read Eelco Dolstra's PhD thesis, managed to package some software, and wrote a couple of services for NixOS.

Most of my packing work was done to automate the deployment of WebDSL applications, a case study in domain-specific language engineering, that is still an ongoing research project in my former research group. WebDSL's purpose is to be a domain-specific language for developing dynamic web applications with a rich data model.

Many aspects in Nix/NixOS were quite "primitive" compared to today's implementations -- there was no NixOS module system, making it less flexible to create additions. Many packages that I needed were missing and I had to write Nix expressions for them myself, such as Apache Tomcat, MySQL, and Midnight Commander. Also the desktop experience, such as KDE, was quite primitive, as only the base package was supported.

As part of my master's thesis project, I did an internship at the Healthcare Systems Architecture group at Philips Research. They had been developing a platform called SDS2, which purpose was to provide asset tracking and utilization analysis services for medical equipment.

SDS2 qualifies itself as a service-oriented system (a term that people used to talk frequently about in the past, but not anymore :) ). As such, it can be decomposed into a set of distributable components (a.k.a. services) that interact with each other through "standardized protocols" (e.g. SOAP), sometimes through network links.

There are a variety of reasons why SDS2 has a distributed architecture. For example, data that has been gathered from medical devices may have to be physically stored inside a hospital for privacy reasons. The analysis components may require a lot of computing power and would perform better if they run in a data center with a huge amount of system resources.

Being able to distribute services is good for many reasons (e.g. in meeting certain non-functional requirements such as privacy), but it also has a big drawback -- services are software components, and one of their characteristics is that they are units of deployment. Deploying a single service without any (or proper) automation to one machine is already complicated and time consuming, but deploying a network of machines is many times as complex.

The goal of my thesis assignment was to automate SDS2's deployment in distributed environments using the Nix package manager as a basis. Nix provides a number of unique properties compared to many conventional deployment solutions, such as fully automated deployment from declarative specifications, and reliable and reproducible deployment. However, it was also lacking a number of features to provide the same or similar kinds of quality properties to deployment processes of service-oriented systems in networks of machines.

The result of my master's thesis project was the first prototype of Disnix that I never officially released. After my internship, I started my PhD research and resumed working on Disnix (as well as several other aspects). This resulted in a second prototype and two official releases eventually turning Disnix into what it is today.

Prototype 1


This was the prototype resulting from my master's thesis and was primarily designed for deploying SDS2.

The first component that I developed was a web service (using similar kinds of technologies as SDS2, such as Apache Tomcat and Apache Axis2) exposing a set of deployment operations to remote machines (most of them consulting the Nix package manager).

To cope with permissions and security, I decided to make the web service just an interface around a "core service" that was responsible for actually executing the deployment activities. The web service used the D-Bus protocol to communicate with the core.

On top of the web service layer, I implemented a collection of tools each executing a specific deployment activity in a network of machines, such as building, distributing and activating services. There were also a number of tools combining deployment activities, such as the "famous" disnix-env command responsible for executing all the activities required to deploy a system.

The first prototype of disnix-env, in contrast to today's implementation, provided two deployment procedure variants: building on targets and building on the coordinator.

The first variant was basically inspired by the manual workflow I used to carry out to get SDS2 deployed -- I manually installed a couple of NixOS machines, then used SSH to remotely connect to them, there I would do a checkout of Nixpkgs and all the other Nix expressions that I need, then I would deploy all packages from source and finally I modified the system configuration (e.g. Apache Tomcat) to run the web services.

Unfortunately, transferring Nix expressions is not an easy process, as they are rarely self contained and typically rely on other Nix expression files scattered over the file system. While thinking about a solution, I "discovered" that the Nix expression evaluator creates so-called store derivation files (low-level build specifications) for each package build. Store derivations are also stored in the Nix store next to ordinary packages, including their dependencies. I could instead instantiate a Nix expression on the coordinator, transfer the closure of store derivation files to a remote machine, and build them there.

After some discussion with my company supervisor Merijn de Jonge, I learned that compiling on target machines was undesired, in particular in production environments. Then I learned more about Nix's purely functional nature, and "discovered" that builds are referentially transparent -- for example, it should not matter where a build has been performed. As long as the dependencies remain the same, the outcome would be the same as well. With this "new knowledge" in mind, I implemented a second deployment procedure variant that would do the package builds on the coordinator machine, and transfer their closures (dependencies) to the target machines.

As with the current implementation, deployment in Disnix was driven by three kinds of specifications: the services model, infrastructure model and distribution model. However, their notational conventions were a bit different -- the services model already knew about inter-dependencies, but propagating the properties of inter-dependencies to build functions was an ad-hoc process. The distribution model was a list of attribute sets also allowing someone to specify the same mappings multiple times (which resulted in undefined outcomes).

Another primitive aspect was the activation step, such as deploying web applications inside Apache Tomcat. It was basically done by a hardcoded script that only knew about Java web applications and Java command-line tools. Database activation was completely unsupported, and had to be done by hand.

I also did a couple of other interesting things. I studied the "two-phase commit protocol" for upgrading distributed systems atomically and mapped its concepts to Nix operations, to support (almost) atomic upgrades. This idea resulted in a research paper that I have presented at HotSWUp 2008.

Finally, I sketched a simple dynamic deployment extension (and wrote a partial implementation for it) that would calculate a distribution model, but time did not permit me to finish it.

Prototype 2


The first Disnix prototype made me quite happy in the early stages of my PhD research -- I gave many cool demos to various kinds of people, including our industry partner: Philips Healthcare and NWO/Jacquard: the organization that was funding me. However, I soon realized that the first prototype became too limited.

The first annoyance was my reliance on Java. Most of the tools in the Disnix distribution were implemented in Java and depended on the Java Runtime Environment, which is quite a big dependency for a set of command-line utilities. I reengineered most of the Disnix codebase and rewrote it in C. I only kept the core service (which was implemented in C already) and the web service interface, that I separated into an external package called DisnixWebService.

I also got rid of the reliance on a web service to execute remote deployment operations, because it was quite tedious to deploy it. I made the communication aspect pluggable and implemented an SSH plugin that became the default communication protocol (the web service protocol could still be used as an external plugin).

For the activation and deactivation of services, I developed a plugin system (Disnix activation scripts) and a set of modules supporting various kinds of services replacing the hardcoded script. This plugin system allowed me to activate and deactivate many kinds of components, including databases.

Finally, I unified the two deployment procedure variants of disnix-env into one procedure. Building on the targets became simply an optional step that was carried out before building on the coordinator.

Disnix 0.1


After my major reengineering effort, I was looking into publishing something about it. While working on a paper (which first version got badly rejected), I realized that services in a SOA-context are "platform independent" because of their interfaces, but they still have implementations underneath that could depend on many kinds of technologies. Heterogeneity makes deployment extra complicated.

There was still one piece missing to bring service-oriented systems to their full potential -- there was no multiple operating systems support in Disnix. The Nix package manager could also be used on several other operating systems besides Linux, but Disnix was bound to one operating system only (Linux).

I did another major reengineering effort to make the system architecture of the target systems configurable requiring me to change many things internally. I also developed new notational conventions for the Disnix models. Each service expression became a nested function in which the outer function corresponds to the intra-dependencies and the inner function to the inter-dependencies, and look quite similar to expressions for ordinary Nix packages. Moreover, I removed the ambiguity problem in distribution model by making it an attribute set.

The resulting Disnix version was first described in my SEAA 2010 paper. Shortly after the paper got accepted, I decided to officially release this version as Disnix 0.1. Many external aspects of this version are still visible in the current version.

Disnix 0.2


After releasing the first of Disnix, I realized that there were still a few pieces missing while automating deployment processes of service-oriented systems. One of the limitations of Disnix is that it expects machines to be present already that may have to run a number of preinstalled system services, such as MySQL, Apache Tomcat, and the Disnix service exposing remote deployment operations. These machines had to be deployed by other means first.

Together with Eelco Dolstra I had been working on declarative deployment and testing of networked NixOS configurations, resulting in a tool called nixos-deploy-network that deploys networks of NixOS machines and a NixOS test driver capable of spawning networks of NixOS virtual machines in which system integration tests can be run non-interactively. These contributions were documented in a tech report and the ISSRE 2010 paper.

I made Disnix more modular so that extensions could be built on top of it. The most prominent extension was DisnixOS that integrates NixOS deployment and the NixOS test driver's features with Disnix service deployment so that a service oriented system's deployment process could be fully automated.

Another extension was Dynamic Disnix, a continuation of the dynamic deployment extension that I never finished during my internship. Dynamic Disnix extends the basic toolset with an infrastructure discovery tool and a distribution generator using deployment planning algorithms from the academic literature to map services to machines. The extended architecture is described in the SEAMS 2011 paper.

The revised Disnix architecture has been documented in both the WASDeTT 2010 and SCP 2014 papers and was released as Disnix 0.2.

Disnix 0.3


After the 0.2 release I got really busy, which was partly caused by the fact that I had to write my PhD thesis and yet another research paper for an unfinished chapter.

The last Disnix-related research contribution was a tool called Dysnomia, which I had based on the Disnix activation scripts package. I augmented the plugins with experimental state deployment operations and changed the package into a new tool, that in (theory) could be combined with other tools as well, or used independently.

Unfortunately, I had to quickly rush out a paper for HotSWUp 2012 and the code was in a barely usable state. Moreover, the state management facilities had some huge drawbacks, so I was not that eager to get them integrated into the mainstream version.

Then I had to fully dedicate myself to completing my PhD thesis and for more than six months, I hardly wrote any code.

After finishing my first draft of my PhD thesis and waiting for feedback from my committee, I left academia and switched jobs. Because I had no use practical use cases for Disnix, and other duties in my new job, its development was done mostly in my spare time at a very low pace -- some things that I accomplished in that period is creating a 'slim' version of Dysnomia that supported all the activities in the HotSWUp paper without any snapshotting facilities.

Meanwhile, nixops-deploy-network got replaced by a new tool named Charon, that later became NixOps. In addition to deployment, NixOps could also instantiate virtual machines in IaaS environments, such as Amazon EC2. I modified DisnixOS to also integrate with NixOps to use its capabilities.

Three and a half years after the previous release (late 2014), my new employer wanted to deploy their new microservices-based system to a production environment, which made me quite motivated to work on Disnix again. I did some huge refactorings and optimized a few aspects to make it work for larger systems. Some interesting optimizations were concurrent data transfers and concurrent service activations.

I also implemented multi-connection protocol support. For example, you could use SSH to connect to one machine and SOAP to another.

After implementing the optimizations, I realized that I had reached a stable point and decided that it was a good time to announce the next release, after a few years of only little development activity.

Disnix 0.4


Despite being happy with the recent Disnix 0.3 release and using it to deploy many services to production environments, I quickly ran into another problem -- the services that I had to manage store data in their own dedicated databases. Sometimes I had to move services from one machine to another. Disnix (like the other Nix tools) does not manage state, requiring me to manually migrate data, which was quite painful.

I decided to dig up the state deployment facilities from the HotSWUp 2012 paper to cope with this problem. Despite having a number of limitations, the databases that I had to manage were relatively small (tens of megabytes), so the solution was still a good fit.

I integrated the state management facilities described in the paper from the prototype into the "production" version of Dysnomia, and modified Disnix to use them. I left out the incremental snapshot facilities described in the paper, because there was no practical use for them. When the work was done, I announced the next release.

Disnix 0.5


With Disnix 0.4, all my configuration management work was automated. However, I spotted a couple of inefficiencies, such as many unnecessary redeployments while upgrading. I solved this issue by making the target-specific services concept a first class citizen in Disnix. Moreover, I regularly had to deal with RAM issues and added on-demand activation support (by using the operating system's service manager, such as systemd).

There were also some user-unfriendly aspects that I improved -- better and more concise logging, more helpful error messages, --rollback, --switch-generation options for disnix-env, and some commands that work on the deployment manifest were extended to take the last deployed manifest into account when no parameters have been provided (e.g. disnix-visualize).

Conclusion


This long blog post describes how the current Disnix version (0.5) came about after nearly eight years of development. I'd like to announce its immediate availability! Consult the Disnix homepage for more information.

Friday, January 22, 2016

Integrating callback and promise based function invocation patterns (Asynchronous programming with JavaScript part 4)

It has been quiet for a while on my blog in the programming language domain. Over two years ago, I started writing a series of blog posts about asynchronous programming with JavaScript.

In the first blog post, I explained some general asynchronous programming issues, code structuring issues and briefly demonstrated how the async library can be used to structure code more properly. Later, I have written a blog post about promises, another abstraction mechanism dealing with asynchronous programming complexities. Finally, I have developed my own abstraction functions by investigating how JavaScript's structured programming language constructs (that are synchronous) translate to the asynchronous programming world.

In these blog posts, I have used two kinds of function invocation styles -- something that I call the Node.js-function invocation style, and the promises invocation style. As the name implies, the former is used by the Node.js standard library, as well as many Node.js-based APIs. The latter is getting more common in the browser world. As a matter of fact, many modern browsers, provide a Promise prototype as part of their DOM API allowing others to construct their own Promise-based APIs with it.

In this blog post, I will compare both function invocation styles and describe some of their differences. Additionally, there are situations in which I have to mix APIs using both styles and I have observed that it is quite annoying to combine them. I will show how to alleviate this pain a bit by developing my own generically applicable adapter functions.

Two example invocations


The most frequently used invocation style in my blog posts is something that I call the Node.js-function invocation style. An example code fragment that uses such an invocation is the following:

fs.readFile("hello.txt", function(err, data) {
    if(err) {
        console.log("Error while opening file: "+err);
    } else {
        console.log("File contents is: "+data);
    }
});

As you may see in the code fragment above, when we invoke the readFile() function, it returns immediately (to be precise: it returns, but it returns no value). We use a callback function (that is typically the last function parameter) to retrieve the results of the invocation (or the error if something went wrong) at a later point in time.

By convention, the first parameter of the callback is an error parameter that is not null if some error occurs. The remaining parameters are optional and can be used to retrieve the corresponding results.

When using promises (more specifically: promises that conform to the Promises/A and Promises/A+ specifications), we use a different invocation pattern that may look as follows:

Task.findAll().then(function(tasks) {
    for(var i = 0; i < tasks.length; i++) {
        var task = tasks[i];
        console.log(task.title + ": "+ task.description);
    }
}, function(err) {
    console.log("An error occured: "+err);
});

As with the previous example, the findAll() function invocation shown above also returns immediately. However, it also does something different compared to the Node.js-style function invocation -- it returns an object called a promise whereas the invocation in the previous example never returns anything.

By convention, the resulting promise object provides a method called then() in which (according the Promises/A and A+ standards) the first parameter is a callback that gets invoked when the function invocation succeeds and the second callback gets invoked when the function invocation fails. The parameters of the callback functions represent result objects or error objects.

Comparing the invocation styles


At first sight, you may probably notice that despite having different styles, both function invocations return immediately and need an "artificial facility" to retrieve the corresponding results (or errors) at a later point in time, as opposed to directly returning a result in a function.

The major difference is that in the promises invocation style, you will always get a promise as a result of an invocation. A promise provides a reference to something which corresponding result will be delivered in the future. For example, when running:

var tasks = Task.findAll();

I will obtain a promise that, at some point in the future, provides me an array of tasks. I can use this reference to do other things by passing the promise around (for example) as a function argument to other functions.

For example, I may want to construct a UI displaying the list of tasks. I can already construct pieces of it without waiting for the full list of tasks to be retrieved:

displayTasks(tasks);

The above function could, for example, already start rendering a header, some table cells and buttons without the results being available yet. The display function invokes the then() function when it really needs the data.

By contrast, in the Node.js-callback style, I have no reference to the pending invocation at all. This means that I always have to wait for its completion before I can render anything UI related. Because we are forced to wait for its completion, it will probably make the application quite unresponsive, in particular when we have to retrieve many task records.

So in general, in addition to better structured code, promises support composability whereas Node.js-style callbacks do not. Because of this reason, I consider promises to be more powerful.

However, there is also something that I consider a disadvantage. In my first blog post, I have shown the following Node.js-function invocation style pyramid code example as a result of nesting callbacks:

var fs = require('fs');
var path = require('path');

fs.mkdir("out", 0755, function(err) {
    if(err) throw err;
    
    fs.mkdir(path.join("out, "test"), 0755, function(err) {
        if (err) throw err;        
        var filename = path.join("out", "test", "hello.txt");

        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) throw err;
                    
            fs.readFile(filename, function(err, data) {
                if(err) throw err;
                
                if(data == "Hello world!")
                    process.stderr.write("File is correct!\n");
                else
                    process.stderr.write("File is incorrect!\n");
            });
        });
    });
});

I have also shown in the same blog post, that I can use the async.waterfall() abstraction to flatten its structure:

var fs = require('fs');
var path = require('path');

filename = path.join("out", "test", "hello.txt");

async.waterfall([
    function(callback) {
        fs.mkdir("out", 0755, callback);
    },

    function(callback) {
        fs.mkdir(path.join("out, "test"), 0755, callback);
    },

    function(callback) {
        fs.writeFile(filename, "Hello world!", callback);
    },

    function(callback) {
        fs.readFile(filename, callback);
    },

    function(data, callback) {
        if(data == "Hello world!")
            process.stderr.write("File is correct!\n");
        else
            process.stderr.write("File is incorrect!\n");
    }

], function(err, result) {
    if(err) throw err;
});
As you may probably notice, the code fragment above is much more readable and better maintainable.

In my second blog post, I implemented a promises-based variant of the same example:

var fs = require('fs');
var path = require('path');
var Promise = require('rsvp').Promise;

/* Promise object definitions */

var mkdir = function(dirname) {
    return new Promise(function(resolve, reject) {
        fs.mkdir(dirname, 0755, function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var writeHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.writeFile(filename, "Hello world!", function(err) {
            if(err) reject(err);
            else resolve();
        });
    });
};

var readHelloTxt = function(filename) {
    return new Promise(function(resolve, reject) {
        fs.readFile(filename, function(err, data) {
            if(err) reject(err);
            else resolve(data);
        });
    });
};

/* Promise execution chain */

var filename = path.join("out", "test", "hello.txt");

mkdir(path.join("out"))
.then(function() {
    return mkdir(path.join("out", "test"));
})
.then(function() {
    return writeHelloTxt(filename);
})
.then(function() {
    return readHelloTxt(filename);
})
.then(function(data) {
    if(data == "Hello world!")
        process.stderr.write("File is correct!\n");
    else
        process.stderr.write("File is incorrect!\n");
}, function(err) {
    console.log("An error occured: "+err);
});

As you may notice, because the then() function invocations can be chained, we also have a flat structure making the code better maintainable. However, the code fragment is also considerably longer than the async library variant and the unstructured variant -- for each asynchronous function invocation, we must construct a promise object, adding quite a bit of overhead to the code.

From my perspective, if you need to do many ad-hoc steps (and not having to compose complex things), callbacks are probably more convenient. For reusable operations, promises are typically a nicer solution.

Mixing function invocations from both styles


It may happen that function invocations from both styles need to be mixed. Typically mixing is imposed by third-party APIs -- for example, when developing a Node.js web application we may want to use express.js (callback based) for implementing a web application interface in combination with sequelize (promises based) for accessing a relational database.

Of course, you could write a function constructing promises that internally only use Node.js-style invocations or the opposite. But if you have to regularly intermix calls, you may end up writing a lot of boilerplate code. For example, if I would use the async.waterfall() abstraction in combination with promise-style function invocations, I may end up writing:

async.waterfall([
    function(callback) {
        Task.sync().then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }).then(function() {
            callback();
        }, function(err) {
            callback(err);
        });
    },
    
    function(callback) {
        Task.findAll().then(function(tasks) {
            callback(null, tasks);
        }, function(err) {
            callback(err);
        });
    },
    
    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

For each Promise-based function invocation, I need to invoke the then() function and in the corresponding callbacks, I must invoke the callback of each function block to propagate the results or the error. This makes the amount of code I have to write unnecessary long, tedious to write and a pain to maintain.

Fortunately, I can create a function that abstracts over this pattern:

function chainCallback(promise, callback) {
    promise.then(function() {
        var args = Array.prototype.slice.call(arguments, 0);
        
        args.unshift(null);
        callback.apply(null, args);
    }, function() {
        var args = Array.prototype.slice.call(arguments, 0);
        
        if(args.length == 0) {
            callback("Promise error");
        } else if(args.length == 1) {
            callback(args[0]);
        } else {
            callback(args);
        }
    });
}

The above code fragment does the following:

  • We define a function takes a promise and a Node.js-style callback function as parameters and invokes the then() method of the promise.
  • When the promise has been fulfilled, it sets the error parameter of the callback to null (to indicate that there is no error) and propagates all resulting objects as remaining parameters to the callback.
  • When the promise has been rejected, we propagate the resulting error object. Because the Node.js-style-callback requires a single defined object, we compose one ourselves if no error object was returned, and we return an array as an error object, if multiple error objects were returned.

Using this abstraction function, we can rewrite the earlier pattern as follows:

async.waterfall([
    function(callback) {
        prom2cb.chainCallback(Task.sync(), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Get some coffee",
            description: "Get some coffee ASAP"
        }), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.create({
            title: "Drink coffee",
            description: "Because I need caffeine"
        }), callback);
    },
    
    function(callback) {
        prom2cb.chainCallback(Task.findAll(), callback);
    },
    
    function(tasks, callback) {
        for(var i = 0; i < tasks.length; i++) {
            var task = tasks[i];
            console.log(task.title + ": "+ task.description);
        }
    }
], function(err) {
    if(err) {
        console.log("An error occurred: "+err);
        process.exit(1);
    } else {
        process.exit(0);
    }
});

As may be observed, this code fragment is more concise and significantly shorter.

The opposite mixing pattern also leads to issues. For example, we can first retrieve the list of tasks from the database (through a promise-style invocation) and then write it as a JSON file to disk (through a Node.js-style invocation):

Task.findAll().then(function(tasks) {
    fs.writeFile("tasks.txt", JSON.stringify(tasks), function(err) {
        if(err) {
            console.log("error: "+err);
        } else {
            console.log("everything is OK");
        }
    });
}, function(err) {
    console.log("error: "+err);
});

The biggest annoyance is that we are forced to do the successive step (writing the file) inside the callback function, causing us to write pyramid code that is harder to read and tedious to maintain. This is caused by the fact that we can only "chain" a promise to another promise.

Fortunately, we can create a function abstraction that wraps an adapter around any Node.js-style function taking the same parameters (without the callback) that returns a promise:

function promisify(Promise, fun) {
    return function() {
       var args = Array.prototype.slice.call(arguments, 0);
           
       return new Promise(function(resolve, reject) {
            function callback() {
                var args = Array.prototype.slice.call(arguments, 0);
                var err = args[0];
                args.shift();
                    
                if(err) {
                    reject(err);
                } else {
                    resolve(args);
                }
            }
           
            args.push(callback);
                
            fun.apply(null, args);
        });
    };
}

In the above code fragment, we do the following:

  • We define a function that takes two parameters: a Promise prototype that can be used to construct promises and a function representing any Node.js-style function (which the last parameter is a Node.js-style callback).
  • In the function, we construct (and return) a wrapper function that returns a promise.
  • We construct an adapter callback function, that invokes the Promise toolkit's reject() function in case of an error (with the corresponding error object provided by the callback), and resolve() in case of success. In case of success, it simply propagates any result object provided by the Node.js-style callback.
  • Finally, we invoke the Node.js-function with the given function parameters and our adapter callback.

With this function abstraction we can rewrite the earlier example as follows:

Task.findAll().then(function(tasks) {
    return prom2cb.promisify(Promise, fs.writeFile)("tasks.txt", JSON.stringify(tasks));
})
.then(function() {
    console.log("everything is OK");
}, function(err) {
    console.log("error: "+err);
});

as may be observed, we can convert the writeFile() Node.js-style function invocation into an invocation returning a promise, and nicely structure the find and write file invocations by chaining then() invocations.

Conclusions


In this blog post, I have explored two kinds of asynchronous function invocation patterns: Node.js-style and promise-style. You may probably wonder which one I like the most?

I actually hate them both, but I consider promises to be the more powerful of the two because of their composability. However, this comes at a price of doing some extra work to construct them. The most ideal solution to me is still a facility that is part of the language, instead of "forgetting" about existing language constructs and replacing them by custom-made abstractions.

I have also explained that we may have to combine both patterns, which is often quite tedious. Fortunately, we can create function abstractions that convert one into another to ease the pain.

Related work


I am not the first one comparing the function invocation patterns described in this blog post. Parts of this blog post are inspired by a blog post titled: "Callbacks are imperative, promises are functional: Node’s biggest missed opportunity". In this blog post, a comparison between the two invocation styles is done from a programming language paradigm perspective, and is IMO quite interesting to read.

I am also not the first to implement conversion functions between these two styles. For example, promises constructed with the bluebird library implement a method called .asCallback() allowing a user to chain a Node.js-style callback to a promise. Similarly, it provides a function: Promise.promisify() to wrap a Node.js-style function into a function returning a promise.

However, the downside of bluebird is that these facilities can only be used if bluebird is used as a toolkit in an API. Some APIs use different toolkits or construct promises themselves. As explained earlier, Promises/A and Promises/A+ are just interface specifications and only the purpose of then() is defined, whereas the other facilities are extensions.

My function abstractions only make a few assumptions and should work with many implementations. Basically it only requires a proper .then() method (which should be obvious) and a new Promise(function(resolve, reject) { ... }) constructor.

Besides the two function invocation styles covered in this blog post, there are others as well. For example, Zef's blog post titled: "Callback-Free Harmonious Node.js" covers a mechanism called 'Thunks'. In this pattern, an asynchronous function returns a function, which can be invoked to retrieve the corresponding error or result at a later point in time.

References


The two conversion abstractions described in this blog post are part of a package called prom2cb. It can be obtained from my GitHub page and the NPM registry.

Wednesday, December 30, 2015

Fifth yearly blog reflection

Today, it's my blog's fifth anniversary. As usual, this is a nice opportunity to reflect over last year's writings.

Disnix


Something that I cannot leave unmentioned is Disnix, a toolset that I have developed as part of my master's and PhD research. For quite some time, its development was progressing at a very low pace, mainly because I had other obligations -- I had to finish my PhD thesis, and after I left academia, I was working on other kinds of aspects.

Fortunately, things have changed considerably. Since October last year I have been actively using Disnix to maintain the deployment of a production system that can be decomposed into independently deployable services. As a result, the development of Disnix also became much more progressive, which resulted in a large number of Disnix related blog posts and some major improvements.

In the first blog post, I compared Disnix with another tool from the Nix project: NixOps, described their differences and demonstrated that they can be combined to fully automate all deployment aspects of a service-oriented system. Shortly after publishing this blog post, I announced the next Disnix release: Disnix 0.3, 4 years after its previous release.

A few months later, I announced yet another Disnix release: Disnix 0.4 in which I have integrated the majority of state deployment facilities from the prototype described in the HotSWUp 2012 paper.

The remainder of blog posts provide solutions for additional problems and describe some optimizations. I have formulated a port assignment problem which may manifest itself while deploying microservices and developed a tool that can be used to provide a solution. I also modified Disnix to deploy target-specific services (in addition to target-agnostic services), which in some scenarios, make deployments more efficient.

Another optimization that I have developed is on demand activation and self termination of services. This is particularly useful for poorly developed services, that for example, leak memory.

Finally, I have attended NixCon2015 where I gave a talk about Disnix (including two live demos) and shown how it can be used to deploy (micro)services. An interesting aspect of the presentation is the first live demo in which I deploy a simple example system into a network of heterogeneous machines (machines running multiple operating systems, having multiple CPU architectures, reachable by multiple connection protocols).

The Nix project


In addition to Disnix, I have also written about some general Nix aspects. In February, I have visited FOSDEM. In this year's edition, we had a NixOS stand to promote the project (including its sub projects). From my own personal experience, I know that advertising Nix is quite challenging. For this event, I crafted a sales pitch explanation recipe, that worked quite well for me in most cases.

A blog post that I am particularly proud of is my evaluation and comparison of Snappy Ubuntu with Nix/NixOS, in which I describe the deployment properties of Snappy and compare how they conceptually relate to Nix/NixOS. It attracted a huge amount of visitors breaking my old monthly visitors record from three years ago!

I also wrote a tutorial blog post demonstrating how we can deploy prebuilt binaries with the Nix package manager. In some cases, packaging prebuilt software can be quite challenging, and the purpose of this blog post to show a number techniques that can be used to accomplish this.

Methodology


Besides deployment, I have also written two methodology related blog posts. In the first blog post, I have described my experiences with Agile software development and Scrum. Something that has been bothering me for quite a while is these people claiming that "implementing" such a methodology considerably improves development and quality of software.

In my opinion this is ridiculous! These methodologies provide some structure, but the "secret" lies in its undefined parts -- to be agile you should accept that nothing will completely go as planned, you should remain focussed, take small steps (not huge leaps), and most importantly: continuously adapt and improve. But no methodology provides a universally applicable recipe that makes you successful in doing it.

However, despite being critical, I think that implementing a methodology is not bad per se. In another blog post, I have described how I implemented a basic software configuration management process in a small organization.

Development


I have also reflected over my experiences while developing command-line utilities and wrote a blog post with some considerations I take into account.

Side projects and research


In my previous reflections, there was always a section dedicated to research and side projects. Unfortunately, this year there is not much to report about -- I made a number of small changes and additions to my side projects, but I did not made any significant advancements.

Probably the fact that Disnix became a main and side project contributes to that. Moreover, I also have other stuff to do that has nothing to do with software development or research. I hope that I can find more time next year to report about my other side projects, but I guess this is basically just a luxury problem. :-)

Blog posts


As with my previous annual blog reflections, I will also publish the top 10 of my most frequently read blog posts:

  1. On Nix and GNU Guix. As with the previous three blog reflections, this blog post remains on top. However, its popularity finally seems to be challenged by the number two!
  2. An evaluation and comparison of Snappy Ubuntu. This is the only blog post I have written this year that ended up in the overall top 10. It attracted a record number of visitors in one month and now rivals the number one in popularity.
  3. An alternative explanation of the Nix package manager. This was last year's number two and dropped to the third place, because of the Snappy Ubuntu blog post.
  4. Setting up a multi-user Nix installation on non-NixOS systems. This blog post was also in last year's top 10 but it seems to have become even more popular. I think this is probably caused by the fact that it is still hard to set up a multi-user installation.
  5. Managing private Nix packages outside the Nixpkgs tree. I wrote this blog for newcomers and observed that people keep frequently consulting it. As a consequence, it has entered the overall top 10.
  6. Asynchronous programming with JavaScript. This blog post was also in last year's top 10 and became slightly more popular. As a result, it moved to the 6th position.
  7. Yet another blog post about Object Oriented Programming and JavaScript. Another JavaScript related blog post that was in last year's top 10. It became slightly more popular and moved to the 7th place.
  8. Composing FHS-compatible chroot environments with Nix (or deploying Steam in NixOS). This blog post was the third most popular last year, but now seems to be not that interesting anymore.
  9. Setting up a Hydra build cluster for continuous integration and testing (part 1). Remains a popular blog post, but also considerably dropped in popularity compared to last year.
  10. Using Nix while doing development. A very popular blog post last year, but considerably dropped in popularity.

Conclusion


I am still not out of ideas yet, so stay tuned! The remaining thing I want to say is:

HAPPY NEW YEAR!!!!!!!!!!!

Thursday, December 3, 2015

On-demand service activation and self termination

I have written quite a few blog posts on service deployment with Disnix this year. The deployment mechanics that Disnix implements work quite well for my own purposes.

Unfortunately, having a relatively good deployment solution does not necessarily mean that a system functions well in a production environment -- there are also many other concerns that must be dealt with.

Another important concern of service-oriented systems is dealing with resource consumption, such as RAM, CPU and disk space. Obviously, services need them to accomplish something. However, since they are typically long running, they also consume resources even if they are not doing any work.

These problems could become quite severe if services have been poorly developed. For example, they may leak memory and never fully release the RAM they have allocated. As a result, an entire machine may eventually run out of memory. Moreover, "idle" services may degrade the performance of other services running on the same machine.

There are various ways to deal with resource problems:

  • The most obvious solution is buying bigger or additional hardware resources, but this typically increases the costs of maintaining a production environment. Moreover, it does not take the source of some of the problems away.
  • Another solution would be to fix and optimize problematic services, but this could be a time consuming and costly process, in particular when there is a high technical debt.
  • A third solution would be to support on-demand service activation and self termination -- a service gets activated the first time it is consulted and terminates itself after a period of idleness.

In this blog post, I will describe how to implement and deploy a system supporting the last solution.

To accomplish this goal, we need to modify the implementations of the services -- we must retrieve an incoming connection from the host system's service manager that activates a service when a client connects and self terminate when the moment is right.

Furthermore, we need to adapt a service's deployment procedure to use these facilities.

Retrieving a socket from the host system's service manager


In many conventional deployment scenarios, the services themselves are responsible for creating the sockets to which clients can connect. However, if we want to activate them on-demand this property conflicts -- the socket must already exist before the process runs, so that it can be started when a client connects.

We can use a service manager that supports socket activation to accomplish on-demand activation. There are various solutions supporting this property. The most prominently advertised solution is probably systemd, but there are other solutions that can do this as well, such as launchd, inetd, or xinetd, albeit the protocols that activated processes must implement differ.

In one of my toy example systems used for testing Disnix (the TCP proxy example) I used to do the following:

static int create_server_socket(int source_port)
{
    int sockfd, on = 1;
    struct sockaddr_in client_addr;
        
    /* Create socket */
    sockfd = socket(AF_INET, SOCK_STREAM, 0);
    if(sockfd < 0)
    {
        fprintf(stderr, "Error creating server socket!\n");
        return -1;
    }    

    /* Create address struct */
    memset(&client_addr, '\0', sizeof(client_addr));
    client_addr.sin_family = AF_INET;
    client_addr.sin_addr.s_addr = htonl(INADDR_ANY);
    client_addr.sin_port = htons(source_port);
        
    /* Set socket options to reuse the address */
    setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &on, 4);
      
    /* Bind the name (ip address) to the socket */
    if(bind(sockfd, (struct sockaddr *)&client_addr, sizeof(client_addr)) < 0)
        fprintf(stderr, "Error binding on port: %d, %s\n", source_port, strerror(errno));
        
    /* Listen for connections on the socket */
    if(listen(sockfd, 5) < 0)
        fprintf(stderr, "Error listening on port %d\n", source_port);

    /* Return the socket file descriptor */
    return sockfd;
}

The function listed above is responsible for creating a socket file descriptor, binding the socket to an IP address and TCP port, and listening for incoming connections.

To support on-demand activation, I need to modify this function to retrieve the server socket from the service manager. Systemd's socket activation protocol works by passing the socket as the third file descriptor to the process that it spawns. By adjusting the previously listed code into the following:

static int create_server_socket(int source_port)
{
    int sockfd, on = 1;

#ifdef SYSTEMD_SOCKET_ACTIVATION
    int n = sd_listen_fds(0);
    
    if(n > 1)
    {
        fprintf(stderr, "Too many file descriptors received!\n");
        return -1;
    }
    else if(n == 1)
        sockfd = SD_LISTEN_FDS_START + 0;
    else
    {
#endif
        struct sockaddr_in client_addr;
        
        /* Create socket */
        sockfd = socket(AF_INET, SOCK_STREAM, 0);
        if(sockfd < 0)
        {
            fprintf(stderr, "Error creating server socket!\n");
            return -1;
        }
        
        /* Create address struct */
        memset(&client_addr, '\0', sizeof(client_addr));
        client_addr.sin_family = AF_INET;
        client_addr.sin_addr.s_addr = htonl(INADDR_ANY);
        client_addr.sin_port = htons(source_port);
        
        /* Set socket options to reuse the address */
        setsockopt(sockfd, SOL_SOCKET, SO_REUSEADDR, &on, 4);
        
        /* Bind the name (ip address) to the socket */
        if(bind(sockfd, (struct sockaddr *)&client_addr, sizeof(client_addr)) < 0)
            fprintf(stderr, "Error binding on port: %d, %s\n", source_port, strerror(errno));
        
        /* Listen for connections on the socket */
        if(listen(sockfd, 5) < 0)
            fprintf(stderr, "Error listening on port %d\n", source_port);

#ifdef SYSTEMD_SOCKET_ACTIVATION
    }
#endif

    /* Return the socket file descriptor */
    return sockfd;
}

the server will use the socket that has been created by systemd (and passed as a third file descriptor). Moreover, if the server is started as a standalone process, it will revert to its old behaviour and allocates the server socket itself.

I have wrapped the systemd specific functionality inside a conditional preprocessor block so that it only gets included when I explicitly ask for it. The downside of supporting systemd's socket activation protocol is that we require some functionality that is exposed by a shared library that has been bundled with systemd. As systemd is Linux (and glibc) specific, it makes no sense to build a service with this functionality enabled on non-systemd based Linux distributions and non-Linux operating systems.

Besides conditionally including the code, I also made linking against the systemd library conditional in the Makefile:

CC = gcc

ifeq ($(SYSTEMD_SOCKET_ACTIVATION),1)
    EXTRA_BUILDFLAGS=-DSYSTEMD_SOCKET_ACTIVATION=1 $(shell pkg-config --cflags --libs libsystemd)
endif

all:
 $(CC) $(EXTRA_BUILDFLAGS) hello-world-server.c -o hello-world-server

...

so that the systemd-specific code block and library only get included if I run 'make' with socket activation explicitly enabled:

$ make SYSTEMD_SOCKET_ACTIVATION=1

Implementing self termination


As with on-demand activation, there is no way to do self termination generically and we must modify the service to support this property in some way.

In the TCP proxy example, I have implemented a simple approach using a counter (that is initially set to 0):

volatile unsigned int num_of_connections = 0;

For each client that connects to the server, we fork a child process that handles the connection. Each time we fork, I also raise the connection counter in the parent process:

while(TRUE)
{
    /* Create client socket if there is an incoming connection */
    if((client_sockfd = wait_for_connection(server_sockfd)) >= 0)
    {
        /* Fork a new process for each incoming client */
        pid_t pid = fork();
     
        if(pid == 0)
        {
            /* Handle the client's request and terminate
             * when it disconnects */
        }
        else if(pid == -1)
            fprintf(stderr, "Cannot fork connection handling process!\n");
#ifdef SELF_TERMINATION
        else
            num_of_connections++;
#endif
    }

    close(client_sockfd);
    client_sockfd = -1;
}

(As with socket activation, I have wrapped the termination functionality in a conditional preprocessor block -- it makes no sense to include this functionality into a service that cannot be activated on demand).

When a client disconnects, the process handling its connection terminates and sends a SIGCHLD signal to the parent. We can configure a signal handler for this type of signal as follows:

#ifdef SELF_TERMINATION
    signal(SIGCHLD, sigreap);
#endif

and use the corresponding signal handler function to decrease the counter and wait for the client process to terminate:

#ifdef SELF_TERMINATION

void sigreap(int sig)
{
    pid_t pid;
    int status;
    num_of_connections--;
    
    /* Event handler when a child terminates */
    signal(SIGCHLD, sigreap);
    
    /* Wait until all child processes terminate */
    while((pid = waitpid(-1, &status, WNOHANG)) > 0);

Finally, the server can terminate itself when the counter has reached 0 (which means that it is not handling any connections and the server has become idle):

    if(num_of_connections == 0)
        _exit(0);
}
#endif

Deploying services with on demand activation and self termination enabled


Besides implementing socket activation and self termination, we must also deploy the server with these features enabled. When using Disnix as a deployment system, we can write the following service expression to accomplish this:

{stdenv, pkgconfig, systemd}:
{port, enableSystemdSocketActivation ? false}:

let
  makeFlags = "PREFIX=$out port=${toString port}${stdenv.lib.optionalString enableSystemdSocketActivation " SYSTEMD_SOCKET_ACTIVATION=1"}";
in
stdenv.mkDerivation {
  name = "hello-world-server";
  src = ../../../services/hello-world-server;
  buildInputs = if enableSystemdSocketActivation then [ pkgconfig systemd ] else [];
  buildPhase = "make ${makeFlags}";
  installPhase = ''
    make ${makeFlags} install
    
    mkdir -p $out/etc
    cat > $out/etc/process_config <<EOF
    container_process=$out/bin/process
    EOF
    
    ${stdenv.lib.optionalString enableSystemdSocketActivation ''
      mkdir -p $out/etc
      cat > $out/etc/socket <<EOF
      [Unit]
      Description=Hello world server socket
      
      [Socket]
      ListenStream=${toString port}
      EOF
    ''}
  '';
}

In the expression shown above, we do the following:

  • We make the socket activation and self termination features configurable by exposing it as a function parameter (that defaults to false disabling it).
  • If the socket activation parameter has been enabled, we pass the SYSTEMD_SOCKET_ACTIVATION=1 flag to 'make' so that these facilities are enabled in the build system.
  • We must also provide two extra dependencies: pkgconfig and systemd to allow the program to find the required library functions to retrieve the socket from systemd.
  • We also compose a systemd socket unit file that configures systemd on the target system to allocate a server socket that activates the process when a client connects to it.

Modifying Dysnomia modules to support socket activation


As explained in an older blog post, Disnix consults a plugin system called Dysnomia that takes care of executing various kinds of deployment activities, such as activating and deactivating services. The reason that a plugin system is used, is because services can be any kind of deployment unit with no generic activation procedure.

For services of the 'process' and 'wrapper' type, Dysnomia integrates with the host system's service manager. To support systemd's socket activation feature, we must modify the corresponding Dysnomia modules to start the socket unit instead of the service unit on activation. For example:

$ systemctl start disnix-53bb1pl...-hello-world-server.socket

starts the socket unit, which in turn starts the service unit with the same name when a client connects to it.

To deactivate the service, we must first stop the socket unit and then the service unit:

$ systemctl stop disnix-53bb1pl...-hello-world-server.socket
$ systemctl stop disnix-53bb1pl...-hello-world-server.service

Discussion


In this blog post, I have described an on-demand service activation and self termination approach using systemd, Disnix, and a number of code modifications. Some benefits of this approach are that we can save system resources such as RAM and CPU, improve the performance of non-idle services running on a same machine, and reduce the impact of poorly implemented services that (for example) leak memory.

There are also some disadvantages. For example, connecting to an inactive service introduces latency, in particular when a service has a slow start up procedure making it less suitable for systems that must remain responsive.

Moreover, it does not cope with potential disk space issues -- a non-running service still consumes disk space for storing its package dependencies and persistent state, such as databases.

Finally, there are some practical notes on the solutions described in the blog post. The self termination procedure in the example program terminates the server immediately after it has discovered that there are no active connections. In practice, it may be better to implement a timeout to prevent unnecessary latencies.

Furthermore, I have only experimented with systemd's socket activation features. However, it is also possible to modify the Dysnomia modules to support different kinds of activation protocols, such as the ones provided by launchd, inetd or xinetd.

The TCP proxy example uses C as an implementation language, but systemd's socket activation protocol is not limited to C programs. For instance, an example program on GitHub demonstrates how a Python program running an embedded HTTP server can be activated with systemd's socket activation mechanism.

References


I have modified the development version of Dysnomia to support the socket activation feature of systemd. Moreover, I have extended the TCP proxy example package with a sub example that implements the on-demand activation and self termination approach described in this blog post.

Both packages can be obtained from my GitHub page.

Monday, November 23, 2015

Deploying services to a heterogeneous network of machines with Disnix


Last week I was in Berlin to visit the first official Nix conference: NixCon 2015. Besides just being there, I have also given a talk about deploying (micro)services with Disnix.

In my talk, I have elaborated about various kinds of aspects, such as microservices in general, their implications (such as increasing operational complexity), the concepts of Disnix, and a number of examples including a real-life usage scenario.

I have given two live demos in the talk. The first demo is IMHO quite interesting, because it shows the full potential of Disnix when you have to deal with many heterogeneous traits of service-oriented systems and their environments -- we deploy services to a network of machines running multiple kinds of operating systems, having multiple kinds of CPU architectures and they must be reached by using multiple connection protocols (e.g. SSH and SOAP/HTTP).

Furthermore, I consider it a nice example that should be relatively straight forward to repeat by others. The good parts of the example are that it is small (only two services that communicate through a TCP socket), and it has no specific requirements on the target systems, such as infrastructure components (e.g. a DBMS or application server) that must be preinstalled first.

In this blog post, I will describe what I did to set up the machines and I will explain how to repeat the example deployment scenarios shown in the presentation.

Configuring the target machines


Despite being a simple example, the thing that makes repeating the demo hard is that Disnix expects the target machines to be present already running the Nix package manager and the Disnix service that is responsible for executing deployment steps remotely.

For the demo, I have manually pre-instantiated these VirtualBox VMs. Moreover, I have installed their configurations manually as well, which took me quite a bit of effort.

Instantiating the VMs


For instantiation of the VirtualBox VMs, most of the standard settings were sufficient -- I simply provided the operating system type and CPU architecture to VirtualBox and used the recommended disk and RAM settings that VirtualBox provided me.

The only modification I have made to the VM configurations is adding an additional network interface. The first network interface is used to connect to the host machine and the internet (with the host machine being the gateway). The second interface is used to allow the host machine to connect to any VM belonging to the same private subnet.

To configure the second network interface, I right click on the corresponding VM, pick the 'Network' option and open the 'Adapter 2' tab. In this tab, I provide the following settings:


Installing the operating systems


For the Kubuntu and Windows 7 machine, I have just followed their standard installation procedures. For the NixOS machine, I have used the following NixOS configuration file:

{ pkgs, ... }:

{
  boot.loader.grub.device = "/dev/sda";
  fileSystems = {
    "/" = { label = "root"; };
  };
  networking.firewall.enable = false;
  services.openssh.enable = true;
  services.tomcat.enable = true;
  services.disnix.enable = true;
  services.disnix.useWebServiceInterface = true;
  
  environment.systemPackages = [ pkgs.mc ];
}

The above configuration file captures a machine configuration providing OpenSSH, Apache Tomcat (for hosting the web service interface) and the Disnix service with the web service interface enabled.

Configuring SSH


The Kubuntu and Windows 7 machine require the OpenSSH to be running to allow deployment operations to be executed from a remote location.

I ran the following command-line instruction to enable the OpenSSH server on Kubuntu:

$ sudo apt-get install openssh-server

I ran the following command on Cygwin to configure the OpenSSH server:

$ ssh-host-config

One of the things the above script does is setting up a Windows service that runs the SSH daemon. It can be started by opening the 'Control Panel -> System and Security -> Administrative Tools -> Services', right clicking on 'CYGWIN sshd' and then selecting 'Start'.

Setting up user accounts


We need to set up specialized user accounts to allow the coordinator machine to connect to the target machines. By default, the coordinator machine connects as the same user which carries out the deployment process. I have configured all the three VMs to have a user account named: 'sander'.

To prevent the SSH client from asking for a password for each request, we must set up a pair of public-private SSH keys. This can be done by running:

$ ssh-keygen

After generating the keys, we must upload the public key (~/.ssh/id_rsa.pub) to all the target machines in the network and configure them so that they can be used. Basically, we need to modify their authorized_keys configuration files and set the correct file permissions:

$ mkdir -p ~/.ssh
$ chmod 700 ~/.ssh
$ cat id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

Installing Nix, Dysnomia and Disnix


The next step is installing the required deployment tools on the host machines. For the NixOS machine all required tools have been installed as part of the system configuration, so no additional installation steps are required. For the other machines we must manually install Nix, Dysnomia and Disnix.

On the Kubuntu machine, I first did a single user installation of the Nix package manager under my own user account:

$ curl https://nixos.org/nix/install | sh

After installing Nix, I deploy Dysnomia from Nixpkgs. The following command-line instruction configures Dysnomia to use the direct activation mechanism for processes:

$ nix-env -i $(nix-build -E 'with import <nixpkgs> {}; dysnomia.override { jobTemplate = "direct"; }')

Installing Disnix can be done as follows:

$ nix-env -f '<nixpkgs>' -iA disnix

We must run a few additional steps to get the Disnix service running. The following command copies the Disnix DBus configuration file allowing it to run on the system bus granting permissions to the appropriate class of users:

$ sudo cp /nix/var/nix/profiles/default/etc/dbus-1/system.d/disnix.conf \
    /etc/dbus-1/system.d

Then I manually edit /etc/dbus-1/system.d/disnix.conf and change the line:
<policy user="root">

into:

<policy user="sander">

to allow the Disnix service to run under my own personal user account (that has a single user Nix installation).

We also need an init.d script that starts the server on startup. The Disnix distribution has a Debian-compatible init.d script included that can be installed as follows:

$ sudo cp /nix/var/nix/profiles/default/share/doc/disnix/disnix-service.initd /etc/init.d/disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc2.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc3.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc4.d/S06disnix-service
$ sudo ln -s ../init.d/disnix-service /etc/rc5.d/S06disnix-service

The script has been configured to run the service under my user account, because it contains the following line:

DAEMONUSER=sander

The username should correspond to the user under which the Nix package manager has been installed.

After executing the previous steps, the DBus daemon needs to be restarted so that it can use the Disnix configuration. Since DBus is a critical system service, it is probably more convenient to just reboot the entire machine. After rebooting, the Disnix service should be activated on startup.

Installing the same packages on the Windows/Cygwin machine is much more tricky -- there is no installer provided for the Nix package manager on Cygwin, so we need to compile it from source. I installed the following Cygwin packages to make source installations of all required packages possible:

curl
patch
perl
libbz2-devel
sqlite3
make
gcc-g++
pkg-config
libsqlite3-devel
libcurl-devel
openssl-devel
libcrypt-devel
libdbus1-devel
libdbus-glib-1-devel
libxml2-devel
libxslt-devel
dbus
openssh

Besides the above Cygwin packages, we also need to install a number of Perl packages from CPAN. I opened a Cygwin terminal in administrator mode (right click, run as: Administrator) and ran the following commands:

$ perl -MCPAN -e shell
install DBD::SQLite
install WWW::Curl

Then I installed the Nix package manager by obtaining the source tarball and running:

tar xfv nix-1.10.tar.xz
cd nix-1.10
./configure
make
make install

I installed Dysnomia by obtaining the source tarball and running:

tar xfv dysnomia-0.5pre1234.tar.gz
cd dysnomia-0.5pre1234
./configure --with-job-template=direct
make
make install

And Disnix by running:

tar xfv disnix-0.5pre1234.tar.gz
cd disnix-0.5pre1234
./configure
make
make install

As with the Kubuntu machine, we must provide a service configuration file for DBus allowing the Disnix service to run on the system bus:
$ cp /nix/var/nix/profiles/default/etc/dbus-1/system.d/disnix.conf \
    /etc/dbus-1/system.d

Also, I have to manually edit /etc/dbus-1/system.d/disnix.conf and change the line:
<policy user="root">

into:

<policy user="sander">

to allow operations to be executed under my own less privileged user account.

To run the Disnix service, we must define two Windows services. The following command-line instruction creates a Windows service for DBus:

$ cygrunsrv -I dbus -p /usr/bin/dbus-daemon.exe \
    -a '--system --nofork'

The following command-line instruction creates a Disnix service running under my own user account:

$ cygrunsrv -I disnix -p /usr/local/bin/disnix-service.exe \
  -e 'PATH=/bin:/usr/bin:/usr/local/bin' \
  -y dbus -u sander

In order to make the Windows service work, the user account requires login rights. To check if this right has been granted, we can run:

$ editrights -u sander -l

which should list SeServiceLogonRight. If this is not the case, this permission can be granted by running:

$ editrights -u sander -a SeServiceLogonRight

Finally, we must start the Disnix service. This can be done by opening the services configuration screen (Control Panel -> System and Security -> Administrative Tools -> Services), right clicking on: 'disnix' and selecting: 'Start'.

Deploying the example scenarios


After deploying the virtual machines and their configurations, we can start doing some deployment experiments with the Disnix TCP proxy example. The Disnix deployment models can be found in the deployment/DistributedDeployment sub folder:

$ cd deployment/DistributedDeployment

Before we can do any deployment, we must write an infrastructure model (infrastructure.nix) reflecting the machines' configuration properties that we have deployed previously:

{
  test1 = { # x86 Linux machine (Kubuntu) reachable with SSH
    hostname = "192.168.56.101";
    system = "i686-linux";
    targetProperty = "hostname";
    clientInterface = "disnix-ssh-client";
  };
  
  test2 = { # x86-64 Linux machine (NixOS) reachable with SOAP/HTTP
    hostname = "192.168.56.102";
    system = "x86_64-linux";
    targetEPR = http://192.168.56.102:8080/DisnixWebService/services/DisnixWebService;
    targetProperty = "targetEPR";
    clientInterface = "disnix-soap-client";
  };

  test3 = { # x86-64 Windows machine (Windows 7) reachable with SSH
    hostname = "192.168.56.103";
    system = "x86_64-cygwin";
    targetProperty = "hostname";
    clientInterface = "disnix-ssh-client";
  };
}

and write the distribution model to reflect the initial deployment scenario shown in the presentation:

{infrastructure}:

{
  hello_world_server = [ infrastructure.test2 ];
  hello_world_client = [ infrastructure.test1 ];
}

Now we can deploy the system by running:

$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

If we open a terminal on the Kubuntu machine, we should be able to run the client:

$ /nix/var/nix/profiles/disnix/default/bin/hello-world-client

When we type: 'hello' the client should respond by saying: 'Hello world!'. The client can be exited by typing: 'quit'.

We can also deploy a second client instance by changing the distribution model:

{infrastructure}:

{
  hello_world_server = [ infrastructure.test2 ];
  hello_world_client = [ infrastructure.test1 infrastructure.test3 ];
}

and running the same command-line instruction again:

$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

After the redeployment has been completed, we should be able to start the client that connects to the same server instance on the second test machine (the NixOS machine).

Another thing we could do is moving the server to the Windows machine:

{infrastructure}:

{
  hello_world_server = [ infrastructure.test3 ];
  hello_world_client = [ infrastructure.test1 infrastructure.test3 ];
}

However, running the following command:

$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix

probably leads to a build error, because the host machine (that runs Linux) is unable to build packages for Cygwin. Fortunately, this problem can be solved by enabling building on the target machines:

$ disnix-env -s services-without-proxy.nix \
  -i infrastructure.nix -d distribution.nix \
  --build-on-targets

After deploying the new configuration, you will observe that the clients have been disconnected. You can restart any of the clients to observe that they have been reconfigured to connect to the new server instance that has been deployed to the Windows machine.

Discussion


In this blog post, I have described how to set up and repeat the heterogeneous network deployment scenario that I have shown in my presentation. Despite being a simple example, the thing that makes repeating it difficult is because we need to deploy the machines first, a process which is not automated by Disnix. (As a sidenote: with the DisnixOS extension we can automate the deployment of machines as well, but this does not work with a network of non-NixOS machines, such as Windows installations).

Additionally, the fact that there is no installer (or official support) for the Nix deployment tools on other platforms than Linux and Mac OS X makes it even more difficult. (Fortunately, compiling from source on Cygwin should work and there are also some ongoing efforts to revive FreeBSD support).

To alleviate some of these issues, I have improved the Disnix documentation a bit to explain how to work with single user Nix installations on non-NixOS platforms and included the Debian init.d script in the Disnix distribution as an example. These changes have been integrated into the current development version Disnix.

I am also considering writing a simple infrastructure model generator for static deployment purposes (a more advanced prototype already exists in the Dynamic Disnix toolset) and include it with the basic Disnix toolset to avoid some repetition while deploying target machines manually.

References


I have published the slides of my talk on SlideShare. For convenience, I have embedded them into this web page:


Furthermore, the recordings of the NixCon 2015 talks are also online.

Thursday, October 29, 2015

Deploying prebuilt binary software with the Nix package manager

As described in a number of older blog posts, Nix is primarily a source based package manager -- it constructs packages from source code by executing their build procedures in isolated environments in which only specified dependencies can be found.

As an optimization, it provides transparent binary deployment -- if a package that has been built from the same inputs exists elsewhere, it can be downloaded from that location instead of being built from source improving the efficiency of deployment processes.

Because Nix is a source based package manager, the documentation mainly describes how to build packages from source code. Moreover, the Nix expressions are written in such a way that they can be included in the Nixpkgs collection, a repository containing build recipes for more than 2500 packages.

Although the manual contains some basic packaging instructions, I noticed that there a few practical bits were missing. For example, how to package software privately outside the Nixpkgs tree is not clearly described, which makes experimentation a bit less convenient, in particular for newbies.

Despite being a source package manager, Nix can also be used to deploy binary software packages (i.e. software for which no source code and build scripts have been provided). Unfortunately, getting prebuilt binaries to run properly is quite tricky. Furthermore, apart from some references, there are no examples in the manual describing how to do this either.

Since I am receiving too many questions about this lately, I have decided to write a blog post about it covering two examples that should be relatively simple to repeat.

Why prebuilt binaries will typically not work


Prebuilt binaries deployed by Nix typically do not work out of the box. For example, if we want to deploy a simple binary package such as pngout (only containing a set of ELF execuables) we may initially think that copying the executable into the Nix store suffices:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "pngout-20130221";

  src = fetchurl {
    url = http://static.jonof.id.au/dl/kenutils/pngout-20130221-linux.tar.gz;
    sha256 = "1qdzmgx7si9zr7wjdj8fgf5dqmmqw4zg19ypg0pdz7521ns5xbvi";
  };

  installPhase = ''
    mkdir -p $out/bin
    cp x86_64/pngout $out/bin
  '';
}

However, when we build the above package:

$ nix-build pngout.nix

and attempt to run the executable, we stumble upon the following error:

$ ./result/bin/pngout
bash: ./result/bin/pngout: No such file or directory

The above error is quite strange -- the corresponding file resides in exactly the specified location yet it appears that it cannot be found!

The actual problem is not that the executable is missing, but one of its dependencies. Every ELF executable that uses shared libraries consults the dynamic linker/loader (that typically resides in /lib/ld-linux.so.2 (on x86 Linux platforms) and /lib/ld-linux-x86-64.so.2 on (x86-64 Linux platforms)) to provide the shared libraries it needs. This path is hardwired into the ELF executable, as can be observed by running:

$ readelf -l ./result/bin/pngout 

Elf file type is EXEC (Executable file)
Entry point 0x401160
There are 8 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001c0 0x00000000000001c0  R E    8
  INTERP         0x0000000000000200 0x0000000000400200 0x0000000000400200
                 0x000000000000001c 0x000000000000001c  R      1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000001593c 0x000000000001593c  R E    200000
  LOAD           0x0000000000015940 0x0000000000615940 0x0000000000615940
                 0x00000000000005b4 0x00000000014f9018  RW     200000
  DYNAMIC        0x0000000000015968 0x0000000000615968 0x0000000000615968
                 0x00000000000001b0 0x00000000000001b0  RW     8
  NOTE           0x000000000000021c 0x000000000040021c 0x000000000040021c
                 0x0000000000000044 0x0000000000000044  R      4
  GNU_EH_FRAME   0x0000000000014e5c 0x0000000000414e5c 0x0000000000414e5c
                 0x00000000000001fc 0x00000000000001fc  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8

In NixOS, most parts of the system are stored in a special purpose directory called the Nix store (i.e. /nix/store) including the dynamic linker. As a consequence, the dynamic linker cannot be found because it resides elsewhere.

Another reason why most binaries will not work is because they must know where to find its required shared libraries. In most conventional Linux distributions these reside in global directories (e.g. /lib and /usr/lib). In NixOS, these folders do not exist. Instead, every package is stored in isolation in separate folders in the Nix store.

Why complication from source works


In contrast to prebuilt ELF binaries, binaries produced by a source build in a Nix build environment work out of the box typically without problems (i.e. they often do not require any special modifications in the build procedure). So why is that?

The "secret" is that the linker (that gets invoked by the compiler) has been wrapped in the Nix build environment -- if we invoke ld, then we actually end up using a wrapper: ld-wrapper that does a number of additional things besides the tasks the linker normally carries out.

Whenever we supply a library to link to, the wrapper appends an -rpath parameter providing its location. Furthermore, it appends the path to the dynamic linker/loader (-dynamic-linker) so that the resulting executable can load the shared libraries on startup.

For example, when producing an executable, the compiler may invoke the following command that links a library to a piece of object code:

$ ld test.o -lz -o test

in reality, ld has been wrapped and executes something like this:

$ ld test.o -lz \
  -rpath /nix/store/31w31mc8i...-zlib-1.2.8/lib \
  -dynamic-linker \
    /nix/store/hd6km3hscb...-glibc-2.21/lib/ld-linux-x86-64.so.2 \
  ...
  -o test

As may be observed, the wrapper transparently appends the path to zlib as an RPATH parameter and provides the path to the dynamic linker.

The RPATH attribute is basically a colon separated string of paths in which the dynamic linker looks for its shared dependencies. The RPATH is hardwired into an ELF binary.

Consider the following simple C program (test.c) that displays the version of the zlib library that it links against:

#include <stdio.h>
#include <zlib.h>

int main()
{
    printf("zlib version is: %s\n", ZLIB_VERSION);
    return 0;
}

With the following Nix expression we can compile an executable from it and link it against the zlib library:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "test";
  buildInputs = [ zlib ];
  buildCommand = ''
    gcc ${./test.c} -lz -o test
    mkdir -p $out/bin
    cp test $out/bin
  '';
}

When we build the above package:

nix-build test.nix

and inspect the program headers of the ELF binary, we can observe that the dynamic linker (program interpreter) corresponds to an instance residing in the Nix store:

$ readelf -l ./result/bin/test 

Elf file type is EXEC (Executable file)
Entry point 0x400680
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000001f8 0x00000000000001f8  R E    8
  INTERP         0x0000000000000238 0x0000000000400238 0x0000000000400238
                 0x0000000000000050 0x0000000000000050  R      1
      [Requesting program interpreter: /nix/store/hd6km3hs...-glibc-2.21/lib/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000096c 0x000000000000096c  R E    200000
  LOAD           0x0000000000000970 0x0000000000600970 0x0000000000600970
                 0x0000000000000260 0x0000000000000268  RW     200000
  DYNAMIC        0x0000000000000988 0x0000000000600988 0x0000000000600988
                 0x0000000000000200 0x0000000000000200  RW     8
  NOTE           0x0000000000000288 0x0000000000400288 0x0000000000400288
                 0x0000000000000020 0x0000000000000020  R      4
  GNU_EH_FRAME   0x0000000000000840 0x0000000000400840 0x0000000000400840
                 0x0000000000000034 0x0000000000000034  R      4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     8
  PAX_FLAGS      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000         8

Furthermore, if we inspect the dynamic section of the binary, we will see that an RPATH attribute has been hardwired into it providing a collection of library paths (including the path to zlib):

$ readelf -d ./result/bin/test 

Dynamic section at offset 0x988 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libz.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000f (RPATH)              Library rpath: [
/nix/store/8w39iz6sp...-test/lib64:
/nix/store/8w39iz6sp...-test/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/libexec/gcc/x86_64-unknown-linux-gnu/4.9.3:
/nix/store/31w31mc8i...-zlib-1.2.8/lib:
/nix/store/hd6km3hsc...-glibc-2.21/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/lib]
 0x000000000000001d (RUNPATH)            Library runpath: [
/nix/store/8w39iz6sp...-test/lib64:
/nix/store/8w39iz6sp...-test/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/libexec/gcc/x86_64-unknown-linux-gnu/4.9.3:
/nix/store/31w31mc8i...-zlib-1.2.8/lib:
/nix/store/hd6km3hsc...-glibc-2.21/lib:
/nix/store/i9nn1fkcy...-gcc-4.9.3/lib]
 0x000000000000000c (INIT)               0x400620
 0x000000000000000d (FINI)               0x400814
 0x0000000000000019 (INIT_ARRAY)         0x600970
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x600978
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x0000000000000004 (HASH)               0x4002a8
 0x0000000000000005 (STRTAB)             0x400380
 0x0000000000000006 (SYMTAB)             0x4002d8
 0x000000000000000a (STRSZ)              528 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x600b90
 0x0000000000000002 (PLTRELSZ)           72 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x4005d8
 0x0000000000000007 (RELA)               0x4005c0
 0x0000000000000008 (RELASZ)             24 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x4005a0
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x400590
 0x0000000000000000 (NULL)               0x0

As a result, the program works as expected:

$ ./result/bin/test 
zlib version is: 1.2.8

Patching existing ELF binaries


To summarize, the reason why ELF binaries produced in a Nix build environment work is because they refer to the correct path of the dynamic linker and have an RPATH value that refers to the paths of the shared libraries that it needs.

Fortunately, we can accomplish the same thing with prebuilt binaries by using the PatchELF tool. With PatchELF we can patch existing ELF binaries to have a different dynamic linker and RPATH.

Running the following instruction in a Nix expression allows us to change the dynamic linker of the pngout executable shown earlier:

$ patchelf --set-interpreter \
    ${stdenv.glibc}/lib/ld-linux-x86-64.so.2 $out/bin/pngout

By inspecting the dynamic section of a binary, we can find out what shared libraries it requires:

$ readelf -d ./result/bin/pngout

Dynamic section at offset 0x15968 contains 22 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libm.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x400ea8
 0x000000000000000d (FINI)               0x413a78
 0x0000000000000004 (HASH)               0x400260
 0x000000006ffffef5 (GNU_HASH)           0x4003b8
 0x0000000000000005 (STRTAB)             0x400850
 0x0000000000000006 (SYMTAB)             0x4003e8
 0x000000000000000a (STRSZ)              379 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x615b20
 0x0000000000000002 (PLTRELSZ)           984 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x400ad0
 0x0000000000000007 (RELA)               0x400a70
 0x0000000000000008 (RELASZ)             96 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffe (VERNEED)            0x400a30
 0x000000006fffffff (VERNEEDNUM)         2
 0x000000006ffffff0 (VERSYM)             0x4009cc
 0x0000000000000000 (NULL)               0x0

According to the information listed above, two libraries are required (libm.so.6 and libc.so.6) which can be provided by the glibc package. We can change the executable's RPATH in the Nix expression as follows:

$ patchelf --set-rpath ${stdenv.glibc}/lib $out/bin/pngout

We can write a revised Nix expression for pngout (taking patching into account) that looks as follows:

with import <nixpkgs> {};

stdenv.mkDerivation {
  name = "pngout-20130221";

  src = fetchurl {
    url = http://static.jonof.id.au/dl/kenutils/pngout-20130221-linux.tar.gz;
    sha256 = "1qdzmgx7si9zr7wjdj8fgf5dqmmqw4zg19ypg0pdz7521ns5xbvi";
  };

  installPhase = ''
    mkdir -p $out/bin
    cp x86_64/pngout $out/bin
    patchelf --set-interpreter \
        ${stdenv.glibc}/lib/ld-linux-x86-64.so.2 $out/bin/pngout
    patchelf --set-rpath ${stdenv.glibc}/lib $out/bin/pngout
  '';
}

When we build the expression:

$ nix-build pngout.nix

and try to run the executable:

$ ./result/bin/pngout 
PNGOUT [In:{PNG,JPG,GIF,TGA,PCX,BMP}] (Out:PNG) (options...)
by Ken Silverman (http://advsys.net/ken)
Linux port by Jonathon Fowler (http://www.jonof.id.au/pngout)

We will see that the executable works as expected!

A more complex example: Quake 4 demo


The pngout example shown earlier is quite simple as it is only a tarball with only one executable that must be installed and patched. Now that we are familiar with some basic concepts -- how should we a approach a more complex prebuilt package, such as a computer game like the Quake 4 demo?

When we download the Quake 4 demo installer for Linux, we actually get a Loki setup tools based installer that is a self-extracting shell script executing an installer program.

Unfortunately, we cannot use this installer program in NixOS for two reasons. First, the installer executes (prebuilt) executables that will not work. Second, to use the full potential of NixOS, it is better to deploy packages with Nix in isolation in the Nix store.

Fortunately, running the installer with the --help parameter reveals that it is also possible to extract its contents without running the installer:

$ bash ./quake4-linux-1.0-demo.x86.run --noexec --keep

After executing the above command-line instruction, we can find the extracted files in the ./quake4-linux-1.0-demo in the current working directory.

The next step is figuring out where the game files reside and which binaries need to be patched. A rough inspection of the extracted folder:

$ cd quake4-linux-1.0-demo
$ ls
bin
Docs
License.txt
openurl.sh
q4base
q4icon.bmp
README
setup.data
setup.sh
version.info

reveals to me that we have both files of installer (./setup.data) and the game intermixed with each other. Some files seem to be required to run the game, but the some others, such as the setup files (e.g. the ones residing in setup.data/) are unnecessary.

Running the following command helps me to figure out which ELF binaries we may have to patch:

$ file $(find . -type f)         
./Docs/QUAKE4_demo_readme.txt:     Little-endian UTF-16 Unicode text, with CRLF line terminators
./bin/Linux/x86/libstdc++.so.5:    ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
./bin/Linux/x86/quake4-demo:       POSIX shell script, ASCII text executable
./bin/Linux/x86/quake4.x86:        ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.30, stripped
./bin/Linux/x86/quake4-demoded:    POSIX shell script, ASCII text executable
./bin/Linux/x86/libgcc_s.so.1:     ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV), dynamically linked, not stripped
./bin/Linux/x86/q4ded.x86:         ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.30, stripped
./README:                          ASCII text
./version.info:                    ASCII text
./q4base/game100.pk4:              Zip archive data, at least v2.0 to extract
./q4base/mapcycle.scriptcfg:       ASCII text, with CRLF line terminators
./q4base/game000.pk4:              Zip archive data, at least v1.0 to extract
./License.txt:                     ISO-8859 text, with very long lines
./openurl.sh:                      POSIX shell script, ASCII text executable
./q4icon.bmp:                      PC bitmap, Windows 3.x format, 48 x 48 x 24
...

As we can see in the output, the ./bin/Linux/x86 sub folder contains a number of ELF executables and shared libraries that most likely require patching.

As with the previous example (pngout), we can use readelf to inspect what libraries the ELF executables require. The first executable q4ded.x86 has the following dynamic section:

$ cd ./bin/Linux/x86
$ readelf -d q4ded.x86 

Dynamic section at offset 0x366220 contains 25 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libpthread.so.0]
 0x00000001 (NEEDED)                     Shared library: [libdl.so.2]
 0x00000001 (NEEDED)                     Shared library: [libstdc++.so.5]
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

According to the above information, the executable requires a couple of libraries that seem to be stored in the same package (in the same folder to be precise): libstdc++.so.5 and libgcc_s.so.1.

Furthermore, it also requires a number of libraries that are not in the same folder. These missing libraries must be provided by external packages. I know from experience that the remaining libraries: libpthread.so.0, libdl.so.2, libm.so.6, libc.so.6, are provided by the glibc package.

The other ELF executable has the following library references:

$ readelf -d ./quake4.x86 

Dynamic section at offset 0x3779ec contains 29 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libSDL-1.2.so.0]
 0x00000001 (NEEDED)                     Shared library: [libpthread.so.0]
 0x00000001 (NEEDED)                     Shared library: [libdl.so.2]
 0x00000001 (NEEDED)                     Shared library: [libstdc++.so.5]
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
 0x00000001 (NEEDED)                     Shared library: [libX11.so.6]
 0x00000001 (NEEDED)                     Shared library: [libXext.so.6]
...

This executable has a number dependencies that are identical to the previous executable. Additionally, it requires: libSDL-1.2.so.0 that can be provided by SDL, libX11.so.6 by libX11 and libXext.so.6 by libXext

Besides the executables, the shared libraries bundled with the package may also have dependencies on shared libraries. We need to inspect and fix these as well.

Inspecting the dynamic section of libgcc_s.so.1 reveals the following:

$ readelf -d ./libgcc_s.so.1 

Dynamic section at offset 0x7190 contains 23 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

The above library has a dependency on libc.so.6 which can be provided by glibc

The remaining library (libstdc++.so.5) has the following dependencies:

$ readelf -d ./libstdc++.so.5 

Dynamic section at offset 0xadd8c contains 25 entries:
  Tag        Type                         Name/Value
 0x00000001 (NEEDED)                     Shared library: [libm.so.6]
 0x00000001 (NEEDED)                     Shared library: [libgcc_s.so.1]
 0x00000001 (NEEDED)                     Shared library: [libc.so.6]
...

It seems to depend on libgcc_s.so.1 residing in the same folder. Similar to the previous binaries, libm.so.6, libc.so.6 provided can be provided by glibc.

With the gathered information so far, we can write the following Nix expression that we can use as a first attempt to run the game:

with import <nixpkgs> { system = "i686-linux"; };

stdenv.mkDerivation {
  name = "quake4-demo-1.0";
  src = fetchurl {
    url = ftp://ftp.idsoftware.com/idstuff/quake4/demo/quake4-linux-1.0-demo.x86.run;
    sha256 = "0wxw2iw84x92qxjbl2kp5rn52p6k8kr67p4qrimlkl9dna69xrk9";
  };
  buildCommand = ''
    # Extract files from the installer
    cp $src quake4-linux-1.0-demo.x86.run
    bash ./quake4-linux-1.0-demo.x86.run --noexec --keep
    
    # Move extracted files into the Nix store
    mkdir -p $out/libexec
    mv quake4-linux-1.0-demo $out/libexec
    cd $out/libexec/quake4-linux-1.0-demo
    
    # Remove obsolete setup files
    rm -rf setup.data
    
    # Patch ELF binaries
    cd bin/Linux/x86
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./quake4.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib:${SDL}/lib:${xlibs.libX11}/lib:${xlibs.libXext}/lib ./quake4.x86
    chmod +x ./quake4.x86
    
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./q4ded.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./q4ded.x86
    chmod +x ./q4ded.x86
    
    patchelf --set-rpath ${stdenv.cc.libc}/lib ./libgcc_s.so.1
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./libstdc++.so.5
  '';
}

In the above Nix expression, we do the following:

  • We import the Nixpkgs collection so that we can provide the external dependencies that the package needs. Because the executables are 32-bit x86 binaries, we need to refer to packages built for the i686-linux architecture.
  • We download the Quake 4 demo installer from Id software's FTP server.
  • We automate the steps we have done earlier -- we extract the files from the installer, move them into Nix store, prune the obsolete setup files, and finally we patch the ELF executables and libraries with the paths to the dependencies that we have discovered in our investigation.

We should now be able to build the package:

$ nix-build quake4demo.nix

and investigate whether the executables can be started:

./result/libexec/quake4-linux-1.0-demo/bin/Linux/x86/quake4.x86

Unfortunately, it does not seem to work:

...
no 'q4base' directory in executable path /nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86, skipping
no 'q4base' directory in current durectory /home/sander/quake4, skipping

According to the output, it cannot find the q4base/ folder. Running the same command with strace reveals why:

$ strace -f ./result/libexec/quake4-linux-1.0-demo/bin/Linux/x86/quake4.x86
...
stat64("/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/q4base", 0xffd7b230) = -1 ENOENT (No such file or directory)
write(1, "no 'q4base' directory in executa"..., 155no 'q4base' directory in executable path /nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86, skipping
) = 155
...

It seems that the program searches relative to the current working directory. The missing q4base/ folder apparently resides in the base directory of the extracted folder.

By changing the current working directory and invoking the executable again, the q4base/ directory can be found:

$ cd result/libexec/quake4-linux-1.0-demo
$ ./bin/Linux/x86/quake4.x86
...
--------------- R_InitOpenGL ----------------
Initializing SDL subsystem
Loading GL driver 'libGL.so.1' through SDL
libGL error: unable to load driver: i965_dri.so
libGL error: driver pointer missing
libGL error: failed to load driver: i965
libGL error: unable to load driver: swrast_dri.so
libGL error: failed to load driver: swrast
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  33
  Current serial number in output stream:  34

Despite fixing the problem, we have run into another one! Apparently the OpenGL driver cannot be loaded. Running the same command again with the following environment variable (source):

$ export LIBGL_DEBUG=verbose

shows us what is causing it:

--------------- R_InitOpenGL ----------------
Initializing SDL subsystem
Loading GL driver 'libGL.so.1' through SDL
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/tls/i965_dri.so
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/i965_dri.so
libGL: dlopen /run/opengl-driver-32/lib/dri/i965_dri.so failed (/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/libgcc_s.so.1: version `GCC_3.4' not found (required by /run/opengl-driver-32/lib/dri/i965_dri.so))
libGL error: unable to load driver: i965_dri.so
libGL error: driver pointer missing
libGL error: failed to load driver: i965
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/tls/swrast_dri.so
libGL: OpenDriver: trying /run/opengl-driver-32/lib/dri/swrast_dri.so
libGL: dlopen /run/opengl-driver-32/lib/dri/swrast_dri.so failed (/nix/store/0kfgsjryycsk5kfv97phj8ypv66n6caz-quake4-demo-1.0/libexec/quake4-linux-1.0-demo/bin/Linux/x86/libgcc_s.so.1: version `GCC_3.4' not found (required by /run/opengl-driver-32/lib/dri/swrast_dri.so))
libGL error: unable to load driver: swrast_dri.so
libGL error: failed to load driver: swrast
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  154 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  33
  Current serial number in output stream:  34

Apparently, the libgcc_so.1 library bundled with the game is conflicting with Mesa3D. According to this GitHub issue, replacing the conflicting version with the host system's GCC's version fixes it.

In our situation, we can accomplish this by appending the path to the host system's GCC library folder to the RPATH of the binaries referring to it and by removing the conflicting library from the package.

Moreover, we can address the annoying issue with the missing q4base/ folder by creating wrapper scripts that change the current working folder and invoke the executable.

The revised expression taking these aspects into account will be as follows:

with import <nixpkgs> { system = "i686-linux"; };

stdenv.mkDerivation {
  name = "quake4-demo-1.0";
  src = fetchurl {
    url = ftp://ftp.idsoftware.com/idstuff/quake4/demo/quake4-linux-1.0-demo.x86.run;
    sha256 = "0wxw2iw84x92qxjbl2kp5rn52p6k8kr67p4qrimlkl9dna69xrk9";
  };
  buildCommand = ''
    # Extract files from the installer
    cp $src quake4-linux-1.0-demo.x86.run
    bash ./quake4-linux-1.0-demo.x86.run --noexec --keep
    
    # Move extracted files into the Nix store
    mkdir -p $out/libexec
    mv quake4-linux-1.0-demo $out/libexec
    cd $out/libexec/quake4-linux-1.0-demo
    
    # Remove obsolete setup files
    rm -rf setup.data
    
    # Patch ELF binaries
    cd bin/Linux/x86
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./quake4.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.cc}/lib:${stdenv.cc.libc}/lib:${SDL}/lib:${xlibs.libX11}/lib:${xlibs.libXext}/lib ./quake4.x86
    chmod +x ./quake4.x86
    
    patchelf --set-interpreter ${stdenv.cc.libc}/lib/ld-linux.so.2 ./q4ded.x86
    patchelf --set-rpath $(pwd):${stdenv.cc.cc}/lib:${stdenv.cc.libc}/lib ./q4ded.x86
    chmod +x ./q4ded.x86
    
    patchelf --set-rpath $(pwd):${stdenv.cc.libc}/lib ./libstdc++.so.5
    
    # Remove libgcc_s.so.1 that conflicts with Mesa3D's libGL.so
    rm ./libgcc_s.so.1
    
    # Create wrappers for the executables
    mkdir -p $out/bin
    cat > $out/bin/q4ded <&ltEOF
    #! ${stdenv.shell} -e
    cd $out/libexec/quake4-linux-1.0-demo
    ./bin/Linux/x86/q4ded.x86 "\$@"
    EOF
    chmod +x $out/bin/q4ded
    
    cat > $out/bin/quake4 <<EOF
    #! ${stdenv.shell} -e
    cd $out/libexec/quake4-linux-1.0-demo
    ./bin/Linux/x86/quake4.x86 "\$@"
    EOF
    chmod +x $out/bin/quake4
  '';
}

We can install the revised package in our Nix profile as follows:

$ nix-env -f quake4demo.nix -i quake4-demo

and conveniently run it from the command-line:

$ quake4


Happy playing!

(As a sidenote: besides creating a wrapper script, it is also possible to create a Freedesktop compliant .desktop entry file, so that it can be launched from the KDE/GNOME applications menu, but I leave this an open exercise to the reader!)

Conclusion


In this blog post, I have explained that prebuilt binaries do not work out of the box in NixOS. The main reason is that they cannot find their dependencies in their "usual locations", because these do not exist in NixOS. As a solution, it is possible to patch binaries with a tool called PatchELF to provide them the correct location to the dynamic linker and the paths to the libraries they need.

Moreover, I have shown two example packaging approaches (a simple and complex one) that should be relatively easy to repeat as an exercise.

Although source deployments typically work out of the box with few or no modifications, getting prebuilt binaries to work is often a journey that requires patching, wrapping, and experimentation. In this blog post I have described a few tricks that can be applied to make prebuilt packages work.

The approach described in this blog post is not the only solution to get prebuilt binaries to work in NixOS. An alternative approach is composing FHS-compatible chroot environments from Nix packages. This solution simulates an environment in which dependencies can be found in their common FHS locations. As a result, we do not require any modifications to a binary.

Although FHS chroot environments are conceptually nice, I would still prefer the patching approach described in this blog post unless there is no other way to make a package work properly -- it has less overhead, does not require any special privileges (e.g. super user rights), we can use the distribution mechanisms of Nix in its full extent, and we can also install a package as an unprivileged user.

Steam is a notable exception for using FHS compatible choot environments, because it is a deployment tool that conflicts with Nix's deployment properties.

As a final practical note: if you want to repeat the Quake 4 demo packing process, please check the following:

  • To enable hardware accelerated OpenGL for 32-bit applications in a 64-bit NixOS, add the following property to /etc/nixos/configuration.nix:

    hardware.opengl.driSupport32Bit = true;
    
  • Id sofware's FTP server seems to be quite slow to download from. You can also obtain the demo from a different download site (e.g. Fileplanet) and run the following command to get it imported into the Nix store:

    $ nix-prefetch-url file:///home/sander/quake4-linux-1.0-demo.x86.run