Scalatra tutorial (Part 1): Setting up a barebones “Hello world” REST API

I went through the exercise of setting up a Scalatra web API on my MacOS X for work. There are lots of guides out there, most of which use pre-existing projects on Github as a starting point. The official Scalatra documentation does this, although I ran into issues trying to follow their instructions (it seems to work fine when using Ubuntu but not with Mac). I used these guides as a reference, but I wanted to start from scratch and build out a simple GET endpoint that returned “Hello World”  using the least amount of code possible. The following is a guide on how to do exactly that.  The goal was to familiarize myself with how Scalatra APIs are structured and build upon that foundation.

I used Mac OS X (Mavericks) and IntelliJ (13.1) with the SBT and Scala plugins. The SBT plugin is optional, as you can use homebrew to install SBT and run the API from the command line. Most of the skeleton frameworks on github have an SBT bash script in the project root directory which unpacks and runs the SBT jar file (typically found under the project folder underneath the project root directory). Keep in mind that is why most documentation tells you to run the SBT commands using “./sbt”. The homebrew install adds SBT to your path, so you can simply type “sbt” instead.

If you want to follow along in github, you can check the sample project out at https://github.com/jieyangh/scalatra-sample-API. The hash for part 1 is 6c66a3df87ab321fa736f6ab0a819cb16d8e039e

Let’s get started!

Install Scala:

brew install scala

Install SBT (optional):

brew install sbt

SBT is a simple build tool, analagous to maven. Typically in most projects, there is a bash shell script with the same name in the project root that runs the sbt jar file (typically located in the project directory, and run via “./sbt”. If you install with brew or opt for a manual install, you can just type “sbt” instead). It is typically used  to generate the project idea file, update dependencies, and launch the application. However, the IntelliJ plugin makes this un-necessary though, so you can skip this step.

Install the Scala and SBT plugins for IntelliJ:

You’ll need to restart IntelliJ for the changes to take effect. Go to Preference->IDE Settings->Plugins->Browse Repositories. Search for “Scala” and “SBT” and install them both. The SBT plugin is optional, but it just makes life much easier.

scalaplugins

Create a new intelliJ project:

In the New Project dialog window, select Scala and choose “SBT”

newproj

Define the web.xml file:

Create a webapp/WEB-INF directory under src/main and add a web.xml file there. This file is a carry over from Java, which uses the web.xml file for servlet configuration. In Scalatra, web.xml is still utilized, but its role is much diminished. All that’s needed here is a definition for our listener class:

<?xml version="1.0" encoding="UTF-8"?>
<web-app xmlns="http://java.sun.com/xml/ns/javaee"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://java.sun.com/xml/ns/javaee
http://java.sun.com/xml/ns/javaee/web-app_3_0.xsd"
version="3.0">
        <listener>
                <listener-class>org.scalatra.servlet.ScalatraListener</listener-class>
        </listener>
</web-app>

Configure the routes in ScalatraBootstrap.scala:

Create a ScalatraBootstrap.scala file under src/main/scala, which will contain the bulk of your application configuration code. The design philosophy seems to favor using code over XML, a pattern that we will see repeated later when we see how SBT handles plugins and dependencies. The ScalatraBootstrap file is where our route definitions will go. Since we’re keeping things simple, we will specify a “sample” endpoint underneath the website route and then map it the yet to be implemented servlet, GreetingController, that services it.

import javax.servlet.ServletContext
import org.scalatra.LifeCycle

import sampleApi.controllers.GreetingController

class ScalatraBootstrap extends LifeCycle {
  override def init(context: ServletContext) {
    // Mount servlets.
    context.mount(new GreetingController, "/sample/*")
  }
} 

Implement the GreetingController servlet:

Here is where we will implement our HelloWorld endpoints. Create a sampleApi.controllers package under src/main and add a GreetingController.scala. Note that its customary to put the API endpoints under a controllers package, but not necessary for the implementation.

package sampleApi.controllers
import org.scalatra.ScalatraServlet

class GreetingController extends ScalatraServlet {
  get("/") {
    "Hello world"
  }

  get("/:name") {
    val name = params.getOrElse("name", "world")
    "Hello " + name
  }
} 

Two HTTP GET routes are defined one that accepts a “name” query string parameter, and one that doesn’t. Both are mapped to the root of application (in our case this would be http://localhost:8080/)

Add dependencies in build.sbt:

Now edit the build.sbt file under the project root. Here is where we manage all our package dependencies. Modify build.sbt so that it contains the following code (you can safely override the default auto generated file provided by IntelliJ)

libraryDependencies ++= Seq(
  "org.scalatra" %% "scalatra" % "2.2.2",
  "org.eclipse.jetty" % "jetty-webapp" % "8.1.7.v20120910" % "container,compile",
  "org.eclipse.jetty.orbit" % "javax.servlet" % "3.0.0.v201112011016")

Seq(webSettings :_*)

libraryDependencies is a variable declared in keys.scala that stores all the managed dependencies. “++= Seq” appends our list of dependencies to the variable. Each dependency uses the same groupID % artifactID % revision format that Maven does. The main difference here is that instead of using XML to specify the dependencies in a pom file, Scala code is used instead. Again, the design philosophy seems to favor the expressiveness and flexibility of code over that of XML.

IntelliJ will prompt you to re-import your project. Enable auto-import to save you some clicks in the future:

reimport

Add Jetty plugin to SBT:
We are still missing one more dependency in order to get our application to run successfully. We will want to launch our application from SBT by running it inside a Jetty web container, which is basically just a wrapper for tomcat. SBT plugins are managed in the project/plugins.sbt file. Note that plugins.sbt manages all the SBT plugins, and build.sbt manages all the dependencies for the actual application itself. Add the following line:

addSbtPlugin("com.earldouglas" % "xsbt-web-plugin" % "0.7.0”)

Launch the application:

Your project structure should look like the following:
projectstructure

Go to View->Tool Windows->SBT Console. Hit the green arrow button to launch SBT. It will load the project definition and update the dependencies automatically. Once SBT has finished loading, you should see a “success” message followed by a command prompt. Type “container:start” to launch the jetty web server:

SBTConsole

You can now can open up a browser and point it to “localhost:8080/sample” and see “Hello world”. Congratulations!

References

http://ethanway.com/scalatra/

Great guide but it does some things that aren’t standard. Namely, it uses a pom file to manage dependencies and launches the application by manually spinning up a jetty server in code instead of using SBT.

https://github.com/laurilehmijoki/sbt-scalatra-skeleton

A simple HelloWorld API with JSON output, complete with a working test and instructions on how to launch the application.

http://www.scalatra.org/2.3/guides/deployment/configuration.html

Official documentation outlining how to setup your API

null + null = 0, +new Date, and why loosely typed languages like Javascript are a pain

Javascript is a loosely typed language. This was done to expedite the writing of code and avoiding tedious tasks such as declaring variables and explicitly defining their types. You know, the kind of stuff only suckers would do. The irony is that these so called shortcuts actually result in more tedium as everything requires a tremendous amount of type checking. Consider the simple act of concatenating two strings: str1 and str2.

str1.concat(str2) fails immediately if str1 is undefined or null, so the concat method is out of the question. But we can always use the javascript “+” operator right? Sure, if you want to trust the implicit type casting in Javascript. In a strongly typed language such as C#, if str1 and str2 are null, str1 + str2 results in null as well. The result is intuitive and expected. In Javascript, the result of str1 + str2 is the integer value 0. This can lead to some hilarious and unintended consequences in the code that can be a pain in the ass to track down. But wait, there’s more! If both str1 and str2 are undefined, str1 + str2 is NaN (not a number). If str1 is “” and str2 is null, then str1 + str2 is equal to “null”. That’s right, because str1 is a string, Javascript implicitly casts str2 as a string as well. The expression becomes “” + “null” which evaluates to “null”. Ugh. If str1 is “” and str2 is undefined, well guess what? The string, “undefined”, is what happens when you concatenate the two.

If you want a result that is not nonsensical, you wind up having to write to some hideous code to convert all null and undefined values to an empty string first:

if (str1 === undefined || str1 === null)
{
    str1 = "";
}

if (str2 === undefined || str2 === null)
{
    str2 = "";
}

var result = str1 + str2;

And yes, the triple equals is absolutely necessary. Without it, the implicit casting rears its ugly head again. So to sum up, use common sense when writing Javascript code. Stick with the basic best practices.

DECLARE YOUR VARIABLES.

str1 = "hello"; //BAD but valid, sets str as a property of a global object. That's right, str1 is now a global variable. Horrible
var str2 = "world"; //You have to type three extra characters but it saves you a ton of grief

DO NOT USE YOUR VARIABLES FOR MORE THAN ONE PURPOSE

var str1 = "hello";
console.log(str1);
str1 = 5;  //For the love of god, just declare a new variable here instead
console.log(str1);

DO NOT DO CUTE STUFF WITH IMPLICIT CASTING TO LOOK COOL

var start = +new Date();

This is a clever snippet of code, but it can be rewritten to this functionally equivalent:

//short hand for
var startDateTime = new Date();
var startTicks = startDateTime.getTime();

Both versions of the code do the same thing. The second requires a bit more typing but is far more readable. +new Date is gimmicky. Yes it “looks cool” and shows off your Javascript skillz, but the brevity in this case is detrimental in the long run to maintainability. +new Date works because it first creates a new Date object. By default, the Date constructor returns the current date. The “+” unary operator is applied, which implicitly casts Date to an integer. The implicit casting returns the number of ticks since 1/1/1970. Great, but why have this esoteric code? It relies on an implicit cast, so it is entirely on the version of Javascript and the browser it is running on. Any future changes could potentially break the code . If typing two lines instead of one is so painful, create a utility function instead that calls +new Date and slap a comment on there. Or don’t, because +new Date has horrible performance.

So remember, scripting languages such as javascript make writing code easy, but the ease of use leads to ease of abuse. Be careful or you’ll wind up making things more difficult in the long run.

Scoping of “this” – client side Javascript versus server side NodeJS

Every function in Javascript has a “this” object that behaves differently depending on the execution context. Everytime a function is executed in Javascript, there is an object context associated with it; “this” provides a way to reference that object. When calling an object method, such as Foo.bar(), “this” refers to the Foo object. When calling a method in a global context, such as bar(), “this” refers to the Javascript global object. In most browsers, that would be the window object. bar() is essentially shorthand for window.bar().

Now, the behavior in NodeJS is quite different. In order to prevent pollution of the global namespace, all functions running in the top level scope are not scoped to the global object, but to a unique local object instead. That means every function in Node has its own “this” variable. According to the official NodeJS documentation:
In browsers, the top-level scope is the global scope. That means that in browsers if you’re in the global scope var something will define a global variable. In Node this is different. The top-level scope is not the global scope; var something inside a Node module will be local to that module.

Consider the following code:


function somefunc() {
    this.value = 100;
    console.log("this.value inside somefunc() is " + this.value);
}

this.value = 1;
somefunc();
console.log("this.value in top level scope is " + this.value);

The nodeJS output is:

this.value inside somefunc() is 100
this.value in top level scope is 1

The output from the javascript console in a browser such as Chrome or Firefox is:

this.value inside somefunc() is 100 
this.value in top level scope is 100 

Because somefunc is attached to the global object, “this.value” actually refers to window.value. Thus, changing the value of “this.value” inside somefunc() changes window.value, so “this.value” is still 100 even after somefunc() has finished execution. In NodeJS however, “this.value” inside somefunc is not the same “this.value” outside of somefunc. Again, every function has its own “this” variable, scoped to the object that it is referenced in. In NodeJS, the same holds true for anonymous functions as well:

 
function somefunc(cb) {
     this.value = 100;
     console.log("this.value inside somefunc() is: " + this.value);
     cb();
}

this.value = 1;
somefunc(function() {
    this.value = 50;
    console.log("this.value inside the anonymous function is: " + this.value);
});

console.log("this.value in top level scope is " + this.value); 

NodeJS outputs:

this.value inside somefunc() is: 100
this.value inside the anonymous function is: 50
this.value in top level scope is 1 

The browser outputs:

this.value inside somefunc() is: 100 
this.value inside the anonymous function is: 50  
this.value in top level scope is 50 

As before, the browser behavior is a result of somefunc being executed in a global context. In the preceding code, this.value always refers to window.value, even when it appears inside the anonymous function callback (more on this later). In NodeJS, the anonymous function has its own “this” object that is different from the “this” object inside somefunc. This results in the three different values for “this.value”.

Now let’s take a look at what happens when we explicitly set the execution context of somefunc to that of a local object:

function somefunc(cb) {
     this.value = 100;
	 console.log("this.value within somefunc is: " + this.value);
     cb();
}

var obj = {
    method:somefunc
};

this.value = 1;

obj.method(function() {
    this.value = 50;
	console.log("this.value within anonymous function is: " + this.value);
});

console.log("obj.value is: " + obj.value);
console.log("this.value in top level scope is: " + this.value);               

The browser prints out:

this.value within somefunc is: 100  
this.value within anonymous function is: 50  
obj.value is: 100  
this.value in top level scope is: 50 
this.value within somefunc is: 100
this.value within anonymous function is: 50
obj.value is: 100
this.value in top level scope is: 1 

When run from a browser, obj.value is set to 100 and window.value is set to 50. Since somefunc is no longer being called from a global context, “this” inside somefunc is owned by the object that calls it (imaginatively named obj in this example). “this.value” is actually referencing obj.value. Note that the anonymous callback function, “this” is not scoped to obj, but rather the global object. Because the anonymous function is not explicitly called via an object context, it still defaults to a global context!

The behavior in Node is consistent with what we observed before. When called in an object context, the “this” object inside of somefunc refers to obj. The anonymous function has its own “this” object, and any modifications to it are not reflected in obj nor in the top level scope. As a result, obj.value is 100 whereas this.value remains 1 in the top level scope – it was initialized once and never modified again.

The key takeaway here is that “this” can be incredibly confusing and unintuitive. Javascript provides methods such as apply() and call() methods for a reason: so that you can explicitly set “this” and avoid all the headache second-guessing yourself trying to figure out what “this” actually is.

When keeping it Agile goes horribly wrong

Disclaimer: I don’t like being Agile. I also don’t like complaining. Too much negativity is a bad thing, and in some cases, it can be even worse than being Agile. I subscribe to the simple philosophy that time spent whining can be better spent improving the situation. Given that I’m not always successful at following my own principles and given that this article is (mostly) facetious, I’ve decided to compromise: The following essay is a tongue in cheek commentary about why Agile sucks. It makes little to no attempt to come up with something better (perhaps because much like democracy, Agile is the worst methodology, except for all the other methodologies tried before it). So without further ado here is my anti Agile manifesto:

Once upon a time, there was this development methodology called Waterfall. It sucked and everyone wrote huge essays ripping it to pieces. It was such an easy target because of how ludicrous it was. Anyone who has ever worked on a project that had complexity greater than that of a “Hello World” application knows that getting all the requirements and design right on the first try, let alone the implementation, is a delusional dream. In some ways, its not even really fair to criticize the Waterfall model anymore. Its like beating a dead horse. Why bother doing it? There’s no challenge. Well, people still like to berate the Waterfall model because it serves as a useful strawman, a foil against which they can hock their new and improved development methodologies. The thing is, a wise man named Fred Brooks once said that there is no silver bullet. Software complexity spans across multiple layers, covering architectural distances that are orders of magnitude apart. So long as humans are writing the code, the building of software is always going to be an imperfect process. Any methodologies should merely be guidelines, not strict rules defining orthodoxy that constrict and bind the builders. Anyone promising the one true way to software development must be regarded with skepticism.

Yet over the years, new false prophets in software engineering delivered sermons promising untold productivity and awesomeness. The most popular of these was from the cult of Agile, which quickly became a full blown religion, complete with its own church and accompanying dogma. A methodology that was supposed to be about “individuals and interactions over process and tools” is now bogged down by the process it sought to avoid. Instead of circumventing tools by shuffling sticky notes around on a whiteboard, software tools designed to simulate sticky notes shuffled around on a whiteboard were adopted. Heretics who complained about how it was a bit too much were castigated and excoriated by the church. Cowed into submission, they would dutifully complete their two week sprints while complaining about how stupid it all was.

Among the front line of enforcers in the Agile army are the scrum-masters. They ensure strict adherence to the doctrine and processes of Agile. Straying from the path is punished by meetings on improving the process improvement process. Scrummasters come in two flavors: the full time scrummaster and the hapless developer who is asked to do double duty writing code and managing the sprint. The former has no technical skills and if they have no domain knowledge either then its best to run. Run and don’t look back. Apparently you can get certified as a scrummaster, and once you have been ordained, you are supposed to be able to provide value to a project just because you are familiar with Agile. I’ve seen this in action and the results aren’t pretty.

I remember at one meeting while the developers were having a technical discussion, the scrummaster stared on helplessly and asked everyone to please use the “As an X, I want Y, so that Z” format so as to provide the proper context. He was promptly ignored for being useless. Later, said scrummaster tried to randomize the team by asking the developers to retroactively edit the completed stories in a sprint to be tagged under a different release. Presumably this is because the scrummaster had no idea how to use Microsoft Team Foundation Suite. Such a scrummaster is dead weight. The problem is exacerbated because the scrummaster has no real power or control. The product owner/manager controls what features goes in the milestone. The developers not only don’t respect the scrummaster, they don’t report to him either. They report to their dev managers. The takeaway here is don’t hire full time scrummasters, but if you must, make sure they have a few years of domain knowledge. The developer/scrummaster hybrid role is not much better and is an unholy abomination. Developers are typically overloaded and randomized enough as it is. Having to manage a sprint on top of it while being invited to yet even more useless meetings is a major pain in the ass.

The meetings all revolve around completing the sprint. Sprints are typically two weeks and are comprised of stories, the basic unit of work. Stories are the building blocks of epics. Unlike the epic of Gilglamesh, which detail the exploits of the legendary warrior king of Uruk and his partner in crime Enkidu, these epics are typically mundane high level overviews of new features that need to be built. Depending on how orthodox the particular Agile church you belong to is, the stories need to be in the famous “As an X, I want Y, so that Z” format. For example, “As a user, I want this super duper awesome feature that shoots pew pew laser beams, so that I feel awesome and give more money to the company”. Or: “As a developer, I want to drink wine out of the bottle, so I can dull the pain of shuffling sticky notes around a whiteboard every day during my daily standup meetings”.

If the stories are not written correctly, they are rewritten or removed from the sprint in the grooming meetings, where the wheat is culled from the chaff. These are supposed to be highly efficient meetings, but I’ve seen these degenerate into pointless semantic meetings where people argue about the definition of “done” and what the proper “acceptance criteria” for a story is. These typically spin off yet more banal meetings about vocabulary words where there is much weeping and gnashing of teeth. Once the stories have been groomed, then the sprint planning meeting can begin. The stories in the sprint are then “sized”, and scope is adjusted accordingly. “Size” is just an arbitrary metric measuring how much work a story is. It can either be measured in terms of complexity (How difficult a task is) or time (How long it will take). The two are not necessarily the same. On an increasing scale of 0 to 5 points, banging your head against the wall 1000 times would be a 1 point story based on complexity, but a 5 point story based on time. These meetings can drag on forever if huge arguments ensue about how large a story is. The good news is that most people just want to get the meeting over with. As the sprint planning drags on, most people’s spirits are broken and they just simply stop caring enough to even bother arguing.

Once the sprint is finally complete, there is the sprint demo. A lot of the times this is to non technical stakeholders or customers. While demos are great if you’re working on a UI and have something flashy to show off, webservices, messaging queues, build scripts, database work, and the other sort of behind the scenes stuff that makes everything work don’t lend themselves to exciting presentations. Its pretty awkward but there are workarounds such as demoing only the UI and hiding any XML, database queries, and command line interfaces. Following the demo there is usually a sprint review. This is a meeting where everyone looks at the sprint burndown chart, a graph that tracks the number of remaining story points in the sprint over time. A perfect burndown is a straight line with a negative slope that reaches 0 at the end of day 14. In reality, the burndown charts resembles a drunk person stumbling around the park. Sometimes the line plateaus for days before small and erratic stepwise increases and decreases eventually reach a number that not quite zero. During the review meeting, everyone discusses ways to make the burndown chart be prettier in the future:
“Perhaps instead of sizing based on time or complexity, we can size based on the position of the moon relative to the earth and the sun?”
“How about consulting the oracle bones instead?”
“How about basing it on how many developers are weeping when they read the story?”
Regardless of whatever zany ideas are proposed, the resulting burndown never even remotely resembles a straight downward line.

Closely related but subtly different from the sprint review is the sprint retro, where the team reflects back on the sprint. On a dysfunctional team, this is mostly for catharsis as team members yell about what went wrong. Typically, this results in a set of action items with clearly identified owners. This is actually a great exercise in team improvement, but your mileage may vary. Those who feel empowered and accountable for the project at large will actually follow up on the action items. Those who are shackled by bureaucracy and red tape will shrug their shoulders, send out a few passive aggressive emails, and call it a day. On teams that are well run, the retro is mostly just a formality.

Throughout the entirety of the sprint are the daily standup meetings. This is where people talk about what they’re working on. Each person usually says something along the lines of: “Yesterday I worked on this stuff, today I’m going to work on this other stuff. No blockers. Sunny outside with a small chance of showers.” These tend to be tedious. To be fair though, they can provide some modicum of value by making sure the entire team is on the same page and aware of any potential risks of blocking issues. Then again, simply having a team that talks to one another accomplishes the exact same thing. The potential danger lurking in the standup meetings is having a weak scrummaster. Its up to the scrummaster to make sure that people don’t take up too much time reporting their status, ramble on, go off topic, or get into a huge debate with others. Any kind of serious issues should be taken offline and discussed after the meeting. If the scrummaster is not strict about this, the standup meetings will start to drag on and become a major source of pain for everyone involved. The problem grows exponentially worse the larger the size of the team, as though daily standup drama is directly proportional to how many people are unlucky enough to attend.

On paper, Agile isn’t completely terrible. There’s a lot of good ideas. The problem is when it is actually put into practice. In that regard, it is quite similar to both Waterfall and Communism. While both are ideas that sound great on paper, they fail miserably in the real world. Agile, an anti-process process of writing software, becomes bogged down in process. Managers obsessed with metrics start demanding more granular detail on story size estimates, more information on what business value a particular story provides, and more visibility into the sprint progress (typically accomplished by asking for daily status reports). Soon things get so ridiculous that something as simple as fixing a typo in a variable name requires a story, approval from business, and a triage meeting before the scrummaster moves it into the sprint. Those on a dysfunctional team will hide behind process like a government worker hides behind red tape to not get anything done. They thrive in an environment like this, feeding off the death of team morale.

At this point in my life, I’d rather do everything ad hoc. There has to be a certain amount of flexibility in the software development process. Sometimes the requirements will not be clear, and they will rarely be perfect. Nor will the requirements be set in stone. Unfortunately, requirements do change, and sometimes at the last possible minute. It sucks, but the solution to this isn’t to add yet even more processes and meetings to prevent this from happening. Minimizing volatility is a good goal, but the team should eschew the rigidity of a predefined process and feel empowered to make their own decisions. Agile helps foster communication among team members. But if the team is filled with toxic people who can’t play well together , no methodology will ever solve that. If the team is filled with competent people who have good rapport and open lines of communication, then Agile is not really necessary. I once worked on a team that had a hard ship date in two months. For those two months we kept the daily standup but eliminated the overhead of grooming, planning, and retros, and simply did what needed to be done. This requires a certain level of proactivity and an ability to handle the almost chaotic nature of such freedom, but it was a refreshing refrain from the suffocating nature of Agile.

The TLDR cliff notes is, teams should follow the spirit of Agile rather than enforcing it to the letter. Rigidity leads to bureaucracy, which defeats the whole purpose of Agile in the first place. Agile was meant to be a lightweight methodology designed to ship code quickly and frequently. The keyword here is lightweight. The frequent grooming, design, planning, demos, and retros are there to foster effective communication, not to stifle people with the tedium of it all.

Deduping an XML document with XSLT (listing only unique elements)

XSLT provides a key() function that retrieves a subset of nodes in an XML document based on a selection criteria specified in an element. In essence, is an index of the XML document, and key() serves as an index lookup function. This provides a lot of versatility to XSLT that allows you to accomplish a lot with a little. Let’s look at how we could use this to dedupe an XML document that contains a lot of duplicate items and only list each unique element once. Consider the following XML:

<Inventory>
<Product>
	<ProductID>1</ProductID>
	<name>Sword of the ancients</name>
	<description>A legendary sword fit for a hero</description> 
</Product>
<Product>
	<ProductID>2</ProductID>
	<name>BFG 9000</name>
	<description>A slayer of demons</description> 
</Product>
<Product>
	<ProductID>2</ProductID>
	<name>BFG 9000</name>
	<description>A slayer of demons</description> 
</Product>
<Product>
	<ProductID>3</ProductID>
	<name>Flaming sword</name>
	<description>Effective against the undead</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
<Product>
	<ProductID>4</ProductID>
	<name>Aegis bulwark</name>
	<description>A stalwart shield that blocks the deadliest blows</description> 
</Product>
</Inventory>

Each Product is uniquely identified via its productID. Some of the products appear more than once. Let’s write a transform to clean up the XML and list each item only once. In order to do so, we will need to make use of the key() and the generate-id() functions. Let’s take a look at key() in detail first. key() takes two parameters, the name of element to do the lookup on and a search string to match against:

key(key-name, search-string)

search-string is then matched against the selection criteria defined in the key variable key-name. A key element has three attributes: @name, @match, and @use. @name is the key-name parameter that will be passed in to the key() function. @match is an XPATH expression that specifies which elements in the XML document to index. @use is an XPATH expression that serves as the lookup key. It must be relative to the elements specified in @match, and it will be evaluated against the key() search-string parameter. The following would be an xsl:key that indexes all the Products in the previous XML document based on the ProductID element.

<xsl:key name="ID-key" match="Product" use="ProductID" />  

key(“ID-Key”, “2”) would return the two BFG 9000 elements. In order to extract only the unique elements, we will need to use generate-id(). generate-id() takes a node as a parameter, and generates a unique ID for it, relative to the entire document. No two nodes in the same document will ever map to the same ID. However, generate-id() does not guarantee that the value generated for a given node will be the same on a different run. In order to identify all the unique elements, we will use generate-id() as a way to test for equality between nodes. The following expression selects all the unique ID values in the document:

<xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">

Let’s break this gnarly expression down into smaller, more manageable, parts. The xsl:foreach is selecting a set of products that satisfy the equality conditional inside the bracket predicates. What is the predicate trying to accomplish? Starting with the innermost expression in the right hand side of the conditional, we see that key(‘ID-key’, ProductID) returns a list of nodes that has the same ProductID value as the current node being evaluated. [1] returns the first node in this key node-set, which is then passed as an argument to generate-id(). Note that we are always guaranteed at least one result from the key lookup, since key() will always return the current node in the node-set.

On the left hand side of the conditional is the expression generate-id(.). This generates an id for the current product node being evaluated. If the two generated IDs are equal, we include the node in the result-set. Subsequent nodes with the same ProductID will fail this equality test, since we are only comparing to the *first* node returned by key(). The rest will be ignored. This guarantees that we get a list of Product nodes where no two ProductIDs will be the same. The equality test essentially filters the redundant elements out.

With that hairy expression out of the way, its just a simple matter of printing out all the unique elements:

<xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">
   <xsl:copy-of select="."/>
    <xsl:value-of select="."/>
</xsl:for-each>

The complete code listing is here:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:msxsl="urn:schemas-microsoft-com:xslt" exclude-result-prefixes="msxsl">
  <xsl:output method="xml" indent="yes"/>
  <xsl:key name="ID-key" match="Product" use="ProductID" />


  <xsl:template match="/">
    <xsl:element name="Inventory">
      <xsl:for-each select="//Product[generate-id(.)=generate-id(key('ID-key', ProductID)[1])]">
        <xsl:copy-of select="."/>
      </xsl:for-each>
    </xsl:element>

  </xsl:template>
</xsl:stylesheet>

Running the transform on the previously listed input XML will output the following XML file:

<?xml version="1.0" encoding="utf-8"?>
<Inventory>
  <Product>
    <ProductID>1</ProductID>
    <name>Sword of the ancients</name>
    <description>A legendary sword fit for a hero</description>
  </Product>
  <Product>
    <ProductID>2</ProductID>
    <name>BFG 9000</name>
    <description>A slayer of demons</description>
  </Product>
  <Product>
    <ProductID>3</ProductID>
    <name>Flaming sword</name>
    <description>Effective against the undead</description>
  </Product>
  <Product>
    <ProductID>4</ProductID>
    <name>Aegis bulwark</name>
    <description>A stalwart shield that blocks the deadliest blows</description>
  </Product>
</Inventory>

Exploring Javascript closures

Javascript closures are a concept that is easier to explain by example:

function outer(increment) {
    var outer_foo = 1;
    function inner() {
        outer_foo += increment;
        return outer_foo; 
    }
    return inner;
}

var outer_instance = outer(2);
var foo1 = outer_instance();
var foo2 = outer_instance();  

alert(foo1);   //3
alert(foo2);   //5

The inner function references the local variables of the outer function. The inner function is the closure. It is enclosed by its parent function and has access to its parent’s execution context, even after the parent has finished execution and returned. In essence, variables in the scope chain of the parent function stay in scope, as though they were allocated on the heap and not on the stack. Contrast this to programming languages such as C, where parameters and variables local to a function are declared on the stack frame, and immediately deallocated after the function has exited.

Because the parent function’s variables stay in scope and are essentially passed by reference to the closure, this can lead to bugs. Case in point:

function createFunctions() {
    var result = new Array();

    for (var i = 0; i < 10; i++) {
        result[i] = function() { return i; };
    }

    return result;
}

var functionArray = createFunctions();
var func1 = functionArray[0];
var foo1 = func1();  
alert(foo1); //10
var func7 = functionArray[6];
var foo7 = func7(); 
alert(foo7) //10

The createFunctions method returns an array of closures, each of which is a function that returns the value of the loop index, i. At first glance, it would seem as though the closures would return the numbers 0 through 9, respectively. In actuality, every closure in the array returns 10. This is because the i variable in each closure references the i variable of the parent execution context. Because this is a reference, the value of i will not be the value at the time of the creation of the closure, but rather at the time of the execution of the closure. The for loop will have already terminated by this point and set i to 10.

In order to match the original intent behind the function, createFunctions() can be rewritten as follows:

function createFunctions2() {
    var result = new Array();

    for (var i = 0; i < 10; i++) {
        result[i] = function(num) { 
	    return function() {
	        return num; 
	    };
	}(i);
    }

    return result;
}

This syntax can be a bit jarring to read at first. The inner-most anonymous function is a closure that references the variable num, which is scoped to its parent function:

 
    //snipped out code ... 
        function() {
            return num; 
        }
    //...

The inner function is the return value for its parent function, which takes in a single argument: num. This parent function is immediately invoked with the loop-counter i as its parameter, as denoted by the “(i)” found at the end of the function declaration.

//snipped out code ...
        function(num) { 
	   //...
	}(i);
//...

Because parameters are passed in by value and not by reference, num retains the value of i at the time this outer function was created. In other words, num will equal the value of i at the current iteration in the for loop, and will keep this value independent of any future modifications to i. The outer anonymous function serves as an intermediary: It returns a closure that returns the value of num “frozen” at that point in time, allowing the code to function as originally intended.

   function createFunctions2() {
	var result = new Array();

	for (var i = 0; i < 10; i++) {
		result[i] = function(num) { 
			return function() {
				return num; 
			};
		}(i);
	}

    return result;
}

var functionArray = createFunctions2();
       
//prints out the numbers 0 through 9
for (var i = 0; i < 10; i++)
{
	var bub = functionArray[i]();
	alert(bub);
}

The key takeaway here is that closures access variables in their parent’s scope chain by reference, and not by value. To bypass this intended behavior, the parent variables that the closure accesses must themselves by a copy of the originals, which is accomplished by passing them in as parameters to the parent function. The code that accomplishes this can look unwieldy, but its not so bad once you figure out the basic pattern behind it.

Implementing a for loop in XSLT

XSLT is a language designed to transform XML documents. Instead of being a procedural programming language, XSLT models itself after functional programming, borrowing many of its fundamental design concepts. Functional programming deals with expressions rather than statements. Statements are executed. They may or may not return a value, and variables directly or indirect impacted by statements can be modified, resulting in potentially far reaching side effects once a particular statement has been executed. Expressions on the other hand, are evaluated. They can be nested, chained together as arguments to a parent expression, and combined together in various other ways. Expressions all are evaluated eventually though, simplifying down into a single return value. In a functional programming language, the parameters that an expression acts upon are immutable and thus there are no side effects when an expression is evaluated.

XSLT takes the same approach to its underlying design philosophy. In XSLT, templates are essentially expressions that can be combined together in various ways, eventually being evaluated and returning an output value that is typically, although not necessarily, a well formed XML document. Because XSLT uses immutable variables, there are no side effects when templates are evaluated. This means that an XSLT engine can optimize the transform, applying template rules in any order, sometimes even simultaneously.

Imagine what would happen if variables were mutable instead. A given template in an XSLT stylesheet can match multiple nodes in an XML document. The order in which those nodes are read would suddenly matter if say, the template output some text from the node and appended the value of an auto incrementing integer variable. Differing implementations of XSLT engines would result in different outputs, a clear violation of encapsulation. Worse, any implementation that has a non-deterministic order of evaluation will have race conditions and result in nondeterministic output every time the transform is run. This also holds true at the template level as well, as the side effects from one template that modify a variable that is used by another template could very well impact the output of the other. Nondeterministic output in a language designed specifically to output documents is a nonstarter from a design perspective, so XSLT would suddenly require some mechanism with which to specify the order of evaluation. The end result would be a language that would be quite cumbersome and awkward, as opposed to the clean and concise syntax that XSLT uses today.

Its important to have this functional programming mindset when working with XSLT. Trying to do procedural programming in XSLT is like trying to fit a square peg into a round hole. Its possible, but the end result is not pretty. Take for instance, a for loop. XSLT provides iteration capabilities via XPATH that lets it iterate over a set of nodes and return the ones that match a certain set of criteria. However, this is not technically not the same as a for loop in a procedural programming language, which specifies a counter, an increment/decrement operation, and a comparison operator that terminates the loop. Because variables are immutable in XSLT, something as simple as having a loop counter suddenly becomes a nontrivial nightmare.

It is indeed possible to simulate a for loop in XSLT, but whether or not you would want to is another question altogether. This is more an interesting exercise in how to map functional programming paradigm to a procedural one. The following solution is based on the mammoth O’Reilly XSLT book by Doug Tidwell. The important thing to keep in mind however, is that in a real world scenario, the need for a for loop might indicate that XSLT may not be the right tool for the job. I’ve encountered usage of such for loops in some hairy XSLT code, and it isn’t pretty. Its best to keep procedural programming tasks left to actual procedural programming languages. This is a shocking concept that can be lost on a lot of people who forget that they can either do any necessary pre or post processing in code instead of in XSLT, or write a code module in a more powerful language that can then be called in XSLT as an extension function.

But I digress. If you’re a glutton for punishment, here’s how you would go about implementing a for loop. The basic idea here is that because variables are immutable, we cannot increment the loop-counter directly. Instead, we can it as a parameter by recursively calling the template with the incremented value.

 
<xsl:template name="for-loop">
	<xsl:param name="loop-counter" />
	<xsl:param name="increment-value" /> 

        ......do some processing here.....

	<xsl:call-template name="for-loop">
		<xsl:with-param name="loop-counter"   select="$loop-counter + $increment-value" />
		<xsl:with-param name="increment-value" select="$increment-value"/> 
	</xsl:call-template>
</xsl:template>

Remember that any task that can be done iteratively can also be done with recursion (given a large enough call stack). The recursion will terminate once the conditional of the for-loop template has been satisfied. We will need to have some way of testing the conditional against the current value of the for-loop counter value. To make the template more flexible, we define a “$conditional-test” parameter that specifies what comparison operator to perform and a “$conditional-value” parameter to evaluate it against. A “$conditional-passed” variable will be used to store the result of the comparison. Now, because XSLT does not have a way to pass in an expression that can be dynamically evaluated at runtime, we will have to use to manually map each potential value of “$conditional-test” to the appropriate if statement. This requires writing a lot of tedious code:


  <xsl:variable name="conditional-passed">
      <xsl:choose>
        <xsl:when test="$conditional-test = '!='">
          <xsl:if test="$loop-counter != $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$conditional-test = '&gt;='">
          <xsl:if test="$loop-counter &gt;= $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        ... add support for other conditional tests here ...

        <xsl:otherwise>
          <xsl:message terminate="yes">
            Invalid conditional-test passed in
          </xsl:message>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>

If the value of “$conditional-test” matches none of the pre-defined values, then we use xsl:message to write the error to output and terminate the template. Otherwise, the final step is to check the value of “$conditiona-passed”:

 
     <xsl:if test="$conditional-passed='true'">
         
         ... Add any custom code here that the for loop needs to execute.   In this case, we merely print the loop-counter
         <xsl:text>loop-counter: </xsl:text>
         <xsl:value-of select="$loop-counter" />
         <xsl:text> &#xA;</xsl:text>

      <xsl:call-template name="for-loop">
        <xsl:with-param name="loop-counter"   select="$loop-counter + $increment-value" />
        <xsl:with-param name="increment-value" select="$increment-value"/>
        <xsl:with-param name="conditional-test" select="$conditional-test"/>
        <xsl:with-param name="conditional-value" select="$conditional-value"/>
      </xsl:call-template>
    </xsl:if>

Note that the custom code inside this if-statement corresponds to the user-defined code found inside a standard procedural for-loop. If $testPassed is not equal to ‘true’, then we have reached the end of our loop and can exit the template. Otherwise, we execute the code and recursively call the template again, after having incremented $counter accordingly. This simulates the next iteration of our for-loop. Note that this section of the template will need to be changed every-time the for-loop needs execute a different set of code. Because there is no way in XSLT to pass in a set of instructions as a parameter to be later dynamically evaluated at runtime, this makes an XSLT for-loop template hard to re-use. Anytime new functionality is needed inside our for-loop construct, a new template must be created. The old template can be copy/pasted and the custom code inside the if-statement modified accordingly, but this is not an ideal solution. This further drives home the point that its best to use the right tool for the right job.

Putting it all together, we arrive at the following code in all its glory:

<?xml version="1.0"?>

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="text"/>

  <xsl:param name="loop-counter" select="10" />
  <xsl:param name="increment-value" select="-1"/>
  <xsl:param name="conditional-test" select="'&gt;='"/>
  <xsl:param name="conditional-value" select="0" />

  <xsl:template match="/">
    <xsl:call-template name="for-loop">
      <xsl:with-param name="loop-counter" select="$loop-counter" />
      <xsl:with-param name="increment-value" select="$increment-value"/>
      <xsl:with-param name="conditional-test" select="$conditional-test"/>
      <xsl:with-param name="conditional-value" select="$conditional-value"/>
    </xsl:call-template>
  </xsl:template>

  <xsl:template name="for-loop">
    <xsl:param name="loop-counter" />
    <xsl:param name="increment-value" />
    <xsl:param name="operator" />
    <xsl:param name="conditional-value" />

    <xsl:variable name="conditional-passed">
      <xsl:choose>
        <xsl:when test="$operator = '!='">
          <xsl:if test="$loop-counter != $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$operator = '&gt;='">
          <xsl:if test="$loop-counter &gt;= $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$operator = '&lt;='">
          <xsl:if test="$loop-counter &lt;= $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$operator = '&gt;'">
          <xsl:if test="$loop-counter &gt; $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$operator = '&lt;'">
          <xsl:if test="$loop-counter &lt; $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>

        <xsl:when test="$operator = '='">
          <xsl:if test="$loop-counter = $conditional-value">
            <xsl:text>true</xsl:text>
          </xsl:if>
        </xsl:when>
        <xsl:otherwise>
          <xsl:message terminate="yes">
            Invalid conditional-test passed in
          </xsl:message>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:variable>


    <xsl:if test="$conditional-passed='true'">
      <xsl:text>loop-counter: </xsl:text>
      <xsl:value-of select="$loop-counter" />
      <xsl:text> &#xA;</xsl:text>

      <xsl:call-template name="for-loop">
        <xsl:with-param name="loop-counter"   select="$loop-counter + $increment-value" />
        <xsl:with-param name="increment-value" select="$increment-value"/>
        <xsl:with-param name="conditional-test" select="$conditional-test"/>
        <xsl:with-param name="conditional-value" select="$conditional-value"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>
 

A simple .net program can be whipped up to apply this transform:

using System.Xml;
using System.Xml.Xsl;
using System.Text;

namespace XSLTTest
{
    class Program
    {
        static void Main(string[] args)
        {
            XslCompiledTransform stylesheet = new XslCompiledTransform(); 
            stylesheet.Load("Forloop.xslt"); 
            stylesheet.Transform("input.xml", "output.xml");
        }
    }
}

Since this template blindly prints out the loop-counter, the input xml is completely irrelevant. For a real working world example, one would likely define new parameters in the for-loop template that would read in various values that the for-loop would then operate on. In our case the output simply looks like the following:

loop-counter: 10 
loop-counter: 9 
loop-counter: 8 
loop-counter: 7 
loop-counter: 6 
loop-counter: 5 
loop-counter: 4 
loop-counter: 3 
loop-counter: 2 
loop-counter: 1 
loop-counter: 0 


Interview questions: Sorting a terabyte of integers with a gigabyte of memory (part 2)

In the previous article, I mentioned that the mergesort algorithm could be modified in order to solve this problem. The usual mergesort implementation assumes that all the unsorted elements can fit into main memory. The algorithm divides the unsorted elements into two halves, and then recursively calls mergesort on each half before merging them together. The merge operation requires a temporary array to hold the merged result, making it a non starter. Worse, the merging will also be horribly inefficient on disk, as swapping two elements on disk versus swapping two elements of an in-memory array is orders of magnitude slower. Disk I/O is optimized for sequential, not random access. Furthermore, due to the large size of the unsorted input, there will likely not be enough memory on the stack for all the recursive calls required.

We will need to use an external sorting algorithm to address these issues, and the basic idea behind mergesort can be used to come up with a workable solution. Let us define M as the set of unsorted integers residing on hard drive D1, and N as the total number of records that can be sorted in memory at any given time (N takes into account any overhead needed to perform the sort itself). Assume also that we have three additional hard drives in addition to D1 (or some other similar storage medium that is optimized for sequential read/write access, such as tape): D2, D3, and D4. We can sort the data in multiple passes. The initial pass generates N/M sorted subsets of size N. Subsequent passes will merge these subsets together until we are left with one final merged set of sorted integers.

Let’s use a simple example with N = 18 and M = 3. Here is the unsorted data before we apply our sorting algorithm:

D1    15 4 1 20 19 3 100 80 8 12 10 11 55 40 31 39 67 88           
D2
D3
D4

The first pass reads M elements at a time from D1 and sorts them in memory (any performant algorithm with a good asymptotic runtime such as quicksort will do). The resulting subsets will be written to disk, with the destination alternating between D3 and D4.

D1              
D2
D3    1  4  15 | 8  80 100 | 31 40 55
D4    3  19 20 | 10 11 12  | 39 67 88

D3 and D4 now each contain three (half of N/M) sorted subsets of size M. The next step will be to combine two subsets at a time, one from D3 and one from D4, into a set of size 2*M. The output will be written to disk, this time with the destination alternating between D1 and D2.

D1    1  3  4  15 19 20  | 31 39 40 55 67 88           
D2    8  10 11 12 80 100 
D3    
D4    

This method of reading in the data from two disks and writing the merged output to the other two disks is repeated until we end up with one disk which contains all the data in sorted order.

D1               
D2     
D3    1  3  4  8  10 11 12 19 20 80 100
D4    31 39 40 55 67 88
D1    1 3 4 8 10 11 12 19 20 31 39 40 55 67 80 88 100            
D2     
D3   
D4   

The final detail here is how we perform the actual merge operation itself given the limited amount of memory available. It turns out we can re-use the merge logic from mergesort. Recall that in the mergesort algorithm, the merge operation takes in two subarrays, S1 and S2 (since the sort happens in memory, these are typically passed in as start and end indexes from the array being sorted), which are then merged in sorted order into a temporary array T1. The contents of T1 are then written back to the array being sorted. The logic for the merge is fairly simple. Let A1, A2, and A3 be indexes into S1, S2, S3, respectively. One integer is read from each subarray at a time, with the smaller of the integers being written out to the temporary array. We then advance the appropriate index. For example, if S1[A1] < S2[A2], then S1[A1] is written to S3[A3], and A1 and A3 are incremented. If S1[A1] > S2[A2], then S2[A2] is written to S3[A3], and A2 and A3 are incremented.

The important takeaway here is that only two integers need to be compared at any given time, which means that the merge operation requires very little memory. Thus, the logic for the merge operation of our external sorting algorithm will be almost identical, only instead of reading/writing from an array, we will be dealing with Disk I/O instead. Instead of incrementing an index into an array, we advance the spindle on the hard drive instead. Because we are reading and writing from the disks in sequential order, we avoid being penalized by random disk I/O access.

As with mergesort, each pass of the data halves the number of sets. After the initial run, there are N/M subsets. Thus, log2(N/M) subsequent passes through the data are required. Note that if additional external storage devices were available, less runs would be required. For example, if there were six total disks, then each run would cut the number of sets by one third. Given 2k disks, the number of runs required would be logk(N/M). There is additional complexity in the merge logic however. Finding the smaller of two integers is trivial, since we can just compare them directly. Finding the smallest element in a set of k integers however, will require additional work. A data structure such as a priority queue could be used to return the minimum. Note that we also need to keep track of which disk the minimum came from, so that we can read in the next integer from that disk. Additional calls to remove the previous minimum from the priority queue and insert the next integer read from disk will also need to be made.

Interview questions explained: Sorting a terabyte of integers with a gigabyte of memory

Given a terabyte of unsorted 32-bit unsigned integers on disk and only one gigabyte of memory, sort the integers. At first glance, mergesort would be the first solution that comes to mind. However, the typical mergesort implementation assumes that all the elements can fit into main memory. Indeed, it is still possible to modify the mergesort algorithm to solve this problem, but there is another alternative: counting sort. The basic idea is to keep track of how many times each integer appears. A hash table would be the ideal data structure for this purpose. From there, its just a simple matter of iterating through each integer in the hash table in order, looking up how many times it appears, and writing them out to disk. The end result is a sorted list of integers. This sort has an O(n) runtime, as it requires one linear pass to count all the integers, and a second one to write them out in sorted order. This sort can only be used in special cases where all the possible values of the unsorted elements is either known ahead of time or is finite. A counting sort would not work for example if we were dealing with floats or strings.

However, because we are dealing with integers, we can apply the counting sort in this scenario. Here are some quick numbers. There are 1,099,511,627,776 bytes in a terabyte. Each integer takes up four bytes of storage, which means there are a total of 274,877,906,944 unsorted integers. There are 1,073,741,824 bytes in a gigabyte. Assume we use some kind of hash table to keep track of the # of occurrences of each integer, and that the key/value pair consists of integer/integer. Without considering the additional overhead of storing the hash table metadata in memory, we would need eight bytes for each key/value pair. This means that at most, we could hope to store the counts for 134,217,728 integers in memory at a time. It is quickly clear that there is not enough memory to do the counting sort in one pass. One workaround would be to write out the integer counts to disk, reading and counting the integers in one gigabyte chunks at a time (allowing for additional memory overhead used by the program itself). Disk I/O is expensive. To make the system more robust, we can write out additional metadata to disk to keep track of what has been read so far. This would allow the program to resume progress from where it left off in the case of a fatal error.

This solution can be further optimized based on the distribution of the unsorted integers. This is why asking for clarification during the interview process is so important. If the range of possible integer values is limited by a minimum and maximum value, then this frees up the amount of memory required for storing the keys. Likewise, if there is a maximum number of times each integer can appear in the unsorted list, this frees up the amount of memory required for storing the count values.

The first chapter of the book Programming Pearls provides one such example of a problem that could be solved entirely in memory with an optimized counting sort. The problem involves a list of unique nine digit phone numbers that need to appear in sorted order. Because each number appears only once in the unsorted set, only one bit of memory is needed to store the count for each one. Thus, a bit vector can be used to store the key/value pairs for the counting sort. Each bit in the vector represents all the possible ten digit numbers between 0 and 999,999,999. If the particular bit corresponding to a number is flipped on, the number appears in the unsorted list, otherwise it does not. A bit vector with 1,000,000,000 bits requires only 125,000,000 bytes or roughly 119 megabytes of memory. Not bad. But even if we had a scenario where the set of key/value pairs cannot fit into main memory, this limitation can be bypassed altogether.

The advantage of the counting sort is that it can scale with the size of the unsorted elements by using a distributed system to do the counting. Here is a high level overview of one possible design. The range of possible values can be divided into buckets across various slave machines; each machine stores the counts for its respective range of values. For example, machine A stores all the integers between 1 and 1,000,000, machine B stores all the integers between 1,000,001 to 2,000,000, and so on. Each time an integer is read from the master, a mapping function is applied to determine the machine whose range the integer falls in. A notification service then alerts the appropriate machine, which then increments the count for that integer accordingly. Once all the integers have been read and counted, each machine in the system can return its set of ordered key/value pairs by exposing it via a web service interface, or some functional equivalent. Keys in the range that have a count of zero can be omitted from the response. The master can then aggregate this data and print out the final result in sorted order.

The potential downside when it comes to a distributed approach is the increased complexity. Each machine is now a potential point of failure. To address this, additional error checking logic is needed to make the system more robust. For example, every time the machine increments its value, we want to notify the master that the work is done. If an error is encountered, we want to retry the increment operation. Ideally, this should all be done asynchronously to improve performance, using some sort of event driven notification system. The easiest solution here would be to simply use a third party component. Messaging queues such as RabbitMQ, MSMQ, IBM MQ, etc were built from the ground up with these kind of issues in mind. We can add redundancy and provide even more error checking by having clusters of machines be responsible for each range of numbers. Once all the unsorted integers have been counted, we can md5 hash the data structure containing the key/value pairs on each machine. By checking that this value is identical across all the machines in a given cluster, we can help ensure that the counts were performed correctly. The end result of this additional complexity is a highly scalable system that can sort an unlimited number of elements, given enough hardware.

When given a sorting problem in an interview, it is important to ask lots of questions and try to break the problem down into something simpler. What exactly is being sorted? By identifying the various constraints on the possible set of values, an easier solution such as a counting sort may present itself. Typically, the interviewer will follow up the question by asking you to scale your solution. You’ll want to think about how to divide the work up in parallel across a large distributed system. The map/reduce class of algorithms was designed to do exactly that; a distributed counting sort is one such example. It may not be the right solution every time, but its a good one to bring up during the interview process. The right mindset to have when asked a scalability question is to have a conversation with the interviewer, discussing the various possible solutions and their design tradeoffs. A good talk goes a long way, and at the very least, you may come out of it learning a lot.

Interview questions explained: Fun with multiplication!

Write a method that takes as its input an array of integers. For each element in the array, print the product of all the other elements in the array. For example, given the following array:
[1,2,3,4,5]
The output would be:
120, 60, 40, 30, 24

Let’s start with the most literal solution. For each element in the array, loop through all the other elements and multiply them together:

int[] j = { 1, 2, 3, 4, 5}; 

for (int outerIndex = 0; outerIndex < j.Length; outerIndex ++)
{
    int product = 1;
    for (int innerIndex = 0; innerIndex < j.Length; innerIndex++)
    {
        if (innerIndex != outerIndex)
        {
            product *= j[innerIndex];
        }
    }

    Console.WriteLine(product);
} 

This solution requires an outer for loop to iterate through each element in the array and then an inner for loop to iterate through the other n-1 elements in the array. The asymptotic complexity of this is O(n^2).

Obviously, this isn’t a very good solution, so we can do better. The key observation to make here is that for any given element with value x in the array, the product of all the other elements in the array is simply the total product of all the elements divided by x. We only need to calculate the total product once.

This solution only requires two passes through the array. The first calculates the total product, and the second divides the total product by each element in the array:

int[] j = { 1, 2, 3, 4, 5 };

int totalProduct = 1;
for (int i = 0; i < j.Length; i++)
{
    totalProduct *= j[i];
}

for (int i = 0; i < j.Length; i++)
{
    //assume we check for zeroes beforehand to prevent divide by zero errors
    Console.WriteLine(totalProduct / j[i]);
}

This solution has O(n) complexity, since it requires only two linear scans of the array.

Now let’s make the problem a little bit more challenging. What if we cannot use division? Assume that the operation is too prohibitively expensive and that we need to find a workaround (a not uncommon scenario in the real world).

We can use an algorithm design technique known as dynamic programming: Break the problem up into smaller sub-problems, solve and store the solution for each one, and then combine the solutions as necessary to arrive at the answer to the original problem. One difference between dynamic programming and the divide and conquer class of algorithms is that dynamic programming stores the solutions to the subproblems, which are then retrieved at a later point in time. This is an optimization that prevents the same sub-problems from being solved multiple times. One example of this approach is calculating the nth element in the fibbonacci sequence. The fibbonacci sequence is a sequence that starts with 0, 1, 1, 2 and where every subsequent number is the sum of the previous two numbers in the sequence. Typically, the solution involves using either recursion or iteration. However, we can use dynamic programming by precomputing the sequence up to some number j. Thus, for all numbers where n < j, instead of having to use recursion or iteration, we can do a simple lookup. For numbers where n > j, we can still reduce the number of computations by summing from j and j-1 instead of from 0 and 1.

With respect to the problem of determining the products without the usage of division, we can use a similar approach. Note that the inefficiency of the first solution is due to the same multiplication operations being carried out multiple times. We can eliminate these redundant calculations by using dynamic programming. For any given element k in the array j of size n, we want to multiply all the numbers to the left of k with all the numbers to the right of k. By precomputing the running product of all the elements in the array in both directions (from left to right and vice versa), we now know the product of all the numbers to the left and right of any element k in the array. The problem can then be solved:

int[] j = {1,2,3,4,5};

int[] runningProductLeft = new int[j.Length];
int[] runningProductRight = new int[j.Length];  

int product = 1;
//there is no element to the left of the start of the array, so set it to 1
runningProductLeft[0] = 1;
            
//since we already set the first element of runningProductLeft
//start populating it from index 1
for (int i = 1; i < j.Length; i++)
{
    product = product * j[i - 1];
    runningProductLeft[i] = product;
}

//we want to populate runningProductRight in reverse by starting from the end of the array

//there is no element to the right of the end of the array, so set it to 1
runningProductRight[j.Length - 1] = 1;
product = 1;

//since we already populated the last element of runningProductRight, start populating from the second to last element in the array
for (int i = j.Length - 2; i >= 0; i--)
{ 
    product = product * j[i + 1];
    runningProductRight[i] = product;
}

//now that the running products have been precomputed, printing out the solution becomes trivial
for (int i = 0; i < j.Length; i++)
{
    product = runningProductLeft[i] * runningProductRight[i];
    Console.WriteLine(product);
}

This solution requires three linear scans through the array, so the runtime complexity is still O(n).