Fun with sed

Given that I’ve spent most of my employment history at Microsoft shops, I’m always looking for an excuse to use CYGWIN more and gain more experience with the UNIX command line. I ran into a problem a while back and found a good opportunity to do so. Imagine the following: Deep within the bowels of a classic asp web application, there lives a giant monstrosity of a VBScript case statement spanning thousands of lines. This eyesore is responsible for maintaining cobranding image URLs for various client companies. It looks something like this:

select case companyName
case "ACME"
	logoURL = "/images/ACMELogo.jpeg"
	loginSplashScreenURL = "/images/ACMESplashscreen.jpeg"
        bannerURL = "/images/ACMEBanner.jpeg"
case "WidgetCorp"
	logoURL = "/images/WidgetCorpLogo.jpeg"
	loginSplashScreenURL = "/images/WidgetCorpSplashscreen.jpeg"
        bannerURL = "/images/WidgetCorpBanner.jpeg"
...

Every time a new company wanted to setup cobranding, the support team would need to manually add another entry to this already gargantuan case statement. Note of course that in the beginning, there were not many client companies, so the natural solution, a database, didn’t seem like it would be necessary. Naturally, it became obvious after a few years that this wasn’t the best solution. Migrating everything to a database started to make a lot of sense. This task would need to be split up into two parts. First, I’d need to create a new page where ops could upload and manage these images. The second part would be to import the existing logos into the database. The powers that be ultimately decided that only the “logoURL” variables would need to be migrated for now. I decided to create an ASP migration page that would loop through all the existing logos and import them into the database. In order to do so, I’d need to come up with a way to parse the existing case statement, and come up with an array of company names and their corresponding logo URLs. To make this parsing easier, I decided to store everything into one array, with even numbered indexes containing the company, and odd indexes containing the logo. It would look something like this:

migrationArray = array(“ACME”,_
                       "/images/ACMELogo.jpeg",_
                       "WidgetCorp",_
                       "/images/WidgetCorpLogo.jpeg", 
                       .....)

And this is when I decided to use Sed to generate the array. Sed is the UNIX stream editor. Tedious batch editing scenarios such as the one I described is one of the things Sed was designed for. In a nutshell: You specify a set of commands and an input file. Sed then reads the input file, one line at a time, while applying the appropriate commands. Commands consist of an optional address (which line numbers this command applies to), a pattern (a regex that’s then matched against the line), and an action (append, delete, substitute, and so on).

Think of Sed as a Turing machine that can only move forward, not backward (although sed does have a hold buffer that addresses this problem). For more complex tasks that require the full power of a Turing complete language, a Ruby or Python script would probably make more sense. However in most cases, Sed can generate clean and concise solutions. Sure, a lot of this text editing functionality can be done manually via find and replace. However, automating things makes life easier. Laziness is a virtue. By storing all my sed commands in a file, I can automate the editing process. This lets me easily tweak the commands and rerun them at my convenience.

In order to accomplish the text transform that I wanted, I first copied and pasted the case statement into a separate text file. This would help simplify my sed commands; I wouldn’t need to worry about parsing out the rest of the asp file, which was quite large. After doing so, I began the task of coming up with the list of sed commands that I would need. First on the list: delete unwanted lines. Since I only cared about the case statement and the logo URLs, everything else would need to be removed. Sed provides a handy delete command that does exactly this. “d” tells sed to delete any line that matches the given preceding pattern (denoted by a leading and trailing “/”). The regex syntax here should be familiar to those who use javascript or perl:

/select case companyName/d
/loginSplashScreenURL/d
/bannerURL/d
/^\s*$/d

The first line deletes the select case statement. The next two lines delete the bannerURL and loginSplashScreenURL assignment statements. The last line deletes all whitespace lines. “^” and “$” mean start of line and end of line in regex, and “\s” matches any whitespace character. Put another way, the statement basically means: “delete any line where only whitespace exists between start and end of the line”.

The next thing that I would need to take care of would be to strip out unwanted characters. The substitute command (think find and replace in a typical text editor) does exactly this. “s” tells sed to look at each line in the file, and replace the first occurrence of a given regex (denoted again by a leading and trailing “/”) , with a given replacement string (which is also denoted by a leading and trailing “/”). Since we are stripping out these characters, the replacement string is blank, hence the “//”:

s/case //
s/logoURL = //
s/’(.*)//$
s/\s*//g

The first statement removes the “case” from the case statement, keeping the actual company name. Likewise, the second statement keeps only the right hand side of the logoURL assignment. The last line removes all comments (‘ denotes a comment in vbscript). The fourth line removes all white space. Note the use of “g”, which tells SED to do the replacement on every match in a given a line (again, this syntax should be familiar to those who have used scripting languages that support some variant of the POSIX regex standard).

Now, since this is an array, I need to append a comma after every item in the array. Because each item is on its own line, and because vbscript is an ugly language, I also need to append an underscore as well (this tells vbscript that this is a multi line statement). Substitute is a flexible command and can be used to append text by using “$” in the pattern (which means end of line in regex). The following statement means “replace the end of the line with a comma and an underscore”:

s/$/,_/

I also need to add the array declaration at the start of the output file:

1 s/^/migrationArray = array(/

Here I’m using the optional address argument. The “1” specifies the first line in the input file (to do something such as specify lines 1 through 10, you would use “1,10” instead). The “^” in the pattern specifies the start of line. Translated literally, this means: replace the start of the first line in the input file with “migrationArray = array(“.

The final result looks something like this (as with most UNIX tools, “#” specifies a comment):

#delete these lines:
/select case companyName/d
/loginSplashScreenURL/d
/bannerURL/d
/^\s*$/d

#strip out the following:
s/case //
s/logoURL = //
s/’(.*)//$
s/\s*//g
 

#append:
s/$/,_/

#prepend: 
1 s/^/migrationArray = array(/

With all the commands saved into the file “migrationscript.sed” (the .sed file extension is just a formality that lets me know that this file contains sed commands), I then ran the commands via:

sed -f migrationscript.sed  casestatement.txt > output.txt

where casestatement.txt contains the original case statement, and output.txt is the file I’m redirecting sed’s output to.

The only thing missing is replacing the final “,_” with the closing parentheses at the end of the file. You can specify the last line in sed, but this is the last line in the input file, which is not the same as the last line in the final output file (lines could have been added or deleted). Since I know the last logo URL occurs in the third to last line in the file, I could try specifying the third to last line as a line number argument to the substitute command. Unfortunately, since sed is a stream editor, it doesn’t know it has reached the last line until it has actually reached the end of the file. By then its too late. Sed does have a “hold buffer” that could potentially provide a solution. To me, that came too close to crossing the complexity threshold where using a more powerful and general purpose scripting language such as Python would make sense. So, in order to keep things simple, I ran this sed command afterward:

sed '$ s/,_/)/' output.txt > output.txt

“$” is a line number argument that refers to the last line in the file (note that it is specified in the address portion of the command, and not the pattern portion, where it would mean the end of the current line). Since now the output from the prior sed command is the input file, the last line is the one we actually want to edit. And so, by slapping two sed commands into a bash script, I was able to automate the tedious editing of a mountain of text. HOORAY! In this author’s humble opinion, text transforms such as this are much more fun than say …. using XSL to transform XML. *Shudders*

Leave a Reply

Your email address will not be published. Required fields are marked *