vim tutorial – using the s command to replace text on the fly

The vim text editor comes with a powerful :s (substitution) command that is more versatile and expressive than the search and replace functionality found in GUI based text editors.

The general form of the command is:
:[g][address]s/search-string/replace-string[/option]

The address specifies which lines vim will search. If none is provided, it will default to the current line only. You can enter in a single line number to search, or specify an inclusive range by entering in the lower and upper bounds separated by a comma. For example: an address of 1,10 is lines one through ten inclusive.

You can also provide a string value for the address by enclosing it with forward slashes. vim will operate on the next line that matches this string. If the address string is preceded by “g”, vim will search all lines that match this string. /hello/ matches the next line that contains hello, whereas g/hello matches every line that contains hello.

The search-string is a regular expression and the replace-string can reference the matched string by using an ampersand (&).

[option] allows even more fine grained control over the substitution. One of the more common options used is “g”, not to be confused with the “g” that precedes address. Option “g”, which appears at the end of the command, replaces every occurrence of the search-string on the line. Normally, the substitute command only matches on the first occurrence and then stops.

s/ten/10/g

run on the following line:
ten + ten = 20

results in:

10 + 10 = 20

as opposed to:

10 + ten = 20

without the global option.

Given all this versatility, the :s command comes in quite handy. Consider the following scenario. There is a comma delimited file that is missing trailing commas on some lines and not others. In order to normalize the text file so that all lines ended with a comma, you could run:

1,$s/[^,]$/&,

The address range 1,$ spans the entire file ($ in the address means the last line in the file). The search-string “[^,]$” is a regular expression that matches every line that ends with any character except comma ($ in a regex indicates end of the line). The replace-string has an &, which refers to the trailing character matched in the search-string. By setting the replace-string to “&,” we are telling VIM to take the last character on every line that is not a comma and add a comma to it.

[^,]$ won’t match on blank new lines because [^,] expects at least one character to be on the line. To get around this problem, you would normally use negative look behinds, however the VIM regex does not seem to support them. The easiest way around this is to use a second replace command for newlines:
1,$s[^\n$]/,

This tells it to only add a comma to any line that only contains a newline (^ in a regex indicates start of line).

This is just one example of course. By coming up with the right regex in the search-string, you can automate all sorts of normally tedious tasks with succinct commands. The best part is, unlike those cumbersome GUI based editors that often require the use of a pesky mouse, your hands never have to leave the keyboard! For even more control and flexibility, you could use sed, but :s can handle most day to day tasks quite easily.

Linux tutorial: Searching all files of a specific type for a string

Let’s say you want to search all the *.txt files in a given parent directory and all its children for the word “hello”.

Something like grep -rl "hello" *.txt would not work because the shell will expand “*.txt” for the current directory only. The -r flag for recursion would essentially be ignored. For example, if the parent directory contained:

a.txt

and the child directory contained:

a.txt b.txt c.txt d.txt

grep -rl "hello" *.txt would only search a.txt in the parent directory. This is because the shell will only evaluate the * wildcard for the parent directory from which the command is run.

What we actually want to do is use the find command to recursively list all the text files in the directory and its children, and then pass each of these files as arguments into grep, which will then search each argument for any instances of the string “hello”.

The find command to locate all the text files looks like this:
find ./ -name "*.txt"

In order to pass each file as an argument into grep, we use xargs. The xargs utilities reads in parameters from standard input (the default delimiter is whitespace or newline). For each item read in from standard input, xargs will then execute a given command with each item passed in as an argument.

Essentially what it is doing is this:

foreach item in stdin
{
execute "[command] [initial arguments] [arg]"
}

In our example, we want to run xargs grep "hello" (grep being [command] and “hello” being [initial arguments]), with stdin coming from the output of the find command. Putting this all together, we get the following:

find ./ -name "*.txt" | xargs grep "hello"

Combining commands together is the strength of the UNIX design philosophy. The various command line utilities are designed to play well with one another, using the output from one as the input into another. Think of each utility as a puzzle piece that can fit together with any other puzzle piece,combining in interesting ways to solve complex problems. Often times there will be many possible solutions to a given problem; such is the versatility of the platform!

Fun with sed

Given that I’ve spent most of my employment history at Microsoft shops, I’m always looking for an excuse to use CYGWIN more and gain more experience with the UNIX command line. I ran into a problem a while back and found a good opportunity to do so. Imagine the following: Deep within the bowels of a classic asp web application, there lives a giant monstrosity of a VBScript case statement spanning thousands of lines. This eyesore is responsible for maintaining cobranding image URLs for various client companies. It looks something like this:

select case companyName
case "ACME"
	logoURL = "/images/ACMELogo.jpeg"
	loginSplashScreenURL = "/images/ACMESplashscreen.jpeg"
        bannerURL = "/images/ACMEBanner.jpeg"
case "WidgetCorp"
	logoURL = "/images/WidgetCorpLogo.jpeg"
	loginSplashScreenURL = "/images/WidgetCorpSplashscreen.jpeg"
        bannerURL = "/images/WidgetCorpBanner.jpeg"
...

Every time a new company wanted to setup cobranding, the support team would need to manually add another entry to this already gargantuan case statement. Note of course that in the beginning, there were not many client companies, so the natural solution, a database, didn’t seem like it would be necessary. Naturally, it became obvious after a few years that this wasn’t the best solution. Migrating everything to a database started to make a lot of sense. This task would need to be split up into two parts. First, I’d need to create a new page where ops could upload and manage these images. The second part would be to import the existing logos into the database. The powers that be ultimately decided that only the “logoURL” variables would need to be migrated for now. I decided to create an ASP migration page that would loop through all the existing logos and import them into the database. In order to do so, I’d need to come up with a way to parse the existing case statement, and come up with an array of company names and their corresponding logo URLs. To make this parsing easier, I decided to store everything into one array, with even numbered indexes containing the company, and odd indexes containing the logo. It would look something like this:

migrationArray = array(“ACME”,_
                       "/images/ACMELogo.jpeg",_
                       "WidgetCorp",_
                       "/images/WidgetCorpLogo.jpeg", 
                       .....)

And this is when I decided to use Sed to generate the array. Sed is the UNIX stream editor. Tedious batch editing scenarios such as the one I described is one of the things Sed was designed for. In a nutshell: You specify a set of commands and an input file. Sed then reads the input file, one line at a time, while applying the appropriate commands. Commands consist of an optional address (which line numbers this command applies to), a pattern (a regex that’s then matched against the line), and an action (append, delete, substitute, and so on).

Think of Sed as a Turing machine that can only move forward, not backward (although sed does have a hold buffer that addresses this problem). For more complex tasks that require the full power of a Turing complete language, a Ruby or Python script would probably make more sense. However in most cases, Sed can generate clean and concise solutions. Sure, a lot of this text editing functionality can be done manually via find and replace. However, automating things makes life easier. Laziness is a virtue. By storing all my sed commands in a file, I can automate the editing process. This lets me easily tweak the commands and rerun them at my convenience.

In order to accomplish the text transform that I wanted, I first copied and pasted the case statement into a separate text file. This would help simplify my sed commands; I wouldn’t need to worry about parsing out the rest of the asp file, which was quite large. After doing so, I began the task of coming up with the list of sed commands that I would need. First on the list: delete unwanted lines. Since I only cared about the case statement and the logo URLs, everything else would need to be removed. Sed provides a handy delete command that does exactly this. “d” tells sed to delete any line that matches the given preceding pattern (denoted by a leading and trailing “/”). The regex syntax here should be familiar to those who use javascript or perl:

/select case companyName/d
/loginSplashScreenURL/d
/bannerURL/d
/^\s*$/d

The first line deletes the select case statement. The next two lines delete the bannerURL and loginSplashScreenURL assignment statements. The last line deletes all whitespace lines. “^” and “$” mean start of line and end of line in regex, and “\s” matches any whitespace character. Put another way, the statement basically means: “delete any line where only whitespace exists between start and end of the line”.

The next thing that I would need to take care of would be to strip out unwanted characters. The substitute command (think find and replace in a typical text editor) does exactly this. “s” tells sed to look at each line in the file, and replace the first occurrence of a given regex (denoted again by a leading and trailing “/”) , with a given replacement string (which is also denoted by a leading and trailing “/”). Since we are stripping out these characters, the replacement string is blank, hence the “//”:

s/case //
s/logoURL = //
s/’(.*)//$
s/\s*//g

The first statement removes the “case” from the case statement, keeping the actual company name. Likewise, the second statement keeps only the right hand side of the logoURL assignment. The last line removes all comments (‘ denotes a comment in vbscript). The fourth line removes all white space. Note the use of “g”, which tells SED to do the replacement on every match in a given a line (again, this syntax should be familiar to those who have used scripting languages that support some variant of the POSIX regex standard).

Now, since this is an array, I need to append a comma after every item in the array. Because each item is on its own line, and because vbscript is an ugly language, I also need to append an underscore as well (this tells vbscript that this is a multi line statement). Substitute is a flexible command and can be used to append text by using “$” in the pattern (which means end of line in regex). The following statement means “replace the end of the line with a comma and an underscore”:

s/$/,_/

I also need to add the array declaration at the start of the output file:

1 s/^/migrationArray = array(/

Here I’m using the optional address argument. The “1” specifies the first line in the input file (to do something such as specify lines 1 through 10, you would use “1,10” instead). The “^” in the pattern specifies the start of line. Translated literally, this means: replace the start of the first line in the input file with “migrationArray = array(“.

The final result looks something like this (as with most UNIX tools, “#” specifies a comment):

#delete these lines:
/select case companyName/d
/loginSplashScreenURL/d
/bannerURL/d
/^\s*$/d

#strip out the following:
s/case //
s/logoURL = //
s/’(.*)//$
s/\s*//g
 

#append:
s/$/,_/

#prepend: 
1 s/^/migrationArray = array(/

With all the commands saved into the file “migrationscript.sed” (the .sed file extension is just a formality that lets me know that this file contains sed commands), I then ran the commands via:

sed -f migrationscript.sed  casestatement.txt > output.txt

where casestatement.txt contains the original case statement, and output.txt is the file I’m redirecting sed’s output to.

The only thing missing is replacing the final “,_” with the closing parentheses at the end of the file. You can specify the last line in sed, but this is the last line in the input file, which is not the same as the last line in the final output file (lines could have been added or deleted). Since I know the last logo URL occurs in the third to last line in the file, I could try specifying the third to last line as a line number argument to the substitute command. Unfortunately, since sed is a stream editor, it doesn’t know it has reached the last line until it has actually reached the end of the file. By then its too late. Sed does have a “hold buffer” that could potentially provide a solution. To me, that came too close to crossing the complexity threshold where using a more powerful and general purpose scripting language such as Python would make sense. So, in order to keep things simple, I ran this sed command afterward:

sed '$ s/,_/)/' output.txt > output.txt

“$” is a line number argument that refers to the last line in the file (note that it is specified in the address portion of the command, and not the pattern portion, where it would mean the end of the current line). Since now the output from the prior sed command is the input file, the last line is the one we actually want to edit. And so, by slapping two sed commands into a bash script, I was able to automate the tedious editing of a mountain of text. HOORAY! In this author’s humble opinion, text transforms such as this are much more fun than say …. using XSL to transform XML. *Shudders*