Saturday, February 13, 2010

Searching the contents of text files

EXECUTIVE SUMMARY

Search for the string 'Recipe' in all files that have the .org or .html extension anywhere in the current directory or below, ensuring that the filename is prepended to all matches:

> grep -e 'Recipe' `find .  \( -name "*.org" -o -name "*.html" \)` /dev/null

Same as above except all non-binary files are searched:

grep -HIre 'Recipe' *

SUPPORTING JABBER

Consider the situation where you have many text files in a certain directory tree and you want to discover which files have particular content. Here we discuss the use of grep and find to help solve this problem. Modern versions of grep remove the need to use find, and we will discuss that method after the one applicable to more disadvantaged systems.

The grep command is used to search the contents of files. A familiar output is to have the filename prepended to the line that matches the search, for example

> grep -e 'ground' *
photos.org:   background.  I also had the privilege of seeing the physical
photos.org:   ground on the night of the 27th.  On the morning of the 28th there
Quotes.org:going to take a lovely, simple melody and drive it into the ground. --

It is tempting to interpret the prepended filename as the overall default. However, whether the filename appears or not depends also on the context in which grep was used. Specifically, when grep is provided a single file to search through the filename is not prepended,

> grep -e 'ground' photos.org
   background.  I also had the privilege of seeing the physical
   ground on the night of the 27th.  On the morning of the 28th there

This is a reasonable behavior from the perspective of grep since only a single file was given there should be no doubt what file contained the match. As we will see below there are times when grep may be provided a single file but the user does not know what that file is. In these cases we want to force the filename to be identified. One way to do this is to pass grep the real file and one other file that has the following property; its contents will never match the search expression, for example /dev/null. Witness the difference,

> grep -e 'ground' photos.org /dev/null
photos.org:   background.  I also had the privilege of seeing the physical
photos.org:   ground on the night of the 27th.  On the morning of the 28th there

Before continuing there are two observations to be made about the grep invocations above. First, and almost as an aside, the calls could have been written just a bit more simply by dropping the -e switch and the quote marks. However, this construct allows for more complex search expressions. An example is to find either the word 'ground' or the word 'Recipe' in any files,

> grep -e 'ground\|Recipe' *
photos.org:   background.  I also had the privilege of seeing the physical
photos.org:   ground on the night of the 27th.  On the morning of the 28th there
Quotes.org:going to take a lovely, simple melody and drive it into the ground. --
Recipes.org:#+TITLE: Recipes
sitemap.org:   + [[file:Recipes.org][Recipes]]

The observation that pertains directly to the problem at hand is that the list of files for grep to search must be specified somehow. If all the files are in the same directory, then a simple wildcard expression might be all that is needed. However, sometimes the search is to be done recursively or across several directories.

The find command is useful for finding files on the system with particular characteristics. As an example, the following expression finds all files in the current directory and below that have either a .org or .html extension,

> find .  \( -name "*.org" -o -name "*.html" \)
backcountry/photos.html
backcountry/readme.html
backcountry/maintenance.html
backcountry/sitemap.html
backcountry/index.html 
 [--snip--]
templates/rketburt-01-Level00.org
templates/rketburt-01-Level01.org
 [--snip--]

Be aware, the space after the \( and before the \) proved to be vital while testing commands for this article. I am unaware if this is a general necessity or just on my particular system.

Now it is a simple matter to search the contents of multiple files. We build the file list using find embedded in backticks (`) to capture the result, then invoke grep on that list. Here is a complete example,

> grep -e 'Recipe' `find .  \( -name "*.org" -o -name "*.html" \)` /dev/null
rketburt-org/Recipes.org:#+TITLE: Recipes
rketburt-org/sitemap.org:   + [[file:Recipes.org][Recipes]]
rketburt/sitemap.html:<a href="Recipes.html">Recipes</a>
rketburt/Recipes.html:<title>Recipes</title>
rketburt/Recipes.html:<h1 class="title">Recipes</h1>
rketburt/index.html:<a href="Recipes.html">Recipes</a>

Note the use of /dev/null as a file argument to grep to ensure that the filename is prepended.

Another way to effect the same final result is to invoke find first and use the -exec argument to call grep. In this ordering grep is only provided with a single file which leads to the lack of filename problem indicated earlier. The overall syntax is a bit more cumbersome as well, since {} is used to pass the result of find to grep and there is the trailing \; as well. An equivalent example to the one in the previous paragraph is

> find .  \( -name "*.org" -o -name "*.html" \) -exec grep -e 'Recipe' {} /dev/null \;

Syntax or preferences aside, it is interesting to note that while these two examples provided the same end result, the one that begins with grep executed nearly 10 times faster.

The find command has been used above for two reasons. First, the desire was to search files that may appear in directories below the one called out. In other words we desired a recursive search. The second reason was to eliminate the prospect of searching non-text files which would have simply been a time sink. The method to exclude the binary files was to limit the file extensions to just two (.org and .html). This may be the exact behavior desired for some questions, but may be too restrictive for others.

Modern versions of grep permit both recursive searching (-r) and binary file exclusion (-I). Additionally, prepending the filename can be specified (-H) even in the event only a single file is searched. To find all text files in or below the current directory that contain the string 'Recipe', the command is now simply

grep -HIre 'Recipe' *

During testing for this article the time to complete was at its fastest only about twice that of the grep that uses the find in backticks, and at its slowest was over 100 times slower. This difference may have been due to the system load or possibly the fact that there were hundreds of files that together total nearly 2GB. Even so, there may be times when the blind search is well worth the time spent to discover something.

No comments:

Post a Comment