Inside F#

Brian's thoughts on F# and .NET

Pipelining in F#

Posted by Brian on March 30, 2008

In my previous blog entry, I mentioned how pipelining interacts with type inference.  I wanted to find a natural example to talk about writing pipelines in F#, and Wikipedia helped out.  On this Wikipedia page describing unix command-line pipelining, there is an example of a unix pipeline that spell-checks a web page at a given URL.  So let’s write a similar tool in F#.  The logic of the individual steps in F# is a little different from the unix example, but it’s nevertheless still a great example of a long pipeline in action.

open System.IO
open System.Net
open System.Text.RegularExpressions

// fetch a web page
let HttpGet (url: string) =
    let req = System.Net.WebRequest.Create(url)
    let resp = req.GetResponse()
    let stream = resp.GetResponseStream()
    let reader = new StreamReader(stream)
    let data = reader.ReadToEnd()
    resp.Close()
    data
    
// Use Word to spellcheck (assumes you have referenced Microsoft.Office.Interop.Word.dll)
let msword = new Microsoft.Office.Interop.Word.ApplicationClass()
let mutable x = System.Reflection.Missing.Value :> System.Object
let Spellcheck text = 
    msword.CheckSpelling(text, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x)

// find all misspelled words on a particular web page, using a pipeline 
printfn "misspelled words:"
HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
|> fun s -> Regex.Replace(s, "[^A-Za-z’]", " ")
|> fun s -> Regex.Split(s, " +")
|> Set.ofArray 
|> Set.filter (fun word -> not (Spellcheck word))
|> Set.iter (fun word -> printfn "   %s" word)

Let’s examine the pipeline (the last six lines of code) more closely.  We start by fetching a page from the web (a page I clearly picked at random, with absolutely no significance whatsoever :) ) using our HttpGet function, which returns the contents as a string:

Then we pipe this string into a function that produces a new string where all the non-word characters have been replaced by spaces:

|> fun s -> Regex.Replace(s, "[^A-Za-z’]", " ") 

Then we pipe that string into a function that splits the string on whitespace into an array of words:

|> fun s -> Regex.Split(s, " +") 

Then we’d like to sort that array and get rid of duplicates, and we can do that simply by creating a Set of the words:

|> Set.ofArray 

Then we want to filter down the set to only those words which fail to spell-check:

|> Set.filter (fun word -> not (Spellcheck word)) 

And finally, we print out each remaining (misspelled) word in the Set:

|> Set.iter (fun word -> printfn "   %s" word) 

That’s a fine example of code that reads very naturally in a long pipeline. 

(Pipelining also demonstrates one of the benefits of well-authored functions that take arguments in curried form rather than tupled form.  However I’ll save those details for another blog entry.)

Of course, we could have written our web-spell-checker without using pipelining.  In one big expression, it looks like this:

(Set.iter (fun word -> printfn "   %s" word)
    (Set.filter 
        (fun word -> not (Spellcheck word))
        (Set.ofArray
            (Regex.Split(
                Regex.Replace(
                    HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
                    "[^A-Za-z’]"
                    " "
                ), " +"
            ))
        )
    )
)

Yuck.  This looks similar to how you’d write this code as one huge expression in any language (e.g. C#), and it reveals the problem with using huge expressions: they’re very hard to read.  This is true especially because expressions evaluate inside-out, which means the "first thing that happens" is nested in the middle of the expression, and then its result is passed as an argument to some function surrounding it, which is passed as argument to a function surrounding it… Code with deeply nested function calls just reads unnaturally to humans, since it’s easier to grok things in small bits in the sequence in which they occur.  As a result, the other main way to write such code without pipelining is to name each intermediate expression:

let page = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let pageWords = Regex.Replace(page, "[^A-Za-z’]", " ")
let wordArray = Regex.Split(pageWords, " +")
let wordSet = Set.ofArray wordArray
let filteredWordSet = Set.filter (fun word -> not (Spellcheck word)) wordSet
Set.iter (fun word -> printfn "   %s" word) filteredWordSet

That code reads pretty well.  The functions now appear in the order they will be executed, and each line "does one thing".  Each intermediate result is named, and then referenced by name.  Depending on the exact nature of what you’re coding, these names can be either a good thing or a bad thing, I think, either helping or hindering the readability of the code.  In general, I feel you should name intermediate values only if the name you introduce adds readability-value to the code by explaining what’s happening.  In this example it’s actually a pretty close call, assuming the reader has a modest familiarity with the functions in the Regex class and the Set module.  So you’ll have to tune your personal coding aesthetic to decide whether to code like the example above with all the "let"s, or using the pipelining style:

HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" 
|> fun s -> Regex.Replace(s, "[^A-Za-z’]", " ") 
|> fun s -> Regex.Split(s, " +") 
|> Set.ofArray  
|> Set.filter (fun word -> not (Spellcheck word)) 
|> Set.iter (fun word -> printfn "   %s" word) 

Of course, it’s not all-or-nothing.  A mix of styles often produces the most readable code, and I am rather fond of:

let page = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = Regex.Replace(page, "[^A-Za-z’]", " ")
            |> fun s -> Regex.Split(s, " +")
            |> Set.ofArray
words |> Set.filter (fun word -> not (Spellcheck word))
      |> Set.iter (fun word -> printfn "   %s" word)

for this example.  I think I like this form of the code because it enables the eye to scan down the left side of the code and see the major conceptual tasks happening: we’re going to get a page, we’re going to compute the words, and we’re going to do something with those words.

So there you go.  Hopefully now you have a better understanding of the pipeline operator "|>" – how it works and when you might choose to use it.  Given what a simple thing it is (recall that "x |> f" just means "f x"), this little operator sure provokes a lot of discussion!

3 Responses to “Pipelining in F#”

  1. Art said

    Thanks Brian.
    More, encore, F# programming style is needed.

  2. Amit said

    Thanks for the article. Just a couple of questions:(a) Could we not have written"http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"|> HttpGet|> and_so_onjust to satisfy the aesthete in us?(b) Is it possible to pipeline more than one argument at a time, or do we have to curry the function? For instance, what would be the correct way to effect the following (intended) pipeline:2 3|> fun (x : int) (y : int) -> x + y;;Visual Studio won\’t compile the above code, and making an ordered pair out of the two numbers results in a mismatch.*****Your materials and responses at hubfs have really helped me a great deal.

  3. Brian said

    (a) yes(b) not like that (as "2 3" parses as "apply the function \’2\’ to the argument \’3\’), but you can author a different operator e.g. ||> that takes a tuple on the left (actually, such an operator may already be in the library, I forget)

Leave a comment