| Brian's profileInside F#BlogGuestbookNetwork | Help |
|
|
March 30 Pipelining in F#In my previous blog entry, I mentioned how pipelining interacts with type inference. I wanted to find a natural example to talk about writing pipelines in F#, and Wikipedia helped out. On this Wikipedia page describing unix command-line pipelining, there is an example of a unix pipeline that spell-checks a web page at a given URL. So let's write a similar tool in F#. The logic of the individual steps in F# is a little different from the unix example, but it's nevertheless still a great example of a long pipeline in action. open System.IO open System.Net open System.Text.RegularExpressions // fetch a web page let HttpGet (url: string) = let req = System.Net.WebRequest.Create(url) let resp = req.GetResponse() let stream = resp.GetResponseStream() let reader = new StreamReader(stream) let data = reader.ReadToEnd() resp.Close() data // Use Word to spellcheck (assumes you have referenced Microsoft.Office.Interop.Word.dll) let msword = new Microsoft.Office.Interop.Word.ApplicationClass() let mutable x = System.Reflection.Missing.Value :> System.Object let Spellcheck text = msword.CheckSpelling(text, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x, &x) // find all misspelled words on a particular web page, using a pipeline printfn "misspelled words:" HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" |> fun s -> Regex.Replace(s, "[^A-Za-z']", " ") |> fun s -> Regex.Split(s, " +") |> Set.of_array |> Set.filter (fun word -> not (Spellcheck word)) |> Set.iter (fun word -> printfn " %s" word) Let's examine the pipeline (the last six lines of code) more closely. We start by fetching a page from the web (a page I clearly picked at random, with absolutely no significance whatsoever :) ) using our HttpGet function, which returns the contents as a string: HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" Then we pipe this string into a function that produces a new string where all the non-word characters have been replaced by spaces: |> fun s -> Regex.Replace(s, "[^A-Za-z']", " ") Then we pipe that string into a function that splits the string on whitespace into an array of words: |> fun s -> Regex.Split(s, " +") Then we'd like to sort that array and get rid of duplicates, and we can do that simply by creating a Set of the words: |> Set.of_array Then we want to filter down the set to only those words which fail to spell-check: |> Set.filter (fun word -> not (Spellcheck word)) And finally, we print out each remaining (misspelled) word in the Set: |> Set.iter (fun word -> printfn " %s" word) That's a fine example of code that reads very naturally in a long pipeline. (Pipelining also demonstrates one of the benefits of well-authored functions that take arguments in curried form rather than tupled form. However I'll save those details for another blog entry.) Of course, we could have written our web-spell-checker without using pipelining. In one big expression, it looks like this: (Set.iter (fun word -> printfn " %s" word) (Set.filter (fun word -> not (Spellcheck word)) (Set.of_array (Regex.Split( Regex.Replace( HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1", "[^A-Za-z']", " " ), " +" )) ) ) ) Yuck. This looks similar to how you'd write this code as one huge expression in any language (e.g. C#), and it reveals the problem with using huge expressions: they're very hard to read. This is true especially because expressions evaluate inside-out, which means the "first thing that happens" is nested in the middle of the expression, and then its result is passed as an argument to some function surrounding it, which is passed as argument to a function surrounding it... Code with deeply nested function calls just reads unnaturally to humans, since it's easier to grok things in small bits in the sequence in which they occur. As a result, the other main way to write such code without pipelining is to name each intermediate expression: let page = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" let pageWords = Regex.Replace(page, "[^A-Za-z']", " ") let wordArray = Regex.Split(pageWords, " +") let wordSet = Set.of_array wordArray let filteredWordSet = Set.filter (fun word -> not (Spellcheck word)) wordSet Set.iter (fun word -> printfn " %s" word) filteredWordSet That code reads pretty well. The functions now appear in the order they will be executed, and each line "does one thing". Each intermediate result is named, and then referenced by name. Depending on the exact nature of what you're coding, these names can be either a good thing or a bad thing, I think, either helping or hindering the readability of the code. In general, I feel you should name intermediate values only if the name you introduce adds readability-value to the code by explaining what's happening. In this example it's actually a pretty close call, assuming the reader has a modest familiarity with the functions in the Regex class and the Set module. So you'll have to tune your personal coding aesthetic to decide whether to code like the example above with all the "let"s, or using the pipelining style: HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" |> fun s -> Regex.Replace(s, "[^A-Za-z']", " ") |> fun s -> Regex.Split(s, " +") |> Set.of_array |> Set.filter (fun word -> not (Spellcheck word)) |> Set.iter (fun word -> printfn " %s" word) Of course, it's not all-or-nothing. A mix of styles often produces the most readable code, and I am rather fond of: let page = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1" let words = Regex.Replace(page, "[^A-Za-z']", " ") |> fun s -> Regex.Split(s, " +") |> Set.of_array words |> Set.filter (fun word -> not (Spellcheck word)) |> Set.iter (fun word -> printfn " %s" word) for this example. I think I like this form of the code because it enables the eye to scan down the left side of the code and see the major conceptual tasks happening: we're going to get a page, we're going to compute the words, and we're going to do something with those words. So there you go. Hopefully now you have a better understanding of the pipeline operator "|>" - how it works and when you might choose to use it. Given what a simple thing it is (recall that "x |> f" just means "f x"), this little operator sure provokes a lot of discussion! TrackbacksWeblogs that reference this entry
|
|
|