Thursday, September 19, 2013

4.7: An Extended Example: Web Spiders - Part 4

The Spider is the crawler. It constructs the iterator chain we use to parse the web pages and starts the Fetcher process. Its interface to the world is an i.Forward iterator that streams out urls and some information about those urls. It operates by sending the url to the Fetcher by putting it on the Fetcher queue. It waits then for the Fetcher to return the first page. The Spider then returns the result to the caller and hands the page over to the parsing routine that runs concurrently. The parser runs through the page, queueing any urls that come out of the iterator chain. The user only blocks if there are no pages waiting on the channel from the Fetcher.

type Entry struct {
    Url, Referer string
    StatusCode   int
}

type Spider struct {
    starturl string
    entries  list.List
    queued   map[string]bool
    err      error
    fetcher  *Fetcher
    robot    Robot
    pages    chan *page
}

The Entry struct is what we return to the user. It contains the url, the page that refered it, and the status code that resulted from attempting to fetch it.
The Spider struct contains the state of the operatation: starturl is the starting url, entries contains the data we send to the user, queued we use as a log of urls we've already worked on, fetcher and robot are the two main components of teh system, and pages is the channel we use to send pages to the parsing funciton.

func NewSpider(url string) *Spider {
    url = strings.ToLower(url)
    var s Spider
    s.starturl = url
    s.robot, _ = NewRobot(url)
    s.queued = make(map[string]bool)
    s.queued[url] = true
    s.fetcher = NewFetcher()
    s.fetcher.Queue(url, url)
    s.fetcher.Run()
    s.pages = make(chan *page)
    s.parse()
    p := <-s.fetcher.Pages()
    s.err = s.processEntry(p)
    return &s
}

The constructor creates the initial state of the spider and queues and fetches the starting url. It then waits for the first page and sends it to the parser before returning.

func (s *Spider) Value() interface{} {
    return s.entries.Front().Value
}

func (s *Spider) Error() error {
    return s.err
}
func (s *Spider) AtEnd() bool {
    return s.entries.Len() == 0
}
func (s *Spider) Close() {
    close(s.pages)
    s.fetcher.Close()
}

The value we return to the user is the head of the entries queue. Once its empty, we are done. To clean up resources, the user has to call Close() on the spider, it closes the channel to the parser and shuts down the fetcher.

func (s *Spider) Next() error {
    s.entries.Remove(s.entries.Front())
    if s.AtEnd() {
        for {
            if p, ok := <-s.fetcher.Pages(); ok {
                s.err = s.processEntry(p)
                if !s.AtEnd() {
                    break
                }
            }
        }
    }
    return s.err
}

The Next() function pops the head from the queue. If it is empty it fetches the next page from the channel from the Fetcher and processes it.

func (s *Spider) processEntry(p *page) error {
    if p.err != nil {
        return p.err
    }
    s.entries.PushBack(&Entry{Url: p.url, Referer: p.ref, StatusCode: p.response.StatusCode})
    s.pages <- p
    return nil
}

The processing routine simply creates an Entry to return to the user and then sends the page off to the parsing function.

func (s *Spider) parse() {
    go func() {
        for p := range s.pages {
            if p.response.Body != nil {
                itr :=
                    HostMapper(p.url,
                        NormalizeItr(
                            UrlItr(
                                LinkItr(
                                    NodeItr(p.response.Body, DepthFirst)))))
                if s.robot != nil {
                    itr = RobotItr(s.robot, itr)
                }
                itr = BindByRef(s.starturl, Referer(p.url, itr))
                for ; !itr.AtEnd(); itr.Next() {
                    urlpair, _ := itr.Value().([]string)
                    url := urlpair[0]
                    if _, ok := s.queued[url]; !ok {
                        s.queued[url] = true
                        s.fetcher.Queue(url, urlpair[1])
                    }
                }
                p.response.Body.Close()
            }
        }
    }()
}

The parsing function loops on the channel of pages, blocking until a new one is available. When it comes through it starts a chain of iterators to filter the links on the page. Every link that comes through the chain is queued to be processed by the Fetcher.


The iterator channel built in the parsing routine is made of eight components that transform the html nodes to a pair of strings representing the url and its referer. Those processes are for the most part specialitations of i.Map and i.Filter.
The complete process of the spider looks like this:

The two red sections represent two independent threads of execution, the Fetcher and the parse() method.
Using the spider is a simple matter of iterating through the urls that the spider returns. In this case, we want to further filter them by only printin the urls that return a error code. This is accomplished by using a hoi.FilterFunc function that checks for the StatusCode. Now we have an url checker that checks the sites for urls that are not available.

func find4xx(itr i.Iterator) bool {
    e, _ := itr.Value().(*spider.Entry)
    return e.StatusCode >= 400 && e.StatusCode < 500
}

func main() {
    s := spider.NewSpider(os.Args[1])
    itr := hoi.Filter(find4xx, s)
    count := 0
    for ; !itr.AtEnd(); itr.Next() {
        e, _ := itr.Value().(*spider.Entry)
        count++
        fmt.Printf("%d: Url: %s, Code: %d, Referer: %s\n", count, e.Url, e.StatusCode, e.Referer)
    }
    if itr.Error() != nil {
        fmt.Println(itr.Error())
    }
    s.Close()
}

Both of these source files are on GitHub at hog/spider/spider and hog/checkurls.

Wednesday, September 18, 2013

4.7: An Extended Example: Web Spiders - Part 3

The robots package is very simple since github.com/temoto/robotstxt-go does the heavy lifting of parsing and querying the robot.txt file.

type Robot interface{}

func NewRobot(url string) (Robot, error) {
    src := hostFromBase(url) + "/robots.txt"
    resp, err := http.Get(src)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    robots, err := robot.FromResponse(resp)
    if err != nil {
        return nil, err
    }
    return robots.FindGroup("GoGoSpider"), nil
}
func RobotItr(rules Robot, itr i.Forward) i.Forward {
    rrules, _ := rules.(*robot.Group)
    filter := func(itr i.Iterator) bool {
        url, _ := itr.Value().(string)
        return rrules.Test(url)
    }
    return hoi.Filter(filter, itr)
}

The NewRobot() constructor uses the host name of our root url to fetch the robots.txt file and returns the rules for the GoGoSpider group (thats us!). The iterator ueses these rules to filter the url stream, rejecting any urls that are not allowed according to the rules in the robots.txt file.
The Fetcher object is the first, and only object in this system that has nothing at all to do with iterators. It doesn't consume them, doesn't produce them. What it does do is fetch and produce web pages concurrently.

type urlpair struct {
    url, ref string
}

type page struct {
    url, ref string
    err      error
    response *http.Response
}
type Fetcher struct {
    pages  chan *page
    done   chan bool
    queue  list.List
    lockq  sync.Mutex
    client http.Client
}

The urlpair is used to transport an url and its referer from the user of the Fetcher to the engine that fetches the pages. The page contains those url pairs along with the http response recieved. The Fetcher maintains page channel, a queue and a mutex to guard access to the queue.

func NewFetcher() *Fetcher {
    f := Fetcher{}
    f.pages = make(chan *page, 5)
    f.done = make(chan bool)
    return &f
}

func (f *Fetcher) Stop() {
    f.done <- true
    <-f.done
}

The constructor creates buffered channels to use in the fetching operations. It can fetch 5 pages before it blocks and has to wait.

func (f *Fetcher) Pages() <-chan *page {
    return f.pages
}
func (f *Fetcher) Queue(url, ref string) {
    f.lockq.Lock()
    defer f.lockq.Unlock()
    f.queue.PushBack(&urlpair{url: url, ref: ref})
}

The Queue() method is where links enter the fetching system. They get stored on a queue so the caller won't block (for long, only if the mutex is locked). The Pages() method gives the user access to the page channel to retrieve them. Its where the results of the queueing operation come back out to the user.

func (f *Fetcher) Run() {
    go func() {
        for {
            f.lockq.Lock()
            if f.queue.Len() > 0 {
                e := f.queue.Front()
                urlpair, _ := e.Value.(*urlpair)
                f.queue.Remove(e)
                f.lockq.Unlock()
                headResp, err := f.client.Head(urlpair.url)
                var p *page
                if err == nil {
                    content := headResp.Header.Get("Content-Type")
                    if !strings.HasPrefix(content, "text/html") || !strings.HasPrefix(content, "text/xhtml") {
                        headResp.Body.Close()
                        getResp, err := f.client.Get(urlpair.url)
                        if err == nil {
                            p = &page{url: urlpair.url, response: getResp, ref: urlpair.ref}
                        } else {
                            p = &page{url: urlpair.url, ref: urlpair.ref, err: err}
                        }
                    } else {
                        p = &page{url: urlpair.url, ref: urlpair.ref, response: headResp}
                    }
                } else {
                    p = &page{url: urlpair.url, ref: urlpair.ref, err: err}
                }
                select {
                case f.pages <- p:
                case <-f.done:
                    p.response.Body.Close()
                    for {
                        select {
                        case sentpage := <-f.pages:
                            sentpage.response.Body.Close()
                        default:
                            close(f.pages)
                            close(f.done)
                            return
                        }
                    }
                }
            } else {
                f.lockq.Unlock()
                time.Sleep(1)
            }
        }
    }()
}

The Run() method is the engine of the Fetcher. It starts a Go routine that fetches links of the queue. It performes a HEAD request on the link, checking the response and the content type of the response. If the content is an html document it performes a GET request, fetching the document. The response of the request that succeded (or the fact that it failed) is then packaged with the url and the referer variables and sent over the pages channel.
If the done channel is sending a value, the fetcher retrieves any documents from the pages channel to close them and free any resources. Once that is done, it quites the Go routine.


The red line represents the boundaries between the two threads of execution. The queue and the channel cross those boundaries.
Both of these source files are on GitHub at hog/spider/robots and hog/spider/fetcher.

Tuesday, September 17, 2013

4.7: An Extended Example: Web Spiders - Part 2

The job of the linkitr.go package is to operate on a stream of html nodes from the NodeItr iterator and transform it into a stream of strings.

The first iterator is the LinkItr.

func links(itr i.Iterator) bool {
    n, _ := itr.Value().(*html.Node)
    if n.Type != html.ElementNode {
        return false
    }
    if n.Data != "a" && n.Data != "img" && n.Data != "link" && n.Data != "style" && n.Data != "script" {
        return false
    }
    if n.Data == "style" || n.Data == "script" {
        src := attr("src", n.Attr)
        return src != ""
    }
    return true
}
func LinkItr(itr i.Forward) i.Forward {
    return hoi.Filter(links, itr)
}

The LinkItr is an i.Filter iterator that runs through the node stream and removes any nodes that are not a, img, link or style nodes.

func geturl(itr i.Iterator) interface{} {
    n, _ := itr.Value().(*html.Node)
    var url string
    if n.Data == "a" {
        url = attr("href", n.Attr)
    } else if n.Data == "img" {
        url = attr("src", n.Attr)
    } else if n.Data == "link" {
        url = attr("href", n.Attr)
    } else if n.Data == "style" {
        url = attr("srr", n.Attr)
    } else if n.Data == "script" {
        url = attr("src", n.Attr)
    }
    return url
}

func UrlItr(itr i.Forward) i.Forward {
    return hoi.Map(geturl, itr)
}

The UrlItr uses hoi.Map to transform the node stream into a string stream, from now on we are operating on a stream of urls.

func attr(name string, attrs []html.Attribute) string {
    for _, a := range attrs {
        if a.Key == name {
            return a.Val
        }
    }
    return ""
}

The attr() method is a helper that gets the value of attribute name from the node. One improvement that could be made to this code would be to implement both the links method and the geturl method with a DispatchTable as discussed in chapter 2 of HOP. After this we hand the stream over to the iterators in the urlitr.go package. It contains five iterators, NormalizeItr, HostMapper, Referer, BindByRef and BindByHost.

func removeUrl(itr i.Iterator) bool {
    url, _ := itr.Value().(string)
    url = strings.TrimSpace(url)
    if url == "" {
        return false
    }
    if strings.HasPrefix(url, "tel:") || strings.HasPrefix(url, "mailto:") {
        return false
    }
    return true
}

func remHash(u string) string {
    idx := strings.Index(u, "#")
    if idx == -1 {
        return u
    }
    if idx == 0 {
        return ""
    }
    return u[0 : idx-1]
}

func normalizeUrl(itr i.Iterator) interface{} {
    url, _ := itr.Value().(string)
    url = strings.TrimSuffix(remHash(url), "/")
    return strings.ToLower(url)
}
func NormalizeItr(itr i.Forward) i.Forward {
    return hoi.Filter(removeUrl, hoi.Map(normalizeUrl, itr))
}

The job of NormalizeItr is to trasform the url into a canalogical version of the url. The hash portion gets cut of, trailing slash chomped, the rest lower cased, emtpy strings excluded along with mailto and tel links. The iterator uses both hoi.Map and hoi.Filter to accomplish this.

func hostFromBase(base string) string {
    slashslashidx := strings.Index(base, "//")
    idx := strings.Index(base[slashslashidx+2:], "/")
    if idx > 0 {
        return base[0 : slashslashidx+2+idx]
    }
    return base
}

func hostMapper(base string) i.MapFunc {
    host := hostFromBase(base)
    if !strings.HasSuffix(base, "/") {
        base = base + "/"
    }
    return func(itr i.Iterator) interface{} {
        url, _ := itr.Value().(string)
        if strings.Contains(url, "://") {
            return url
        }
        if strings.HasPrefix(url, "/") {
            return host + url
        }
        return base + url
    }
}

func HostMapper(base string, itr i.Forward) i.Forward {
    return hoi.Map(hostMapper(base), itr)
}

The HostMapper runs over the url stream, turning any relative urls into absolute urls.

func referer(ref string) hoi.MapFunc {
    return func(itr i.Iterator) interface{} {
        url, _ := itr.Value().(string)
        return []string{url, ref}
    }
}

func Referer(ref string, itr i.Forward) i.Forward {
    return hoi.Map(referer(ref), itr)
}

The Referer transforms the url stream into a stream of twinlets, the url and the page it was found on.

func removeIfNotReferedBy(ref string) hoi.FilterFunc {
    host := hostFromBase(ref)
    return func(itr i.Iterator) bool {
        ref, _ := itr.Value().([]string)
        return strings.HasPrefix(ref[1], host)
    }
}

func BindByRef(ref string, itr i.Forward) i.Forward {
    return hoi.Filter(removeIfNotReferedBy(ref), itr)
}

The BindByRef removes any urls from the stream that are not refered by the reference url. This is to stop the crawler to run away into the distance, we are only checking the urls that are refered by the host of the starting url. Those urls might take us to another website but we wont go any further.

func removeIfNotOnHost(url string) hoi.FilterFunc {
    host := hostFromBase(url)
    return func(itr i.Iterator) bool {
        u, _ := itr.Value().([]string)
        return strings.HasPrefix(u[0], host)
    }
}

func BindByHost(host string, itr i.Forward) i.Forward {
    return hoi.Filter(removeIfNotOnHost(host), itr)
}

The BindByHost is an iterator that removes any urls from the stream that are not on the same host as the referencing url. If you use this iterator on the stream, you are only crawling urls that are one the same domain as the starting url. Both of these source files are on GitHub at hog/spider/linkitr and hog/spider/urlitr.

Monday, September 16, 2013

4.7: An Extended Example: Web Spiders - Part 1

The spider example is, like the Flat File Database, long enough to need more than one post. And as with Ffdb I will attempt to explain the code from the bottom up.

But first the birds view of the system. The Node iterator gives us access to the html document. We then use various iterators to transform that stream of nodes into a stream of links. The fetcher's job is to retrieve documents from the web, and the robot's job is to parse the robot.txt file so we can act as good citizens when hitting some poor server. The spider runs the whole show and maintains the state of the crawling. One crucial change I made from the example in HOP: the fetching operation is now a concurrent operation. This is a blog about Go and its characteristics after all.




The Node iterators job is to take a document tree of html nodes and turn it into a one dimensional stream of nodes. The iterator offers two strategies to do this, a Depth-First and a Breath-First.


type SearchTactic int

const (
    DepthFirst SearchTactic = iota
    BreathFirst
)

type nodeitr struct {
    err   error
    cur   *html.Node
    next  func() error
    queue list.List
}

func NodeItr(in io.Reader, stactic SearchTactic) i.Forward {
    n := nodeitr{}
    n.cur, n.err = html.Parse(in)
    if stactic == DepthFirst {
        n.next = n.depthFirst
    } else {
        n.queue.PushBack(n.cur)
        n.next = n.breathFirst
        n.next()
    }
    return &n
}

The nodeitr maintains a reference to the current node, a queue for the Breath-First search and function pointer to the search algorithm. The constructor retrieves the root node and assigns the correct search function to next according to the SearchTactic value.

func (i *nodeitr) Value() interface{} {
    return i.cur
}

func (i *nodeitr) Error() error {
    return i.err
}

func (i *nodeitr) AtEnd() bool {
    return i.cur == nil
}

func (i *nodeitr) Next() error {
    return i.next()
}

The only thing to node here is that the only thing Next() does is to forward the operation to whatever function is in the next variable, be that the Depth-First or the Breath-First search.

func (i *nodeitr) depthFirst() error {
    if i.err != nil {
        return i.err
    }
    if i.cur.FirstChild != nil {
        i.cur = i.cur.FirstChild
    } else if i.cur.NextSibling != nil {
        i.cur = i.cur.NextSibling
    } else if i.cur.Parent != nil {
        for i.cur.Parent != nil {
            i.cur = i.cur.Parent
            if i.cur.NextSibling != nil {
                i.cur = i.cur.NextSibling
                return i.err
            }
        }
        i.cur = nil
    }
    return i.err
}

The depthFirst() simply uses the FirstChild, NextSibling and Parent references that every node has to traverse the tree in a Depth-First fashion. If the current node has a child we go there. Else if it has a sibling we go there. If both fails we start looping up the parent references, stopping on the first one that has a sibling. If that fails we are back at the root node and the iteration is finished.

func (i *nodeitr) breathFirst() error {
    if i.err != nil {
        return i.err
    }
    i.cur = nil
    if i.queue.Len() > 0 {
        i.cur = i.queue.Front().Value.(*html.Node)
        i.queue.Remove(i.queue.Front())
        if i.cur.FirstChild != nil {
            for c := i.cur.FirstChild; c != nil; c = c.NextSibling {
                i.queue.PushBack(c)
            }
        }
    }
    return i.err
}

Our breathFirst() function uses a queue to maintain a list of nodes that we are yet to visit. If the queue is not empty we pop the first node of the queue; that is our current position. Then we check if this node has any child nodes; if so they get appended to the queue. If the queue is empty we are finished.

Get the source at GitHub.

Sunday, September 15, 2013

4.6.2: An Iterator with an each-Like Interface

The concept behind each-Like is to run a Map operation transforming a value and returning it along with the original to the caller.
This is a Map iterator in almost every way.

type eachlike struct {
    itr i.Forward
    f   hoi.MapFunc
}

func EachLike(f i.MapFunc, itr i.Forward) *eachlike {
    return &eachlike{f: f, itr: itr}
}

The constructor takes a f.MapFunc function and a i.Forward iterator to work on.

func (i *eachlike) Value() interface{} {
    ret := make([]interface{}, 2)
    ret[0] = i.itr.Value()
    ret[1] = i.f(i.itr)
    return ret
}

func (i *eachlike) ValuePair() (interface{}, interface{}) {
    res := i.Value().([]interface{})
    return res[0], res[1]
}

The Value() method returns an interface{} value that contains both the original value and the transformed value. The ValuePair() method unpacks the interface{} value into a interface{} slice.

func (i *eachlike) Error() error {
    return i.itr.Error()
}

func (i *eachlike) Next() error {
    return i.itr.Next()
}
func (i *eachlike) AtEnd() bool {
    return i.itr.AtEnd()
}

These functions simply forward to the wrapped iterator.

func multiply(n int) i.MapFunc {
    return func(itr i.Iterator) interface{} {
        return n * itr.Value().(int)
    }
}

This is a simple closure that creates a i.MapFunc that multiplies two integers together.

itr := EachLike(multiply(10), icon.List(1, 2, 3, 4, 5, 6, 7, 8, 9))
for ; !itr.AtEnd(); itr.Next() {
    val, res := itr.ValuePair()
    fmt.Printf("Value: %v, result: %v\n", val, res)
}

To use it we construct the iterator from the i.MapFunc and the i.Forward iterator. We then loop over it and print out every value, result pair.
The rational behind this is to exploit a feature in Perl where the caller can indicate to a function whether it expects a single value or a list value returned. The EachLike would return the transformed value in the single context, and the (transformed value, original value) in the list context. This is not possible in Go. And the core functionality of returning the transformed value with the origianl can easily be achived with the regular i.Map iterator and a i.MapFunc function that simply returns a slice with both values.
Get the source at GitHub.

Concerning the i library

During the last few days, the spinof library that is the i library has seen many changes. I've refactored it into subpackages (i, hoi, icon, iadapt, ityped, itk), added tests and examples, and a first draft of a documentation is up on godoc.org. Over the next days I will add to the documentation and examples.

The hoi package is the Higher Order Iterator package, as such it contains mainy useful higher order functions to use with iterators. As of today it contains the following iterators:
Documentation for those functions is up on GoDoc.

As a result of the refactoring, I've had to go back over some of the examples in chapter 4 and change the code in the hog repository. I've also gone through the past blogposts and updated the links to files in the i repository.

Saturday, September 14, 2013

4.6.1: Using foreach to Loop Over More Than One Array

Zipping lists is to take some lists A, B, C .. and produce a new list containing the elements (a1, b1, c1..), (a2, b2, c2, ..) .. A key decision here is does the new list stop producing new items once the shortest list is exhausted, or does it continue to produce elements until the longest list is finished, and fill in nil values for the shorter lists.

type stop int

const (
    StopAtMin stop = iota
    StopAtMax
)

type eacharray struct {
    itrs   []i.Forward
    err    error
    stopAt stop
}

func EachArray(stopAt stop, itrs ...i.Forward) i.Forward {
    return &eacharray{itrs: itrs, stopAt: stopAt, err: nil}
}

The constructor takes an enum to control the behaviour vis-a-vis the length of individual streams, and a list of iterators.

func (i *eacharray) Value() interface{} {
    ret := make([]interface{}, len(i.itrs))
    for idx, itr := range i.itrs {
        if !itr.AtEnd() {
            ret[idx] = itr.Value()
        } else {
            ret[idx] = nil
        }
    }
    return ret
}

func (i *eacharray) Error() error {
    return i.err
}

func ( *eacharray) SetError(err error) {
    i.err = err
}

The Value() function loops over all iterators, collecting a value from each of them. If the iterator is at end, we use a nil value for that iterator.

func (i *eacharray) Next() error {
    for _, itr := range i.itrs {
        if !itr.AtEnd() {
            err := itr.Next()
            if err != nil {
                i.err = err
                break
            }
        }
    }
    return i.err
}

func (i *eacharray) AtEnd() bool {
    atEndCount := 0
    for _, itr := range i.itrs {
        if itr.AtEnd() {
            if i.stopAt == StopAtMin {
                return true
            }
            atEndCount++
        }
    }
    return atEndCount == len(i.itrs)
}

The Next() method loops over all iterators, calling Next() on each iterator that is not at end yet. The AtEnd() method loops over all the iterator and checks each one for EOS. If stopAt is StopAtMin, we return true on first EOS we encoundter, otherwise we return false until all iterators return EOS.

itr := EachArray(
    StopAtMax,
    i.List(1, 2, 3, 4, 5, 6),
    i.List(6.4, 7.1, 8.2, 9.9),
    i.List("A", "B", "C", "D", "E"))

for ; !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Usage is simple and there is no need for all the lists to be of the same type.

Get the source at GitHub. It is also available in the iterator library as Zip.

Friday, September 13, 2013

4.4.4: append()

Go has an append function, it simply appends two slices and returns the resulting slice. An append iterator is similar, the major difference being that you can start to work on the joined list without actually going through the process of copying them together. If the lists are long, this is a good strategy.

type iappend struct {
    itrs []i.Forward
    pos  int
    err  error
}

func Append(itrs ...i.Forward) i.Forward {
    return &iappend{itrs: itrs}
}

func (i *iappend) AtEnd() bool {
    if i.pos < len(i.itrs)-1 {
        return false
    } else if i.pos >= len(i.itrs) {
        return true
    }
    return i.itrs[i.pos].AtEnd()
}

func (i *iappend) Next() error {
    if i.pos >= len(i.itrs) {
        i.err = fmt.Errorf("Appending beyond last iterator")
        return i.err
    }
    i.itrs[i.pos].Next()
    for i.itrs[i.pos].AtEnd() {
        i.pos++
        if i.pos >= len(i.itrs) {
            return nil
        }
    }
    return nil
}

func (i *iappend) Value() interface{} {
    return i.itrs[i.pos].Value()
}

func (i *iappend) Error() error {
    if i.err != nil {
        return i.err
    }
    for _, v := range i.itrs {
        err := v.Error()
        if err != nil {
            return err
        }
    }
    return nil
}

func (i *iappend) SetError(err error) {
    i.err = err
}

Append simply runs through each iterator until it hits the end, then it switches to the next one. Once the last iterator is exhausted, Append quits.

The usage is very simple:

for itr := Append(i.List(1, 2, 3), igen.Range(10, 20)); !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Get the source at GitHub. It is also available in the iterator library.

Thursday, September 12, 2013

4.4.3: list_iterator()

In this section MJD introduces the imap and igrep iterators. We've already used them; imap was introduced in 4.3.3: Example - A Flat-File Database - part 2 (available in the iterator library as i.Map) while igrep was introduced in 4.3: Examples (available in the iterator library as i.Filter). I've won't spend any more time on them here.

A list_iterator() transforms a Perl array into an iterator. Go's native container is the slice so we will build an adaptor for the same purpose here.

One problem we run into is that a slice of some type, say []int, can not be typecasted to a []interface{} type, we would need to create a interface{} of equal length as the []int and then copy the values over. This is not efficient and not something I want to do. I have two solutions to this problem. The first one is to write an adapter speficic for the type you want to use.

type intslice struct {
    slice []int
    pos   int
}

func IntSlice(vals []int) i.Forward {
    return &intslice{slice: vals}
}

func (i *intslice) AtEnd() bool {
    return i.pos >= len(i.slice)
}

func (i *intslice) Next() error {
    i.pos++
    return nil
}

func (i *intslice) Value() interface{} {
    return i.slice[i.pos]
}

func (i *intslice) Error() error {
    return nil
}

func (i *intslice) SetError(err error) {
}

This is a typical i.Forward iterator that simply wraps around the int slice you provide.

itr := IntSlice([]int{1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11})
itr = i.Filter(
    func(itr i.Iterator) bool {
        n, _ := itr.Value().(int)
        return math.Mod(float64(n), 3) == 0
    }, itr)

for !itr.AtEnd() {
    fmt.Println(itr.Value())
    itr.Next()
}

I've created a script that generates iterators such as this for all the base types in Go. They are all available in the iterator library.


A second strategy is not to wrap a slice but to use Go's support for variable arguments. Yes, if you are forced to wrap a slice this will not help you but sometimes all we need is to wrap a list of values that we've written out in the source file.

type list struct {
    slice []interface{}
    pos   int
}

func List(vals ...interface{}) i.Forward {
    return &list{slice: vals}
}

func (i *list) AtEnd() bool {
    return i.pos >= len(i.slice)
}

func (i *list) Next() error {
    i.pos++
    return nil
}

func (i *list) Value() interface{} {
    return i.slice[i.pos]
}

func (i *list) Error() error {
    return nil
}

func (i *list) SetError(err error) {
}

This will work for any type, even a mix of different types.


itr = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
itr = i.Filter(func(itr i.Iterator) bool {
    n, _ := itr.Value().(int)
    return math.Mod(float64(n), 2) == 0
}, itr)

for !itr.AtEnd() {
    fmt.Println(itr.Value())
    itr.Next()
}

Get the source at GitHub. The variable argument version is also available in the iterator library.

Wednesday, September 11, 2013

4.3.6: Example - Random Number Generator

Our final example of an iterator is short and simple, it is a random number generator.

type Rnd struct {
    seed int64
    cur  float64
    gen  *rand.Rand
}

func Rand() *Rnd {
    var r Rnd
    r.seed = time.Now().UnixNano()
    r.First()
    return &r
}

The structure contains the seed used to generate the sequence, the current value (a floating number in the [0..1] range) and a reference to Go's random number generator.
The constructor simply seeds a random sequence from the current time and generates the first value.

func (r *Rnd) First() error {
    r.gen = rand.New(rand.NewSource(r.seed))
    r.cur = r.gen.Float64()
    return nil
}

func (r *Rnd) Next() error {
    r.cur = r.gen.Float64()
    return nil
}

The iterator is a BoundedAtStart iterator, i.e. it has a beginning and can easily be restored to it through the First() method. The Next() generates the next random value.

func (r *Rnd) AtEnd() bool {
    return false
}

func (r *Rnd) Error() error {
    return nil
}

This iterator represents an infinite stream of values on the form [rval1, rval2, ...), therefore the only job of the AtEnd() method is to return false.

func (r *Rnd) Value() interface{} {
    return r.cur
}

func (r *Rnd) Float64() float64 {
    return r.cur
}

The iterator has a special type casting function so we can read float64 values from it without have to do the type assertion ourselves.

func (r *Rnd) SetSeed(seed int64) {
    r.seed = seed
    r.First()
}

func (r *Rnd) Seed() int64 {
    return r.seed
}

Lastsly, the iterator provids methods to retrieve the seeding used to generate the sequence, as well as reset and iterator to a previously saved seeding value.
A sampel usage is as follows:

fmt.Print("A quarted of random pairs: ")
ritr1, ritr2 := Rand(), Rand()
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

seed1, seed2 := ritr1.Seed(), ritr2.Seed()
fmt.Println("")
fmt.Print("A quarted of another random pairs: ")
ritr1, ritr2 = Rand(), Rand()
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

fmt.Println("")
fmt.Print("The first quarted of random pairs: ")
ritr1.SetSeed(seed1)
ritr2.SetSeed(seed2)
for i := 0; i < 4; i++ {
    r1 := ritr1.Float64()
    r2 := ritr2.Float64()
    fmt.Printf("(%f, %f), ", r1, r2)
    ritr1.Next()
    ritr2.Next()
}

We generate three list of a quarted of a pair of random numbers, where the first sequence and the last sequence are the same.
Get the source at GitHub. The Random iterator is also awailable in the iterator library as a RandomAccess iterator.

Tuesday, September 10, 2013

4.3.3: Example - A Flat-File Database - part 3

We conclude this example by going over the Query package.

type (
    FieldQueryFunc func(string) bool
)

A FieldQueryFunc is similar to a i.FilterFunc with the exception that it is specialized to work on string values rather than i.Iterator values.

func (db *Ffdb) Query(f i.FilterFunc, d Direction) i.Forward {
    return i.Filter(f, RecordItr(db, d))
}

func (db *Ffdb) QueryField(fieldname string, f FieldQueryFunc, d Direction) i.Forward {
    fieldname = strings.ToUpper(fieldname)
    if _, ok := db.fieldnames[fieldname]; !ok {
        panic(fmt.Errorf("Unknown field: %q\n", fieldname))
    }

    return db.Query(func(itr i.Iterator) bool {
        r, _ := itr.Value().(*Record)
        return f(r.Value(fieldname))
    }, d)
}

The Query package contains two fundamental methods that serve as building blocks for all other queries. The first, Query(), is an iterator that uses the i.Filter iterator to construct a i.Forward iterator that will filter a Record stream in the direction of Direction with the filter f.
The QueryField() constructs a i.FilterFunc and binds it to a field name and a FieldQueryFunc. We can now iterate over the stream of Records and ask specific questions about individual field values. Two examples are provided in the Query package.

func (db *Ffdb) QueryFieldRx(fieldname, rxs string, d Direction) i.Forward {
    rx := regexp.MustCompile(rxs)
    return db.QueryField(fieldname, func(val string) bool {
        return rx.MatchString(val)
    }, d)
}

func (db *Ffdb) QueryGreater(fieldname string, check float64, d Direction) i.Forward {
    return db.QueryField(fieldname, func(fieldvalue string) bool {
        value, _ := strconv.ParseFloat(fieldvalue, 64)
        return value > check
    }, d)
}

QueryFieldRx is a query that checks field named fieldname against a regular expression rsxQueryGreater is a query that returns records where the float64 value of field fieldname is greater than some value suppled to the query.

An entire pipeline of iterators, from query to text file using e.g. the QueryFieldRx, would look something like this:


The blue lines represent how data moves through the construction of the pipeline. FieldName, Rx and Direction are send to the QueryFieldRx which constructs a FieldQueryFunc and sends it along with the FieldName and Direction to the QueryField object. It constructs a i.FilterFunc and sends it with the Direction to the Query object. The Query constructs the Record iterator, which in turns constructs the Reader from the Direction.
Now that the pipeline is ready, we can read bytes from the file. The Reader turns those into a string for each line in the file and hands those over to the Record iterator. The Record iterator, being a i.Map iterator, maps the lines to a Record which the Query then filters and returns to the user.
Following is an example of how to use this package.

    db, _ := ffdb.NewFfdbHeader(os.Args[1], "[:]")
    defer db.Close()

    fmt.Println("From state: MA & NY")
    itrNY := db.QueryFieldRx("state", "NY", ffdb.Reverse)
    itrMA := db.QueryFieldRx("state", "MA", ffdb.Forward)
    for !itrNY.AtEnd() || !itrMA.AtEnd() {
        if !itrNY.AtEnd() {
            fmt.Println(itrNY.Value())
            itrNY.Next()
        }
        if !itrMA.AtEnd() {
            fmt.Println(itrMA.Value())
            itrMA.Next()
        }
    }
    fmt.Println("\nOwes more than 100")
    for itr := db.QueryGreater("owes", 100, ffdb.Forward); !itr.AtEnd(); itr.Next() {
        fmt.Println(itr.Value())
    }

We open the database, build some query iterators and run through them printing out the result.
The two source files discussed here are available at GitHub: Query and ffdb

Monday, September 9, 2013

4.3.3: Example - A Flat-File Database - part 2

The Record package iterates over a stream of lines, parsing them into a Record structure and passing them on. It utilizes a new function in the iterator library, Map. Lets start by running through it.

type MapFunc func(Iterator) interface{}

type imap struct {
    fmap MapFunc
    val  interface{}
    itr  Forward
}

func Map(fmap MapFunc, itr Forward) Forward {
    return &imap{fmap: fmap, itr: itr, val: nil}
}

func (i *imap) AtEnd() bool {
    return i.itr.AtEnd()
}

func (i *imap) Next() error {
    i.val = nil
    return i.itr.Next()
}

func (i *imap) Value() interface{} {
    if i.val == nil {
        i.val = i.fmap(i.itr)
    }
    return i.val
}

func (i *imap) Error() error {
    return i.itr.Error()
}

The Map iterator accepts a MapFunc function and a Forward iterator. It then iterates over that iterator and uses the MapFunc to transform the value before returning it. You can see source at GitHub.

type (
    Record map[string]string
)

func RecordItr(db *Ffdb, d Direction) i.Forward {
    return i.Map(recordParser(db), d(db))
}

The Record type is simply a map of strings keyed to a string. And the constructor creates a map from a stream of lines to a stream of records as defined by recordParser. The Direction is a function pointer to either the Forward or the Reverse reader.

func recordParser(db *Ffdb) i.MapFunc {
    return func(itr i.Iterator) interface{} {
        line, _ := itr.Value().(string)
        fields := db.fieldsep.Split(line, -1)
        if len(fields) != len(db.fields) {
            panic(fmt.Errorf("Unexpected number of fields in record %q, expected %d, got %d\n", line, len(db.fields), len(fields)))
        }
        var record Record = make(map[string]string, len(fields))
        for i, v := range fields {
            record[db.fields[i]] = v
        }
        return &record
    }
}

The recordParser returns a function of type i.MapFunc. Its purpose is to accept a string, tokenize it with a field seperator defined in the Ffdb object, validate that the number of fields in the line matches with the number of fields in the Ffdb object, create a Record structure and fill it with the values from the line keyed to the fields defined in the Ffdb object.

func (r *Record) String() string {
    m := (map[string]string)(*r)
    return fmt.Sprint(m)
}

func (r *Record) Value(key string) string {
    return (*r)[key]
}

Finally we have some helper functions that allows to work with a Record structure. The Ffdb package defines the flat file database object.

type Bound struct {
    start, end int64
}

type Ffdb struct {
    file       *os.File
    fields     []string
    fieldnames map[string]int
    fieldsep   *regexp.Regexp
    bound      Bound
}

The Bound structure we've seen before, it defines the boundaries of the underlying text file for the readers. The Ffdb structure holds the neccessary state for the database: the name of the file, name of the fields, a mapping of field names to field numbers, the field seperator and the Bound structure.

func NewFfdb(name, fieldsep string, schema []string) (*Ffdb, error) {
    var db Ffdb
    var err error
    db.fieldsep = regexp.MustCompile(fieldsep)
    if db.file, err = os.Open(name); err != nil {
        return nil, err
    }
    db.fields = schema
    db.bound.end = -1
    return &db, nil
}

We have to constructors for the Ffdb object. The former assumes that there is no schema in the file and therefore expects the caller to supplie a slice of strings containing the names of the fields (and indirectly the number of fields). It defines the boundaries as the entire file.

func NewFfdbHeader(name, fieldsep string) (*Ffdb, error) {
    db, err := NewFfdb(name, fieldsep, nil)
    if err != nil {
        return nil, err
    }
    readbuf := make([]byte, 4096)
    var buf bytes.Buffer
    var pos int64
    found := false
    for !found {
        n, err := db.file.Read(readbuf)
        if err != nil && err != io.EOF {
            return nil, errors.New("Unable to read schema")
        }
        r := bytes.NewReader(readbuf)
        for r.Len() > 0 {
            ch, _, _ := r.ReadRune()
            if ch == '\n' {
                offset, err := r.Seek(0, os.SEEK_CUR)
                if err != nil {
                    return nil, err
                }
                pos += offset
                buf.Write(readbuf[0 : offset-1])
                found = true
                break
            }
        }
        if !found {
            buf.Write(readbuf)
            pos += int64(n)
        }
    }
    db.fields = db.fieldsep.Split(buf.String(), -1)
    db.fieldnames = make(map[string]int, len(db.fields))
    for i, v := range db.fields {
        db.fieldnames[v] = i
    }
    db.bound.start = pos
    return db, nil
}

The second constructor assumes that the first line in the text file contains the schema of the data base. It attempts to read this line and parse it to fill both the fields and the fieldnames members. It also saves the ending position of the first line and uses that to define the boundaries for the readers.

func (db *Ffdb) Close() {
    db.file.Close()
}

The Ffdb is at heart a file resource, and must therefore be closed to avoid resource leaks.

The two source files discussed here are available at GitHub: Record and Ffdb

Sunday, September 8, 2013

4.3.3: Example - A Flat-File Database - part 1

The flat file database is a text file in the form of e.g.:
LASTNAME:FIRSTNAME:CITY:STATE:COUNTRY:OWES
Adler:David:New York:NY:US:157.00
Asthon:Elaine:Boston:MA:US:0.00
Dominus:Mark:Philadelphia:PA:US:0.00
Orwant:Jon:Cambridge:MA:US:26.30
Schern:Michael:New York:NY:US:149658.23
Wall:Larry:Mountain View:CA:US:-372.14
Gylfason:Magnús:Reykjavík:NA:IS:20.00

The first line is an optional schema that names the individual columns, while ":" is the column separator. If no schema is in the file we have to supply a list of field names to the Ffdb constructor.

An overview over my implementation of this system is as such:


The Readers package provides iterators that allow us to iterate through text files line by line, the Record package provides an iterator that transforms a stream of lines into a stream of Record structures, the Ffdb provides the flat-file database object and lastly the Query package implements various examples of queries that are possible. This is a somewhat larger example than the previous ones, so we will go through this in more than one post. I think it is best to go through this bottom up, so lets start down by the metal. The Readers package provides two Forward iterators, Forward and Reverse, that iterate through text files line by line. A future solution would be to provide one BiDirectional iterator that could fulfill both roles, but this was simpler and will have to do for now.
type Direction func(*Ffdb) i.Forward

The Direction type is simply the type of the constructors of the readers, this allows us to specify when we create the query whether we want to search forwards or reversed simply by passing the name of the reader to the query.
type reader struct {
    file  *os.File
    pos   int64
    line  string
    err   error
    atEnd bool
    bound Bound
}

The reader structure is the state shared by both the Forward reader and the Reverse reader. It contains the file we are reading and our position in it, current line, last error, an atEnd indicator and the Bound structure which is used to bind the readers within a range of the file. This is so we can exclude the schema header if it is located within the file (usually the header). The readers are valid on the range bound.start <= pos <= bound.end.
func (r *reader) Error() error {
    if r.err == io.EOF {
        return nil
    }
    return r.err
}

func (r *reader) Value() interface{} {
    return r.line
}

func (r *reader) AtEnd() bool {
    return r.atEnd
}

These methods are shared between the readers and are therefore bound to the reader structure.
type forward struct {
    reader
    in *bufio.Reader
}

func Forward(db *Ffdb) i.Forward {
    var itr forward
    if itr.pos, itr.err = db.file.Seek(db.bound.start, os.SEEK_SET); itr.err == nil {
        itr.file = db.file
        itr.bound = db.bound
        itr.in = bufio.NewReader(db.file)
        itr.Next()
    }
    return &itr
}

The forward state is simply the reader state plus a bufio.Reader. This reader is very similar to the one we created in 4.3.3: Example - Filehandle Iterators with the exception of boundaries. The constructor seeks to the beginning position in the file as defined by bound.Start, creates the buffered reader from bufio and reads the first line.
func (f *forward) Next() error {
    if f.err == io.EOF || (f.bound.end > 0 && f.pos >= f.bound.end) {
        f.atEnd = true
        return nil
    }

    f.file.Seek(f.pos, os.SEEK_SET)
    if f.line, f.err = f.in.ReadString('\n'); f.err != nil && f.err != io.EOF {
        return f.err
    }

    f.pos, _ = f.file.Seek(0, os.SEEK_CUR)
    // chomp
    f.line = strings.TrimSuffix(f.line, "\n")
    return nil
}

The Next() method seeks to the last position in the file. This is so we can have many iterators on the same database file active at the same time, each maintaining its own position. It reads the line from the file, saves the new position in the file and chomps of the trailing NL character. A io.EOF or a position beyond the boundaries indicates that we've reached the end of the database for this iterator.
const readbufsize = 4096
type reverse struct {
    reader
    readbufpos, readbuflen int64
    readbuf                []byte
    buf                    bytes.Buffer
}

The Reverse reader is a bit more complicated. It inherits from the reader and adds various buffers and reading positions to it. The strategy here is to read the file backwards in readbufsize chunks, searching through the chunks for NL characters. We need to handle both that each chunk might contain many lines and that any line could span across many chunks.
func Reverse(db *Ffdb) i.Forward {
    whence := os.SEEK_SET
    pos := db.bound.end
    if pos == -1 {
        whence = os.SEEK_END
        pos = 0
    }

    var itr reverse
    if itr.pos, itr.err = db.file.Seek(pos, whence); itr.err == nil {
        itr.file = db.file
        itr.bound = db.bound
        itr.readbufpos = -1
        itr.readbuf = make([]byte, readbufsize)
        itr.Next()
    }
    return &itr
}

The constructor must start by positioning itself at the end of the area that the iterator is valid on, be that the end of the file or the position in bound.end. When then proceed to read the first chunk into our buffer.
func (r *reverse) writeBuffer() {
    var out bytes.Buffer
    out.Write(r.readbuf[r.readbufpos+1 : r.readbuflen])
    if r.buf.Len() > 0 {
        out.Write(r.buf.Bytes())
        r.buf.Reset()
    }

    r.readbuflen = r.readbufpos
    r.readbufpos--
    r.line = out.String()
}

The writeBuffer() writes the current line that is in the chunk to the variable returned by the Value() method. It then appends the leftover from the last iteration.
func (r *reverse) Next() error {
    for {
        if r.readbufpos < 0 {
            if r.pos == r.bound.start {
                r.atEnd = true
                return nil
            }
            r.readbuflen = readbufsize
            if r.pos < readbufsize {
                r.readbuflen = r.pos
            }
            r.pos -= r.readbuflen
            r.file.Seek(r.pos, os.SEEK_SET)
            if _, r.err = r.file.Read(r.readbuf); r.err != nil && r.err != io.EOF {
                return r.err
            }
            r.readbufpos = r.readbuflen - 1
        }

        if r.readbufpos < 0 {
            r.atEnd = true
            return nil
        }

        for ; r.readbufpos >= 0; r.readbufpos-- {
            if r.readbuf[r.readbufpos] == '\n' {
                r.writeBuffer()
                return nil
            }
        }

        if r.readbufpos < 0 && r.pos == r.bound.start {
            r.writeBuffer()
            return nil
        }

        var joinbuf bytes.Buffer
        joinbuf.Write(r.readbuf[0:r.readbuflen])
        if r.buf.Len() > 0 {
            joinbuf.Write(r.buf.Bytes())
        }
        r.buf = joinbuf
    }
    return nil
}

The Next() method starts by checking the readbufpos variable, if it is below zero we must read a new chunk from the file into our buffer. Normally we try to read readbufsize number of characters but if we are closer than that to the start of the file we simply read the rest of the file. If after reading readbufpos is still below zero we are finished. Otherwise we loop through the buffer searching for a NL character. If we find it we write out the buffer and return, otherwise we join the buffer with the previous buffer and iterate again. A special case is if we find no NL character but move beyond both the buffer and the reading boundary; then we've found the last (first) line in the file. Get the source at GitHub.

Saturday, September 7, 2013

4.3.3: Example - Filehandle Iterators

Go already provides a good package to read a file line by line. It is bufio.Reader so this iterator is simply a wrapper around that reader.

type fh struct {
    in   *bufio.Reader
    line string
    err  error
}

func Fh(in io.Reader) i.Forward {
    f := fh{in: bufio.NewReader(in)}
    f.Next()
    return &f
}

func (f *fh) Error() error {
    if f.err == io.EOF {
        return nil
    }
    return f.err
}

func (f *fh) Value() interface{} {
    return f.line
}

func (f *fh) AtEnd() bool {
    return f.err == io.EOF
}

We need to check if err is io.EOF or any other error since io.EOF is not an error condition but simply an EOS indicator. Everything else is pretty much self explanatory.

func (f *fh) Next() error {
    if f.line, f.err = f.in.ReadString('\n'); f.err != nil && f.err != io.EOF {
        return f.err
    }
    // chomp
    f.line = strings.TrimSuffix(f.line, "\n")
    return nil
}

The Next() function will read the next line from the file and chomp of the new line indicator.

i.Each(
    Fh(os.Stdin),
    func(itr i.Iterator) bool {
        line, _ := itr.Value().(string)
        fmt.Println("Line: " + line)
        return true
    })

Usage is same as before.
Get the source at GitHub.

Friday, September 6, 2013

4.3.2: Example - Genomic Sequence Generator

Given the pattern A(CG)G(AT), the Gene iterator will produce the sequence:

ACGA
AGGA
ACGT
AGGT

Strings within () represent possible permutations while strings outside () are static.

type gene struct {
    atEnd bool
    data  []interface{}
    cur   string
}

func Gene(pattern string) i.Forward {
    r := regexp.MustCompile("[()]")
    tokens := r.Split(pattern, -1)
    g := gene{atEnd: false, data: make([]interface{}, 0, len(tokens))}
    for i, v := range tokens {
        if math.Mod(float64(i), 2) == 0 {
            // constant
            if v != "" {
                g.data = append(g.data, v)
            }
        } else {
            // option
            if v != "" {
                option := make([]interface{}, 0, len(v)+1)
                option = append(option, 0)
                for _, c := range v {
                    option = append(option, string(c))
                }
                g.data = append(g.data, option)
            }
        }
    }
    g.Next()
    return &g
}

The state of the iteration is held in the data variable. The constructor accepts a pattern string a loops through it creating a datastructure such as:

A(GA)C => ['A',['G','A'],'C']

A string element represent a static element while a slice element represents the possible permuations that we need to loop through. Calling Next() creates the first permuation.

func (g *gene) Error() error {
    return nil
}

func (g *gene) Value() interface{} {
    return g.cur
}
func (g *gene) AtEnd() bool {
    return g.cur == ""
}

The only thing we need to talk about here is that the atEnd variable realy means that there are no more permuations to generate. But there is still a value to return as log as cur is not empty.

func (g *gene) Next() error {
    if g.atEnd {
        g.cur = ""
        return nil
    }
    finishedIncr := false
    var result bytes.Buffer
    for _, t := range g.data {
        if v, ok := t.(string); ok {
            result.WriteString(v)
        } else if v, ok := t.([]interface{}); ok {
            n := v[0].(int)
            result.WriteString(v[n+1].(string))
            if !finishedIncr {
                if n == len(v)-2 {
                    v[0] = 0
                } else {
                    v[0] = v[0].(int) + 1
                    finishedIncr = true
                }
            }
        }
    }
    g.atEnd = !finishedIncr
    g.cur = result.String()
    return nil
}

Next() starts by checking the atEnd variable, clearing cur and returning if it is true. AtEnd() will now return true.
Otherwise we loop through the data variable, generating the next value for the sequence.

i.Each(
    Gene(os.Args[1]),
    func(itr i.Iterator) bool {
        fmt.Println(itr.Value())
        return true
    })

The main loop is the same as before.
Get the source at GitHub.

Thursday, September 5, 2013

4.3.1: Example - Permutations

The purpose of the Permute iterator is to accept a range such as A..C and write out all possible permutations:

ABC
ACB
BAC
BCA
CAB
CBA

As always, we start with capturing and storing the state of the iteration, i.e. the permuation.

type permute struct {
    items []rune
    n     int
    atEnd bool
}

func Permute(items []rune) i.Forward {
    return &permute{items: items, n: 1, atEnd: false}
}

items[] and n are the state of the current permuation while atEnd indicates whether we've listed all possible permuations.

func (p *permute) Error() error {
    return nil
}

func (p *permute) Value() interface{} {
    return p.items
}

func (p *permute) AtEnd() bool {
    return p.atEnd
}

These three functions are very simple, very little to do here apart from the obvious.

func (per *permute) Next() error {
    var i int
    p := per.n
    for i = 1; i <= len(per.items) && math.Mod(float64(p), float64(i)) == 0; i++ {
        p /= i
    }
    d := int(math.Mod(float64(p), float64(i)))
    j := len(per.items) - i
    if j < 0 {
        per.atEnd = true
        return nil
    }

    copy(per.items[j+1:len(per.items)], reverse(per.items[j+1:len(per.items)]))
    per.items[j], per.items[j+d] = per.items[j+d], per.items[j]
    per.n++
    return nil
}

Next() is where the generation of the next permutation occurs, and if the three before were to simple, this one is to clever by far. The idea is to go through the permuations from the last element to the first, such as:

ABCD > ABDC > ACBD > ACDB > ADBC > ADCB > BADC ...

The first element is held fixed while we go through all permuations of the other elements. Once we are done, we swap the first and second element and go through the process again. If nothing else, it is a very efficient use of modular mathematics and vectors.

func reverse(in []rune) []rune {
    out := make([]rune, len(in))
    for i, v := range in {
        out[len(out)-i-1] = v
    }
    return out
}

reverse() is a helper that Next() uses to copy one slice to another, reversing the order in the process.

func generate(from, to rune) []rune {
    list := make([]rune, 0, to-from+1)
    for itr := iutil.Range(int(from), int(to)+1); !itr.AtEnd(); itr.Next() {
        list = append(list, rune(itr.Int()))
    }
    return list
}

generate() accepts paramters such as 'A', 'D' and returns a slice such as 'A','B','C','D'. It utilizes i.Range to generate an integer interval such as [63, 67) that is used to generate the rune slice.

i.Each(
    Permute(generate(from, to)),
    func(itr i.Iterator) bool {
        r, _ := itr.Value().([]rune)
        fmt.Println(string(r))
        return true
    })

The useage is the same as we've used to before, simply iterator through all possible permuations and print them out.
Get the source at GitHub.

Wednesday, September 4, 2013

4.3: Examples

The first example is to use the Dirwalk iterator from the previous post and create some filters on top of it. To do that, we need to create a Filter iterator.

type FilterFunc func(Iterator) bool

type filter struct {
    f   FilterFunc
    itr Forward
}

func Filter(f FilterFunc, itr Forward) Forward {
    return &filter{f, itr}
}

func (i *filter) AtEnd() bool {
    for !i.itr.AtEnd() {
        if i.f(i.itr) {
            return false
        }
        i.itr.Next()
    }
    return true
}

func (i *filter) Next() error {
    return i.itr.Next()
}

func (i *filter) Value() interface{} {
    return i.itr.Value()
}

func (i *filter) Error() error {
    return i.itr.Error()
}

The Filter iterator is a struct that wraps around a Forward iterator. For most part it simply forwards any method calls to the enclosed iterator. The filtering function is provided by the AtEnd() method. If the wrapped iterator contains further elements, it simply loops until it finds an element that fulfills the requirements of the user supplied FilterFunc. At first, this seems to break the idempotency requirement. But if we call AtEnd() it will loop until it hits the end of stream or finds an element that is not filtered. If we then call AtEnd() again without calling Next(), it simply finds the same element again and returns at the same position.

I've put this Filter iterator into the i package.

Now the problem is simply to create a FilterFunc that performs the filtering we require.

func hasInName(val string) i.FilterFunc {
    val = strings.ToUpper(val)
    return func(itr i.Iterator) bool {
        filename, _ := itr.Value().(string)
        return strings.Contains(strings.ToUpper(filename), val)
    }
}

func not(f i.FilterFunc) i.FilterFunc {
    return func(itr i.Iterator) bool {
        return !f(itr)
    }
}

hasInName performs a case insensitive search, matching val to filenames that Dirwalk returns. not simply negates the result of any FilterFunc.

i.Each(
    i.Filter(not(hasInName("example")), Dirwalk(os.Args[1])),
    func(itr i.Iterator) bool {
        fmt.Println(itr.Value())
        return true
    })

This constructs loops through all filenames under os.Args[1] and returns any that don't contain the string example.

Get the source at GitHub.

Tuesday, September 3, 2013

4.2.2: dir_walk()

The recursive dirwalk can be rewritten as an Forward iterator.


type dirwalk struct {
    cur   string
    queue []string
    err   error
}

func Dirwalk(filename string) i.Forward {
    // remove trailing /
    strings.TrimSuffix(filename, "/")
    // construct and initialize
    var dw dirwalk
    dw.queue = []string{filename}
    dw.Next()
    return &dw
}

The dirwalk structure contains a queue of filenames to be processed. The cur variable holds the filename that will be returned when Value() is called. Dirwalk simply constructs this structure and returns it, encapsulated as a Forward iterator.


func (dw *dirwalk) Value() interface{} {
    return dw.cur
}

func (dw *dirwalk) Error() error {
    return dw.err
}

func (dw *dirwalk) AtEnd() bool {
    return len(dw.queue) == 0
}

The length of the queue represents the state of the iteration; once it hits zero we are done.


func (dw *dirwalk) Next() error {
    // pop head from queue
    dw.cur, dw.queue = dw.queue[0], append(dw.queue[1:])

    // open file
    var file *os.File
    if file, dw.err = os.Open(dw.cur); dw.err != nil {
        return dw.err
    }
    defer file.Close()
    // stat file
    var stat os.FileInfo
    if stat, dw.err = file.Stat(); dw.err != nil {
        return dw.err
    }
    if stat.IsDir() {
        // read files in directory
        var files []string
        if files, dw.err = file.Readdirnames(0); dw.err != nil {
            return dw.err
        }

        // add files in directory to queue
        for _, subfile := range files {
            dw.queue = append(dw.queue, dw.cur+string(os.PathSeparator)+subfile)
        }
    }
    return dw.err
}

Next is the engine of the iterator. It pops the queue for the current filename and then processes it. If it is a simple file we are done and return. If it is a directory we read through it and put the filenames in it on the queue.


for itr := Dirwalk(os.Args[1]); !itr.AtEnd(); itr.Next() {
    fmt.Println(itr.Value())
}

Using the iterator is very simple, simply construct it and loop through it. This provides an opportunity to write the first higher order function that depends on the Iterator interface. The loop can be written as an Each function as follows:


type EachFunc func(i Iterator) bool

func Each(i Forward, e EachFunc) {
    for ; !i.AtEnd(); i.Next() {
        if !e(i) {
            break
        }
    }
}

It accepts a Forward iterator and a function of type EachFunc, loops through the data stream and calls the function on each Iterator. If the function returns false, the loop breaks. Now we can use the Dirwalk() iterator as follows:


Each(Dirwalk(os.Args[1]), func(itr i.Iterator) bool {
    fmt.Println(itr.Value())
    return true
})

Yes, this is very trivial and of dubious value, but it is a start. You can get the source at GitHub and you can get the source for the Each function in the i package.

Monday, September 2, 2013

4.2.1: A Trivial Iterator - Take three

Iterators are about movement across data, and what differentiates one class of iterators from another is the type of movement it allows. The Upto iterator is essentially a Forward iterator, it provides a way to move forward, to get the value at the current location, and a way to check if we have reached the end of the data.


type upto struct {
    m, n int
}

func Upto(m, n int) *upto {
    return &upto{m: m, n: n}
}

func (i *upto) Next() {
    i.m++
}

func (i *upto) AtEnd() bool {
    return i.m >= i.n
}

func (i *upto) Value() int {
    return i.m
}

As before, we use a struct to capture the state of the iterator. Three methods provide the functionality of the iterator: Value() gives us the value at the current location, Next() moves the iterator to the next location, and AtEnd() allows us to check if we've reached EOS. At the cost of a bit more of boilerplate, we now have an iterator with a well defined interface and no hidden behaviours.


func main() {
    for i := Upto(3, 5); !i.AtEnd(); i.Next() {
        fmt.Println(i.Value())
    }
}

This design fits very well in the initialization, check, increment design of the standard for loop. Get the source at GitHub.


Formal design of iterators

Going forward this is the design that I will use, and I want to formalize it a bit more, especially with respect to types of iterators. These are all external iterators classified according to the movements they provide and how they define the boundaries of the data sets they provide access too.


Iterator interface {
    Value() interface{}
    Error() error
}

The basic interface is the Iterator. It simply provides two idempotent ways to access the value and last error raised.


Forward interface {
    Iterator
    Next() error
    AtEnd() bool
}

The Forward is a extension of the Iterator. It provides previously seen Next() method and AtEnd() method. This is a iterator to provide access to any lazily evaluated, potentially infinite stream of data. The stream must have a way to move forward, but whether it has any end at all is unanswered.


BiDirectional interface {
    Forward
    Prev() error
    AtStart() bool
}

The BiDirectional is a Forward that allows us to move back in the stream with Prev() and check if we reached the start of the stream with AtStart(). AtStart() should be idempotent.


BoundedAtStart interface {
    Forward
    First() error
}

BoundedAtStart is a Forward iterator that has a well defined beginning that it can be reset to in an efficient manner.


Bounded interface {
    BiDirectional
    First() error
    Last() error
}

Bounded is a BiDirectional with the ability to jump to the end of the data set quickly. This is e.g. a file on disk where we can quickly seek to the end of the file.


RandomAccess interface {
    Bounded
    Goto(int) error
    Len() int
}

The last class of iterators is the RandomAccess. It is a Bounded iterator with two more functions. Goto() provides a quick way to jump into any point in the dataset while Len() gives us a quick count of distinct elements in the data set.

To understand the difference between Bounded and RandomAccess think of a text file on the disk. If we want to iterate over the individual bytes in the file, RandomAccess will provide us all the tools needed with acceptable performance. If we want to iterate over the individual lines in the file, we have to do with a Bounded, since there is no efficient way to either count the individual lines or position them in the file.

This is the design I will use moving forward and to that end I've set up a new project on GitHub. Right now it only contains the definitions of iterator interfaces and a RandomAccess version of the Upto iterator, now named Range, but I will add more higher order algorithms and iterators to it as I progress through the chapter.