Skip to content
This repository was archived by the owner on Oct 18, 2019. It is now read-only.

Idiomatic Go code #1

Merged
merged 3 commits into from
Oct 13, 2014
Merged

Idiomatic Go code #1

merged 3 commits into from
Oct 13, 2014

Conversation

egonelbre
Copy link
Contributor

Here's how you would implement this properly in Go.

@egonelbre
Copy link
Contributor Author

PS. I didn't test the performance, because I was too lazy to download the tweet dataset.

dimroc added a commit that referenced this pull request Oct 13, 2014
@dimroc dimroc merged commit 972cc0b into dimroc:master Oct 13, 2014
@dimroc
Copy link
Owner

dimroc commented Oct 13, 2014

Thanks for the suggestions! Code looks great but it seems to be skipping over a step. It foregoes writing the mapped results to a file and does it in memory. This makes the solution inconsistent with the others ones. Is this right?

@dimroc dimroc self-assigned this Oct 13, 2014
@egonelbre
Copy link
Contributor Author

Yes, it skips that step intentionally... because otherwise it wouldn't be an idiomatic solution; it doesn't make sense to write those files in any language. It just doesn't make sense to impose random constraints on a solution when it solves the problem. If for some peculiar reason there is need for such files... it would do the counting and writing the files simultaneously (i.e. read input, count and write files).

@dimroc
Copy link
Owner

dimroc commented Oct 13, 2014

You're making the false assumption that the reducers will be running on the same box as the mappers, thus eliminating the need for mappings to be written to a file. The common way to handle large scale map reduce across all systems and languages is to write to disk, such as hdfs, and bring the code to the data rather than vice versa. If we have hundreds of thousands of input files, we'll break down.

My little test does run on just one computer, but the underlying constraints mimic ones that happen in the real world when spanning dynos on heroku, instances on EC2, or instances on mesosphere. It has nothing to do with what's idiomatic for a particular language.

I definitely appreciate the cleaner Go code and will learn from it, but there's definitely a space for mapping to a file and then reducing from a collection of files in any language. I'll have to revert the commit unless we map to a file 👎 . Even if you disagree with the above points, the benchmark is invalidated by skipping the IO which all the other implementations are doing.

@egonelbre
Copy link
Contributor Author

The common way to handle large scale map reduce across...

I guess, my problem was that I didn't see this as a large problem. Also I wouldn't write "0", "1" entries, I would do such reductions immediately and write that to the disk.

I already deleted my repo, but the code would look something like:

// queues
filenames := make(chan string, *procs)
filtered := make(chan string, *procs)
results := make(chan map[string]int, *procs)

// find files
go func() {
    filelist, err := ioutil.ReadDir(*input)
    check(err)
    for _, file := range filelist {
        filenames <- filepath.Join(*input, file.Name())
    }
    close(filenames)
}()

// parse and filter
Spawn(*procs, func() {
    for filename := range filenames {
        file, err := os.Open(filename)
        outfilename := filename + ".out"
        out, err := os.Create(outfilename)
        check(err)

        scanner := bufio.NewScanner(file)
        for scanner.Scan() {
            record := strings.Split(scanner.Text(), "\t")
            // id, hood, borough, message
            hood, message := record[1], record[3]
            if reQuery.MatchString(message) {
                out.WriteString(hood + "\t1\n")
            } else {
                out.WriteString(hood + "\t0\n")
            }
        }
        file.Close()
        out.Close()
        filtered <- outfilename
    }
}, func() { close(filtered) })

// reduce
Spawn(*procs, func() {
    count := make(map[string]int)
    for filename := range filtered {
        file, err := os.Open(filename)
        check(err)

        rd := bufio.NewReader(file)
        for {
            line, _, err := rd.ReadLine()
            if err == io.EOF {
                break
            }
            check(err)
            record := strings.Split(string(line), "\t")
            // hood, matches
            hood, matches := record[0], record[1]
            if matches == "1" {
                count[hood]++
            }
        }

        file.Close()
    }
    results <- count
}, func() { close(results) })

Probably in the first span there could be additional "bufio.Writer", I can't remember whether it already did that. Also, I'm not sure whether bufio.Reader or bufio.Scanner was faster.

@dimroc
Copy link
Owner

dimroc commented Nov 19, 2014

Just got the chance to take a look at all this and the code in your comment mostly worked. Thanks again for the implementation. I found the Spawn helper method very interesting and will write up on it.

@dimroc
Copy link
Owner

dimroc commented Nov 16, 2015

Hi @egonelbre,
Would you be up to write a quick README.md in the golang/ folder on the implementation details? A lot of it is covered in this discussion.

@egonelbre
Copy link
Contributor Author

I'm not sure what parts of it? It's pretty straight-forward, except the Spawn func.

@dimroc
Copy link
Owner

dimroc commented Nov 16, 2015

Talking about what makes it idiomatic and the Spawn func would be great.

On Mon, Nov 16, 2015 at 11:24 AM, Egon Elbre [email protected]
wrote:

I'm not sure what parts of it? It's pretty straight-forward, except the
Spawn func.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@egonelbre
Copy link
Contributor Author

Idiomatic usually means, similar to stdlib. The best way I can describe it is, avoid abstractions, unless proven to be necessary, and avoid abstractions that hide performance, until proven to be necessary. tl;dr; idiomatic == very concrete code.

The Spawn func creates N goroutines and uses atomic-s to also run finishing funcs after all goroutines have terminated. This requires some knowledge how atomics work.

There really isn't much more to explain than already exists in the code.

@dimroc
Copy link
Owner

dimroc commented Nov 16, 2015

There is also the dramatic performance difference between regex and
substring check. This discussion alone is enough to drop in the readme, but
if you don't feel it's worth the effort you can leave it as is.

On Mon, Nov 16, 2015 at 11:33 AM, Egon Elbre [email protected]
wrote:

Idiomatic usually means, similar to stdlib. The best way I can describe it
is, avoid abstractions, unless proven to be necessary, and avoid
abstractions that hide performance, until proven to be necessary. tl;dr;
idiomatic == very concrete code.

The Spawn func creates N goroutines and uses atomic-s to also run
finishing funcs after all goroutines have terminated. This requires some
knowledge how atomics work.

There really isn't much more to explain than already exists in the code.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@egonelbre
Copy link
Contributor Author

I'm not sure what's going on there... I haven't monitored the perf. I can take a quick look tomorrow. Although, the code hasn't been written as high-performance. It's written to be nice and readable.

As why the regexp isn't as fast, I can only guess ATM:

  1. it's hitting some corner case, due to the (?i)
  2. It's due to golang regexp-s linear-time guarantee https://swtch.com/~rsc/regexp/regexp1.html, https://golang.org/pkg/regexp/. It makes some trade-offs with regards to worst-case scenario.
  3. (?i) is not doing the same as strings.ToLower check, e.g. it might have more extensive unicode handling. (although very unlikely)

@egonelbre
Copy link
Contributor Author

I created golang/go#13288 for this. It is doing some interesting case-folding, and I'm not sure why.

@dimroc
Copy link
Owner

dimroc commented Nov 17, 2015

This is fantastic. It would be great to get a PR into golang as a result of
your analysis. I'll keep tabs on it.

Still don't think we have material for a README? 😃
Down the road, people won't know to read this thread.

On Tue, Nov 17, 2015 at 4:54 AM, Egon Elbre [email protected]
wrote:

I created golang/go#13288 golang/go#13288 for
this. It is doing some interesting case-folding, and I'm not sure why.


Reply to this email directly or view it on GitHub
#1 (comment)
.

@egonelbre
Copy link
Contributor Author

For the README, probably not, for a blog-post, maybe. For me it's just another "technical details of optimizing". However, the issue might be worth mentioning in the benchmark, i.e. "regexp is slow in Go 1.5 with case insensitive matches, it's being addressed in issue https://golang.org/issue/13288".

dimroc pushed a commit that referenced this pull request Jun 9, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants