Wednesday, November 20, 2013

The many faces of nameless chunks of code in Ruby

The internet is littered with posts about the differences between blocks/procs and lambdas. But when experienced programmers first encounter this topic, they have several questions about it and hence finding themselves having to read several posts or threads. I'm going to try to summarize the answers to several related questions here.

At a very high level as far as the philosophy behind the design of the language goes, there is a question of why there are different ways to do things. One of the guiding principles at play is that there is more than one way to do everything. So if it makes sense to have two different approaches to do something that are convenient in different contexts, then both approaches often exist so that you have the more convenient option at hand.

Another important guiding principle that's a part of ruby is the principle of least astonishment. I'm just mentioning it here but I'll talk about why its relevant further down.

On a mechanical level it's important to understand the differences between lambdas and blocks/procs. Here is one source. The important differences are argument checking, and the behavior of return, break and continue. It's important not only to understand how they are different, but that there are things you can do with lambdas that you cannot do with blocks and there are things you can do with blocks that you cannot do with lambdas.

The next thing to understand is what they are.

  • Lambdas are lambdas just like you see in most other languages where anonymous functions exist. It's a closure that you can pass around as a parameter or return and then call. 
  • Blocks are something completely new. They are a syntactic construct, not first class objects. At a superficial level they behave like lambdas (with the restrictions that they can only be the last parameter and so on) and the differences between blocks and lambdas have been described above.
  • Procs is like an object representation of a Block. You use a proc when you'd like to accept a block and then do something funky with it. The real decision api wise that you must make is whether you want to accept a lambda or accept a block. Procs are what you get when you peek under the hood and decide that you really need to something a little more unusual.
People who are used to lambdas find blocks and procs inelegant and confusing. Inelegant because there already is a mechanism to pass closures around. Confusing because it has unusual semantics. Once you have internalized what they are and what they can do differently, it's easy to see that these differences at times make certain things possible that would not have been possible otherwise. 

But let's talk about this being confusing and the principle of least surprise. The reason that this seems confusing is because it's not how closures in other languages behave. But the principle of least surprise has never been about compatibility with concepts from other languages. All that the POLA says is that once you understand how something works and how to think about it, apis will ideally behave in a non-surprising manner. It does not imply that you will not have to learn how ruby works. What it DOES imply though is that there is a conceptual model that can be used to understand blocks that is different from the conceptual model people use for lambdas. One where the behavior of return in blocks is not confusing at all.

The way to think about blocks is that the code inside a block belongs to the function in which it is defined, and hence the semantics of the code internally should behave exactly as it would if it were outside the block. A block is not a function that you are passing in ... that's a detail of how it's implemented. A block is a chunk of code that is from your function that executes when the method you call chooses to activate it. And since its a chunk of code from your function, the return statement continues to behave the way it would behave if it were not inside the block. Code inside the block should not be thought of as special from code outside. It just happens to run at a time that's not of your choosing. it's a different mental model.
Continue and break are special keywords for code inside a block that break out of the block and break out of the function that took the block respectively. This makes them consistent with loops. Since blocks are often used as iterators, break and continue behave just the same inside iterators and regular loops.
  


Sunday, October 31, 2010

Introducing StepRewrite (aka I finally understand macros)

So it's finally done! I could probably do a bit more, but for the moment it's working and packaged! For a while now, I've been writing about a problem I want to solve, and discovering the tools to solve it. Of course it's also a problem that may not really exist, but fuck it! To really understand the motivations behind this thing I've built, follow those links above, but here is the short version.

I said that Evented IO makes you write ugly code, just to sequence a bunch of operations. I figured out that the only way to be able to sequence code normally but run it in an evented IO environment is to use macros and actually rewrite code. So I wrote step_rewrite (which has obtained that name since it was born out of an unholy union of step.js and rewrite). If you install, it you can write code like this

that behaves as if it was written like this.

This doesn't mean that you are forced to (or should in fact) write every block in this manner. It's meant to be used only when the blocks exist purely to sequence the rest of the method to occur inside a callback. i.e. when you really do intend what the first piece of code implies. That you have a series of operations that should occur one after another, and it just so happens, that they perform IO. (if you squint really hard, you can pretend the &_ bits are invisible).

So step_rewrite, Can be used either, as a function that takes a block to eval, or as a fuction that takes a block to define a method. It rewrites the code and converts every function call taking a special callback, followed by some other code into a form where the function now receives a block with the rest of the code in the block.
becomes
for situations where you intend the former but your environment requires code to do the latter.

It also converts return values into block arguments. So that
becomes

This is acceptable for the most part because using &_ has made hunter.kill the last statement in it's block. So anything it returns will be the return value of its block.

Of course this abstraction is leaky. There are a lot of complicated situations where you have to be aware of what the converted code will look like. I just hope that 80-90% of the time, you can be oblivious to it.

This works using the ParseTree gem. ParseTree converts code into S-Expressions, the language of macros. Here are some examples of S-expressions.

Given the S-expression, I can now manipulate it. Chopping out bits, wrapping it in other pieces. The resulting S-expression is converted back into Ruby using Ruby2Ruby and I'm done :).

So yeah, what I really do is convert an S-Expression of the first form to the second form.

I'm excited about this because I finally understood what the whole macro thing was about. I've always heard that it was about extending the language itself. But I never really got it. It seemed to me that anything I wanted to do, could either be implemented using good old fashioned meta programming, or was not possible even with macros. I could not see this middle ground of extending the language without writing a full fledged parser. But now I finally see it :)

Lastly, it looks like Narrative.js does something similar to what I was trying to do. It contains a full javascript parser, so its a bigger project. I need to look at it to see if it converts code into a similar form, and if so how it overcomes various problems that I have because of leaky abstractions.


Wednesday, October 20, 2010

Silly Mistake

So recently I stumbled across this webcomic Wonderella, which is awesome! After I read a few pages, I decided to do what I do with all webcomics I like, ... download the whole thing. Because I can't stand that 30 second - 1 minute wait between pages. Previously, I used to do this sort of thing with wget, and try to discover the rule, and invariably there would be some complication or the other.

Then I wised up and started using Mechanize, which made things a breeze. And sometimes I've started multiple threads/processes and somehow allocated tasks to the threads so that the whole download happen faster. But this time I thought, "I've been thinking and writing about evented I/O a bit now, someone on the Node.js irc channel had mentioned he used node.js for http scraping, why not try the same."

I decided to use it with EventMachine because that way I could also test my new work in progress (then), pet project. (which I'm going to write about soon). So I started reading about EventMachine and used hpricot for the html parsing. In 30 minutes I had the script ready and I had tested bits of it.

I asked it to just get the first four comics from the archive page, and it worked. So I did it again, this time asking it to get the entire thing. But, nothing happened!! Nothing was downloaded. Little print statements let me know that it had correctly parsed the archives page and was going after the right pages, but nothing! I let it run for a while, but still nothing :/.

It took me a while to realize, I had been solving the wrong problem. I wanted to do things in parallel. But my bottleneck, wasn't how much memory I could afford on my machine. This is where evented I/O helps you. My bottleneck, was how many connections the webserver would allow me to make. I was attempting to download all the comics simultaneously, and that didn't work.

So like yeah, funny story :/. I just redid it with mechanize and it was delicious. Here is the code if anyone is curious what EM code looks like.

Sunday, October 3, 2010

Changing the Rules

Evented programming with nonblocking I/O is the new black. In evented I/O systems, every single I/O operation (or at least the expensive ones) all take callback functions, and execute asynchronously so that your code does not block on I/O. It either continues on, or it sleeps to allow a different request a chance. The main purpose of these systems is, without really getting into concurrency or threads, people can parallelize a system that spends a fair amount of time on I/O and handle several independent requests simultaneously.

Every time an I/O operation starts, the system registers your callback and continues on. Once the Input or Output operation is completed, the callback executes. So every time you intend to do any input or output operation, you put the code that's meant to execute after it is done, in the callback. But, this means, that in any interesting program, a large part of it is going to be spent in nested callbacks.

Let's look at the sample piece of code I've been playing around with. Except I'm going to use Ruby without Event Machine and mimic some code I've seen in node.js (node has non blocking operations for everything, so it should be easier to follow). For anyone who hasn't read this before, this code writes hello to file, appends to it with world and then reads the last line. Every thing I/O takes a callback. Writefile, read, write, close etc

Notice that the primary purpose of callbacks here, is not, as is customary, to jiggle the stuff in the middle of an algorithm. This is not the strategy pattern. It's far more low level than that. It's one of the three fundamental control structures in programming (Sequence, Selection, Iteration). Callbacks here are to sequence your operations. By using a callback, you are sequencing operations so that whatever is inside the callback happens after the current operation. And it can make your code a bit hard to follow.

Programming languages, have had from day one, a way to sequence operations. You just put the operations down one after the other, and that's the sequence they occur in. We could imagine a magical programming language which has all the power of languages we currently use and love, with special support for evented IO. So I can mark those calls as special but still sequence operations normally and allow the compiler or interpreter to understand that some operations are to be executed only after this async one returns.

Step.js was a library that tried to solve this problem, so I ported it to Ruby. It takes a series of lambda's as input, and executes them so that each subsequent one is executed when the previous callback returns. Basically, it sequences them correctly. This is what the same code looks like with step in Ruby (cb is chosen as a magic variable to indicate that &cb is where the callback to each function would normally be passed).

Firstly, it has a bug. close doesn't really work because the file has gone out of scope. The file handle is not being passed on by write. Secondly, it's pretty ugly. Lambda, lambda, lambda ... If only I could take a bunch of code without them being wrapped in lambdas and sequence them correctly.

While Ruby allows for a lot of meta programming and building DSLs, this problem seems unsurmountable, unless we change the syntax of the language itself. Unless we invent our own control structures that do what we ask them to and extend the language. What we need are macros.

A macro is something that allows you to automagically expand something preselected into a sequence of operations. Excel and Word had macros. Games have macros. It usually expands out into a larger body of code and prevent you from having to type it out. Like you could imagine a macro that contains two or three operations that always occur together. You might want to allow for variable substitution in your macro so that it's actually useful.

The macro is read by some macro interpreter or compiler, it modifies the code, and then the regular interpreter or compiler reads your code. But, how powerful should your macro system be. At first, you might decide to keep complexity low, only allow templating. So that for example you could have a macro that given the name of a loop variable, generates code to run throught its items and do some operation (in case your programming language of choice doesnt already support this). Later you might want rudimentary branching so that you can, for example, change the code that is executed in development mode to make for easy debugging. Or looping support to generate repetitive code. Before you know it, you have a whole other programming language. In which case why not allow your programming language of choice itself to be your macro language? So if you use C, use the full power of C in your macros. If you use Ruby, use the full power of Ruby in your macros.

Lisp does that. Lisp gives you access to your parsed code in the form of S-expressions and allows you to modify or generate S-expressions before it executes this code. It's built into the language. And it works really well because the language itself has support for it.

Now Ruby and Javascript have always had eval. You could load your source and modify it. But the effort of parsing such code is large, and you don't want to work at the level of strings. Which is why I got really excited when I saw this awesome video by Reginald Braithwaite. Where he built some macros in Ruby. Using the ParseTree project to parse ruby and get S-expressions (which can then be modified) and Ruby2Ruby which takes S expressions and generates Ruby code.

So my next step is to rewrite Step using macros so that you can sequence code the usual way, but allow it to work on an evented IO system.

Lastly, let me leave you with a joke.

A drunk loses the keys to his house and is looking for them under a lamppost. A policeman comes over and asks what he’s doing.

“I’m looking for my keys” he says. “I lost them over there”.

The policeman looks puzzled. “Then why are you looking for them all the way over here?”

“Because the light is so much better”.


Thursday, September 30, 2010

Node.js means having to put your toys back in the closet after you're done :(

When I was first exposed to Ruby and how easy it was for functions to accept blocks and how ubiquitous such functions were, I was delighted that for example
File.open {|file| foo(file); bar; baz; qux;}
would let me do a bunch of operations on a file and not have to worry about closing the file. This was similar to how cool it was to move from C++ to C# the first time and stop worrying about memory. This was how things should be.
And then recently this post on Stack Overflow led me to this code sample in Node.js. The code is opening a file and appending the word world.

I have to remember to close the file!! In a language that allows functions to be passed as parameters, I have to remember to close the file. How much does that suck? My colleague, Rakesh Pai, a node.js fan, suggested that maybe this is because this is a low level api, and high level apis will take care of this problem.

But on thinking about it, this may not happen. I mean I could imagine a fs.append function that takes a file and some text, appends it and then calls the callback, but not something that gives you access to a file object to play with.
Or maybe fs.open could take 2 functions. One meant to be executed after opening the file, and a second to be executed after closing the file. But this will probably cause a lot of scoping grief. So maybe the first function will have to return an array of variables that then become available to the second one?

So it's possible this never happens. If this is the case, is it possible that the file will not be closed for a long time? I mean if your code is something like

Then, wouldn't your workers hang on to the file handle for an awfully long time? Even though it isn't being used? Since the file is trapped in that closure there, there's no way out unless my runtime has analyzed my code and seen that I will never use file from this point on. Not even via an eval. I can't see the GC or whatever figuring this out. How about if say as a good programmer, I do this.

Now can it figure it out, since I haven't passed file to my doJob function. Does that mean that in this scenario I'm safe? I assume it will since file isn't trapped in a closure here. But I don't know how the v8 GC works. But in the previous scenario, it feels to me like it will not be able to detect my intention, so we'll have to be a bit more careful programming with node.js

Edit: From tuxychandru's comment, I realised that I may not have been very clear. I wasn't just whining about some api requiring people to do work. I understand that there will always be low level api's that require more care. I was saying that in some ways node is being positioned as an answer to the programming ills for certain kinds of problems. I was saying that it looks like the paradigm has resulted in a new solved problem returning. That as part of everyday programming, cleaning up is required.

Tuesday, September 28, 2010

Naive Implementation of Step.js in Ruby

I wrote about step.js here and here, since I thought it was really cool. I've built a naive implementation in Ruby. To demonstrate the usage, imagine the same set of functions that node provides existed in Ruby. So this is how code would be written both in the serial and parallel cases.

Ruby uses self unlike the this in javascript. But self is lexically scoped, whereas this is dynamically scoped. So here I have chosen to have the caller decide the name of the magic variable (cb in the example above) and use instance_exec to make it happen. There's no real advantage to this approach though, since existing references to self will still break.
I call it a naive implementation because I've made assumptions that are true in the node environment, but not necessarily in a random piece of ruby code. Hence, there are some issues. But first, here is the code itself.

So what are the problems?
  • The serial implementation has a bug when you want have to a callback function receiving a parameter that is needed by it and the functions further down. Since scopes are nested by default in Javascript or Ruby, the intermediate functions dont pass them through. If you see the first example I wrote, or in the ruby example today, it actually has this bug. File.open provides a file handle meant to be used by writeFile and close. I didn't realize this was a problem until my ruby code broke. The javascript implementation has the same bug, and I don't think it can be fixed.
  • The parallel implementation works by incrementing a counter every time someone requests the parallel callback and executes the next function only when the counter is zero. This might work in a node style environment where any callback can execute ONLY after the current code completes executing. So the counter starts decrementing only after it has been incremented completely. Which means that if the IO functions I use came from Event Machine may work. But sync calls masquerading as async will break since they will execute the callback right away. So the counter starts decrementing immediately and the next callback will execute several times. This is the same implementation in the original javascript, but it is not a bug there since the node environment has no sync I/O. One of the reasons node gets more love than Event Machine.
  • At the end of it, this is as ugly as the javascript implementation. You have to wrap everything in a lambda call. It will be interesting if we could do better than that.
On a side note, I looked at the implementation of step after I implemented mine in ruby. Tho original implementation chose a more procedural style for the main piece, whereas I chose a more functional style. But it feels as if the procedural is easier to understand. The eyes of everybody I show my code to seem to glaze over the bit where I do a fold over the list of lambdas I get as the second parameter to step. What do you think?
Of course all this goes out the window when I get to the parallel method where I have to maintain state externally. I have no idea how to do that in a more functional style. Any inputs would be cool :).

Wednesday, September 15, 2010

Beware ActiveSupports Default Autoload Mechanism

By default active_support sets its dependency loading mechanism to load rather than require. We're working on a non rails app with datamapper and some models add before :create hooks. If you dont take care, these models are loaded twice and the hooks are added twice which can result in strange behaviour.

So remember to do this ActiveSupport::Dependencies.mechanism = :require