Memory profiling in Ruby

The easiest approach is just calling out to ps with the current pid and receiving the resident set size (amount of physical memory allocated to the process).

def Process.rss; `ps -o rss= -p #{Process.pid}`.chomp.to_i; end  

If you're only interested in a temporary heuristic for debugging a particular issue, this is probably fine. It's platform-specific, though, and you don't have any guarantees about what the garbage collector is doing between calls.

You can use ruby-prof, but measuring memory with it requires patching the Ruby interpreter.

There's also the memory_profiler gem, which uses the ObjectSpace allocation tracing API introduced in 2.1. Since this tracks allocations by origin, it can be resource intensive; in my particular case I found it used more memory than what it was profiling. It's also a young gem and still a bit buggy.

I ended up extracting the core of memory_profiler into a more basic thing which just looks at the total amount of memory allocated over the course of a block, and so is particularly suitable for unit tests:

require 'objspace'                                                              

module MemoryUsage  
  MemoryReport = Struct.new(:total_memsize)                                     

  def self.full_gc                                                              
    GC.start(full_mark: true)                                                   
  end                                                                           

  def self.report(&block)                                                       
    rvalue_size = GC::INTERNAL_CONSTANTS[:RVALUE_SIZE]                          

    full_gc                                                                     
    GC.disable                                                                  

    total_memsize = 0                                                           

    generation = nil                                                            
    ObjectSpace.trace_object_allocations do                                     
      generation = GC.count                                                     
      block.call                                                                
    end                                                                         

    ObjectSpace.each_object do |obj|                                            
      next unless generation == ObjectSpace.allocation_generation(obj)          
      memsize = ObjectSpace.memsize_of(obj) + rvalue_size                       
      # compensate for API bug                                                  
      memsize = rvalue_size if memsize > 100_000_000_000                        
      total_memsize += memsize                                                  
    end                                                                         

    GC.enable                                                                   
    full_gc                                                                     

    return MemoryReport.new(total_memsize)                                      
  end                                                                           
end  

Extending the Markdown syntax in Ghost

I'm writing a somewhat lengthy thing which really wants footnotes, but Ghost doesn't have any native syntax for them yet. You can put them in manually using html, but it is tedious, and troublesome to reorder. Fortunately it wasn't too hard to add my own hacky1 implementation2:

// Adds footnote syntax as per Markdown Extra:
//
// https://michelf.ca/projects/php-markdown/extra/#footnotes
//
// That's some text with a footnote.[^1]
//
// [^1]: And that's the footnote.
//
//     That's the second paragraph.
//
// Also supports [^n] if you don't want to worry about preserving
// the footnote order yourself.

(function () {
    var footnotes = function () {
        return [
            { type: 'lang', filter: function(text) {
                var preExtractions = {},
                    hashID = 0;

                function hashId() {
                    return hashID++;
                }

                // Extract pre blocks
                text = text.replace(/```[\s\S]*?\n```/gim, function (x) {
                    var hash = hashId();
                    preExtractions[hash] = x;
                    return "{gfm-js-extract-pre-" + hash + "}";
                }, 'm');

                // Inline footnotes e.g. "foo[^1]"
                var i = 0;
                var inline_regex = /(?!^)\[\^(\d|n)\]/gim;
                text = text.replace(inline_regex, function(match, n) {
                    // We allow both automatic and manual footnote numbering
                    if (n == "n") n = i+1;

                    var s = '<sup id="fnref:'+n+'">' +
                              '<a href="#fn:'+n+'" rel="footnote">'+n+'</a>' +
                            '</sup>';
                    i += 1;
                    return s;
                });

                // Expanded footnotes at the end e.g. "[^1]: cool stuff"
                var end_regex = /\[\^(\d|n)\]: ([\s\S]*?)\n(?!    )/gim;
                var m = text.match(end_regex);
                var total = m ? m.length : 0;
                var i = 0;

                text = text.replace(end_regex, function(match, n, content) {
                    if (n == "n") n = i+1;

                    content = content.replace(/\n    /g, "<br>");

                    var s = '<li class="footnote" id="fn:'+n+'">' +
                              '<p>'+content+'<a href="#fnref:'+n +
                                '" title="return to article"> ↩</a>' +
                              '</p>' +
                            '</li>';

                    if (i == 0) {
                        s = '<div class="footnotes"><ol>' + s;
                    }

                    if (i == total-1) {
                        s = s + '</ol></div>';
                    }

                    i += 1;
                    return s;
                });

                // replace extractions
                text = text.replace(/\{gfm-js-extract-pre-([0-9]+)\}/gm, function (x, y) {
                    return preExtractions[y];
                });

                return text;
            }}
        ];
    };

    // Client-side export
    if (typeof window !== 'undefined' && window.Showdown && window.Showdown.extensions) {
        window.Showdown.extensions.footnotes = footnotes;
    }
    // Server-side export
    if (typeof module !== 'undefined') {
        module.exports = footnotes;
    }
}());
  1. Please don't do this with regexes unless you have to, kids.

  2. Can be found on a branch.

OSW 2014 Update

The open science workshop went well! I gave my talk about SciRate and improving the way people do science. Slides are up here. Also doubled as a useful opportunity to test the site on people in person, and I got some good usability feedback.

Some of the other interesting projects I learned about:

  • SageMathCloud is a computational mathematics tool based on the Python-based Sage framework, and lets you collaboratively edit Sage worksheets and IPython notebooks.

  • Authorea is a web-based paper authoring tool supporting LaTeX and Markdown which uses git as a backend. It takes a lot of inspiration from GitHub, with unlimited free public projects and paid private ones.

  • NBViewer is a simple tool for taking IPython notebooks and displaying them publicly on the web.

  • eLife is an open access biosciences journal which uses a non-standard peer review process in which the reviewers collaborate directly with each other.

There were also a lot of nifty groups and events:

  • Inspire9 is the little hackerspace in Richmond where the workshop was held, and they host regular Ruby and Python meetups, among other events. There's an entire room painted with flowers and butterflies, so I felt quite at home.

  • HealthHack in October this year, which brings together programmers and scientists to help solve medical research problems. It's been a good many years since I last used my biology background so I'm looking forward to this one.

  • The Open Knowledge Foundation runs HealthHack and lots of other cool stuff, like GovHack.

  • OpenTechSchool runs a bunch of experimental tech education projects, including Rails Girls which I'd heard about earlier. It has a Melbourne chapter!

I want to get involved in more of these things!

Open Science Workshop 2014

There's a little event on Saturday (July 19) about adapting techniques from the software community to make science more open and accessible. If you're a scientist in Melbourne, or a developer for a related project, I hope you consider coming along! I'll be there working on SciRate and giving a brief talk about it.

The mysterious nature of bots

A couple of years ago @JackLScanlan made a joke of some kind, as he often does. The subject of the joke was @Horse_ebooks, a uniquely Twitter oddity and likely the most infamous spambot to have ever lived. This seemed like a prime opportunity for silliness, so after a bit of coding @scanlan_ebooks was born. Little did I know this would be but the first of many robot clones.

Markov chain chatbots have a long history in programming, being very easy toy examples of a simple but powerful mathematical model which is used for a whole lot of more serious stuff. The classic Markov text generator maintains a probability map of which words are more or less likely to come after some number of preceding words, and builds a sentence by following it from a given start point.

The algorithm I use now is a variation on this. Instead of linearly chaining words, it starts with an intact sentence from the corpus and mixes it with one or more other sentences in a manner similar to DNA recombination. The Markov model is used to select the junction sites where this recombination occurs. This seems to strike a nice balance between diversifying the output and avoiding complete gibberish; the sentences it produces are grammatically correct more often than not. (well, assuming the source is!)

This has proliferated somewhat, and I have no idea how many of the various _ebooks accounts are using my twitter_ebooks Ruby gem or how modified they are. There have been bots based on novels, cartoon characters, and all manner of strange text corpora. Kevin Nguyen wrote a very introspective article about @knguyen_ebooks, deployed by @negatendo.

What I find much more interesting than the bots themselves though is the way people interact with them. These generally fall into three groups:

  • Those familiar with Markov chains who are being tongue-in-cheek about it
  • Non-programmers experiencing the ELIZA effect to various degrees
  • People who should probably never be relied upon to judge a Turing test

The third group is more populous than you might expect, especially if you include ESL speakers. My bots will try to imitate human interaction patterns, responding to mentions using keyword analysis to come up with something vaguely related to the input, and a slight random delay to avoid appearing superhuman. They will also follow back and occasionally favorite or RT tweets they find sufficiently interesting.

Some examples of amusing events in recent history:

mcc_ebooks and the robot uprising

I think @mcc_ebooks is my favorite overall, just because @mcclure111 and her friends are already so suffused with baffling surreal humor that it just sort of amplifies it.

People tend to give it the benefit of the doubt, which is often very sweet and heart-warming.

As the original human tweets at and about the bot, more bot-related statements enter the corpus, so it becomes "self-aware".

Which of course, has only one logical endpoint.

m1sp1dea_ebooks spooks Rackspace security

@m1sp1dea_ebooks uses a combined corpus consisting of myself and @0xabad1dea's tweets. It's kind of a freakish hybrid. (people keep confusing the two of us anyway, somehow)

Of course since @0xabad1dea spends a lot of time talking about infosec, it was inevitable that the bot would one day announce it had found a vulnerability.

And not do very much to discourage the idea.

Fortunately, a human quickly intervened.

The political intrigues of TonyAbotMHR

During the last Australian federal election season, someone made a joke about Tony Abbott and his propensity for Markov-like meaningless rambling. Thus, @TonyAbotMHR was born, using a slightly different algorithm that replaces nouns with random other nouns.

Occasionally, he is mistaken for the real thing, by endearingly optimistic citizens who seemingly believe the denizens of high politics are likely to engage in individual discourse with them.

There's been at least one truly epic debate, covering everything from genetically modified giraffes to the local entertainment industry.

This man has since been elected Prime Minister, to our great dismay.

winocm_ebooks and the jailbreak swarm

@winocm has the highest follower count of my Twitter friends by a large margin, largely on account of her role in the iOS jailbreaking community. Sadly this means she is constantly pestered by people demanding the release of various things.

Fortunately, this was a trivial extension to make to @winocm_ebooks.

  make_bot(bot, "winocm") do |gen|
    EM.next_tick do
      bot.stream.track("@winocm") do |tweet|
        text = tweet[:text].downcase
        if !tweet[:user][:screen_name].include?("_ebooks") && (text.include?("7.1") || text.include?("jailbreak") || text.split.include?("jb"))
          bot.reply(tweet, "@#{tweet[:user][:screen_name]} " + gen.model.make_response(tweet[:text]))
        end
      end
    end
  end

It works really quite surprisingly well. People mention @winocm, receive a reply from @winocm_ebooks, and proceed to engage with it, seemingly unaware that their jailbreaking deity has been replaced with a robot.

These conversations go on for many, many pages. A few bold individuals even requested the bot's hand in marriage:

I'm fairly sure this isn't legal anywhere yet. Maybe Japan.

Can we draw any interesting conclusions from all of this? Probably not. I do like to think, though, that the readiness with which people engage with the bots speaks well of our capacity to accept that which is fundamentally different from us. Should true non-human intelligence appear, I hope we will be similarly ready to adapt our culture around it.

Instant feedback UIs are really nice

Just migrated my blog from Octopress to Ghost. While I like static site generators a lot in theory, the overhead of having to run a local server and wait (often several seconds) for it to regenerate meant I didn't write anything very often. I think Ghost represents a nice midpoint between this and WYSIWYG editors, in that it lets you use Markdown or HTML but also renders a preview of the end result in realtime:

In general, I like UI design which does this kind of keypress-level processing. Immediacy of feedback is important for learning: the shorter the gap between action and response, the easier it is for human brains to form the right connection between the two. It also helps with temporal discounting, an element of procrastination where humans have difficulty perceiving the value of delayed rewards for their efforts.

Fuzzy search autocomplete is a lovely example of this, which shows up in bash's reverse search and the Windows 8 start screen, among other places. You can add fuzzy file search to vim via unite and it is so, so much nicer than trying to manually type out paths all the time, even with tab completion.

I wrote something very similar to Ghost's instant preview for SciRate comments, which parses both Markdown and a subset of LaTeX. People don't use it too often at the moment, but when they do they tend to make super mathsy doom comments so I'm quite pleased with it.